20 Notable Difference Between Hadoop 2.x vs Hadoop 3.x

by TechVidvan Team

The objective of this Hadoop tutorial is to provide you a clearer understanding between different Hadoop version. In this blog we have covered top, 20 Difference between Hadoop 2.x vs Hadoop 3.x.

This blog covers the difference between Hadoop 2 and Hadoop 3 on the basis of different features.

Difference Between Hadoop 2.x vs Hadoop 3.x

Apache Hadoop is an open source software framework for distributed storage & processing of huge amount of data sets.

Hadoop 3.x was introduced to overcome the limitation of Hadoop 2.x. Hadoop 3.x has added some new features, although the old features are still used.

Detailed feature wise comparison between Hadoop 2.x vs Hadoop 3.x are given below:

a. License

Hadoop 2.x- Apache 2.0, open source
Hadoop 3.x- Apache 2.0, open source

b. Minimum supported version of Java

Hadoop 2.x- Java 7.
Hadoop 3.x- Java 8.

c. Fault Tolerance

Hadoop 2.x- In this version, replication handles fault tolerance.
Hadoop 3.x- In this version, erasure coding handle fault tolerance.

d. Data Balancing

Hadoop 2.x- Uses HDFS Balancer for data balancing
Hadoop 3.x- Uses Intra-data node balancer, which is invoked via the HDFS disk balancer CLI.

e. Storage Scheme

Hadoop 2.x- Uses 3X replication scheme.
Hadoop 3.x- Uses Erasure coding.

f. Storage Overhead

Hadoop 2.x- In this version HDFS has 200% overhead in storage space.
Hadoop 3.x- In this version HDFS has 50% overhead in storage space.

g. Storage Overhead Example

Hadoop 2.x- If there are 6 blocks, and 3x replication of each block, so it results in 18 blocks. It will occupy 18 blocks space.
Hadoop 3.x- If there are 6 blocks, so it will occupy 9 block space i.e. 6 blocks and 3 for parity.

h. YARN Timeline Service

Hadoop 2.x- Uses old timeline service which has scalability issues.
Hadoop 3.x- This version improves the timeline service v2. It also improves the scalability and reliability of timeline service.

j. Default Ports Range

Hadoop 2.x- In this version, default ports are Linux ephemeral port range. Hence at the time of startup, they will fail to bind.
Hadoop 3.x- While this version is moved out of ephemeral range.

k. Tools

Hadoop 2.x- Hive, pig, Tez, Hama, and other Hadoop tools are also available.
Hadoop 3.x- In this version also Hive, pig, Tez, Hama, and other Hadoop tools are available.

l. Compatible File System

Hadoop 2.x- It supports HDFS (Default FS), FTP File system: This also stores all its data on remotely accessible FTP servers. It also supports Amazon S3 (Simple Storage Service) file system Windows Azure Storage Blobs (WASB) file system.
Hadoop 3.x- It supports all the previous one as well as Microsoft Azure Data Lake filesystem.

m. Datanode Resources

Hadoop 2.x- For the MapReduce Datanode resource is not dedicated. We can also use it for other application.
Hadoop 3.x- In this version also data node resource can be used for other Applications too.

n. MR API Compatibility

Hadoop 2.x- MR API compatible with Hadoop 1.x program to execute on Hadoop 2.X
Hadoop 3.x- MR API is also compatible with running Hadoop 1.x programs to execute on Hadoop 3.X

o. Support for Microsoft

Hadoop 2.x- It can be deployed on Windows.
Hadoop 3.x- It also supports for Microsoft windows.

p. Slots/container

Hadoop 2.x- Hadoop 1.x works on the concept of slots while Hadoop 2.X works on the concept of the container.
Hadoop 3.x- Hadoop 3.x also works on the concept of a container.

q. Single point of failure

Hadoop 2.x- It has the features to overcome SPOF. So, whenever NameNode fails it recovers automatically.
Hadoop 3.x- It also has the features to overcome SPOF. So, whenever NameNode fails it recovers automatically no need of manual intervention.

r. HDFS Federation

Hadoop 2.x- In Hadoop 1.x only single NameNode to manage all Namespace. But Hadoop 2.x has multiple NameNode for multiple Namespace.
Hadoop 3.x- It also has multiple Namenode for multiple namespaces.

s. Scalability

Hadoop 2.x- We can scale up to 10000 Nodes per cluster.
Hadoop 3.x- We can scale more than 10000 Nodes per cluster.

t. HDFS Snapshot

Hadoop 2.x- It adds the support for a snapshot. It also provides disaster recovery and protection for user error.
Hadoop 3.x- It also support for the snapshot feature.

u. Platform

Hadoop 2.x- It serves as a platform for a wide variety of data analytics. It is also possible to run event processing, streaming, and real-time operations.
Hadoop 3.x- It is also possible to run event processing, streaming and real-time operation on the top of YARN.

Conclusion

In conclusion, Hadoop 3.0 has added new features like erasure coding to handle fault tolerance. Hadoop 3.x also reduces the storage overhead by 200% to 50%.

It also introduced a new command line tool called Disk balancer. Hence, Hadoop 3.x has improved overall performance.

If you find any other difference between Hadoop 2.x vs Hadoop 3.x, so do let us know in the comment section.