20 Notable Difference Between Hadoop 2.x vs Hadoop 3.x

The objective of this Hadoop tutorial is to provide you a clearer understanding between different Hadoop version. In this blog we have covered top, 20 Difference between Hadoop 2.x vs Hadoop 3.x.

This blog covers the difference between Hadoop 2 and Hadoop 3 on the basis of different features.

Difference Between Hadoop 2.x vs Hadoop 3.x

Apache Hadoop is an open source software framework for distributed storage & processing of huge amount of data sets.

Hadoop 3.x was introduced to overcome the limitation of Hadoop 2.x. Hadoop 3.x has added some new features, although the old features are still used.

Detailed feature wise comparison between Hadoop 2.x vs Hadoop 3.x are given below:

a. License

  • Hadoop 2.x- Apache 2.0, open source
  • Hadoop 3.x- Apache 2.0, open source

b. Minimum supported version of Java

  • Hadoop 2.x-  Java 7.
  • Hadoop 3.x-  Java 8.

c. Fault Tolerance

  • Hadoop 2.x- In this version, replication handles fault tolerance.
  • Hadoop 3.x- In this version, erasure coding handle fault tolerance.

d. Data Balancing

  • Hadoop 2.x- Uses HDFS Balancer for data balancing
  • Hadoop 3.x- Uses Intra-data node balancer, which is invoked via the HDFS disk balancer CLI.

e. Storage Scheme

  • Hadoop 2.x- Uses 3X replication scheme.
  • Hadoop 3.x- Uses Erasure coding.

f. Storage Overhead

  • Hadoop 2.x- In this version HDFS has 200% overhead in storage space.
  • Hadoop 3.x- In this version HDFS has 50% overhead in storage space.

g. Storage Overhead Example

  • Hadoop 2.x- If there are 6 blocks, and 3x replication of each block, so it results in 18 blocks. It will occupy 18 blocks space.
  • Hadoop 3.x- If there are 6 blocks, so it will occupy 9 block space i.e. 6 blocks and 3 for parity.

h. YARN Timeline Service

  • Hadoop 2.x- Uses old timeline service which has scalability issues.
  • Hadoop 3.x- This version improves the timeline service v2. It also improves the scalability and reliability of timeline service.

j. Default Ports Range

  • Hadoop 2.x- In this version, default ports are Linux ephemeral port range. Hence at the time of startup, they will fail to bind.
  • Hadoop 3.x- While this version is moved out of ephemeral range.

k. Tools

  • Hadoop 2.x- Hive, pig, Tez, Hama, and other Hadoop tools are also available.
  • Hadoop 3.x- In this version also Hive, pig, Tez, Hama, and other Hadoop tools are available.

l. Compatible File System

  • Hadoop 2.x- It supports HDFS (Default FS), FTP File system: This also stores all its data on remotely accessible FTP servers. It also supports Amazon S3 (Simple Storage Service) file system Windows Azure Storage Blobs (WASB) file system.
  • Hadoop 3.x- It supports all the previous one as well as Microsoft Azure Data Lake filesystem.

m. Datanode Resources

  • Hadoop 2.x- For the MapReduce Datanode resource is not dedicated. We can also use it for other application.
  • Hadoop 3.x- In this version also data node resource can be used for other Applications too.

n. MR API Compatibility

  • Hadoop 2.x- MR API compatible with Hadoop 1.x program to execute on Hadoop 2.X
  • Hadoop 3.x- MR API is also compatible with running Hadoop 1.x programs to execute on Hadoop 3.X

o. Support for Microsoft

  • Hadoop 2.x- It can be deployed on Windows.
  • Hadoop 3.x- It also supports for Microsoft windows.

p. Slots/container

  • Hadoop 2.x- Hadoop 1.x works on the concept of slots while Hadoop 2.X works on the concept of the container.
  • Hadoop 3.x- Hadoop 3.x also works on the concept of a container.

q. Single point of failure

  • Hadoop 2.x- It has the features to overcome SPOF. So, whenever NameNode fails it recovers automatically.
  • Hadoop 3.x- It also has the features to overcome SPOF. So, whenever NameNode fails it recovers automatically no need of manual intervention.

r. HDFS Federation

  • Hadoop 2.x- In Hadoop 1.x only single NameNode to manage all Namespace. But Hadoop 2.x has multiple NameNode for multiple Namespace.
  • Hadoop 3.x- It also has multiple Namenode for multiple namespaces.

s. Scalability

  • Hadoop 2.x- We can scale up to 10000 Nodes per cluster.
  • Hadoop 3.x- We can scale more than 10000 Nodes per cluster.

t. HDFS Snapshot

  • Hadoop 2.x- It adds the support for a snapshot. It also provides disaster recovery and protection for user error.
  • Hadoop 3.x- It also support for the snapshot feature.

u. Platform

  • Hadoop 2.x- It serves as a platform for a wide variety of data analytics. It is also possible to run event processing, streaming, and real-time operations.
  • Hadoop 3.x- It is also possible to run event processing, streaming and real-time operation on the top of YARN.

Conclusion

In conclusion, Hadoop 3.0 has added new features like erasure coding to handle fault tolerance. Hadoop 3.x also reduces the storage overhead by 200% to 50%.

It also introduced a new command line tool called Disk balancer. Hence, Hadoop 3.x has improved overall performance.

If you find any other difference between Hadoop 2.x vs Hadoop 3.x, so do let us know in the comment section.