Rack Awareness in Hadoop and its Advantages

This Hadoop tutorial is all about Rack Awareness in Hadoop. In this blog we will describe each and everything about Rack Awareness in HDFS.

First of all we will study what is HDFS Rack Awareness property, what is the need of Rack Awareness in Hadoop. Then we will discuss replica placement via Rack Awareness in HDFS.

At last we will also discuss the various benefits of Rack Awareness in Hadoop framework.

Introduction to HDFS Rack Awareness

Rack Awareness in Hadoop is the concept that chooses closer Datanodes based on the rack information. By default, Hadoop installation assumes that all the nodes belong to the same rack.

To improve network traffic while reading/writing HDFS files in large clusters of Hadoop. NameNode chooses data nodes, which are on the same rack or a nearby rock to read/ write requests (client node). HDFS Namenode achieves this rack information by maintaining rack ids of each data node.

Why Rack Awareness?

The main purpose of Rack awareness is to:

  • Improve data reliability and data availability.
  • Better cluster performance.
  • Prevents data loss if the entire rack fails.
  • To improve network bandwidth.
  • Keep the bulk flow in-rack when possible.

Replica placement via Rack Awareness in Hadoop

The main purpose of replica placement via Rack awareness, the policy is to improve data reliability etc.

A simple policy is to place replicas on the rack to prevent losing of data when an entire rack fails. And allow the use of bandwidth from multiple racks when reading a file.

On multiple rack clusters, block replication follows the below policy:

You should not place more than one replica on one node. You should also not place more than two replicas on the same rack. This has a bottleneck that number of racks used for block replication should be always less than the total number of block replicas.

For example;

  • When a Hadoop framework creates new block, it places first replica on the local node. And place a second one in a different rack, and the third one is on different node on the local node.
  • When re-replicating a block, if the number of existing replicas is one, place the second on a different rack.
  • When number of existing replicas are two, if the two replicas are in the same rack, place the third one on a different rack.

Advantages of Rack Awareness in Hadoop

Let’s now discuss some advantages of Rack Awareness in Hadoop HDFS-

  • Provide higher bandwidth and low latency – This policy maximizes network bandwidth by transferring block within a rack rather than between racks. The YARN is able to optimize MapReduce job performance by assigning tasks to nodes that are closer to their data in terms of network topology.
  • Provides data protection against rack failure – Namenode assign the block replicas of 2nd And 3rd Block to nodes in different rack from the first replica. Thus, it provides data protection even against rack failure. However, this is possible only if Hadoop was configured with knowledge of its rack configuration.
  • Minimize the writing cost and Maximize read speed – Rack awareness, policy places read/write requests to replicas which are in the same rack. Thus, this minimizes writing cost and maximizes reading speed.

Conclusion

In conclusion, it is the concept that chooses closer Datanodes based on the rack information to improve data reliability.  The main purpose of Rack-Awareness is to prevent data loss if the entire rack fails. It also improves network bandwidth. Learn more HDFS properties in detail.

If you have any questions related to Rack Awareness in Hadoop, so please share with us in the comment section. We will try our best to help you.