Introduction to Data Locality in Hadoop MapReduce

In this Hadoop tutorial, we are going to explain you the concept of Data locality in Hadoop.

First of all we will see the introduction to MapReduce Data locality in Hadoop, then we will discuss the need of Hadoop Data Locality next with the categories of Data Locality in MapReduce, Data locality optimization.

At last, we will see the advantages of Hadoop Data Locality principle in this MapReduce tutorial.

What is Data Locality in Hadoop MapReduce?

Data locality in Hadoop is the process of moving the computation close to where the actual data resides instead of moving large data to computation. This minimizes overall network congestion. This also increases the overall throughput of the system.

data locality in hadoop

The main drawback of Hadoop was cross-switch network traffic due to the huge amount of data. To overcome this drawback, Data Locality came into existence.

In Hadoop, HDFS stores datasets. Framework divides datasets into blocks and store across the datanodes. When a client runs the MapReduce job, then NameNode sent the MapReduce code to the datanodes on which data is available according to MapReduce job.

Requirement for Hadoop Data Locality

Hadoop architecture needs to satisfy below conditions to get the benefits of all the advantages of data locality:

  • First, Hadoop cluster should have the appropriate topology. The Hadoop code should have the ability to read data locality.
  • Second, Apache Hadoop should be aware of the topology of the nodes where tasks are executed. Also Hadoop should know where the data is located.

Categories of Data locality in Hadoop

The various categories in Hadoop Data Locality are as follows:

1. Data local data locality in Hadoop

In this, data is located on the same node as the mapper working on the data. In this, the proximity of data is very near to computation. Data local data locality is the most preferred scenario.

2. Intra-Rack data locality in Hadoop

As we know that it’s not always possible to execute the mapper on the same datanode due to resource constraints. In this case, it is preferred to run the mapper on the different node but on the same rack.

3. Inter–Rack data locality in Hadoop

Sometimes it is also not possible to execute mapper on a different node in the same rack. In such situation, we will execute the mapper on the nodes on different racks. Inter –rack data locality is the least preferred scenario.

Hadoop Data locality Optimization

Since Data locality is the main advantage of Hadoop MapReduce.  But this is not always beneficial in practice due to various reasons like Heterogeneous cluster, speculative execution, Data distribution and placement, and Data Layout.

In large clusters challenges become more prevalent. As in large cluster more the number of data nodes and data, the less is the locality.

In larger clusters, some nodes are newer and faster than the other, creating the data to compute ratio out of balance. Thus, large clusters tend not be completely homogenous.

In Hadoop speculative execution since the data might not be local, but it uses the compute power. The main cause also lies in the data layout/placement. Also non-local data processing puts a strain on the network, which creates problem to scalability. Therefore the network becomes the bottleneck.

We can also improve data locality by first detecting which jobs have degrade over time or data locality problem. Problem-solving is more complex and involves changing the data placement and data layout using a different scheduler.

After that we have to verify whether a new execution of the same workload has a better data locality ratio.

Advantages of data locality in Hadoop

  • High Throughput – Data locality in Hadoop increases the overall throughput of the system.
  • Faster Execution – In data locality, framework move code to the node where data resides instead of moving large data to the node. Thus, this makes Hadoop faster. Because the size of the program is always lesser than the size of data, so moving data is a bottleneck of network transfer.

Conclusion

In conclusion, Data locality in Hadoop  improves the overall execution of the system and makes Hadoop faster. Hence, it reduces network congestion.

If you find this blog helpful, or you have any query, so leave a comment in the comment section below. We will be glad to solve them.