Performance Tuning in MapReduce for Performance Improvement

Performance tuning in Hadoop helps in optimizing Hadoop cluster performance. In this MapReduce Performance Tuning article, you will firstly explore the various ways for improving the Hadoop cluster performance and achieve the best results from MapReduce programming in Hadoop.

Then the article will cover seven important ways or concepts for Hadoop MapReduce Performance Tuning. These ways are Memory Tuning in Hadoop, Improving IO Performance, Map Disk spill in Hadoop, tuning mapper and reducer tasks, writing combiner, using skewed joins, and Speculative execution.

These techniques can be used to set up Hadoop clusters in the production with commodity hardware for enhancing the performance with minimal operational cost.

Introduction to Hadoop MapReduce Performance Tuning

Installing the Hadoop cluster in the production is just half the battle won. For the Hadoop administrator, it is extremely important to tune the Hadoop cluster setup in order to gain maximum performance.

Hadoop performance tuning helps in optimizing Hadoop cluster performance and achieve the best results while running MapReduce jobs in Big Data companies.

During the Hadoop installation, the Hadoop cluster is configured with the default configuration settings.

It is very important for the Hadoop administrators to be familiar with the several hardware specifications such as RAM capacity, the number of disks mounted on the DataNodes, number of CPU cores, the number of physical or virtual cores, NIC Cards, etc.

As such, there is no single performance tuning technique that fits all the Hadoop jobs because it is very difficult to obtain an equilibrium among all the resources while solving the big data problem.

We can choose the performance tuning tips and tricks on the basis of the amount of data to be moved and on the type of Hadoop job to be run in production. The best and the most effective performance tuning helps in achieving maximum performance.

To perform the same, we have to repeat the below-mentioned process until the desired output is achieved in an optimal way.
Run Job –> Identify Bottleneck –> Address Bottleneck.

So basically, for the performance tuning, we have to first run the Hadoop MapReduce job, identify the bottleneck, and then address the issue using the below methods. We have to repeat the above step until the desired level of performance is achieved.

Tips and Tricks for MapReduce Performance Tuning

The ways used for the Hadoop MapReduce performance tuning can be categorized into two categories. These two categories are:

1. Hadoop run-time parameters based performance tuning

2. Hadoop application-specific performance tuning

Let us now discuss how we can improve the Hadoop cluster’s performance based on these two categories.

1. Hadoop Run-Time Parameters Based Performance Tuning

This category deals with tuning the Hadoop run-time parameters such as tuning CPU usage, memory usage, disk usage, and network usage for performance tuning. The techniques included in this category are:

a. Memory Tuning

The most important step for ensuring the maximum performance of a Hadoop job is to tune the configuration parameters for memory by monitoring memory usage on the server.

Each MapReduce job in Hadoop collects the information about the various input records read, number of reducer records, number of records pipelined for further execution, swap memory, heap size set, etc.

The Hadoop tasks are generally not bound by the CPU. So, the prime concern is to optimize the memory usage and disk spills.

The best thumb rule for memory tuning to maximize the performance is to ensure that the MapReduce jobs do not trigger swapping. That means use as much memory as you can without triggering swapping.

Softwares like Cloudera Manager, Nagios, or Ganglia can be used for monitoring the swap memory usage.

Whenever there is a huge swap memory utilization, then the memory usage should be optimized via configuring the mapred.child.java.opts property by reducing the amount of RAM allotted to each task in the mapred.child.java.opts.

We can adjust the memory for the task by setting the mapred.child.java.opts to -Xmx2048M in a mapred-site.xml.

b. Minimize the Map Disk Spill

Disk IO is the performance bottleneck in Apache Hadoop. There were lots of parameters that we can tune for minimizing spilling. We can tune the parameters like:

Compression of mapper output
Ensure that the mapper uses 70% of the heap memory for the spill buffer.

But do you think that frequent spilling is really a good idea?

It is highly suggested that you should not spill more than once because if we spill once, then we need to reread and rewrite all the data: 3x the IO.

c. Tuning Mapper Tasks

We can implicitly set the number of map tasks. The most common and effective way for Hadoop performance tuning for the mapper is to control the amount of mappers and size of each job.

While dealing with the large files, the framework splits the file into smaller chunks so that the mapper can run it in parallel. However, the initialization of a new mapper job usually takes a few seconds, which is also an overhead and has to be minimized. So the suggestions for the same are:

Reuse jvm task
Aim for the map tasks running 1 to 3 minutes each. So if the average mapper running time is less than one minute, increase the mapred.min.split.size, to allocate less mappers in the slot and thus reduce the mapper initializing overhead.
Use the Combine file input format for a bunch of smaller files.

2. Hadoop Application-Specific Performance Tuning

The techniques included in this category are:

a. Minimizing Mapper Output

By minimizing the mapper output, we can improve the performance as the mapper output is very sensitive to the disk IO, network IO, and the memory sensitivity on the shuffle phase. We can achieve this by:

Filtering the records on the mapper side instead of the reducer side.
Using minimal data for forming our mapper output key and value in the MapReduce.
Compressing mapper output

b. Balancing Reducer Loading

The unbalanced reduce tasks create performance issues. Some of the reducers take most of the output from the mapper and run extremely long compared to the other reducers. We can balance the reducer loading by:

Implementing a better hash function in the Partitioner class.
Writing a preprocess job for separating keys using the multiple outputs. Then use another map-reduce job for processing the special keys that can cause the problem.

c. Reduce Intermediate Data with Combiner in Hadoop

Further we can tune the performance of the Hadoop cluster by writing a combiner. Combiner reduces the amount of data to be transferred from mapper to reducer. This proves to be beneficial as it reduces network congestion.

d. Speculative Execution

The performance of the MapReduce jobs is seriously affected when the tasks take a longer time to finish its execution. Speculative execution in Hadoop is the common approach for solving this problem by backing up the slow tasks on the alternate machines.

We can enable the speculative execution by setting the configuration parameters ‘mapreduce.map.tasks.speculative.execution’ and ‘mapreduce.reduce.tasks.speculative.execution’ to true. This will reduce the job execution time.

Summary

Finally, we have seen Performance tuning in Hadoop helps in optimizing Hadoop cluster performance. The article explained various tips and tricks for performance tuning the Hadoop cluster.

The article has highlighted some of the best and the most effective tricks for maximizing performance.

However, If you have any queries about this topic, feel free to share them with us in the comment section.