6 Best MapReduce Job Optimization Techniques

Performance tuning will help in optimizing your Hadoop performance. In this blog, we are going to discuss all those techniques for MapReduce Job optimizations.

In this MapReduce tutorial, we will provide you 6 important tips for MapReduce Job Optimization such as the Proper configuration of your cluster, LZO compression usage, Proper tuning of the number of MapReduce tasks etc.

MapReduce Job Optimization Techniques

Tips for MapReduce Job Optimization

Below are some MapReduce job optimization techniques that would help you in optimizing MapReduce job performance.

1. Proper configuration of your cluster

  • With -noatime option Dfs and MapReduce storage are mounted. This will disable the access time. Thus improves I/O performance.
  • Try to avoid the RAID on TaskTracker and datanode machines. This generally reduces performance.
  • Ensure that you have configured mapred.local.dir and dfs.data.dir to point to one directory on each of your disks. This is to ensure that all of your I/O capacity is used.
  • You should monitor the graph of swap usage and network usage with software. If you see that swap is being used, you should reduce the amount of RAM allocated to each task in mapred.child.java.opts.
  • Make sure that you should have smart monitoring to the health status of your disk drives. This is one of the important practices for  MapReduce performance tuning.

2. LZO compression usage

For Intermediate data, this is always a good idea. Every Hadoop job that generates a non-negligible amount of map output will get benefit from intermediate data compression with LZO.

Although LZO adds a little bit of overhead to CPU, it saves time by reducing the amount of disk IO during the shuffle.

Set mapred.compress.map.output to true to enable LZO compression

3. Proper tuning of the number of MapReduce tasks

  • In MapReduce job, if each task takes 30-40 seconds or more, then it will reduce the number of tasks. The mapper or reducer process involves following things: first, you need to start JVM (JVM loaded into the memory). Then you need to initialize JVM. And after processing (mapper/reducer) you need to de-initialize JVM. And these JVM tasks are very costly. Suppose a case in which mapper runs a task just for 20-30 seconds. For this, we need to start/initialize/stop JVM. This might take a considerable amount of time. So, it is strictly recommended to run the task for at least 1 minute.
  • If a job has more than 1TB of input. Then you should consider increasing the block size of the input dataset to 256M or even 512M. So the number of tasks will be smaller. You can change the block size by using the command Hadoop distcp –Hdfs.block.size=$[256*1024*1024] /path/to/inputdata /path/to/inputdata-with-largeblocks
  • As we know that each task runs for at least 30-40 seconds. You should increase the number of mapper tasks to some multiple of the number of mapper slots in the cluster.
  • Don’t run too many reduce tasks – for most jobs. The number of reduce tasks equal to or a bit less than the number of reduce slots in the cluster.

4. Combiner between Mapper and Reducer

If algorithm involves computing aggregates of any sort, then we should use a Combiner. Combiner performs some aggregation before the data hits the reducer.

The Hadoop MapReduce framework runs combine intelligently to reduce the amount of data to be written to disk. And that data has to be transferred between the Map and Reduce stages of computation.

5. Usage of most appropriate and compact writable type for data

Big data users use the Text writable type unnecessarily to switch from Hadoop Streaming to Java MapReduce. Text can be convenient. It’s inefficient to convert numeric data to and from UTF8 strings. And can actually make up a significant portion of CPU time.

6. Reusage of Writables

Many MapReduce users make one very common mistake that is to allocate a new Writable object for every output from a mapper/reducer. Suppose, for example, word-count mapper implementation as follows:

public void map(...) {
...
for (String word: words) {
output.collect(new Text(word), new IntWritable(1));
}

This implementation causes allocation of thousands of short-lived objects. While Java garbage collector does a reasonable job at dealing with this, it is more efficient to write:

class MyMapper ... {
Text wordText = new Text();
IntWritable one = new IntWritable(1);
public void map(...) {
... for (String word: words)
{
wordText.set(word);
output.collect(word, one); }
}
}

Conclusion

Hence, there are various MapReduce job optimization techniques that help you in optimizing MapReduce job. Like using combiner between mapper and Reducer, by LZO compression usage, proper tuning of the number of MapReduce tasks, Reusage of writable.

If you find ant other technique for MapReduce job optimization, so do let us know in the comment section given below.