MapReduce Shuffling and Sorting in Hadoop
This Hadoop tutorial is all about MapReduce Shuffling and Sorting. Here we will provide you a detailed description of Hadoop Shuffling and Sorting phase. Firstly we will discuss what is MapReduce Shuffling, next with MapReduce Sorting, then we will cover MapReduce secondary sorting phase in detail.
What is MapReduce Shuffling and Sorting?
Shuffling is the process by which it transfers mappers intermediate output to the reducer. Reducer gets 1 or more keys and associated values on the basis of reducers. The intermediated key – value generated by mapper is sorted automatically by key. In Sort phase merging and sorting of map output takes place.
Shuffling and Sorting in Hadoop occurs simultaneously.
Shuffling in MapReduce
The process of transferring data from the mappers to reducers is shuffling. It is also the process by which the system performs the sort. Then it transfers the map output to the reducer as input. This is the reason shuffle phase is necessary for the reducers. Otherwise, they would not have any input (or input from every mapper). Since shuffling can start even before the map phase has finished. So this saves some time and completes the tasks in lesser time.
Sorting in MapReduce
MapReduce Framework automatically sort the keys generated by the mapper. Thus, before starting of reducer, all intermediate key-value pairs get sorted by key and not by value. It does not sort values passed to each reducer. They can be in any order.
Sorting in a MapReduce job helps reducer to easily distinguish when a new reduce task should start. This saves time for the reducer. Reducer in MapReduce starts a new reduce task when the next key in the sorted input data is different than the previous. Each reduce task takes key value pairs as input and generates key-value pair as output.
The important thing to note is that shuffling and sorting in Hadoop MapReduce are will not take place at all if you specify zero reducers (setNumReduceTasks(0)). If reducer is zero, then the MapReduce job stops at the map phase. And the map phase does not include any kind of sorting (even the map phase is faster).
Secondary Sorting in MapReduce
If we want to sort reducer values, then we use a secondary sorting technique. This technique enables us to sort the values (in ascending or descending order) passed to each reducer.
In conclusion, MapReduce Shuffling and Sorting occurs simultaneously to summarize the Mapper intermediate output. Hadoop Shuffling-Sorting will not take place if you specify zero reducers (setNumReduceTasks (0)). Framework sorts all intermediate key-value pair by key, not by value. It uses secondary sorting for sorting by value. If you have any suggestion or query related to MapReduce Shuffling and Sorting phase, so please leave a comment in a comment box. We will be happy to solve them.