Hadoop RecordReader Introduction, Working & Types

In our previous blog, we have studied Hadoop Counters in detail. Now in this tutorial, we are going to discuss the RecordReader in Hadoop.

Here we will cover the introduction to Hadoop RecordReader, working of RecordReader. We will also discuss the types of RecordReader in MapReduce, the size of the single Record in Hadoop MapReduce in this MapReduce Tutorial.

What is RecordReader in MapReduce?

A RecordReader converts the byte-oriented view of the input to a record-oriented view for the Mapper and Reducer tasks for processing.

To understand Hadoop RecordReader, we need to understand MapReduce Dataflow. Let us learn how the data flow:

MapReduce is a simple model of data processing. Inputs and outputs for the map and reduce functions are key-value pairs. Following is the general form of the map and reduce functions:

Map: (K1, V1) → list (K2, V2)
Reduce: (K2, list (V2)) → list (K3, V3)

Now before processing starts, it needs to know on which data to process. So, InputFormat class helps to achieve this. This class selects the file from HDFS that is the input to the map function. It is also responsible for creating the input splits.

Also, divide them into records. It divides the data into the number of splits (typically 64/128mb) in HDFS. This is known as InputSplit. InputSplit is the logical representation of data. In a MapReduce job, execution number of map tasks is equal to the number of InputSplits.

By calling ‘getSplit ()’ the client calculates the splits for the job. Then it sent to the application master. It uses their storage locations to schedule map tasks that will process them on the cluster.

After that map task passes the split to the createRecordReader() method. From that, it obtains RecordReader for the split. RecordReader generates record (key-value pair). Then it passes to the map function.

Hadoop RecordReader in MapReduce job execution uses the data within the boundaries that are being created by the inputsplit. And it then creates Key-value pairs for the mapper. The “start” is the byte position in the file.

At the Start, Hadoop RecordReader starts generating key/value pairs. The “end” is where RecorReader stops reading records. In RecordReader, the data is loaded from its source.

Then the data are converted into key-value pairs suitable for reading by the Mapper. It communicates with the inputsplit till the file reading is not completed.

How RecorReader works in Hadoop?

It is more than iterator over the records. The map task uses one record to generate key-value pair which it passes to the map function. We can also see this by using the mapper’s run function given below:

public void run(Context context ) throws IOException, InterruptedException{
setup(context);
while(context.nextKeyValue())
{
map(context.setCurrentKey(),context.getCurrentValue(),context)
}
cleanup(context);
}

Although it is not mandatory for RecordReader to stays in between the boundaries created by the inputsplit to generate key-value pairs it usually stays. Also, custom implementation can even read more data outside of the inputsplit.

Then, after running setup(), the nextKeyValue() will repeat on the context. This populates the key and value objects for the mapper. By way of context, framework retrieves key-value from record reader. Then pass to the map() method to do its work.

Hence, input (key-value) to the map function processes as per the logic mentioned in the map code. When the record gets to the end of the record, the nextKeyValue() method returns false.

Types of Hadoop RecordReader

InputFormat defines the RecordReader instance, in Hadoop. By default, by using TextInputFormat ReordReader converts data into key-value pairs. TextInputFormat also provides 2 types of RecordReaders which as follows:

1. LineRecordReader

It is the default RecordReader. TextInputFormat provides this RecordReader. It also treats each line of the input file as the new value. Then the associated key is byte offset. It always skips the first line in the split (or part of it), if it is not the first split.

It always reads one line after the boundary of the split in the end (if data is available, so it is not the last split).

2. SequenceFileRecordReader

This Hadoop RecorReader reads data specified by the header of a sequence file.

The maximum size of the single Record

By using below parameter we set maximum value.

conf.setInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);

Conclusion

In conclusion, Hadoop RecorReader creates the input (key-value) to Mapper. It also uses TextInputFormat for converting data into key-value pair.

I hope you have liked this blog if you have any question related to Hadoop RecordReader, feel free to share with us. We will be glad to solve them.