Hadoop InputFormat & Types of InputFormat in MapReduce

1. Objective

In our previous Hadoop tutorial, we have provided you a detailed description of Hadoop Mapper and Reducer. Now in this blog, we are going to cover the other component of MapReduce process i.e. Hadoop InputFormat. We will discuss What is InputFormat in Hadoop, What functionalities are provided by MapReduce InputFormat. We will also cover the types of InputFormat in MapReduce, and how to get the data from mapper using InputFormat.

Hadoop MapReduce InputFormat

2. What is Hadoop InputFormat?

Hadoop InputFormat describes the input-specification for execution of the Map-Reduce job.

InputFormat describes how to split up and read input files. In MapReduce job execution, InputFormat is the first step. It is also responsible for creating the input splits and dividing them into records.

Input files store the data for MapReduce job. Input files reside in HDFS. Although these files format is arbitrary, we can also use line-based log files and binary format. Hence, In MapReduce, InputFormat class is one of the fundamental classes which provides below functionality:

  • InputFormat selects the files or other objects for input.
  • It also defines the Data splits. It defines both the size of individual Map tasks and its potential execution server.
  • Hadoop InputFormat defines the RecordReader. It is also responsible for reading actual records from the input files.

3. How we get the data from Mapper?

Methods to get the data from mapper are: getsplits() and createRecordReader() which are as follows:

public abstract class InputFormat<K, V>
{
public abstract List<InputSplit> getSplits(JobContext context)
throws IOException, InterruptedException;
public abstract RecordReader<K, V>
createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
InterruptedException;
}

4. Types of InputFormat in MapReduce

There are different types of MapReduce InputFormat in Hadoop which are used for different purpose. Let’s discuss the Hadoop InputFormat types below:

4.1. FileInputFormat

It is the base class for all file-based InputFormats. FileInputFormat also specifies input directory which has data files location. When we start a MapReduce job execution, FileInputFormat provides a path containing files to read. This InpuFormat will read all files. Then it divides these files into one or more InputSplits.

4.2. TextInputFormat

It is the default InputFormat. This InputFormat treats each line of each input file as a separate record. It performs no parsing. TextInputFormat is useful for unformatted data or line-based records like log files. Hence,

  • Key – It is the byte offset of the beginning of the line within the file (not whole file one split).  So it will be unique if combined with the file name.
  • Value – It is the contents of the line. It excludes line terminators.

4.3. KeyValueTextInputFormat

It is similar to TextInputFormat. This InputFormat also treats each line of input as a separate record. While the difference is that TextInputFormat treats entire line as the value, but the KeyValueTextInputFormat breaks the line itself into key and value by a tab character (‘/t’). Hence,

  • Key – Everything up to the tab character.
  • Value – It is the remaining part of the line after tab character.

4.4. SequenceFileInputFormat

It is an InputFormat which reads sequence files. Sequence files are binary files. These files also store sequences of binary key-value pairs. These are block-compressed and provide direct serialization and deserialization of several arbitrary data. Hence,

Key & Value both are user-defined.

4.5. SequenceFileAsTextInputFormat

It is the variant of SequenceFileInputFormat. This format converts the sequence file key values to Text objects. So, it performs conversion by calling ‘tostring()’ on the keys and values. Hence, SequenceFileAsTextInputFormat makes sequence files suitable input for streaming.

4.6. SequenceFileAsBinaryInputFormat

By using SequenceFileInputFormat we can extract the sequence file’s keys and values as an opaque binary object.

4.7. NlineInputFormat

It is another form of TextInputFormat where the keys are byte offset of the line. And values are contents of the line. So, each mapper receives a variable number of lines of input with TextInputFormat and KeyValueTextInputFormat. The number depends on the size of the split. Also, depends on the length of the lines. So, if want our mapper to receive a fixed number of lines of input, then we use NLineInputFormat.

N- It is the number of lines of input that each mapper receives.

By default (N=1), each mapper receives exactly one line of input.

Suppose N=2, then each split contains two lines. So, one mapper receives the first two Key-Value pairs. Another mapper receives the second two key-value pairs.

4.8. DBInputFormat

This InputFormat reads data from a relational database, using JDBC. It also loads small datasets, perhaps for joining with large datasets from HDFS using MultipleInputs. Hence,

  • Key – LongWritables
  • Value – DBWritables.

5. Conclusion

Hence, InputFormat defines how to read data from a file into the Mapper instances. In this tutorial, we have learned many types of InputFormat like FileInputFormat, TextInputFormat etc. The default input format is TextInputFormat. If you have any query related to MapReduce InputFormat, so feel free to share with us. Hope we will solve them.

1 Response

  1. pranayreddy says:

    Nice Explanation..
    Can You give more information on SequenceFileInputFormat with an example and it usecases

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.