Difference Between InputSplit vs Blocks in Hadoop
Keeping you updated with latest technology trends, Join TechVidvan on Telegram
In this MapReduce tutorial, we will discuss the comparison between MapReduce InputSplit vs Blocks in Hadoop. Firstly, we will see what is HDFS data blocks next to what is Hadoop InputSplit. Then we will see the feature wise difference between InputSplit vs Blocks. At last, we will also discuss the example of Hadoop InputSplit and Data blocks in HDFS.
2. Introduction to InputSplit and Blocks in Hadoop
Let’ first discuss what is HDFS Data Blocks and what is Hadoop InputSplit one by one.
2.1. What is a Block in HDFS?
Hadoop HDFS split large files into small chunks known as Blocks. It contains a minimum amount of data that can be read or write. HDFS stores each file as blocks. The Hadoop application distributes the data block across multiple nodes. HDFS client doesn’t have any control on the block like block location, the Namenode decides all such things. Learn HDFS data Blocks in detail.
2.2. What is InputSplit in Hadoop?
It represents the data which individual mapper processes. Thus the number of map tasks is equal to the number of InputSplits. Framework divides split into records, which mapper processes.
Initially input files store the data for MapReduce job. Input a file typically resides in HDFS InputFormat describes how to split up and read input files. InputFormat is responsible for creating InputSplit. Learn Hadoop InputSplit in detail.
3. Comparison Between InputSplit vs Blocks in Hadoop
Let’s now discuss the feature wise difference between InputSplit vs Blocks in Hadoop Framework.
3.1. Data Representation
- Block – HDFS Block is the physical representation of data in Hadoop.
- InputSplit – MapReduce InputSplit is the logical representation of data present in the block in Hadoop. It is basically used during data processing in MapReduce program or other processing techniques. The main thing to focus is that InputSplit doesn’t contain actual data; it is just a reference to the data.
- Block – By default, the HDFS block size is 128MB which you can change as per your requirement. All HDFS blocks are the same size except the last block, which can be either the same size or smaller. Hadoop framework break files into 128 MB blocks and then stores into the Hadoop file system.
- InputSplit – InputSplit size by default is approximately equal to block size. It is user defined. In MapReduce program the user can control split size based on the size of data.
3.3. Example of Block and InputSplit in Hadoop
Suppose we need to store the file in HDFS. Hadoop HDFS stores files as blocks. Block is the smallest unit of data that can be stored or retrieved from the disk. The default size of the block is 128MB. Hadoop HDFS breaks files into blocks. Then it stores these blocks on different nodes in the cluster.
For example, we have a file of 132 MB. So HDFS will break this file into 2 blocks.
Now, if we want to perform a MapReduce operation on the blocks, it will not process. The reason is that 2nd block is incomplete. So, InpuSplit solves this problem. MapReduce InputSplit will form a logical grouping of blocks as a single block. As the InputSplit include a location for the next block and the byte offset of the data needed to complete the block.
Hence, InputSplit is only a logical chunk of data i.e. It has just the information about blocks address or location. While Block is the physical representation of data. Now I am sure that, you have a clearer understanding about InputSplit and HDFS Data blocks after reading this blog. If you find any other difference between InputSplit vs Blocks, so do let us know in the comment section.