HDFS Tutorial – A Complete Introduction to HDFS for Beginners

Wants to know how Hadoop stores massive amounts of data in a reliable and fault-tolerant manner?

In this HDFS tutorial, we are going to discuss one of the core components of Hadoop, that is, Hadoop Distributed File System (HDFS).

First, we will see an introduction to Distributed FileSystem. Then we will study the Hadoop Distributed FileSystem. The article explains the reason for using HDFS, HDFS architecture, and blocks in HDFS.

The article also enlists some of the features of Hadoop HDFS. Also, you will come to know about the heartbeat messages in Hadoop HDFS.

This HDFS tutorial provides the complete introductory guide to the most reliable storage Hadoop HDFS.

Let us first start with an introduction to Distributed FileSystem.

Distributed FileSystem

When the dataset exceeds the storage capacity of a single machine, then it becomes mandatory to partition the dataset across several separate machines. The filesystem that manages the data across the network of machines is called a distributed filesystem.

A distributed filesystem is a filesystem that allows us to store data across multiple machines or nodes in a cluster and allows multiple users to access data.

Since the DFS is based on the network, all the complications of network programming kick in, making a distributed file system more complex than the regular filesystem. One of the biggest challenges in DFS is to tolerate node failure without suffering data loss.

Hadoop comes with a distributed filesystem called Hadoop Distributed Filesystem for storing vast amounts of data while providing fault tolerance and high availability.

Curious to know HDFS? So now, let’s start with the HDFS tutorial.

HDFS Tutorial – Introduction

Hadoop Distributed FileSystem (HDFS) is a java based distributed file system used in Hadoop for storing a large amount of structured or unstructured data, ranging in size from GigaBytes to PetaBytes, across a cluster of commodity hardware. It is the most reliable storage known to date on the planet.

In HDFS, data is stored in multiple locations, so if any of the machines fails, then data can be fetched from other machine containing the copy of data. Thus it is highly fault-tolerant and ensures no data loss even in the case of hardware failure.

It is the major component of Hadoop, along with MapReduce, YARN, and other common utilities.

It follows a Write-Once-Read-Many philosophy that simplifies data coherency and enables high-throughput access.

Why HDFS?

In today’s IT world, almost 75% of the world’s data resides in Hadoop HDFS. It is due to the following reason:

HDFS stores data across the commodity hardware due to which there is no need for high-end machines for storing big data. Thus provides economical storage for storing big data.
HDFS follows the most efficient data processing pattern that is Write-Once-Read-Many-Times pattern. A dataset generated from various sources are copied, and then the various analysis is performed on that dataset over time. So, it is best for batch processing.
HDFS can store data of any size generated from any source in any formats, either structured or unstructured.
Its write-one-read-many model relaxes the concurrency control requirements. The data can be accessed multiple times without any issue regarding data coherency.
HDFS works on the data locality assumption that is moving of computation to data is much easier and faster than moving data to the computational unit. HDFS facilitates locating processing logic near the data rather than moving data to the application space. Thus this reduces network congestion and overall turnaround time.

So, moving ahead in this HDFS tutorial, let us jump to HDFS Architecture.

HDFS Architecture

Hadoop DFS follows master-slave architecture. The HDFS consists of two types of nodes that are master node and slave nodes. The master node manages the file system namespace, that is, it stores the metadata about the blocks of files.

The slave nodes store the user data and are responsible for processing data based on the instruction from the master node.

HDFS Master

Master in HDFS is the centerpiece of Hadoop HDFS. They are the high-end machines that store metadata related to all the files stored in HDFS. It manages and maintains the filesystem namespace and provides instructions to the slave nodes.

The NameNode is the master node in Hadoop HDFS.

HDFS Slave

Slave Nodes are responsible for storing the actual business data. They are the normal configuration machines (commodity hardware) that stores and processes the datasets upon instruction from the master node.

The DataNodes are the slave nodes in Hadoop HDFS.

HDFS NameNode

NameNode is the master node. It manages filesystem namespace operations like opening/closing, renaming files, and directories. NameNode maps data blocks to DataNodes and records each change made to the filesystem namespace.

HDFS DataNode

DataNodes are the slave nodes that handle read/write requests from HDFS clients. DataNodes creates, deletes, and replicates data blocks as per the instructions from the governing name node.

Wondering how data gets stored in HDFS?

Blocks in HDFS

HDFS split the files into block-size chunks called data blocks. These blocks are stored across multiple DataNodes in the cluster. The default block size is 128 MB. We can configure the default block size, depending on the cluster configuration.

For the cluster with high-end machines, the block size can be kept large (like 256 Mb or more). For the cluster with machines having configuration like 8Gb RAM, the block size can be kept smaller (like 64 Mb).

Also, HDFS creates replicas of blocks based on the replication factor ( a number that defines the total copies of a block of a file). By default, the replication factor is 3. It means that 3 copies of each block are created and stored across multiple nodes.

If any of the DataNode fails, then the block is fetched from other DataNode containing a replica of a block. This makes HDFS fault tolerance.

Have you thought how NameNode discovers DataNode failure?

DataNode Failure

All the DataNodes in Hadoop HDFS continuously sends a small heartbeat message (signals) to NameNode to tell “I am Alive” in every 3 seconds.

If NameNode does not get a heartbeat message from any particular DataNode for more than 10 minutes, then it considers that DataNode as dead and starts creating a replica of blocks that were available on that DataNode.

NameNode instructs the DataNodes containing a copy of that data to replicate that data on other DataNodes to balance the replication. In this way, NameNode discovers DataNode failure.

Want to know how NameNode places replicas on different DataNode? Let us explore rack awareness in HDFS to get an answer to the above question.

Rack Awareness in HDFS

Hadoop HDFS stores data across the cluster of commodity hardware. To provide fault tolerance, replicas of blocks are created and stored on different DataNodes.

NameNode places the replicas of blocks on multiple DataNodes by following the Rack Awareness algorithm to ensure no data loss even if DataNode or the whole rack goes down. The NameNode places the first replica on the nearest DataNode.

It stores the second replica on different DataNode on the same rack and the third replica on different DataNode on a different rack.

If the replication factor is 2, then it places the second replica on a different DataNode on a different rack so that if a complete rack goes down, then also the system will be highly available.

The main purpose of a rack-aware replica placement policy is to improve fault tolerance, data reliability, availability.

Next in the HDFS tutorial, we discuss some key features of Hadoop HDFS.

Important Features of Hadoop HDFS

1. High Availability

It is a highly available file system. In this file system, data gets replicated among the nodes in the Hadoop cluster by creating a replica of the blocks on the other slaves present in the HDFS cluster. So, whenever a user wants to access this data, they can access their data from the slaves, which contain its blocks.

2. Fault Tolerance

Fault tolerance in Hadoop HDFS is the working strength of a system in unfavorable conditions. It is highly fault-tolerant. Hadoop framework divides data into blocks.

After that, it creates multiple copies of blocks on different machines in the cluster. So, when any machine in the cluster goes down, then a client can easily access their data from the other machine, which contains the same copy of data blocks.

3. High Reliability

HDFS provides reliable data storage. It can store data in the range of 100s of petabytes. HDFS stores data reliably on a cluster. It divides the data into blocks. Then, the Hadoop framework stores these blocks on nodes present in the cluster.

HDFS also stores data reliably by creating a replica of each and every block present in the cluster. Hence provides fault tolerance facility.

4. Replication

Data Replication is a unique feature of HDFS. Replication solves the problem of data loss in an unfavorable condition like hardware failure, crashing of nodes, etc. HDFS maintains the process of replication at a regular interval of time.

It also keeps creating replicas of user data on different machines present in the cluster. So, when any node goes down, the user can access the data from other machines. Thus, there is no possibility of losing of user data.

5. Scalability

It stores data on multiple nodes in the cluster. So, whenever requirements increase, you can scale the cluster. Two scalability mechanisms are available in HDFS: Vertical and Horizontal Scalability.

6. Distributed Storage

HDFS features are achieved via distributed storage and replication. It stores data in a distributed manner across the nodes. In Hadoop, data is divided into blocks and stored on the nodes present in the cluster.

After that, it creates the replica of each and every block and store on other nodes. When the single machine in the cluster gets crashed, we can easily access our data from the other nodes which contain its replica.

Next in the HDFS tutorial, we discuss some useful HDFS operations.

HDFS Operation

Hadoop HDFS has many similarities with the Linux file system. We can do almost all the operation we can do with a local file system like create a directory, copy the file, change permissions, etc.

It also provides different access rights like read, write and execute to users, groups, and others.

1. Read Operation

When the HDFS client wants to read any file from HDFS, the client first interacts with NameNode. NameNode is the only place that stores metadata. NameNode specifies the address of the slaves where data is stored. Then, the client interacts with the specified DataNodes and read the data from there.

HDFS client interacts with the distributed file system API. Then, it sends a request to NameNode to send a block location. NameNode first checks if the client has sufficient privileges to access the data or not? After that, NameNode will share the address at which data is stored in the DataNode.

NameNode provides a token to the client, which it shows to the DataNode for reading the file for security purposes. When a client goes to DataNode for reading the file, after checking the token, DataNode allows the client to read that particular block.

After that client opens the input stream and starts reading data from the specified DataNodes. Thus, in this manner, the client reads data directly from DataNode.

2. Writing Operation

For writing a file, the client first interacts with NameNode. HDFS NameNode provides the address of the DataNode on which data has to be written by the client.

When the client finishes writing the block, the DataNode starts replicating the block into another DataNode. Then it copies the block to the third DataNode. Once it creates required replication, it sends a final acknowledgment to the client. The authentication is the same as the read operation.

The client just sends 1 copy of data irrespective of our replication factor, while DataNodes replicate the blocks. Writing of file is not costly because it writes multiple blocks parallelly multiple blocks on several DataNodes.

Summary

In the HDFS tutorial conclusion, we can say that Hadoop HDFS stores data in a distributed manner across the cluster of commodity hardware.

Hadoop HDFS is a highly reliable, fault-tolerant, and highly available storage system known to date. It follows the master-slave architecture where NameNode is the master node, and the DataNodes are the slave nodes.

Also, the HDFS splits the client’s input file into blocks of size 128 MB, which we can configure as per our requirement. It also stores replicas of blocks to provide fault tolerance.

NameNode follows rack awareness policy for placing replicas on DataNode to ensure that no data is lost during machine failure or hardware failure. In addition, the DataNodes sends a heartbeat message to NameNode to ensure that they are alive.

During file read or write, the client first interacts with the NameNode.

The Hadoop HDFS is scalable, reliable, distributed, fault-tolerant, and highly available storage system for storing big data.