HDFS – A Complete Introduction to HDFS for Beginners
In this HDFS tutorial, we are going to discuss each and everything about Hadoop Distributed File System. First of all, we will answer to what is HDFS in Hadoop, what are the NameNode and DataNode. We will also cover HDFS Architecture, features and HDFS data read and write operations in this Hadoop tutorial.
2. What is HDFS?
Hadoop distributed file system is the primary storage system of Hadoop. It stores very large files running on a cluster of commodity hardware. It is based on GFS (Google FileSystem). HDFS stores data reliably even in the case of hardware failure. It provides also high-throughput access to the application by accessing in parallel. According to the prediction by the end of 2017, 75% of the data available on the planet will be residing in HDFS.
3. HDFS Nodes
It works on Master-Slave fashion. It has two types of node NameNode (Master) and DataNode (Slave).
- Master Node – It is also called as NameNode. NameNode manages all the slave nodes. It assigns the work to slaves. NameNode deployed on reliable hardware as it is the centerpiece of Hadoop Distributed File System.
- Slave Node – It is also called DataNode. It deploys on each machine and provides actual storage. DataNodes are the actual worker nodes. DataNodes are also responsible for serving read-write requests from the client.
4. HDFS Daemons
There are 3 daemons that run in HDFS for storage of data.
- NameNode – It is the daemon that runs on all the masters. It stores metadata like the number of blocks, filename etc. Metadata is available in memory in the master for faster retrieval of data. A copy of metadata is also available on the local disk for persistence.
- Secondary NameNode – The Main function of Secondary NameNode is to take a checkpoint of the file system metadata present on NameNode. It is not the backup node, it is the helper to the primary NameNode, but does not replace primary NameNode.
- DataNode – It is the daemon that runs on the slave. DataNodes are the actual worker node that stores the data.
5. HDFS Data storage
If the client wants to write a file in HDFS, Hadoop framework breaks the file into the small piece of data known as blocks. The default size of the block is 128MB which we can configure as per the requirement. Then, Blocks are stored in the cluster in a distributed manner on parallel in the cluster.
Hadoop framework replicates each block. And store these blocks across the cluster on different nodes. By default, it has a replication factor of 3.
6. Rack Awareness in HDFS
It runs on a cluster of computers which are commonly spread across many racks. NameNode places replicas of a block on multiple racks. NameNode place at least one replica of a block in each rack. So that if a complete rack goes down then also system will be highly available. The main purpose of a rack-aware replica placement policy is to improve fault tolerance, data reliability, availability.
7. HDFS Architecture
Architecture gives a complete picture of Hadoop HDFS. It has one NameNode and multiple DataNodes. NameNode stores metadata. DataNode is the actual worker node. Nodes are arranged in racks. Then, a replica of data blocks is stored on different racks in the cluster to improve data reliability, fault tolerance. To read or write a file in/from the client interacts with NameNode it is the centerpiece in the cluster.
There are many DataNodes in the cluster which stores data on the local disk. DataNode in sends a heartbeat message to NameNode periodically to indicate that it is alive. It also replicates the data to other DataNode as per the replication factor.
8. Features of HDFS
Important Features of HDFS are-
8.1. High Availability
It is a highly available file system. In this file system, data gets replicated among the nodes in the Hadoop cluster by creating a replica of the blocks on the other slaves present in HDFS cluster. So, whenever a user wants to access this data, they can access their data from the slaves which contain its blocks.
8.2. Fault Tolerance
Fault tolerance in Hadoop HDFS is the working strength of a system in unfavorable conditions. It is highly fault-tolerant. Hadoop framework divides data into blocks. After that creates multiple copies of blocks on different machines in the cluster. So, when any machine in the cluster goes down, then a client can easily access their data from the other machine which contains the same copy of data blocks.
8.3. High Reliability
HDFS provides reliable data storage. It can store data in the range of 100s of petabytes. HDFS stores data reliably on a cluster. It divides the data into blocks. Then, Hadoop framework stores these blocks on nodes present in the cluster. HDFS also stores data reliably by creating a replica of each and every block present in the cluster. Hence provides fault tolerance facility.
Data Replication is unique features of HDFS. Replication solves the problem of data loss in an unfavorable condition like hardware failure, crashing of nodes etc. HDFS maintain the process of replication at regular interval of time. It also keeps creating replicas of user data on different machine present in the cluster. So, when any node goes down, the user can access the data from other machines. Thus, there is no possibility of losing of user data.
It stores data on multiple nodes in the cluster. So, whenever requirements increase you can scale the cluster. Two scalability mechanisms are available in HDFS: Vertical and Horizontal Scalability.
8.6. Distributed Storage
HDFS features are achieved via distributed storage and replication. It stores data in a distributed manner across the nodes. In Hadoop, data is divided into blocks and stored on the nodes present in the cluster. After that, it creates the replica of each and every block and store on other nodes. When the single machine in the cluster gets crashed we can easily access our data from the other nodes which contain its replica.
9. HDFS Operation
Hadoop HDFS has many similarities with Linux file system. We can do almost all the operation we can do with a local file system like create a directory, copy the file, change permissions etc.
It also provides different access rights like read write and execute to users, groups, and others.
9.1. Read Operation
When HDFS client wants to read any file from HDFS, the client first interacts with NameNode. NameNode is the only place which stores metadata. NameNode specifies the address of the slaves where data is stored. Then, the client interacts with the specified DataNodes and read the data from there.
HDFS client interacts with distributed file system API. Then, it sends a request to NameNode to send block location. NameNode first checks if the client is having sufficient privileges to access the data or not? After that NameNode will share the address at which data is stored in data node.
NameNode provides token to the client which it shows to the data node for reading the file for security purpose. When a client goes to DataNode for reading the file, after checking the token, DataNode allows the client to read that particular block. After that client opens input stream and starts reading data from the specified DataNodes. Thus, in this manner, the client reads data directly from DataNode.
9.2. Writing Operation
For writing a file, the client first interacts with NameNode. HDFS NameNode provides the address of the DataNode on which data has to be written by the client.
When the client finishes writing the block, the DataNode starts replicating the block into another DataNode. Then it copies the block to the third DataNode. Once it creates required replication, it sends a final acknowledgment to the client. The authentication is same as the read operation.
The client just sends 1 copy of data irrespective of our replication factor while DataNodes replicate the blocks. Writing of file is not costly because it writes multiple blocks parallelly multiple blocks on several DataNodes.
In conclusion, we can say, HDFS store huge amount of data in a distributed manner. NameNode and DataNode are two nodes of HDFS. It also breaks large files into small chunks know as Blocks. Block is the smallest unit of data in a filesystem. Then, It replicates these blocks and stores across the cluster. Replication of blocks provides high availability of data, which is the main power of HDFS. Learn other Hadoop Ecosystem Components in detail.