How Hadoop Works – Understand the Working of Hadoop
In this Hadoop tutorial, we will discuss How Hadoop works internally? Hadoop has 5 daemons – NameNode, DataNode, Secondary NameNode, ResourceManager, NodeManager. These daemons help in Hadoop internal working. We will deeply understand all these Hadoop Daemons and the working of Apache Hadoop in this Big data Hadoop tutorial.
Apache Hadoop is an open source software framework that stores data in a distributed manner and process that data in Parallel. Hadoop provides world’s most reliable storage layer – HDFS, a batch processing engine – MapReduce and a Resource Management Layer – YARN. 5 daemons run on Hadoop in these 3 layers. Daemons are the processes that run in the background. 5 daemons of Hadoop are as follows:
- Secondary NameNode
3. Apache Hadoop Daemons
Let’s discuss the about these 5 Hadoop Daemons in detail.
It works as Master in Hadoop cluster. Namenode stores meta-data i.e. number of blocks, replicas and other details. Meta-data is present in memory in the master. NameNode also assigns tasks to the slave node. As it is the centerpiece of HDFS, so it should deploy on reliable hardware
It works as Slave in Hadoop cluster. In Hadoop HDFS, DataNode is responsible for storing actual data in HDFS. DataNode performs read and write operation as per request for the clients. DataNodes can also deploy on commodity hardware.
3.3. Secondary NameNode
Its main function is to take checkpoints of the file system metadata present on namenode. It is not the backup namenode. It is a helper to the primary NameNode but it does not replace the primary namenode.
It is a cluster level component and runs on the Master machine. Hence it manages resources and schedule applications running on the top of YARN. It has two components: Scheduler & Application Manager.
It is a node level component. NodeManager runs on each slave machine. It continuously communicate with Resource Manager to remain up-to-date
4. How Hadoop works Internally?
As we have discussed Apache Hadoop Daemons in detail. Now we will learn how Hadoop works with the help of these daemons.
To process any data, the client first submits data and program. Hadoop store data using HDFS and then process the data using MapReduce.
4.1. Hadoop Data Storage
Let us first learn how Hadoop stores the data?
Hadoop Distributed File System – HDFS is the primary storage system of Hadoop. It stores very large files running on a cluster of commodity hardware. HDFS stores data reliably even in the case of machine failure. It also provides high throughput access to the application by accessing in parallel.
The data is broken into small chunks as blocks. Block is the smallest unit of data that the file system store. Hadoop application distributes data blocks across the multiple nodes. Then, each block is replicated as per the replication factor (by default 3). Once all the blocks of the data are stored on datanode, the user can process the data.
4.2. Hadoop Data Processing
Let us now learn how Hadoop processes the data?
Hadoop MapReduce is the data processing layer. It is the framework for writing applications that process the vast amount of data stored in the HDFS. MapReduce processes a huge amount of data in parallel by dividing the job into a set of independent tasks (sub-job). In Hadoop, MapReduce works by breaking the processing into phases: Map and Reduce.
- Map – It is the first phase of processing. In which we specify all the complex logic/business rules/costly code. The map takes a set of data and converts it into another set of data. It also breaks individual elements into tuples (key-value pairs).
- Reduce – It is the second phase of processing. In which we specify light-weight processing like aggregation/summation. The output from the map is the input to Reducer. Then, reducer combines tuples (key-value) based on the key. And then, modifies the value of the key accordingly.
In this tutorial, we have learned how Hadoop store huge amount of data using HDFS. We have also learned MapReduce process the data. Hadoop works by breaking input data into small chunks as blocks. Then move the block to different nodes. Once the entire blocks are stored on Datanodes, then the user can process the data. I hope this blog helps you a lot to learn the workings of Apache Hadoop. If you have any query to understand how Hadoop works, so you can share with us in the comment section.