Site icon TechVidvan

Hadoop – An Apache Hadoop Tutorials for Beginners

Apache Hadoop Tutorial

The main goal of this Hadoop Tutorial is to describe each and every aspect of Apache Hadoop Framework. Basically, this tutorial is designed in a way that it would be easy to Learn Hadoop from basics.

In this article, we will do our best to answer questions like what is Big data Hadoop, What is the need of Hadoop, what is the history of Hadoop, and lastly advantages and disadvantages of Apache Hadoop framework.

Our hope is that after reading this article, you will have a clear understanding of what is a Hadoop Framework.

What is Hadoop?

It is an open source software framework for distributed storage & processing of huge amount of data sets. Open source means it is freely available and even we can change its source code as per your requirements.

It also makes it possible to run applications on a system with thousands of nodes. It’s distributed file system has the provision of rapid data transfer rates among nodes. It also allows the system to continue operating in case of node failure.

Hadoop provides-

Hadoop – History

In 2003, Google launches project Nutch to handle billions of searches. Also for indexing millions of web pages. In October 2003 Google published GFS (Google File System) paper, from that paper Hadoop was originated.

In 2004, Google releases paper with MapReduce. And in 2005, Nutch used GFS and MapReduce to perform operations.

In 2006, Computer scientists Doug Cutting and Mike Cafarella created Hadoop. In February 2006 Doug Cutting joined Yahoo. This provided resources and the dedicated team to turn Hadoop into a system that ran at web scale. In 2007, Yahoo started using Hadoop on a 100 node cluster.

In January 2008, Hadoop made its own top-level project at Apache, confirming its success. Many other companies used Hadoop besides Yahoo!, such as the New York Times and Facebook.

In April 2008, Hadoop broke a world record to become the fastest system to sort a terabyte of data. Running on a 910-node cluster, In sorted one terabyte in 209 seconds.

In December 2011, Apache Hadoop released version 1.0. In August 2013, version 2.0.6 was available. Later in June 2017, Apache Hadoop 3.0.0-alpha4 is available. ASF (Apache Software Foundation) manages and maintains Hadoop’s framework and ecosystem of technologies.

Why Hadoop?

As we have learned the Introduction, Now we are going to learn what is the need of Hadoop?

It emerged as a solution to the “Big Data” problems-

a. Storage for Big Data – HDFS Solved this problem. It stores Big Data in Distributed Manner. HDFS also stores each file as blocks. Block is the smallest unit of data in a filesystem.

Suppose you have 512MB of data. And you have configured HDFS such that it will create 128Mb of data blocks. So HDFS divide data into 4 blocks (512/128=4) and stores it across different DataNodes. It also replicates the data blocks on different datanodes.

Hence, storing big data is not a challenge.

b. Scalability – It also solves the Scaling problem. It mainly focuses on horizontal scaling rather than vertical scaling. You can add extra datanodes to HDFS cluster as and when required. Instead of scaling up the resources of your datanodes.

Hence enhancing performance dramatically.

c. Storing the variety of data  – HDFS solved this problem. HDFS can store all kind of data (structured, semi-structured or unstructured). It also follows write once and read many models.

Due to this, you can write any kind of data once and you can read it multiple times for finding insights.

d. Data Processing Speed  – This is the major problem of big data. In order to solve this problem, move computation to data instead of data to computation. This principle is Data locality.

Hadoop Core Components

Now we will learn the Apache Hadoop core component in detail. It has 3 core components-

Let’s discuss these core components one by one.

a. HDFS

Hadoop distributed file system (HDFS) is the primary storage system of Hadoop. HDFS store very large files running on a cluster of commodity hardware. It follows the principle of storing less number of large files rather than the huge number of small files.

Stores data reliably even in the case of hardware failure. It provides high-throughput access to the application by accessing in parallel.

Components of HDFS:

b. MapReduce

MapReduce is the data processing layer of Hadoop. It processes large structured and unstructured data stored in HDFS. MapReduce also processes a huge amount of data in parallel.

It does this by dividing the job (submitted job) into a set of independent tasks (sub-job). MapReduce works by breaking the processing into phases: Map and Reduce.

c. YARN

YARN provides the resource management. It is the operating system of Hadoop. It responsible for managing and monitoring workloads, also implementing security controls. Apache YARN is also a central platform to deliver data governance tools across the clusters.

YARN allows multiple data processing engines such as real-time streaming, batch processing etc.

Components of YARN:

Advantages of Hadoop

Let’s now discuss various Hadoop advantages to solve the big data problems.

Disadvantages of Hadoop

Some Disadvantage of Apache Hadoop Framework is given below-

Conclusion

In conclusion, we can say that it is the most popular and powerful Big data tool. It stores huge amount of data in the distributed manner.

And then processes the data in parallel on a cluster of nodes. It also provides world’s most reliable storage layer- HDFS. Batch processing engine MapReduce and Resource management layer- YARN.

Hence, these daemons ensure Hadoop functionality.

Exit mobile version