Limitations of Hadoop, Ways to Resolve Hadoop Drawbacks
We have discussed Hadoop Features in our previous Hadoop tutorial. Now we are going to cover the limitations of Hadoop. There are various drawbacks of Apache Hadoop frameworks. For example, Small Files problem, Slow Processing, Batch Processing only, Latency, Security Issue, Vulnerability, No Caching etc. All these limitations of Hadoop we will discuss in detail in this Hadoop tutorial.
2. What is Hadoop?
Apache Hadoop is an open source software framework for distributed storage & processing of huge amount of data sets. Open source means it is freely available and even we can change its source code as per the requirements. Apache Hadoop also makes it possible to run applications on a system with thousands of nodes. It’s distributed file system has the provision of rapid data transfer rates among nodes. It also allows the system to continue operating in case of node failure. Main features of Hadoop are as follows:
- In Apache Hadoop, data is available despite machine failure due to many copies of data. So, if any machine crashes, then one can access the data from another path.
- Apache Hadoop is scalable, as it is easy to add new hardware to the node.
- Hadoop is highly fault-tolerant, as by default 3 replicas of each block is stored across the cluster. So, if any node in the cluster goes down, data on that node can be recovered from the other node easily.
- Apache Hadoop runs on a cluster of commodity hardware which is not very expensive.
- In Apache Hadoop, data is reliably stored on the cluster despite hardware failure due to replication of data on the cluster.
Although Hadoop is the most powerful tool of Big Data, there are various limitations to it. Due to the limitations of Hadoop, Apache Spark and Apache Flink came into existence.
3. Limitations of Hadoop
Various limitations of Apache Hadoop are given below along with their solution-
3.1. Issues with Small Files
The main problem with Hadoop is that it is not suitable for small data. HDFS lacks the ability to support the random reading of small due to its high capacity design.
Small files are smaller than the HDFS Block size (default 128MB). If you are storing these huge numbers of small files, HDFS cannot handle these lots of small files. As HDFS was designed to work with a small number of large files for storing large data sets rather than a large number of small files. If there are lot many small files, then the NameNode will be overloaded since it stores the namespace of HDFS.
Simply merge the small files to create bigger files and then copy bigger to HDFS.
Hadoop Archives (HAR files) deals with the problem of lots of small files. Hadoop Archives works by building a layered filesystem on the top of HDFS. With the help Hadoop archive command, HAR files are created; this runs a MapReduce job to pack the files being archived into a small number of HDFS files. Reading files through HAR is not more efficient than reading through HDFS. As each HAR file access requires two index files read as well the data file to read, this will make it slower.
Sequence files also overcome the small file problem. In which we use the filename as the key and the file contents as the value. By writing a program for files (100 KB), we can put them into a single Sequence file and then we can process them in a streaming fashion operating on the Sequence file. MapReduce in Hadoop can break Sequence file into chunks and operate on each chunk independently because Sequence file is splittable.
By storing files in Hbase we can overcome the small file problem. We are not actually storing millions of small file into HBase rather adding the binary content of the file to a cell.
3.2. Slow Processing Speed
MapReduce processes a huge amount of data. In Hadoop, MapReduce works by breaking the processing into phases: Map and Reduce. So, MapReduce requires a lot of time to perform these tasks, thus increasing latency. Hence, reduces processing speed.
By in-memory processing of data, Apache Spark overcomes this issue. As in In-memory processing, no time is spent in moving the data/processes in and out of the disk, thus this makes it faster. Apache Spark is 100 times faster as compared to MapReduce because it processes everything in memory.
Flink can also overcome this issue. Flink processes faster than Spark because of its streaming architecture.
3.3. Support for Batch Processing only
Hadoop only supports batch processing, it is not suitable for streaming data. Hence, overall performance is slower. MapReduce framework doesn’t leverage the memory of the Hadoop cluster to the maximum.
Apache Spark solves this problem as it supports stream processing. But Spark stream processing is not as much efficient as Flink as it uses micro-batch processing. Apache Flink improves the overall performance as it provides single run-time for the streaming as well as batch processing.
3.4. No Real-time Processing
Apache Hadoop is a batch processing framework. It means it takes a huge amount of data in input, processes it and produces the result. Batch processing is very efficient for processing a high volume of data, but depends on the size of data being processed and computational power of the system; an output can be delayed significantly. Apache Hadoop is not suitable for Real-time processing.
Spark is suitable for stream processing. Steaming processing provide continuous input/output data. It process data within the small amount of time.
Flink provides single run-time for both streamings as well as batch processing.
3.5. Iterative Processing
Apache Hadoop is not much efficient for iterative processing. As Hadoop is not supported cyclic data flow (i.e. a chain of stages in which each output of the previous stage is the input to the next stage).
Spark overcomes this issue. As Apache Spark accesses data from RAM instead of the Disk. This dramatically improves the performance of an iterative algorithm that accesses the same dataset repeatedly. In Apache Spark, for iterative processing, each iteration has to be scheduled and executed separately.
MapReduce in Hadoop is slower because it supports different format, structured and huge amount of data. In MapReduce, Map takes a set of data and converts it into another set of data, where an individual element is broken down into a key-value pair. Reduce takes the output from the map as and Reduce takes the output from the map as input and process further. MapReduce requires a lot of time to perform these tasks thereby increasing latency.
Apache Spark can reduce this issue. Although Spark is the batch system, it is relatively faster, because it caches much of the input data on memory by RDD. Apache Flink data streaming achieves low latency and high throughput.
3.7. No Ease of Use
MapReduce developer in Hadoop needs to hand code for each and every operation which makes it very difficult to work. In Hadoop, MapReduce has no interactive mode, but adding hive and pig makes working with MapReduce little easier.
Spark has overcome this issue, as the Spark has an interactive mode. So, that developers and users alike can have intermediate feedback for queries and other activities. As spark has tons of high-level operators so it is easy to program Spark. One can also use Apache Flink as it also has high-level operators.
3.8. Security Issue
Apache Hadoop is challenging in maintaining the complex applications. Hadoop is missing encryption at the storage and network levels, which is a major point of concern. Apache Hadoop supports Kerberos authentication, which is hard to manage.
Apache Spark provides security bonus. If you run Apache Spark in HDFS, it can use HDFS ACLs and file level permissions.
3.9. Vulnerable by Nature
Apache Hadoop is written in Java. Java, is a most popular language, hence java most heavily exploited by cybercriminals.
3.10. No Caching
Apache Hadoop is not efficient for caching. MapReduce cannot cache the intermediate data in memory for the further requirement and this diminishes the performance of Hadoop.
Spark and Flink overcome this issue. Spark and Flink cache data in memory for further iterations which enhance the overall performance.
3.11. Lengthy Code
Apache Hadoop has 1, 20,000 line of code. The number of lines produces the number of bugs. Hence it will take more time to execute the programs.
Spark and Flink are written in Scala and Java. But the implementation is in Scala, so the number of line of code is lesser than Hadoop. Thus, it takes less time to execute the programs.
As a result of Hadoop’s limitation, the need of Spark and Flink emerged. Thus, make the system friendlier to play with a huge amount of data.
Apache Spark provides in-memory processing of data, thus improves the processing speed. Flink improves the performance as it provides single run-time for the streaming as well as batch processing. Spark provides security bonus. Hence, one can resolve all these Hadoop limitations by using other big data technologies like Apache Spark and Flink.
If you find other limitations of Hadoop, So please let us know by leaving a comment in a section given below.