Battle: Apache Spark vs Hadoop MapReduce

1. Spark vs Hadoop – Objective

Spark vs Hadoop is a popular battle nowadays increasing the popularity of Apache Spark, is an initial point of this battle. In the big data world, Spark and Hadoop are popular Apache projects. We can say, Apache Spark is an improvement on the original Hadoop MapReduce component. As Spark is 100x faster than Hadoop, even comfortable APIs, so some people think this could be the end of Hadoop era. Still, there is a debate on whether Spark is replacing the Apache Hadoop. Also, people are thinking who is becoming the top big data analytics tool. Ultimately, we will see who wins the battle, Hadoop or Spark.

In this blog, we will compare both on the basis of different features. So, we will get answers to all our questions. At first, we will learn the introduction of each to understand the comparison well. Afterwards, we will move forward towards the difference between Spark and Hadoop MapReduce. That will help us to conclude who wins the battle between Hadoop vs spark.

Spark vs Hadoop

Apache Spark v/s Hadoop MapReduce

2. Introduction: Spark vs Hadoop

2.1. Hadoop MapReduce

Hadoop MapReduce is an open source framework for writing applications. Through MapReduce, it is possible to process structured and unstructured data. In Hadoop, data is stored in HDFS. Hadoop MapReduce is able to handle the large volume of data on a cluster of commodity hardware. It can process data in batch mode.

2.2. Apache Spark

Apache Spark is also an open source big data framework. It is 100x faster than MapReduce. Also, general purpose data processing engine. The basic idea behind its design is fast computation. Apart from batch processing, it also covers the wide range of workloads. For example, interactive, iterative and streaming.

Now, let’s start the battle, Hadoop MapReduce vs Spark on the basis of their features.

3. Apache Spark vs Hadoop MapReduce

3.1. Language Developed

Hadoop MapReduce: It is developed in java language.

Apache Spark: It is developed in Scala language.

3.2. Programming Language support

Hadoop MapReduce: Initially Hadoop MapReduce supported Java. In addition, other languages are also supported using Hadoop streaming. For example C, C++, Ruby, Groovy, Perl, Python.

Apache Spark: Spark has rich APIs. It supports Scala, Java, Python, R as well as SQL.

3.3. Real-time analysis

Hadoop MapReduce: When we talk about real-time data processing, MapReduce fails. Purposely to perform batch processing on huge amounts of data, Hadoop MapReduce was designed.

Apache Spark:  Spark easily supports Real-Time Data Processing. It can process real-time data. Like real-time event streams, at the rate of millions of events/ second data can process by Spark. For Example, twitter data for instance or Facebook sharing/posting. In other words, it is Spark’s strength, that it is able to process live streams efficiently.

3.4. Speed

Hadoop MapReduce: Processing speed is slow, due to read and write process from disk.

Apache Spark: While we talk about running applications in spark, it runs up to 100x faster in memory. While spark runs 10x faster on disk than Hadoop. Since, In Spark, the number of read/write cycle to disk is reduced. Also, storing intermediate data in-memory Spark makes it possible.

3.5. Interactive mode

Hadoop MapReduce: It does not support Interactive Mode.

Apache Spark:  In Spark, data can process interactively.

3.6. Latency

Hadoop MapReduce: It provides high latency computing.

Apache Spark: Spark is a low latency computing framework.

3.7. Difficulty

Hadoop MapReduce: It is not easy to work on Hadoop MapReduce. Since here developers need to hand code each operation.

Apache Spark: Basically, RDD (Resilient Distributed Dataset) have tons of high-level operators. Therefore, Spark is easy to program.

3.8. Streaming

Hadoop MapReduce: We can only process data in batch mode.

Apache Spark: Through Spark Streaming, Spark can process real-time data.

3.9. Easy to Manage

Hadoop MapReduce: While working with Hadoop MapReduce, we are dependent on different engines. For example- Storm, Giraph, Impala, etc. for other requirements. Since MapReduce only provides the batch engine. Hence, it is very difficult to manage many components.

Apache Spark: Spark is a complete data analytics engine. It can perform batch and interactive processing. Also, can perform Machine Learning and Stream over the same cluster. Therefore, there is no need to manage different component for each need. Ultimately, Installing Spark over the cluster is enough to fulfill all requirements.

3.10. Ease of use

Hadoop MapReduce: As we need to handle low-level APIs to process the data. That requires lots of hand coding. As a result, MapReduce is more complex.

Apache Spark: In Spark RDD enables users to process data using high-level operators. Also, provides rich APIs in Java, Scala, Python, and R. Therefore, Spark is easier to use.

3.11. Fault tolerance

Hadoop MapReduce: MapReduce is Highly fault-tolerant. Hence, in case of any failure, there is no need to restart the application from scratch.

Apache Spark: Like  MapReduce, Apache Spark is also fault-tolerant. Thus, in case of any failure, there is no need to restart the application from scratch.

3.12. Scheduler

Hadoop MapReduce: To schedule complex flows, MapReduce needs an external job scheduler. For Example, Oozie.

Apache Spark: Spark has its own flow scheduler, because of in-memory computation.

3.13. Recovery

Hadoop MapReduce: As we know, Hadoop MapReduce is the highly fault-tolerant system. Therefore, it is naturally resilient to system faults or failures.

Apache Spark: By RDDs, we can recover partitions on failed nodes by re-computation of DAG. Basically, it offers a more similar recovery style to Hadoop by way of checkpointing. Also, reduces the dependencies of an RDD.

3.14. Cost

Hadoop MapReduce: While comparing it in terms of cost, it is a cheaper option available.

Apache Spark:  It is more costly than Hadoop MapReduce. Since it requires a lot of RAM to run in-memory. Therefore, it increases the cluster.

3.15. Security

Hadoop MapReduce: Due to the presence of Kerberos. Also Access Control Lists (ACLs), a traditional file permission model. Hadoop MapReduce is more secure.

Apache Spark: It supports the only authentication through shared secret password authentication. Therefore, Spark is little less secure in comparison to MapReduce.

3.16. License

Hadoop MapReduce: It has Apache License 2.

Apache Spark: As similar as Hadoop, It also belongs from Apache License 2.

3.17. OS support

Hadoop MapReduce: It supports cross-platform.

Apache Spark: It also supports cross-platform.

3.18. Category

Hadoop MapReduce: Hadoop MapReduce is a basic data processing engine.

Apache Spark: It is a choice for Data Scientist. As It is data analytics engine.

3.19. Community

Hadoop MapReduce: Its community has been shifted to Spark.

Apache Spark: Spark has a very strong community. Since it is one of the most active projects at Apache.

3.20. Scalability

Hadoop MapReduce: We can keep adding n number of nodes in the cluster. Also, a largest known Hadoop cluster is of 14000 nodes. Hence, MapReduce is highly scalable.

Apache Spark: We can add n number of nodes in the cluster. Also, a largest known Spark Cluster is of 8000 nodes. Hence,  Spark is also highly scalable.

3.21. SQL support

Hadoop MapReduce: By using Apache Hive, it enables users to run SQL queries.

Apache Spark: By using Spark SQL, it enables the user to run SQL queries.

3.22. Machine Learning

Hadoop MapReduce: Apache Mahout is a machine learning tool that Hadoop uses.

Apache Spark: A set of machine learning, MLlib is already present in Spark.

3.23. The line of code

Hadoop MapReduce: It is developed in 1,20,000 line of codes.

Apache Spark:  It has merely 20000 line of codes.

3.24. Caching

Hadoop MapReduce: For future requirements, MapReduce cannot cache the data in memory. As a result, system performance decreases.

Apache Spark: For further iterations, Spark can cache data in memory. Thus, it increases the system performance.

3.25. Hardware Requirements

Hadoop MapReduce: On commodity hardware, MapReduce runs very well.

Apache Spark: It requires mid to high-level hardware.

4. Spark vs Hadoop – Conclusion

As a result, we have seen, Spark has excellent performance and is highly cost-effective. It also supports in-memory data processing. Even so, it’s compatible with all of Hadoop’s data sources and file formats. Also, it has friendly APIs that are available in several languages. So its bit difficult to conclude, who wins the battle, between Spark vs MapReduce.

However,  Spark looks like the big winner. Yet, we can not use it alone. As we still need HDFS to store the data. Also possible that we want to use HBase, Hive, Pig, Impala or other projects of Hadoop. That signifies we need to run Hadoop and MapReduce alongside Spark for a full Big Data package. Hence, we have seen a complete comparison between Hadoop MapReduce and Spark along with their features. Hope we have answered each question, coming into your mind regarding Apache Spark vs Hadoop MapReduce.

Top 10 books for learning Spark.

Reference

If you get a good knowledge from above differentiation, let us know in comment section.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *