Hadoop Spark Compatibility: Hadoop+Spark better together
1. Hadoop Spark Compatibility – Objective
This tutorial is all about Hadoop Spark Compatibility. Hadoop and Spark together build a very powerful system to address all the Big Data requirements. Spark complements Hadoop with tons of power, you can handle all the diverse workloads, which was not possible with Hadoop’s MapReduce. Apache Spark is not developed to replace Hadoop rather it’s developed to complement Hadoop. Hadoop’s storage layer – HDFS is the most reliable storage system on the planet, on the other hand, its processing layer is quite limited to batch processing. Here Spark comes to rescue, using which we can handle: batch, real-time, streaming, graph, interactive, iterative requirements.
First of all, we discuss what is Hadoop and what is Apache Spark. Hadoop is an open source software framework for distributed storage & processing of huge amount of data sets, which also makes it possible to run applications on a system with thousands of nodes. Apache Spark is powerful cluster computing engine, which is purposely designed for fast computation in Big Data world.
Hadoop Spark Compatibility is explaining all three modes to use Spark over Hadoop, such as Standalone, YARN, SIMR(Spark In MapReduce). To understand in detail we will learn by studying launching methods on all three modes. In closing, we will also cover the working of SIMR in Spark Hadoop compatibility.
2. Spark and Hadoop working together
At first, we always get confused with the term replace, that spark replaced Hadoop. But to clarify spark does not replace Hadoop, it enhances the functionality of Hadoop. The day when spark comes in the picture it was the sketch to read and write data from and to HDFS. Even also with other storage systems like HBase and Amazon’s S3. Although, Hadoop Users can enhance their processing capabilities by combining Hadoop with spark. Below we will show Hadoop spark compatibility in detail.
3. Hadoop – Spark Compatibility
It is easy as possible for every Hadoop user to take benefit of spark’s capabilities. Hadoop spark compatibility does not affect either we run Hadoop 1.x or Hadoop 2.0 (YARN). No matter if we have privileges to configure the Hadoop cluster or not, there is a way for us to run Spark. In certain, there are three modes to deploy spark in a Hadoop cluster: Standalone, YARN, and SIMR.
Let’s understand all three ways one by one. Before that, the diagram below is also showing Hadoop spark compatibility.
When we use standalone deployment, we can statically allocate resources over the cluster. Also on a subset of machines in a Hadoop cluster. Even we can run spark side by side with Hadoop MR. Afterwards, the user can run arbitrary spark jobs on their HDFS data. Moreover, the simplicity of this deployment makes its choice for many Hadoop 1.x users.
Hadoop Yarn :
This mode allows users to easily integrate spark in their Hadoop stack. It offers to take advantage of the full power of spark, as well as of other components running on top of spark. Hadoop users who are planning to deploy Hadoop yarn can simply run spark on yarn. Even users who are already deployed Hadoop yarn can also use spark easily. There is no pre-installation or administrative access required.
Spark In MapReduce (SIMR) :
Another option, besides standalone deployment, is to use SIMR. All the Hadoop users those are not running YARN yet can definitely switch to SIMR. we can simply use SIMR to launch spark jobs inside mapreduce. With SIMR, users can start working with spark easily. We can also use its shell within a couple of moments after downloading it. Using SMIR reduces the barrier of deployment, and lets everyone play with spark.
To understand better we need to dive in launching methods of all three ways one by one:
3.1. Launching of Spark in Standalone Mode
While we talk about the standalone cluster, Spark supports two deployment modes for it. Such as the cluster mode and the client mode.
Driver launch in the same process, in which client submits the application.
In this mode, driver launch from one of worker node process inside the cluster.
As it submits the application the client process exits. It exists even without waiting for the application to finish.
bin/spark-submit --master spark://ubuntu:7077 --class com.df.wc.Wordcount ../sparkwc.jar hdfs://localhost:9000/inp hdfs://localhost:9000/out0004
– Running application in Standalone Mode
If we want to run spark application in standalone mode by taking input from HDFS use the code:
$ ./bin/spark-submit --class MyApp.class --master MyApp.jar --input hdfs://localhost:9000/input-file-path --output output-file-path
– Adding the jar
If we launch request with spark submit, application jar gets spread over all worker nodes. For any extra jar specify it through – jars flag, use comma as a delimiter. If it exits with a non-zero code, standalone cluster mode will restart your application.
3.2. Launching of Spark on YARN
There is a condition that YARN_CONF_DIR points to the directory must contain the (client side) configuration. Basically, these configs are for the hadoop cluster. To write to HDFS and connect to YARN resource manager, we generally use these configs.
The configs available in this directory will spread to the YARN cluster. So that all containers use the same configuration, which is used by an application. As the configuration references, java system properties not managed by YARN. Those configs should also be set in the spark application’s configuration.
There are two deploy modes to launch spark applications while using Yarn. Likewise cluster mode and client mode.
Inside an application master process, a driver runs. By YARN on the cluster, we can control this process. Also, the client can go away after starting the application.
In the client process, the driver runs for requesting resources from YARN, we use application master only.
In Yarn Mode, we take the address of resource manager from Hadoop configuration. So, the master parameter in here is yarn.
– If we want to launch spark application in cluster mode. The command is:
$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]
– If we want to launch spark application in client mode. The command is: (replace cluster in above by client)
$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode client
– Debugging Application on YARN
Both the application master as well as executors run inside the containers in YARN. After the application has completed YARN has two modes for handling container logs. To understand them in detail let’s look under the hood thoroughly:
If log aggregation is turned on
If log aggregation is turned on while using the YARN. This config enables the container logs. Those logs are replicate as it is to HDFS and then getting erase from the local machine. Later on, if we want to view this file from anywhere on the cluster we can use this command:
yarn logs -applicationId <app ID>
While we use this command, it will print the content from all log files from given application.
To see the container log files in HDFS directly, we can use HDFS shell or API. To find the directory where the log file is available, we will use this command:
yarn.nodemanager.remote-app-log-dir and yarn.nodemanager.remote-app-log-dir-suffix
If log aggregation is not turned on
While we use YARN if log aggregation isn’t turned on, all the logs are reserve locally on each machine under YARN_APP_LOGS_DIR. Those are usually configured to /tmp/logs or $HADOOP_HOME/logs/userlogs. It only depends on the Hadoop version and installation.
To view logs for a container, it requires going to the host. Host contains them in their directory. In sub-directories log files are classified by application ID as well as container ID.
– Adding the jar
As we already know, the driver runs on a different machine than the client in cluster mode. So SparkContext.addJar won’t work with files that are local to the client and may out of the box. If we want files available to SparkContext.addJar, add them with –jars in the launch command.
$ ./bin/spark-submit --class my.main.Class \ --master yarn \ --deploy-mode cluster \ --jars my-other-jar.jar,my-other-other-jar.jar \ my-main-jar.jar \ app_arg1 app_arg2
– If we want to access Spark runtime jars from YARN side, we need to specify-
spark.yarn.archive or spark.yarn.jars.
– Spark will form a zip file with all jar under $SPARK_HOME/jars, Since, we don’t want to specify. Hence, it will upload it to the distributed cache.
3.3 Launching of Spark in MapReduce (SIMR)
It allows anyone with access to a Hadoop mapreduce v1 cluster While working on SIMR. Also, helps to run spark out of the box. On top of hadoop mapreduce v1, any user can run spark directly. Even it does not require any administrative rights. There is no need to install spark or scala on any of the nodes, to access it. It only requires HDFS access and mapreduce v1. SIMR is open sourced under the apache license.
In addition, It is beneficial that fundamentally a user can download the package of SIMR (3 files). Those matches their Hadoop cluster and immediately start using spark. Those three files are:
SIMR runtime script: simr
It includes interactive spark shell. SMIR allows users to use the shell backed by the computational power of the cluster. Once the SIMR is ready we can simply get it by the command:
Basically, it requires bundling it along with its dependencies into a jar. It helps to run a spark program. Also, require launching the job through SIMR. For running jobs it uses the following command line syntax:
./simr jar_file main_class parameters
– How does SIMR work
The user can interact with the driver program by using SIMR. Moreover, SIMR runs the relay server on the master mapper. It runs the relay client on the machine which launches SIMR. when input and output come from client and driver, it goes to and fro between the client and the master mapper. Hence, to achieve all this HDFS is greatly used.
4. Hadoop Spark Compatibility – Conclusion
Hadoop spark compatibility tutorial said that Hadoop needs a diverse processing engine to process its data, on the other hand, spark needs a reliable storage system which is fault tolerant as well as distributed. Hence for Big Data professionals its the best combination to solve Big Data problems more efficiently than ever.
Reference – Apache Spark
If this tutorial is beneficial for you, don’t hesitate to write us