SparkContext In Apache Spark: Entry Point to Spark Core
1. SparkContext – Objective
This tutorial gives information on the main entry point to spark core i.e. Apache Spark SparkContext. Apache Spark is a powerful cluster computing engine, therefore, it is designed for fast computation of big data. Here, we will learn what is Apache Spark SparkContext. Furthermore, we will discuss the process to create SparkContext Class in Spark and the facts that how to stop SparkContext in Spark. To get this concept deeply, we will also study various functions of SparkContext in Spark.
2. SparkContext – Introduction
We can say Apache Spark SparkContext is a heart of spark application. It is the Main entry point to Spark Functionality. Generating, SparkContext is a most important task for Spark Driver Application and set up internal services and also constructs a connection to Spark execution environment. This is essentially a client of Spark’s execution environment, that acts as a master of Spark Application.
Moreover, once we create Apache Spark SparkContext we can use it in following ways. We can make RDDs (Resilient distributed datasets) from it, which is also used to obtain broadcast variables. Broadcast variables mean, which allow the programmer to cache a read-only variable and allows reading on each machine than shipping copy of it with jobs. Furthermore, we can create accumulator using SparkContext. It allows to access spark services and helps to run jobs until we stop SparkContext by own.
In addition, SparkContext allows accessing spark cluster manager, spark Standalone, Yarn, and Apache Mesos any one of them can be selected as a cluster manager.
If we use YARN (head node), and node manager (worker node), it may work to allocate containers for executors. As resources are available then, it will allocate memory and cores. It distributes according to the configuration. If we use sparks manager (head node), and spark slave (worker node) will work to allocate resources.
3. SparkContext – Ways to create Class
The first step to create Apache Spark SparkContext. It is the main entry point to spark functionality, is to make sparkconf. We have some configuration parameters. Those parameters we pass to sparkcontext through spark driver application and these parameters explain the properties of the application. We configure the parameters according to functionalities we need. Spark use some of them to allocate resources on a cluster by executors. Resources like cluster, like the number, memory size, and cores.
In addition, it helps us to access spark cluster in a better way. Due to sparkcontext objects, we can use functions such as textfile, sequence file and parallelize. There are following context in which it can run are local, yarn-client, Mesos URL and Spark URL.
As we discussed earlier, once we create sparkcontext. We can use it to create RDDs, broadcast variable, accumulator, and run jobs. Until we stop sparkcontext we can carry most of all these things.
4. SparkContext – How to stop
There is a condition that only a single sparkcontext can be active for single JVM. Before creating a new sparkcontext we need to stop () the active one.
While we apply this command we get a message written below:
INFO SparkContext: Successfully stopped SparkContext
This message signifies that running sparkcontext stopped successfully.
5. SparkContext – Various tasks
SparkContext offers following functions. Those are list-up below:
5.1. We have following functions to get the current status of Spark Application
SparkENV is a runtime environment with spark’s public services. For a running spark application, it holds all the running services objects. It may also include serializer, block manager, map output tracker, Akka actor system and also interacts to establish a distributed computing platform for spark application.
– To access SparkEnv
In spark, each application is configured separately with spark properties. Properties also manage all the applications. We can also set these properties manually on sparkconf. We can configure following common properties, such as master URL, application name, arbitrary key-value pair and configured through set method.
– Deployment environment (as master URL)
We can categorize deployment environment in two ways, such as local and clustered. We can also run spark in local mode. This is non-distributed single – JVM deployment mode. All the execution components are present in same single JVM. It includes driver, executor, local scheduler backend, and master.
Hence, a local mode is the only mode where drivers are useful for execution. This mode is very appropriate for testing, debugging. Also, helps in demonstration purposes. Due to there is no earlier setup require to launch spark applications. Local mode is also called as spark in-process or a local version of Spark.
5.2 Spark in Cluster mode
Spark can run in cluster mode also. The following cluster managers (task schedulers or resource managers) are currently supported:
1.Spark’s own built-in standalone cluster manager
– Application name
This function gives the value to the mandatory spark.app.name setting.
– Unique identifier of the execution attempt
This application provides the unique identifier of a spark application.
– Deploy Mode
This mode returns the current value of spark.submit.deployMode setting or client if not set.
– Default level of parallelism
It is the user that initiated the sparkcontext instance.
5.3. We have following functions to set the configuration
– Master URL
This method helps to returns back the current value of spark.master. That is deployment environment in use.
– Local properties – Creating Logical Job Groups
To form logical groups of jobs by means of properties, we use local properties. These create job launched from the different thread. That belongs to the single logic group. We can also set local properties. That may affect spark Jobs submitted from the thread, as spark fair scheduler pool.
We can also use our own custom properties. Properties can propagate through to worker tasks. We can access thereby TaskContext.getLocalProperty.
– Default Logging level
In a spark application, default logging level helps to set the root login level.
5.4. We have following functions to access various services
– Task Scheduler
It helps to manage (i.e. reads or writes) task scheduler internal property.
We use livelistnerbus to declare application-wide events in a spark application. It makes a path to pass listener events to register spark listeners.
– Block Manager
It is a key-value store for blocks of data in spark. Also, acts as a local cache. It runs on every “node” in a spark, i.e. the driver and executors.
– Scheduler Backend
To support various cluster managers, it works as a pluggable interface. The first one is Apache Mesos, second is Hadoop YARN and the third one is Spark Standalone.
– Shuffle Manager
It is the pluggable mechanism for shuffle systems. On the driver and executors, it tracks shuffle dependencies for ShuffleMapStage.
5.5. Provides a function To Cancel a job
It simply means to cancel a Spark job (with an optional reason) by requesting DAG Scheduler.
5.6. Provides a function Closure Cleaning
Whenever we call an action, Spark cleans up the closure. That is the body of the action before it is serialized. This method is closure cleaning method. In addition, it not only cleans the closure but also does it transitively. That means referenced closures are cleaned transitively.
5.7. Provides a function To cancel a stage
It simply means to cancel a Spark stage (with an optional reason). Actually, by requesting DAGScheduler.
5.8. Provides a function To Register Spark listener
We can register a custom Spark Listener Interface by using add Spark Listener method.
5.9. It allows Programmable Dynamic allocation
SparkContext offers the several methods as API developer for dynamic allocation of executors:
- Request Executors
- Kill Executors
- Request Total Executors
5.10. It provides a function To access persistent RDD
This provides the collection of spark RDDs. Those have considered themselves as persistent by using the cache.
5.11. It provides a function to un-persist RDDs
Unpersist removes the RDD from the master’s block manager internal persistent RDD pamapping.
6. SparkContext – Conclusion
Hence, we have discussed, an introduction of SparkContext, a creation of SparkContext, how to stop it and it’s various tasks. However, provides the various functions in sparks, such as getting the current status, set the configuration, cancel a job, cancel a stage and much more. As a result, we get SparkContext is the main entry point to spark core and it acts as a heart of a spark application.
Reference – Apache Spark
We would love to hear your comments.