Apache Spark Partitioning and Spark Partition

by TechVidvan Team

Partitioning is simply defined as dividing into parts, in a distributed system. Partitioning means, the division of the large dataset. Also, store them as multiple parts of the cluster. In this blog post, we will explain apache spark partition in detail.

We will also focus on the method to create a partition in spark. Furthermore, we will learn about different types of apache spark partitioning. Such as hash partitioning in apache spark and range partitioning in spark.

To understand completely, we will also learn to set partitioning for data in apache spark.

Partitioning in spark

Spark Partition – What is Partition in Spark?

For the division of data into several partitions first, we need to store it. In apache spark, we store data in the form of RDDs. RDDs refers to Resilient Distributed Datasets. They are a collection of various data items that are so huge in size. That big size data cannot fit into a single node.

Thus, we need to divide it into partitions across various nodes, spark automatically partitions RDDs. Also, automatically distributes the partitions among different nodes.

In spark, the partition is an atomic chunk of data. Simply putting, it is a logical division of data stored on a node over the cluster. In apache spark, partitions are basic units of parallelism and RDDs, in spark are the collection of partitions.

Spark Partition – Why Use a Partitioner?

If we talk about cluster computing, it is very challenging to cut network traffic. There’s a fair amount of shuffling of data across the network, for later transformations on the RDD.

Therefore, partitioning becomes imperative, when the data is key-value oriented. Since the range of keys or similar keys is in the same partition that minimizes shuffling. Hence, processing becomes substantially fast.

There are some transformations those require shuffling of data across worker nodes, they greatly benefit from partitioning. For Example, co-group, groupBy, groupByKey and many more. These operations also need lots of I/O operations.

As a result, by applying partitioning we can reduce the number of I/O operations rapidly. Thus, it speeds up the data processing. As spark works on data locality principle.

It means for processing worker nodes takes the data which are nearer to them. So, partitioning reduces network I/O. Hence, processing of data becomes a lot faster.

Spark Partition – Partitioning in Spark

To understand well, let’s take an example, creating a list of 20 integers with 4 partitions –

integer_RDD = sc.parallelize (range (20), 4)

Spark Partition – How many partitions should a Spark RDD have?

It is not fixed that, an RDD will be having too large a number of partitions or too few. On the basis of cluster configuration & application, we decide the number of partitions.

If we enhance the number of partitions, that make each partition have less data or no data at all. In spark, a single concurrent task can run for every partition of an RDD. Even up to the total number of cores in the cluster.

As we already know, in HDFS one partition is created for each block of the file of size 64MB. Although, when we create an RDD a second argument can be passed. In that argument we define the number of partitions to we want to create for an RDD.

val rdd= sc.textFile (“file.txt”, 6)

In the above line of code, we are creating an RDD named textFile with 6 partitions. Suppose that you have a cluster with 5 cores and assume that each partition needs to process for 5 minutes. In case of the above RDD with 6 partitions, 5 partition processes will run in parallel.

As there are 5 cores and the 6th partition process will process after 5 minutes when one of the 5 cores, is free. The entire processing will finish in 10 minutes. During the 6th partition process, the resources will remain idle.

1. Spark partitions number

Best way to decide a number of spark partitions in an RDD is to make the number of partitions equal to the number of cores over the cluster. This results in all the partitions will process in parallel. Also, use of resources will do in an optimal way.

In a Spark RDD, a number of partitions can always be monitor by using the partitions method of RDD. The spark partitioning method will show an output of 6 partitions, for the RDD that we created.

Scala> rdd.partitions.size

Output = 6

Task scheduling may take more time than the actual execution time if RDD has too many partitions. As some of the worker nodes could just be sitting idle resulting in less concurrency. Therefore, having too fewer partitions is also not beneficial.

That may lead to improper resource utilization and also data skewing. Since on a single partition, data might be skewed. And a worker node might be doing more than other worker nodes. Hence, when it comes to deciding the number of partitions, there is always a trade-off.

2. Guidelines for the number of partitions in Spark

While a number of partitions are between 100 and 10K partitions. Then based on the size of the cluster and data, the lower and upper bound should be determined.

The lower bond is determined by 2 X number of cores over the cluster.
The upper bound task should take 100+ ms time to execute. If execution time is less than the partitioned data might be too small. In other words, in scheduling tasks application might be spending extra time.

Spark Partition – Properties of Spark Partitioning

Tuples which are in the same partition in spark are guaranteed to be on the same machine.
Every node over cluster contains more than one spark partition.
A total number of partitions in spark are configurable. Although, it is already set to the total number of cores on all the executor nodes.

Spark Partition- Types

Hash Partitioning in Spark
Range Partitioning in Spark

1. Hash Partitioning in Apache Spark

Hash Partitioning in Spark

It means to spread the data evenly across various partitions, on the basis of a key. To determine the partition in Spark we use Object.hashCode method. As

partition = key.hashCode () % numPartitions.

2. Range Partitioning in Apache Spark

In some RDDs have keys that follow a particular ordering. Range partitioning is an efficient partitioning technique, for such RDDs. Through this method, tuples those have keys within the same range will appear on the same machine.

In range partitioner, keys are partitioned based on an ordering of keys. Also, depends on the set of sorted range of keys.

Both the spark partitioning techniques are ideal for various spark use cases. Yet, the spark still allows users to fine tune by using custom partitioner objects. That how their RDD is partitioned with custom partitioning.

Custom partitioning is only available for pair RDDs. Paired RDDs are RDDs with key-value pairs.

Spark Partition – Set data partitioning in Spark

We can create RDDs with specific partitioning in two ways –

By Providing explicit partitioner. For that need to call a partitionBy method on an RDD.
We can also apply transformations that return RDDs with specific partitioners. RDD operations that hold and propagate a partitioner are-

Join
LeftOuterJoin
RightOuterJoin
GroupByKey
ReduceByKey
FoldByKey
Sort
PartitionBy
FoldByKey

Conclusion

As a result, we have seen how spark partitioning helps to process data a lot faster. Spark partitioning reduces the number of I/O operations rapidly. Also, minimizes the shuffling. Hence, apache spark partitioning turns out very beneficial at the time of processing.