Apache Spark RDD: Spark’s Core Abstraction

Tech Vidvan

6 years ago

Apache Spark RDD seems like a piece of cake for developers as it makes their work more efficient. This is an immutable group of objects arranged in the cluster in a distinct manner.

It is partitioned over cluster as nodes so we can compute parallel operations on every node. It is immutable that helps to maintain consistency when we perform further computations.

In this section, we will go through what is Apache Spark RDD, Why Apache Spark RDD is needful, Example of Spark RDD And we will also know when to use RDD.

What is Apache Spark RDD?

Apache Spark RDD refers to Resilient Distributed Datasets in Spark. It is an API (application programming interface) of Spark. It collects all the elements of the data in the cluster which are well partitioned. We can perform different operations on RDD as well as on data storage to form another RDDs from it.

For example, Transformation and Actions. Transformations mean to create a new data sets from the existing ones. As we know RDDs are Immutable, we can transform the data from one to another.

Likewise, actions are operations that return a value to the program. All the transformations done on a Resilient Distributed Datasets are later applied when an action is called.

We can categorize operations as coarse-grained operation and fine-grained operation. The coarse-grained operation means to apply operations on all the objects at once. Fine-grained operations mean to apply operations on a smaller set.

We generally apply coarse-grained operation, as it works on entire cluster simultaneously. We can also create RDDs by its cache and divide it manually. RDDs give power to users to control them and can save it in a cache memory.

Users may also persist an RDD to memory as RDDs can be reused across the parallel operation.

Through its name, RDD itself indicating its properties like:

Resilient – Means that it is able to withstand all the losses itself.

Distributed – Indicates that the data at different locations or partitioned.

Datasets – Group of data on which we are performing different operations.

1. There are three ways to create RDDs:

Apache Spark RDD-Ways to Create RDD in Spark

– Parallelize the present collection in our dataset

– Referencing a dataset in external storage system

– To Create Resilient Distributed Datasets from already existing RDDs

If any failure occurs RDD can self-recover it. By this process, it enhances its property of fault tolerance. When we apply different transformations on RDDs it creates a logical execution plan. A logical execution plan is known as Lineage graph.

In Process, we may lose RDD as if any fault arises in the machine. So by applying the same computation on the node of the Lineage graph, we can recover our same dataset again.

Why is Spark RDD is needful?

Pointing up the necessity of RDDs:

In terms of memory and speed, it is far better than older mapreduce paradigm. To achieve faster and efficient operations higher speed is necessary.
In Hadoop MapReduce, we cannot reuse the date or data sharing is not possible. We need to store data in some intermediate data store which results in the slower process. In Spark, if we need to perform multiple operations on the same data, we can store that data explicitly. It can be stored in the memory by calling cache or persist functions.
In parallel jobs, both iterative and interactive applications need faster data sharing. That was not possible in Hadoop. Iterative means Reuse intermediate results. whereas interactive means allowing a two-way flow of information. As RDD supports parallel operations due to this faster data sharing is possible.
DSM refers to Distributed Shared Memory. It is a general architecture of memory which lacks in fault tolerance. It is also less efficient. To overcome this issue Resilient Distributed Datasets enters the property of self-recovery if any fault occurs.
Coarse-grained transformation is performed on the dataset. It defines as to perform the same operation over cluster at once.
We perform in-memory operations in RDD. In-memory means operate storage of information in main memory across jobs. Rather than operate information in complicated databases. That slows the drive, so the process in RDD becomes 100 times faster than Hadoop.

Resilient Distributed Datasets- When to use ?

Considering the plots where we can use RDDs:

We use RDDs when we opt for lower level transformations and actions. It helps to control the datasets on which we are working. Such as Low level; Map, Filter etc.
When the data is unstructured, like media streams, streams of text we do use RDD to fetch the information from it.
When we can relinquish some optimization and benefits on the basis of performance. By the dataframes and datasets even from structured as well as semi-structured data.
While we want to manipulate data with functional programming than domain specific expressions. Functional programming refers to a program built up from functions instead of objects. Whereas domain specific expressions mean with specific goals in design and implementation.
When we cannot process or access data attributes by their names or columns. We use Resilient Distributed Datasets to perform several operations on them. As it is possible without imposing several schemes such as columnar format.

Conclusion

As a result, we observed Apche Spark RDD is succeeding where Hadoop MapReduce was lacking its speed. In-memory processing is also a factor where RDD is progressive.

As it seems to be immutable which turned as more beneficial towards development era. We can perform parallel operations on partitioned data across the cluster. That also results in faster computations appears fruitful for us.