Site icon TechVidvan

Apache Spark RDD: Spark’s Core Abstraction

Apache Spark RDD Tutorial

Apache Spark RDD seems like a piece of cake for developers as it makes their work more efficient. This is an immutable group of objects arranged in the cluster in a distinct manner.

It is partitioned over cluster as nodes so we can compute parallel operations on every node. It is immutable that helps to maintain consistency when we perform further computations.

In this section, we will go through what is Apache Spark RDD, Why Apache Spark RDD is needful, Example of Spark RDD And we will also know when to use RDD.

What is Apache Spark RDD?

Apache Spark RDD refers to Resilient Distributed Datasets in Spark. It is an API (application programming interface) of Spark. It collects all the elements of the data in the cluster which are well partitioned. We can perform different operations on RDD as well as on data storage to form another RDDs from it.

For example, Transformation and Actions. Transformations mean to create a new data sets from the existing ones. As we know RDDs are Immutable, we can transform the data from one to another.

Likewise, actions are operations that return a value to the program. All the transformations done on a Resilient Distributed Datasets are later applied when an action is called.

We can categorize operations as coarse-grained operation and fine-grained operation. The coarse-grained operation means to apply operations on all the objects at once. Fine-grained operations mean to apply operations on a smaller set.

We generally apply coarse-grained operation, as it works on entire cluster simultaneously. We can also create RDDs by its cache and divide it manually. RDDs give power to users to control them and can save it in a cache memory.

Users may also persist an RDD to memory as RDDs can be reused across the parallel operation.

Through its name, RDD itself indicating its properties like:

1. There are three ways to create RDDs:

Apache Spark RDD-Ways to Create RDD in Spark

–  Parallelize the present collection in our dataset

–  Referencing a dataset in external storage system

– To Create Resilient Distributed Datasets from already existing RDDs

If any failure occurs RDD can self-recover it. By this process, it enhances its property of fault tolerance. When we apply different transformations on RDDs it creates a logical execution plan. A logical execution plan is known as Lineage graph.

In Process, we may lose RDD as if any fault arises in the machine. So by applying the same computation on the node of the Lineage graph, we can recover our same dataset again.

Why is Spark RDD is needful?

Pointing up the necessity of RDDs:

Resilient Distributed Datasets- When to use ?  

Considering the plots where we can use RDDs:

Conclusion

As a result, we observed Apche Spark RDD is succeeding where Hadoop MapReduce was lacking its speed. In-memory processing is also a factor where RDD is progressive.

As it seems to be immutable which turned as more beneficial towards development era. We can perform parallel operations on partitioned data across the cluster. That also results in faster computations appears fruitful for us.

Exit mobile version