Apache Spark Lazy Evaluation: In Spark RDD

by TechVidvan Team

In this blog, we will capture one of the important features of RDD, Spark Lazy Evaluation. Spark RDD (Resilient Distributed Datasets), collect all the elements of data in the cluster which are partitioned. Its a group of immutable objects arranged in the cluster in a distinct manner.

Here, we will discuss spark lazy evaluation, reasons for carrying lazy evaluation technique, how the spark is managing with this technique at last advantages of spark lazy evaluation.

Introduction

What is lazy evaluation-

“LAZY” the word itself indicates its meaning ‘not at the same time’.

That means, it evaluates something only when we require it. In accordance with a spark, it does not execute each operation right away, that means it does not start until we trigger any action. Once an action is called all the transformations will execute in one go.

This process is more appealing than executing one operation at the same time. Execution at the same time may lead to delay the process. Overcoming to the downtime, it gives birth to Lazy Evaluation.

Using lazy evaluation, It figured out that all transformations can be combined together into a single transformation and executed together. Simply, it says while performing any operation it will not execute until we trigger an action.

We need to call an action every time we want to execute the process. Especially while we perform any transformation, it does not execute immediately.

DAG in Spark

Through DAG, Spark maintains the record of every operation performed, DAG refers to Directed Acyclic Graph. That keeps the track of each step through its arrangement of vertices and edges.

So, when we call any action it strikes to DAG graph directly and DAG keeps maintenance of operations which triggers the execution process forward.

If we talk about Hadoop mapreduce, we need to minimize the steps of mapreduce passes. That is only possible by clubbing the operations together. While in spark DAG graph already maintains the records of all the operations, it clubs all the operations in one.

Rather than making single execution graph for each operation. Thus it creates a difference between Hadoop mapreduce and apache spark.

So, in this way by waiting until last minute to execute our code will also enhance the performance. It also boosts the efficiency of the process over the cluster.

Advantages of Spark Lazy Evaluation

There are several perks of lazy evaluation listed below:

1. Reduces Complexities

As we know that, we are not executing every operation rather than we are executing the entire process only once. It allows us to work with an infinite data structure.

An action is only triggered when it is required. It results in saving the time as well as reducing the space complexity.

2. Optimization

Due to lazy evolution, the system works more efficiently with fewer resources. That means, it also decreases the number of queries.

3. Develops Manageability

By manageability, we signifying the organizing manner of a program. That organize it into smaller operations which enhances the efficiency. It also helps in reduces the number of passes on data by grouping the operations.

4. Saves Computation and increases speed

Due to this process, we are not performing every calculation at the instance. This is saving us from the bundle of calculations we owe. Now we only need to calculate important values, that saves our time and also speeds up the process.

Conclusion

As a result, we have seen spark lazy evaluation plays the very important role in the execution process and had several benefits of it. It enhances the efficiency of the system by reducing a number of executions of operations.

This feature also uplift the performance of the system. It also maintains the track of operations through DAG which optimizes the process. Lazy Evaluation works as a key building block in operations of Spark RDD.