Persistence And Caching Mechanism In Apache Spark

by TechVidvan Team

In this article, we will learn about spark RDD persistence and caching mechanism in detail. These are optimization techniques we use for spark computations. We will go through why do we need spark RDD persistence and caching, what are the benefits of RDD persistence in spark.

We will also see what are the required storage levels to store persisted RDDs. Along with that, we will also study about, how to un-persist RDD in spark.

Understanding Persistence And Caching Mechanism in RDD

Spark RDD persistence and caching are optimization techniques. This may use for iterative as well as interactive Spark computations. Iterative computations mean to reuse the results over multiple computations in multistage applications. Interactive computations mean, allowing a two-way flow of information.

These mechanisms help saving results for upcoming stages so, that we can use them. After these results, we can store RDD in memory and disk. Memory (most preferred) and disk (less Preferred because of its slow access speed). We can cache RDDs using cache ( ) operation. Similarly, we can also persist RDDs by persist ( ) operations.

We can see Spark RDD persistence and caching one by one in detail:

1. RDD Persistence Mechanism

As we know, RDDs are re-computable on each action by default due to its behavior. This phenomenon can be overcome by persisting the RDDs. So, that whenever we call an action on RDD, no re-computation takes place. When we call persist ( ) method, each computation stores the result in its partitions.

To persist an RDD, we use persist ( ) method. We can use apache spark through scala, python, java etc coding. Persist( ) method will always store the data in JVM. In java virtual machine as an unserialized object, while working with java and scala.

Similarly in python, calling persist() will serialize the data before persisting, serialize means (One-byte array per partition). There are options to store data in memory or disk combination is also possible.

The actual persistence takes place during the first (1) action call on the spark RDD. Spark provides multiple storage options like memory or disk. That helps to persist the data as well as replication levels.

When we apply persist method, RDDs as result can be stored in different storage levels. One thing to remember that we cannot change storage level from resulted RDD, once a level assigned to it already.

2. Spark Cache Mechanism

Cache mechanism is one used to speed up the applications that access the same RDDs several times.

Cache is a synonym of word persist or persist(MEMORY_ONLY), that signifies cache is nothing but persist with the default storage level MEMORY_ONLY.

When to use caching

There are following situations in which we can use cache mechanism.

When we re-use RDD while working in iterative machine learning applications
While we re-use RDD in standalone spark applications
When RDD computations are expensive, we use caching mechanism. It helps in reducing the cost of recovery if, in case one executor fails.

3. Difference between Spark RDD Persistence and caching

This difference between the following operations is purely syntactic. There is the only difference between cache ( ) and persist ( ) method. When we apply cache ( ) method the resulted RDD can be stored only in default storage level, default storage level is MEMORY_ONLY.

While we apply persist method, resulted RDDs are stored in different storage levels. As we discussed above, cache is a synonym of word persist or persist (MEMORY_ONLY), that means the cache is a persist method with the default storage level MEMORY_ONLY.

Need of Persistence Mechanism

It allows us to use same RDD multiple times in apache spark. As we know as many times we use RDD or we repeat RDD evaluation, we need to call action to execute.

This process consumes much time as well as memory, while we perform iterative algorithm we require looking at data many times that time, that consumes ample of memory and time. To overcome this issue of repeated computation, these techniques of persistence introduced.

Benefits of RDD Persistence in Spark

Using techniques of RDD Persistence in apache spark has become beneficial. Listing following reasons below:

It enhances the speed of applications we perform generally. We can access same RDDs multiple times that increase the speed of our application.
They are very time efficient, before these methods of much time consumed in several processes. As this process comes in the picture, it reduces the time with an increase in work efficiency.
We were not able to use same RDDs, we have to afford many numbers of RDDs which became an expensive task for us. So, after persistence and caching is possible we can use same RDDs again and again. That results in reducing the cost and prove as cost-effective.
Likewise, we discussed earlier that RDD persistence is helping in reducing the time. This enhances the speed of application with less memory. It definitely lessens the execution time of the process.

Storage levels of Persisted RDDs

On applying persist method, RDDS takes place in respective storage levels. Those storage levels are:

1. MEMORY_ONLY (Default level)

It is default memory, while it is must it stores data in available memory cluster (JVM), as an unserialized object. It also happens that if there is insufficient memory some of the data partitions may not be cached, that uncached data computed next time when we need it. In this option, we only use memory, not disk.

2. MEMORY_AND_DISK

By this option, RDD stored as deserialized data objects. As sometimes RDD may not fit in the memory cluster, it stores the remaining part on the disk.

Again we read the leftover part when needed. In this option, we use memory as well as disk.

3. MEMORY_ONLY_SER

In this option, RDD stored as serialized Java objects in memory. Serialized objects mean one-byte array per partition, this is much space efficient which saves memory.

Due to this some data partitions may not be cached, so we only calculate remaining part as per requirement. In this option, we do not use the disk.

4. MEMORY_ONLY_DISK_SER

This option is as similar as MEMORY_ONLY_SER. Unlike, it saves the leftover part in the disk which is not stored in memory. This option uses both memory and disk storage.

5. DISC_ONLY

This option stores RDD only on Disk. It makes only use of disk for storage purpose.

How to un-persist RDD in Spark

Cached data overreach the volume of memory, spark automatically expel the old data. This is actually a process named LRU, LRU refers to Last Recently Used. This algorithm categorizes the data as less used or frequently used.

Either, it happens automatically or we can do it on our own by using the method calls un-persist, this is RDD.unpersist( ) method.

Conclusion

Hence, Spark RDD persistence and caching mechanism are various optimization techniques, that help in storing the results of RDD evaluation techniques. These mechanisms help saving results for upcoming stages so that we can reuse it.

After that, these results as RDD can be stored in memory and disk as well. To learn Apache Spark refer this book.