Fault Tolerance in Spark: Self recovery property

This article is all about the Fault Tolerance in Spark property. A feature of self-recovery is one of the most powerful keys on spark platform. Which means at any stage of failure, RDD itself can recover the losses.

In this blog, we will learn about spark fault tolerance, apache spark high availability and how spark handles the process of spark fault tolerance in detail.

Fault Tolerance in Apache Spark

Fault Tolerance in Spark: Self-recovery property in RDD

Apache spark fault tolerance property means RDD, has a capability of handling if any loss occurs. It can recover the failure itself, here fault refers to failure. If any bug or loss found, RDD has the capability to recover the loss.

We need a redundant element to redeem the lost data. Redundant data plays important role in a self-recovery process. Ultimately, we can recover lost data by redundant data.

In addition, while working on spark we may apply different transformations on RDDs. That creates a logical execution plan for all tasks executed. This logical execution plan is also popular as lineage graph.

In the process, we may lose any RDD as if any fault arises in a machine. By applying the same computation on that node, we can recover our same dataset again. We can apply the same computations by using lineage graph.

Hence, This process is fault tolerance or self-recovery process.

Fault Tolerance in Spark – How Spark Handles Fault Tolerance

If any of the nodes of processing data gets crashed, that results in a fault in a cluster. In other words, RDD is logically partitioned and each node is operating on a partition at any point in time.

Operations which are being performed is a series of scala functions. Those operations are being executed on that partition of RDD. This series of operations are merged together and create a DAG, it refers to Directed Acyclic Graph. That means DAG keeps track of operations performed.

If any node crashes in the middle of an operation, the cluster manager finds out that node. Then, it tries to assign another node to continue the processing at the same place.

This node will operate on the same partition of RDD and series of operations. Due to this new node, there is effectively no data loss. Meanwhile, that new node continues the processing smoothly.

There are basically two fault tolerant significance in apache spark. They are:

  • Since RDDs are immutable in nature. Hence, to create each RDD we need to memorize the lineage of operations. Thus, it might be used on fault-tolerant input dataset for its creation purpose.
  • If any partition of an RDD may be lost due to worker node failure, then using the lineage we can recompute it. This is possible from an original fault tolerant dataset.

To meet fault tolerance for all the RDDs, whole data is being replicated among many nodes in the cluster.

There are two types of data that demand to be recovered in the event of failure:

  • Data received and replicated
  • Data received but buffered for replication.

To understand better, let’s study them in detail:

– Data received and replicated

In this process, data gets replicated on one of the other nodes. So that if any failure occurs we can retrieve the data for further use.

– Data received but buffered for replication

In this process, there is no replicated data. The only way to recover it, by getting it again from the source.

Apache Mesos (Cluster Manager) creates or maintains the backup masters in spark. Hence, that helps spark becoming the master fault tolerant.

Mesos is an open source software works between an application layer and operating system. It helps to manage applications in the large clustered environment.

Fault Tolerance in Spark – Apache Spark High Availability

If downtime of any system is tolerable that system is highly available. There is no system with its downtime is zero, so it is an imaginary term for any system. We can say, any machine has an uptime of 97.7% so, the probability of its downtime will be 0.023.

Similarly, if we have three systems simultaneously downtime of those systems will be (0.023*0.023*0.023), that is 0.000012167. By this order, uptime of the system will be 99.99. Which is highly acceptable uptime guarantee.

Conclusion

Hence, we have seen that spark RRDs are fault tolerant in nature. So, they track operations to rebuild the lost data at the time of failures. Also, spark’s fault tolerance property of  RDD enhances the efficiency and we can recover lost data by redundant data.

As a result, in case of any failure, there is no need to restart the application from scratch. It also helps in building the performance of the system.