Site icon TechVidvan

HDFS Erasure Coding in Big Data Hadoop

HDFS Erasure Coding in Big Data Hadoop

This blog is all about HDFS Erasure Coding. In this blog we will discuss the concept of Erasure Coding in Hadoop, issues of old replication scheme. Two algorithms for Hadoop erasure coding such as XOR  Algorithm, Reed-Solomon Algorithm are also discussed in this blog.

At last we will see the architecture and the advantages of erasure coding in Hadoop HDFS.

Problem with Old Scheme Replication

HDFS Erasure coding is a new feature introduced to reduce storage overhead by approximately 50% compared to 3x replication. Hadoop HDFS replicates each block 3 times for various purposes. It is very simple form of redundancy to shield against the datanode failure.

Along with pros it has various cons that it is very expensive. 3 x replication has 200% overhead in storage space and other resources. Datasets with low I/O activity, addition replicas are rarely accessed during normal operation but still consume other resources.

This is the reason that Hadoop Erasure coding came into existence. It provides the same level of fault tolerance with less space store and 50% storage overhead.

When comparing the different storage scheme, an important consideration is:

So In N-way replication, there is N-1 fault tolerance with 1/n storage efficiency.

What is HDFS Erasure Coding in Hadoop?

HDFS Erasure Coding uses RAID. RAID implements EC uses stripping. Stripping logically stores the data in the form of a block. Then stores these blocks on the different disk. It calculates parity for each block and store. This is encoded. Through parity it recovers error.

For fault tolerance EC extends message with redundant data. HDFS Erasure coding will operate on uniformly sized data cells. The codec takes a number of data cells as input. And then produces parity cells as the output.

This whole process is called as Encoding. Parity and data cell together are called as an erasure coding group. The process by which lost data cell reconstructs over the remaining cells is known as Decoding.

Two algorithms available for HDFS Erasure Coding are as follows:

a) XOR Algorithm

It is the simple implementation of Hadoop Erasure coding.

Let’s assume data cells X and Y and Z are data cell, then parity cell is XOR of these three data cells x ⊕ y ⊕ z so during the XOR operation only one parity bit is generated and if any one bit is lost it can be recovered by the remaining data cells and a parity bit.

It is very limited since it produces 1 parity bit so the XOR operation can tolerate only 1 failure with n group size.

In XOR operation fault tolerance 1 and storage efficiency is n-1/n when group size is n.

b) Reed-Solomon Algorithm

Reed-Solomon addresses the XOR operation limitation. It uses linear algebra to generate multiple parity cells. RS uses two parameter k and m, k is a number of data cells and m is a number of parity cells.

RS works by multiplying k data cells with a Generator Matrix (GT), to generate extended codeword with k data cells and m parity cells. Storage failure can be recovered by the multiplying inverse of the generator matrix with the extended codewords as long as k out of k+m cells is available.

“With Reed, Solomon fault tolerance is up to m cells and storage efficiency k/k+m where k are data cells and m are parity cells.”

Design Decision and Architecture

EC striping has several advantages:

EC added many new components are:

Benefits of Erasure Coding

Conclusion

In conclusion, we can say that, HDFS Erasure coding has reduced the storage overhead by 50%. EC reduces overhead because of parity bits. Hence, these HDFS features empower Apache Hadoop functionality.

If you have any query or suggestion related to Erasure Coding in HDFS, so please comment us in the section given below.

Exit mobile version