Introduction to Distributed Cache in Hadoop

In this tutorial we will provide you a detailed description of a Distributed Cache in Hadoop. First of all we will briefly understand what is Hadoop, then we will see what is Distributed Cache in Hadoop.

We will also cover the working and implementation of Hadoop Distributed Cache. At last in this blog we will also see the advantages and disadvantages of distributed caching in Hadoop.

Introduction to Hadoop

 

It is a mechanism that MapReduce framework provides to cache files needed by the applications. It can cache files like read-only text/data files, and more complex types such as archives, jar files etc.

Before we start with Distributed Cache, let us first discuss what is Hadoop?

Hadoop is open-source, Java-based programming framework. It supports processing and storage of extremely large datasets in a distributed environment. Hadoop follows Master-Slave topology.

Master is NameNode and Slave is DataNode. Datanode stores actual data in HDFS. And it performs read and write operation as per request for the client. Namenode stores meta-data.

In Apache Hadoop, data chunks, process in parallel among Datanodes, using a program written by the user.  If we want to access some files from all the Datanodes, then we will put that file into distributed cache.

What is Distributed Cache in Hadoop?

Distributed Cache in Hadoop is a facility provided by the MapReduce framework. Distributed Cache can cache files when needed by the applications. It can cache read only text files, archives, jar files etc.

Once we have cached a file for our job, Apache Hadoop will make it available on each datanodes where map/reduce tasks are running. Thus, we can access files from all the datanodes in our MapReduce job.

Size of Distributed Cache

By default, distributed cache size is 10 GB. If we want to adjust the size of distributed cache we can adjust by using local.cache.size.

Implementation

An application which is going to use distributed cache to distribute a file:

  • Should first ensure that the file is available.
  • After that, also make sure that the file can accessed via URLs. URLs can be either hdfs: // or https://.

After the above validation, if the file is present on the mentioned urls. The Hadoop user mentions it to be a cache file to the distributed cache. The Hadoop MapReduce job will copy the cache file on all the nodes before starting of tasks on those nodes.

Follow the below process:

a) Copy the requisite file to the HDFS:

$ hdfs dfs-put/user/dataflair/lib/jar_file.jar

b) Setup the application’s JobConf:

DistributedCache.addFileToClasspath(new Path (“/user/dataflair/lib/jar-file.jar”), conf).

c) Add it in Driver class.

Advantages of Distributed Cache

  • Single point of failure- As distributed cache run across many nodes. Hence, the failure of a single node does not result in a complete failure of the cache.
  • Data Consistency- It tracks the modification timestamps of cache files. It then, notifies that the files should not change until a job is executing. Using hashing algorithm, the cache engine can always determine on which node a particular key-value resides. As we know, that there is always a single state of the cache cluster, so, it is never inconsistent.
  • Store complex data – It distributes simple, read-only text file. It also stores complex types like jars, archives. These achieves are then un-archived at the slave node.

Disadvantage of Distributed Cache

A Distributed Cache in Hadoop has overhead that will make it slower than an in-process cache:

a) Object serialization– It must serialize objects. But the serialization mechanism has two main problems:

  • Very bulky– Serialization stores complete class name, cluster, and assembly details. It also stores references to other instances in member variables. All this makes the serialization very bulky.
  • Very slow– Serialization uses reflection to inspect the type of information at runtime. Reflection is a very slow process as compared to pre-compiled code.

Conclusion

In conclusion to Distributed cache, we can say that, it is a facility provided by the MapReduce. It caches files when needed by the applications. It can cache read only text files, archives, jar files etc.

By default, distributed cache size is 10 GB. If you find this blog, or you have any query related to Distributed Cache in Hadoop, so feel free to share with us.