{"id":1999,"date":"2017-10-05T05:58:04","date_gmt":"2017-10-05T05:58:04","guid":{"rendered":"http:\/\/techvidvan.com\/tutorials\/?p=366"},"modified":"2017-10-05T05:58:04","modified_gmt":"2017-10-05T05:58:04","slug":"distributed-cache-in-hadoop","status":"publish","type":"post","link":"https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/","title":{"rendered":"Introduction to Distributed Cache in Hadoop"},"content":{"rendered":"<p>In this tutorial we will provide you a detailed description of a Distributed Cache in <strong>Hadoop<\/strong>. First of all we will briefly understand what is Hadoop, then we will see what is Distributed Cache in Hadoop.<\/p>\n<p>We will also cover the working and implementation of Hadoop\u00a0Distributed Cache. At last in this blog we will also see the advantages and disadvantages of distributed caching in Hadoop.<\/p>\n<h3>Introduction to Hadoop<\/h3>\n<p>&nbsp;<\/p>\n<p>It is a mechanism that MapReduce framework provides to cache files needed by the applications. It can cache files like read-only text\/data files, and more complex types such as archives, jar files etc.<\/p>\n<p>Before we start with Distributed Cache, let us first discuss what is Hadoop?<\/p>\n<p><strong>Hadoop<\/strong> is open-source, Java-based programming framework. It supports processing and storage of extremely large datasets in a distributed environment. Hadoop follows Master-Slave topology.<\/p>\n<p>Master is NameNode and Slave is DataNode. Datanode stores actual data in<strong> HDFS<\/strong>. And it performs read and write operation as per request for the client. Namenode stores meta-data.<\/p>\n<p>In Apache Hadoop, data chunks, process in parallel among Datanodes, using a program written by the user.\u00a0 If we want to access some files from all the Datanodes, then we will put that file into distributed cache.<\/p>\n<h3>What is Distributed Cache in Hadoop?<\/h3>\n<p><strong>Distributed Cache<\/strong>\u00a0in Hadoop is a facility provided by the MapReduce framework. Distributed Cache can cache files when needed by the applications. It can cache read only text files, archives, jar files etc.<\/p>\n<p>Once we have cached a file for our job, Apache Hadoop will make it available on each datanodes where map\/reduce tasks are running. Thus, we can access files from all the datanodes in our MapReduce job.<\/p>\n<h4>Size of Distributed Cache<\/h4>\n<p>By default, distributed cache size is 10 GB. If we want to adjust the size of distributed cache we can adjust by using <strong>local<\/strong>.<strong>cache<\/strong>.<strong>size.<\/strong><\/p>\n<h4>Implementation<\/h4>\n<p>An application which is going to use distributed cache to distribute a file:<\/p>\n<ul>\n<li>Should first ensure that the file is available.<\/li>\n<li>After that, also make sure that the file can accessed via URLs. URLs can be either<strong> hdfs: \/\/ or https:\/\/.<\/strong><\/li>\n<\/ul>\n<p>After the above validation, if the file is present on the mentioned urls. The Hadoop user mentions it to be a cache file to the distributed cache. The Hadoop MapReduce job will copy the cache file on all the nodes before starting of tasks on those nodes.<\/p>\n<p>Follow the below process:<\/p>\n<h5><strong>a) Copy the requisite file to the HDFS:<\/strong><\/h5>\n<p>$ hdfs dfs-put\/user\/dataflair\/lib\/jar_file.jar<\/p>\n<h5><strong>b) Setup the application\u2019s JobConf:<\/strong><\/h5>\n<p>DistributedCache.addFileToClasspath(new Path (\u201c\/user\/dataflair\/lib\/jar-file.jar\u201d), conf).<\/p>\n<h5><strong>c) Add it in Driver class.<\/strong><\/h5>\n<h3>Advantages of Distributed Cache<\/h3>\n<ul>\n<li><strong>Single point of failure- <\/strong>As distributed cache run across many nodes. Hence, the failure of a single node does not result in a complete failure of the cache.<\/li>\n<li><strong>Data Consistency- <\/strong>It tracks the modification timestamps of cache files. It then, notifies that the files should not change until a job is executing. Using hashing algorithm, the cache engine can always determine on which node a particular key-value resides. As we know, that there is always a single state of the cache cluster, so, it is never inconsistent.<\/li>\n<li><strong>Store complex data &#8211;<\/strong> It distributes simple, read-only text file. It also stores complex types like jars, archives. These achieves are then un-archived at the slave node.<\/li>\n<\/ul>\n<h3>Disadvantage of Distributed Cache<\/h3>\n<p>A Distributed Cache in Hadoop has overhead that will make it slower than an in-process cache:<\/p>\n<p><strong>a) Object serialization<\/strong>&#8211; It must serialize objects. But the serialization mechanism has two main problems:<\/p>\n<ul>\n<li><strong>Very bulky<\/strong>&#8211; Serialization stores complete class name, cluster, and assembly details. It also stores references to other instances in member variables. All this makes the serialization very bulky.<\/li>\n<li><strong>Very slow<\/strong>&#8211; Serialization uses reflection to inspect the type of information at runtime. Reflection is a very slow process as compared to pre-compiled code.<\/li>\n<\/ul>\n<h3>Conclusion<\/h3>\n<p>In conclusion to Distributed cache, we can say that, it is a facility provided by the MapReduce. It caches files when needed by the applications. It can cache read only text files, archives, jar files etc.<\/p>\n<p>By default, distributed cache size is 10 GB. If you find this blog, or you have any query related to Distributed Cache in Hadoop, so feel free to share with us.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this tutorial we will provide you a detailed description of a Distributed Cache in Hadoop. First of all we will briefly understand what is Hadoop, then we will see what is Distributed Cache&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":73108,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[544],"tags":[538,457,539,622,541,623],"class_list":["post-1999","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-hadoop","tag-apache-hadoop","tag-big-data","tag-big-data-hadoop","tag-distributed-cache-in-hadoop","tag-hadoop","tag-hadoop-distributed-cache"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.7 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Introduction to Distributed Cache in Hadoop - TechVidvan<\/title>\n<meta name=\"description\" content=\"Hadoop tutorial for Distributed Cache in Hadoop,Hadoop distributed cache implementation, Distributed Cache size, benefits &amp; limitations of Distributed cache\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Introduction to Distributed Cache in Hadoop - TechVidvan\" \/>\n<meta property=\"og:description\" content=\"Hadoop tutorial for Distributed Cache in Hadoop,Hadoop distributed cache implementation, Distributed Cache size, benefits &amp; limitations of Distributed cache\" \/>\n<meta property=\"og:url\" content=\"https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/\" \/>\n<meta property=\"og:site_name\" content=\"TechVidvan\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/TechVidvan\/\" \/>\n<meta property=\"article:published_time\" content=\"2017-10-05T05:58:04+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Distributed-Cache-in-Hadoop-01.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"TechVidvan Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:site\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"TechVidvan Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Introduction to Distributed Cache in Hadoop - TechVidvan","description":"Hadoop tutorial for Distributed Cache in Hadoop,Hadoop distributed cache implementation, Distributed Cache size, benefits & limitations of Distributed cache","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/","og_locale":"en_US","og_type":"article","og_title":"Introduction to Distributed Cache in Hadoop - TechVidvan","og_description":"Hadoop tutorial for Distributed Cache in Hadoop,Hadoop distributed cache implementation, Distributed Cache size, benefits & limitations of Distributed cache","og_url":"https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/","og_site_name":"TechVidvan","article_publisher":"https:\/\/www.facebook.com\/TechVidvan\/","article_published_time":"2017-10-05T05:58:04+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Distributed-Cache-in-Hadoop-01.jpg","type":"image\/jpeg"}],"author":"TechVidvan Team","twitter_card":"summary_large_image","twitter_creator":"@vidvantech","twitter_site":"@vidvantech","twitter_misc":{"Written by":"TechVidvan Team","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/#article","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/"},"author":{"name":"TechVidvan Team","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22"},"headline":"Introduction to Distributed Cache in Hadoop","datePublished":"2017-10-05T05:58:04+00:00","mainEntityOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/"},"wordCount":719,"commentCount":0,"publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Distributed-Cache-in-Hadoop-01.jpg","keywords":["apache hadoop","big data","big data hadoop","Distributed cache in hadoop","hadoop","hadoop distributed cache"],"articleSection":["Hadoop Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/","url":"https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/","name":"Introduction to Distributed Cache in Hadoop - TechVidvan","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/#website"},"primaryImageOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/#primaryimage"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Distributed-Cache-in-Hadoop-01.jpg","datePublished":"2017-10-05T05:58:04+00:00","description":"Hadoop tutorial for Distributed Cache in Hadoop,Hadoop distributed cache implementation, Distributed Cache size, benefits & limitations of Distributed cache","breadcrumb":{"@id":"https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/#primaryimage","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Distributed-Cache-in-Hadoop-01.jpg","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Distributed-Cache-in-Hadoop-01.jpg","width":1200,"height":628,"caption":"Hadoop Distributed Cache"},{"@type":"BreadcrumbList","@id":"https:\/\/techvidvan.com\/tutorials\/distributed-cache-in-hadoop\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/techvidvan.com\/tutorials\/"},{"@type":"ListItem","position":2,"name":"Introduction to Distributed Cache in Hadoop"}]},{"@type":"WebSite","@id":"https:\/\/techvidvan.com\/tutorials\/#website","url":"https:\/\/techvidvan.com\/tutorials\/","name":"TechVidvan Blogs","description":"","publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/techvidvan.com\/tutorials\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/techvidvan.com\/tutorials\/#organization","name":"TechVidvan","url":"https:\/\/techvidvan.com\/tutorials\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","width":200,"height":50,"caption":"TechVidvan"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/TechVidvan\/","https:\/\/x.com\/vidvantech"]},{"@type":"Person","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22","name":"TechVidvan Team","description":"The TechVidvan Team delivers practical, beginner-friendly tutorials on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our experts are here to help you upskill and excel in today\u2019s tech industry."}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/1999","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/comments?post=1999"}],"version-history":[{"count":0,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/1999\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media\/73108"}],"wp:attachment":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media?parent=1999"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/categories?post=1999"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/tags?post=1999"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}