{"id":2010,"date":"2018-01-06T12:36:02","date_gmt":"2018-01-06T12:36:02","guid":{"rendered":"https:\/\/techvidvan.com\/tutorials\/?p=655"},"modified":"2018-01-06T12:36:02","modified_gmt":"2018-01-06T12:36:02","slug":"persistence-and-caching-mechanism","status":"publish","type":"post","link":"https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/","title":{"rendered":"Persistence And Caching Mechanism In Apache Spark"},"content":{"rendered":"<p>In this article, we will learn about spark RDD persistence and caching\u00a0mechanism in detail. These are optimization techniques we use for spark computations. We will go through why do we need spark RDD persistence and caching, what are the benefits of RDD persistence in spark.<\/p>\n<p>We will also see what are the required storage levels to store persisted RDDs. Along with that, we will also study about, how to un-persist RDD in spark.<\/p>\n<p><a href=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/RDD-Persistence-And-Caching-Mechanism-01.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-73224\" src=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/RDD-Persistence-And-Caching-Mechanism-01.jpg\" alt=\"RDD Persistence And Caching Mechanism\" width=\"1200\" height=\"628\" \/><\/a><\/p>\n<h3>Understanding Persistence And Caching Mechanism in RDD<\/h3>\n<p>Spark RDD persistence and caching are optimization techniques. This may use for iterative as well as interactive Spark computations. <em>Iterative computations<\/em> mean to reuse the results over multiple computations in multistage applications. <em>Interactive computations<\/em> mean, allowing a two-way flow of information.<\/p>\n<p>These mechanisms help saving results for upcoming stages so, that we can use them. After these results, we can store RDD in memory and disk. Memory (most preferred) and disk (less Preferred because of its slow access speed). We can cache RDDs using cache ( ) operation. Similarly, we can also persist RDDs by persist ( ) operations.<\/p>\n<p>We can see Spark RDD persistence and caching one by one in detail:<\/p>\n<h4>1. RDD Persistence Mechanism<\/h4>\n<p>As we know, RDDs are re-computable on each action by default due to its behavior. This phenomenon can be overcome by persisting the RDDs. So, that whenever we call an action on RDD, no re-computation takes place. When we call persist ( ) method, each computation stores the result in its partitions.<\/p>\n<p>To persist an RDD, we use persist ( ) method. We can use apache spark through scala, python, java etc coding. Persist( ) method will always store the data in JVM. In java virtual machine as an unserialized object, while working with java and scala.<\/p>\n<p>Similarly in python, calling persist() will serialize the data before persisting, serialize means (One-byte array per partition). There are options to store data in memory or disk combination is also possible.<\/p>\n<p>The actual persistence takes place during the first (1) action call on the spark RDD. Spark provides multiple storage options like memory or disk. That helps to persist the data as well as replication levels.<\/p>\n<p>When we apply persist method, RDDs as result can be stored in different storage levels. One thing to remember that we cannot change storage level from resulted RDD, once a level assigned to it already.<\/p>\n<h4>2. Spark Cache Mechanism<\/h4>\n<p>Cache mechanism is one used to speed up the applications that access the same RDDs several times.<\/p>\n<p>Cache is a synonym of word persist or persist(MEMORY_ONLY), that signifies cache is nothing but persist with the default storage level MEMORY_ONLY.<\/p>\n<h5>When to use caching<\/h5>\n<p>There are following situations in which we can use cache mechanism.<\/p>\n<ul>\n<li>When we re-use RDD \u00a0while working in iterative machine learning applications<\/li>\n<li>While we re-use RDD \u00a0in standalone spark applications<\/li>\n<li>When RDD computations are expensive, we use caching mechanism. It helps in reducing the cost of recovery if, in case one executor fails.<\/li>\n<\/ul>\n<h4>3. Difference between Spark RDD Persistence and caching<b><br \/>\n<\/b><\/h4>\n<p>This difference between the following operations is <span class=\"adverb\">purely<\/span> syntactic. There is the only difference between cache ( ) and persist ( ) method. When we apply cache ( ) method the resulted RDD can <span class=\"passivevoice\">be stored<\/span> only in default storage level, default storage level is MEMORY_ONLY.<\/p>\n<p>While we apply persist method, resulted RDDs <span class=\"passivevoice\">are stored<\/span> in different storage levels. As we discussed above, cache is a synonym of word persist or persist (MEMORY_ONLY), that means the cache is a persist method with the default storage level MEMORY_ONLY.<\/p>\n<h3>Need of Persistence Mechanism<\/h3>\n<p>It allows us to use same RDD <span class=\"complexword\">multiple<\/span> times in apache spark. As we know as many times we use RDD or we repeat RDD evaluation, we need to call action to execute.<\/p>\n<p>This process consumes much time as well as memory, while we perform iterative algorithm we <span class=\"complexword\">require<\/span> looking at data many times that time, that consumes ample of memory and time. To overcome this issue of repeated computation, these techniques of persistence introduced.<\/p>\n<h3>Benefits of RDD Persistence in Spark<\/h3>\n<p>Using techniques of RDD Persistence in apache spark has become beneficial. Listing following reasons below:<\/p>\n<ul>\n<li>It enhances the speed of applications we perform generally. We can access same RDDs multiple times that increase the speed of our application.<\/li>\n<li>They are very time efficient, before these methods of much time consumed in several processes. As this process comes in the picture, it reduces the time with an increase in work efficiency.<\/li>\n<li>We were not able to use same RDDs, we have to afford many numbers of RDDs which became an expensive task for us. So, after persistence and caching is possible we can use same RDDs again and again. That results in reducing the cost and prove as cost-effective.<\/li>\n<li>Likewise, we discussed earlier that RDD persistence is helping in reducing the time. This enhances the speed of application with less memory. It definitely lessens the execution time of the process.<\/li>\n<\/ul>\n<h3>Storage levels of Persisted RDDs<\/h3>\n<p>On applying persist method, RDDS takes place in respective storage levels. Those storage levels are:<\/p>\n<p><a href=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Storage-levels-of-Persisted-RDDs-01.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-73311\" src=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Storage-levels-of-Persisted-RDDs-01.jpg\" alt=\" Persistence And Caching Mechanism - Storage levels of Persisted RDDs\" width=\"1200\" height=\"628\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<h4>1.\u00a0 MEMORY_ONLY (Default level)<\/h4>\n<p>It is default memory, while it is must it stores data in available memory cluster (JVM), as an unserialized object. It also happens that if there is insufficient memory some of the data partitions may not <span class=\"passivevoice\">be cached, t<\/span>hat uncached data computed next time when we need it. In this option, we only use memory, not disk.<\/p>\n<h4>2. MEMORY_AND_DISK<\/h4>\n<p>By this option, RDD stored as deserialized data objects. As sometimes RDD may not fit in the memory cluster, it stores the remaining part on the disk.<\/p>\n<p>Again we read the leftover part when needed. In this option, we use memory as well as disk.<\/p>\n<h4>3.\u00a0 MEMORY_ONLY_SER<\/h4>\n<p>In this option, RDD stored as serialized Java objects in memory. Serialized objects mean <em>one-byte array per partition, <\/em>this is much space efficient which saves memory.<\/p>\n<p>Due to this some data partitions may not be cached, so we only calculate remaining part as per requirement. In this option, we do not use the disk.<\/p>\n<h4>4.\u00a0 MEMORY_ONLY_DISK_SER<\/h4>\n<p>This option is as similar as MEMORY_ONLY_SER. Unlike, it saves the leftover part in the disk which is not stored in memory. This option uses both memory and disk storage.<\/p>\n<h4>5.\u00a0 DISC_ONLY<\/h4>\n<p>This option stores RDD only on Disk. It makes only use of disk for storage purpose.<\/p>\n<h3>How to un-persist RDD in Spark<\/h3>\n<p>Cached data overreach the volume of memory, spark automatically expel the old data. This is actually a process named LRU, LRU refers to Last<strong> Recently Used<\/strong>. This algorithm categorizes the data as less used or frequently used.<\/p>\n<p>Either, it happens automatically or we can do it on our own by using the method calls un-persist, this is RDD.unpersist( ) method.<\/p>\n<h3>Conclusion<\/h3>\n<p>Hence, Spark RDD persistence and caching mechanism are various optimization techniques, that help in storing the results of RDD evaluation techniques. These mechanisms help saving results for upcoming stages so that we can reuse it.<\/p>\n<p>After that, these results as RDD can be stored in memory and disk as well. To learn Apache Spark refer this book.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this article, we will learn about spark RDD persistence and caching\u00a0mechanism in detail. These are optimization techniques we use for spark computations. We will go through why do we need spark RDD persistence&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":73224,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[614],"tags":[725,726,727,728,729,730,731,732],"class_list":["post-2010","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-apache-spark","tag-caching-and-persistence","tag-caching-vs-persistence-in-apache-spark","tag-choosing-rdd-persistence-and-caching-with-spark","tag-persistence-and-caching-in-spark-rdd","tag-rdd-persistence-caching-mechanism","tag-rdd-persistence-and-caching-mechanism-in-apache-spark","tag-understanding-persistance-machanism","tag-understanding-spark-caching"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.7 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Persistence And Caching Mechanism In Apache Spark - TechVidvan<\/title>\n<meta name=\"description\" content=\"Persistence And Caching Mechanism in Spark RDD covers introduction, needs, benefits, storage levels, when to use RDD persistence and how to unpersist RDD.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Persistence And Caching Mechanism In Apache Spark - TechVidvan\" \/>\n<meta property=\"og:description\" content=\"Persistence And Caching Mechanism in Spark RDD covers introduction, needs, benefits, storage levels, when to use RDD persistence and how to unpersist RDD.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/\" \/>\n<meta property=\"og:site_name\" content=\"TechVidvan\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/TechVidvan\/\" \/>\n<meta property=\"article:published_time\" content=\"2018-01-06T12:36:02+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/RDD-Persistence-And-Caching-Mechanism-01.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"TechVidvan Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:site\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"TechVidvan Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Persistence And Caching Mechanism In Apache Spark - TechVidvan","description":"Persistence And Caching Mechanism in Spark RDD covers introduction, needs, benefits, storage levels, when to use RDD persistence and how to unpersist RDD.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/","og_locale":"en_US","og_type":"article","og_title":"Persistence And Caching Mechanism In Apache Spark - TechVidvan","og_description":"Persistence And Caching Mechanism in Spark RDD covers introduction, needs, benefits, storage levels, when to use RDD persistence and how to unpersist RDD.","og_url":"https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/","og_site_name":"TechVidvan","article_publisher":"https:\/\/www.facebook.com\/TechVidvan\/","article_published_time":"2018-01-06T12:36:02+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/RDD-Persistence-And-Caching-Mechanism-01.jpg","type":"image\/jpeg"}],"author":"TechVidvan Team","twitter_card":"summary_large_image","twitter_creator":"@vidvantech","twitter_site":"@vidvantech","twitter_misc":{"Written by":"TechVidvan Team","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/#article","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/"},"author":{"name":"TechVidvan Team","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22"},"headline":"Persistence And Caching Mechanism In Apache Spark","datePublished":"2018-01-06T12:36:02+00:00","mainEntityOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/"},"wordCount":1183,"commentCount":0,"publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/RDD-Persistence-And-Caching-Mechanism-01.jpg","keywords":["Caching and Persistence","Caching vs Persistence in Apache Spark","Choosing RDD Persistence and Caching with Spark","persistence and caching in spark RDD","RDD Persistence &amp; Caching Mechanism","RDD Persistence and Caching Mechanism in Apache Spark","understanding persistance machanism","Understanding Spark Caching"],"articleSection":["Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/","url":"https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/","name":"Persistence And Caching Mechanism In Apache Spark - TechVidvan","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/#website"},"primaryImageOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/#primaryimage"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/RDD-Persistence-And-Caching-Mechanism-01.jpg","datePublished":"2018-01-06T12:36:02+00:00","description":"Persistence And Caching Mechanism in Spark RDD covers introduction, needs, benefits, storage levels, when to use RDD persistence and how to unpersist RDD.","breadcrumb":{"@id":"https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/#primaryimage","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/RDD-Persistence-And-Caching-Mechanism-01.jpg","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/RDD-Persistence-And-Caching-Mechanism-01.jpg","width":1200,"height":628,"caption":"RDD Persistence And Caching Mechanism"},{"@type":"BreadcrumbList","@id":"https:\/\/techvidvan.com\/tutorials\/persistence-and-caching-mechanism\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/techvidvan.com\/tutorials\/"},{"@type":"ListItem","position":2,"name":"Persistence And Caching Mechanism In Apache Spark"}]},{"@type":"WebSite","@id":"https:\/\/techvidvan.com\/tutorials\/#website","url":"https:\/\/techvidvan.com\/tutorials\/","name":"TechVidvan Blogs","description":"","publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/techvidvan.com\/tutorials\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/techvidvan.com\/tutorials\/#organization","name":"TechVidvan","url":"https:\/\/techvidvan.com\/tutorials\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","width":200,"height":50,"caption":"TechVidvan"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/TechVidvan\/","https:\/\/x.com\/vidvantech"]},{"@type":"Person","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22","name":"TechVidvan Team","description":"The TechVidvan Team delivers practical, beginner-friendly tutorials on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our experts are here to help you upskill and excel in today\u2019s tech industry."}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2010","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/comments?post=2010"}],"version-history":[{"count":0,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2010\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media\/73224"}],"wp:attachment":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media?parent=2010"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/categories?post=2010"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/tags?post=2010"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}