{"id":2031,"date":"2018-01-13T12:34:59","date_gmt":"2018-01-13T12:34:59","guid":{"rendered":"https:\/\/techvidvan.com\/tutorials\/?p=812"},"modified":"2018-01-13T12:34:59","modified_gmt":"2018-01-13T12:34:59","slug":"spark-streaming-checkpoint","status":"publish","type":"post","link":"https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/","title":{"rendered":"A Quick Guide On Apache Spark Streaming Checkpoint"},"content":{"rendered":"<p>This document aims at a Spark Streaming Checkpoint, we will start with what is a streaming checkpoint, how streaming checkpoint helps to achieve fault tolerance. There are two types of spark checkpoint i.e. reliable checkpointing, local checkpointing.<\/p>\n<p>In this spark streaming tutorial, we will learn both the types in detail. Also, to understand more about a comparison of checkpointing &amp; persist() in Spark.<\/p>\n<div id=\"attachment_73297\" style=\"width: 1210px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Spark-Streaming-Checkpoint-in-Apache-Spark-Copy.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-73297\" class=\"wp-image-73297 size-full\" src=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Spark-Streaming-Checkpoint-in-Apache-Spark-Copy.jpg\" alt=\"apache spark streaming checkpoint\" width=\"1200\" height=\"628\" \/><\/a><p id=\"caption-attachment-73297\" class=\"wp-caption-text\">Spark Streaming Checkpoint in Apache Spark<\/p><\/div>\n<h3>What is Spark Streaming Checkpoint<\/h3>\n<p>A process of writing received records at checkpoint intervals to HDFS is checkpointing. It is a requirement that streaming application must operate 24\/7. Hence, must be resilient to failures unrelated to the application logic such as system failures, JVM crashes, etc.<\/p>\n<p>Checkpointing creates fault-tolerant stream processing pipelines. So, input dstreams can restore before-failure streaming state and continue stream processing.<\/p>\n<p>In Streaming, DStreams can checkpoint input data at specified time intervals. For its possibility, needs to checkpoint enough information to fault-tolerant storage system such that, it can recover from failures. Data checkpoint are of two types.<\/p>\n<h4>1. Metadata checkpointing<\/h4>\n<p>We use it to recover from the failure of the node running the driver of the streaming application. Metadata includes:<\/p>\n<ol>\n<li style=\"list-style-type: none\">\n<ul>\n<li><strong>Configuration<\/strong> &#8211; We use to create the streaming application.<\/li>\n<li><strong>DStream operations<\/strong> &#8211; Defines the streaming application.<\/li>\n<li><strong>Incomplete batches<\/strong> -Jobs are in the queue but have not completed yet.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<h4>2. Data checkpointing<\/h4>\n<p>All the generated RDDs are saving to reliable storage. For some stateful transformations,\u00a0it is necessary to combine the data across multiple batches. Since generated RDDs in some transformations depend on RDDs of previous batches.<\/p>\n<p>It causes the length of the dependency chain to keep increasing with time, to reduce increases in recovery time. Checkpoint intermediate RDDs of stateful transformation, it happens at reliable storage to cut off the dependency chains.<\/p>\n<p>In other words, for recovery from driver failures, metadata checkpointing is primarily needed, data or RDD checkpointing is also necessary. Even for basic functioning, if we use stateful transformations.<\/p>\n<h3>When to enable Checkpointing in Spark Streaming<\/h3>\n<p>With any of the following requirements, checkpointing in Spark streaming is a must for applications:<\/p>\n<h4>1. While we use stateful transformations<\/h4>\n<p>The checkpoint directory must be provided to allow for periodic RDD checkpointing. Only while we use following stateful transformations, such as updateStateByKey or reduceByKeyAndWindow (with inverse function) in the application.<\/p>\n<h4>2. Recovering from failures of the driver running the application<\/h4>\n<p>To recover with progress information, we use metadata checkpoints.<\/p>\n<p><b>Note:\u00a0<\/b>Apart from above mentioned, simple streaming applications can run, without enabling checkpointing. In that case, the recovery from driver failures will also be partial.<\/p>\n<p>Also, remember some received but unprocessed data may get lost. It is often acceptable and many run Spark Streaming applications in this way.<\/p>\n<h3>Marking StreamingContext as Checkpointed<\/h3>\n<p>While we persist checkpoint data we use \u201cStreamingContext.checkpoint\u201d method. We use this method to set up an HDFS-compatible checkpoint directory.<\/p>\n<h3>For Example<\/h3>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">ssc.checkpoint(\"_checkpoint\")<\/pre>\n<h3>Types of Checkpointing in Spark Streaming<\/h3>\n<p>Apache Spark checkpointing are two categories:<\/p>\n<h4>1. Reliable Checkpointing<\/h4>\n<p>The checkpointing in which the actual RDD exist in the reliable distributed file system, e.g. HDFS. We need to call following method to set the checkpoint directory<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">SparkContext.setCheckpointDir(directory: String)<\/pre>\n<p>While running over cluster, the directory must be an HDFS path. Since the driver tries to recover the checkpointed RDD from a local file. Even so, checkpoint files are actually on the executor\u2019s machines.<\/p>\n<h4>2. Local Checkpointing<\/h4>\n<p>We truncate the RDD lineage graph in spark, in Streaming or GraphX. In local checkpointing, we persist RDD to local storage in the executor<\/p>\n<h3>Difference between Spark Checkpointing and Persist<\/h3>\n<p>Spark checkpoint vs persist is different in many ways. Let\u2019s discuss them one by one-<\/p>\n<h4>Persist<\/h4>\n<ul>\n<li>While we persist RDD with DISK_ONLY storage, RDD gets stored in whereafter use of RDD will not reach, that points to recomputing the lineage.<\/li>\n<li>Spark remembers the lineage of the RDD, even though it doesn\u2019t call it, just after Persist() called.<\/li>\n<li>As soon as the job run is complete, it clears the cache and also destroys all the files.<\/li>\n<\/ul>\n<h4>Checkpointing<\/h4>\n<ul>\n<li>Through checkpointing, RDDs get stored in HDFS. Also, deletes the lineage which created it.<\/li>\n<li>Unlike the cache, the checkpoint file is not deleted upon completing the job run.<\/li>\n<li>At the time of checkpointing an RDD, it results in double computation.<\/li>\n<\/ul>\n<h3>Spark Streaming Checkpoint &#8211; Conclusion<\/h3>\n<p>Spark Streaming Checkpoint tutorial, said that by using a checkpointing method in spark streaming one can achieve fault tolerance. Whenever it needs, it provides fault tolerance to the streaming data.<\/p>\n<p>Moreover, when the read operation is complete the files are not removed, as in persist method. Hence, the RDD in Apache Spark needs to be checkpointed if the computation takes a long time or the computing chain is too long.<\/p>\n<p>Also, if it depends on too many RDDs. It also helps to avoid unbounded increases in recovery time. Ultimately, checkpointing improves the performance of the system.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This document aims at a Spark Streaming Checkpoint, we will start with what is a streaming checkpoint, how streaming checkpoint helps to achieve fault tolerance. There are two types of spark checkpoint i.e. reliable&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":73297,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[614],"tags":[860,915,916,917,648,918,919,920,865,866,921,922,869],"class_list":["post-2031","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-apache-spark","tag-apache-spark-streaming","tag-checkpointing--spark-streaming","tag-checkpointing-in-spark","tag-spark-checkpoint","tag-spark-streaming","tag-spark-streaming-checkpoint","tag-spark-streaming-checkpoint-in-apache-spark","tag-spark-streaming-checkpoints-for-dstreams","tag-spark-streaming-examples","tag-spark-streaming-tutorial","tag-streaming","tag-streaming-checkpoint-in-apache-spark-quick-guide","tag-streaming-in-spark"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.7 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>A Quick Guide On Apache Spark Streaming Checkpoint - TechVidvan<\/title>\n<meta name=\"description\" content=\"Spark Streaming Checkpoint- what is spark streaming checkpoint, when to enable checkpoint in spark, marking streamingcontext as a checkpointed in spark, types of spark checkpointing, difference between spark checkpointing vs persist\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A Quick Guide On Apache Spark Streaming Checkpoint - TechVidvan\" \/>\n<meta property=\"og:description\" content=\"Spark Streaming Checkpoint- what is spark streaming checkpoint, when to enable checkpoint in spark, marking streamingcontext as a checkpointed in spark, types of spark checkpointing, difference between spark checkpointing vs persist\" \/>\n<meta property=\"og:url\" content=\"https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/\" \/>\n<meta property=\"og:site_name\" content=\"TechVidvan\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/TechVidvan\/\" \/>\n<meta property=\"article:published_time\" content=\"2018-01-13T12:34:59+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-Streaming-Checkpoint-in-Apache-Spark-Copy.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"TechVidvan Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:site\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"TechVidvan Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"A Quick Guide On Apache Spark Streaming Checkpoint - TechVidvan","description":"Spark Streaming Checkpoint- what is spark streaming checkpoint, when to enable checkpoint in spark, marking streamingcontext as a checkpointed in spark, types of spark checkpointing, difference between spark checkpointing vs persist","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/","og_locale":"en_US","og_type":"article","og_title":"A Quick Guide On Apache Spark Streaming Checkpoint - TechVidvan","og_description":"Spark Streaming Checkpoint- what is spark streaming checkpoint, when to enable checkpoint in spark, marking streamingcontext as a checkpointed in spark, types of spark checkpointing, difference between spark checkpointing vs persist","og_url":"https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/","og_site_name":"TechVidvan","article_publisher":"https:\/\/www.facebook.com\/TechVidvan\/","article_published_time":"2018-01-13T12:34:59+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-Streaming-Checkpoint-in-Apache-Spark-Copy.jpg","type":"image\/jpeg"}],"author":"TechVidvan Team","twitter_card":"summary_large_image","twitter_creator":"@vidvantech","twitter_site":"@vidvantech","twitter_misc":{"Written by":"TechVidvan Team","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/#article","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/"},"author":{"name":"TechVidvan Team","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22"},"headline":"A Quick Guide On Apache Spark Streaming Checkpoint","datePublished":"2018-01-13T12:34:59+00:00","mainEntityOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/"},"wordCount":815,"commentCount":0,"publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-Streaming-Checkpoint-in-Apache-Spark-Copy.jpg","keywords":["Apache Spark streaming","Checkpointing \u00b7 Spark Streaming","checkpointing in spark","Spark Checkpoint","spark streaming","spark streaming checkpoint","Spark Streaming Checkpoint in Apache Spark","Spark streaming checkpoints for DStreams","spark streaming examples","Spark streaming tutorial","streaming","Streaming Checkpoint in Apache Spark: Quick Guide","streaming in spark"],"articleSection":["Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/","url":"https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/","name":"A Quick Guide On Apache Spark Streaming Checkpoint - TechVidvan","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/#website"},"primaryImageOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/#primaryimage"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-Streaming-Checkpoint-in-Apache-Spark-Copy.jpg","datePublished":"2018-01-13T12:34:59+00:00","description":"Spark Streaming Checkpoint- what is spark streaming checkpoint, when to enable checkpoint in spark, marking streamingcontext as a checkpointed in spark, types of spark checkpointing, difference between spark checkpointing vs persist","breadcrumb":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/#primaryimage","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-Streaming-Checkpoint-in-Apache-Spark-Copy.jpg","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-Streaming-Checkpoint-in-Apache-Spark-Copy.jpg","width":1200,"height":628,"caption":"apache spark streaming checkpoint"},{"@type":"BreadcrumbList","@id":"https:\/\/techvidvan.com\/tutorials\/spark-streaming-checkpoint\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/techvidvan.com\/tutorials\/"},{"@type":"ListItem","position":2,"name":"A Quick Guide On Apache Spark Streaming Checkpoint"}]},{"@type":"WebSite","@id":"https:\/\/techvidvan.com\/tutorials\/#website","url":"https:\/\/techvidvan.com\/tutorials\/","name":"TechVidvan Blogs","description":"","publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/techvidvan.com\/tutorials\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/techvidvan.com\/tutorials\/#organization","name":"TechVidvan","url":"https:\/\/techvidvan.com\/tutorials\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","width":200,"height":50,"caption":"TechVidvan"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/TechVidvan\/","https:\/\/x.com\/vidvantech"]},{"@type":"Person","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22","name":"TechVidvan Team","description":"The TechVidvan Team delivers practical, beginner-friendly tutorials on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our experts are here to help you upskill and excel in today\u2019s tech industry."}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2031","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/comments?post=2031"}],"version-history":[{"count":0,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2031\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media\/73297"}],"wp:attachment":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media?parent=2031"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/categories?post=2031"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/tags?post=2031"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}