{"id":2034,"date":"2018-01-18T12:39:10","date_gmt":"2018-01-18T12:39:10","guid":{"rendered":"https:\/\/techvidvan.com\/tutorials\/?p=868"},"modified":"2018-01-18T12:39:10","modified_gmt":"2018-01-18T12:39:10","slug":"apache-spark-stage","status":"publish","type":"post","link":"https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/","title":{"rendered":"Apache Spark Stage- Physical Unit Of Execution"},"content":{"rendered":"<p>In Apache Spark, a stage is a <em>physical unit of execution<\/em>. We can say, it is a step in a physical execution plan. In this document, we will learn the whole concept of spark stage, types of spark stage. Basically, there are two types of stages in spark- ShuffleMapstage and ResultStage.<\/p>\n<p>Furthermore, We will also learn to create spark stage in detail.<\/p>\n<div id=\"attachment_73320\" style=\"width: 1210px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Tasks-and-submitting-a-job-01.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-73320\" class=\"size-full wp-image-73320\" src=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Tasks-and-submitting-a-job-01.jpg\" alt=\"tasks and submitting a job\" width=\"1200\" height=\"628\" \/><\/a><p id=\"caption-attachment-73320\" class=\"wp-caption-text\">Spark Stage- Tasks and Submitting A Job<\/p><\/div>\n<h3>Stage in Spark<\/h3>\n<p>In Apache Spark, a stage is a physical unit of execution. We can say, it is a step in a <em>physical execution plan<\/em>. It is a set of parallel tasks \u2014 one task per partition. In other words, each job gets divided into smaller sets of tasks, is what you call stages.<\/p>\n<p>Generally, it depends on each other and it is very similar to the map and reduce stages in\u00a0MapReduce. Basically, a spark job is a computation with that computation sliced into stages.<\/p>\n<p>We can uniquely identify a stage with the help of its id. Whenever it creates a stage, DAGScheduler increments internal counter nextstageId. It helps to track the number of stage submissions.<\/p>\n<p>We can associate stage with many other dependent parent stages. Since stage can only work on the partitions of a single RDD. Even with the boundary of a stage marked by shuffle dependencies.<\/p>\n<p>Afterwards, stage submission triggers execution of a series of dependent parent stages. Ultimately, there is a first JobId present of every stage that is the id of the job which submits stage.<\/p>\n<div id=\"attachment_73317\" style=\"width: 1090px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Submitting-a-job-triggers-execution-of-the-stage-and-its-parent-stages-01.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-73317\" class=\"size-full wp-image-73317\" src=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Submitting-a-job-triggers-execution-of-the-stage-and-its-parent-stages-01.jpg\" alt=\"Submitting a Job Trigger Execution of the stage\" width=\"1080\" height=\"1080\" \/><\/a><p id=\"caption-attachment-73317\" class=\"wp-caption-text\">Spark Stage \u2013 Submitting a Job Trigger Execution of the stage<\/p><\/div>\n<h3>Types of Spark Stage<\/h3>\n<p>In spark, stages are of two types:<\/p>\n<p>3.1. ShuffleMapStage<\/p>\n<p>3.2. ResultStage<\/p>\n<p>Let&#8217;s discuss each type of stages in detail:<\/p>\n<h4><b>1. ShuffleMapStage<\/b><\/h4>\n<p>In the physical execution of DAG,\u00a0 we consider\u00a0ShuffleMapStage as an intermediate stage. It produces data for another stage(s). Also, writes map output files for a shuffle.<\/p>\n<p>We can also consider it as the final stage in a job in a<em>daptive query planning \/ adaptive scheduling.<\/em> As a spark job for adaptive query planning, we can also submit it independently.<\/p>\n<p>A ShuffleMapStage saves map output files when executed. Those files can later be fetched by reduce tasks. The ShuffleMapStage is considered ready when all map outputs are available. Sometimes, output locations can be missing, it means partitions might not be calculated or are lost.<\/p>\n<p>We can track how many shuffle map outputs are available, for that these stages use outputLocs &amp;_numAvailableOutputs internal registries. In the DAG of stages, ShuffleMapStage is an input for the other following stages. Those are what we call a shuffle dependency\u2019s map side.<\/p>\n<p>There can be multiple <b>pipeline operations<\/b>, in ShuffleMapStage, for example, map and filter, before shuffle operation. It is also possible to share single ShuffleMapStage across different jobs.<\/p>\n<div id=\"attachment_73096\" style=\"width: 1210px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/DAGScheduler-and-Stages-for-a-job-01.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-73096\" class=\"size-full wp-image-73096\" src=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/DAGScheduler-and-Stages-for-a-job-01.jpg\" alt=\"DAGScheduler and Job Stages\" width=\"1200\" height=\"628\" \/><\/a><p id=\"caption-attachment-73096\" class=\"wp-caption-text\">Spark Stage- DAGScheduler and Job Stages<\/p><\/div>\n<h4>2. ResultStage<\/h4>\n<p>A stage that executes a spark action in a user program by running a function on an RDD is a ResultStage. Generally, we consider it as a final stage. In other words, it is a final stage in a job which applies a function on one or many partitions of the target RDD. It also helps for computation of the result of an action.<\/p>\n<div id=\"attachment_73136\" style=\"width: 1210px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Graph-of-Stages-01.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-73136\" class=\"size-full wp-image-73136\" src=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Graph-of-Stages-01.jpg\" alt=\" Graph of Stages\" width=\"1200\" height=\"628\" \/><\/a><p id=\"caption-attachment-73136\" class=\"wp-caption-text\">Spark Stage \u2013 Graph of Stages<\/p><\/div>\n<h3>Getting StageInfo For Most Recent Attempt<\/h3>\n<p>We can also get to know the most recent StageInfo by using the latestInfo method.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">latestInfo: StageInfo<\/pre>\n<h3>Stage Contract<\/h3>\n<p>Basically, stage is a private[scheduler] abstract contract.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">abstract class Stage {\n\u00a0def findMissingPartitions(): Seq[Int]\n}<\/pre>\n<h3>Method to Create New Spark Stage<\/h3>\n<p>We can create the new stage with the help of the following method:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">makeNewStageAttempt(\n\u00a0numPartitionsToCompute: Int,\n\u00a0taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty): Unit<\/pre>\n<p>It creates a new TaskMetrics. Also, registers the internal accumulators by using the RDD\u2019s SparkContext. It uses the same RDD that was defined when stage was created.<\/p>\n<p>With nextAttemptId, numPartitionsToCompute, &amp; taskLocalityPreferences, it sets latestInfo to be a StageInfo, from stage. It also increments nextAttemptId counter.<\/p>\n<p><strong>Note:<\/strong><b> \u00a0<\/b>We use this method only when DAGScheduler submits missing tasks for a stage.<\/p>\n<h3>Conclusion<\/h3>\n<p>In this blog, we have studied the whole concept of stages in spark. Hope, this document helped to calm the curiosity about stage in Apache Spark.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In Apache Spark, a stage is a physical unit of execution. We can say, it is a step in a physical execution plan. In this document, we will learn the whole concept of spark&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":73320,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[614],"tags":[942,943,944,945,753],"class_list":["post-2034","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-apache-spark","tag-apache-spark-stage-physical-unit-of-execution","tag-how-are-stages-split-into-tasks-in-spark","tag-spark-stages-discription","tag-stage-in-spark","tag-understanding-your-apache-spark-application-through-visualization"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.7 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Apache Spark Stage- Physical Unit Of Execution - TechVidvan<\/title>\n<meta name=\"description\" content=\"Spark Stage- what is spark stage, types of stage in spark, spark stage contract, methods to create new stage in spark, DAGScheduler in Spark,ResultStage,ShuffleMapStage\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Apache Spark Stage- Physical Unit Of Execution - TechVidvan\" \/>\n<meta property=\"og:description\" content=\"Spark Stage- what is spark stage, types of stage in spark, spark stage contract, methods to create new stage in spark, DAGScheduler in Spark,ResultStage,ShuffleMapStage\" \/>\n<meta property=\"og:url\" content=\"https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/\" \/>\n<meta property=\"og:site_name\" content=\"TechVidvan\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/TechVidvan\/\" \/>\n<meta property=\"article:published_time\" content=\"2018-01-18T12:39:10+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Tasks-and-submitting-a-job-01.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"TechVidvan Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:site\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"TechVidvan Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Apache Spark Stage- Physical Unit Of Execution - TechVidvan","description":"Spark Stage- what is spark stage, types of stage in spark, spark stage contract, methods to create new stage in spark, DAGScheduler in Spark,ResultStage,ShuffleMapStage","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/","og_locale":"en_US","og_type":"article","og_title":"Apache Spark Stage- Physical Unit Of Execution - TechVidvan","og_description":"Spark Stage- what is spark stage, types of stage in spark, spark stage contract, methods to create new stage in spark, DAGScheduler in Spark,ResultStage,ShuffleMapStage","og_url":"https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/","og_site_name":"TechVidvan","article_publisher":"https:\/\/www.facebook.com\/TechVidvan\/","article_published_time":"2018-01-18T12:39:10+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Tasks-and-submitting-a-job-01.jpg","type":"image\/jpeg"}],"author":"TechVidvan Team","twitter_card":"summary_large_image","twitter_creator":"@vidvantech","twitter_site":"@vidvantech","twitter_misc":{"Written by":"TechVidvan Team","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/#article","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/"},"author":{"name":"TechVidvan Team","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22"},"headline":"Apache Spark Stage- Physical Unit Of Execution","datePublished":"2018-01-18T12:39:10+00:00","mainEntityOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/"},"wordCount":709,"commentCount":0,"publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Tasks-and-submitting-a-job-01.jpg","keywords":["Apache Spark Stage: Physical Unit Of Execution","How are stages split into tasks in Spark?","Spark Stages discription","Stage in Spark","Understanding your Apache Spark Application Through Visualization"],"articleSection":["Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/","url":"https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/","name":"Apache Spark Stage- Physical Unit Of Execution - TechVidvan","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/#website"},"primaryImageOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/#primaryimage"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Tasks-and-submitting-a-job-01.jpg","datePublished":"2018-01-18T12:39:10+00:00","description":"Spark Stage- what is spark stage, types of stage in spark, spark stage contract, methods to create new stage in spark, DAGScheduler in Spark,ResultStage,ShuffleMapStage","breadcrumb":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/#primaryimage","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Tasks-and-submitting-a-job-01.jpg","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Tasks-and-submitting-a-job-01.jpg","width":1200,"height":628,"caption":"Spark Stage- Tasks and Submitting A Job"},{"@type":"BreadcrumbList","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-stage\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/techvidvan.com\/tutorials\/"},{"@type":"ListItem","position":2,"name":"Apache Spark Stage- Physical Unit Of Execution"}]},{"@type":"WebSite","@id":"https:\/\/techvidvan.com\/tutorials\/#website","url":"https:\/\/techvidvan.com\/tutorials\/","name":"TechVidvan Blogs","description":"","publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/techvidvan.com\/tutorials\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/techvidvan.com\/tutorials\/#organization","name":"TechVidvan","url":"https:\/\/techvidvan.com\/tutorials\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","width":200,"height":50,"caption":"TechVidvan"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/TechVidvan\/","https:\/\/x.com\/vidvantech"]},{"@type":"Person","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22","name":"TechVidvan Team","description":"The TechVidvan Team delivers practical, beginner-friendly tutorials on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our experts are here to help you upskill and excel in today\u2019s tech industry."}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2034","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/comments?post=2034"}],"version-history":[{"count":0,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2034\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media\/73320"}],"wp:attachment":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media?parent=2034"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/categories?post=2034"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/tags?post=2034"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}