{"id":2029,"date":"2018-01-16T11:41:15","date_gmt":"2018-01-16T11:41:15","guid":{"rendered":"https:\/\/techvidvan.com\/tutorials\/?p=790"},"modified":"2018-01-16T11:41:15","modified_gmt":"2018-01-16T11:41:15","slug":"spark-dstream","status":"publish","type":"post","link":"https:\/\/techvidvan.com\/tutorials\/spark-dstream\/","title":{"rendered":"Spark DStream: Abstraction of Spark Streaming"},"content":{"rendered":"<p>Spark DStream (<strong>Discretized Stream<\/strong>) is the basic abstraction of Spark Streaming. In this blog, we will learn the concept of DStream in Spark, we will learn what is DStream, operations of DStream such as stateless and stateful transformations and output operation.<\/p>\n<div id=\"attachment_73043\" style=\"width: 1210px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Apache-Spark-Dstream-01.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-73043\" class=\"size-full wp-image-73043\" src=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Apache-Spark-Dstream-01.jpg\" alt=\"Spark DStream - Introduction\" width=\"1200\" height=\"628\" \/><\/a><p id=\"caption-attachment-73043\" class=\"wp-caption-text\">Introduction of Apache Spark DStream<\/p><\/div>\n<h3>What is Discretized Stream (DStream)<\/h3>\n<h4><b>Introduction of Spark Streaming<\/b><\/h4>\n<p>The core Spark API&#8217;s extension is what we call a \u201cSpark Streaming\u201d. It enables high-throughput,\u00a0scalability, fault-tolerant stream processing of live data streams. Ingestion of data is possible from many sources like Kafka, Flume, Kinesis, or TCP sockets.<\/p>\n<p>We can also process it by using complex algorithms expressed with high-level functions, such as map, reduce, join and window. Afterwards, data which is already processed can be pushed out to filesystems. In addition, we can apply spark\u2019s machine learning and graph processing on data streams.<\/p>\n<h4><b>DStream<\/b><\/h4>\n<p>The basic abstraction of Spark Streaming is Spark DStream (Discretized Stream). It is a continuous sequence of spark RDDs, that represents a continuous stream of data, that is Spark\u2019s abstraction of an <em>immutable<\/em>, distributed dataset.<\/p>\n<p>Creation of DStreams can be possible from live data, such as data from HDFS, Kafka or Flume. Also, generation of Dstream is possible by transformation existing DStreams using operations, such as map, window, and reduceByKey and window.<\/p>\n<p>Each DStream periodically generates a Spark RDD, while a spark streaming program is running, that RDD is either generated by live data or by transforming the RDD generated by a parent DStream.<\/p>\n<p>There are few basic properties of DStreams:<\/p>\n<ul>\n<li>Record of other DStreams that the DStream depends on.<\/li>\n<li>Time duration at which DStream generates an RDD.<\/li>\n<li>The function which we use to generate an RDD after each time interval.<\/li>\n<\/ul>\n<p>Operation applied on Spark DStream translates to operations on underlying RDDs. Spark engine computes these underlying RDD transformations, that offers developer a high-level API for convenience. Hence, DStream simplifies working with streaming data.<\/p>\n<h3>Input DStreams and Receivers<\/h3>\n<p>The stream of input data received from streaming sources is represented as DStream, which are input DStream. With every input DStream object, a receiver (Scala doc, Java doc) object is associated, that receives the data from a source. Also, stores it in spark\u2019s memory for processing.<\/p>\n<p>Following are the two types of built-in streaming sources.<\/p>\n<ul>\n<li><b> Basic sources:<\/b> Those are directly available in the StreamingContext API are basic sources, such as file systems, and socket connections.<\/li>\n<li><b> Advanced sources:<\/b> Advanced sources, those are available through extra utility classes, for example, Kafka, Flume, Kinesis and much more. These sources require linking against extra dependencies.<\/li>\n<\/ul>\n<p>We can receive many streams of data in parallel by creating multiple input DStream. This process will create many receivers, so we will receive many data streams at the same time. Although, a spark executor\/worker is a long-running task.<\/p>\n<p>Hence, it occupies one of the cores allocated to the spark streaming application. Importantly, we need to allocate enough cores to the streaming application, that helps to process the received data also to run the receiver.<\/p>\n<h4>1. Basic Sources<\/h4>\n<p>Some of these basic sources are as follows-<\/p>\n<p><strong>a. File Streams:<\/strong><\/p>\n<p>We can create DStreams in:<\/p>\n<ul>\n<li>Scala<\/li>\n<li>Java<\/li>\n<li>Python<\/li>\n<\/ul>\n<p><b>For Scala:<\/b><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)<\/pre>\n<p><b>For Java: <\/b><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">streamingContext.fileStream&lt;KeyClass, ValueClass, InputFormatClass&gt;(dataDirectory);<\/pre>\n<p><b>For Python: <\/b><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">streamingContext.textFileStream(dataDirectory)<\/pre>\n<p>The directory datadirectory will be monitor by spark streaming. Streaming also process any files created in that directory.<\/p>\n<p>There are few conditions:<\/p>\n<ol>\n<li>There should be same data format for all the files.<\/li>\n<li>A file must create automatically in datadirectory, either by moving or renaming them into data directory.<\/li>\n<li>If files moved once, the files must not be changed. Because if files are being continuously appended, the new data will not be read.<\/li>\n<\/ol>\n<p><strong>b. Streams based on Custom Receivers:<\/strong><\/p>\n<p>We can create DStreams with data streams received by custom receivers.<\/p>\n<p><strong>c. The queue of RDDs as a Stream: <\/strong><\/p>\n<p>One can also create a DStream based on a queue of RDDs to test a spark streaming application with test data. We can create it by using the following method:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">streamingContext.queueStream(queueOfRDDs)<\/pre>\n<h4>2. Advance Sources<\/h4>\n<p>There are so many advance sources, some of them are:<\/p>\n<ul>\n<li><b>Kafka:<\/b> Kafka broker versions 0.8.2.1 is compatible with spark streaming 2.2.0.<\/li>\n<li><b>Flume:<\/b> \u00a0Flume 1.6.0 is compatible with spark streaming 2.2.0.<\/li>\n<li><b>Kinesis:<\/b> Kinesis client library 1.2.1 is compatible with spark streaming 2.2.0.<\/li>\n<\/ul>\n<h3>Apache Spark DStream Operations<\/h3>\n<p>Spark DStream supports two types of operations:<\/p>\n<p>1. Transformations<\/p>\n<p>2. Output operations<\/p>\n<h4>1. Transformation<\/h4>\n<p>Transformation in spark DStream are two categories:<\/p>\n<ol>\n<li>Stateless Transformations<\/li>\n<li>Stateful Transformations<\/li>\n<\/ol>\n<p><strong>a. Stateless Transformations<\/strong><\/p>\n<p>In stateless transformations, we don&#8217;t need data of previous batches for processing. As a result, these are simple RDD transformations. We apply it to each batch means every RDD in a DStream. Some common RDD transformations for example map(), filter(), reduceByKey() etc.<\/p>\n<p>Although, each transformation applies to each spark RDD. So, it seems like applying it to the whole stream as each DStream is a collection of many RDDs (batches) in spark. We can combine data from many DStreams within each time step in this transformation.<\/p>\n<p>DStreams comes with an advanced operator called <strong>transform()<\/strong>. We use transform(), if in case stateless transformations are insufficient. The transform() allow operating on the RDDs inside them.<\/p>\n<p>It also allows any arbitrary RDD-to-RDD function to act on the DStream. To produce a new stream, this function gets called on each batch of data in the stream to produce a new stream.<\/p>\n<p><strong>b. Stateful Transformations<\/strong><\/p>\n<div id=\"attachment_73306\" style=\"width: 1210px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Stateful-Transformations-in-Spark-Streaming-01.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-73306\" class=\"size-full wp-image-73306\" src=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Stateful-Transformations-in-Spark-Streaming-01.jpg\" alt=\"spark dstream - stateful transformation\" width=\"1200\" height=\"628\" \/><\/a><p id=\"caption-attachment-73306\" class=\"wp-caption-text\">Stateful Transformation In Spark DStream<\/p><\/div>\n<p>For computation of current batch results, it uses intermediate results from previous batches. These operations on DStreams track data across time. <span class=\"complexword\">Therefore<\/span>, it makes use of some data from previous batches to generate the results for a new batch.<\/p>\n<p><span class=\"adverb\">Importantly<\/span>, two main operations such as windowed operation and updateStateByKey(). The windowed operation is which act over a sliding window of time periods. While updateStateByKey() is which used to track state across events for each key.<\/p>\n<div class=\"\">\n<h4 class=\"public-DraftStyleDefault-block public-DraftStyleDefault-ltr\">2. Output Operation<\/h4>\n<\/div>\n<p>After transformation on data, output operation is performed on that data in streaming. Now, as debugging of our program is done. Afterwards, by using output operation we can save our output. Output operations like print(), save() and much more.<\/p>\n<p>Therefore, by save operation, we take directory to save file into and an optional suffix. Also, by print() operation we take in the first 10 elements from each batch of DStream and prints result.<\/p>\n<h3>Conclusion<\/h3>\n<p>By study above information on Spark DStream, RDD is Spark\u2019s core abstraction and DStream is streaming&#8217;s high-level abstraction. It is a continuous sequence of RDDs, that represents a continuous stream of data.<\/p>\n<p>Hence, we can <span class=\"complexword\">obtain<\/span> DStream from input DStream like Kafka, Flume &amp; much more. Also, we can apply the transformation on the existing DStream to get a new DStream.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Spark DStream (Discretized Stream) is the basic abstraction of Spark Streaming. In this blog, we will learn the concept of DStream in Spark, we will learn what is DStream, operations of DStream such as&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":73043,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[614],"tags":[901,902,903,904,905,906,907],"class_list":["post-2029","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-apache-spark","tag-apache-spark-dstream-discretized-streams","tag-discretized-streams-dstreams","tag-discretized-streams-dstreams--spark-streaming","tag-discretized-streams-an-efficient-and-fault-tolerant-model","tag-discretized-streams-fault-tolerant-streaming","tag-spark-dstream-abstraction-of-spark-streaming","tag-spark-streaming-dstream"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.7 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Spark DStream: Abstraction of Spark Streaming - TechVidvan<\/title>\n<meta name=\"description\" content=\"Spark DStream (Discretized Stream)-spark streaming,what is Dstream,input DStream and receivers,Spark DStream operations-Stateless,stateful Transformation\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/techvidvan.com\/tutorials\/spark-dstream\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Spark DStream: Abstraction of Spark Streaming - TechVidvan\" \/>\n<meta property=\"og:description\" content=\"Spark DStream (Discretized Stream)-spark streaming,what is Dstream,input DStream and receivers,Spark DStream operations-Stateless,stateful Transformation\" \/>\n<meta property=\"og:url\" content=\"https:\/\/techvidvan.com\/tutorials\/spark-dstream\/\" \/>\n<meta property=\"og:site_name\" content=\"TechVidvan\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/TechVidvan\/\" \/>\n<meta property=\"article:published_time\" content=\"2018-01-16T11:41:15+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Apache-Spark-Dstream-01.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"TechVidvan Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:site\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"TechVidvan Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Spark DStream: Abstraction of Spark Streaming - TechVidvan","description":"Spark DStream (Discretized Stream)-spark streaming,what is Dstream,input DStream and receivers,Spark DStream operations-Stateless,stateful Transformation","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/techvidvan.com\/tutorials\/spark-dstream\/","og_locale":"en_US","og_type":"article","og_title":"Spark DStream: Abstraction of Spark Streaming - TechVidvan","og_description":"Spark DStream (Discretized Stream)-spark streaming,what is Dstream,input DStream and receivers,Spark DStream operations-Stateless,stateful Transformation","og_url":"https:\/\/techvidvan.com\/tutorials\/spark-dstream\/","og_site_name":"TechVidvan","article_publisher":"https:\/\/www.facebook.com\/TechVidvan\/","article_published_time":"2018-01-16T11:41:15+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Apache-Spark-Dstream-01.jpg","type":"image\/jpeg"}],"author":"TechVidvan Team","twitter_card":"summary_large_image","twitter_creator":"@vidvantech","twitter_site":"@vidvantech","twitter_misc":{"Written by":"TechVidvan Team","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/techvidvan.com\/tutorials\/spark-dstream\/#article","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-dstream\/"},"author":{"name":"TechVidvan Team","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22"},"headline":"Spark DStream: Abstraction of Spark Streaming","datePublished":"2018-01-16T11:41:15+00:00","mainEntityOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-dstream\/"},"wordCount":1108,"commentCount":0,"publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-dstream\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Apache-Spark-Dstream-01.jpg","keywords":["Apache Spark DStream (Discretized Streams)","Discretized Streams (DStreams)","Discretized Streams (DStreams) \u00b7 Spark Streaming","Discretized Streams: An Efficient and Fault-Tolerant Model","Discretized Streams: Fault-Tolerant Streaming","Spark DStream: abstraction of Spark Streaming","spark.streaming.DStream"],"articleSection":["Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/techvidvan.com\/tutorials\/spark-dstream\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/techvidvan.com\/tutorials\/spark-dstream\/","url":"https:\/\/techvidvan.com\/tutorials\/spark-dstream\/","name":"Spark DStream: Abstraction of Spark Streaming - TechVidvan","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/#website"},"primaryImageOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-dstream\/#primaryimage"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-dstream\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Apache-Spark-Dstream-01.jpg","datePublished":"2018-01-16T11:41:15+00:00","description":"Spark DStream (Discretized Stream)-spark streaming,what is Dstream,input DStream and receivers,Spark DStream operations-Stateless,stateful Transformation","breadcrumb":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-dstream\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/techvidvan.com\/tutorials\/spark-dstream\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/spark-dstream\/#primaryimage","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Apache-Spark-Dstream-01.jpg","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Apache-Spark-Dstream-01.jpg","width":1200,"height":628,"caption":"Introduction of Apache Spark DStream"},{"@type":"BreadcrumbList","@id":"https:\/\/techvidvan.com\/tutorials\/spark-dstream\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/techvidvan.com\/tutorials\/"},{"@type":"ListItem","position":2,"name":"Spark DStream: Abstraction of Spark Streaming"}]},{"@type":"WebSite","@id":"https:\/\/techvidvan.com\/tutorials\/#website","url":"https:\/\/techvidvan.com\/tutorials\/","name":"TechVidvan Blogs","description":"","publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/techvidvan.com\/tutorials\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/techvidvan.com\/tutorials\/#organization","name":"TechVidvan","url":"https:\/\/techvidvan.com\/tutorials\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","width":200,"height":50,"caption":"TechVidvan"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/TechVidvan\/","https:\/\/x.com\/vidvantech"]},{"@type":"Person","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22","name":"TechVidvan Team","description":"The TechVidvan Team delivers practical, beginner-friendly tutorials on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our experts are here to help you upskill and excel in today\u2019s tech industry."}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2029","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/comments?post=2029"}],"version-history":[{"count":0,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2029\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media\/73043"}],"wp:attachment":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media?parent=2029"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/categories?post=2029"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/tags?post=2029"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}