{"id":2009,"date":"2018-01-06T12:22:26","date_gmt":"2018-01-06T12:22:26","guid":{"rendered":"https:\/\/techvidvan.com\/tutorials\/?p=652"},"modified":"2018-01-06T12:22:26","modified_gmt":"2018-01-06T12:22:26","slug":"ways-to-create-rdd-in-spark","status":"publish","type":"post","link":"https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/","title":{"rendered":"Ways To Create RDD In Spark with Examples"},"content":{"rendered":"<p>RDD is used for efficient work by a developer, it is a read-only partitioned collection of records. In this article. We will learn about the several ways to Create RDD in spark. There are following ways to Create RDD in Spark.<\/p>\n<p>Such as 1. Using parallelized collection 2. From existing Apache Spark RDD &amp; 3. From external datasets. To get these concepts we will dive in, with few examples of the following methods to understand in depth.<\/p>\n<h3>Ways to Create RDD in Spark<\/h3>\n<p>Spark RDD are core abstraction of apache spark. RDD refers to <strong>Resilient Distributed Datasets<\/strong>. Generally, we consider it as a technological arm of apache-spark, they are immutable in nature. It supports self-recovery, i.e. fault tolerance or resilient property of RDDs.<\/p>\n<p>They are the logically partitioned collection of objects which are usually stored in-memory. RDDs can be operated on in-parallel. We can perform different operations on RDD as well as on data storage to form another RDDs from it.<\/p>\n<p>There are two more ways to create RDD in spark manually by\u00a0<strong>cache<\/strong> and <strong>divide<\/strong> it manually. Users may also persist an RDD in memory. In parallel operation, we can reuse it efficiently. RDDs are a read-only partitioned collection of records.<\/p>\n<p>As we cannot modify RDDs after once they created. This makes RDD to race different conditions and other failure scenarios.<\/p>\n<p>There are two types of operations, we can perform on RDDs. They are transformations, which means to create a new dataset from the existing RDD.<\/p>\n<p>Actions, return a value to the program after the completion of the computation on the dataset. Transformation returns new RDD, whereas action returns the new value to which are datatypes.<\/p>\n<p>After learning about Apache Spark RDD, we will move forward towards the generation of RDD.<\/p>\n<p>There are following ways to create RDD in Spark are:<\/p>\n<p>1.Using parallelized collection.<\/p>\n<p>2.From external datasets (Referencing a dataset in external storage system ).<\/p>\n<p>3.From existing apache spark RDDs.<\/p>\n<p>Furthermore, we will learn all these ways to create RDD in detail.<\/p>\n<h4>1. Using Parallelized collection<\/h4>\n<p>RDDs can be created generally by the parallelizing method. It is possible by taking an existing collection from our driver program. Driver program such as Scala, Python, Java. Also by calling the sparkcontext\u2019s parallelize( ) method on it.<\/p>\n<p>This is a basic method to create RDD which is applied at the very initial stage of spark. It creates RDD very quickly. It also initializes further operations on them at the same time. To operate this method, we need entire dataset on one machine.<\/p>\n<p>Due to this property, this process is rarely used outside of testing and prototyping.<\/p>\n<p>Considering the following example of sortbykey () method. In this programs, the values to be sorted is taken through the parallelized collection:<\/p>\n<h4>&#8211; For Example:<\/h4>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">val\u00a0\u00a0\u00a0 data=spark.sparkContext.parallelize(Seq((\"sun\",01),(\"mon\",02),(\"tue\",03), (\"wed\",04),(\"thus\",05)))\n\nval sorted = data.sortByKey()\n\nsorted.foreach(println)\n<\/pre>\n<p>The number of partitions in which a dataset is cut into is a key point in the parallelized collection. we know spark cluster is logically partitioned. As we discussed earlier, we can also create RDD by its cache and divide it manually.<\/p>\n<p>That means we can also set a number of partitions by our own. To set by own, we need to pass a number of partition as the second parameter in parallelize method.<\/p>\n<h4>&#8211; For Example:<\/h4>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">sc.parallelize(data, 20)<\/pre>\n<p>So here we set the number of partitions 20 by our own.<\/p>\n<p>We can see one more example below. In that, we are applying parallelize method and also giving the number of partitions by our own<\/p>\n<h4>&#8211; For Example:<\/h4>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">val rdd1 = spark.sparkContext.parallelize(Array(\"sun\",\"mon\",\"tue\",\"wed\",\"thu\",\"fri\"),4)\n\nval result = rdd1.coalesce(3)\n\nresult.foreach(println)\n<\/pre>\n<p>So, this is an initial level method to create our own RDDs. It helps to create RDDs very quickly also.<\/p>\n<p>This method is generally used to create datasets by users.<\/p>\n<h4>2. From external datasets (Referencing a dataset in external storage system)<\/h4>\n<p>If any storage source supported by Hadoop, including our local file system it can create RDDs from it. Apache spark does support sequence files, textfiles, and any other Hadoop input format.<\/p>\n<p>We can create textfile RDDs by sparkcontext\u2019s textfile method. This method uses the URL for the file (either a local path on the machine or database or a hdfs:\/\/, s3n:\/\/, etc URL). It also reads whole as a collection of lines.<\/p>\n<p>Always be careful that the path of the local system and worker node should always be similar. The file should be available at the same place in the local file system and worker node.<\/p>\n<p>We can copy the file of the worker nodes. We can also use a network mounted the shared file system.<\/p>\n<p>To load a dataset from an external storage system, we can use data frame reader interface. External storage system such as file systems, key-value stores. It supports many file formats like:<\/p>\n<h5>a. CSV (String Path) Example<\/h5>\n<p>In this example, we are providing a CSV file which returns dataset&lt;Row&gt; as a result.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">import org.apache.spark.sql.SparkSession\n\ndef main(args: Array[String]):Unit = {\n\nobject DataFormat {\n\nVal\u00a0\u00a0\u00a0 spark = \u00a0SparkSession.builder.appName(\"ExtDataEx1\").master(\"local\").getOrCreate()\n\nval dataRDD = spark.read.csv(\"path\/of\/csv\/file\").rdd<\/pre>\n<p><strong>Note<\/strong> \u2013 We have seen that in this example .rdd method is used. We use this format to convert Dataset &lt;Row&gt; to RDD &lt;Row&gt;.<\/p>\n<h5>b.\u00a0 json (String Path) Example<\/h5>\n<p>In this example, we are providing a JSON file (one object per line ) which returns Dataset&lt;Row&gt; as a result.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">val dataRDD = spark.read.json(\"path\/of\/json\/file\").rdd<\/pre>\n<h5>c. textfile (String Path) Example<\/h5>\n<p>In this example, we are providing a text file which returns Dataset of a string as a result.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">val dataRDD = spark.read.textFile(\"path\/of\/text\/file\").rdd<\/pre>\n<h4>3. From existing Apache Spark RDDs<\/h4>\n<p>As we discussed earlier, that RDD is immutable so, we can not change anything to it. So we can create different RDD from the existing RDDs. This process of creating another dataset from the existing ones means transformation.<\/p>\n<p>As a result, transformation always produces new RDD. As they are immutable, no changes take place in it if once created. This property maintains the consistency over the cluster.<\/p>\n<p>Some of the operations performed on RDD are map, \u00a0filter, count, distinct, flatmap etc.<\/p>\n<h4>&#8211; For Example:<\/h4>\n<p>In this example, we are providing a text file which returns Dataset of the string as a result.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">val words=spark.sparkContext.parallelize(Seq(\"sun\", \"rises\", \"in\", \"the\", \"east\", \"and\", \"sets\", \"in\", \u201cthe\u201d, \"west\"))\n\nval wordPair = words.map(w =&gt; (w.charAt(0), w))\n\nwordPair.foreach(println)<\/pre>\n<p><b>Note <\/b>\u2013 In the above example RDD \u201cwordPair\u201d is created from existing RDD \u201cword\u201d using map ( ) transformation. This result contains word and starting character together of the same word.<\/p>\n<h3>Conclusion<\/h3>\n<p>Hence, we have learned all possible ways to<em>\u00a0generate Spark RDD<\/em>\u00a0in-depth: parallelized collection, from external datasets and from existing Apache Spark RDD. As well as manually we can also create RDD by its <strong>cache<\/strong> and <strong>divide.<\/strong><\/p>\n<p>So these datasets are no longer difficult for us to operate. This will enhance our efficiency while working on resilient distributed datasets.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>RDD is used for efficient work by a developer, it is a read-only partitioned collection of records. In this article. We will learn about the several ways to Create RDD in spark. There are&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":73353,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[614],"tags":[663,664,665,666,667,658,668,669],"class_list":["post-2009","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-apache-spark","tag-existing-apache-spark","tag-external-datasets","tag-how-to-create-spark-rdd","tag-parallelized-collection","tag-possible-ways-to-create-rdd-in-spark","tag-resilient-distributed-datasets","tag-tranformation-and-action","tag-ways-to-create-spark-rdd"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.7 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Ways To Create RDD In Spark with Examples - TechVidvan<\/title>\n<meta name=\"description\" content=\"Ways to create RDD in spark - create Spark RDD with spark parallelized collection, external datasets, and existing apache spark. Learn with spark examples.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Ways To Create RDD In Spark with Examples - TechVidvan\" \/>\n<meta property=\"og:description\" content=\"Ways to create RDD in spark - create Spark RDD with spark parallelized collection, external datasets, and existing apache spark. Learn with spark examples.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/\" \/>\n<meta property=\"og:site_name\" content=\"TechVidvan\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/TechVidvan\/\" \/>\n<meta property=\"article:published_time\" content=\"2018-01-06T12:22:26+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Ways-to-Create-rdd-in-spark.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"TechVidvan Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:site\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"TechVidvan Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Ways To Create RDD In Spark with Examples - TechVidvan","description":"Ways to create RDD in spark - create Spark RDD with spark parallelized collection, external datasets, and existing apache spark. Learn with spark examples.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/","og_locale":"en_US","og_type":"article","og_title":"Ways To Create RDD In Spark with Examples - TechVidvan","og_description":"Ways to create RDD in spark - create Spark RDD with spark parallelized collection, external datasets, and existing apache spark. Learn with spark examples.","og_url":"https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/","og_site_name":"TechVidvan","article_publisher":"https:\/\/www.facebook.com\/TechVidvan\/","article_published_time":"2018-01-06T12:22:26+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Ways-to-Create-rdd-in-spark.jpg","type":"image\/jpeg"}],"author":"TechVidvan Team","twitter_card":"summary_large_image","twitter_creator":"@vidvantech","twitter_site":"@vidvantech","twitter_misc":{"Written by":"TechVidvan Team","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/#article","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/"},"author":{"name":"TechVidvan Team","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22"},"headline":"Ways To Create RDD In Spark with Examples","datePublished":"2018-01-06T12:22:26+00:00","mainEntityOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/"},"wordCount":1071,"commentCount":0,"publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Ways-to-Create-rdd-in-spark.jpg","keywords":["Existing Apache Spark","external datasets","how to create spark rdd","parallelized collection","possible ways to create RDD in Spark","Resilient Distributed Datasets","Tranformation and Action","Ways to create Spark RDD"],"articleSection":["Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/","url":"https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/","name":"Ways To Create RDD In Spark with Examples - TechVidvan","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/#website"},"primaryImageOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/#primaryimage"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Ways-to-Create-rdd-in-spark.jpg","datePublished":"2018-01-06T12:22:26+00:00","description":"Ways to create RDD in spark - create Spark RDD with spark parallelized collection, external datasets, and existing apache spark. Learn with spark examples.","breadcrumb":{"@id":"https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/#primaryimage","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Ways-to-Create-rdd-in-spark.jpg","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Ways-to-Create-rdd-in-spark.jpg","width":1200,"height":628,"caption":"Ways to Create rdd in spark"},{"@type":"BreadcrumbList","@id":"https:\/\/techvidvan.com\/tutorials\/ways-to-create-rdd-in-spark\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/techvidvan.com\/tutorials\/"},{"@type":"ListItem","position":2,"name":"Ways To Create RDD In Spark with Examples"}]},{"@type":"WebSite","@id":"https:\/\/techvidvan.com\/tutorials\/#website","url":"https:\/\/techvidvan.com\/tutorials\/","name":"TechVidvan Blogs","description":"","publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/techvidvan.com\/tutorials\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/techvidvan.com\/tutorials\/#organization","name":"TechVidvan","url":"https:\/\/techvidvan.com\/tutorials\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","width":200,"height":50,"caption":"TechVidvan"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/TechVidvan\/","https:\/\/x.com\/vidvantech"]},{"@type":"Person","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22","name":"TechVidvan Team","description":"The TechVidvan Team delivers practical, beginner-friendly tutorials on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our experts are here to help you upskill and excel in today\u2019s tech industry."}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2009","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/comments?post=2009"}],"version-history":[{"count":0,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2009\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media\/73353"}],"wp:attachment":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media?parent=2009"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/categories?post=2009"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/tags?post=2009"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}