{"id":2026,"date":"2018-01-11T11:00:24","date_gmt":"2018-01-11T11:00:24","guid":{"rendered":"https:\/\/techvidvan.com\/tutorials\/?p=840"},"modified":"2018-01-11T11:00:24","modified_gmt":"2018-01-11T11:00:24","slug":"apache-spark-paired-rdd","status":"publish","type":"post","link":"https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/","title":{"rendered":"Apache Spark Paired RDD: Creation &amp; Operations"},"content":{"rendered":"<p>In Apache Spark,<em> Key-value<\/em> pairs are known as<strong> paired RDD<\/strong>. In this blog, we will learn what are paired RDDs in Spark in detail.<\/p>\n<p>To understand in deep, we will focus on following methods of creating spark paired RDD in and operations on paired RDDs in spark, such as transformations and actions in Spark RDD.<\/p>\n<p>The transformation such as groupByKey, reduceByKey, join, leftOuterJoin\/rightOuterJoin, while, actions like countByKey. But at first, we will learn brief introduction on RDDs in Spark.<\/p>\n<p><a href=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Spark-RDD.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-73282\" src=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Spark-RDD.jpg\" alt=\"spark RDD\" width=\"1200\" height=\"628\" \/><\/a><\/p>\n<h3>\u00a0Introduction &#8211; Spark RDDs<\/h3>\n<p><strong>RDD<\/strong> refers to Resilient Distributed Datasets,\u00a0core abstraction and a fundamental data structure of Spark. RDDs in spark are <em>immutable<\/em> as well as the distributed collection of objects. In RDD, each dataset is divided into logical partitions.<\/p>\n<p>That each partition may be computed on different nodes of the cluster. Spark RDDs can contain user-defined classes. Also, includes any type of Scala, python or java objects.<\/p>\n<p>It is a read-only, partitioned collection of records. Spark RDDs are the fault-tolerant collection of elements and it can be operated in parallel. There are generally three ways to create spark RDDs.<\/p>\n<p>Data in stable storage, other RDDs, and parallelizing existing collection in driver program. By using RDD, it is possible to achieve faster and efficient MapReduce operations.<\/p>\n<h3>Introduction &#8211; Apache Spark Paired RDD<b><br \/>\n<\/b><\/h3>\n<p>Spark Paired RDDs are defined as the RDD containing a key-value pair. There is two linked data item in a <strong>key-value pair<\/strong> (KVP). We can say the key is the <em>identifier<\/em>, while the value is the <em>data corresponding<\/em> to the key value.<\/p>\n<p>In addition, \u00a0most of the Spark operations work on RDDs containing any type of objects. But on RDDs of key-value pairs, a few special operations are available. For example, distributed \u201cshuffle\u201d operations, such as grouping or aggregating the elements by a key.<\/p>\n<p>These operations are automatically available on RDDs containing Tuple2 objects, in Scala. In the Pair RDD functions class, the key-value pair operations are available. That wraps around an RDD of tuples.<\/p>\n<p><strong>For example:<\/strong><\/p>\n<p>In this code we are using the reduceByKey operation on key-value pairs. We will count how many times each line of text occurs in a file:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">val lines1 = sc.textFile(\"data1.txt\")\nval pairs1 = lines.map(s =&gt; (s, 1))\nval counts1 = pairs.reduceByKey((a, b) =&gt; a + b)<\/pre>\n<p>There is one more method <b>counts.sortByKey() <\/b>we can use.<\/p>\n<h3>Importance of Apache Spark Paired RDD?<\/h3>\n<p>In many programs, pair RDDs of Apache Spark are a useful<em> building block<\/em>. Operations that allow us to act on each key in parallel, it exposes those operations. Also, helps to regroup the data across the network.<\/p>\n<p>For instance, in spark paired RDDs reduceByKey() method aggregate data separately for each key and a join() method, which merges two RDDs together by grouping elements with the same key. It is very normal to extract fields from an RDD.<\/p>\n<p>For example, representing, for instance, an event time, customer ID, or another identifier. Also, use those fields in spark pair RDD operations as keys.<\/p>\n<h3>Creating Paired RDD in Spark<\/h3>\n<p>By running a map() function that returns key or value pairs, we can create spark pair RDDs. On the basis of language, the procedure to build the key-value RDDs differs.<\/p>\n<ul>\n<li>\n<h4>In Python language<\/h4>\n<\/li>\n<\/ul>\n<p>For the functions of keyed data to work, we need to return an RDD composed of tuples. Furthermore, for creating a pair RDD in spark using the first word as the key in Python programming language.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">pairs = lines.map(lambda x: (x.split(\u201d \u201c)[0], x))<\/pre>\n<ul>\n<li>\n<h4>In Scala language<b><br \/>\n<\/b><\/h4>\n<\/li>\n<\/ul>\n<p>We also need to return tuples as shown in the previous example. Moreover, this will make functions on keyed data to be available. To provide the extra key or value functions, an implicit conversion on RDDs of tuples exists.<\/p>\n<p>Furthermore, to create apache spark pair RDD, by using the first word as the keyword<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">val pairs = lines.map(x =&gt; (x.split(\u201d \u201c)(0), x))\n<\/pre>\n<ul>\n<li>\n<h4>In Java language<b><br \/>\n<\/b><\/h4>\n<\/li>\n<\/ul>\n<p>It doesn\u2019t have a built-in function of tuple function. So, by using the scala, only spark\u2019s java API has users create tuples.Tuple2 class. Although, users can construct a new tuple by writing new Tuple2(elem1, elem2) in java. Also, can access its relevant elements with the _1() and _2() methods.<\/p>\n<p>In addition, when you are creating paired RDDs in Spark, we need to call special versions of spark\u2019s functions in java. For example, in place of the basic map() function the mapToPair () function should be used.<\/p>\n<p>To create a Spark pair RDD, using the first word as the keyword<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">PairFunction&lt;String, String, String&gt; keyData =\n\nnew PairFunction&lt;String, String, String&gt;() {\n\npublic Tuple2&lt;String, String&gt; call(String x) {\n\nreturn new Tuple2(x.split(\u201d \u201c)[0], x);\n\n}\n\n};\n\nJavaPairRDD&lt;String, String&gt; pairs = lines.mapToPair(keyData);<\/pre>\n<h3>\u00a0Some Interesting Spark Paired RDD &#8211; Operations<\/h3>\n<h4>1.Transformation Operations<b><br \/>\n<\/b><\/h4>\n<p>All the transformations available to standard RDDs, Pair RDDs are allowed to use them. Even it can apply same rules from <em>\u201cpassing functions to spark\u201d.<\/em><\/p>\n<p>As there are tuples available in spark paired RDDs, we need to pass functions that operate on tuples, rather than on individual elements. Some of the transformation methods are listed here. For example:<\/p>\n<ul>\n<li>\n<h5>groupByKey<\/h5>\n<\/li>\n<\/ul>\n<p>Basically, it groups all the values with the same key.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">rdd.groupByKey()\n<\/pre>\n<ul>\n<li>\n<h5>reduceByKey(fun)<\/h5>\n<\/li>\n<\/ul>\n<p>It uses to combine values with the same key.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">add.reduceByKey( (x, y) =&gt; x + y)\n<\/pre>\n<ul>\n<li>\n<h5>combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner)<\/h5>\n<\/li>\n<\/ul>\n<p>By using a different result type, combine values with the same key.<\/p>\n<ul>\n<li>\n<h5>mapValues(func)<\/h5>\n<\/li>\n<\/ul>\n<p>Without changing the key, apply a function to each value of a pair RDD of spark.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">rdd.mapValues(x =&gt; x+1)\n<\/pre>\n<ul>\n<li>\n<h5>keys()<\/h5>\n<\/li>\n<\/ul>\n<p>Basically, Keys() returns a spark RDD of just the keys.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">rdd.keys()\n<\/pre>\n<ul>\n<li>\n<h5>values()<\/h5>\n<\/li>\n<\/ul>\n<p>Generally, values() returns an RDD of just the values.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">rdd.values()<\/pre>\n<ul>\n<li>\n<h5>sortByKey()<\/h5>\n<\/li>\n<\/ul>\n<p>Basically, sortByKey returns an RDD sorted by the key.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">rdd.sortByKey()<\/pre>\n<h4>2. Action Operations<b><br \/>\n<\/b><\/h4>\n<p>Like transformations, actions available on spark pair RDDs are similar to base RDD. Basically, there are some additional actions available on pair RDDs of spark.\u00a0 Moreover, those leverages the advantage of the key\/value nature of the data. Some of them are listed below. For example,<\/p>\n<ul>\n<li>\n<h5>countByKey()<\/h5>\n<\/li>\n<\/ul>\n<p>For each key, it helps to count the number of elements.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">rdd.countByKey()<\/pre>\n<ul>\n<li>\n<h5>collectAsMap()<\/h5>\n<\/li>\n<\/ul>\n<p>Basically, it helps to collect the result as a map to provide easy lookup.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">rdd.collectAsMap()<\/pre>\n<ul>\n<li>\n<h5>lookup(key)<\/h5>\n<\/li>\n<\/ul>\n<p>Basically, lookup(key) returns all values associated with the provided key.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">rdd.lookup()<\/pre>\n<h3>Conclusion<\/h3>\n<p>Hence, \u00a0we have seen how to work with Spark key\/value data. Also, how to use the specialized functions and operations available in spark. Finally,\u00a0 we hope this article has given all your answers regarding spark paired RDDs.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In Apache Spark, Key-value pairs are known as paired RDD. In this blog, we will learn what are paired RDDs in Spark in detail. To understand in deep, we will focus on following methods&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":73282,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[614],"tags":[839,840,841,842,843,844,845,846,847],"class_list":["post-2026","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-apache-spark","tag-action-operation-in-spark","tag-apache-spark-paired-rdd-transformations-and-actions","tag-pair-rdds-transformations-and-actions","tag-paired-rdd-in-spark","tag-rdd-in-spark","tag-spark-rdd-operations","tag-spark-pairrddfunctions","tag-transformation-operation-in-spark","tag-working-with-keyvalue-pairs"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Apache Spark Paired RDD: Creation &amp; Operations - TechVidvan<\/title>\n<meta name=\"description\" content=\"Spark Paired RDD tutorial-introduction &amp; importance of Spark Paired RDD, creation of paired RDD in spark and Transformation and action operations in RDD\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Apache Spark Paired RDD: Creation &amp; Operations - TechVidvan\" \/>\n<meta property=\"og:description\" content=\"Spark Paired RDD tutorial-introduction &amp; importance of Spark Paired RDD, creation of paired RDD in spark and Transformation and action operations in RDD\" \/>\n<meta property=\"og:url\" content=\"https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/\" \/>\n<meta property=\"og:site_name\" content=\"TechVidvan\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/TechVidvan\/\" \/>\n<meta property=\"article:published_time\" content=\"2018-01-11T11:00:24+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-RDD.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"TechVidvan Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:site\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"TechVidvan Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Apache Spark Paired RDD: Creation &amp; Operations - TechVidvan","description":"Spark Paired RDD tutorial-introduction & importance of Spark Paired RDD, creation of paired RDD in spark and Transformation and action operations in RDD","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/","og_locale":"en_US","og_type":"article","og_title":"Apache Spark Paired RDD: Creation &amp; Operations - TechVidvan","og_description":"Spark Paired RDD tutorial-introduction & importance of Spark Paired RDD, creation of paired RDD in spark and Transformation and action operations in RDD","og_url":"https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/","og_site_name":"TechVidvan","article_publisher":"https:\/\/www.facebook.com\/TechVidvan\/","article_published_time":"2018-01-11T11:00:24+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-RDD.jpg","type":"image\/jpeg"}],"author":"TechVidvan Team","twitter_card":"summary_large_image","twitter_creator":"@vidvantech","twitter_site":"@vidvantech","twitter_misc":{"Written by":"TechVidvan Team","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/#article","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/"},"author":{"name":"TechVidvan Team","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22"},"headline":"Apache Spark Paired RDD: Creation &amp; Operations","datePublished":"2018-01-11T11:00:24+00:00","mainEntityOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/"},"wordCount":997,"commentCount":0,"publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-RDD.jpg","keywords":["action operation in spark","Apache Spark Paired RDD: Transformations and Actions","Pair RDDs: Transformations and Actions","Paired RDD in Spark","RDD in Spark","Spark RDD Operations","spark.PairRDDFunctions","Transformation operation in spark","Working with Key\/Value Pairs"],"articleSection":["Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/","url":"https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/","name":"Apache Spark Paired RDD: Creation &amp; Operations - TechVidvan","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/#website"},"primaryImageOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/#primaryimage"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-RDD.jpg","datePublished":"2018-01-11T11:00:24+00:00","description":"Spark Paired RDD tutorial-introduction & importance of Spark Paired RDD, creation of paired RDD in spark and Transformation and action operations in RDD","breadcrumb":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/#primaryimage","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-RDD.jpg","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-RDD.jpg","width":1200,"height":628,"caption":"spark RDD"},{"@type":"BreadcrumbList","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-paired-rdd\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/techvidvan.com\/tutorials\/"},{"@type":"ListItem","position":2,"name":"Apache Spark Paired RDD: Creation &amp; Operations"}]},{"@type":"WebSite","@id":"https:\/\/techvidvan.com\/tutorials\/#website","url":"https:\/\/techvidvan.com\/tutorials\/","name":"TechVidvan Blogs","description":"","publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/techvidvan.com\/tutorials\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/techvidvan.com\/tutorials\/#organization","name":"TechVidvan","url":"https:\/\/techvidvan.com\/tutorials\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","width":200,"height":50,"caption":"TechVidvan"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/TechVidvan\/","https:\/\/x.com\/vidvantech"]},{"@type":"Person","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22","name":"TechVidvan Team","description":"The TechVidvan Team delivers practical, beginner-friendly tutorials on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our experts are here to help you upskill and excel in today\u2019s tech industry."}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2026","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/comments?post=2026"}],"version-history":[{"count":0,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2026\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media\/73282"}],"wp:attachment":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media?parent=2026"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/categories?post=2026"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/tags?post=2026"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}