{"id":2024,"date":"2018-01-13T04:30:16","date_gmt":"2018-01-13T04:30:16","guid":{"rendered":"https:\/\/techvidvan.com\/tutorials\/?p=777"},"modified":"2018-01-13T04:30:16","modified_gmt":"2018-01-13T04:30:16","slug":"apache-spark-rdd-vs-dataframe","status":"publish","type":"post","link":"https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/","title":{"rendered":"Comparision between Apache Spark RDD vs DataFrame"},"content":{"rendered":"<p>At a rapid pace, Apache Spark is evolving either on the basis of <em>changes<\/em> or on the basis of <em>additions<\/em> to core APIs. The most disruptive areas of change we have seen are a representation of data sets.<\/p>\n<p>In this blog, we will discuss the comparison between two of the datasets, Spark RDD vs DataFrame and learn detailed feature wise difference between RDD and dataframe in Spark.<\/p>\n<p>We will also cover the brief introduction of two of the Spark APIs i.e. DataFrame vs spark RDD. There are various features on which RDD and DataFrame are different. Such as data representation, immutability, interoperability and many more.<\/p>\n<p>To understand better, we will illustrate where to use Spark RDD vs DataFrame.<\/p>\n<h3>Introduction of Spark APIs: DataFrame and RDD<\/h3>\n<p>To understand the comparison well, it is important to know their introduction first, let\u2019s study each one by one:<\/p>\n<h4>1. Spark RDD<\/h4>\n<p>Apache Spark rotates around the idea of <strong>RDD<\/strong>, it refers to Resilient Distributed Datasets. RDD is a fault-tolerant collection of elements that can be operated on in-parallel, also we can say RDD is the fundamental data structure of Spark.<\/p>\n<p>Basically, it is read-only partition collection of records. Moreover it supports in-memory computations on large clusters in a fault-tolerant manner.<\/p>\n<p>This set of data is spread across multiple machines over cluster, with API to let us act on it. From any data source, e.g. text files, a database via JDBC, etc. , an \u00a0RDD can come. Also, can easily handle data with no predefined structure.<\/p>\n<h4>2. DataFrame<\/h4>\n<p>It is a distributed collection of data. Basically, data is organized into named columns in dataframes. Although it is as same as the table in a relational database or an R\/Python dataframe. Furthermore, Spark also introduced <strong>catalyst optimizer<\/strong>, along with dataframe.<\/p>\n<p>To build an extensible query optimizer, it also leverages advanced programming features. In Spark, dataframe allows developers to impose a structure onto a distributed data. It also allows higher-level abstraction.<\/p>\n<h3>Comparison between Spark RDD vs DataFrame<\/h3>\n<p>To understand the Apache Spark RDD vs DataFrame in depth, we will compare them on the basis of different features, let\u2019s discuss it one by one:<\/p>\n<h4>1. Release of DataSets<\/h4>\n<p><b>RDD &#8211;\u00a0<\/b>Basically, Spark 1.0 release introduced an <em>RDD API<\/em>. \u00a0<b>\u00a0\u00a0\u00a0 <\/b><b>\u00a0 \u00a0<\/b><\/p>\n<p><b>DataFrame-\u00a0\u00a0<\/b>Basically, Spark 1.3 release introduced a preview of the new dataset, that is <em>dataFrame<\/em>.<\/p>\n<h4>2. Data Formats<\/h4>\n<p><b>RDD- <\/b>Through RDD, we can process structured as well as unstructured data. But, in RDD user need to specify the <em>schema<\/em> of ingested data, RDD cannot infer its own.<\/p>\n<p><b>DataFrame-\u00a0<\/b>In data frame data is organized into named <em>columns<\/em>. Through dataframe, we can process structured and unstructured data efficiently. It also allows <em>Spark to manage schema<\/em>.<\/p>\n<h4>3. Data Representations<\/h4>\n<p><b>RDD-\u00a0<\/b>It is a distributed collection of data elements. That is spread across many machines over the cluster, they are a set of Scala or Java objects representing data.<\/p>\n<p><b>DataFrame-\u00a0\u00a0<\/b>As we discussed above, in a data frame data is organized into named columns. Basically, it is as same as a table in a relational database.<\/p>\n<h4>4. Compile- Time Type Safety<\/h4>\n<p><b>RDD-\u00a0 <\/b>RDD Supports <em>object-oriented programming<\/em> style with compile-time type safety.<\/p>\n<p><b>DataFrame-<\/b><b> <\/b>If we try to access any column which is not present in the table, then an attribute error may occur at runtime. Dataframe will not support compile-time type safety in such case.<\/p>\n<h4>5. Immutability and Interoperability<\/h4>\n<p><b>RDD-\u00a0<\/b>RDDs are <em>immutable<\/em> in nature. That means we can not change anything about RDDs. We can create it through some transformation on existing partitions. Due to immutability, all the computations performed are consistent in nature. If RDD is in tabular format, we can move from RDD to dataframe by <b>to()<\/b> method. We can also do the reverse by the <b>.rdd<\/b> method.<\/p>\n<p><b>DataFrame-\u00a0\u00a0<\/b>One cannot regenerate a domain object, after transforming into dataframe. By using the example, if we generate one test data frame from tested then, we can not recover the original RDD again of the test class.<\/p>\n<h4>6. Data Sources API<\/h4>\n<p><b>RDD-\u00a0<\/b>From any data source, e.g. text files, a database via JDBC, etc. , an \u00a0RDD can come. Also, can easily handle data with no predefined structure.<\/p>\n<p><b>DataFrame-\u00a0<\/b>In different formats, data source API allows data processing, such as AVRO, CSV, JSON, and storage system HDFS, HIVE tables, MySQL.<\/p>\n<h4>7. Optimization<\/h4>\n<p><b>RDD- \u00a0<\/b>There was no provision for optimization engine in RDD. On the basis of its attributes, developers optimise each RDD.<\/p>\n<p><b>DataFrame- <\/b>By using Catalyst Optimizer, optimization takes place in dataframes. In 4 phases, dataframes use catalyst tree transformation framework<\/p>\n<ul>\n<li>By Analysis<\/li>\n<li>With logical plan optimization<\/li>\n<li>By physical planning<\/li>\n<li>With code generation to compile parts of the query to java bytecode.<\/li>\n<\/ul>\n<h4>8. Serialization<\/h4>\n<p><b>RDD- \u00a0<\/b>Spark uses java serialization, whenever it needs to distribute data over a cluster. Serializing individual Scala and Java objects are expensive. It also requires sending both data and structure between nodes.<\/p>\n<p><b>DataFrame-\u00a0<\/b>In dataframe, we can serialize data into off-heap storage in binary format. Afterwards, it performs transformations on this off-heap memory, as spark understands schema. Moreover, to encode the data, there is no need to use java serialization.<\/p>\n<h4>9. Efficiency\/Memory use<\/h4>\n<p><b>RDD- \u00a0<\/b>When serialization executes individually on a java and scala object, efficiency decreases. It also takes lots of time.<\/p>\n<p><b>DataFrame-\u00a0<\/b>Use of off-heap memory for serialization reduces the overhead also generates, bytecode. So that, many operations can <span class=\"passivevoice\">be performed<\/span> on that serialized data. <span class=\"adverb\">Basically<\/span>, there is no need of deserialization for small operations.<\/p>\n<h4>10. Lazy Evaluation<\/h4>\n<p><b>RDD- \u00a0<\/b>Spark does not compute their result right away, it evaluates RDDs <span class=\"adverb\">lazily<\/span>. Apart from it, Spark memorizes the transformation applied to some base data set. Moreover, When an action needs, a result sent to driver program for computation.<\/p>\n<p><b>DataFrame- <\/b>Similarly, computation happens only when action appears as Spark evaluates dataframe lazily<b>.<\/b><\/p>\n<h4>11. Language Support<\/h4>\n<p><b>RDD-\u00a0<\/b>APIs for RDD is available in 4 languages, such as Java, Scala, Python, and R. As a result, this feature provides flexibility to the developers.<\/p>\n<p><b>DataFrame-\u00a0<\/b>As similar as RDD, it also has APIs in same 4 languages, such as Java, Scala, Python, and R.<\/p>\n<h4>12. Schema Projection<\/h4>\n<p><b>RDD-\u00a0<\/b>Since RDD APIs, use schema projection explicitly. Therefore, a user needs to define the schema manually.<\/p>\n<p><b>DataFrame-\u00a0<\/b>In dataframe, there is no need to specify a schema. Generally, it discovers schema automatically.<\/p>\n<h4>13. Aggregation<\/h4>\n<p><b>RDD- <\/b>While performing simple grouping and aggregation operations RDD API is slower.<\/p>\n<p><b>DataFrame-\u00a0<\/b>In performing exploratory analysis, creating aggregated statistics on data, dataframes are faster.<\/p>\n<h4>14. Usage<\/h4>\n<p><b>RDD-\u00a0\u00a0<\/b>When you want low-level transformation and actions, we use RDDs. Also, when we need high-level abstractions we use RDDs.<\/p>\n<p><b>DataFrame- \u00a0<\/b>We use dataframe when we need a high level of abstraction and for unstructured data, such as media streams or streams of text.<\/p>\n<h3>Conclusion<\/h3>\n<p>As a result, we have seen RDDs of Apache spark offers low-level functionality and control. Whereas datasets offer higher functionality. While dataframe offers high-level domain-specific operations, saves space and executes at high speed.<\/p>\n<p>Therefore, it increases the efficiency of the system. Ultimately, we have discussed the comparison between Spark RDD vs DataFrame in detail.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>At a rapid pace, Apache Spark is evolving either on the basis of changes or on the basis of additions to core APIs. The most disruptive areas of change we have seen are a&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":73227,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[614],"tags":[833,834,835,836,837,838],"class_list":["post-2024","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-apache-spark","tag-apache-spark-rdd-vs-dataframe","tag-apache-spark-rdd-vs-dataframe-vs-dataset","tag-comparision-between-dataframe-vs-rdd-apache-spark","tag-dataframe-vs-spark-rdd","tag-difference-between-spark-rdd-and-dataframe","tag-spark-dataframe-vs-spark-rdd"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.7 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Comparision between Apache Spark RDD vs DataFrame - TechVidvan<\/title>\n<meta name=\"description\" content=\"Spark RDD vs DataFrame-What is Spark RDD,What is spark dataframe,comparison between RDD &amp; dataframe in spark with features of Spark RDD &amp; dataframe in Spark\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Comparision between Apache Spark RDD vs DataFrame - TechVidvan\" \/>\n<meta property=\"og:description\" content=\"Spark RDD vs DataFrame-What is Spark RDD,What is spark dataframe,comparison between RDD &amp; dataframe in spark with features of Spark RDD &amp; dataframe in Spark\" \/>\n<meta property=\"og:url\" content=\"https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/\" \/>\n<meta property=\"og:site_name\" content=\"TechVidvan\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/TechVidvan\/\" \/>\n<meta property=\"article:published_time\" content=\"2018-01-13T04:30:16+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/RDD-vs-data-frame-01.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"TechVidvan Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:site\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"TechVidvan Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Comparision between Apache Spark RDD vs DataFrame - TechVidvan","description":"Spark RDD vs DataFrame-What is Spark RDD,What is spark dataframe,comparison between RDD & dataframe in spark with features of Spark RDD & dataframe in Spark","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/","og_locale":"en_US","og_type":"article","og_title":"Comparision between Apache Spark RDD vs DataFrame - TechVidvan","og_description":"Spark RDD vs DataFrame-What is Spark RDD,What is spark dataframe,comparison between RDD & dataframe in spark with features of Spark RDD & dataframe in Spark","og_url":"https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/","og_site_name":"TechVidvan","article_publisher":"https:\/\/www.facebook.com\/TechVidvan\/","article_published_time":"2018-01-13T04:30:16+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/RDD-vs-data-frame-01.jpg","type":"image\/jpeg"}],"author":"TechVidvan Team","twitter_card":"summary_large_image","twitter_creator":"@vidvantech","twitter_site":"@vidvantech","twitter_misc":{"Written by":"TechVidvan Team","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/#article","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/"},"author":{"name":"TechVidvan Team","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22"},"headline":"Comparision between Apache Spark RDD vs DataFrame","datePublished":"2018-01-13T04:30:16+00:00","mainEntityOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/"},"wordCount":1156,"commentCount":0,"publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/RDD-vs-data-frame-01.jpg","keywords":["Apache Spark : RDD vs DataFrame","Apache Spark RDD vs DataFrame vs DataSet","Comparision between DataFrame vs RDD: Apache Spark","dataframe vs Spark RDD","difference between spark RDD and Dataframe","spark dataframe vs spark RDD"],"articleSection":["Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/","url":"https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/","name":"Comparision between Apache Spark RDD vs DataFrame - TechVidvan","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/#website"},"primaryImageOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/#primaryimage"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/RDD-vs-data-frame-01.jpg","datePublished":"2018-01-13T04:30:16+00:00","description":"Spark RDD vs DataFrame-What is Spark RDD,What is spark dataframe,comparison between RDD & dataframe in spark with features of Spark RDD & dataframe in Spark","breadcrumb":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/#primaryimage","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/RDD-vs-data-frame-01.jpg","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/RDD-vs-data-frame-01.jpg","width":1200,"height":628,"caption":"comparison between RDD and dataframes"},{"@type":"BreadcrumbList","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-rdd-vs-dataframe\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/techvidvan.com\/tutorials\/"},{"@type":"ListItem","position":2,"name":"Comparision between Apache Spark RDD vs DataFrame"}]},{"@type":"WebSite","@id":"https:\/\/techvidvan.com\/tutorials\/#website","url":"https:\/\/techvidvan.com\/tutorials\/","name":"TechVidvan Blogs","description":"","publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/techvidvan.com\/tutorials\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/techvidvan.com\/tutorials\/#organization","name":"TechVidvan","url":"https:\/\/techvidvan.com\/tutorials\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","width":200,"height":50,"caption":"TechVidvan"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/TechVidvan\/","https:\/\/x.com\/vidvantech"]},{"@type":"Person","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22","name":"TechVidvan Team","description":"The TechVidvan Team delivers practical, beginner-friendly tutorials on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our experts are here to help you upskill and excel in today\u2019s tech industry."}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2024","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/comments?post=2024"}],"version-history":[{"count":0,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2024\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media\/73227"}],"wp:attachment":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media?parent=2024"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/categories?post=2024"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/tags?post=2024"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}