{"id":2022,"date":"2018-01-13T04:28:26","date_gmt":"2018-01-13T04:28:26","guid":{"rendered":"https:\/\/techvidvan.com\/tutorials\/?p=750"},"modified":"2018-01-13T04:28:26","modified_gmt":"2018-01-13T04:28:26","slug":"apache-spark-sql-dataframe","status":"publish","type":"post","link":"https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/","title":{"rendered":"Introduction on Apache Spark SQL DataFrame"},"content":{"rendered":"<p>Spark SQL is <em>Spark module that works for structured data processing<\/em>.<\/p>\n<p>In this spark dataframe tutorial, we will learn the detailed introduction on Spark SQL DataFrame, why we need SQL DataFrame over RDD, how to create SparkSQL DataFrame, Features of DataFrame in Spark SQL: such as custom memory management, optimized execution plan.<\/p>\n<p>To understand better, we will highlight the limitations of Spark SQL Dataframe also.<\/p>\n<h3>Introduction to Spark SQL DataFrame<\/h3>\n<p>DataFrames are <strong>datasets<\/strong>, which is ideally organized into named <em>columns<\/em>. We can construct dataframe from an array of different sources, like structured data files, hive tables, external databases, or existing RDDs. DataFrames are equal to a table in a relational database or a dataframe in R\/Python with good optimizations.<\/p>\n<p>Dataframe is used, for processing of a <em>large amount of structured data<\/em>. Basically, it contains <strong>rows<\/strong> with a schema. Moreover, that schema is nothing but the <em>illustration<\/em> of the structure of data. It is more powerful than RDD but it attains features of RDD as well.<\/p>\n<p>There are several features those are common to RDD, such as distributed computing capability, immutability, in-memory, resilient. Also, provides higher level abstraction. Generally, it allows users to impose the structure onto a distributed data collection.<\/p>\n<p>It\u2019s API is available on several platforms, such as Scala, Java, Python, and R as well. While we work with Scala and Java, it is represented by a <em>dataset of rows<\/em>.<\/p>\n<p>Distinctly, dataframe is simply a type alias of dataset[Row] in Scala API. Whereas, users need to use dataset&lt;Row&gt; to represent a dataframe in Java API.<\/p>\n<h3>Why DataFrame?<\/h3>\n<p>There is always a question which strikes my mind that if we already have RDD, why do we need dataframe than? But, As we discussed earlier,<em> dataframe is one step ahead of RDD<\/em>. RDD has following limitations, such as:<\/p>\n<ul>\n<li>\n<h4>Limitations of RDD<\/h4>\n<\/li>\n<\/ul>\n<ol>\n<li>There is <em>no built-in optimization engine<\/em> in RDD.<\/li>\n<li>RDD <em>cannot handle structured data<\/em>.<\/li>\n<\/ol>\n<p>These are the following drawbacks, due to which Spark SQL dataframe comes in picture.<em> Dataframe overcomes limitations of RDD<\/em> as it provides memory management and optimized execution plan. This feature is not available in RDD. Let\u2019s discuss them in detail:<\/p>\n<h4>1. Custom Memory Management:<\/h4>\n<p>In custom memory management, memory (as data) is stored in <strong>off-heap memory<\/strong> in <strong>binary format<\/strong>, this process is tungsten project. There is no garbage collection overhead, in memory management. There is <em>no expensive java serialization<\/em>.<\/p>\n<h4>2. Optimized Execution plan:<\/h4>\n<p>The other name of this process is <strong>query optimizer<\/strong>. An optimized execution plan is created, for the execution of a query. When an optimized plan is created, then only final execution takes place on RDDs.<\/p>\n<h3>Features of DataFrame<\/h3>\n<p>There are several features of dataframe, such as:<\/p>\n<ul>\n<li>Dataframes are able to process the data in different <em>sizes<\/em>, like the size of <em>kilobytes<\/em> to <em>petabyte<\/em>s on a single node cluster to large cluster.<\/li>\n<li>It is a distributed collection of data organized in a named column, it is as similar to a table in RDBMS.<\/li>\n<li>They support<em> different data formats<\/em>, such as Avro, csv, elastic search, and Cassandra. It also provides <em>storage systems<\/em> like HDFS, HIVE tables, MySQL, etc.<\/li>\n<li>The optimizer called as <strong>catalyst optimizer<\/strong> supports optimization. Basically, to represent trees, there are general libraries available.<\/li>\n<li>\u00a0By analyzing logical plan to solve references.<\/li>\n<li>\u00a0With <em>logical plan optimization<\/em>.<\/li>\n<li>\u00a0By <em>physical planning<\/em>.<\/li>\n<li>\u00a0With code generation to compile part of a query to java bytecode.<\/li>\n<li>We can integrate dataframe with all<em> big data tools<\/em> and frameworks by spark-core.<\/li>\n<li>Dataframe provides <em>several API<\/em>, such as Python, Java, Scala, and R programming.<\/li>\n<li>It is compatible with a<em> hive<\/em>. It is possible to run unmodified hive queries on existing hive warehouse.<\/li>\n<\/ul>\n<h3>Creating DataFrames<\/h3>\n<p>There are many ways through which we can create a dataframe:<\/p>\n<div id=\"attachment_73349\" style=\"width: 1210px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Ways-to-creat-Dataframe-in-Spark-01-Copy.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-73349\" class=\"wp-image-73349 size-full\" src=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Ways-to-creat-Dataframe-in-Spark-01-Copy.jpg\" alt=\"Create Dataframe In Spark SQL\" width=\"1200\" height=\"628\" \/><\/a><p id=\"caption-attachment-73349\" class=\"wp-caption-text\">Ways To Create Dataframe In Spark<\/p><\/div>\n<ul>\n<li>We can create it by using <em>different dataformats<\/em>, such as loading the data from JSON, CSV.<\/li>\n<li>It is also possible by <em>loading data from existing RDD<\/em>.<\/li>\n<\/ul>\n<p>By using <strong>Spark session<\/strong>, an application can create dataframe from an<em> existing RDD<\/em>. It is also possible to create it from hive table or from Spark data sources.<\/p>\n<p>To access the functionality of Spark, we need to create the Spark session class. It is the entry point, to create \u00a0basic Spark session, we can use the following command:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">SparkSession.builder()<\/pre>\n<p>By using dataframe interface, Spark SQL can operate on the variety of data sources. We can create a temporary view of using dataframe.in\u00a0Spark SQL. To run the SQL query on the data, \u00a0we will need the temporary view of the data frame.<\/p>\n<h3>Limitations of SparkSQL DataFrames<\/h3>\n<p>There are also some limitations of dataframes in Spark SQL, like:<\/p>\n<ul>\n<li>In SQL dataframe, there is <em>no compile-time type safety<\/em>. Hence, as the structure is unknown, manipulation of data is not possible.<\/li>\n<li>We can <em>convert domain object<\/em> into dataFrame. But <em>once<\/em> we do it, then we can not regenerate the domain object.<\/li>\n<\/ul>\n<h3>Conclusion<\/h3>\n<p>As a result, we have seen that SQL dataframe API is different from the RDD API. Developers who are familiar with building query plans, for them dataframe API is good. But due to its limitations, it is not good for the majority of developers.<\/p>\n<p>Apparently, it avoids the <em>garbage-collection cost<\/em> for each row in the dataset. Therefore, dataframe API in Spark SQL improves the performance and scalability of Spark.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Spark SQL is Spark module that works for structured data processing. In this spark dataframe tutorial, we will learn the detailed introduction on Spark SQL DataFrame, why we need SQL DataFrame over RDD, how&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":73286,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[614],"tags":[821,822,823,824,825,826,827,828,829,830,831,832],"class_list":["post-2022","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-apache-spark","tag-complete-guide-on-dataframe","tag-dataframe","tag-dataframe-in-apache-spark","tag-dataframes-in-sparksql","tag-introducing-dataframes-in-apache-spark","tag-introduction-on-apache-spark-sql-dataframe","tag-spark-dataframe-example","tag-spark-sql-and-dataframes","tag-spark-sql-dataframe-tutorial-an-introduction-to-dataframe","tag-spark-sql-dataframes","tag-sparksql-dataframes","tag-tutorial-spark-sql-and-dataframes"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.7 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Introduction on Apache Spark SQL DataFrame - TechVidvan<\/title>\n<meta name=\"description\" content=\"Spark SQL DataFrame-Introduction to SparkSQL dataframe,Why SQL DataFrame,Features of DataFrame in spark SQL,Creation of SparkSQL DataFrames, its Limitations\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Introduction on Apache Spark SQL DataFrame - TechVidvan\" \/>\n<meta property=\"og:description\" content=\"Spark SQL DataFrame-Introduction to SparkSQL dataframe,Why SQL DataFrame,Features of DataFrame in spark SQL,Creation of SparkSQL DataFrames, its Limitations\" \/>\n<meta property=\"og:url\" content=\"https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/\" \/>\n<meta property=\"og:site_name\" content=\"TechVidvan\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/TechVidvan\/\" \/>\n<meta property=\"article:published_time\" content=\"2018-01-13T04:28:26+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-SQL-DataFrame-01.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"TechVidvan Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:site\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"TechVidvan Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Introduction on Apache Spark SQL DataFrame - TechVidvan","description":"Spark SQL DataFrame-Introduction to SparkSQL dataframe,Why SQL DataFrame,Features of DataFrame in spark SQL,Creation of SparkSQL DataFrames, its Limitations","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/","og_locale":"en_US","og_type":"article","og_title":"Introduction on Apache Spark SQL DataFrame - TechVidvan","og_description":"Spark SQL DataFrame-Introduction to SparkSQL dataframe,Why SQL DataFrame,Features of DataFrame in spark SQL,Creation of SparkSQL DataFrames, its Limitations","og_url":"https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/","og_site_name":"TechVidvan","article_publisher":"https:\/\/www.facebook.com\/TechVidvan\/","article_published_time":"2018-01-13T04:28:26+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-SQL-DataFrame-01.jpg","type":"image\/jpeg"}],"author":"TechVidvan Team","twitter_card":"summary_large_image","twitter_creator":"@vidvantech","twitter_site":"@vidvantech","twitter_misc":{"Written by":"TechVidvan Team","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/#article","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/"},"author":{"name":"TechVidvan Team","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22"},"headline":"Introduction on Apache Spark SQL DataFrame","datePublished":"2018-01-13T04:28:26+00:00","mainEntityOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/"},"wordCount":889,"commentCount":0,"publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-SQL-DataFrame-01.jpg","keywords":["Complete Guide on DataFrame","dataframe","dataframe in Apache Spark","dataframes in sparksql","Introducing DataFrames in Apache Spark","Introduction on Apache Spark SQL DataFrame","spark dataframe example","Spark SQL and DataFrames","Spark SQL DataFrame Tutorial - An Introduction to DataFrame","Spark SQL DataFrames","SparkSQL dataframes","Tutorial : Spark SQL and DataFrames"],"articleSection":["Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/","url":"https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/","name":"Introduction on Apache Spark SQL DataFrame - TechVidvan","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/#website"},"primaryImageOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/#primaryimage"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-SQL-DataFrame-01.jpg","datePublished":"2018-01-13T04:28:26+00:00","description":"Spark SQL DataFrame-Introduction to SparkSQL dataframe,Why SQL DataFrame,Features of DataFrame in spark SQL,Creation of SparkSQL DataFrames, its Limitations","breadcrumb":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/#primaryimage","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-SQL-DataFrame-01.jpg","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Spark-SQL-DataFrame-01.jpg","width":1200,"height":628,"caption":"how to create apache spark sql dataframe"},{"@type":"BreadcrumbList","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-sql-dataframe\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/techvidvan.com\/tutorials\/"},{"@type":"ListItem","position":2,"name":"Introduction on Apache Spark SQL DataFrame"}]},{"@type":"WebSite","@id":"https:\/\/techvidvan.com\/tutorials\/#website","url":"https:\/\/techvidvan.com\/tutorials\/","name":"TechVidvan Blogs","description":"","publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/techvidvan.com\/tutorials\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/techvidvan.com\/tutorials\/#organization","name":"TechVidvan","url":"https:\/\/techvidvan.com\/tutorials\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","width":200,"height":50,"caption":"TechVidvan"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/TechVidvan\/","https:\/\/x.com\/vidvantech"]},{"@type":"Person","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22","name":"TechVidvan Team","description":"The TechVidvan Team delivers practical, beginner-friendly tutorials on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our experts are here to help you upskill and excel in today\u2019s tech industry."}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2022","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/comments?post=2022"}],"version-history":[{"count":0,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2022\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media\/73286"}],"wp:attachment":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media?parent=2022"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/categories?post=2022"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/tags?post=2022"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}