{"id":2005,"date":"2018-01-03T17:46:01","date_gmt":"2018-01-03T17:46:01","guid":{"rendered":"https:\/\/techvidvan.com\/tutorials\/?p=622"},"modified":"2018-01-03T17:46:01","modified_gmt":"2018-01-03T17:46:01","slug":"spark-tutorial","status":"publish","type":"post","link":"https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/","title":{"rendered":"Spark Tutorial &#8211; Apache Spark Introduction for Beginners"},"content":{"rendered":"<p>In this Spark tutorial, we will focus on what is Apache Spark, Spark terminologies, Spark ecosystem components as well as RDD. Now-a-days, whenever we talk about Big Data, only one word strike us &#8211; the next-gen Big Data tool &#8211;\u00a0<strong>&#8220;Apache Spark&#8221;<\/strong>.<\/p>\n<p>We will discuss why you must learn Apache Spark, how Spark handles big data efficiently, why industry is focusing on Spark. We will also cover the Spark history to understand its evolution.<\/p>\n<p>In closing, we will also study Apache Spark architecture and deployment mode. This spark blog is turned out as Apache spark quickstart tutorial for beginners.<\/p>\n<h3>What is Apache Spark?<\/h3>\n<p>Apache Spark is powerful <em>cluster computing engine<\/em>. It is purposely designed for fast computation in Big Data world. Spark is primarily based on <strong>Hadoop<\/strong>, supports earlier model to work efficiently. It offers several new computations.<\/p>\n<p>Such as interactive queries as well as stream processing. The most Sparkling feature of Apache Spark is it offers in-memory cluster computing. In-memory cluster computing enhances the processing speed of an application.<\/p>\n<p>There is a huge range of workloads in Apache Spark. For example streaming and batch applications, iterative algorithms, interactive queries. In addition, Spark also decreases the management burden of maintaining separate tools.<\/p>\n<p>Spark supports high-level APIs such a <strong>Java<\/strong>, <strong>Scala<\/strong>, <strong>Python<\/strong> and <strong>R<\/strong>. It is basically built upon Scala language<strong>.<\/strong> Due to its feature of high speed to handle large scale data it becomes visible. The superiority about spark is it works 100 x faster than Hadoop.<\/p>\n<p>It is also 10 x faster than accessing data from disk. Spark is highly compatible with Hadoop. As Apache Spark does not have its own file management system. So Apche spark can simply integrated with Hadoop and can process existing Hadoop HDFS data.<\/p>\n<h3>Evolution of Apache Spark<\/h3>\n<p>Apache Spark is one of Hadoop\u2019s subproject. This was first developed in AMPLab by Matei Zaharia in the year 2009 in UC Berkeley\u2019s. Spark became Open Sourced under a BSD license in the year 2010.<\/p>\n<p>Afterwards,\u00a0 Apache software foundation adopted Spark in 2013. Now,\u00a0 It is announced as a top-level Apache project\u00a0from February 2014.<\/p>\n<h3>Why Apache Spark is remarkable now-a-days?<\/h3>\n<p>Spark always been a Step ahead from Hadoop in Several features. Those features itself defines why it remains in high demand always. One of the possible reason for its popularity is fast Speed.<\/p>\n<p>As we discussed already,\u00a0 Apache Spark offers about 100 times faster processing than Hadoop. It process very large amount of data in such short span of time. Spark uses fewer resources as compared to Hadoop, which makes it cost-effective.<\/p>\n<p>One of the Prime aspect where Spark has the upper hand in terms of compatibility. It is highly compatible with a resource manager. It is generally known to run with Hadoop, just as MapReduce does.<\/p>\n<p>Although, it can also work with other resource managers such as YARN or Mesos.<\/p>\n<p>In addition, a major reason is that Spark supports Real-time processing even in batch mode. Spark is fulfilling that fundamental demand of Industry.<\/p>\n<p>There is also a need for an engine that can perform in-memory processing. As a matter of fact, it offers in-memory computation which also increases its demand.<\/p>\n<h3>Apache Spark Ecosystem Components<\/h3>\n<p>As we know, Spark offers faster computation and easy development. But it is not possible without following components of Spark. To learn all the components of Apache Spark in detail, let&#8217;s study all one by one.<\/p>\n<p>Those are:<\/p>\n<p><a href=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Apache-Spark-Ecosystem-Copy-01.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-73046\" src=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Apache-Spark-Ecosystem-Copy-01.jpg\" alt=\"Apache Spark Ecosystem\" width=\"1200\" height=\"628\" \/><\/a><\/p>\n<h4>1. Apache Spark Core<\/h4>\n<p>Apache Spark Core is a platform on which all functionality of Spark is basically built upon. It is the underlying general execution engine for spark. Spark core provides In-Memory computation. It also references datasets in external storage systems.<\/p>\n<h4>2. Apache Spark SQL<\/h4>\n<p>On top of Spark Core, It is a component that introduces a new data abstraction. That abstraction is called SchemaRDD. It supports for both structured as well as semi-structured data.<\/p>\n<h4>3. Apache Spark Streaming<\/h4>\n<p>While we talk about Real-time Processing in Spark it is possible because of Spark Streaming. It holds the capability to perform streaming analytics. SQL divides data in mini-batches and perform Micro batch processing.<\/p>\n<p>It Supports DStream. \u00a0Dstream is fundamentally a series of RDDs, to process the real-time data<\/p>\n<h4>4. MLlib (Machine Learning Library)<\/h4>\n<p>MLlib is Spark\u2019s machine learning framework. It consists of common learning algorithms as well as utilities. This library also includes classification, regression, clustering &amp; many more.<\/p>\n<p>It is also capable of performing in-memory data processing. That enhances the performance of iterative algorithm drastically.<\/p>\n<h4>5. GraphX<\/h4>\n<p>On top of Spark, GraphX is a distributed graph-processing framework. It enables to process graph data at scale.<\/p>\n<h4>6. SparkR<\/h4>\n<p>SparkR is somehow a combination of Spark and R. Major key Aspect behind SparkR is we can explore different techniques. It enhances functionality by merging the use of R with the scalability of Spark.<\/p>\n<h3>RDD: Core Abstraction of Apache Spark<\/h3>\n<p>RDD refers to<strong> Resilient Distributed Dataset<\/strong>. It is the fundamental unit of data in Apache Spark. RDDs are distributed a collection of elements across cluster nodes. One of the Important parameters is RDD supports parallel operations.<\/p>\n<p>Spark RDDs are immutable in nature. \u00a0We can not make any changes though it can generate by transforming existing RDD.<\/p>\n<p>we can create RDD in Spark by several ways. Those are:<\/p>\n<p>1. Parallelized collections<\/p>\n<p>2.External datasets<\/p>\n<p>3.Existing RDDs<\/p>\n<p><a href=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Transformations-and-Actions-01-min.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-73338\" src=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Transformations-and-Actions-01-min.jpg\" alt=\"Spark Tutorial- RDD transformations and actions\" width=\"1200\" height=\"628\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p><strong>RDD offers two types of operations:<\/strong><\/p>\n<p><strong>1. Transformation<\/strong><\/p>\n<p><strong>2. Action<\/strong><\/p>\n<h4>Transformations:<\/h4>\n<p>As we can not make any changes in RDD but we can transform one. This process returns a new RDD.<\/p>\n<p>Few of the Transformation functions are a map, filter, flatMap etc.<\/p>\n<h4>Actions:<\/h4>\n<p>After Action takes place it returns a new value to driver program. It may write it to external datastore also.<\/p>\n<p>Few of the Action operations are reduce, collect etc.<\/p>\n<h3>Conclusion<\/h3>\n<p>As a result, we have seen how spark dominated complete Big Data world. It is a powerful framework which enhances Big Data to a new level in the industry. Spark provides a collection of technologies, which increases the efficiency of the system.<\/p>\n<p>Also, a powerful engine which mainly provides ease of use. Hence, Spark became very beneficial to developers of this ere with phenomenal speed.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this Spark tutorial, we will focus on what is Apache Spark, Spark terminologies, Spark ecosystem components as well as RDD. Now-a-days, whenever we talk about Big Data, only one word strike us &#8211;&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":73069,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[614],"tags":[637,638,639,640,641,642,643],"class_list":["post-2005","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-apache-spark","tag-apache-spark","tag-apache-spark-rdd","tag-apache-spark-tutorial","tag-learn-apache-spark","tag-spark","tag-spark-machine-learning","tag-spark-tutorial"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.7 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Spark Tutorial - Apache Spark Introduction for Beginners - TechVidvan<\/title>\n<meta name=\"description\" content=\"Spark Tutorial. What is Apache Spark, Why Apache Spark, Spark introduction, Spark Ecosystem Components. This free Apache Spark tutorial explains Next gen Big Data tool, which is lightning fast &amp; can handle diverse workload.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Spark Tutorial - Apache Spark Introduction for Beginners - TechVidvan\" \/>\n<meta property=\"og:description\" content=\"Spark Tutorial. What is Apache Spark, Why Apache Spark, Spark introduction, Spark Ecosystem Components. This free Apache Spark tutorial explains Next gen Big Data tool, which is lightning fast &amp; can handle diverse workload.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/\" \/>\n<meta property=\"og:site_name\" content=\"TechVidvan\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/TechVidvan\/\" \/>\n<meta property=\"article:published_time\" content=\"2018-01-03T17:46:01+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Apache-Spark-Tutorial-01.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"TechVidvan Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:site\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"TechVidvan Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Spark Tutorial - Apache Spark Introduction for Beginners - TechVidvan","description":"Spark Tutorial. What is Apache Spark, Why Apache Spark, Spark introduction, Spark Ecosystem Components. This free Apache Spark tutorial explains Next gen Big Data tool, which is lightning fast & can handle diverse workload.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/","og_locale":"en_US","og_type":"article","og_title":"Spark Tutorial - Apache Spark Introduction for Beginners - TechVidvan","og_description":"Spark Tutorial. What is Apache Spark, Why Apache Spark, Spark introduction, Spark Ecosystem Components. This free Apache Spark tutorial explains Next gen Big Data tool, which is lightning fast & can handle diverse workload.","og_url":"https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/","og_site_name":"TechVidvan","article_publisher":"https:\/\/www.facebook.com\/TechVidvan\/","article_published_time":"2018-01-03T17:46:01+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Apache-Spark-Tutorial-01.jpg","type":"image\/jpeg"}],"author":"TechVidvan Team","twitter_card":"summary_large_image","twitter_creator":"@vidvantech","twitter_site":"@vidvantech","twitter_misc":{"Written by":"TechVidvan Team","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/#article","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/"},"author":{"name":"TechVidvan Team","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22"},"headline":"Spark Tutorial &#8211; Apache Spark Introduction for Beginners","datePublished":"2018-01-03T17:46:01+00:00","mainEntityOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/"},"wordCount":1015,"commentCount":0,"publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Apache-Spark-Tutorial-01.jpg","keywords":["apache spark","Apache Spark RDD","apache spark tutorial","learn Apache Spark","Spark","Spark Machine Learning","spark tutorial"],"articleSection":["Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/","url":"https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/","name":"Spark Tutorial - Apache Spark Introduction for Beginners - TechVidvan","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/#website"},"primaryImageOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/#primaryimage"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Apache-Spark-Tutorial-01.jpg","datePublished":"2018-01-03T17:46:01+00:00","description":"Spark Tutorial. What is Apache Spark, Why Apache Spark, Spark introduction, Spark Ecosystem Components. This free Apache Spark tutorial explains Next gen Big Data tool, which is lightning fast & can handle diverse workload.","breadcrumb":{"@id":"https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/#primaryimage","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Apache-Spark-Tutorial-01.jpg","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/Apache-Spark-Tutorial-01.jpg","width":1200,"height":628,"caption":"Apache Spark Tutorial"},{"@type":"BreadcrumbList","@id":"https:\/\/techvidvan.com\/tutorials\/spark-tutorial\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/techvidvan.com\/tutorials\/"},{"@type":"ListItem","position":2,"name":"Spark Tutorial &#8211; Apache Spark Introduction for Beginners"}]},{"@type":"WebSite","@id":"https:\/\/techvidvan.com\/tutorials\/#website","url":"https:\/\/techvidvan.com\/tutorials\/","name":"TechVidvan Blogs","description":"","publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/techvidvan.com\/tutorials\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/techvidvan.com\/tutorials\/#organization","name":"TechVidvan","url":"https:\/\/techvidvan.com\/tutorials\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","width":200,"height":50,"caption":"TechVidvan"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/TechVidvan\/","https:\/\/x.com\/vidvantech"]},{"@type":"Person","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22","name":"TechVidvan Team","description":"The TechVidvan Team delivers practical, beginner-friendly tutorials on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our experts are here to help you upskill and excel in today\u2019s tech industry."}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2005","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/comments?post=2005"}],"version-history":[{"count":0,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2005\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media\/73069"}],"wp:attachment":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media?parent=2005"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/categories?post=2005"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/tags?post=2005"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}