{"id":2007,"date":"2018-01-05T10:23:12","date_gmt":"2018-01-05T10:23:12","guid":{"rendered":"https:\/\/techvidvan.com\/tutorials\/?p=638"},"modified":"2018-01-05T10:23:12","modified_gmt":"2018-01-05T10:23:12","slug":"apache-spark-terminologies","status":"publish","type":"post","link":"https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/","title":{"rendered":"Apache Spark Terminologies and Key Concepts"},"content":{"rendered":"<p>This article cover core Apache Spark concepts, including Apache Spark Terminologies.<\/p>\n<p>Ultimately, it is an introduction to all the terms used in Apache Spark with focus and clarity in mind like Action, Stage, task, RDD, Dataframe, Datasets, Spark session etc.<\/p>\n<p>Apache Spark is so popular tool in big data, it provides a powerful and unified engine to data researchers. This blog is\u00a0helpful to the beginner&#8217;s abstract of important Apache Spark terminologies.<\/p>\n<p><a href=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/terminologies-of-spark-01-2.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-73325\" src=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/terminologies-of-spark-01-2.jpg\" alt=\"\" width=\"1200\" height=\"628\" \/><\/a><\/p>\n<h3>List of essential key terms: Apache Spark Terminologies<\/h3>\n<h4>1. Apache Spark<\/h4>\n<p>Apache Spark is an <strong>open-source processing<\/strong> engine alternative to Hadoop. In terms of memory, it runs 100 times faster than Hadoop MapReduce. However, On disk, it runs 10 times faster than Hadoop.<\/p>\n<p>It handles large-scale data analytics with ease of use. Also, Spark supports in-memory computation.<\/p>\n<p>We can run spark on following APIs like Java, Scala, Python, R, and SQL. As well, Spark runs on a Hadoop YARN, Apache Mesos, and standalone cluster managers.<\/p>\n<h4>2. Working with Apache Engine<\/h4>\n<p>Spark engine is the <em>fast<\/em> and <em>general engine<\/em> of Big Data Processing. This engine is responsible for scheduling of jobs on the cluster. It also handles distributing and monitoring data applications over the cluster.<\/p>\n<h4>3. RDD (Resilient Distributed Datasets)<\/h4>\n<p>RDD is Spark\u2019s core abstraction as a distributed collection of objects. It is an <em>Immutable<\/em> dataset which cannot change with time. This data can be stored in memory or disk across the cluster.<\/p>\n<p>The data is logically partitioned over the cluster. It offers in-parallel operation across the cluster. As RDDs cannot be changed it can be transformed using several operations. Those are <em>Transformation and Action<\/em> operations.<\/p>\n<p>Furthermore, RDDs are fault Tolerant in nature. If any failure occurs it can rebuild lost data automatically through lineage graph.<\/p>\n<h4>4. Partitions<\/h4>\n<p>To speed up the data processing, term partitioning of data comes in. Basically, <strong>Partition<\/strong> means logical and smaller unit of data. Partitioning of data defines as to derive logical units of data.<\/p>\n<h4>5. Cluster Manager<\/h4>\n<p>Cluster manager runs as an external service which provides resources to each application. This is possible to run Spark on the distributed node on Cluster. Spark supports following cluster managers.<\/p>\n<p>First is <em>Apache Spark Standalone<\/em> cluster manager, the Second one is <em>Apache Mesos<\/em> while third is <em>Hadoop Yarn<\/em>.<\/p>\n<p>Hence, all cluster managers are different on comparing by scheduling, security, and monitoring. As a matter of fact, each has its own benefits. No doubt, We can select any cluster manager as per our need and goal.<\/p>\n<h4>6. Worker Node<\/h4>\n<p>A worker node refers to a <strong>slave node<\/strong>. Actually, any node which can run the application across the cluster is a worker node. In other words, any node runs the program in the cluster is defined as worker node.<\/p>\n<h4>7. Application<\/h4>\n<p>It is a User program built on Apache Spark. Moreover, it consists of a <strong>driver program<\/strong> as well as executors over the cluster.<\/p>\n<h4>8. Executor<\/h4>\n<p><a href=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Spark-Physical-Cluster-01.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-73280\" src=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/sites\/2\/2019\/11\/Spark-Physical-Cluster-01.jpg\" alt=\"Spark Physical Cluster\" width=\"1200\" height=\"628\" \/><\/a><\/p>\n<p>Any application can have its own executors. These are generally present at worker nodes which implements the task.<\/p>\n<p>In other words, as any process activates for an application on a worker node. That executes tasks and keeps data in-memory or disk storage over them.<\/p>\n<h4>9. Task<\/h4>\n<p>A Task is a unit of work that is sent to any executor.<\/p>\n<h4>10. Stage<\/h4>\n<p>Each job is divided into small sets of tasks which are known as <strong>stages<\/strong>.<\/p>\n<h4>11. Driver Program<\/h4>\n<p>The driver program is the process running the main() function of the application. It also creates the SparkContext. This program runs on a master node of the machine. In the meantime, it also declares transformations and actions on data RDDs.<\/p>\n<h4>12. Action<\/h4>\n<p>Actions refer to an <em>operation<\/em>. It includes reducing, counts, first and many more. However, it also applies to RDD that perform computations. Also, send the result back to driver program.<\/p>\n<h4>13. Lazy Evaluation<\/h4>\n<p>It optimizes the overall<em> data processing workflow<\/em>. Lazy evaluation means execution is not possible until we trigger an action. ultimately, all the transformations take place are lazy in spark.<\/p>\n<h4>14. Data Frame<\/h4>\n<p>It is an <em>immutable distributed<\/em> data collection, like RDD. We can organize data into names, columns, tables etc. in the database. This design makes large datasets processing even easier.<\/p>\n<p>It allows developers to impose distributed collection into a structure and high-level abstraction.<\/p>\n<h4>15. Datasets<\/h4>\n<p>To express transformation on domain objects, Datasets provides an API to users. It also enhances the performance and advantages of robust Spark SQL execution engine.<\/p>\n<h4>16. MLlib<\/h4>\n<p>In Apache Spark a general machine learning library \u2014 MLlib \u2014 is available. Moreover, It provides simplicity, scalability, as well as easy integration with other tools.<\/p>\n<p>It is designed to work with scalability, language compatibility, and speed of Spark.<\/p>\n<h4>17. Spark SQL<\/h4>\n<p>It is a <em>spark module<\/em> which works with structured data. Also, supports workloads, even combine SQL queries with the complicated algorithm based analytics.<\/p>\n<h4>18. Spark Context<\/h4>\n<p>Spark context holds a connection with Spark cluster manager. While Co-ordinated by it, applications run as an independent set of processes in a program.<\/p>\n<h4>19. ML pipelines<\/h4>\n<p>We can say when <em>machine learning algorithms<\/em> are running, it involves a sequence of tasks. Above all, It includes pre-processing, feature extraction, model fitting, and validation stages.<\/p>\n<h4>20. GraphX<\/h4>\n<p>It is the component in Apache Spark for graphs and graph-parallel computation. Moreover, GraphX extends the Spark RDD by Graph abstraction.<\/p>\n<p><strong>Abstraction<\/strong> is a directed multigraph with properties attached to each vertex and edge. In addition, to brace graph computation, it introduces a set of fundamental operators.<\/p>\n<h4>21. Spark Streaming<\/h4>\n<p>It is an extension of core spark which allows real-time data processing. Key abstraction of spark streaming is Discretized Stream, also <strong>DStream<\/strong>. Moreover,\u00a0 it indicates a stream of data separated into small batches.<\/p>\n<h3>Conclusion<\/h3>\n<p>Therefore, This tutorial sums up some of the important Apache Spark Terminologies. It shows how these terms play a vital role in Apache Spark computations. Also, helps us to understand Spark in more depth.<\/p>\n<p>Hence, this blog includes all the Terminologies of Apache Spark to learn concept efficiently.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This article cover core Apache Spark concepts, including Apache Spark Terminologies. Ultimately, it is an introduction to all the terms used in Apache Spark with focus and clarity in mind like Action, Stage, task,&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":73325,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[614],"tags":[650,651,652,653,654,655,656],"class_list":["post-2007","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-apache-spark","tag-apache-spark-key-terms","tag-apache-spark-terminologies-and-concepts-you-must-know","tag-apche-spark","tag-important-keywords-on-apache-spark","tag-spark-data-frame","tag-spark-datasets","tag-spark-rdd"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.7 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Apache Spark Terminologies and Key Concepts - TechVidvan<\/title>\n<meta name=\"description\" content=\"Apache Spark terminologies state the terms used in Apche Spark like Action, Stage, task, Spark RDD, Spark Dataframe, Spark Datasets, Spark session etc.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Apache Spark Terminologies and Key Concepts - TechVidvan\" \/>\n<meta property=\"og:description\" content=\"Apache Spark terminologies state the terms used in Apche Spark like Action, Stage, task, Spark RDD, Spark Dataframe, Spark Datasets, Spark session etc.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/\" \/>\n<meta property=\"og:site_name\" content=\"TechVidvan\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/TechVidvan\/\" \/>\n<meta property=\"article:published_time\" content=\"2018-01-05T10:23:12+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/terminologies-of-spark-01-2.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"TechVidvan Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:site\" content=\"@vidvantech\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"TechVidvan Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Apache Spark Terminologies and Key Concepts - TechVidvan","description":"Apache Spark terminologies state the terms used in Apche Spark like Action, Stage, task, Spark RDD, Spark Dataframe, Spark Datasets, Spark session etc.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/","og_locale":"en_US","og_type":"article","og_title":"Apache Spark Terminologies and Key Concepts - TechVidvan","og_description":"Apache Spark terminologies state the terms used in Apche Spark like Action, Stage, task, Spark RDD, Spark Dataframe, Spark Datasets, Spark session etc.","og_url":"https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/","og_site_name":"TechVidvan","article_publisher":"https:\/\/www.facebook.com\/TechVidvan\/","article_published_time":"2018-01-05T10:23:12+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/terminologies-of-spark-01-2.jpg","type":"image\/jpeg"}],"author":"TechVidvan Team","twitter_card":"summary_large_image","twitter_creator":"@vidvantech","twitter_site":"@vidvantech","twitter_misc":{"Written by":"TechVidvan Team","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/#article","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/"},"author":{"name":"TechVidvan Team","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22"},"headline":"Apache Spark Terminologies and Key Concepts","datePublished":"2018-01-05T10:23:12+00:00","mainEntityOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/"},"wordCount":973,"commentCount":0,"publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/terminologies-of-spark-01-2.jpg","keywords":["apache spark key terms","Apache Spark Terminologies and Concepts You Must Know","Apche Spark","Important keywords on Apache Spark","Spark Data frame","Spark Datasets","spark rdd"],"articleSection":["Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/","url":"https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/","name":"Apache Spark Terminologies and Key Concepts - TechVidvan","isPartOf":{"@id":"https:\/\/techvidvan.com\/tutorials\/#website"},"primaryImageOfPage":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/#primaryimage"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/#primaryimage"},"thumbnailUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/terminologies-of-spark-01-2.jpg","datePublished":"2018-01-05T10:23:12+00:00","description":"Apache Spark terminologies state the terms used in Apche Spark like Action, Stage, task, Spark RDD, Spark Dataframe, Spark Datasets, Spark session etc.","breadcrumb":{"@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/#primaryimage","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/terminologies-of-spark-01-2.jpg","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2019\/11\/terminologies-of-spark-01-2.jpg","width":1200,"height":628,"caption":"terminologies of spark"},{"@type":"BreadcrumbList","@id":"https:\/\/techvidvan.com\/tutorials\/apache-spark-terminologies\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/techvidvan.com\/tutorials\/"},{"@type":"ListItem","position":2,"name":"Apache Spark Terminologies and Key Concepts"}]},{"@type":"WebSite","@id":"https:\/\/techvidvan.com\/tutorials\/#website","url":"https:\/\/techvidvan.com\/tutorials\/","name":"TechVidvan Blogs","description":"","publisher":{"@id":"https:\/\/techvidvan.com\/tutorials\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/techvidvan.com\/tutorials\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/techvidvan.com\/tutorials\/#organization","name":"TechVidvan","url":"https:\/\/techvidvan.com\/tutorials\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/","url":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","contentUrl":"https:\/\/techvidvan.com\/tutorials\/wp-content\/uploads\/2024\/03\/techvidvan-logo-200x50-1.webp","width":200,"height":50,"caption":"TechVidvan"},"image":{"@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/TechVidvan\/","https:\/\/x.com\/vidvantech"]},{"@type":"Person","@id":"https:\/\/techvidvan.com\/tutorials\/#\/schema\/person\/e9c26e74dd3d87421f7ada9433b8cd22","name":"TechVidvan Team","description":"The TechVidvan Team delivers practical, beginner-friendly tutorials on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our experts are here to help you upskill and excel in today\u2019s tech industry."}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2007","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/comments?post=2007"}],"version-history":[{"count":0,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/posts\/2007\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media\/73325"}],"wp:attachment":[{"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/media?parent=2007"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/categories?post=2007"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techvidvan.com\/tutorials\/wp-json\/wp\/v2\/tags?post=2007"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}