Spark Streaming- Architecture, Working and Operations

As we know, there are so many distributed stream processing engines available. The question arises is why Apache spark streaming and what are its unique benefits. Spark Streaming tutorial totally aims at the topic “Spark Streaming”.

To understand the topic better, we will start with basics of spark streaming, spark streaming examples and why it is needful in spark. Moreover, we will learn how streaming works in Spark, apache spark streaming operations, sources of spark streaming.

By the end, we will complete the topic with advantages of spark streaming over Big Data Hadoop & Storm.

What is Spark Streaming

Spark Streaming” is generally known as an extension of the core Spark API. It is a unified engine that natively supports both batch and streaming workloads. Spark streaming enables scalability, high-throughput, fault-tolerant stream processing of live data streams.

It is a different system from others. Some engines either have streaming or have similar batch and streaming APIs, yet they compile internally to different engines.

Therefore, streaming came into the picture. This model offers both execution and unified programming for batch and streaming. There are some advantages over other traditional streaming systems. Majorly, there are 4 aspects of it.

  • It recovers very fast from failures and stragglers.
  • Resource usage and better load balancing in spark streaming.
  • Streaming data can be combined with static datasets as well as interactive queries.
  • We can also integrate it with advanced processing libraries, such as SQL, machine learning, graph processing.

Data ingestion is possible from many sources, such as Apache Flume, Kinesis, Kafka or TCP sockets. Processing is also possible by using complex algorithms expressed with high-level functions, such as map, reduce, join and window.

Afterwards, data which is processed can be pushed out to databases, filesystems, and live dashboards. Although, can apply spark’s machine learning & graph processing algorithms on data streams.

Internal Working

At first, input data streams enter further, it divides the data into batches. After that those batches processed through engine, to generate final results in batches.

What is Spark Streaming

DStream or discretized stream is a high-level abstraction of spark streaming, that represents a continuous stream of data. Creation of DStreams is possible from input data streams, from following sources, such as Kafka, Flume, and Kinesis.

We can also create it by applying high-level operations on other DStreams. A DStream is just a sequence of spark RDDs, internally, we can write spark streaming programs in Scala, Java or Python.

Why Streaming in Spark is needful?

There is traditional stream processing systems are designed. Those are integrated with a continuous operator model for data processing, working of the system is :

  •  At first, streaming data is received from data sources (e.g. live logs, IoT device data etc). That data sent into some data ingestion system like Apache Kafka, Amazon Kinesis, etc.
  • Then data processing takes place in parallel over cluster.
  •  Afterwards, results are given to downstream systems like HBase, Cassandra, Kafka, etc.

A set of worker nodes is available, that each runs one or more continuous operators. Continuous operator processes the streaming data one record at a time. Afterwards, it forwards the records to other operators in the pipeline.

By source operators data is received from ingestion systems. By sink operators, it returns as output to downstream systems. Basically, continuous operators are the natural model and very simple. Due to larger scale & complex real-time analytics, the traditional system met some challenges, such as:

– Unification of Batch, Streaming, and Interactive Workloads

To query the streaming data interactively or combining static datasets, it turns out to be very attractive, this was not easy in continuous operator systems. Those are not for new operators for ad-hoc queries. A single engine that can combine batch, streaming & interactive queries is necessary.

– Advanced Analytics with Machine learning and SQL Queries

While working on complex workloads it always requires continuously learning & updating data models. Also, querying the streaming data with SQL queries. Now, the developer’s job much easier by having 1 abstraction across these analytic tasks.

– Fast Failure and Straggler Recovery

It is must that system is able to recover from failures automatically for results. Static allocation of continuous operators to worker nodes is challenging in traditional systems.

– Load Balancing

Uneven allocation of processing load between workers causes bottlenecks in continuous operator system. The system must be able to adapt the resource allocation based on workload.

Why Spark Streaming?

Earlier, as Hadoop have high latency that is not right for near real time processing needs. In most cases, we use Hadoop for batch processing while used Storm for stream processing.

It leads to increase in code size, a number of bugs to fix, development effort, and causes other issues, which makes difference between Big data Hadoop and Apache Spark.

Ultimately, Spark Streaming fixed all those issues. It provides the scalable, efficient, resilient, and integrated system. This model offers both execution and unified programming for batch and streaming.

Although there is a major reason for its rapid adoption, is the unification of distinct data processing capabilities. It becomes hot cake for developers to use a single framework to attain all the processing needs. In addition, through Spark SQL streaming data can combine with static data sources.

In addition, it uses a new architecture called Discretized Streams, that offers rich libraries of Spark and fault tolerance property of the Spark engine.

Architecture of Spark Streaming: Discretized Streams

As we know, continuous operator processes the streaming data one record at a time. Despite, processing one record at a time, it discretizes data into tiny, micro-batches. We can also say, spark streaming’s receivers accept data in parallel. Also, buffer it in the memory of spark’s worker’s nodes.

Afterwards, latency optimized engine runs small jobs to process batches and returns output. From the traditional continuous model, it differs as; computation is statically allocated to a node.

Tasks are assigned to the workers based on the locality of the data and available resources. It also enables better load balancing and faster fault recovery.

Moreover, a DStream is just a sequence of spark RDDs, abstraction fault-tolerant dataset. This offers, that we can process streaming data by using any spark library or code.

Benefits of Discretized Stream Processing

1. Fast failure and straggler recovery

Computation of lost data in case of the node failure in traditional systems is not easy. It has to restart the failed operator on another node. As in spark, the computation is discretized into small tasks, so that can run anywhere without affecting correctness.

Therefore, failed tasks can spread on all nodes over cluster to perform recomputations. Hence, recover from the failure is faster than the traditional approach.

2. Unification of batch, streaming and interactive analytics

DStream is just a series of Spark RDDs, that allows batch, streaming workloads to interoperate seamlessly. We can apply Spark functions on each batch of streaming data, that can be interactively queried on demand. Since spark’s worker memory stores it.

3. Advanced analytics like machine learning and interactive SQL

We can also integrate it with advanced processing libraries, such as SQL, machine learning, graph processing. RDDs, which are generated by DStreams can also convert into dataframes. Afterwards, they are queried with SQL.

4. Performance

Ability to batch data and leverage spark engine leads to almost higher throughput. Latencies of spark streaming are as low as a few hundred milliseconds.

Working of Spark Streaming

Data stream converted into batches called DStreams, it is just a series of RDDs in spark that can process using Spark APIs. Afterwards, results are returned in batches. Spark streaming APIs are available in Scala, Java, and Python.

A state based on data coming in a stream called stateful computations, it offers window operations.

Spark Streaming Sources

Input DStream is corresponding to a receiver object, that receives the data from a source and stores it in spark’s memory for processing.

There are following two types of built-in streaming sources:

  • Basic sources

The sources those are directly available in the StreamingContext API, such as file systems, and socket connections.

  • Advanced sources

Sources those are available through extra utility classes, such as Kafka, Flume, Kinesis, & many more. It also requires linking against extra dependencies.

There are two categories of a receiver, on the basis of  their reliability:

  • Reliable Receiver

A receiver that sends an acknowledgment to a source, when they received the data. Also, stores in spark replication are a reliable receiver.

  • Unreliable Receiver

A receiver that does not send an acknowledgment to a source. We can use it as sources if, we do not want or need to go into the complexity of acknowledgment.

Spark Streaming Operations

There are two types of operations, spark streaming supports:

1. Transformation Operations in Spark

Like RDDs, here also transformations allow modification of the data from input DStream. It also offers many transformations those are available on normal Spark RDD’s. Few of them are:

Map(), repartition(numPartitions), filter(), flatMap(), union(otherStream),  reduce(), count(), countByValue() and many more.

2. Output Operations in Apache Spark

DStream’s data can push out to external systems like a database, by output operations. Some of the following output operations are:

print(), saveAsTextFiles(prefix, [suffix])”prefix-TIME_IN_MS[.suffix]”, saveAsObjectFiles(prefix, [suffix]), and many more.

Conclusion

As a result, spark streaming overcome all issues with the traditional streaming systems. Now we can handle process batch and streaming workloads at once.

Hence, It becomes hot cake for developers to use a single framework to meet all the processing needs. Therefore, use of streaming also enhances system efficiency and performance.