Apache Spark Paired RDD: Creation & Operations

1. Objective

In Apache Spark, Key-value pairs are known as paired RDD. In this blog, we will learn what are paired RDDs in Spark in detail. To understand in deep, we will focus on following methods of creating spark paired RDD in and operations on paired RDDs in spark, such as transformations and actions in Spark RDD. The transformation such as groupByKey, reduceByKey, join, leftOuterJoin/rightOuterJoin, while, actions like countByKey. But at first, we will learn brief introduction on RDDs in Spark.

Spark Paired RDD

Apache Spark Paired RDD

2. Introduction – Spark RDDs

RDD refers to Resilient Distributed Datasets, core abstraction and a fundamental data structure of Spark. RDDs in spark are immutable as well as the distributed collection of objects. In RDD, each dataset is divided into logical partitions. That each partition may be computed on different nodes of the cluster. Spark RDDs can contain user-defined classes. Also, includes any type of Scala, python or java objects.

It is a read-only, partitioned collection of records. Spark RDDs are the fault-tolerant collection of elements and it can be operated in parallel. There are generally three ways to create spark RDDs. Data in stable storage, other RDDs, and parallelizing existing collection in driver program. By using RDD, it is possible to achieve faster and efficient MapReduce operations.

3. Introduction – Apache Spark Paired RDD

Spark Paired RDDs are defined as the RDD containing a key-value pair. There is two linked data item in a key-value pair (KVP). We can say the key is the identifier, while the value is the data corresponding to the key value.

Spark Paired RDD

Paired RDD in Spark

In addition,  most of the Spark operations work on RDDs containing any type of objects. But on RDDs of key-value pairs, a few special operations are available. For example, distributed “shuffle” operations, such as grouping or aggregating the elements by a key.

These operations are automatically available on RDDs containing Tuple2 objects, in Scala. In the Pair RDD functions class, the key-value pair operations are available. That wraps around an RDD of tuples.

For example:

In this code we are using the reduceByKey operation on key-value pairs. We will count how many times each line of text occurs in a file:

val lines1 = sc.textFile("data1.txt")
val pairs1 = lines.map(s => (s, 1))
val counts1 = pairs.reduceByKey((a, b) => a + b)

There is one more method counts.sortByKey() we can use.

4. Importance of Apache Spark Paired RDD?

In many programs, pair RDDs of Apache Spark are a useful building block. Operations that allow us to act on each key in parallel, it exposes those operations. Also, helps to regroup the data across the network. For instance, in spark paired RDDs reduceByKey() method aggregate data separately for each key and a join() method, which merges two RDDs together by grouping elements with the same key. It is very normal to extract fields from an RDD. For example, representing, for instance, an event time, customer ID, or another identifier. Also, use those fields in spark pair RDD operations as keys.

5. Creating Paired RDD in Spark

By running a map() function that returns key or value pairs, we can create spark pair RDDs. On the basis of language, the procedure to build the key-value RDDs differs.

  • In Python language

For the functions of keyed data to work, we need to return an RDD composed of tuples. Furthermore, for creating a pair RDD in spark using the first word as the key in Python programming language.

pairs = lines.map(lambda x: (x.split(” “)[0], x))
  • In Scala language

We also need to return tuples as shown in the previous example. Moreover, this will make functions on keyed data to be available. To provide the extra key or value functions, an implicit conversion on RDDs of tuples exists.

Furthermore, to create apache spark pair RDD, by using the first word as the keyword

val pairs = lines.map(x => (x.split(” “)(0), x))
  • In Java language

It doesn’t have a built-in function of tuple function. So, by using the scala, only spark’s java API has users create tuples.Tuple2 class. Although, users can construct a new tuple by writing new Tuple2(elem1, elem2) in java. Also, can access its relevant elements with the _1() and _2() methods.

In addition, when you are creating paired RDDs in Spark, we need to call special versions of spark’s functions in java. For example, in place of the basic map() function the mapToPair () function should be used.

To create a Spark pair RDD, using the first word as the keyword

PairFunction<String, String, String> keyData =

new PairFunction<String, String, String>() {

public Tuple2<String, String> call(String x) {

return new Tuple2(x.split(” “)[0], x);

}

};

JavaPairRDD<String, String> pairs = lines.mapToPair(keyData);

6. Some Interesting Spark Paired RDD – Operations

6.1.Transformation Operations

All the transformations available to standard RDDs, Pair RDDs are allowed to use them. Even it can apply same rules from “passing functions to spark”. As there are tuples available in spark paired RDDs, we need to pass functions that operate on tuples, rather than on individual elements. Some of the transformation methods are listed here. For example:

  • groupByKey

Basically, it groups all the values with the same key.

rdd.groupByKey()
  • reduceByKey(fun)

It uses to combine values with the same key.

add.reduceByKey( (x, y) => x + y)
  • combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner)

By using a different result type, combine values with the same key.

  • mapValues(func)

Without changing the key, apply a function to each value of a pair RDD of spark.

rdd.mapValues(x => x+1)
  • keys()

Basically, Keys() returns a spark RDD of just the keys.

rdd.keys()
  • values()

Generally, values() returns an RDD of just the values.

rdd.values()
  • sortByKey()

Basically, sortByKey returns an RDD sorted by the key.

rdd.sortByKey()

6.2. Action Operations

Like transformations, actions available on spark pair RDDs are similar to base RDD. Basically, there are some additional actions available on pair RDDs of spark.  Moreover, those leverages the advantage of the key/value nature of the data. Some of them are listed below. For example,

  • countByKey()

For each key, it helps to count the number of elements.

rdd.countByKey()
  • collectAsMap()

Basically, it helps to collect the result as a map to provide easy lookup.

rdd.collectAsMap()
  • lookup(key)

Basically, lookup(key) returns all values associated with the provided key.

rdd.lookup()

7. Apache Spark Paired RDD – Conclusion

Hence,  we have seen how to work with Spark key/value data. Also, how to use the specialized functions and operations available in spark. Finally,  we hope this article has given all your answers regarding spark paired RDDs.

Best books for learning Spark.

Reference – Apache Spark

This tutorial is informative or not, give us your feedback.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *