Apache Pig Operators with Syntax and Examples

1. Apache Pig Operators Tutorial

There is a huge set of Apache Pig Operators available in Apache Pig. In this article, “Introduction to Apache Pig Operators” we will discuss all types of Apache Pig Operators in detail. Such as Diagnostic Operators, Grouping & Joining, Combining & Splitting and many more. They also have their subtypes. So, here we will discuss each Apache Pig Operators in depth along with syntax and their examples.

Apache Pig Operators

Apache Pig Operators

2. Introduction to Apache Pig Operators

We have the huge set of Apache Pig Operators, for performing several types of Operations. Let’s discuss types of Apache Pig Operators:

  1. Diagnostic Operators
  2. Grouping & Joining
  3. Combining & Splitting
  4. Filtering
  5. Sorting

So, let’s discuss each type of Apache Pig Operators in detail.

3. Types of Pig Operators

i. Diagnostic Operators: Apache Pig Operators

Basically, we use Diagnostic Operators to verify the execution of the Load statement. There are four different types of diagnostic operators −

  1. Dump operator
  2. Describe operator
  3. Explanation operator
  4. Illustration operator

Further, we will discuss each operator of Pig Latin in depth.

a. Dump Operator

In order to run the Pig Latin statements and display the results on the screen, we use Dump Operator. Generally, we use it for debugging Purpose.

  • Syntax

So the syntax of the Dump operator is:

grunt> Dump Relation_Name
  • Example

Here, is the example, in which a dump is performed after each statement.

A = LOAD 'Employee' AS (name:chararray, age:int, gpa:float);
DUMP A;
(Shubham,18,4.0F)
(Pulkit,19,3.7F)
(Shreyash,20,3.9F)
(Mehul,22,3.8F)
(Rishabh,20,4.0F)
B = FILTER A BY name matches 'J.+';
DUMP B;
(Shubham,18,4.0F)
(Mehul,22,3.8F)
(Rishabh,20,4.0F)

b. Describe operator

To view the schema of a relation, we use the describe operator.

  • Syntax

So, the syntax of the describe operator is −

grunt> Describe Relation_name
  • Example

Let’s suppose we have a file Employee_data.txt in HDFS. Its content is.

001,mehul,chourey,9848022337,Hyderabad
002,Ankur,Dutta,9848022338,Kolkata
003,Shubham,Sengar,9848022339,Delhi
004,Prerna,Tripathi,9848022330,Pune
005,Sagar,Joshi,9848022336,Bhubaneswar
006,Monika,sharma,9848022335,Chennai

Also,  using the LOAD operator, we have read it into a relation Employee.

grunt> Employee = LOAD 'hdfs://localhost:9000/pig_data/Employee_data.txt' USING PigStorage(',')
  as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

Further, let’s describe the relation named Employee. Then verify the schema.

grunt> describe Employee
  • Output

It will produce the following output, after execution of the above Pig Latin statement.

grunt> Employee: { id: int,firstname: chararray,lastname: chararray,phone: chararray,city: chararray

c. Explanation operator

To display the logical, physical, and MapReduce execution plans of a relation, we use the explain operator.

  • Syntax

So, the syntax of the explain operator is-

grunt> explain Relation_name;
  • Example

Let’s suppose  we have a file Employee_data.txt in HDFS. Its content is:

001,mehul,chourey,9848022337,Hyderabad
002,Ankur,Dutta,9848022338,Kolkata
003,Shubham,Sengar,9848022339,Delhi
004,Prerna,Tripathi,9848022330,Pune
005,Sagar,Joshi,9848022336,Bhubaneswar
006,Monika,sharma,9848022335,Chennai

Also, using the LOAD operator, we have read it into a relation Employee

grunt> Employee = LOAD 'hdfs://localhost:9000/pig_data/Employee_data.txt' USING PigStorage(',')
  as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

Further, using the explain operator let ‘s explain the relation named Employee.

grunt> explain Employee;
  • Output

It will produce the following output.

$ explain Employee;
2015-10-05 11:32:43,660 [main]
2015-10-05 11:32:43,660 [main] INFO  org.apache.pig.newplan.logical.optimizer
.LogicalPlanOptimizer -
{RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator,
GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter,
MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer,
PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}  
#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
Employee: (Name: LOStore Schema:
id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city#
35:chararray)
|
|---Employeet: (Name: LOForEach Schema:
id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city#
35:chararray)
   | |
   | (Name: LOGenerate[false,false,false,false,false] Schema:
id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city#
35:chararray)ColumnPrune:InputUids=[34, 35, 32, 33,
31]ColumnPrune:OutputUids=[34, 35, 32, 33, 31]
   | |   |
   | |   (Name: Cast Type: int Uid: 31)
   | |   | |  | |---id:(Name: Project Type: bytearray Uid: 31 Input: 0 Column: (*))
   | |   |
   | |   (Name: Cast Type: chararray Uid: 32)
   | |   |
   | |   |---firstname:(Name: Project Type: bytearray Uid: 32 Input: 1
Column: (*))
   | |   |
   | |   (Name: Cast Type: chararray Uid: 33)
   | |   |
   | |   |---lastname:(Name: Project Type: bytearray Uid: 33 Input: 2
Column: (*))
   | |   |
   | |   (Name: Cast Type: chararray Uid: 34)
   | |   |
   | |   |---phone:(Name: Project Type: bytearray Uid: 34 Input: 3 Column:
(*))
   | |   |
   | |   (Name: Cast Type: chararray Uid: 35)
   | |   |
   | |   |---city:(Name: Project Type: bytearray Uid: 35 Input: 4 Column:
(*))
   | |
   | |---(Name: LOInnerLoad[0] Schema: id#31:bytearray)
   | |  
   | |---(Name: LOInnerLoad[1] Schema: firstname#32:bytearray)
   | |
   | |---(Name: LOInnerLoad[2] Schema: lastname#33:bytearray)
   | |
   | |---(Name: LOInnerLoad[3] Schema: phone#34:bytearray)
   | |
   | |---(Name: LOInnerLoad[4] Schema: city#35:bytearray)
   |
   |---Employee: (Name: LOLoad Schema:
id#31:bytearray,firstname#32:bytearray,lastname#33:bytearray,phone#34:bytearray
,city#35:bytearray)RequiredFields:null
#-----------------------------------------------
# Physical Plan: #-----------------------------------------------
Employee: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-36
|
|---Employee: New For Each(false,false,false,false,false)[bag] - scope-35
   | |
   | Cast[int] - scope-21
   | |
   | |---Project[bytearray][0] - scope-20
   | |  
   | Cast[chararray] - scope-24
   | |
   | |---Project[bytearray][1] - scope-23
   | |
   | Cast[chararray] - scope-27
   | |  
   | |---Project[bytearray][2] - scope-26
   | |  
   | Cast[chararray] - scope-30
   | |  
   | |---Project[bytearray][3] - scope-29
   | |
   | Cast[chararray] - scope-33
   | |
   | |---Project[bytearray][4] - scope-32
   |
   |---Employee: Load(hdfs://localhost:9000/pig_data/Employee_data.txt:PigStorage(',')) - scope19
2015-10-05 11:32:43,682 [main]
INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler 
File concatenation threshold: 100 optimistic? false
2015-10-05 11:32:43,684 [main]
INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOp timizer -
MR plan size before optimization: 1 2015-10-05 11:32:43,685 [main]
INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.
MultiQueryOp timizer - MR plan size after optimization: 1
#--------------------------------------------------
# Map Reduce Plan                                   
#--------------------------------------------------
MapReduce node scope-37
Map Plan
Employee: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-36
|
|---Employee: New For Each(false,false,false,false,false)[bag] - scope-35
   | |
   | Cast[int] - scope-21
   | |
   | |---Project[bytearray][0] - scope-20
   | |
   | Cast[chararray] - scope-24
   | |
   | |---Project[bytearray][1] - scope-23
   | |
   | Cast[chararray] - scope-27
   | |
   | |---Project[bytearray][2] - scope-26
   | |
   | Cast[chararray] - scope-30
   | |  
   | |---Project[bytearray][3] - scope-29
   | |
   | Cast[chararray] - scope-33
   | |
   | |---Project[bytearray][4] - scope-32
   |
   |---Employee:
Load(hdfs://localhost:9000/pig_data/Employee_data.txt:PigStorage(',')) - scope
19-------- Global sort: false
----------------

d. Illustration operator

This operator gives you the step-by-step execution of a sequence of statements.

  • Syntax

So, the syntax of the illustrate operator is-

grunt> illustrate Relation_name;

  • Example

Let’s suppose we have a file Employee_data.txt in HDFS. Its content is:

001,mehul,chourey,9848022337,Hyderabad
002,Ankur,Dutta,9848022338,Kolkata
003,Shubham,Sengar,9848022339,Delhi
004,Prerna,Tripathi,9848022330,Pune
005,Sagar,Joshi,9848022336,Bhubaneswar
006,Monika,sharma,9848022335,Chennai

Also, using the LOAD operator, we have read it into a relation Employee

grunt> Employee = LOAD 'hdfs://localhost:9000/pig_data/Employee_data.txt' USING PigStorage(',')
  as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

Further, we illustrate the relation named Employee as.

grunt> illustrate Employee;
  • Output

We will get the following output, on executing the above statement.

grunt> illustrate Employee;
INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$M ap - Aliases
being processed per job phase (AliasName[line,offset]): M: Employee[1,10] C:  R:
Employee id:int firstname:chararray lastname:chararray phone:chararray city:chararray
002 Ankur Dutta 98458022338 Kolkata

ii. Grouping & Joining: Apache Pig Operators

There are 4 types of Grouping and Joining Operators. Such as:

  1. Group Operator
  2. Cogroup Operator
  3. Join Operator
  4. Cross operator

Let’s discuss them in depth:

a. Group Operator

To group the data in one or more relations, we use the GROUP operator.

  • Syntax

So, the syntax of the group operator is:

grunt> Group_data = GROUP Relation_name BY age;
  • Example

Let’s suppose that we have a file named Employee_details.txt in the HDFS directory /pig_data/.

Employee_details.txt
001,mehul,chourey,21,9848022337,Hyderabad
002,Ankur,Dutta,22,9848022338,Kolkata
003,Shubham,Sengar,22,9848022339,Delhi
004,Prerna,Tripathi,21,9848022330,Pune
005,Sagar,Joshi,23,9848022336,Bhubaneswar
006,Monika,sharma,23,9848022335,Chennai
007,pulkit,pawar,24,9848022334,trivandrum
008,Roshan,Shaikh,24,9848022333,Chennai

Also, with the relation name Employee_details, we have loaded this file into Apache Pig.

grunt> Employee_details = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING PigStorage(',')
  as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);

Further, let’s group the records/tuples in the relation by age.

grunt> group_data = GROUP Employee_details by age;
  • Verification

Then, using the DUMP operator, verify the relation group_data.

grunt> Dump group_data;
  • Output

Hence, we will get output displaying the contents of the relation named group_data. We can observe that the resulting schema has two columns −

First is age. That groups the relation.

Second is a bag. That contains the group of tuples, Employee records with the respective age.

(21,{(4,Prerna,Tripathi,21,9848022330,Pune),(1,mehul,chourey,21,9848022337,Hyderabad)})
(22,{(3,Shubham,Sengar,22,9848022339,Delhi),(2,Ankur,Dutta,22,984802233 8,Kolkata)})
(23,{(6,Monika,sharma,23,9848022335,Chennai),(5,Sagar,Joshi,23,9848022336 ,Bhubaneswar)})
(24,{(8,Roshan,Shaikh,24,9848022333,Chennai),(7,pulkit,pawar,24,9848022334, trivandrum)})

Thus, after grouping the data using the describe command see the schema of the table.

grunt> Describe group_data;
group_data: {group: int,Employee_details: {(id: int,firstname: chararray,
              lastname: chararray,age: int,phone: chararray,city: chararray)}}

Similarly, using the illustrate command we can get the sample illustration of the schema.

$ Illustrate group_data;

The output is −

group_data group:int Employee_details:bag{:tuple(id:int,firstname:chararray,lastname:

chararray,age:int,phone:chararray,city:chararray)}

21 { 4, Prerna,Tripathi, 21, 9848022330, Pune), (1, mehul,chourey, 21, 9848022337, Hyderabad)}
22 {(2,Ankur,Dutta,22,9848022338,Kolkata),(003,Shubham,Sengar,22,9848022339,Delhi)}
  • Grouping by Multiple Columns

Further, let’s group the relation by age and city.

grunt> group_multiple = GROUP Employee_details by (age, city);

Now, using the Dump operator, we can verify the content of the relation named group_multiple.

grunt> Dump group_multiple;
((21,Pune),{(4,Prerna,Tripathi,21,9848022330,Pune)})
((21,Hyderabad),{(1,Mehul,Chourey,21,9848022337,Hyderabad)})
((22,Delhi),{(3,Shubham,Sengar,22,9848022339,Delhi)}
((22,Kolkata),{(2,Ankur,Dutta,22,9848022338,Kolkata)})
((23,Chennai),{(6,Monika,Sharma,23,9848022335,Chennai)})
((23,Bhubaneswar),{(5,Sagar,Joshi,23,9848022336,Bhubaneswar)})
((24,Chennai),{(8,Roshan,Shaikh,24,9848022333,Chennai)})
(24,trivandrum),{(7,Pulkit,Pawar,24,9848022334,trivandrum)})
  • Group All

We can group a relation by all the columns.

grunt> group_all = GROUP Employee_details All;

Hence, verify the content of the relation group_all.

grunt> Dump group_all;
(all,{(8,Roshan,Shaikh,24,9848022333,Chennai),(7,pulkit,Pawar,24,9848022334 ,trivandrum),
(6,Monika,Sharma,23,9848022335,Chennai),(5,Sagar,Joshi,23,9848022336,Bhubaneswar),
(4,Prerna,Tripathi,21,9848022330,Pune),(3,Shubham,Sengar,22,9848022339,Delhi),
(2,Ankur,Dutta,22,9848022338,Kolkata),(1,Mehul,Chourey,21,9848022337,Hyderabad)})

b. Cogroup Operator

It works more or less in the same way as the GROUP operator. At one point they differentiate that we normally use the group operator with one relation, whereas, we use the cogroup operator in statements involving two or more relations.

  • Grouping Two Relations using Cogroup

Let’s suppose we have two files namely Employee_details.txt and Clients_details.txt in the HDFS directory /pig_data/.

Employee_details.txt
001,mehul,chourey,21,9848022337,Hyderabad
002,Ankur,Dutta,22,9848022338,Kolkata
003,Shubham,Sengar,22,9848022339,Delhi
004,Prerna,Tripathi,21,9848022330,Pune
005,Sagar,Joshi,23,9848022336,Bhubaneswar
006,Monika,sharma,23,9848022335,Chennai
007,pulkit,pawar,24,9848022334,trivandrum
008,Roshan,Shaikh,24,9848022333,Chennai
Clients_details.txt
001,Kajal,22,new york
002,Vaishnavi,23,Kolkata
003,Twinkle,23,Tokyo
004,Manish,25,London
005,Purva,23,Bhubaneswar
006,Vishal,22,Chennai

Also, with the relation names Employee_details and Clients_details respectively we have loaded these files into Pig.

grunt> Employee_details = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING PigStorage(',')
  as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
grunt> Clients_details = LOAD 'hdfs://localhost:9000/pig_data/Clients_details.txt' USING PigStorage(',')
  as (id:int, name:chararray, age:int, city:chararray);

Hence, with the key age, let’s group the records/tuples of the relations Employee_details and Clients_details.

grunt> cogroup_data = COGROUP Employee_details by age, Clients_details by age;
  • Verification

Using the DUMP operator, Verify the relation cogroup_data.

grunt> Dump cogroup_data;
  • Output

Now, displaying the contents of the relation named cogroup_data, it will produce the following output.

(21,{(4,Prerna,Tripathi,21,9848022330,Pune), (1,Mehul,chourey,21,9848022337,Hyderabad)},
  {  })
(22,{ (3,Shubham,Sengar,22,9848022339,Delhi), (2,Ankur,Dutta,22,9848022338,Kolkata) },  
  { (6,Vishal,22,Chennai),(1,Kajal,22,new york) })  
(23,{(6,Monika,Sharma,23,9848022335,Chennai),(5,Sagar,Joshi,23,9848022336 ,Bhubaneswar)},
  {(5,Purva,23,Bhubaneswar),(3,Twinkle,23,Tokyo),(2,Vaishnavi,23,Kolkata)})
(24,{(8,Roshan,Shaikh,24,9848022333,Chennai),(7,Pulkit,Pawar,24,9848022334, trivandrum)},
  { })  
(25,{   },
  {(4,Manish,25,London)})

So, here, cogroup operator groups the tuples from each relation according to age. Where each group depicts a particular age value.

Let’s understand it with an example. Since we consider the 1st tuple of the result, it is grouped by age 21. It contains two bags −

One bag holds all the tuples from the first relation (Employee_details in this case) having age 21.

Another bag contains all the tuples from the second relation (Clients_details in this case) having age 21.

Moreover, it returns an empty bag, in case a relation doesn’t have tuples having the age value 21.

c. Join Operator

Basically, to combine records from two or more relations, we use the JOIN operator. Moreover, we declare one (or a group of) tuple(s) from each relation, as keys, while performing a join operation. However, make sure,  the two particular tuples are matched, when these keys match, else the records are dropped. There are several types of Joins. Such as −

  1. Self-join
  2. Inner-join
  3. Outer-join − left join, right join, and full join

d. Cross Operator

It computes the cross-product of two or more relations.

  • Syntax

So, the syntax of the CROSS operator.

grunt> Relation3_name = CROSS Relation1_name, Relation2_name;
  • Example

Let’s suppose we have two files namely Users.txt and orders.txt in the /pig_data/ directory of HDFS Users.txt

1,Sanjeev,32,Ahmedabad,2000.00
2,Ankit,25,Delhi,1500.00
3,Raj,23,Kota,2000.00
4,Sumit,25,Mumbai,6500.00
5,Pankaj,27,Bhopal,8500.00
6,Vishnu,22,MP,4500.00
7,Ravi,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060

Also, with the relations Users and orders, we have loaded these two files into Pig.

grunt> Users = LOAD 'hdfs://localhost:9000/pig_data/Users.txt' USING PigStorage(',')
  as (id:int, name:chararray, age:int, address:chararray, salary:int);
grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',')
  as (oid:int, date:chararray, customer_id:int, amount:int);

Using the cross operator on these two relations, let’s get the cross-product of these two relations.

grunt> cross_data = CROSS Users, orders;
  • Verification

Now, using the DUMP operator, verify the relation cross_data.

grunt> Dump cross_data;
  • Output

displaying the contents of the relation cross_data, it will produce the following output.

(7,Ravi,24,Indore,10000,103,2008-05-20 00:00:00,4,2060)
(7,Ravi,24,Indore,10000,101,2009-11-20 00:00:00,2,1560)
(7,Ravi,24,Indore,10000,100,2009-10-08 00:00:00,3,1500)
(7,Ravi,24,Indore,10000,102,2009-10-08 00:00:00,3,3000)
(6,Vishnu,22,MP,4500,103,2008-05-20 00:00:00,4,2060)
(6,Vishnu,22,MP,4500,101,2009-11-20 00:00:00,2,1560)
(6,Vishnu,22,MP,4500,100,2009-10-08 00:00:00,3,1500)
(6,Vishnu,22,MP,4500,102,2009-10-08 00:00:00,3,3000)
(5,Pankaj,27,Bhopal,8500,103,2008-05-20 00:00:00,4,2060)
(5,Pankaj,27,Bhopal,8500,101,2009-11-20 00:00:00,2,1560)
(5,Pankaj,27,Bhopal,8500,100,2009-10-08 00:00:00,3,1500)
(5,Pankaj,27,Bhopal,8500,102,2009-10-08 00:00:00,3,3000)
(4,Sumit,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(4,Sumit,25,Mumbai,6500,101,2009-20 00:00:00,4,2060)
(2,Ankit,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(2,Ankit,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500)
(2,Ankit,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000)
(1,Sanjeev,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060)
(1,Sanjeev,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560)
(1,Sanjeev,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500)
(1,Sanjeev,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000)-11-20 00:00:00,2,1560)
(4,Sumit,25,Mumbai,6500,100,2009-10-08 00:00:00,3,1500)
(4,Sumit,25,Mumbai,6500,102,2009-10-08 00:00:00,3,3000)
(3,Raj,23,Kota,2000,103,2008-05-20 00:00:00,4,2060)
(3,Raj,23,Kota,2000,101,2009-11-20 00:00:00,2,1560)
(3,Raj,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,Raj,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(2,Ankit,25,Delhi,1500,103,2008-05-20 00:00:00,4,2060)
(2,Ankit,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(2,Ankit,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500)
(2,Ankit,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000)
(1,Sanjeev,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060)
(1,Sanjeev,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560)
(1,Sanjeev,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500)
(1,Sanjeev,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000)

iii. Combining & Splitting: Apache Pig Operators

These are of two types-

  1. Union
  2. Split

a. Union Operator

To merge the content of two relations, we use the UNION operator of Pig Latin. Also, make sure, to perform UNION operation on two relations, their columns and domains must be identical.

  • Syntax

So, the syntax of the UNION operator.

grunt> Relation_name3 = UNION Relation_name1, Relation_name2;
  • Example

Let’s suppose we have two files namely Employee_data1.txt and Employee_data2.txt in the /pig_data/ directory of HDFS.

Employee_data1.txt
001,mehul,chourey,9848022337,Hyderabad
002,Ankur,Dutta,9848022338,Kolkata
003,Shubham,Sengar,9848022339,Delhi
004,Prerna,Tripathi,9848022330,Pune
005,Sagar,Joshi,9848022336,Bhubaneswar
006,Monika,sharma,9848022335,Chennai
Employee_data2.txt
7,Prachi,Yadav,9848022334,trivendram.
8,Avikal,Singh,9848022333,Chennai.

Also, with the relations Employee1 and Employee2 we have loaded these two files into Pig.

grunt> Employee1 = LOAD 'hdfs://localhost:9000/pig_data/Employee_data1.txt' USING PigStorage(',')
  as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);
grunt> Employee2 = LOAD 'hdfs://localhost:9000/pig_data/Employee_data2.txt' USING PigStorage(',')
  as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);

Using the UNION operator, let’s now merge the contents of these two relations.

grunt> Employee = UNION Employee1, Employee2;
  • Verification

Now, using the DUMP operator, verify the relation Employee.

grunt> Dump Employee;
  • Output

Now, displaying the contents of the relation Employee, it will display the following output.

(1,mehul,chourey,9848022337,Hyderabad
(2,Ankur,Dutta,9848022338,Kolkata)
(3,Shubham,Sengar,9848022339,Delhi)
(4,Prerna,Tripathi,9848022330,Pune)
(5,Sagar,Joshi,9848022336,Bhubaneswar)
(6,Monika,Sharma,9848022335,Chennai)
(7,Prachi,Yadav,9848022334,trivendram)
(8,Avikal,Singh,9848022333,Chennai)

b. Split Operator

To split a relation into two or more relations, we use the SPLIT operator is used.

  • Syntax

So, the syntax of the SPLIT operator is-

grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name (condition2),
  • Example

Let’s suppose that we have a file named Employee_details.txt in the HDFS directory /pig_data/ as shown below.

Employee_details.txt
001,mehul,chourey,21,9848022337,Hyderabad
002,Ankur,Dutta,22,9848022338,Kolkata
003,Shubham,Sengar,22,9848022339,Delhi
004,Prerna,Tripathi,21,9848022330,Pune
005,Sagar,Joshi,23,9848022336,Bhubaneswar
006,Monika,sharma,23,9848022335,Chennai
007,pulkit,pawar,24,9848022334,trivandrum
008,Roshan,Shaikh,24,9848022333,Chennai

Also, with the relation name Employee_details, we have loaded this file into Pig.

Employee_details = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING PigStorage(',')
  as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);

Now, Let’s split the relation into two,

First listing the employees of age less than 23,

Second listing the employees having the age between 22 and 25.

SPLIT Employee_details into Employee_details1 if age<23, Employee_details2 if (22<age and age>25);
  • Verification

Using the DUMP operator, Verify the relations Employee_details1 and Employee_details2.

grunt> Dump Employee
grunt> Dump Employee_details2;
  • Output

By displaying the contents of the relations Employee_details1 and Employee_details2 respectively, it will produce the following output.

grunt> Dump Employee_details1;
(1,mehul,chourey,21,9848022337,Hyderabad)
(2,Ankur,Dutta,22,9848022338,Kolkata)
(3,Shubham,Sengar,22,9848022339,Delhi)
(4,Prerna,Tripathi,21,9848022330,Pune)
grunt> Dump Employee_details2;
(5,Sagar,Joshi,23,9848022336,Bhubaneswar)
(6,Monika,sharma,23,9848022335,Chennai)
(7,pulkit,pawar,24,9848022334,trivandrum)
(8,Roshan,Shaikh,24,9848022333,Chennai)

iv. Filtering: Apache Pig Operators

These are of 3 types;

  1. Filter
  2. Distinct
  3. For Each

Now, let’s discuss, each in detail:

A. Filter Operator

To select the required tuples from a relation based on a condition, we use the FILTER operator.

  • Syntax

So the syntax of the FILTER operator is

grunt> Relation2_name = FILTER Relation1_name BY (condition);
  • Example

Let’s suppose we have a file named Employee_details.txt in the HDFS directory /pig_data/

Employee_details.txt
001,mehul,chourey,21,9848022337,Hyderabad
002,Ankur,Dutta,22,9848022338,Kolkata
003,Shubham,Sengar,22,9848022339,Delhi
004,Prerna,Tripathi,21,9848022330,Pune
005,Sagar,Joshi,23,9848022336,Bhubaneswar
006,Monika,sharma,23,9848022335,Chennai
007,pulkit,pawar,24,9848022334,trivandrum
008,Roshan,Shaikh,24,9848022333,Chennai

Also, with the relation name Employee_details we have loaded this file into Pig.

grunt> Employee_details = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING PigStorage(',')
  as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);

Now, to get the details of the Employee who belong to the city Chennai, let ’s use the Filter operator.

filter_data = FILTER Employee_details BY city == 'Chennai';
  • Verification

Using the DUMP operator, verify the relation filter_data.

grunt> Dump filter_data;
  • Output

By, displaying the contents of the relation filter_data, it will produce the following output

(6,Monika,Sharma,23,9848022335,Chennai)

(8,Roshan,Shaikh,24,9848022333,Chennai)

b. The DISTINCT operator

To remove redundant (duplicate) tuples from a relation, we use the DISTINCT operator.

  • Syntax

So,  the syntax of the DISTINCT operator is:

grunt> Relation_name2 = DISTINCT Relatin_name1;
  • Example

Let’s suppose that we have a file named Employee_details.txt in the HDFS directory /pig_data/ as shown below.

Employee_details.txt
001,mehul,chourey,9848022337,Hyderabad
002,Ankur,Dutta,9848022338,Kolkata
002,Ankur,Dutta,9848022338,Kolkata
003,Shubham,Sengar,9848022339,Delhi
003,Shubham,Sengar,9848022339,Delhi
004,Prerna,Tripathi,9848022330,Pune
005,Sagar,Joshi,9848022336,Bhubaneswar
006,Monika,sharma,9848022335,Chennai
006,Monika,sharma,9848022335,Chennai

Also, with the relation name Employee_details, we have loaded this file into Pig

grunt> Employee_details = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING PigStorage(',')
  as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);

Now, now using the DISTINCT operator remove the redundant (duplicate) tuples from the relation named Employee_details. Also,  store it in another relation named distinct_data.

grunt> distinct_data = DISTINCT Employee_details;
  • Verification

Using the DUMP operator, verify the relation distinct_data.

grunt> Dump distinct_data;
  • Output

By displaying the contents of the relation distinct_data, it will produce the following output.

(1,mehul,chourey,21,9848022337,Hyderabad)

(2,Ankur,Dutta,22,9848022338,Kolkata)

(3,Shubham,Sengar,22,9848022339,Delhi)

(4,Prerna,Tripathi,21,9848022330,Pune)

(5,Sagar,Joshi,23,9848022336,Bhubaneswar)

(6,Monika,sharma,23,9848022335,Chennai)

c. Filtering Operators

To generate specified data transformations based on the column data, we use the FOREACH operator.

  • Syntax

So, the syntax of FOREACH operator.

grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
  • Example

Let’s suppose we have a file named Employee_details.txt in the HDFS directory /pig_data/.

Employee_details.txt
001,mehul,chourey,21,9848022337,Hyderabad
002,Ankur,Dutta,22,9848022338,Kolkata
003,Shubham,Sengar,22,9848022339,Delhi
004,Prerna,Tripathi,21,9848022330,Pune
005,Sagar,Joshi,23,9848022336,Bhubaneswar
006,Monika,sharma,23,9848022335,Chennai
007,pulkit,pawar,24,9848022334,trivandrum
008,Roshan,Shaikh,24,9848022333,Chennai

Also, with the relation name Employee_details, we have loaded this file into Pig.

grunt> Employee_details = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING PigStorage(',')
  as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);

Now, using the foreach operator, let us now get the id, age, and city values of each Employee from the relation Employee_details and store it into another relation named foreach_data.

grunt> foreach_data = FOREACH Employee_details GENERATE id,age,city;
  • Verification

Also, using the DUMP operator, verify the relation foreach_data.

grunt> Dump foreach_data;
  • Output

By displaying the contents of the relation foreach_data, it will produce the following output.

(1,21,Hyderabad)

(2,22,Kolkata)

(3,22,Delhi)

(4,21,Pune)

(5,23,Bhubaneswar)

(6,23,Chennai)

(7,24,trivandrum)

(8,24,Chennai)

v. Sorting: Apache Pig Operators

These are of two types,

  1. Order By
  2. Limit

Let’s discuss both in detail:

a. ORDER BY operator

To display the contents of a relation in a sorted order based on one or more fields, we use the ORDER BY operator.

  • Syntax

So, the syntax of the ORDER BY operator is-

grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);
  • Example

Let’s suppose we have a file named Employee_details.txt in the HDFS directory /pig_data/.

Employee_details.txt

001,mehul,chourey,21,9848022337,Hyderabad

002,Ankur,Dutta,22,9848022338,Kolkata

003,Shubham,Sengar,22,9848022339,Delhi

004,Prerna,Tripathi,21,9848022330,Pune

005,Sagar,Joshi,23,9848022336,Bhubaneswar

006,Monika,sharma,23,9848022335,Chennai

007,pulkit,pawar,24,9848022334,trivandrum

008,Roshan,Shaikh,24,9848022333,Chennai

Also with the relation name Employee_details, we have loaded this file into Pig.

grunt> Employee_details = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING PigStorage(',')
  as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);

Now, on the basis of the age of the Employee let’s sort the relation in a descending order. Then using the ORDER BY operator store it into another relation named order_by_data.

grunt> order_by_data = ORDER Employee_details BY age DESC;
  • Verification

Further, using the DUMP operator verify the relation order_by_data.

grunt> Dump order_by_data;
  • Output

By displaying the contents of the relation order_by_data, it will produce the following output.

(8,Roshan,Shaikh,24,9848022333,Chennai)

(7,pulkit,pawar,24,9848022334,trivandrum)

(6,Monika,sharma,23,9848022335,Chennai)

(5,Sagar,Joshi,23,9848022336,Bhubaneswar)

(3,Shubham,Sengar,22,9848022339,Delhi)

(2,Ankur,Dutta,22,9848022338,Kolkata)

(4,Prerna,Tripathi,21,9848022330,Pune)

(1,Mehul,Chourey,21,9848022337,Hyderabad)

b. LIMIT operator

In order to get a limited number of tuples from a relation, we use the LIMIT operator.

  • Syntax

So, the syntax of the LIMIT operator is-

grunt> Result = LIMIT Relation_name required number of tuples;
  • Example

Assume that we have a file named Employee_details.txt in the HDFS directory /pig_data/ as shown below.

Employee_details.txt
001,mehul,chourey,21,9848022337,Hyderabad
002,Ankur,Dutta,22,9848022338,Kolkata
003,Shubham,Sengar,22,9848022339,Delhi
004,Prerna,Tripathi,21,9848022330,Pune
005,Sagar,Joshi,23,9848022336,Bhubaneswar
006,Monika,sharma,23,9848022335,Chennai
007,pulkit,pawar,24,9848022334,trivandrum
008,Roshan,Shaikh,24,9848022333,Chennai

Also, with the relation name Employee_details, we have loaded this file into Pig.

grunt> Employee_details = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING PigStorage(',')
  as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);

Now, on the basis of age of the Employee let’s sort the relation in descending order. Then using the ORDER BY operator store it into another relation named limit_data.

grunt> limit_data = LIMIT Employee_details 4;
  • Verification

Further, using the DUMP operator, verify the relation limit_data.

grunt> Dump limit_data;
  • Output

By displaying the contents of the relation limit_data, it will produce the following output.

(1,mehul,chourey,21,9848022337,Hyderabad)

(2,Ankur,Dutta,22,9848022338,Kolkata)

(3,Shubham,Sengar,22,9848022339,Delhi)

(4,Prerna,Tripathi,21,9848022330,Pune)

This was all on Apache Pig Operators.

4. Conclusion: Apache Pig Operators

As a result, we have seen all the Apache Pig Operators in detail, along with their Examples. However, if any query occurs, feel free to share.

See Also,

Hadoop MapReduce Job Execution flow Chart

For Reference>>

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *