Apache Pig Operators with Syntax and Examples
There is a huge set of Apache Pig Operators available in Apache Pig. In this article, “Introduction to Apache Pig Operators” we will discuss all types of Apache Pig Operators in detail.
Such as Diagnostic Operators, Grouping & Joining, Combining & Splitting and many more. They also have their subtypes.
So, here we will discuss each Apache Pig Operators in depth along with syntax and their examples.
What is Apache Pig Operators?
We have a huge set of Apache Pig Operators, for performing several types of Operations. Let’s discuss types of Apache Pig Operators:
- Diagnostic Operators
- Grouping & Joining
- Combining & Splitting
- Filtering
- Sorting
So, let’s discuss each type of Apache Pig Operators in detail.
Types of Pig Operators
i. Diagnostic Operators: Apache Pig Operators
Basically, we use Diagnostic Operators to verify the execution of the Load statement. There are four different types of diagnostic operators −
-
- Dump operator
-
- Describe operator
-
- Explanation operator
- Illustration operator
Further, we will discuss each operator of Pig Latin in depth.
a. Dump Operator
In order to run the Pig Latin statements and display the results on the screen, we use Dump Operator. Generally, we use it for debugging Purpose.
- Syntax
So the syntax of the Dump operator is:
grunt> Dump Relation_Name
- Example
Here, is the example, in which a dump is performed after each statement.
A = LOAD 'Employee' AS (name:chararray, age:int, gpa:float); DUMP A; (Shubham,18,4.0F) (Pulkit,19,3.7F) (Shreyash,20,3.9F) (Mehul,22,3.8F) (Rishabh,20,4.0F) B = FILTER A BY name matches 'J.+'; DUMP B; (Shubham,18,4.0F) (Mehul,22,3.8F) (Rishabh,20,4.0F)
b. Describe operator
To view the schema of a relation, we use the describe operator.
- Syntax
So, the syntax of the describe operator is −
grunt> Describe Relation_name
- Example
Let’s suppose we have a file Employee_data.txt in HDFS. Its content is.
001,mehul,chourey,9848022337,Hyderabad 002,Ankur,Dutta,9848022338,Kolkata 003,Shubham,Sengar,9848022339,Delhi 004,Prerna,Tripathi,9848022330,Pune 005,Sagar,Joshi,9848022336,Bhubaneswar 006,Monika,sharma,9848022335,Chennai
Also, using the LOAD operator, we have read it into a relation Employee.
grunt> Employee = LOAD 'hdfs://localhost:9000/pig_data/Employee_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Further, let’s describe the relation named Employee. Then verify the schema.
grunt> describe Employee
- Output
It will produce the following output, after execution of the above Pig Latin statement.
grunt> Employee: { id: int,firstname: chararray,lastname: chararray,phone: chararray,city: chararray
c. Explanation operator
To display the logical, physical, and MapReduce execution plans of a relation, we use the explain operator.
- Syntax
So, the syntax of the explain operator is-
grunt> explain Relation_name;
- Example
Let’s suppose we have a file Employee_data.txt in HDFS. Its content is:
001,mehul,chourey,9848022337,Hyderabad 002,Ankur,Dutta,9848022338,Kolkata 003,Shubham,Sengar,9848022339,Delhi 004,Prerna,Tripathi,9848022330,Pune 005,Sagar,Joshi,9848022336,Bhubaneswar 006,Monika,sharma,9848022335,Chennai
Also, using the LOAD operator, we have read it into a relation Employee
grunt> Employee = LOAD 'hdfs://localhost:9000/pig_data/Employee_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Further, using the explain operator let ‘s explain the relation named Employee.
grunt> explain Employee;
- Output
It will produce the following output.
$ explain Employee; 2015-10-05 11:32:43,660 [main] 2015-10-05 11:32:43,660 [main] INFO org.apache.pig.newplan.logical.optimizer .LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]} #----------------------------------------------- # New Logical Plan: #----------------------------------------------- Employee: (Name: LOStore Schema: id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city# 35:chararray) | |---Employeet: (Name: LOForEach Schema: id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city# 35:chararray) | | | (Name: LOGenerate[false,false,false,false,false] Schema: id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city# 35:chararray)ColumnPrune:InputUids=[34, 35, 32, 33, 31]ColumnPrune:OutputUids=[34, 35, 32, 33, 31] | | | | | (Name: Cast Type: int Uid: 31) | | | | | |---id:(Name: Project Type: bytearray Uid: 31 Input: 0 Column: (*)) | | | | | (Name: Cast Type: chararray Uid: 32) | | | | | |---firstname:(Name: Project Type: bytearray Uid: 32 Input: 1 Column: (*)) | | | | | (Name: Cast Type: chararray Uid: 33) | | | | | |---lastname:(Name: Project Type: bytearray Uid: 33 Input: 2 Column: (*)) | | | | | (Name: Cast Type: chararray Uid: 34) | | | | | |---phone:(Name: Project Type: bytearray Uid: 34 Input: 3 Column: (*)) | | | | | (Name: Cast Type: chararray Uid: 35) | | | | | |---city:(Name: Project Type: bytearray Uid: 35 Input: 4 Column: (*)) | | | |---(Name: LOInnerLoad[0] Schema: id#31:bytearray) | | | |---(Name: LOInnerLoad[1] Schema: firstname#32:bytearray) | | | |---(Name: LOInnerLoad[2] Schema: lastname#33:bytearray) | | | |---(Name: LOInnerLoad[3] Schema: phone#34:bytearray) | | | |---(Name: LOInnerLoad[4] Schema: city#35:bytearray) | |---Employee: (Name: LOLoad Schema: id#31:bytearray,firstname#32:bytearray,lastname#33:bytearray,phone#34:bytearray ,city#35:bytearray)RequiredFields:null #----------------------------------------------- # Physical Plan: #----------------------------------------------- Employee: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-36 | |---Employee: New For Each(false,false,false,false,false)[bag] - scope-35 | | | Cast[int] - scope-21 | | | |---Project[bytearray][0] - scope-20 | | | Cast[chararray] - scope-24 | | | |---Project[bytearray][1] - scope-23 | | | Cast[chararray] - scope-27 | | | |---Project[bytearray][2] - scope-26 | | | Cast[chararray] - scope-30 | | | |---Project[bytearray][3] - scope-29 | | | Cast[chararray] - scope-33 | | | |---Project[bytearray][4] - scope-32 | |---Employee: Load(hdfs://localhost:9000/pig_data/Employee_data.txt:PigStorage(',')) - scope19 2015-10-05 11:32:43,682 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler File concatenation threshold: 100 optimistic? false 2015-10-05 11:32:43,684 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOp timizer - MR plan size before optimization: 1 2015-10-05 11:32:43,685 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. MultiQueryOp timizer - MR plan size after optimization: 1 #-------------------------------------------------- # Map Reduce Plan #-------------------------------------------------- MapReduce node scope-37 Map Plan Employee: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-36 | |---Employee: New For Each(false,false,false,false,false)[bag] - scope-35 | | | Cast[int] - scope-21 | | | |---Project[bytearray][0] - scope-20 | | | Cast[chararray] - scope-24 | | | |---Project[bytearray][1] - scope-23 | | | Cast[chararray] - scope-27 | | | |---Project[bytearray][2] - scope-26 | | | Cast[chararray] - scope-30 | | | |---Project[bytearray][3] - scope-29 | | | Cast[chararray] - scope-33 | | | |---Project[bytearray][4] - scope-32 | |---Employee: Load(hdfs://localhost:9000/pig_data/Employee_data.txt:PigStorage(',')) - scope 19-------- Global sort: false ----------------
d. Illustration operator
This operator gives you the step-by-step execution of a sequence of statements.
- Syntax
So, the syntax of the illustrate operator is-
grunt> illustrate Relation_name;
- Example
Let’s suppose we have a file Employee_data.txt in HDFS. Its content is:
001,mehul,chourey,9848022337,Hyderabad 002,Ankur,Dutta,9848022338,Kolkata 003,Shubham,Sengar,9848022339,Delhi 004,Prerna,Tripathi,9848022330,Pune 005,Sagar,Joshi,9848022336,Bhubaneswar 006,Monika,sharma,9848022335,Chennai
Also, using the LOAD operator, we have read it into a relation Employee
grunt> Employee = LOAD 'hdfs://localhost:9000/pig_data/Employee_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Further, we illustrate the relation named Employee as.
grunt> illustrate Employee;
- Output
We will get the following output, on executing the above statement.
grunt> illustrate Employee; INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$M ap - Aliases being processed per job phase (AliasName[line,offset]): M: Employee[1,10] C: R:
Employee | id:int | firstname:chararray | lastname:chararray | phone:chararray | city:chararray |
002 | Ankur | Dutta | 98458022338 | Kolkata |
ii. Grouping & Joining: Apache Pig Operators
There are 4 types of Grouping and Joining Operators. Such as:
-
- Group Operator
-
- Cogroup Operator
-
- Join Operator
- Cross operator
Let’s discuss them in depth:
a. Group Operator
To group the data in one or more relations, we use the GROUP operator.
- Syntax
So, the syntax of the group operator is:
grunt> Group_data = GROUP Relation_name BY age;
-
Example
Let’s suppose that we have a file named Employee_details.txt in the HDFS directory /pig_data/.
Employee_details.txt 001,mehul,chourey,21,9848022337,Hyderabad 002,Ankur,Dutta,22,9848022338,Kolkata 003,Shubham,Sengar,22,9848022339,Delhi 004,Prerna,Tripathi,21,9848022330,Pune 005,Sagar,Joshi,23,9848022336,Bhubaneswar 006,Monika,sharma,23,9848022335,Chennai 007,pulkit,pawar,24,9848022334,trivandrum 008,Roshan,Shaikh,24,9848022333,Chennai
Also, with the relation name Employee_details, we have loaded this file into Apache Pig.
grunt> Employee_details = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
Further, let’s group the records/tuples in the relation by age.
grunt> group_data = GROUP Employee_details by age;
- Verification
Then, using the DUMP operator, verify the relation group_data.
grunt> Dump group_data;
- Output
Hence, we will get output displaying the contents of the relation named group_data. We can observe that the resulting schema has two columns −
First is age. That groups the relation.
Second is a bag. That contains the group of tuples, Employee records with the respective age.
(21,{(4,Prerna,Tripathi,21,9848022330,Pune),(1,mehul,chourey,21,9848022337,Hyderabad)}) (22,{(3,Shubham,Sengar,22,9848022339,Delhi),(2,Ankur,Dutta,22,984802233 8,Kolkata)}) (23,{(6,Monika,sharma,23,9848022335,Chennai),(5,Sagar,Joshi,23,9848022336 ,Bhubaneswar)}) (24,{(8,Roshan,Shaikh,24,9848022333,Chennai),(7,pulkit,pawar,24,9848022334, trivandrum)})
Thus, after grouping the data using the describe command see the schema of the table.
grunt> Describe group_data; group_data: {group: int,Employee_details: {(id: int,firstname: chararray, lastname: chararray,age: int,phone: chararray,city: chararray)}}
Similarly, using the illustrate command we can get the sample illustration of the schema.
$ Illustrate group_data;
The output is −
group_data | group:int | Employee_details:bag{:tuple(id:int,firstname:chararray,lastname: chararray,age:int,phone:chararray,city:chararray)} |
21 | { 4, Prerna,Tripathi, 21, 9848022330, Pune), (1, mehul,chourey, 21, 9848022337, Hyderabad)} | |
22 | {(2,Ankur,Dutta,22,9848022338,Kolkata),(003,Shubham,Sengar,22,9848022339,Delhi)} |
- Grouping by Multiple Columns
Further, let’s group the relation by age and city.
grunt> group_multiple = GROUP Employee_details by (age, city);
Now, using the Dump operator, we can verify the content of the relation named group_multiple.
grunt> Dump group_multiple; ((21,Pune),{(4,Prerna,Tripathi,21,9848022330,Pune)}) ((21,Hyderabad),{(1,Mehul,Chourey,21,9848022337,Hyderabad)}) ((22,Delhi),{(3,Shubham,Sengar,22,9848022339,Delhi)} ((22,Kolkata),{(2,Ankur,Dutta,22,9848022338,Kolkata)}) ((23,Chennai),{(6,Monika,Sharma,23,9848022335,Chennai)}) ((23,Bhubaneswar),{(5,Sagar,Joshi,23,9848022336,Bhubaneswar)}) ((24,Chennai),{(8,Roshan,Shaikh,24,9848022333,Chennai)}) (24,trivandrum),{(7,Pulkit,Pawar,24,9848022334,trivandrum)})
- Group All
We can group a relation by all the columns.
grunt> group_all = GROUP Employee_details All;
Hence, verify the content of the relation group_all.
grunt> Dump group_all; (all,{(8,Roshan,Shaikh,24,9848022333,Chennai),(7,pulkit,Pawar,24,9848022334 ,trivandrum), (6,Monika,Sharma,23,9848022335,Chennai),(5,Sagar,Joshi,23,9848022336,Bhubaneswar), (4,Prerna,Tripathi,21,9848022330,Pune),(3,Shubham,Sengar,22,9848022339,Delhi), (2,Ankur,Dutta,22,9848022338,Kolkata),(1,Mehul,Chourey,21,9848022337,Hyderabad)})
b. Cogroup Operator
It works more or less in the same way as the GROUP operator. At one point they differentiate that we normally use the group operator with one relation, whereas, we use the cogroup operator in statements involving two or more relations.
- Grouping Two Relations using Cogroup
Let’s suppose we have two files namely Employee_details.txt and Clients_details.txt in the HDFS directory /pig_data/.
Employee_details.txt 001,mehul,chourey,21,9848022337,Hyderabad 002,Ankur,Dutta,22,9848022338,Kolkata 003,Shubham,Sengar,22,9848022339,Delhi 004,Prerna,Tripathi,21,9848022330,Pune 005,Sagar,Joshi,23,9848022336,Bhubaneswar 006,Monika,sharma,23,9848022335,Chennai 007,pulkit,pawar,24,9848022334,trivandrum 008,Roshan,Shaikh,24,9848022333,Chennai Clients_details.txt 001,Kajal,22,new york 002,Vaishnavi,23,Kolkata 003,Twinkle,23,Tokyo 004,Manish,25,London 005,Purva,23,Bhubaneswar 006,Vishal,22,Chennai
Also, with the relation names Employee_details and Clients_details respectively we have loaded these files into Pig.
grunt> Employee_details = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); grunt> Clients_details = LOAD 'hdfs://localhost:9000/pig_data/Clients_details.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, city:chararray);
Hence, with the key age, let’s group the records/tuples of the relations Employee_details and Clients_details.
grunt> cogroup_data = COGROUP Employee_details by age, Clients_details by age;
- Verification
Using the DUMP operator, Verify the relation cogroup_data.
grunt> Dump cogroup_data;
- Output
Now, displaying the contents of the relation named cogroup_data, it will produce the following output.
(21,{(4,Prerna,Tripathi,21,9848022330,Pune), (1,Mehul,chourey,21,9848022337,Hyderabad)}, { }) (22,{ (3,Shubham,Sengar,22,9848022339,Delhi), (2,Ankur,Dutta,22,9848022338,Kolkata) }, { (6,Vishal,22,Chennai),(1,Kajal,22,new york) }) (23,{(6,Monika,Sharma,23,9848022335,Chennai),(5,Sagar,Joshi,23,9848022336 ,Bhubaneswar)}, {(5,Purva,23,Bhubaneswar),(3,Twinkle,23,Tokyo),(2,Vaishnavi,23,Kolkata)}) (24,{(8,Roshan,Shaikh,24,9848022333,Chennai),(7,Pulkit,Pawar,24,9848022334, trivandrum)}, { }) (25,{ }, {(4,Manish,25,London)})
So, here, cogroup operator groups the tuples from each relation according to age. Where each group depicts a particular age value.
Let’s understand it with an example. Since we consider the 1st tuple of the result, it is grouped by age 21. It contains two bags −
One bag holds all the tuples from the first relation (Employee_details in this case) having age 21.
Another bag contains all the tuples from the second relation (Clients_details in this case) having age 21.
Moreover, it returns an empty bag, in case a relation doesn’t have tuples having the age value 21.
c. Join Operator
Basically, to combine records from two or more relations, we use the JOIN operator. Moreover, we declare one (or a group of) tuple(s) from each relation, as keys, while performing a join operation.
However, make sure, the two particular tuples are matched, when these keys match, else the records are dropped.
There are several types of Joins. Such as −
-
- Self-join
-
- Inner-join
- Outer-join − left join, right join, and full join
d. Cross Operator
It computes the cross-product of two or more relations.
- Syntax
So, the syntax of the CROSS operator.
grunt> Relation3_name = CROSS Relation1_name, Relation2_name;
- Example
Let’s suppose we have two files namely Users.txt and orders.txt in the /pig_data/ directory of HDFS Users.txt
1,Sanjeev,32,Ahmedabad,2000.00 2,Ankit,25,Delhi,1500.00 3,Raj,23,Kota,2000.00 4,Sumit,25,Mumbai,6500.00 5,Pankaj,27,Bhopal,8500.00 6,Vishnu,22,MP,4500.00 7,Ravi,24,Indore,10000.00 orders.txt 102,2009-10-08 00:00:00,3,3000 100,2009-10-08 00:00:00,3,1500 101,2009-11-20 00:00:00,2,1560 103,2008-05-20 00:00:00,4,2060
Also, with the relations Users and orders, we have loaded these two files into Pig.
grunt> Users = LOAD 'hdfs://localhost:9000/pig_data/Users.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',') as (oid:int, date:chararray, customer_id:int, amount:int);
Using the cross operator on these two relations, let’s get the cross-product of these two relations.
grunt> cross_data = CROSS Users, orders;
- Verification
Now, using the DUMP operator, verify the relation cross_data.
grunt> Dump cross_data;
- Output
displaying the contents of the relation cross_data, it will produce the following output.
(7,Ravi,24,Indore,10000,103,2008-05-20 00:00:00,4,2060) (7,Ravi,24,Indore,10000,101,2009-11-20 00:00:00,2,1560) (7,Ravi,24,Indore,10000,100,2009-10-08 00:00:00,3,1500) (7,Ravi,24,Indore,10000,102,2009-10-08 00:00:00,3,3000) (6,Vishnu,22,MP,4500,103,2008-05-20 00:00:00,4,2060) (6,Vishnu,22,MP,4500,101,2009-11-20 00:00:00,2,1560) (6,Vishnu,22,MP,4500,100,2009-10-08 00:00:00,3,1500) (6,Vishnu,22,MP,4500,102,2009-10-08 00:00:00,3,3000) (5,Pankaj,27,Bhopal,8500,103,2008-05-20 00:00:00,4,2060) (5,Pankaj,27,Bhopal,8500,101,2009-11-20 00:00:00,2,1560) (5,Pankaj,27,Bhopal,8500,100,2009-10-08 00:00:00,3,1500) (5,Pankaj,27,Bhopal,8500,102,2009-10-08 00:00:00,3,3000) (4,Sumit,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (4,Sumit,25,Mumbai,6500,101,2009-20 00:00:00,4,2060) (2,Ankit,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (2,Ankit,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500) (2,Ankit,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000) (1,Sanjeev,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060) (1,Sanjeev,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560) (1,Sanjeev,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500) (1,Sanjeev,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000)-11-20 00:00:00,2,1560) (4,Sumit,25,Mumbai,6500,100,2009-10-08 00:00:00,3,1500) (4,Sumit,25,Mumbai,6500,102,2009-10-08 00:00:00,3,3000) (3,Raj,23,Kota,2000,103,2008-05-20 00:00:00,4,2060) (3,Raj,23,Kota,2000,101,2009-11-20 00:00:00,2,1560) (3,Raj,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,Raj,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (2,Ankit,25,Delhi,1500,103,2008-05-20 00:00:00,4,2060) (2,Ankit,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (2,Ankit,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500) (2,Ankit,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000) (1,Sanjeev,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060) (1,Sanjeev,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560) (1,Sanjeev,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500) (1,Sanjeev,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000)
iii. Combining & Splitting: Apache Pig Operators
These are of two types-
-
- Union
- Split
a. Union Operator
To merge the content of two relations, we use the UNION operator of Pig Latin. Also, make sure, to perform UNION operation on two relations, their columns and domains must be identical.
- Syntax
So, the syntax of the UNION operator.
grunt> Relation_name3 = UNION Relation_name1, Relation_name2;
- Example
Let’s suppose we have two files namely Employee_data1.txt and Employee_data2.txt in the /pig_data/ directory of HDFS.
Employee_data1.txt 001,mehul,chourey,9848022337,Hyderabad 002,Ankur,Dutta,9848022338,Kolkata 003,Shubham,Sengar,9848022339,Delhi 004,Prerna,Tripathi,9848022330,Pune 005,Sagar,Joshi,9848022336,Bhubaneswar 006,Monika,sharma,9848022335,Chennai Employee_data2.txt 7,Prachi,Yadav,9848022334,trivendram. 8,Avikal,Singh,9848022333,Chennai.
Also, with the relations Employee1 and Employee2 we have loaded these two files into Pig.
grunt> Employee1 = LOAD 'hdfs://localhost:9000/pig_data/Employee_data1.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray); grunt> Employee2 = LOAD 'hdfs://localhost:9000/pig_data/Employee_data2.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);
Using the UNION operator, let’s now merge the contents of these two relations.
grunt> Employee = UNION Employee1, Employee2;
- Verification
Now, using the DUMP operator, verify the relation Employee.
grunt> Dump Employee;
- Output
Now, displaying the contents of the relation Employee, it will display the following output.
(1,mehul,chourey,9848022337,Hyderabad (2,Ankur,Dutta,9848022338,Kolkata) (3,Shubham,Sengar,9848022339,Delhi) (4,Prerna,Tripathi,9848022330,Pune) (5,Sagar,Joshi,9848022336,Bhubaneswar) (6,Monika,Sharma,9848022335,Chennai) (7,Prachi,Yadav,9848022334,trivendram) (8,Avikal,Singh,9848022333,Chennai)
b. Split Operator
To split a relation into two or more relations, we use the SPLIT operator is used.
- Syntax
So, the syntax of the SPLIT operator is-
grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name (condition2),
- Example
Let’s suppose that we have a file named Employee_details.txt in the HDFS directory /pig_data/ as shown below.
Employee_details.txt 001,mehul,chourey,21,9848022337,Hyderabad 002,Ankur,Dutta,22,9848022338,Kolkata 003,Shubham,Sengar,22,9848022339,Delhi 004,Prerna,Tripathi,21,9848022330,Pune 005,Sagar,Joshi,23,9848022336,Bhubaneswar 006,Monika,sharma,23,9848022335,Chennai 007,pulkit,pawar,24,9848022334,trivandrum 008,Roshan,Shaikh,24,9848022333,Chennai
Also, with the relation name Employee_details, we have loaded this file into Pig.
Employee_details = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
Now, Let’s split the relation into two,
First listing the employees of age less than 23,
Second listing the employees having the age between 22 and 25.
SPLIT Employee_details into Employee_details1 if age<23, Employee_details2 if (22<age and age>25);
- Verification
Using the DUMP operator, Verify the relations Employee_details1 and Employee_details2.
grunt> Dump Employee grunt> Dump Employee_details2;
- Output
By displaying the contents of the relations Employee_details1 and Employee_details2 respectively, it will produce the following output.
grunt> Dump Employee_details1; (1,mehul,chourey,21,9848022337,Hyderabad) (2,Ankur,Dutta,22,9848022338,Kolkata) (3,Shubham,Sengar,22,9848022339,Delhi) (4,Prerna,Tripathi,21,9848022330,Pune) grunt> Dump Employee_details2; (5,Sagar,Joshi,23,9848022336,Bhubaneswar) (6,Monika,sharma,23,9848022335,Chennai) (7,pulkit,pawar,24,9848022334,trivandrum) (8,Roshan,Shaikh,24,9848022333,Chennai)
iv. Filtering: Apache Pig Operators
These are of 3 types;
-
- Filter
-
- Distinct
- For Each
Now, let’s discuss, each in detail:
A. Filter Operator
To select the required tuples from a relation based on a condition, we use the FILTER operator.
- Syntax
So the syntax of the FILTER operator is
grunt> Relation2_name = FILTER Relation1_name BY (condition);
- Example
Let’s suppose we have a file named Employee_details.txt in the HDFS directory /pig_data/
Employee_details.txt 001,mehul,chourey,21,9848022337,Hyderabad 002,Ankur,Dutta,22,9848022338,Kolkata 003,Shubham,Sengar,22,9848022339,Delhi 004,Prerna,Tripathi,21,9848022330,Pune 005,Sagar,Joshi,23,9848022336,Bhubaneswar 006,Monika,sharma,23,9848022335,Chennai 007,pulkit,pawar,24,9848022334,trivandrum 008,Roshan,Shaikh,24,9848022333,Chennai
Also, with the relation name Employee_details we have loaded this file into Pig.
grunt> Employee_details = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
Now, to get the details of the Employee who belong to the city Chennai, let ’s use the Filter operator.
filter_data = FILTER Employee_details BY city == 'Chennai';
- Verification
Using the DUMP operator, verify the relation filter_data.
grunt> Dump filter_data;
- Output
By, displaying the contents of the relation filter_data, it will produce the following output
(6,Monika,Sharma,23,9848022335,Chennai)
(8,Roshan,Shaikh,24,9848022333,Chennai)
b. The DISTINCT operator
To remove redundant (duplicate) tuples from a relation, we use the DISTINCT operator.
- Syntax
So, the syntax of the DISTINCT operator is:
grunt> Relation_name2 = DISTINCT Relatin_name1;
- Example
Let’s suppose that we have a file named Employee_details.txt in the HDFS directory /pig_data/ as shown below.
Employee_details.txt 001,mehul,chourey,9848022337,Hyderabad 002,Ankur,Dutta,9848022338,Kolkata 002,Ankur,Dutta,9848022338,Kolkata 003,Shubham,Sengar,9848022339,Delhi 003,Shubham,Sengar,9848022339,Delhi 004,Prerna,Tripathi,9848022330,Pune 005,Sagar,Joshi,9848022336,Bhubaneswar 006,Monika,sharma,9848022335,Chennai 006,Monika,sharma,9848022335,Chennai
Also, with the relation name Employee_details, we have loaded this file into Pig
grunt> Employee_details = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);
Now, now using the DISTINCT operator remove the redundant (duplicate) tuples from the relation named Employee_details. Also, store it in another relation named distinct_data.
grunt> distinct_data = DISTINCT Employee_details;
- Verification
Using the DUMP operator, verify the relation distinct_data.
grunt> Dump distinct_data;
- Output
By displaying the contents of the relation distinct_data, it will produce the following output.
(1,mehul,chourey,21,9848022337,Hyderabad)
(2,Ankur,Dutta,22,9848022338,Kolkata)
(3,Shubham,Sengar,22,9848022339,Delhi)
(4,Prerna,Tripathi,21,9848022330,Pune)
(5,Sagar,Joshi,23,9848022336,Bhubaneswar)
(6,Monika,sharma,23,9848022335,Chennai)
c. Filtering Operators
To generate specified data transformations based on the column data, we use the FOREACH operator.
- Syntax
So, the syntax of FOREACH operator.
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
- Example
Let’s suppose we have a file named Employee_details.txt in the HDFS directory /pig_data/.
Employee_details.txt 001,mehul,chourey,21,9848022337,Hyderabad 002,Ankur,Dutta,22,9848022338,Kolkata 003,Shubham,Sengar,22,9848022339,Delhi 004,Prerna,Tripathi,21,9848022330,Pune 005,Sagar,Joshi,23,9848022336,Bhubaneswar 006,Monika,sharma,23,9848022335,Chennai 007,pulkit,pawar,24,9848022334,trivandrum 008,Roshan,Shaikh,24,9848022333,Chennai
Also, with the relation name Employee_details, we have loaded this file into Pig.
grunt> Employee_details = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);
Now, using the foreach operator, let us now get the id, age, and city values of each Employee from the relation Employee_details and store it into another relation named foreach_data.
grunt> foreach_data = FOREACH Employee_details GENERATE id,age,city;
- Verification
Also, using the DUMP operator, verify the relation foreach_data.
grunt> Dump foreach_data;
- Output
By displaying the contents of the relation foreach_data, it will produce the following output.
(1,21,Hyderabad)
(2,22,Kolkata)
(3,22,Delhi)
(4,21,Pune)
(5,23,Bhubaneswar)
(6,23,Chennai)
(7,24,trivandrum)
(8,24,Chennai)
v. Sorting: Apache Pig Operators
These are of two types,
-
- Order By
- Limit
Let’s discuss both in detail:
a. ORDER BY operator
To display the contents of a relation in a sorted order based on one or more fields, we use the ORDER BY operator.
- Syntax
So, the syntax of the ORDER BY operator is-
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);
- Example
Let’s suppose we have a file named Employee_details.txt in the HDFS directory /pig_data/.
Employee_details.txt 001,mehul,chourey,21,9848022337,Hyderabad 002,Ankur,Dutta,22,9848022338,Kolkata 003,Shubham,Sengar,22,9848022339,Delhi 004,Prerna,Tripathi,21,9848022330,Pune 005,Sagar,Joshi,23,9848022336,Bhubaneswar 006,Monika,sharma,23,9848022335,Chennai 007,pulkit,pawar,24,9848022334,trivandrum 008,Roshan,Shaikh,24,9848022333,Chennai
Also with the relation name Employee_details, we have loaded this file into Pig.
grunt> Employee_details = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);
Now, on the basis of the age of the Employee let’s sort the relation in a descending order. Then using the ORDER BY operator store it into another relation named order_by_data.
grunt> order_by_data = ORDER Employee_details BY age DESC;
- Verification
Further, using the DUMP operator verify the relation order_by_data.
grunt> Dump order_by_data;
- Output
By displaying the contents of the relation order_by_data, it will produce the following output.
(8,Roshan,Shaikh,24,9848022333,Chennai)
(7,pulkit,pawar,24,9848022334,trivandrum)
(6,Monika,sharma,23,9848022335,Chennai)
(5,Sagar,Joshi,23,9848022336,Bhubaneswar)
(3,Shubham,Sengar,22,9848022339,Delhi)
(2,Ankur,Dutta,22,9848022338,Kolkata)
(4,Prerna,Tripathi,21,9848022330,Pune)
(1,Mehul,Chourey,21,9848022337,Hyderabad)
b. LIMIT operator
In order to get a limited number of tuples from a relation, we use the LIMIT operator.
- Syntax
So, the syntax of the LIMIT operator is-
grunt> Result = LIMIT Relation_name required number of tuples;
- Example
Assume that we have a file named Employee_details.txt in the HDFS directory /pig_data/ as shown below.
Employee_details.txt 001,mehul,chourey,21,9848022337,Hyderabad 002,Ankur,Dutta,22,9848022338,Kolkata 003,Shubham,Sengar,22,9848022339,Delhi 004,Prerna,Tripathi,21,9848022330,Pune 005,Sagar,Joshi,23,9848022336,Bhubaneswar 006,Monika,sharma,23,9848022335,Chennai 007,pulkit,pawar,24,9848022334,trivandrum 008,Roshan,Shaikh,24,9848022333,Chennai
Also, with the relation name Employee_details, we have loaded this file into Pig.
grunt> Employee_details = LOAD 'hdfs://localhost:9000/pig_data/Employee_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);
Now, on the basis of age of the Employee let’s sort the relation in descending order. Then using the ORDER BY operator store it into another relation named limit_data.
grunt> limit_data = LIMIT Employee_details 4;
- Verification
Further, using the DUMP operator, verify the relation limit_data.
grunt> Dump limit_data;
- Output
By displaying the contents of the relation limit_data, it will produce the following output.
(1,mehul,chourey,21,9848022337,Hyderabad)
(2,Ankur,Dutta,22,9848022338,Kolkata)
(3,Shubham,Sengar,22,9848022339,Delhi)
(4,Prerna,Tripathi,21,9848022330,Pune)
This was all on Apache Pig Operators.
Conclusion: Apache Pig Operators
As a result, we have seen all the Apache Pig Operators in detail, along with their Examples. However, if any query occurs, feel free to share.