Apache Pig Tutorial

Before 2006, programs were written only on MapReduce using the Java programming language.

Developers had to mind the map, sort shuffle, and reduce fundamentals while creating a program for which they needed common operations such as joining, filtering, and so on. The challenges kept building up while maintaining, optimizing, and extending the code. Consequently, production time increased. Also, data flow in MapReduce was quite rigid, where the output of one task could be used as the input of another. To overcome these issues, Pig was developed in late 2006 by Yahoo researchers. It later became an Apache open-source project. Pig is another language, besides Java, in which MapReduce programs can be written.

Take Your Data Scientist Skills to the Next Level

With the Data Scientist Master’s Program from IBMExplore Program
Take Your Data Scientist Skills to the Next Level

Get trained on Yarn, MapReduce, Pig, Hive, HBase, and Apache Spark with the Caltech Post Graduate Program In Data Science. Click to enroll now!

What is Pig in Hadoop?

Pig is a scripting platform that runs on Hadoop clusters designed to process and analyze large datasets. Pig is extensible, self-optimizing, and easily programmed.

Programmers can use Pig to write data transformations without knowing Java. Pig uses both structured and unstructured data as input to perform analytics and uses HDFS to store the results.

Pig - Example

Yahoo scientists use grid tools to scan through petabytes of data. Many of them write scripts to test a theory or gain deeper insights; however, in the data factory, data may not be in a standardized state. This makes Pig a good option as it supports data with partial or unknown schemas and semi or unstructured data.

Components of Pig

There are two major components of the Pig:

  • Pig Latin script language
  • A runtime engine

Pig Latin script language

The Pig Latin script is a procedural data flow language. It contains syntax and commands that can be applied to implement business logic. Examples of Pig Latin are LOAD and STORE.

A runtime engine

The runtime engine is a compiler that produces sequences of MapReduce programs. It uses HDFS to store and retrieve data. It is also used to interact with the Hadoop system (HDFS and MapReduce).

The runtime engine parses, validates, and compiles the script operations into a sequence of MapReduce jobs.

Take Your Data Scientist Skills to the Next Level

With the Data Scientist Master’s Program from IBMExplore Program
Take Your Data Scientist Skills to the Next Level

How Pig Works and Stages of Pig Operations

Pig operations can be explained in the following three stages:

Stage 1: Load data and write Pig script

In this stage, data is loaded and Pig script is written.

A = LOAD ‘myfile’

AS (x, y, z);

B = FILTER A by x > 0;

C = GROUP B BY x;

D = FOREACH A GENERATE

x, COUNT(B);

STORE D INTO ‘output’;

Stage 2: Pig Operations

In the second stage, the Pig execution engine Parses and checks the script. If it passes the script optimized and a logical and physical plan is generated for execution.

The job is submitted to Hadoop as a job defined as a MapReduce Task. Pig Monitors the status of the job using Hadoop API and reports to the client.

Stage 3: Execution of the plan

In the final stage, results are dumped on the section or stored in HDFS depending on the user command.

Let us now understand a few salient features of Pig

Salient Features of Pig

Developers and analysts like to use Pig as it offers many features. Some of the features are as follows:

  • Provision for step-by-step procedural control and the ability to operate directly over files
  • Schemas that, though optional, can be assigned dynamically
  • Support to User Defined Functions, or UDFs, and to various data types

Data Model in Pig

As part of its data model, Pig supports four basic types.

  1. Atom: It is a simple atomic value like int, long, double, or string.
  2. Tuple: It is a sequence of fields that can be of any data type.
  3. Bag: It is a collection of tuples of potentially varying structures and can contain duplicates.
  4. Map: It is an associative array.

The key must be a char array, but the value can be of any type. By default, Pig treats undeclared fields as byte arrays, which are collections of uninterpreted bytes. Pig can infer a field’s type based on the use of operators that expect a certain type of field. It can also use User Defined Functions or UFDs, with a known or explicitly set return type. Furthermore, it can infer the field type based on schema information provided by a LOAD function or explicitly declared using an AS clause. Please note that type conversion is lazy, which means the data type is enforced at the point of execution-only.

Nested Data Model

Pig Latin has a fully-nestable data model with Atomic values, Tuples, Bags or lists, and Maps. This implies one data type can be nested within another, as shown in the image. Pig Latin Nested Data Model is shown in the following diagram.

Nested Data Model _ Apache Pig

The advantage is that this is more natural to programmers than flat Tuples. Also, it avoids expensive joins. Now we will look into different execution modes pig works in.

Take Your Data Scientist Skills to the Next Level

With the Data Scientist Master’s Program from IBMExplore Program
Take Your Data Scientist Skills to the Next Level

Pig Execution Modes

Pig works in two execution modes: Local and MapReduce.

Local mode

In the local mode, the Pig engine takes input from the Linux file system and the output is stored in the same file system. Pig Execution local mode is explained below.

Pig Execution Mode - Local mode

MapReduce mode

In MapReduce mode, the Pig engine directly interacts and executes in HDFS and MapReduce as shown in the diagram given below.

Pig Execution Mode - MapReduce mode

Let us now look into interactive modes of Pig.

Pig Interactive Modes

The two modes in which a Pig Latin program can be written are Interactive and Batch.

Interactive mode

Interactive mode means coding and executing the script, line by line, as shown in the image given below.

Pig Interactive Mode

Batch mode

In Batch mode, all scripts are coded in a file with the extension .pig and the file is directly executed as shown in the diagram given below.

Pig Batch mode

Since we have already learned about Hive and Impala which works on SQL, let’s now see how Pig is different from SQL.

Pig vs. SQL

Given below are some differences between Pig and Sql.

Difference

Pig

SQL

Definition

Pig is a scripting language used to interact with HDFS.

SQL is a query language used to interact with databases residing in the database engine.

Query Style

Pig offers a step-by-step execution style.

SQL offers the single block execution style.

Evaluation

Pig does a lazy evaluation, which means that data is processed only when the STORE or DUMP command is encountered.

SQL offers immediate evaluation of a query.

Pipeline Splits

Pipeline Splits are supported in Pig.

In SQL, you need to run the “join” command twice for the result to be materialized as an intermediate result.

Now that we have gone through the differences between Pig and SQL, let us now understand further with an example.

Pig vs. SQL - Example

The illustration given below is an example to help you understand the SQL command and its Pig equivalent command script.

Track customers in Texas who spend more than $2,000.

Pig

SELECT c_id ,

SUM(amount) AS CTotal

FROM customers c

JOIN sales s ON c.c_id = s.c_id

WHERE c.city = ‘Texas'

GROUP BY c_id

HAVING SUM(amount) > 2000

ORDER BY CTotal DESC

SQL

The SQL command focuses on the customer table with columns c_id and CTotal, which is the sum of the amounts. It joins the sales table with reference to c_id, where the c.city is Texas. The grouping of c_id is performed by ensuring the sum of the amounts is greater than 2000 ordered in descending order.

customer = LOAD '/data/customer.dat' AS (c_id,name,city);

sales = LOAD '/data/sales.dat' AS (s_id,c_id,date,amount);

salesBLR = FILTER customer BY city == ‘Texas';

joined= JOIN customer BY c_id, salesTAX BY c_id;

grouped = GROUP joined BY c_id;

summed= FOREACH grouped GENERATE GROUP, SUM(joined.salesTX::amount);

spenders= FILTER summed BY $1 > 2000;

sorted = ORDER spenders BY $1 DESC; DUMP sorted;

Now, examine the same function using Pig.

In Pig, you create two entities, customer and sales‚ where you load equivalent data with the schema. You filter the customers based on location, for example, Texas. Both data are joined using the c_id row. The sum of the amounts of individual c_ids is calculated. Now, isolate those customers who spend more than $2,000. Later, sort the customers in descending order.

In the next section of this Apache Pig tutorial, you will learn how to load and store data in the Pig engine using the command console.

Take Your Data Scientist Skills to the Next Level

With the Data Scientist Master’s Program from IBMExplore Program
Take Your Data Scientist Skills to the Next Level

Loading and Storing Methods in Pig

In order to load and store data in the Pig engine, we use loading and Storing methods as explained below.

Loading

Loading refers to loading relations from the files in the Pig buffer. This is done using the keyword LOAD followed by the name of the variable for which data is to be loaded as shown below.

Loading method in Pig

A series of transformation statements processes the data.

Storing

Storing refers to writing output to the file system. This is done using the keyword STORE followed by the name of the variable whose data is to be stored along with the location of storage.

Storing Method - Pig

You can use the keyword DUMP to display the output in the section.

Pig Script Interpretation

Pig processes Pig Latin statements in the following manner:

  • Pig validates the syntax and semantics of all statements.
  • It type checks with the schema.
  • It verifies references. Pig performs limited optimization before execution.
  • If Pig encounters a DUMP or STORE, it will execute the statements.

A Pig Latin script execution plan consists of logical, optimized logical, physical, and MapReduce plans as shown in the below diagram.

Pig Latin script execution plan

Various Relations Performed by Developers

Some of the relations performed by Big Data and Hadoop Developers are:

  • Filtering: Filtering refers to the filtering of data based on a conditional clause, such as grade and pay.
  • Transforming: Transforming refers to making data presentable to extract logical data.
  • Grouping: Grouping refers to generating a group of meaningful data.
  • Sorting: Sorting refers to arranging the data in ascending or descending order.
  • Combining: Combining refers to performing a union operation of data stored in the variable.
  • Splitting: Splitting refers to separating the data with a logical meaning.

In the next section of this Pig tutorial, we will see some Pig commands which are frequently used by analysts.

Pig Commands

Given below in the table are some frequently used Pig Commands.

Command

Function

load

Reads data from the system

Store

Writes data to file system

foreach

Applies expressions to each record and outputs one or more records

filter

Applies predicate and removes records that do not return true

Group/cogroup

Collects records with the same key from one or more inputs

join

Joins two or more inputs based on a key

order

Sorts records based on a key

distinct

Removes duplicate records

union

Merges data sets

split

Splits data into two or more sets based on filter conditions

stream

Sends all records through a user-provided binary

dump

Writes output to stdout

limit

Limits the number of records

Getting Datasets for Pig Development

These are some of the popular URLs shown to download different datasets for Pig development.

Datasets

URL

Books

http://www.gutenberg.org/

(war_and_peace.text)

Wikipedia Database

https://dumps.wikimedia.org/enwiki/

Open database from Amazon S3 data

https://aws.amazon.com/datasets/

Open database from national climate data

http://cdo.ncdc.noaa.gov/qclcd_ascii

To summarize the tutorial:

  • Pig in Hadoop is a high-level data flow scripting language and has two major components: Runtime engine and Pig Latin language.
  • Pig runs in two execution modes: Local and MapReduce.
  • Pig engine can be installed by downloading the mirror web link from the website: pig.apache.org.
  • Three parameters need to be followed before setting the environment for Pig Latin: ensure that all Hadoop services are running properly, Pig is completely installed and configured, and all required datasets are uploaded in the HDFS.

Take Your Data Scientist Skills to the Next Level

With the Data Scientist Master’s Program from IBMExplore Program
Take Your Data Scientist Skills to the Next Level

Next Step to Success

To learn more and get an in-depth understanding of Hadoop and you can enroll in the Big Data Engineer Master’s Program. This program in collaboration with IBM provides online training on the popular skills required for a successful career in data engineering. Master the Hadoop Big Data framework, leverage the functionality of AWS services, and use the database management tool MongoDB to store data.

About the Author

SimplilearnSimplilearn

Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud Computing, Project Management, Data Science, IT, Software Development, and many other emerging technologies.

View More
  • Acknowledgement
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, OPM3 and the PMI ATP seal are the registered marks of the Project Management Institute, Inc.