Sqoop Tutorial: Your Guide to Managing Big Data on Hadoop the Right Way

The Hadoop ecosystem consists of various facets specific to different career specialties. One such discipline centers around Sqoop, which is a tool in the Hadoop ecosystem used to load data from relational database management systems (RDBMS) to Hadoop and export it back to the RDBMS. Simply put, Sqoop helps professionals work with large amounts of data in Hadoop.

This Sqoop tutorial gives you an in-depth walkthrough for using the Sqoop tool in Hadoop to manage Big Data. It digs into everything from the basics of Sqoop and its architecture, to how to actually use it. 

Let us now begin our Sqoop tutorial by understanding what exactly is Sqoop.

What is Sqoop and Why Use Sqoop?

Let us begin this Sqoop tutorial by understanding about Sqoop. Sqoop is a tool used to transfer bulk data between Hadoop and external datastores, such as relational databases (MS SQL Server, MySQL). 

RDBMS

To process data using Hadoop, the data first needs to be loaded into Hadoop clusters from several sources. However, it turned out that the process of loading data from several heterogeneous sources was extremely challenging. The problems administrators encountered included:

  1. Maintaining data consistency
  2. Ensuring efficient utilization of resources
  3. Loading bulk data to Hadoop was not possible
  4. Loading data using scripts was slow

The solution was Sqoop. Using Sqoop in Hadoop helped to overcome all the challenges of the traditional approach and it could load bulk data from RDBMS to Hadoop with ease. 

Now that we've understood about Sqoop and the need for Sqoop, as the next topic in this Sqoop tutorial, let's learn the features of Sqoop.

Sqoop Features

Sqoop has several features, which makes it helpful in the Big Data world: 

  1. Parallel Import/Export

    Sqoop uses the YARN framework to import and export data. This provides fault tolerance on top of parallelism.
  2. Import Results of an SQL Query

    Sqoop enables us to import the results returned from an SQL query into HDFS.
  3. Connectors For All Major RDBMS Databases

    Sqoop provides connectors for multiple RDBMSs, such as the MySQL and Microsoft SQL servers.
  4. Kerberos Security Integration

    Sqoop supports the Kerberos computer network authentication protocol, which enables nodes communication over an insecure network to authenticate users securely.
  5. Provides Full and Incremental Load

    Sqoop can load the entire table or parts of the table with a single command.

After going through the features of Sqoop as a part of this Sqoop tutorial, let us understand the Sqoop architecture. 

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

Sqoop Architecture

Now, let’s dive deep into the architecture of Sqoop, step by step:

1. The client submits the import/ export command to import or export data.

2. Sqoop fetches data from different databases. Here, we have an enterprise data warehouse, document-based systems, and a relational database. We have a connector for each of these; connectors help to work with a range of accessible databases.

data-sqoop

3. Multiple mappers perform map tasks to load the data on to HDFS.

multiple-mappers

4. Similarly, numerous map tasks will export the data from HDFS on to RDBMS using the Sqoop export command. 

This Sqoop tutorial now gives you an insight of the Sqoop import.

Sqoop Import

The diagram below represents the Sqoop import mechanism. 

/sqoop-import

  1. In this example, a company’s data is present in the RDBMS. All this metadata is sent to the Sqoop import. Scoop then performs an introspection of the database to gather metadata (primary key information).
  2. It then submits a map-only job. Sqoop divides the input dataset into splits and uses individual map tasks to push the splits to HDFS. 

Few of the arguments used in Sqoop import are shown below:

argument

In this Sqoop tutorial, you have learned about the Sqoop import, now let's dive in to understand the Sqoop export.

Sqoop Export

Let us understand the Sqoop export mechanism stepwise:

sqoop-export.

  1. The first step is to gather the metadata through introspection.
  2. Sqoop then divides the input dataset into splits and uses individual map tasks to push the splits to RDBMS. 

Let’s now have a look at few of the arguments used in Sqoop export:

export-argument

After understanding the Sqoop import and export, the next section in this Sqoop tutorial is the processing that takes place in Sqoop.

Sqoop Processing

Processing takes place step by step, as shown below:

  1. Sqoop runs in the Hadoop cluster.
  2. It imports data from the RDBMS or NoSQL database to HDFS. 
  3. It uses mappers to slice the incoming data into multiple formats and loads the data in HDFS.
  4. Exports data back into the RDBMS while ensuring that the schema of the data in the database is maintained.

Next, let us take a look at the Sqoop demo as a part of this Sqoop tutorial.

Learn Job Critical Skills To Help You Grow!

Post Graduate Program In Data EngineeringExplore Program
Learn Job Critical Skills To Help You Grow!

Demo on Sqoop

For this Sqoop demo, we will be using the Cloudera QuickStart VM. Once you are in the Cloudera QuickStart VM, you will see a GUI. Here, it will open to the Hue GUI by default; if it doesn’t, then you have to click on Hue. Under Hue, you have to go on to Query, then choose Editor and finally, choose Sqoop. The screen looks like, as shown below:

You will then get a screen where you can load the Sqoop editor into Hue. Here, we will be switching over to the command line, as CLI runs quicker on many computers. Now you can go ahead and open the command line.

For this hands-on demo, we will be using Oracle VirtualBox Manager and the Cloudera QuickStart VM. Here, we have the databases in MySQL, so to get into the MySQL server, open a terminal in Cloudera, and type:

mysql -u root -pcloudera // This depends on your setup. Here, we have typed Cloudera’s username and password. 

After typing the above commands, you will get into MySQL. Let’s look into a few quick commands here:

show databases; // This will display a list of 12 databases present in MySQL

After typing the above command, the output will be as follows:

sqoop-output

As seen in the above image, MySQL comes with a standard set of databases. We will use the retail_db database for our demo. To use this database and check the tables in it:

use retail_db;

By typing the command above, that database will be set as the default in MySQL. To check the list of tables present inside the retail_db database, you have to type the following command:

show tables; 

As shown below, the above command will provide a list of six tables present inside the database:

sqoop

Now, we will use the “departments” table from the “retail_db” database. For that, you have to type the following:

select * from departments;// This views the contents of the departments table. 

After typing the above command, the output will be as follows:

mysql-sqoop

After this, open another terminal in Cloudera and type the following:

hostname -f //Checks the hostname

sqoop list-databases --connect jdbc:mysql://localhost/ --password cloudera --username root; //Lists the databases using Sqoop

After typing the above command, the output will be as follows:

database-match

As seen above, the databases match. Now to list the tables present in retail_db using Sqoop, type the following command:

sqoop list-tables --connect jdbc:mysql://quickstart:3306/retail_db --password cloudera --username root; //Lists the tables present in retail_db using Sqoop

After typing the above command, the output will be as follows:

match

After this, we would have to connect to the MySQL database and import the customer’s database to HDFS. Let’s go ahead and run our first import command:

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --password cloudera --username root --table departments; //Executes Map Tasks at the back end

After typing the above command, the output will be as follows:

killed-map

After the code is executed, we can check the web UI of HDFS using localhost: 50075, where the data is imported. Now, you can go back to the terminal window and type the following:

hadoop fs -ls /user/cloudera //Checks if importing data into HDFS was successful

hadoop fs -cat /user/cloudera/departments/part*  //Views the contents of the department table

After typing the above commands, the output will be as shown below:

The next command to be typed is as follows:

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --password cloudera --username root --table departments --target-dir /user/cloudera/dept1; //Imports the departments data to a target directory:

This process will take some time to run. After that, type the following to view the contents of the department table1 inside dept1 directory:

hadoop fs -cat /user/cloudera/dept1/part*

The output will be as follows:

virtual-memory

Let us move onto filter data now; we can use the following import command to filter specific rows using the where clause:

sqoop import -–connect jdbc:mysql://quickstart:3306/retail_db --password cloudera --username root --table departments –-m 3 --where “department_id>4” --target-dir /user/cloudera/dept2;

The above command will only return department ids that are greater than four and will store it in dept2. After running the above command, we have to view the contents of the dept2 directory, and for that, you will have to type the following:

hadoop fs -cat /user/cloudera/dept2/part*

The output after filtering will be:

Now that we have imported the data, the next stage is to export the data. For that, we have to go back to the MySQL server and create a database to export into. We will simply be creating a new table “dept” to export the “departments” table data to the new dept table. Create the below table under the MySQL terminal:

create table dept(department_id int not null default NULL auto_increment, department_name varchar(45) not null default 'NULL', primary key(department_id));

Now we have to export the departments table to dept table using the following command:

sqoop export --connect jdbc:mysql://quickstart:3306/retail_db --password cloudera --username root --table dept –-export-dir /user/cloudera/dept2;

The above command has to be typed out in the Sqoop setup. After the command is executed, go back to your MySQL terminal and type:

select * from dept;  

The above command will let you check the contents of the dept table under the MySQL terminal. The output will be as shown below:

terminal

Now, this brings us to the end of this demo on Sqoop. As you saw in this Sqoop tutorial, the commands are easy to understand and implement. 

After learning all about Sqoop, from its basics to its architecture, in this Sqoop tutorial, you can enroll in our Caltech Post Graduate Program in Data Science to get an in-depth understanding of Hadoop installation and configuration, computational frameworks for processing Big Data, Hadoop administrator activities, cluster management with Sqoop, Flume, Pig, Hive, Impala, and Cloudera.

About the Author

Shruti MShruti M

Shruti is an engineer and a technophile. She works on several trending technologies. Her hobbies include reading, dancing and learning new languages. Currently, she is learning the Japanese language.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, OPM3 and the PMI ATP seal are the registered marks of the Project Management Institute, Inc.