How To Install Hadoop On Ubuntu

If you've read our previous blogs on Hadoop, you might understand how important it is. For all of you first-time readers, let's brief you on Hadoop before we get started on our guide to installing Hadoop on Ubuntu. 

Big Data is a term that goes hand in hand when it comes to Hadoop. Big data, as the name suggests, is a data set that is too massive to be stored in traditional databases. To manage big data, we have Hadoop, a framework for storing big data in a distributed way, and processing it in a parallel fashion. 

If you're going to work on big data, you need to have Hadoop installed on your system. So, how do you install the Apache Hadoop cluster on Ubuntu? There are various distributions of Hadoop; you could set up an Apache Hadoop cluster, which is the core distribution or a Cloudera distribution of Hadoop, or even a Hortonworks (acquired by Cloudera in 2018). In this blog post, we'll learn how to set up an Apache Hadoop cluster, the internals of setting up a cluster, and the different configuration properties for a cluster set up.

Looking forward to becoming a Hadoop Developer? Check out the Big Data Hadoop Certification Training Course  and get certified today.

Setting Up Multiple Machines

First, you would have to set up multiple machines, and for that, you must download the Linux disk image. Here, we're using Ubuntu for the cluster set up. You can download it from the web by searching for "Ubuntu disc image iso file download."

ubuntu


After you click on the above link, your screen will look like this:

xenial-xerus

Once you've downloaded the Oracle VM box and the Linux disc image, you're ready to set up machines within your virtualization software, which can then be used to set up a cluster. 

Your screen will then look like this:

ubuntu-3.

To set up a machine, click on New, give it a name, and choose Linux as the type and Ubuntu (64-bit) as the version. At times you might face an issue of not finding the Ubuntu (64-bit) option, and in such a case, you'd have to enable virtualization in your BIOS settings. After choosing the specifications mentioned above, click Next. 

In the next step, you'll allocate the RAM space depending on your GB. Here, 1.5 GB is chosen. Click Next and select Create a virtual hard disk now, and then click on Create. After this, in the next step, you'd need to choose VMDK and click Next.

The next screen will ask you how you want your hard disk to be allocated. You can choose the option of dynamically allocated, which means as you store data on your disk, it will be using your parent disk storage. Click Next, and now you must give the size of the hard disk. Here, we've provided 20 GB as it will be more than sufficient for machines that will be hosting the Apache Hadoop cluster. 

Once again, you'd have to click on create, so now you've given the basic settings for your machine. However, this does not have any disk assigned to it. Now, click on Settings, then System, and here you can increase or decrease the RAM and give more CPU to your machine. Click on Storage, then click on Empty, and from the drop-down on the right side you can select your disk image which you'd have already downloaded. Your screen will look like this:

storage

After Storage, now click on Network. For a Hadoop cluster setup, you'd need every machine to have a different IP, and for this, you'd have to choose Bridge Adapter. Then click on OK. Now you're done giving all the settings for this machine. Click on Start — this will start your machine and recognize the disc image that you've added. This will allow you to give settings to set up your first Ubuntu machine. 

Now, wait for a couple of seconds. You'll see an option that says to try Ubuntu or install Ubuntu — go ahead with the installation of Ubuntu distribution of Linux for the machine. While this happens in the background, you can quickly set up one more machine. Click on New, give a different name, choose Linux, click on Next, and give the required RAM. Click on Next and choose Create a virtual hard disk now. Then click on Create. After this, in the next step, you need to choose VMDK and click on Next. Now choose Dynamically allocated, click on Next to give the size of the hard disk, and click on Create

Once this is done, repeat the steps, click on settings, choose System/Processor, and give it two CPU cores. Then click on Storage/Empty and choose the disk image. Now, click on Network and choose Bridge Adapter. Once this is done, click on OK, and now we can start the second machine as well. This is how you can parallelly set up two machines, and once the machines are set you can then install relevant packages and Hadoop to set up the Hadoop cluster. 

Now let's look at the m1 machine, which will slowly come up, and your screen will show the dialogue, "Connection Established." After this, the screen will show various installation options, and you'd have to click on Install Ubuntu. Then choose Download updates while installing Ubuntu, click on Continue, and then you'll see an option which says Erase disk and install Ubuntu. Select that and click on Install Now. Choose to continue, and now you'd have to choose the time zone, your city/country will be displayed, and then click on Continue. Choose your language and continue. After this step, fill in your credentials.

Your Big Data Engineer Career Awaits!

Post Graduate Program In Data EngineeringExplore Program
Your Big Data Engineer Career Awaits!

Download Cloudera QuickStart VM

After performing the above steps, you'd have to wait for a while until the pop up asks you to restart the machine. Go through the same steps for the second machine, as well. Wait until both your machines are set up. Meanwhile, you can download the Cloudera QuickStart VM by looking for it on Google. Select the option illustrated below:


After clicking the link, select your platform, and choose Virtual Box. After this, provide your details so you can download the QuickStart VM. Once you have the zip file downloaded, unzip it so you can set up a single node Cloudera cluster. You can set up the two machines individually or by cloning, as well. Here, we've shown how to do it individually. Once the installation is done, it will look like this:

installation

Now, click on Restart.

Similarly, your second machine will also show the above dialogue and you'll need to restart that too. When you look at the first machine, it will ask you to remove the installation medium, so click on Enter. This will start your machine. The same thing applies to the second machine, so let's click on machine 1 and set it up.

Firstly, download some packages for both machines so they can be used for a Cloudera cluster. When the first machine is up, you need to click on the password box and enter the password which you had set up, and then hit enter. Do the same process for the second machine as well. Go back to machine one, and wait for the utilities to come up. Once it's set, then you can start configuring your machine. Remember to set up your machine as smoothly as possible, so setting up the cluster becomes easy. 

Check the right corner to see if the first machine is connected to the internet. Once it's connected, click on the icon in the top left side to type "terminal" and then click on enter. Now, your terminal will open up, and you can start by typing the below command:

ifconfig  // Displays the IP address of the machine 

sudo su   // Helps you log in as root, and then give in your password 

vi /etc/hosts // Updating the IP address of the current machine

192.168.0.116 m1 // Updating the IP address and host name of the current machine

192.168.0.217 m2 // Updating the IP address and host name of the second machine

Perform the same steps for the second machine as well and save it. Now you can see if machine two can ping the first machine by typing:

ping m1  

Do the same on the second machine, and you'll see that both work perfectly well. In addition to this, you need to disable the firewall by typing:

ufw disable  

Similarly, use the same command for machine two. Remember that both machines don't have Java and other packages installed, so let's start installing packages now.

apt-get install wget  // Install packages from internet

apt-get install openssh-server  // Allows us to have a passwordless access from machines to one another using ssh

Download Java and Hadoop

Let's open up the browser on machine one and search for oracle JDK 1.8. Click on the link below:

downloadjava-hadoop

Once the link opens, accept the agreement. From the various options that will be displayed, you can choose a stable version by accepting the license agreement. Here, we're choosing an x64 tar file as we have a 64-bit machine. This should download your Java-related tar file. 

Additionally, we also need a Hadoop package, so open a different browser and search for archive.apache.org. Click on the link below:

apache-archives

Search for Hadoop, then click on Hadoop. Many versions of Hadoop will be displayed — select a stable one. Here, we're choosing Hadoop-2.6.5/. After this, download the Hadoop-related tar file by clicking on Hadoop-2.6.5.tar.gz. Now both Java and Hadoop are downloading. 

It's not required to download Java and Hadoop on the second file as we have the tar files and the ssh setup that will help us copy the tar files from machine one to two. While the download is happening, go to machine one's terminal and give the user root privileges by typing:

vi sudo  

hdc ALL=(ALL:ALL)ALL  // Type this below the root line. This allows the user to have the same privileges as root. After this save it and enter.

ls /home/hdc/Downloads/ // The download path for Java and Hadoop

cd /usr/lib // Change directory 

mkdir jvm  // Create directory

cd jvm  // Switch to jvm

tar -xvf /home/hdc/Downloads/jdk-8u201-linux-x64.tar.gz   // Give your jdk package which needs to be untarred. This is for unpackaging

Once the above steps are performed, you'll have Java installed on the machine. 

ls -all   // Your jdk will be displayed

ls jdk1.8.0_201/bin   // Displays the Java related programs

You will now have your Java path. The type:

cd jdk1.8.0_201/bin/    // Change directory. Then press enter and copy the new path

cd    // Update in the bash file

ls -all 

vi .bashrc 

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_201

export PATH=$PATH:$JAVA_HOME/bin     

Copy the last two commands so that we can update in the hdc user. Save the above and refresh the file using:

source .bashrc   // Refresh the file

apt-get install vim   // Provides a good layout when you're editing your files 

vi .bashrc

su - hdc // Updating path for hdc user

vi .bashrc // Then give the copied Java path and save 

source .bashrc   // Refresh again

java -version // Java version should show up even if you're logged in as a hdc user

Once the Java setup is complete, you'll need to untar the Hadoop-related directory. This needs to be performed in a particular location. 

cd /usr/local   // Change location

sudo tar -xvf /home/hdc/Downloads/hadoop-2.6.5.tar.gz // Hadoop related tar file which you want to unpack

Once this is done, you'll have a Hadoop directory created under user local. If you want to have the Hadoop path more simple, create a link, and change the ownership to hdc for anything that starts with Hadoop. You can follow the below steps for it:

hadoop-path

First set the Hadoop path for the user:

cd 

vi .bashrc 

Now you can add some export lines which will allow you to give Hadoop commands from anywhere. Once that is done, save it. Follow these steps:

hadoop-command

After this, refresh it. If everything is fine, you should be able to check the Hadoop version:

source .bashrc   // Refresh

Hadoop version // To check the Hadoop version 

So we've set up Java and unpacked Hadoop. The same things must be repeated on machine two, but before that, we have to set up a password across both machines. 

ssh-keygen -t rsa -P "// Sets up a ssh key. This also creates a ssh key which has a public and private key

ls -all .ssh 

For our ssh to work, we need our public key to be transferred to the other machine. So now we have to navigate to the second machine and generate an ssh key on it. As we did on the first machine, you need to install vim here, as well. Similarly, make sure that the user has root access. Meanwhile, open another terminal and set up the ssh key for this machine. 

Your Big Data Engineer Career Awaits!

Post Graduate Program In Data EngineeringExplore Program
Your Big Data Engineer Career Awaits!

Now you need to copy the public key from machine one to two and vice versa. For this, type the following command on the first machine:

ssh-copy-id -i $HOME/.ssh/id_rsa.pub hdc@m2   

Then say yes and enter your password. This process should have copied your public key. Repeat the same command for the second machine by replacing hdc@m2 with hdc@m1.

We also need to copy our public keys to the same machine, as shown below:

publickeys

Perform the same steps on the 1st machine. Once this is done, we have two machines that can not only ping each other but also connect through ssh with each other. Now we need to install Java on the second machine by copying the tar file from the first. The steps below show how to copy the Java and the Hadoop files to the second machine.

second-machine

After this, go to the second machine and create a directory. Perform the steps below to get the copied tar file and later unpack it.

directory

Then perform similar steps as you did on machine one and unpack Hadoop. Create a link, change the ownership, and update the bashrc. In addition to this, copy the bashrc file from machine one to machine two. Perform the following steps on your machine one, while you're logged in as hdc:

hdc

Go back to machine two, and check if your Java and Hadoop versions are shown correctly:

java-hadoop

Now we have two machines which have Java and Hadoop packages, and these can be used to set up the Hadoop cluster once we start editing the config files found in the below path on both machines:

terminal

Now we're ready to bring up our two-node cluster for Hadoop. Here on machine one, we'll have a NameNode, DataNode, ResourceManager, and a NodeManager running. On the second machine, we'll have a DataNode, NodeManager, and SecondaryNameNode running.  

To set up the cluster, you need the config files. Here, config files are updated on the GitHub link. We're copying the config files from GitHub. Shown below are the important config files that we need to update:

hadoop-scala

Go to machine one and type the following:

type

As seen above, the first config file that is updated is the core-site file. Once you hit enter, you'll see that there is no specified configuration. In Apache Hadoop, we'd have to download the Hadoop related package, edit the configs, do the formatting, and only then can we start the cluster. 

So now, we're directing to the Git link and getting the contents of the core-site file, which tells the HDFS path, the location where the NameNode would run, and the port to which the NameNode will listen to. You need to scroll down and select a few properties as we're setting up a simple cluster. We have chosen the below properties:

github

Copy the properties, go back to your terminal, and paste it. Then change the machine name to m1 and save it.

m1

So now, the first config file is edited. The next file that we have to edit is mapred-site — we need to rename it and then edit it to tell us what processing layer we're using. 

config-file

Go to the link again and pick up the information of mapred-site file - the processing layer we'll be using here is YARN. Follow the same steps as we did previously. The screen will look like this:

yarn

The third file will be the hdfs-site file, which tells the replication factor, where the NameNode stores metadata on disk, and so on. After this, go to the GitHub site and perform the same steps. Copy the various properties and paste them on the terminal. Since we plan to run two DataNodes, we're going for two replications. You have rename the paths as seen below and save it.

hdfs-site.

The next file that we need to edit is yarn-site. This tells you about the ResourceManager, ResourceTracker, and other properties. Repeat the process and go to the GitHub site and choose a few properties. Copy these properties to the machine. Make the changes like below:

yarn-site

The rest of the properties can be left as they are and saved. Now the YARN is also done. The next file to edit is the slaves' file which tells you which machine the DataNodes and NodeManagers run on. The five main files have been edited now.

slaves

You can also edit Hadoop-env, but it's not mandatory. After this, we copy the files to m2 and edit it on m2 as well. The command for that is as shown:

hadoop

All the files are copied now to machine two, we can update it accordingly. You can do this without logging into machine two and by following these commands:

these-commands

You can look into the various files and see which file needs changes. The core and mapped files don't need any changes. In addition to this, you can remove any template file. The HDFS file will have the following changes:

mapred-file

We don't make any changes to yarn and slaves file as what we have is fine. So now all the config files are edited on both the machines. You need to remember to create the directory mentioned in the config files. Once the steps below are performed, the hadoop cluster is set:

yarn-slaves

But before we can start using the cluster, we have to do formatting to create the initial metadata in the metadata path by using the following command:

initial-metadata

It's advised not to perform formatting more than once. Once the formatting is done and after the initial metadata is created, we can use the start-all.sh command to start the cluster. 

Congratulations, you have just set up your Hadoop cluster on Ubuntu!

How skilled ae you with the concepts of Big data? Take up answering this Big Data and Hadoop Developer Practice Test and assess yourself!

Master the Concepts of Hadoop

Businesses are now capable of making better decisions by gaining actionable insights through Big Data analytics. The Hadoop Architecture is a major, but one aspect of the entire Hadoop ecosystem. 

Learn more about other aspects of Big Data with Simplilearn's Big Data Hadoop Certification Training Course. Apart from gaining hands-on experience with tools like HDFS, YARN, MapReduce, Hive, Impala, Pig, and HBase, you can also start your journey towards achieving Cloudera's CCA175 Big Data certification.

About the Author

Shruti MShruti M

Shruti is an engineer and a technophile. She works on several trending technologies. Her hobbies include reading, dancing and learning new languages. Currently, she is learning the Japanese language.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, OPM3 and the PMI ATP seal are the registered marks of the Project Management Institute, Inc.