Hadoop is an open-source distributed data-processing framework that works with vast volumes of data and storage for applications running in a clustered environment. Hadoop can manage structured, semi-structured, and unstructured data. It gives users more flexibility for collecting, processing, and analyzing data than traditional relational database management systems (RDBMS).
Hadoop provides big data professionals with advanced analytics like predictive analytics, data mining, and machine learning capabilities. It is composed of: the Hadoop distributed file system (HDFS), which stores, manipulates, and distributes data across different nodes; and Hadoop YARN, a job scheduling, and resource management framework.
Master the Big Data framework, ecosystem tools and more with the Big Data Hadoop Certification Training Course. Enroll now!
What Is Hadoop Cluster?
A Hadoop cluster is designed to store and analyze large amounts of structured, semi-structured, and unstructured data in a distributed environment. It is often referred to as a shared-nothing system because the only thing that is shared between the nodes is the network itself. Naturally, one machine in the cluster is designated as a master (such as NameNode or JobTracker). The rest of the machines in the cluster act as workers (such as DataNode or TaskTracker). Hadoop clusters are highly resilient, with every bit of data stored in one node replicated into other nodes in the cluster to prevent data loss.
A typical simple cluster diagram looks like this:
The Architecture of a Hadoop Cluster
A cluster architecture is a system of interconnected nodes that helps run an application by working together, similar to a computer system or web application. In cluster architecture, user requests are divided among two or more computer systems, so a single user request is handled and delivered by two or more nodes. A master node maintains knowledge of what’s in the DFS and hosts two daemons, NameNode and ResourceManager. Worker nodes store the actual data and provide processing power to run the jobs using the daemons, DataNode, and NodeManager.
The Prerequisites for Setting Up a Cluster
There are some prerequisites before you configure a Hadoop cluster and run MapReduce. First, you must set up Java and SSH keys. Then, the first server you configure is the master.
To configure the master, follow these steps:
1. Open the /etc/hosts file on your local terminal and type $> 192.168.2.32 hadoop1
2. Connect to the server (ideally Linux or Ubuntu) and ensure that you can successfully sign in and sign out
3. After signing out, create the SSH keygen using this command: $> ssh-keygen –t –rsa –b 20148
You will be able to see public and private keys in the .ssh folder after executing the above command.
4. Copy the IDs using the ssh-copy with this command: $> ssh-copy-id hadoop@hadoop1
You will be prompted to type the password for the hadoop1 master node. After entering the password, you will be able to connect to the node without the need for a password.
5. Use this command to connect to hadoop1: $> ssh hadoop@hadoop1 directly
You will now be signed in to hadoop1 hadoop@hadoop1>. |
6. Configure the directory for installing Java using these commands:
$> sudo add-apt-repository ppa:webupd8team/java
$> sudo apt update
7. Install Java8 using this command: $> sudo apt install oracle-java8-installer
8. Validate to see if you are using jdk1.8 using javac –version command
9. Upgrade once before installing Hadoop using this command:$> sudo apt upgrade to upgrade any libraries since the last Java install
This will remove any unnecessary files from your machine: $> sudo apt autoremove
You may need to repeat these steps before configuring each node as Hadoop.
How to Set Up a Single-Node Cluster
Before setting up a full-scale multi-node cluster, let's take a look at configuring a single-node cluster.
1. Download Hadoop from Apache by visiting https://hadoop.apache.org/releases.html and finding the latest version of the binary, which will take you to the tar location.
2. Copy the above link and sign in to the single Hadoop server
Use wget http://mirrors.estointernet.in/apache/hadoop/common/hadoop-3.1.2/hadoop-3.1.2.tar.gz to download the binary.
3. Unzip the tar file using tar –xzcf hadoop-3.1.2.tar.gz, so the files get unpacked and switch to the Hadoop-3.1.2 directory
4. In your batch file update, the home path of Hadoop is one of the following:
export HADOOP_HOME=/home/hadoop/Hadoop
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export PATH=$PATH:$HADOOP_HOME/bin: :$HADOOP_HOME/sbin
5. Update the Hadoop environment file as well with JAVA_HOME path
/hadoop/etc/hadoop/hadoop-env.sh
JAVA_HOME==/usr/lib/jvm/java-8-oracle
6. Update the below property in core-site.xml, as well in the same directory
7. Update hdfs-site.xml with the below configuration details, which are needed for replication
8. NFormat the single-node Hadoop system using this command: ./hadoop/bin/hadoop/ namenode –format
9. Run the configured Hadoop using the following command: start-all.sh and you will see the similar information when it is starting
$> start-all.sh
10. You can also verify the node names using the Java command
$> jps
11. You can stop the server using stop-all.sh script
$> stop-all.sh
How to Set Up a Multi-Node Cluster
Now that you know how to set up Hadoop on a single node, you can set up a multi-node cluster. Let’s configure three Hadoop nodes called Hadoop2, Hadoop3, and Hadoop4. You should have a Linux server, and Java installed on all three boxes before configuring Hadoop.
1. Configure worker nodes 1, 2, and 3 in the /etc/hosts file in the master with IP address like the ones below:
192.168.2.30 hadoop1
192.168.2.32 hadoop2
192.168.2.34 hadoop3
192.168.2.36 hadoop4
2. Copy all keys from Hadoop1 (including username and password) to the other three nodes using the following command so you can sign in to these machines seamlessly:
ssh-copy-id hadoop2
ssh-copy-id hadoop3
ssh-copy-id hadoop4
3. Edit the /hadoop/etc/hadoop/core-site.xml in the hadoop1 (master) to configure the master node
From
To
4. Edit /Hadoop/etc/Hadoop/hdfs-site.xml to have replication mentioned as three for three nodes
From
To
5. Edit /Hadoop/etc/Hadoop/workers to include all four nodes. By default, it comes only with the localhost:
Hadoop1
Hadoop2
Hadoop3
Hadoop4
6. Install Hadoop in the other three nodes using the following command ( make sure you are in the directory where you extracted Hadoop from the Apache website):
$> scp –rq Hadoop hadoop2:/home/Hadoop
$> scp –rq Hadoop hadoop3:/home/hadoop
$> scp –rq Hadoop hadoop4:/home/Hadoop
7. Copy the bashrc file to the other three nodes using the following commands:
$> scp –rq .bashrc hadoop2:/home/Hadoop
$> scp –rq .bashrc hadoop3:/home/Hadoop
$> scp –rq .bashrc hadoop4:/home/Hadoop
8. Format if you are setting this up for the first time using this command: hadoop namenode –format
9. Start the node using this command: $> start-all.sh
You can verify this by giving the below Java command to see whether the node boots up.
$> jps
You can sign in to hadoop2 to see if the node is running through ssh.
$> ssh hadoop2
$> jps
You can then use stop-all.sh to stop all nodes.
How to Expand a Cluster
Now that there's a master node called hadoop1 and three worker nodes called hadoop2, hadoop3, and hadoop4, let's add more nodes to the existing cluster.
1. Stop all Hadoop nodes running in the environment using this command: > stop-all.sh
2. Edit /etc/hosts file to add the new node:
$> vi or emacs /etc/hosts
192.168.2.30 hadoop1
192.168.2.32 hadoop2
192.168.2.34 hadoop3
192.168.2.36 hadoop4
<add the below line>
192.168.2.38 hadoop5
3. Copy the security files to the hadoop5 node using this command so you can sign in:
$> ssh-copy-id hadoop5
*See if you can connect to the hadoop5 server after successfully executing this command: $> ssh hadoop5
4. Edit an employee’s file in hadoop/etc/hadoop/workers
Add this line:
hadoop5
5. Copy the file over to hadoop5 node using this command:
$> scp –rq hadoop hadoop5:/home/etc/hadoop
6. Copy the file to all other nodes through the following commands:
$> scp hadoop/etc/hadoop/workers hadoop2:/home/etc/hadoop
$> scp hadoop/etc/hadoop/workers hadoop3:/home/etc/hadoop
$> scp hadoop/etc/hadoop/workers hadoop4:/home/etc/hadoop
7. Copy the .bashrc file over to hadoop5 node using this command:
$> scp .bashrc hadoop5:/home/Hadoop
8. Start the Hadoop using the start script: $> start-all.sh
9. Verify to see whether it's running by connecting to the hadoop5 node using the following commands:
$> ssh hadoop5
$> jps
Are you skilled enough for a Big Data career? Try answering these Big Data and Hadoop Developer Test Questions and find out now!
How to Run a MapReduce Job in the Hadoop Cluster
Now that clusters are set up let's run a small program in MapReduce to calculate the number of words in a text file in the Hadoop cluster.
1. Copy the text file (a.txt) having some text to the root folder of Hadoop using this command: $> hadoop –fs –copyFromLocal a.txt /
2. Download a jar, which is a built-in example of Hadoop, using this command: > hadoop jar Hadoop/share/Hadoop/mapreduce/Hadoop-mapreduce-examples-3.0.0.jar wordcount /a.txt /wordcnt-output
3. The output of the above command will generate two files, which can be viewed through this fs command: $> Hadoop fs –ls /wordcnt-output
The two files generated are listed below. The _SUCCESS file has the logs and has indicated that it has run successfully.
/wordcnt-output/_SUCCESS
/wordcnt-output/part-r-0000
The part-r-0000 will have the words along with its count.
4. To sort and view the most used word and its count, use this command: > sort –t$’\t’ –nk 2,2 part-r-0000
Want to learn more about the Hadoop cluster? Enroll in Simplilearn’s Big Data Hadoop Certification Training Course to gain a deeper understanding of the Hadoop architecture, distributed storage, YARN, data ingestion, distributed procession, and more. Learn from industry experts and work on real-world industry projects using Hadoop, Hive, and big data stack. So, wait no more and take your career to the next level!