What is Hadoop Architecture and its Components?

HDFS cluster is based on the Hadoop Distributed File System (HDFS). Designed for use on commodity hardware, the storage system is scalable, fault-tolerant, and rack-aware. HDFS distinguishes itself from other distributed file systems in several ways.

Hadoop is a framework permitting the storage of large volumes of data on node systems. The Hadoop architecture  allows parallel processing of data using several components:

  • Hadoop HDFS to store data across slave machines
  • Hadoop YARN for resource management in the Hadoop cluster
  • Hadoop MapReduce to process data in a distributed fashion
  • Zookeeper to ensure synchronization across a cluster

This article lets you understand the various Hadoop components that make the Hadoop architecture.

Hadoop HDFS

The Hadoop Distributed File System (HDFS) is Hadoop’s storage layer. Housed on multiple servers, data is divided into blocks based on file size. These blocks are then randomly distributed and stored across slave machines.

HDFS in Hadoop Architecture divides large data into different blocks. Replicated three times by default, each block contains 128 MB of data. Replications operate under two rules:

  1. Two identical blocks cannot be placed on the same DataNode
  2. When a cluster is rack aware, all the replicas of a block cannot be placed on the same rack

 datanode

In this example, blocks A, B, C, and D are replicated three times and placed on different racks. If DataNode 7 crashes, we still have two copies of block C data on DataNode 4 of Rack 1 and DataNode 9 of Rack 3.

There are three components of the Hadoop Distributed File System:  

  1. NameNode (a.k.a. masternode): Contains metadata in RAM and disk
  2. Secondary NameNode: Contains a copy of NameNode’s metadata on disk
  3. Slave Node: Contains the actual data in the form of blocks

Learn Job Critical Skills To Help You Grow!

Post Graduate Program In Data EngineeringExplore Program
Learn Job Critical Skills To Help You Grow!

NameNode

NameNode is the master server. In a non-high availability cluster, there can be only one NameNode. In a high availability cluster, there is a possibility of two NameNodes, and if there are two NameNodes there is no need for a secondary NameNode. 

NameNode holds metadata information on the various DataNodes, their locations, the size of each block, etc. It also helps to execute file system namespace operations, such as opening, closing, renaming files and directories.

Secondary NameNode

The secondary NameNode server is responsible for maintaining a copy of the metadata in the disk. The main purpose of the secondary NameNode is to create a new NameNode in case of failure.

In a high availability cluster, there are two NameNodes: active and standby. The secondary NameNode performs a similar function to the standby NameNode.

Hadoop Cluster - Rack Based Architecture

We know that in a rack-aware cluster, nodes are placed in racks and each rack has its own rack switch. Rack switches are connected to a core switch, which ensures a switch failure will not render a rack unavailable.

HDFS Read and Write Mechanism

HDFS Read and Write mechanisms are parallel activities. To read or write a file in HDFS, a client must interact with the namenode. The namenode checks the privileges of the client and gives permission to read or write on the data blocks.

Datanodes

Datanodes store and maintain the blocks. While there is only one namenode, there can be multiple datanodes, which are responsible for retrieving the blocks when requested by the namenode. Datanodes send the block reports to the namenode every 10 seconds; in this way, the namenode receives information about the datanodes stored in its RAM and disk.

Let us now discuss the next component of the Hadoop architecture - Hadoop YARN.

Hadoop YARN

Hadoop YARN (Yet Another Resource Negotiator) is the cluster resource management layer of Hadoop and is responsible for resource allocation and job scheduling. Introduced in the Hadoop 2.0 version, YARN is the middle layer between HDFS and MapReduce in the Hadoop architecture. 

The elements of YARN include:

  • ResourceManager (one per cluster)
  • ApplicationMaster (one per application)
  • NodeManagers (one per node)

Resource Manager

Resource Manager manages the resource allocation in the cluster and is responsible for tracking how many resources are available in the cluster and each node manager’s contribution. It has two main components:

  1. Scheduler: Allocating resources to various running applications and scheduling resources based on the requirements of the application; it doesn’t monitor or track the status of the applications
  2. Application Manager: Accepting job submissions from the client or monitoring and restarting application masters in case of failure

Application Master

Application Master manages the resource needs of individual applications and interacts with the scheduler to acquire the required resources. It connects with the node manager to execute and monitor tasks.

Node Manager

Node Manager tracks running jobs and sends signals (or heartbeats) to the resource manager to relay the status of a node. It also monitors each container’s resource utilization.

Container

Container houses a collection of resources like RAM, CPU, and network bandwidth. Allocations are based on what YARN has calculated for the resources. The container provides the rights to an application to use specific resource amounts.

Steps to Running an Application in YARN

  1. Client submits an application to the ResourceManager
  2. ResourceManager allocates a container
  3. ApplicationMaster contacts the related NodeManager because it needs to use the containers
  4. NodeManager launches the container 
  5. Container executes the ApplicationMaster

Now that you know about YARN, let us continue with the next important component of Hadoop architecture called MapReduce. 

Learn Job Critical Skills To Help You Grow!

Post Graduate Program In Data EngineeringExplore Program
Learn Job Critical Skills To Help You Grow!

MapReduce

MapReduce is a framework conducting distributed and parallel processing of large volumes of data. Written using a number of programming languages, it has two main phases: Map Phase and Reduce Phase.

Map Phase 

Map Phase stores data in the form of blocks. Data is read, processed and given a key-value pair in this phase. It is responsible for running a particular task on one or multiple splits or inputs.

Reduce Phase

The reduce Phase receives the key-value pair from the map phase. The key-value pair is then aggregated into smaller sets and an output is produced. Processes such as shuffling and sorting occur in the reduce phase.

The mapper function handles the input data and runs a function on every input split (known as map tasks). There can be one or multiple map tasks based on the size of the file and the configuration setup. Data is then sorted, shuffled, and moved to the reduce phase, where a reduce function aggregates the data and provides the output.

MapReduce Job Execution

  • The input data is stored in the HDFS and read using an input format. 
  • The file is split into multiple chunks based on the size of the file and the input format. 
  • The default chunk size is 128 MB but can be customized. 
  • The record reader reads the data from the input splits and forwards this information to the mapper. 
  • The mapper breaks the records in every chunk into a list of data elements (or key-value pairs). 
  • The combiner works on the intermediate data created by the map tasks and acts as a mini reducer to reduce the data. 
  • The partitioner decides how many reduce tasks will be required to aggregate the data. 
  • The data is then sorted and shuffled based on their key-value pairs and sent to the reduce function. 
  • Based on the output format decided by the reduce function, the output data is then stored on the HDFS.

Hadoop Common

Hadoop Common is a crucial part of the Hadoop ecosystem, providing essential utilities and libraries that support the core components of Hadoop. It includes Java libraries and files necessary for the functioning of HDFS, YARN, and MapReduce. For instance, when you run a MapReduce job, Hadoop Common supplies the necessary libraries to execute the job across a distributed cluster, ensuring that the job can process data efficiently.

In addition to its utility functions, Hadoop Common plays a key role in managing hardware failures within a Hadoop cluster. Given the large scale of Hadoop deployments, hardware failures are a common occurrence. Hadoop Common addresses this issue by providing automatic failure recovery features. For example, if a node in the cluster fails, Hadoop Common can automatically reassign tasks to other available nodes. This minimizes disruptions and maintains system performance, allowing for continuous data processing without requiring manual intervention.

Advantages of Hadoop Architecture

Here are the main advantages of Hadoop architecture and how it efficiently manages and processes large data sets:

  • Scalability and Cost Efficiency

Hadoop efficiently manages large data volumes by distributing them across numerous affordable servers. This scalability allows businesses to handle extensive data processing without high costs, unlike traditional databases that struggle with expansion.

  • Handling Growing Data

Traditional databases can become costly and inefficient with rapidly growing data sets. Hadoop avoids this issue by storing and processing all data, preventing valuable information from being lost due to budget constraints.

  • Versatility with Data Types

Hadoop can process both structured and unstructured data from various sources, such as social media and emails. This flexibility enables businesses to gain insights from diverse data types, whether for log analysis, data warehousing, or fraud detection.

  • Efficient Processing

Hadoop’s distributed file system reduces storage costs and speeds up processing by keeping data and tools on the same servers. This setup allows for the rapid handling of large amounts of unstructured data.

  • Fault Tolerance

Hadoop ensures data reliability through duplication across multiple nodes. If one node fails, another can provide access to the data, maintaining continuity and protecting against hardware issues.

Disadvantages of Hadoop Architecture

Apart from the advantages, there are several disadvantages to consider with Hadoop architecture:

  • Security Concerns

Managing security in Hadoop can be quite complex. The system does not provide straightforward security models, making it difficult to set up and configure properly. This complexity can leave data vulnerable if not handled correctly. Additionally, Hadoop lacks built-in encryption for storage and network communication, which can be a significant drawback, especially for sensitive information.

  • Java Vulnerabilities

Hadoop relies on Java, which has been widely exploited by cybercriminals. Since Java is a common target for attacks, this dependency can introduce security risks. Organizations using Hadoop need to implement additional safeguards to protect against these vulnerabilities.

  • Issues with Small Files

Hadoop’s Distributed File System (HDFS) excels at handling large volumes of data but is not efficient with small files. When dealing with small amounts of data, HDFS can be less effective and may lead to increased overhead and slower performance.

  • Keeping Up with Updates

To avoid issues, it's crucial for organizations to use the latest stable version of Hadoop. This can be challenging and may require relying on third-party vendors for updates and support, which adds to the complexity and potential costs.

  • Need for Additional Tools

While Hadoop is powerful for data processing, it may not be enough on its own. Companies might miss out on valuable insights and efficiencies if they don't use additional tools or platforms that complement Hadoop’s capabilities for better data collection, aggregation, and integration.

History of Hadoop

Hadoop was initiated in 2002 by Doug Cutting and Mike Cafarella, who were working on the Apache Nutch project, a web crawler. The challenge of managing large data volumes led to the development of Hadoop. This was influenced by Google's 2003 release of the Google File System (GFS) and their 2004 MapReduce paper, which detailed methods for efficient data processing. In 2005, Cutting and Cafarella created the Nutch Distributed File System (NDFS) with MapReduce, setting the stage for Hadoop.

In 2006, Cutting, now at Yahoo, launched Hadoop with the Hadoop Distributed File System (HDFS), naming it after his son's toy elephant. By 2007, Yahoo was running clusters of 1,000 machines. Hadoop gained recognition in 2008 for sorting 1 terabyte of data on a 900-node cluster in just 209 seconds. Since then, Hadoop has evolved with key updates like Hadoop 2.0 in 2013 and Hadoop 3.0 in 2017, incorporating additional tools and gaining broad industry adoption.

Master the Concepts of the Hadoop Framework

Businesses are now capable of making better decisions by gaining actionable insights through big data analytics. The Hadoop Architecture is a major, but one aspect of the entire Hadoop ecosystem. Now that you have learned about HDFC architecture and components of hadoop, maser the A-Z of it and also learn more about other aspects of Big Data with Simplilearn's PCP Data Engineering Course. Apart from gaining hands-on experience with tools like HDFS, YARN, MapReduce, Hive, Impala, Pig, and HBase, you can also start your journey towards achieving Cloudera’s CCA175 Hadoop certification. Explore and enroll today!

FAQs

1. What is the use of Hadoop architecture?

Hadoop architecture is designed for managing and processing large datasets. It distributes data storage and computation across multiple servers, allowing for parallel processing. This approach helps in efficiently handling big data applications by breaking down complex tasks into smaller, manageable pieces that run simultaneously across a cluster of machines.

2. Why is Hadoop important?

Hadoop is important because it scales easily across thousands of servers, making it cost-effective for handling massive data volumes. Its distributed file system provides rapid data access and includes built-in fault tolerance, ensuring that applications continue running even if some servers fail. This reliability and scalability are crucial for managing big data efficiently.

3. What are the main purposes of Hadoop?

Hadoop is primarily used for storing and processing large volumes of data across clusters of computers. It supports big data applications by enabling scalable data management and advanced analytics. Common uses include predictive analytics, data mining, and machine learning, making it a key tool for data science and complex data processing tasks.

About the Author

SimplilearnSimplilearn

Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud Computing, Project Management, Data Science, IT, Software Development, and many other emerging technologies.

View More
  • Acknowledgement
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, OPM3 and the PMI ATP seal are the registered marks of the Project Management Institute, Inc.