Sunday, July 1, 2018

HDFS - Rack Awareness

The placement of Data node in the Hadoop ecosystem plays a pivotal role . It has a huge impact on the performance of the Hadoop File System.

What is Rack Awareness ?

Rack Awareness is more of understanding the cluster topology.It explains how the different nodes are distributed across the cluster.Normally , the latency between the datanodes of the same rack will be lesser compare to the those which are in the different racks.
In a Hadoop Cluster  HDFS block placement will use rack awareness for fault tolerance by placing one block replica on a different rack. This provides data availability in the event of a network switch failure or partition within the cluster.
In large clusters of Hadoop, in order to improve network traffic while reading/writing HDFS files, NameNode chooses data nodes which are on the same rack or a near by rack to read/write request (client node).

Rack Configuration :-

Hadoop handles the management of data nodes inside the racks using the rack id's.The hadoop daemons obtain the ip's of the datanodes inside the cluster slaves by invoking a JAVA class. The Topology information is obtained in the form of the 'myrack/myhost' .
Suppose we have an address  of  '192.168.1.23/192.1.1.52' .Here 192.168.1.23 refers to the rack id  and 1921.1.52 refers to the individual host  identifier.

Replica Placement via Rack Awareness :-

The position of DataNodes inside the Rack plays an important role on the performance and reliability  of the HDFS.No more than one replica is placed on the same node and no more than than two replicas are placed on the same rack.This configuration is mainly done to avoid the data loss during the rack failure.The aggregate bandwith between nodes on the same rack is much greater than that between nodes on different racks.

No comments:

Post a Comment

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...