Data is Future: HDFS - Replication Factor

Tuesday, June 26, 2018

HDFS - Replication Factor

Replication Factor is a property in Hadoop ecosystem that signifies the number of replicas of a block that can be kept in the cluster. We already know that the data in the Hadoop distributed system is stored in the commodity hardware. To maintain a high data availability, we need to have the replicas of the data. The default replication factor in HDFS is 3.

When the number of replicas of the data is below the replication factor, then it is called under-replicated blocks.
Replication < Replication Factor

Example :- Let us assume that we have 10 TB storage and the data size is 4 TB. The default replication factor is 3. In such a case , all the blocks does not have a replica. Such scenarios are where we ave under replicated data.

When the number of replicas are more than the replication factor , then it is called over replicated blocks.
Replication > Replication Factor

Over Replicated blocks normally occurs when crashed data nodes become alive.
We can check the file details like replication factor, corrupt files , Under replicated blocks and many more details with the help of the hdfs fsck / command. The / here denotes the root directory.
The replication factor details are available in the file hdfs-site.xml which is available in the Hadoop Installation Directory.

<property>
<name>dfs.replication<name>
<value>3<value>
<description>Block Replication<description>
<property>

We can set the replication factor on the per file basis via below commands:-
hadoop fs -setrep -w 2 /dir/

Data is Future

Tuesday, June 26, 2018

HDFS - Replication Factor

No comments:

Post a Comment

Delta Lake - Time Travel