HDFS Disk Balancer Introduction, Operations & Features

In this blog of Hadoop Tutorial, we are going to cover HDFS Disk Balancer in detail. First of all we will discuss what is Disk Balancer in Hadoop, then we will discuss various operations of Hadoop Disk balancer.

We will also discuss the Intra DataNode Disk Balancer in hadoop and its algorithum. At last in this tutorial, we will cover the features of the Hadoop HDFS Disk balancer in detail.

Introduction to HDFS Disk Balancer

HDFS Disk balancer is a command line tool. It distributes data uniformly on all disks of a datanode. HDFS Disk balancer is completely different from Balancer, which takes care of cluster-wide databalancing.

Due to the below reasons, HDFS may not always distribute data in a uniform way across the disks:

  • A lot of writing and deletes
  • Disk replacement

This leads to significant skew within a DataNode. Thus HDFS Balancer cannot handle this, which concerns itself with Inter, Non-Intra, DN skew.

So, new Intra-DataNode Balancing functionality came into existence to deal with this situation. This is invoked via the HDFS Disk Balancer CLI.

Disk Balancer works against a given datanode and moves blocks from one disk to another.

Operation of Disk Balancer

By creating a plan (a set of statements) and executing that plan on the datanode, HDFS Disk balancer works. These sets of statements describe how much data should move between two disks.

A plan has many move steps. These move steps have a number of bytes to move, source disk, and destination disk. A plan can execute against an operational datanode.

HDFS Disk balancer is not enabled by default;

So, to enable HDFS disk balancer dfs.disk.balancer.enabled is set true in hdfs-site.xml.

HDFS Intra-DataNode DiskBalancer

When user write new block in HDFS, so by using volume choosing policy datanode choose the disk for the block. Below are two such policies:

  • Round-robin – This policy distributes the new blocks in a uniform way across the available disks.
  • Available space – This policy writes data to the disk that has more free space by percentage.

By default HDFS DataNode uses the Round-robin policy.

Datanode still create significant imbalance volume due to massive file deletion and addition in HDFS. It is even possible that available space based volume-choosing policy can lead to less efficient disk I/O.

Every new write will go to the new added empty disk while at that time the other disks were idle. Thus, creating a bottleneck on the new disk.

To reduce the data imbalance issue, Apache Hadoop community developed server offline scripts. HDFS-1312 also introduced an online disk balancer. This re-balances the volumes on a running datanode based on various metrics.

Abilities of HDFS Disk Balancer

1. Data spread report

User can measure how to spread data through metrics.

a) Volume data density or Intra-node data density

This metrics can compute how much data is on a node. Also tell what ideal storage on each volume is.

Formula for computation, i.e. Total data at that node divided by the total disk capacity of that node.

Ideal storage =  total used % total capacity
Volume data density =  ideal storage – dfsUsedRatio

  • Positive value- This indicates indicate that the disk is under-utilized.
  • Negative value- This indicates that the disk is over-utilized.
b) Node data density or inter-node data density

As now we have calculated volume data density. So, we can easily compare which all nodes in the data center need to balance?

c) Reports

Now we have volume data density and node data density. So disk balancer can balance the top 20 nodes in the cluster that have the skewed data distribution.

2. Balance data between volume while datanode are alive

HDFS Disk balancer has the ability to move data from one volume to another.

Conclusion

In conclusion, we can say that Disk Balancer is the tool which distributes data on all disks of a datanode. It works by creating a plan (set of statements) and executing that plan on the datanode.

HDFS Disk Balancer uses Round-robin and Available space policies for choosing the disk for the block.  If you find this blog helpful, or you have any query, so please share with us in the comment section. We will be glad to solve them.