【原创】大数据基础之Hadoop(3)hdfs diskbalancer

hdfs单个节点内多个磁盘不均衡时(比如新加磁盘),需要手工进行diskbalancer操作,命令如下

# hdfs diskbalancer -help plan
usage: hdfs diskbalancer -plan <hostname> [options]
Creates a plan that describes how much data should be moved between disks.
 
 
    --bandwidth <arg>             Maximum disk bandwidth (MB/s) in integer
                                  to be consumed by diskBalancer. e.g. 10
                                  MB/s.
    --maxerror <arg>              Describes how many errors can be
                                  tolerated while copying between a pair
                                  of disks.
    --out <arg>                   Local path of file to write output to,
                                  if not specified defaults will be used.
    --plan <arg>                  Hostname, IP address or UUID of datanode
                                  for which a plan is created.
    --thresholdPercentage <arg>   Percentage of data skew that is
                                  tolerated before disk balancer starts
                                  working. For example, if total data on a
                                  2 disk node is 100 GB then disk balancer
                                  calculates the expected value on each
                                  disk, which is 50 GB. If the tolerance
                                  is 10% then data on a single disk needs
                                  to be more than 60 GB (50 GB + 10%
                                  tolerance value) for Disk balancer to
                                  balance the disks.
    --v                           Print out the summary of the plan on
                                  console

其中thresholdPercentage的注释有歧义,看起来是根据绝对值进行均衡的,查看代码

org.apache.hadoop.hdfs.server.diskbalancer.datamodel.DiskBalancerVolumeSet

/**
 * Computes Volume Data Density. Adding a new volume changes
 * the volumeDataDensity for all volumes. So we throw away
 * our priority queue and recompute everything.
 *
 * we discard failed volumes from this computation.
 *
 * totalCapacity = totalCapacity of this volumeSet
 * totalUsed = totalDfsUsed for this volumeSet
 * idealUsed = totalUsed / totalCapacity
 * dfsUsedRatio = dfsUsedOnAVolume / Capacity On that Volume
 * volumeDataDensity = idealUsed - dfsUsedRatio
 */
public void computeVolumeDataDensity() {
  long totalCapacity = 0;
  long totalUsed = 0;
  sortedQueue.clear();
 
  // when we plan to re-distribute data we need to make
  // sure that we skip failed volumes.
  for (DiskBalancerVolume volume : volumes) {
    if (!volume.isFailed() && !volume.isSkip()) {
 
      if (volume.computeEffectiveCapacity() < 0) {
        skipMisConfiguredVolume(volume);
        continue;
      }
 
      totalCapacity += volume.computeEffectiveCapacity();
      totalUsed += volume.getUsed();
    }
  }
 
  if (totalCapacity != 0) {
    this.idealUsed = truncateDecimals(totalUsed /
        (double) totalCapacity);
  }
 
  for (DiskBalancerVolume volume : volumes) {
    if (!volume.isFailed() && !volume.isSkip()) {
      double dfsUsedRatio =
          truncateDecimals(volume.getUsed() /
              (double) volume.computeEffectiveCapacity());
 
      volume.setVolumeDataDensity(this.idealUsed - dfsUsedRatio);
      sortedQueue.add(volume);
    }
  }
}
 
 
/**
 * Computes whether we need to do any balancing on this volume Set at all.
 * It checks if any disks are out of threshold value
 *
 * @param thresholdPercentage - threshold - in percentage
 *
 * @return true if balancing is needed false otherwise.
 */
public boolean isBalancingNeeded(double thresholdPercentage) {
  double threshold = thresholdPercentage / 100.0d;
 
  if(volumes == null || volumes.size() <= 1) {
    // there is nothing we can do with a single volume.
    // so no planning needed.
    return false;
  }
 
  for (DiskBalancerVolume vol : volumes) {
    boolean notSkip = !vol.isFailed() && !vol.isTransient() && !vol.isSkip();
    Double absDensity =
        truncateDecimals(Math.abs(vol.getVolumeDataDensity()));
 
    if ((absDensity > threshold) && notSkip) {
      return true;
    }
  }
  return false;
}

主要有两个函数,

computeVolumeDataDensity:查看一个盘的数据密度,计算方法为 当前盘的空间占用比例(dfsUsedRatio)- 所有盘的空间占用比例(idealUsed)
isBalancingNeeded:判断一个盘是否需要均衡,即数据密度的绝对值是否超过参数设置(thresholdPercentage)

所以实际均衡的时候考虑的是空间占用比例,而不是空间占用绝对值

【原创】大数据基础之Hadoop(3)hdfs diskbalancer

上一篇:Connection was reset, errno 10054问题


下一篇:关于将px转换为vw vh的解决方案