hdfs 的 dataNode 启动卡死的排错过程

在启动hdfs的时候发现,有一个dataNode一直启动不起来,查看日志发现Time to add replicas to map for block pool 这样的日志,表示hdfs正在扫描数据盘,把数据文件名打包上传给nameNode,但是数据盘有3个,才扫描了两个,卡在第三个数据盘扫描出了问题  
 1 2021-07-24 11:37:50,731 INFO  checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(122)) - Scheduling a check for /cloud/data01/hadoop/hdfs/data/current
 2 2021-07-24 11:37:50,738 INFO  checker.DatasetVolumeChecker (DatasetVolumeChecker.java:checkAllVolumes(210)) - Scheduled health check for volume /cloud/data01/hadoop/hdfs/data/current
 3 2021-07-24 11:37:50,739 INFO  checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(122)) - Scheduling a check for /cloud/data02/hadoop/hdfs/data/current
 4 2021-07-24 11:37:50,740 INFO  checker.DatasetVolumeChecker (DatasetVolumeChecker.java:checkAllVolumes(210)) - Scheduled health check for volume /cloud/data02/hadoop/hdfs/data/current
 5 2021-07-24 11:37:50,740 INFO  checker.ThrottledAsyncChecker (ThrottledAsyncChecker.java:schedule(122)) - Scheduling a check for /cloud/data03/hadoop/hdfs/data/current
 6 2021-07-24 11:37:50,740 INFO  checker.DatasetVolumeChecker (DatasetVolumeChecker.java:checkAllVolumes(210)) - Scheduled health check for volume /cloud/data03/hadoop/hdfs/data/current
 7 2021-07-24 11:37:50,740 INFO  impl.FsDatasetImpl (FsDatasetImpl.java:addBlockPool(2635)) - Adding block pool BP-1188018203-192.168.50.56-1615989660288
 8 2021-07-24 11:37:50,741 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(392)) - Scanning block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data01/hadoop/hdfs/data/current...
 9 2021-07-24 11:37:50,741 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(392)) - Scanning block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data02/hadoop/hdfs/data/current...
10 2021-07-24 11:37:50,741 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(392)) - Scanning block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data03/hadoop/hdfs/data/current...
11 2021-07-24 11:37:50,757 INFO  impl.FsDatasetImpl (BlockPoolSlice.java:loadDfsUsed(251)) - Cached dfsUsed found for /cloud/data01/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current: 5512085090304
12 2021-07-24 11:37:50,757 INFO  impl.FsDatasetImpl (BlockPoolSlice.java:loadDfsUsed(251)) - Cached dfsUsed found for /cloud/data03/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current: 5497424637952
13 2021-07-24 11:37:50,760 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(397)) - Time taken to scan block pool BP-1188018203-192.168.50.56-1615989660288 on /cloud/data01/hadoop/hdfs/data/current: 19ms
14 2021-07-24 11:37:50,760 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(397)) - Time taken to scan block pool BP-1188018203-192.168.50.56-1615989660288 on /cloud/data03/hadoop/hdfs/data/current: 19ms
15 2021-07-24 11:37:50,773 INFO  impl.FsDatasetImpl (BlockPoolSlice.java:loadDfsUsed(251)) - Cached dfsUsed found for /cloud/data02/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current: 5591791809340
16 2021-07-24 11:37:50,773 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(397)) - Time taken to scan block pool BP-1188018203-192.168.50.56-1615989660288 on /cloud/data02/hadoop/hdfs/data/current: 32ms
17 2021-07-24 11:37:50,774 INFO  impl.FsDatasetImpl (FsVolumeList.java:addBlockPool(423)) - Total time to scan all replicas for block pool BP-1188018203-192.168.50.56-1615989660288: 33ms
18 2021-07-24 11:37:50,776 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(188)) - Adding replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data01/hadoop/hdfs/data/current...
19 2021-07-24 11:37:50,776 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(188)) - Adding replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data02/hadoop/hdfs/data/current...
20 2021-07-24 11:37:50,776 INFO  impl.BlockPoolSlice (BlockPoolSlice.java:readReplicasFromCache(738)) - Replica Cache file: /cloud/data01/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current/replicas doesn't exist
21 2021-07-24 11:37:50,776 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(188)) - Adding replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data03/hadoop/hdfs/data/current...
22 2021-07-24 11:37:50,776 INFO  impl.BlockPoolSlice (BlockPoolSlice.java:readReplicasFromCache(738)) - Replica Cache file: /cloud/data02/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current/replicas doesn't exist
23 2021-07-24 11:37:50,777 INFO  impl.BlockPoolSlice (BlockPoolSlice.java:readReplicasFromCache(738)) - Replica Cache file: /cloud/data03/hadoop/hdfs/data/current/BP-1188018203-192.168.50.56-1615989660288/current/replicas doesn't exist
24 2021-07-24 11:39:00,774 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(193)) - Time to add replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data03/hadoop/hdfs/data/current: 69998ms
25 2021-07-24 11:39:02,396 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(193)) - Time to add replicas to map for block pool BP-1188018203-192.168.50.56-1615989660288 on volume /cloud/data01/hadoop/hdfs/data/current: 71620ms

 

  然后就自然会怀疑io,通过iostst -kxd 2 查看到第三个数据盘的io都在98~100%之间 hdfs 的 dataNode 启动卡死的排错过程

 

 

  然后通过iotop -oP 找到了消耗io最高的进程,好家伙,发现居然有3个进程在对这个数据路径进行du,不卡死才怪,应该是由于刚刚我频繁重启dataNode,重启时,之前的du还未结束杀死,就又增加了一个du进程,马上手动杀死这三个进程,然后重启dataNode,成功 hdfs 的 dataNode 启动卡死的排错过程

 

 

  但是在偶尔的重启中也发现,第三个数据盘的du时快时慢,偶尔会卡死,所以怀疑硬件有些问题,后面准备和运维工程师进行沟通

上一篇:HDFS集群数据不均衡处理


下一篇:Hadoop的三大核心组件