Caused by: java.io.IOException: On-disk size without header provided is 6
前言
本片报错信息博主粘的比较详细,还请耐心查看。
问题的来源:
本人写了一个spark程序,调用hbase的API接口进行scan获取数据。起先可以正常导出按批数据,可是当进行了某某批次的时候,就会爆出如下的错误:
WARN TaskSetManager: Lost task 8.0 in stage 79.0 (TID 38487, cnbjsjqpsgjdn171, executor 4): org.apache.hadoop.hbase.client
.RetriesExhaustedException: Failed after attempts=35, exceptions:
Tue Jun 29 21:10:51 CST 2021, RpcRetryingCaller{globalStartTime=1624972251344, pause=100, retries=35}, java.io
.IOException: java.io.IOException: Could not seek StoreFileScanner[HFileScanner for reader reader=hdfs://names
ervice1/hbase/data/default/packet_v2/2b827f7067bbba7cf08dcb187d643c44/cf/7199e17027454ed99f637eb41142e8da, com
pression=lzo, cacheConf=blockCache=LruBlockCache{blockCount=8851, currentSize=625112608, freeSize=9585108448,
maxSize=10210221056, heapSize=625112608, minSize=9699709952, minFactor=0.95, multiSize=4849854976, multiFactor
=0.5, singleSize=2424927488, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnW
rite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
, firstKey=2uy5q_1624691294000_1/cf:data/1624691294413/Put, lastKey=2v2l9_1624707579000_4/cf:verify/1624707579
921/Put, avgKeyLen=39, avgValueLen=77, entries=1537704, length=66621977, cur=null] to key 2v1gr_1451577600000/
cf:/LATEST_TIMESTAMP/DeleteFamily/vlen=0/seqid=0
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:218)
at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:350)
at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:199)
at org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2120)
at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2110)
at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:5617)
at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2637)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2623)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2604)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2392)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientPr
otos.java:33648)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2191)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:183)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163)
Caused by: java.io.IOException: On-disk size without header provided is 62081, but block header contains 0. Bl
ock offset: 55383835, data starts with: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x
00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
at org.apache.hadoop.hbase.io.hfile.HFileBlock.validateOnDiskSizeWithoutHeader(HFileBlock.java:526)
at org.apache.hadoop.hbase.io.hfile.HFileBlock.access$700(HFileBlock.java:92)
at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1705
)
at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1548)
at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:446)
at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBl
ockIndex.java:266)
at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:643)
at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:593)
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:297)
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:200)
... 14 more
问题分析
从报错信息中看到有RpcRetryingCaller,博主第一时间以为是自己设置的rpc超时时间太小,于是对比了当时hbase的读写请求、运行情况等,发现确实有明显的拨动。如下图,此图并非出现报错时的hbase图片,只是给大家一个参考的思路。
于是便开始修改spark源代码,修改rpc超时时间。再次运行,又一次出现了相同的错误。错误日志如下:
21/06/29 10:17:54 WARN TaskSetManager: Lost task 8.0 in stage 39.0 (TID 18287, cnbjsjqpsgjdn61, exe
cutor 12): org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=35, exceptions:
Tue Jun 29 10:08:43 CST 2021, RpcRetryingCaller{globalStartTime=1624932523329, pause=100, retries=35}, java.io.IOException: java.io.IOException: Could not
seek StoreFileScanner[HFileScanner for reader reader=hdfs://nameservice1/hbase/data/default/packet_v2/08edc75f91a5465776db18092d3035ce/cf/14261d5166da4af59
247f94b0fa77582, compression=lzo, cacheConf=blockCache=LruBlockCache{blockCount=8770, currentSize=615286456, freeSize=9594934600, maxSize=10210221056, heap
Size=615286456, minSize=9699709952, minFactor=0.95, multiSize=4849854976, multiFactor=0.5, singleSize=2424927488, singleFactor=0.25}, cacheDataOnRead=true,
cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, fir
stKey=98877a5b6a834c4bbca9706fa9eb696e_1624155462000_1/cf:data/1624155462557/Put, lastKey=988d3aeca4ce43ce9836e4d600415330_1624180198000_4/cf:verify/162418
0198123/Put, avgKeyLen=66, avgValueLen=79, entries=719284, length=31701109, cur=null] to key 9889f92362d64e608afa838e402b8f68_1451577600000/cf:/LATEST_TIME
STAMP/DeleteFamily/vlen=0/seqid=0
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:218)
at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:350)
at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:199)
at org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2120)
at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2110)
at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:5617)
at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2637)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2623)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2604)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2392)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33648)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2191)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:183)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163)
Caused by: java.io.IOException: On-disk size without header provided is 18784, but block header contains 0. Block offset: 12257382, data starts with: \x00\
x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
at org.apache.hadoop.hbase.io.hfile.HFileBlock.validateOnDiskSizeWithoutHeader(HFileBlock.java:526)
at org.apache.hadoop.hbase.io.hfile.HFileBlock.access$700(HFileBlock.java:92)
at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1705)
at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1548)
at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:446)
at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:266)
at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:643)
at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:593)
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:297)
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:200)
... 14 more
对比两次日志我们发现有相同地方。报错信息中有
Could not seek StoreFileScanner
说明我们的hbase的API确实去hbase库中去拉取数据了,但是问题是扫描不到这块数据,再加上报错日志中的:
On-disk size without header provided is 62081, but block header contains 0. Block offset: 55383835, data starts with: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x
00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
我们可以初步断定应该和hbase存储的磁盘有关联。
查看两次报错的信息中所读取的hadoop数据为:
第一次读取的数据:
hdfs://nameservice1/hbase/data/default/packet_v2/2b827f7067bbba7cf08dcb187d643c44/cf/7199e17027454ed99f637eb41142e8da
第二次读取的数据:
hdfs://nameservice1/hbase/data/default/packet_v2/08edc75f91a5465776db18092d3035ce/cf/14261d5166da4af59247f94b0fa77582
打开我们的hadoop的web管理界面,查看一下这两个数据究竟是存储到了hadoop集群的那台服务器上面。
对比两张图,我们发现都出现了cnbjsjqpsgjdn224,说明数据都有存储在这台服务器上。为了验证确实是cnbjsjqpsgjdn224出了问题,我又将我之前写好的批量读取hbase数据的脚本又执行了几次,发现出错的数据信息都一致的定位到了cnbjsjqpsgjdn224这台集群服务器上面。进而我们可以断定问题就是出在cnbjsjqpsgjdn224这台节点。通过Cloudera Management 又一次验证了我的想法,他还是hbase中的一台RegionServer节点。根据集群读取数据的就近原则,又一次证实了我们之前的判断。就是这台节点存储数据出现了问题。
问题解决:
- 使用“Cloudera Management Service”标记hd01为维护模式
- 使用“Cloudera Management Service”停止hd01的所有角色
- 备份分区的关键数据(HDFS数据节点由于有冗余,所以我们选择不备份)
- 卸载挂于“/data”的“/dev/mapper/ds-data”分区
- 使用e2fsck命令修复损坏的坏道
问题的解决也可以参考一下两篇博客,根据自己的实际情况进行修复
https://www.cmdschool.org/archives/10817
https://blog.csdn.net/shekey92/article/details/46895357
希望对您的问题有所帮助,如有疑问,可在评论区发表您的意见,博主第一时间看到就会给与回复。