记录一次hbase宕机之后,对问题的排查

hbase宕机问题排查

起因是在使用spark将数据库数据写入到hbase时,发现在正常运行过程中,突然发现了有三个spark任务突然假死,通过查看日志发现程序卡在了如下的位置:
2021-10-18 18:23:58,158 INFO jdbc.Utils: Supplied authorities: 192.168.xx.xx:10000
2021-10-18 18:23:58,158 INFO jdbc.Utils: Resolved authority: 192.168.xx.xx:10000
2021-10-18 18:24:24,234 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on node25:33917 in memory (size: 4.8 KiB, free: 366.3 MiB)
2021-10-18 18:24:24,253 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on node25:42263 in memory (size: 4.8 KiB, free: 366.3 MiB)
2021-10-18 18:24:24,264 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on node25:33917 in memory (size: 7.3 KiB, free: 366.3 MiB)
2021-10-18 18:24:24,268 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on node25:42263 in memory (size: 7.3 KiB, free: 366.3 MiB)

最开始的时候,对问题的定位原因是数据问题,可能是因为数据倾斜导致的进程假死,所以将当前任务使用以下命令yarn application -kill ${application_id}杀死后,重新选择了一张空表进行测试,发现还是出现同样问题,所以初步怀疑的数据倾斜导致的进行假死原因不成立,又重新进行排查。
这里有一篇自己看到的关于数据倾斜解决的问题:spark性能优化,这是自己在排查问题过程中看到的文章。
然后继续进行排查,这时候日志突然报出了如下问题:

2021-10-18 18:44:35,265 ERROR yarn.ApplicationMaster: User class threw exception: java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.net.SocketTimeoutException: callTimeout=1200000, callDuration=1219044: callTimeout=1200000, callDuration=1238974: Failed to find location, tableName=hbase:meta, row=xxxxx, reload=true row 'xxxx' on table 'hbase:meta' at null 
	at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:148)
	at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3078)
	at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3070)
	at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:445)
	at org.apache.hadoop.hive.hbase.HBaseMetaHook.preCreateTable(HBaseMetaHook.java:135)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:832)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:822)
	at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:212)
	at com.sun.proxy.$Proxy34.createTable(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2773)
	at com.sun.proxy.$Proxy34.createTable(Unknown Source)
	at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:921)
	at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:937)
	at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4954)
	at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:428)
	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:205)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:97)
	at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2664)
	at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:2335)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:2011)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1709)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1703)
	at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:157)
	at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:224)
	at org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:87)
	at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:316)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
	at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:329)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: callTimeout=1200000, callDuration=1238974: Failed to find location, tableName=hbase:meta, row=xxxxx, reload=true row 'xxxxx' on table 'hbase:meta' at null
	at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:148)
	at org.apache.hadoop.hbase.client.HTable.get(HTable.java:395)
	at org.apache.hadoop.hbase.client.HTable.get(HTable.java:369)
	at org.apache.hadoop.hbase.MetaTableAccessor.getTableState(MetaTableAccessor.java:1108)
	at org.apache.hadoop.hbase.MetaTableAccessor.tableExists(MetaTableAccessor.java:427)
	at org.apache.hadoop.hbase.client.HBaseAdmin$6.rpcCall(HBaseAdmin.java:448)
	at org.apache.hadoop.hbase.client.HBaseAdmin$6.rpcCall(HBaseAdmin.java:445)
	at org.apache.hadoop.hbase.client.RpcRetryingCallable.call(RpcRetryingCallable.java:58)
	at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:106)
	... 40 more
Caused by: java.io.IOException: Failed to find location, tableName=hbase:meta, row=xxxxx, reload=true
	at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:222)
	at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:104)

这时候发现hbase服务异常停止,在对hbase服务进行重启时,突然意外的发现,hbase服务无法启动了,此时重启hbase,Hmaster服务一直启动失败,查看HMaster的启动日志,发现问题如下:

2021-10-18 20:17:12,663 WARN  [main-SendThread(192.168.17.23:2181)] zookeeper.ClientCnxn: Session 0x0 for server 192.168.17.23/192.168.17.23:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:378)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)

2021-10-18 20:17:26,533 INFO  [main] zookeeper.ZooKeeper: Session: 0x0 closed
2021-10-18 20:17:26,533 ERROR [main] regionserver.HRegionServer: Failed construction RegionServer
org.apache.hadoop.hbase.ZooKeeperConnectionException: master:160000x0, quorum=192.168.17.22:2181,192.168.17.23:2181,192.168.17.24:2181, baseZNode=/hbase Unexpected KeeperException creating base node
        at org.apache.hadoop.hbase.zookeeper.ZKWatcher.createBaseZNodes(ZKWatcher.java:192)
        at org.apache.hadoop.hbase.zookeeper.ZKWatcher.<init>(ZKWatcher.java:167)
        at org.apache.hadoop.hbase.zookeeper.ZKWatcher.<init>(ZKWatcher.java:119)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:652)
        at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:494)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:2937)
        at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:236)
        at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:140)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)
        at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2955)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
        at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:547)
        at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:526)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents(ZKUtil.java:1197)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents(ZKUtil.java:1175)
        at org.apache.hadoop.hbase.zookeeper.ZKWatcher.createBaseZNodes(ZKWatcher.java:183)
        ... 14 more

问题发现了,是hbase无法连接到zookeeper,导致的hbase启动失败,但是通过ps -ef|grep zookeeper发现zookeeper服务正在运行,所以应该不存在因为zookeeper服务停止导致的hbase服务失败,然后注意到了java.io.IOException: Connection reset by peer,这个问题是socket连接数过多导致的问题,所以通过查看zookeeper的连接数

 netstat -anl|grep 2181|grep -i 'zk_ip'|grep ESTABLISHED|wc -l

发现zookeeper的连接数是240,但是zookeeper的默认连接数是60,所以据诶定修改zookppeer的最大允许连接数,
在zookeeper的zoo.cfg文件中,添加如下参数:

# the maximum number of client connections.
# increase this if you need to handle more clients
maxClientCnxns=2000

然后重启zookeeper集群,然后在zookeeper的shell中执行rmr /hbase,再重启hbase,发现服务正常启动。

上一篇:6.1.8、Hbase__Mapreduce操作Hbase,导包,插件,使用TableMapper,TableReducer,设置map任务,使用TableMapReduceUtil工具类


下一篇:HBase - JAVA -API