问题
在本地写了一个Java程序,操作云端的HDFS文件系统,执行ls
没有问题。
在本地写了另外一个Java程序,连接云端的HDFS做MapReduce操作,报错如下。
片段1:在开始做map 0% reduce 0%
操作时,报了一个Connection refused
。
2020-10-31 09:32:09,858 INFO [org.apache.hadoop.mapreduce.Job] - map 0% reduce 0%
2020-10-31 09:32:11,120 WARN [org.apache.hadoop.hdfs.BlockReaderFactory] - I/O error constructing remote block reader.
java.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:777)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:694)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:355)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:635)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:844)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:896)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:143)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:183)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
片段2:Failed to connect to /127.0.0.1:50010 for block,无法连接到127.0.0.1:50010
这个地址的服务。
2020-10-31 09:32:11,123 WARN [org.apache.hadoop.hdfs.DFSClient] - Failed to connect to /127.0.0.1:50010 for block,
add to deadNodes and continue. java.net.ConnectException: Connection refused: no further information
java.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:777)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:694)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:355)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:635)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:844)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:896)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:143)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:183)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
127.0.0.1是localhost的IP,50010这个端口是啥呢?我去云端看一下,使用netstat
命令发现这个端口跑的是DataNode的进程。
初步分析
首先,我们来回顾一下Map Reduce流程。看下面的图:
Client(本地的Java程序)提交一个任务(MapReduce操作)到Job Tracker,然后由Job Tracker和DataNode交互,不断更新自己的作业状态,最后把结果存到HDFS上。
这里的Job Tracker是啥呢?(在Hadoop生态圈,任务调度的角色应该是Yarn)笔者认为在这里应该是本地起的"Dummy"的Job Tracker。(因为我本地和云端都没起Yarn)
回去查log如下:
2020-10-31 09:32:08,672 INFO [org.apache.hadoop.mapreduce.lib.input.FileInputFormat] - Total input paths to process : 1
2020-10-31 09:32:08,694 INFO [org.apache.hadoop.mapreduce.JobSubmitter] - number of splits:1
2020-10-31 09:32:08,747 INFO [org.apache.hadoop.mapreduce.JobSubmitter] - Submitting tokens for job: job_local1490971204_0001
2020-10-31 09:32:08,850 INFO [org.apache.hadoop.mapreduce.Job] - The url to track the job: http://localhost:8080/
2020-10-31 09:32:08,851 INFO [org.apache.hadoop.mapreduce.Job] - Running job: job_local1490971204_0001
可以看到,在本地,Java程序已经做好了Split,然后把Job提交在了local的Job Tracker上。访问一下URLhttp://localhost:8080/
,发现打不开。(这个端口应该是要起Yarn才有用的)
结合以上,我们可以得出初步结论:Java程序可以和NameNode通信,但是无法和DataNode建立连接。因为DataNode返回的是127.0.0.1:50010
。
我们要做的修改是,让DataNode返回正确的Hostname,而不是IP。这样,本地的Java程序才能解析。
强制使用Hostname
首先,在Java程序中,加入如下语句:
configuration.set("dfs.client.use.datanode.hostname", "true");
其次,在HDFS端,修改$HADOOP_HOME/etc/hadoop/hdfs-site.xml
,加入如下语句:
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
<description>Whether clients should use datanode hostnames when
connecting to datanodes.
</description>
</property>
意思是,让客户端总是使用DataNode的Hostname进行连接。(保险起见,两处都要改,有的时候只改一处发现还是不行)
重启一遍试试,发现还是报错。
但是不急,看看Console log,发现其实是有效果的。Clients端想要PING的变成了localhost
,原先是127.0.0.1
。
2020-10-31 10:06:28,077 WARN [org.apache.hadoop.hdfs.DFSClient] - Failed to connect to localhost/127.0.0.1:50010 for block,
add to deadNodes and continue. java.net.ConnectException: Connection refused: no further information
java.net.ConnectException: Connection refused: no further information
这里的问题是,DataNode(错误地)挂在了localhost
上。
修改Hostname
首先,我们去服务器看一下,到底Hostname是啥?打开DataNode的log,发现有如下语句:
2020-10-31 09:31:43,492 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Configured hostname is localhost
确确实实是localhost
!
然后,我回去看一下$HADOOP_HOME/etc/hadoop/slaves
的配置,里面配的是hadoop001
。
再去看一下/etc/hosts
的内容,发现是这么配的:
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
172.19.183.99 maxc maxc
127.0.0.1 maxc maxc
127.0.0.1 localhost localhost
0.0.0.0 hadoop000 hadoop000
0.0.0.0 hadoop001 hadoop001
使用hostname
命令,找到Linux服务器的真实Hostname是maxc
。
所以,机器的真实Hostname是maxc
,而我们又创建了两个虚拟的Hostname,hadoop000
和hadoop001
,用来起Hadoop。
猜想:Hadoop通过hadoop001
的IP地址0.0.0.0
,想要找到机器的真实Hostname,但是,由于配得有点问题,最终找到了localhost。
修改如下:
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
172.19.183.99 maxc maxc
172.19.183.99 hadoop001 hadoop001
172.19.183.99 hadoop000 hadoop000
127.0.0.1 localhost localhost
重启再试,发现问题解决了。
总结
首先,根据log,发现DataNode返回的是localhost IP,于是我们强制Client-DataNode交互使用Hostname。其次,我们修改了DataNode的Hostname,从localhost变成我们最终想要的Hostname。
经验教训:实际部署的时候,不要给DataNode起"别名"了,就老老实实用所属机器的Hostname,避免踩坑。
参考
- Hadoop Map/Reduce执行流程详解 https://www.jianshu.com/p/352db00b6d7a
- Hadoop配置文件参数详解 https://blog.csdn.net/starskyboy/article/details/80879697
- 外网无法访问云主机HDFS文件系统 https://my.oschina.net/gordonnemo/blog/3017724