因为是maptask报错,怀疑是map数量过少,导致oom,因此调整参数,增加map数量,但是问题依然存在。看来和map的数量没有关系。
通过jobid查找jobhistory中对应的日志信息,定位到出错的task id和对应的host.通过日志查看出问题的containerid.
由于container是由RM进行分配的,查看RM的日志,可以看到container的分配情况:
比如下面的例子:
1
|
2014-05-06 16:00:00,632 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_1399267192386_43455_01_000037 of capacity <memory:1536, vCores:1> on host xxxx:44614, which currently has 4 containers, <memory:6144, vCores:4> used and <memory:79872, vCores:42> available
|
可以看到container的id,host和内存大小,cpu 大小。
进一步查看NM的相关container日志:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
2014-05-05 10:14:47,001 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1399203487215_21532_01_000035 by user hdfs
2014-05-05 10:14:47,001 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs IP=10.201.203.111 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1399203487215_21532 CONTAINERID=container_1399203487215_21532_01_000035 2014-05-05 10:14:47,001 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1399203487215_21532_01_000035 to application application_1399203487215_21532 2014-05-05 10:14:47,055 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1399203487215_21532_01_000035 transitioned from NEW to LOCALIZING 2014-05-05 10:14:47,058 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1399203487215_21532_01_000035
2014-05-05 10:14:47,060 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /home/vipshop/hard_disk/10/yarn/local/nmPrivate/container_1399203487215_21532_01_000035 .tokens. Credentials list:
2014-05-05 10:14:47,412 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1399203487215_21532_01_000035 transitioned from LOCALIZING to LOCALIZED 2014-05-05 10:14:47,454 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1399203487215_21532_01_000035 transitioned from LOCALIZED to RUNNING 2014-05-05 10:14:47,493 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [ bash , /home/vipshop/hard_disk/6/yarn/local/usercache/hdfs/appcache/application_1399203487215_21532/container_1399203487215_21532_01_000035/default_container_executor .sh]
2014-05-05 10:14:48,827 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Copying from /home/vipshop/hard_disk/10/yarn/local/nmPrivate/container_1399203487215_21532_01_000035 .tokens to /home/vipshop/hard_disk/11/yarn/local/usercache/hdfs/appcache/application_1399203487215_21532/container_1399203487215_21532_01_000035 .tokens
2014-05-05 10:14:49,169 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1399203487215_21532_01_000035
2014-05-05 10:14:49,305 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 21209 for container- id container_1399203487215_21532_01_000035: 66.7 MB of 1.5 GB physical memory used; 2.1 GB of 3.1 GB virtual memory used
2014-05-05 10:14:53,063 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 21209 for container- id container_1399203487215_21532_01_000035: 984.1 MB of 1.5 GB physical memory used; 2.1 GB of 3.1 GB virtual memory used
2014-05-05 10:14:56,379 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 21209 for container- id container_1399203487215_21532_01_000035: 984.5 MB of 1.5 GB physical memory used; 2.1 GB of 3.1 GB virtual memory used
....... 2014-05-05 10:19:26,823 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 21209 for container- id container_1399203487215_21532_01_000035: 1.1 GB of 1.5 GB physical memory used; 2.1 GB of 3.1 GB virtual memory used
2014-05-05 10:19:27,459 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs IP=10.201.203.111 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1399203487215_21532 CONTAINERID=container_1399203487215_21532_01_000035 2014-05-05 10:19:27,459 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1399203487215_21532_01_000035 transitioned from RUNNING to KILLING 2014-05-05 10:19:27,459 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1399203487215_21532_01_000035 2014-05-05 10:19:27,800 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1399203487215_21532_01_000035 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL |
可以看到,虽然container分配的内存为1.5,但是在使用到1.1G(1.1 GB of 1.5 GB physical memory used)时task被kill掉了。。还有400多M的剩余,看来不是task的整个内存大小分配的太小导致,比较像perm的问题(默认为64m)
更新mapred的设置如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
< property >
< name >mapreduce.map.java.opts</ name >
< value >-Xmx1280m -Xms1280m -Xmn256m -XX:SurvivorRatio=6 -XX:MaxPermSize=128m</ value >
</ property >
< property >
< name >mapreduce.reduce.java.opts</ name >
< value >-Xmx1280m -Xms1280m -Xmn256m -XX:SurvivorRatio=6 -XX:MaxPermSize=128m</ value >
</ property >
< property >
< name >mapred.child.java.opts</ name >
< value >-Xmx1280m -Xms1280m -Xmn256m -XX:SurvivorRatio=6 -XX:MaxPermSize=128m</ value >
< final >true</ final >
</ property >
|
重新运行job,成功。
其实对应java的oom问题来说,最好的方法是打印gc的信息和dump内存的堆栈,然后使用MAT一类的工具来进行分析。
本文转自菜菜光 51CTO博客,原文链接:http://blog.51cto.com/caiguangguang/1407424,如需转载请自行联系原作者