hive 任务因一个map导致进程oom挂掉的排查

sql 部分如下

select
  ‘20200607‘ as log_date,
  COUNT(distinct if(event_id=‘app.onepass-login.0.0.pv‘ AND (get_json_object(extended_fields,‘$.refer_click‘) in (‘main.homepage.avatar-nologin.all.click‘)) ,buvid,null)) as aaa,
 xxxx
xxxx FROM xxx.hongcan_onepass_appctr_d WHERE log_date
=‘20200607‘ and (app_id=1 AND platform=1) GROUP BY log_date

查询表分区的大小为167m ,hdfs块大小128m。

然而任务运行起来只有一个map,运行失败,看下日志很明显的内存溢出。

hive 任务因一个map导致进程oom挂掉的排查

 

 hive 任务因一个map导致进程oom挂掉的排查

 

 

   为了排查,只能看hive源码了

 运行任务部分日志如下

Query ID = hdfs_20200610165438_c314dfd5-c046-46c5-9a25-0a467be937a6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
20/06/10 16:56:02 INFO Configuration.deprecation: mapred.submit.replication is deprecated. Instead, use mapreduce.client.submit.file.replication
20/06/10 16:56:14 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
20/06/10 16:56:33 INFO input.FileInputFormat: Total input files to process : 1
20/06/10 16:56:41 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
20/06/10 16:56:41 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev Unknown build revisionscripts/get_build_revision.sh: 21: scripts/get_build_revision.sh: [[: not found
]
ERROR: transport error 202: recv error: Connection timed out
20/06/10 17:24:22 INFO input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 2, size left: 0
20/06/10 17:24:23 INFO mapreduce.JobSubmitter: number of splits:1
20/06/10 17:24:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1591697494533_103857
20/06/10 17:24:24 INFO impl.YarnClientImpl: Submitted application application_1591697494533_103857
20/06/10 17:24:24 INFO mapreduce.Job: The url to track the job: http://xxx:8088/proxy/application_1591697494533_103857/
Starting Job = job_1591697494533_103857, Tracking URL = http://xxx:8088/proxy/application_1591697494533_103857/
Kill Command = /data/service/hadoop/bin/hadoop job  -kill job_1591697494533_103857
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
20/06/10 17:24:32 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2020-06-10 17:24:32,431 Stage-1 map = 0%,  reduce = 0%
JobSubmitter 类 
hive 任务因一个map导致进程oom挂掉的排查

 

 hive 任务因一个map导致进程oom挂掉的排查

 

 hive 任务因一个map导致进程oom挂掉的排查

 

 hive 任务因一个map导致进程oom挂掉的排查

 

 

hive 任务因一个map导致进程oom挂掉的排查

 返回的inputSplitShims Array大小就是map个数

 hive 任务因一个map导致进程oom挂掉的排查

 

 hive 任务因一个map导致进程oom挂掉的排查

 注意观察maxSize怎么来的

 hive 任务因一个map导致进程oom挂掉的排查

 传递maxSize

 hive 任务因一个map导致进程oom挂掉的排查

 这里的blockToNodes就是切分后文件块的集合

 hive 任务因一个map导致进程oom挂掉的排查

 

 重点来了

这里的maxSize 就是mapreduce.input.fileinputformat.split.maxsize设置的大小,表示单个map最大size。文件超过就被切分。

源码中有一段介绍 hadoop2.x 使用mapreduce.input.fileinputformat.split.maxsize 控制切分文件的数量

 

 

/**
   * The desired number of input splits produced for each partition. When the
   * input files are large and few, we want to split them into many splits,
   * so as to increase the parallelizm of loading the splits. Try also two
   * other parameters, mapred.min.split.size and mapred.max.split.size for
   * hadoop 1.x, or mapreduce.input.fileinputformat.split.minsize and
   * mapreduce.input.fileinputformat.split.maxsize in hadoop 2.x to
   * control the number of input splits.
   */

 

 

 

left 为文件的size,这里是167m

原因也明了了,我们的配置该参数为256m ,文件大小167m。myLength = Math.min(maxSize, left);

直接一撸到底,一个块返回。

hive 任务因一个map导致进程oom挂掉的排查

 hive 任务因一个map导致进程oom挂掉的排查

 

hive 任务因一个map导致进程oom挂掉的排查

 

 本次帮用户修改mapreduce.input.fileinputformat.split.maxsize=100000,将map提升为多个解决了问题

 

 

 

hive 任务因一个map导致进程oom挂掉的排查

上一篇:Linux下select的用法--实现一个简单的回射服务器程序


下一篇:[Linux Shell学习系列十三]捕获-3.捕获