一个稍微复杂的Hive SQL,在执行过程中发现某个Task非常慢,怎么去定位这个Task是属于哪段SQL逻辑呢
对于如下逻辑的一个SQL,在Spark引擎和Tez引擎中执行DAG的比较。在两个引擎下,执行慢的逻辑是一样的,分别看两种引擎的执行情况。
WITH tbl_a AS ( SELECT a, b, name, SUM(pv) AS pv FROM ( SELECT a,b,name , count(1) AS pv FROM dataware.a WHERE dt = '{@date}' GROUP BY a,b,name UNION ALL SELECT a,b,name , count(1) AS pv FROM dateware.b WHERE dt = '{@date}' GROUP BY a,b,name UNION ALL SELECT a,b,name , count(1) AS pv FROM dateware.c WHERE dt = '{@date}' GROUP BY a,b,name UNION ALL SELECT a,b,name , count(1) AS pv FROM dataware.d WHERE dt = '{@date}' GROUP BY a,b,name ) a GROUP BY a, b, name ), tbl_b AS ( SELECT * FROM ( SELECT a, ,b ,row_number() OVER (PARTITION BY app ORDER BY min_time DESC) AS row_num FROM dataware.demo_a WHERE dt = '{@date}' ) tmp WHERE row_num = 1 ), tbl_c AS ( SELECT a, b, user_id FROM dataware.demo_b WHERE dt = '{@date}' ), tbl_d AS ( SELECT user_id , identity as role FROM dataware.demo_c WHERE dt = '{@date}' ) INSERT OVERWRITE TABLE dataware.result PARTITION ( dt = '{@date}') SELECT a, b, name, other_fields, , SUM(pv) AS pv FROM tbl_a LEFT JOIN tbl_b ON tbl_a.a = tbl_b.a AND tbl_a.b = tbl_b.b LEFT JOIN tbl_c ON tbl_a.a = tbl_c.a AND tbl_a.b = tbl_c.b LEFT JOIN tbl_d ON tbl_c.user_id = tbl_d.user_id GROUP BY a, b, name, other_fields
Spark引擎当中,从执行结果上看。执行慢的Stage分别是Stage1和Stage8。刚入门要想看出来Stage1和Stage8分别对应哪部分SQL逻辑有一定困难。
在Spark引擎中,我总结可能会有如下办法找到对应SQL逻辑:
1、根据经验将SQL分开。例子中union all的逻辑很好对应。
Stage1是读取tbl_b表阶段
2、也可以单独执行某个SQL逻辑段,看看【单独执行SQL逻辑段的STAGE】与【整体Stage哪个部分】相符。
Stage7对应读取dataware.demo_a表逻辑
3、如果是个读取文件的过程,一般Task的并行度与文件块个数是对应的。所以查看这个表的block块,确定Stage对应SQL逻辑段
Stage5对应是读取dataware.demo_b表逻辑
4、参考Tez引擎
Tez引擎执行情况:
Tez引擎执行情况:下图按Vertices的开始时间排序。看出Map8和Reduce9阶段耗时长。
在 Graphical View中可以很容易看到Map8和Reduce9对应的读取tbl_b的阶段