基于Hive 2.1.x
首先直接做关联,查看执行计划
-- 先连接后过滤
explain
select
count(1)
from student t1 inner join student t2
on t1.id = t2.id and t2.name = 'shanghai'
where
t1.city = 'beijing';
1 STAGE DEPENDENCIES:
2 Stage-5 is a root stage
3 Stage-2 depends on stages: Stage-5
4 Stage-0 depends on stages: Stage-2
5
6 STAGE PLANS:
7 Stage: Stage-5
8 Map Reduce Local Work
9 Alias -> Map Local Tables:
10 t1
11 Fetch Operator
12 limit: -1
13 Alias -> Map Local Operator Tree:
14 t1
15 TableScan
16 alias: t1
17 filterExpr: (id is not null and (city = 'beijing')) (type: boolean)
18 Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
19 Filter Operator
20 predicate: (id is not null and (city = 'beijing')) (type: boolean)
21 Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
22 HashTable Sink Operator
23 keys:
24 0 id (type: int)
25 1 id (type: int)
26
27 Stage: Stage-2
28 Map Reduce
29 Map Operator Tree:
30 TableScan
31 alias: t2
32 filterExpr: ((name = 'shanghai') and id is not null) (type: boolean)
33 Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
34 Filter Operator
35 predicate: ((name = 'shanghai') and id is not null) (type: boolean)
36 Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
37 Map Join Operator
38 condition map:
39 Inner Join 0 to 1
40 keys:
41 0 id (type: int)
42 1 id (type: int)
43 Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
44 Group By Operator
45 aggregations: count(1)
46 mode: hash
47 outputColumnNames: _col0
48 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
49 Reduce Output Operator
50 sort order:
51 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
52 value expressions: _col0 (type: bigint)
53 Local Work:
54 Map Reduce Local Work
55 Reduce Operator Tree:
56 Group By Operator
57 aggregations: count(VALUE._col0)
58 mode: mergepartial
59 outputColumnNames: _col0
60 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
61 File Output Operator
62 compressed: false
63 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
64 table:
65 input format: org.apache.hadoop.mapred.SequenceFileInputFormat
66 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
67 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
68
69 Stage: Stage-0
70 Fetch Operator
71 limit: -1
72 Processor Tree:
再使用子查询进行先过滤,再连接。
explain
select
count(1)
from
(select * from student where city = 'beijing') t1
inner join student t2
on t1.id = t2.id and t2.name = 'shanghai';
1 STAGE DEPENDENCIES:
2 Stage-5 is a root stage
3 Stage-2 depends on stages: Stage-5
4 Stage-0 depends on stages: Stage-2
5
6 STAGE PLANS:
7 Stage: Stage-5
8 Map Reduce Local Work
9 Alias -> Map Local Tables:
10 t1:student
11 Fetch Operator
12 limit: -1
13 Alias -> Map Local Operator Tree:
14 t1:student
15 TableScan
16 alias: student
17 filterExpr: ((city = 'beijing') and id is not null) (type: boolean)
18 Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
19 Filter Operator
20 predicate: ((city = 'beijing') and id is not null) (type: boolean)
21 Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
22 Select Operator
23 expressions: id (type: int)
24 outputColumnNames: _col0
25 Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
26 HashTable Sink Operator
27 keys:
28 0 _col0 (type: int)
29 1 id (type: int)
30
31 Stage: Stage-2
32 Map Reduce
33 Map Operator Tree:
34 TableScan
35 alias: t2
36 filterExpr: ((name = 'shanghai') and id is not null) (type: boolean)
37 Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
38 Filter Operator
39 predicate: ((name = 'shanghai') and id is not null) (type: boolean)
40 Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
41 Map Join Operator
42 condition map:
43 Inner Join 0 to 1
44 keys:
45 0 _col0 (type: int)
46 1 id (type: int)
47 Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
48 Group By Operator
49 aggregations: count(1)
50 mode: hash
51 outputColumnNames: _col0
52 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
53 Reduce Output Operator
54 sort order:
55 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
56 value expressions: _col0 (type: bigint)
57 Local Work:
58 Map Reduce Local Work
59 Reduce Operator Tree:
60 Group By Operator
61 aggregations: count(VALUE._col0)
62 mode: mergepartial
63 outputColumnNames: _col0
64 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
65 File Output Operator
66 compressed: false
67 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
68 table:
69 input format: org.apache.hadoop.mapred.SequenceFileInputFormat
70 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
71 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
72
73 Stage: Stage-0
74 Fetch Operator
75 limit: -1
76 Processor Tree:
77 ListSink
从中我们可以清晰的看到,两者的执行计划和执行效率其实都差不多。应该是Hive 引擎自动做了谓词下推的优化机制。