什么是聚合
聚合就是把元素按照一定的规则分为不同的组,然后对各组元素进行计算。
如图:下面是通过颜色分组,然后计算sum。
postgresql中聚合算法
- GroupAggregate: get the data sorted and apply aggregation function to groups one by one。
- HashAggregate: store state for each key in a hash table
对比两种算法
GroupAggregate
GroupAggregatede 特点是在进行聚合之前先要将数据进行排序,然后进行聚合操作,而且出来的结果是有序的。(postgresql使用enable_sort控制)
HashAggregate
特点是不需要进行排序,在组数值比较小的情况下是比GroupAggregate要快很多,但是需求的内存会比较多。只能进行一些简单的聚合,像count(distinct ...)这类聚合是做不了的。
对比两种算法速率
创建9个测试表
create table t_1 (branch_id bigint, amount numeric);
create table t_10 (branch_id bigint, amount numeric);
create table t_100 (branch_id bigint, amount numeric);
create table t_1000 (branch_id bigint, amount numeric);
create table t_10000 (branch_id bigint, amount numeric);
create table t_100000 (branch_id bigint, amount numeric);
create table t_1000000 (branch_id bigint, amount numeric);
create table t_10000000 (branch_id bigint, amount numeric);
create table t_100000000 (branch_id bigint, amount numeric);
每个表的大小为1亿数据量,控制每张表组个数(其实就是distinct值)。
insert into t_1 select mod(i, 1), random() from generate_series(1,100000000) s(i);
insert into t_10 select mod(i, 10), random() from generate_series(1,100000000) s(i);
insert into t_100 select mod(i, 100), random() from generate_series(1,100000000) s(i);
...
vacuum analyze t_1;
vacuum analyze t_10;
vacuum analyze t_100;
...
;
可以看到在组个数在10^7以下时,hashagg都要比groupagg优。其中hashagg的耗时也是跟着组个数逐渐升高的。当值达到10^8 时,执行计划会强制走groupagg,即使关闭了enable_sort.
大部分情况下hashagg的效率都会比groupagg要好,主要是sort这个操作比较耗时,本质上hashagg是在用空间(内存)换时间,内存充足的情况下这种做ok,内存不足是容易oom(pg在内存不足的情况下是会强制走groupagg)
原理解析
应用
应该尽量将groupagg转化为hashagg执行,大部分情况下执行器会自行选择最优算法,但在一些特定情况下,需要手动改写。
表t1
create table t1 (branch_id bigint, amount numeric);
insert into t1 select mod(i, 1000), random() from generate_series(1,10000000) s(i);
SQL
select branch_id,count(distinct amount) from t1 group by branch_id;
postgres=# explain analyze select branch_id,count(distinct amount) from t1 group by branch_id;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=1326385.18..1401396.05 rows=1000 width=16) (actual time=2524.329..8132.612 rows=1000 loops=1)
Group Key: branch_id
-> Sort (cost=1326385.18..1351385.47 rows=10000115 width=19) (actual time=2517.526..4227.713 rows=10000000 loops=1)
Sort Key: branch_id
Sort Method: quicksort Memory: 1174466kB
-> Seq Scan on t1 (cost=0.00..163696.15 rows=10000115 width=19) (actual time=0.015..824.249 rows=10000000 loops=1)
Planning time: 0.105 ms
Execution time: 8234.834 ms
(8 rows)
现在需要转化为hashagg语句,强制关闭sort,发现执行计划还是没有改变,事实是hashagg无法处理count(distinct ...)这类聚合
postgres=# explain analyze select branch_id,count(distinct amount) from t1 group by branch_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=10001326385.18..10001401396.05 rows=1000 width=16) (actual time=2581.177..7953.331 rows=1000 loops=1)
Group Key: branch_id
-> Sort (cost=10001326385.18..10001351385.47 rows=10000115 width=19) (actual time=2574.178..4126.988 rows=10000000 loops=1)
Sort Key: branch_id
Sort Method: quicksort Memory: 1174466kB
-> Seq Scan on t1 (cost=0.00..163696.15 rows=10000115 width=19) (actual time=0.015..813.226 rows=10000000 loops=1)
Planning time: 0.082 ms
Execution time: 8034.338 ms
修改语句
select branch_id,count(*) from (select branch_id,amount from t1 group by branch_id,amount)_ group by branch_id;
postgres=# explain analyze select branch_id,count(*) from (select branch_id,amount from t1 group by branch_id,amount)_ group by branch_id;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=458272.50..458274.50 rows=200 width=16) (actual time=9544.638..9544.763 rows=1000 loops=1)
Group Key: t1.branch_id
-> HashAggregate (cost=213696.73..311527.04 rows=9783031 width=19) (actual time=5200.136..7961.864 rows=9999971 loops=1)
Group Key: t1.branch_id, t1.amount
-> Seq Scan on t1 (cost=0.00..163696.15 rows=10000115 width=19) (actual time=0.022..890.532 rows=10000000 loops=1)
Planning time: 0.197 ms
Execution time: 9810.855 ms
走了hashagg之后时间并没有减少,why?根据上面的理论,时间应该减少才对,仔细看执行计划,时间上改写之后的语句进行了两次的hashagg,每次时间在4s左右。而且第一次hashagg之后,由于amount这个值的选择度太高,导致后面一次的hashagg数据量还是和前面一样,所以时间也用了4s左右。所以这种改写方法适合在amount选择度较低的时候,减少第二次的hashagg的时间。如果值选择性很高,原语句和耗时数一个sort+groupagg,改写之后的语句是2*hashagg。下面我们把amount的选择度降低,再来看一下效果。
update t1 set amount = branch_id;
postgres=# explain analyze select branch_id,count(distinct amount) from t1 group by branch_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=10002533944.49..10002672603.94 rows=1000 width=16) (actual time=3143.167..7858.782 rows=1000 loops=1)
Group Key: branch_id
-> Sort (cost=10002533944.49..10002580160.97 rows=18486593 width=19) (actual time=3137.152..4561.355 rows=10000000 loops=1)
Sort Key: branch_id
Sort Method: quicksort Memory: 861967kB
-> Seq Scan on t1 (cost=0.00..302614.93 rows=18486593 width=19) (actual time=613.417..1537.574 rows=10000000 loops=1)
Planning time: 0.134 ms
Execution time: 7925.181 ms
(8 rows)
postgres=# explain analyze select branch_id,count(*) from (select branch_id,amount from t1 group by branch_id,amount)_ group by branch_id;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=847179.99..847181.99 rows=200 width=16) (actual time=3987.460..3987.611 rows=1000 loops=1)
Group Key: t1.branch_id
-> HashAggregate (cost=395047.90..575900.73 rows=18085284 width=19) (actual time=3821.612..3986.812 rows=1000 loops=1)
Group Key: t1.branch_id, t1.amount
-> Seq Scan on t1 (cost=0.00..302614.93 rows=18486593 width=19) (actual time=73.574..919.854 rows=10000000 loops=1)
Planning time: 0.155 ms
Execution time: 4438.999 ms
(7 rows)
可以看到amount值选择度变低之后,第一次hashagg的时间还是4s左右,但是第二次hashagg的时间明显减少了,所有改写之后的语句耗时要比原语句少
利用pg10并行,原语句是无法使用并行的,修改后的语句可以走并行,这里我们把并行度设置为4,开了并行之后默认是走sort+groupagg的执行计划
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=408450.41..937838.27 rows=200 width=16) (actual time=831.472..1365.904 rows=1000 loops=1)
Group Key: t1.branch_id
-> Group (cost=408450.41..922858.98 rows=998486 width=12) (actual time=830.731..1365.377 rows=1000 loops=1)
Group Key: t1.branch_id, t1.amount
-> Gather Merge (cost=408450.41..902889.26 rows=3993944 width=12) (actual time=830.730..1364.782 rows=5000 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Group (cost=407950.35..426671.97 rows=998486 width=12) (actual time=794.142..1317.058 rows=1000 loops=5)
Group Key: t1.branch_id, t1.amount
-> Sort (cost=407950.35..414190.89 rows=2496215 width=12) (actual time=794.139..1002.323 rows=2000000 loops=5)
Sort Key: t1.branch_id, t1.amount
Sort Method: quicksort Memory: 197526kB
-> Parallel Seq Scan on t1 (cost=0.00..142711.15 rows=2496215 width=12) (actual time=35.558..190.117 rows=2000000 loops=5)
Planning time: 0.119 ms
Execution time: 1379.119 ms
从现在的现象看来,hashagg和groupagg的效率还是和许多因素相关的,一般情况下hashagg的效率要更高。在改写count(distinct ...)这类语句是要看值的选择性是不是很高。