【重新发现PostgreSQL之美】- 24 滑动窗口分析 2000x

背景


场景:
游戏、社交、电商场景.
流失用户、新增用户、UV计算.
滑动计算, 任意窗口.

挑战:
数据量大、计算量大
传统方案需要记录明细, 才能支持滑动计算

PG解决方案:
采用HLL类型不需要存储明细, 支持滑动, 交、并、差计算.

例子


传统方案:

数据构造:

ts date,  -- 日期
gid int,  -- 维度1 , 例如城市
uid int8  -- 用户ID
);

写入1亿条记录, 跨度15天:

select current_date+(random()*15)::int, random()*100, random()*800000000
from generate_series(1,100000000);
public | t    | table | postgres | unlogged    | heap          | 4224 MB |

1、查询某一天的UV

count
---------
3326250
(1 row)
Time: 7268.339 ms (00:07.268)

2、查询某连续7天的UV

count
----------
42180699
(1 row)
Time: 25291.414 ms (00:25.291)

3、查询某一天相比前一天的新增用户数

count
---------
6610943
(1 row)
Time: 19969.067 ms (00:19.969)

4、查询某一天相比前一天的流失用户数

count
---------
3298421
(1 row)
Time: 19434.652 ms (00:19.435)

5、查询某7天相比前7天的新增用户数

count
----------
42945970
(1 row)
Time: 90321.223 ms (01:30.321)

6、查询某7天相比前7天的流失用户数

count
----------
39791334
(1 row)
Time: 93443.917 ms (01:33.444)

7、查询某14天的UV

count
----------
85126669
(1 row)
Time: 48258.861 ms (00:48.259)

PG 解决方案:

数据构造:

ts date,  -- 日期
gid int,  -- 维度1 , 例如城市
uid hll  -- 用户IDs
);

每天每个GID一条. 不需要原始数据.

insert into pt
select ts, gid, hll_add_agg(hll_hash_bigint(uid)) from
t group by ts,gid;
INSERT 0 1616
Time: 37344.032 ms (00:37.344)
public | pt   | table | postgres | unlogged    | heap          | 2208 kB |

1、查询某一天的UV

?column?
--------------------
3422975.3781066863
(1 row)
Time: 1.530 ms

2、查询某连续7天的UV

?column?
-------------------
42551621.27768603
(1 row)
Time: 4.910 ms

3、查询某一天相比前一天的新增用户数

a as ( select hll_union_agg(uid) uid from pt where ts=current_date+1 ),
b as ( select hll_union_agg(uid) uid from pt where ts=current_date )
select (# hll_union(a.uid,b.uid)) - (# b.uid) from a,b;
?column?
-------------------
6731386.388893194
(1 row)
Time: 2.330 ms

4、查询某一天相比前一天的流失用户数

a as ( select hll_union_agg(uid) uid from pt where ts=current_date+1 ),
b as ( select hll_union_agg(uid) uid from pt where ts=current_date )
select (# hll_union(a.uid,b.uid)) - (# a.uid) from a,b;
?column?
-------------------
3290109.808110645
(1 row)
Time: 2.469 ms

5、查询某7天相比前7天的新增用户数

a as ( select hll_union_agg(uid) uid from pt where ts>=current_date+7 and ts<current_date+14 ),
b as ( select hll_union_agg(uid) uid from pt where ts>=current_date and ts<current_date+7 )
select (# hll_union(a.uid,b.uid)) - (# b.uid) from a,b;
?column?
--------------------
42096480.700727895
(1 row)
Time: 8.762 ms

6、查询某7天相比前7天的流失用户数

a as ( select hll_union_agg(uid) uid from pt where ts>=current_date+7 and ts<current_date+14 ),
b as ( select hll_union_agg(uid) uid from pt where ts>=current_date and ts<current_date+7 )
select (# hll_union(a.uid,b.uid)) - (# a.uid) from a,b;
?column?
--------------------
38055266.104507476
(1 row)
Time: 8.758 ms

7、查询某14天的UV

?column?
-------------------
84648101.97841392
(1 row)
Time: 8.739 ms

总结

方法

传统方法

4224 MB

PG 解决方案

2 MB

测试case

传统方法速度

hll 速度

hll 精度

1、查询某一天的UV

7268 ms

1 ms

97.17%

2、查询某连续7天的UV

25291 ms

4 ms

99.13%

3、查询某一天相比前一天的新增用户数

19969 ms

2 ms

98.21%

4、查询某一天相比前一天的流失用户数

19434 ms

2 ms

100.25%

5、查询某7天相比前7天的新增用户数

90321 ms

8 ms

102.02%

6、查询某7天相比前7天的流失用户数

93443 ms

8 ms

104.56%

7、查询某14天的UV

48258 ms

8 ms

100.57%

关于精度:

https://hub.fastgit.org/citusdata/postgresql-hll

 

上一篇:大数据时代:基于微软案例数据库数据挖掘知识点总结(Microsoft 时序算法——结果预算+下期彩票预测篇)


下一篇:【重新发现PostgreSQL之美】- 33 刘姥姥进大观园