背景
场景:
游戏、社交、电商场景.
流失用户、新增用户、UV计算.
滑动计算, 任意窗口.
挑战:
数据量大、计算量大
传统方案需要记录明细, 才能支持滑动计算
PG解决方案:
采用HLL类型不需要存储明细, 支持滑动, 交、并、差计算.
例子
传统方案:
数据构造:
ts date, -- 日期
gid int, -- 维度1 , 例如城市
uid int8 -- 用户ID
);
写入1亿条记录, 跨度15天:
select current_date+(random()*15)::int, random()*100, random()*800000000
from generate_series(1,100000000);
public | t | table | postgres | unlogged | heap | 4224 MB |
1、查询某一天的UV
count
---------
3326250
(1 row)
Time: 7268.339 ms (00:07.268)
2、查询某连续7天的UV
count
----------
42180699
(1 row)
Time: 25291.414 ms (00:25.291)
3、查询某一天相比前一天的新增用户数
count
---------
6610943
(1 row)
Time: 19969.067 ms (00:19.969)
4、查询某一天相比前一天的流失用户数
count
---------
3298421
(1 row)
Time: 19434.652 ms (00:19.435)
5、查询某7天相比前7天的新增用户数
count
----------
42945970
(1 row)
Time: 90321.223 ms (01:30.321)
6、查询某7天相比前7天的流失用户数
count
----------
39791334
(1 row)
Time: 93443.917 ms (01:33.444)
7、查询某14天的UV
count
----------
85126669
(1 row)
Time: 48258.861 ms (00:48.259)
PG 解决方案:
数据构造:
ts date, -- 日期
gid int, -- 维度1 , 例如城市
uid hll -- 用户IDs
);
每天每个GID一条. 不需要原始数据.
insert into pt
select ts, gid, hll_add_agg(hll_hash_bigint(uid)) from
t group by ts,gid;
INSERT 0 1616
Time: 37344.032 ms (00:37.344)
public | pt | table | postgres | unlogged | heap | 2208 kB |
1、查询某一天的UV
?column?
--------------------
3422975.3781066863
(1 row)
Time: 1.530 ms
2、查询某连续7天的UV
?column?
-------------------
42551621.27768603
(1 row)
Time: 4.910 ms
3、查询某一天相比前一天的新增用户数
a as ( select hll_union_agg(uid) uid from pt where ts=current_date+1 ),
b as ( select hll_union_agg(uid) uid from pt where ts=current_date )
select (# hll_union(a.uid,b.uid)) - (# b.uid) from a,b;
?column?
-------------------
6731386.388893194
(1 row)
Time: 2.330 ms
4、查询某一天相比前一天的流失用户数
a as ( select hll_union_agg(uid) uid from pt where ts=current_date+1 ),
b as ( select hll_union_agg(uid) uid from pt where ts=current_date )
select (# hll_union(a.uid,b.uid)) - (# a.uid) from a,b;
?column?
-------------------
3290109.808110645
(1 row)
Time: 2.469 ms
5、查询某7天相比前7天的新增用户数
a as ( select hll_union_agg(uid) uid from pt where ts>=current_date+7 and ts<current_date+14 ),
b as ( select hll_union_agg(uid) uid from pt where ts>=current_date and ts<current_date+7 )
select (# hll_union(a.uid,b.uid)) - (# b.uid) from a,b;
?column?
--------------------
42096480.700727895
(1 row)
Time: 8.762 ms
6、查询某7天相比前7天的流失用户数
a as ( select hll_union_agg(uid) uid from pt where ts>=current_date+7 and ts<current_date+14 ),
b as ( select hll_union_agg(uid) uid from pt where ts>=current_date and ts<current_date+7 )
select (# hll_union(a.uid,b.uid)) - (# a.uid) from a,b;
?column?
--------------------
38055266.104507476
(1 row)
Time: 8.758 ms
7、查询某14天的UV
?column?
-------------------
84648101.97841392
(1 row)
Time: 8.739 ms
总结
方法 |
存储空间 |
传统方法 |
4224 MB |
PG 解决方案 |
2 MB |
测试case |
传统方法速度 |
hll 速度 |
hll 精度 |
1、查询某一天的UV |
7268 ms |
1 ms |
97.17% |
2、查询某连续7天的UV |
25291 ms |
4 ms |
99.13% |
3、查询某一天相比前一天的新增用户数 |
19969 ms |
2 ms |
98.21% |
4、查询某一天相比前一天的流失用户数 |
19434 ms |
2 ms |
100.25% |
5、查询某7天相比前7天的新增用户数 |
90321 ms |
8 ms |
102.02% |
6、查询某7天相比前7天的流失用户数 |
93443 ms |
8 ms |
104.56% |
7、查询某14天的UV |
48258 ms |
8 ms |
100.57% |
关于精度:
https://hub.fastgit.org/citusdata/postgresql-hll