日常需求中经常会会遇到重叠交叉区间统计的需求,比如电商场景中某个品牌的打折活动,第一次活动活动时间为2021-08-01到 2021-08-09,第二次活动时间为 2021-08-05到 2021-08-15其中5号到9号为重复天数,这些天只统计一次,即该品牌打折活动共计15天。
接下来以某公司的需求的简化版为例进行需求分析
需求:每日服务器的在线人数(享受服务的人数)
order_id(单号) |
stime(服务器开始日期) |
etime(服务器结束日期) |
X01 |
2021-05-01 |
2021-06-02 |
X02 |
2021-05-28 |
2021-06-01 |
X03 |
2021-06-13 |
2021-07-13 |
... |
... |
... |
汇总成如下表:
date(日期) |
count(人数) |
2021-01-01 |
45 |
2021-01-02 |
90 |
... |
... |
分析:对每个单号的开始时间和结束日期进行展开,然后对日期去重
- 生成一系列数据,并计算每个单号的日期差
with tb1 as ( select order_id, stime, etime, datediff(etime,stime) diff from values('X01','2021-05-01','2021-06-02'), ('X02','2021-05-28','2021-06-01'), ('X03','2021-06-13','2021-07-13') t(order_id,stime,etime) )
- 使用repeat或者rpad函数对每个单号的开始时间进行增值
--eg: --返回ababab select repeat('ab',3); --返回abcde12121 select rpad('abcde', 10, '12');
tb2 as ( select order_id, stime, etime, diff, substr(repeat(stime||',',diff),1,11*diff-1) re_t from tb1 )
- 使用posexplode()方法生成序列
tb3 as ( select order_id, stime, etime, diff, re_time, inx, start_time from tb2 lateral view posexplode(split(re_time,",")) t as inx,start_time ) --展示数据如下 order_id stime etime diff re_time inx start_time X01 2021-05-01 2021-06-02 32 2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01 0 2021-05-01 X01 2021-05-01 2021-06-02 32 2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01 1 2021-05-01 X01 2021-05-01 2021-06-02 32 2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01 2 2021-05-01 .....
- 使用date_add将生成的序列与每个单号的初始时间相加,得到每个单号的在线时间
tb4 as ( select order_id, stime, etime, diff, inx, start_time, date_add(start_time,inx) as online_time from tb3 ) --展示数据如下: order_id stime etime diff inx start_time online_time X01 2021-05-01 2021-06-02 32 0 2021-05-01 2021-05-01 X01 2021-05-01 2021-06-02 32 1 2021-05-01 2021-05-02 X01 2021-05-01 2021-06-02 32 2 2021-05-01 2021-05-03 X01 2021-05-01 2021-06-02 32 3 2021-05-01 2021-05-04 X01 2021-05-01 2021-06-02 32 4 2021-05-01 2021-05-05 X01 2021-05-01 2021-06-02 32 5 2021-05-01 2021-05-06 X01 2021-05-01 2021-06-02 32 6 2021-05-01 2021-05-07 X01 2021-05-01 2021-06-02 32 7 2021-05-01 2021-05-08 X01 2021-05-01 2021-06-02 32 8 2021-05-01 2021-05-09 X01 2021-05-01 2021-06-02 32 9 2021-05-01 2021-05-10 ...
- 按在线日期统计每日在线人数
select online_time,count(*) ct from tb4 group by online_time;
若不熟悉posexplode方法的童鞋,可以使用row_number函数进行代替也能达到相同的效果。利用上述思路可以很好的解决业务中关于合并重叠区间或时间段的问题。