MaxCompute 使用SQL进行重叠交叉区间问题分析

日常需求中经常会会遇到重叠交叉区间统计的需求,比如电商场景中某个品牌的打折活动,第一次活动活动时间为2021-08-01到 2021-08-09,第二次活动时间为 2021-08-05到 2021-08-15其中5号到9号为重复天数,这些天只统计一次,即该品牌打折活动共计15天。

    接下来以某公司的需求的简化版为例进行需求分析

需求:每日服务器的在线人数(享受服务的人数)

order_id(单号)

stime(服务器开始日期)

etime(服务器结束日期)

X01

2021-05-01

2021-06-02

X02

2021-05-28

2021-06-01

X03

2021-06-13

2021-07-13

...

...

...

汇总成如下表:

date(日期)

count(人数)

2021-01-01

45

2021-01-02

90

...

...

分析:对每个单号的开始时间和结束日期进行展开,然后对日期去重

  1. 生成一系列数据,并计算每个单号的日期差
with tb1 as (
    select 
        order_id,
        stime,
        etime,
        datediff(etime,stime) diff
    from values('X01','2021-05-01','2021-06-02'),
               ('X02','2021-05-28','2021-06-01'),
               ('X03','2021-06-13','2021-07-13')
               t(order_id,stime,etime)
)
  1. 使用repeat或者rpad函数对每个单号的开始时间进行增值
--eg:
--返回ababab 
select repeat('ab',3);
--返回abcde12121
select rpad('abcde', 10, '12');
tb2 as (
    select 
        order_id,
        stime,
        etime,
        diff,
        substr(repeat(stime||',',diff),1,11*diff-1) re_t
    from tb1
)
  1. 使用posexplode()方法生成序列
tb3 as (
    select 
        order_id,
        stime,
        etime,
        diff,
        re_time,
        inx,
        start_time
    from tb2 lateral view posexplode(split(re_time,",")) t as inx,start_time
)
--展示数据如下
order_id    stime   etime   diff    re_time inx start_time
X01 2021-05-01  2021-06-02  32  2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01 0   2021-05-01
X01 2021-05-01  2021-06-02  32  2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01 1   2021-05-01
X01 2021-05-01  2021-06-02  32  2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01,2021-05-01 2   2021-05-01
.....
  1. 使用date_add将生成的序列与每个单号的初始时间相加,得到每个单号的在线时间
tb4 as (
    select 
        order_id,
        stime,
        etime,
        diff,
        inx,
        start_time,
        date_add(start_time,inx) as online_time
    from tb3
)
--展示数据如下:
order_id    stime   etime   diff    inx start_time  online_time
X01 2021-05-01  2021-06-02  32  0   2021-05-01  2021-05-01
X01 2021-05-01  2021-06-02  32  1   2021-05-01  2021-05-02
X01 2021-05-01  2021-06-02  32  2   2021-05-01  2021-05-03
X01 2021-05-01  2021-06-02  32  3   2021-05-01  2021-05-04
X01 2021-05-01  2021-06-02  32  4   2021-05-01  2021-05-05
X01 2021-05-01  2021-06-02  32  5   2021-05-01  2021-05-06
X01 2021-05-01  2021-06-02  32  6   2021-05-01  2021-05-07
X01 2021-05-01  2021-06-02  32  7   2021-05-01  2021-05-08
X01 2021-05-01  2021-06-02  32  8   2021-05-01  2021-05-09
X01 2021-05-01  2021-06-02  32  9   2021-05-01  2021-05-10
...
  1. 按在线日期统计每日在线人数
select online_time,count(*) ct from tb4 group by online_time;

若不熟悉posexplode方法的童鞋,可以使用row_number函数进行代替也能达到相同的效果。利用上述思路可以很好的解决业务中关于合并重叠区间或时间段的问题。



上一篇:MaxCompute 使用SQL进行连续区间问题分析


下一篇:MaxCompute SQL使用小技巧之多列转多行