MaxCompute 使用SQL进行连续区间问题分析

连续区间问题在业务分析中也很常见:连续N天登陆、车辆持续运行时长、系统状态报告连续日期等等等,接下来以真实需求的简化版进行分析

案例:某线上作业每天都会运行一次,且独立于其他的任务,运行结果会保留到一张表中run_status_result_1d;运行状态只有两个值:failed/succeeded,数据如下:

run_date

status

...

...

2021-08-01

succeeded

2021-08-02

succeeded

2021-08-03

succeeded

2021-08-04

succeeded

2021-08-05

succeeded

2021-08-06

failed

2021-08-07

failed

2021-08-08

failed

2021-08-09

succeeded

2021-08-10

succeeded

2021-08-11

succeeded

...

...

需求:作业运行状态的连续日期 结果数据展示如下:

status

start_date

end_date

succeeded

2021-08-01

2021-08-05

failed

2021-08-06

2021-08-07

...

...

...

分析:利用sum(if)组合进行分析

  1. 生成一些数据
with tb1 as (
    select 
        run_date,
        status
    from values('2021-08-01','succeeded'),
               ('2021-08-02','succeeded'),
               ('2021-08-03','succeeded'),
               ('2021-08-04','succeeded'), 
               ('2021-08-05','succeeded'), 
               ('2021-08-06','failed'), 
               ('2021-08-07','failed'), 
               ('2021-08-08','failed'), 
               ('2021-08-09','succeeded'), 
               ('2021-08-10','succeeded'), 
               ('2021-08-11','succeeded'), 
               ('2021-08-12','succeeded')
               t(run_date,status)
)
  1. 利用lag函数拿到上一条运行状态
tb2 as (
    select 
        run_date,
        status,
        lag(status,1,status) over(order by run_date) lag_status
    from tb1
)
  1. 利用当前数据状态与上一条状态的异同进行分分组
tb3 as (
    select 
        run_date,
        status,
        lag_status,
        sum(if(status!=lag_status,1,0)) over(order by run_date) flag
    from tb2
)
--数据分组效果如下:
run_date        status      lag_status  flag
2021-08-01  succeeded   succeeded       0
2021-08-02  succeeded   succeeded       0
2021-08-03  succeeded   succeeded       0
2021-08-04  succeeded   succeeded       0
2021-08-05  succeeded   succeeded       0
2021-08-06  failed      succeeded       1
2021-08-07  failed      failed          1
2021-08-08  failed      failed          1
2021-08-09  succeeded   failed          2
2021-08-10  succeeded   succeeded       2
2021-08-11  succeeded   succeeded       2
2021-08-12  succeeded   succeeded       2
  1. 对上述数据按flag求最大时间和最小时间即为结果
select 
    min(status) as status,
    min(run_date) as start_date,
    max(run_date) as end_date
from tb3
group by flag;
--结果如下:
status      start_date  end_date
succeeded   2021-08-01  2021-08-05
failed      2021-08-06  2021-08-08
succeeded   2021-08-09  2021-08-12

对于案例中使用sum(if)组合,在业务需求分析中经常使用,也可以简化很多代码,还未使用的小伙伴,可以尝试使用下

拜了个拜

上一篇:Shell脚本循环控制| 学习笔记


下一篇:MaxCompute 使用SQL进行重叠交叉区间问题分析