每天分享一个sql,帮助大家找到sql的快乐
需求:统计每个月访问量和累计月访问量
建表语句
CREATE TABLE visit(
userId string,
visitDate string,
visitCount INT
)
ROW format delimited FIELDS TERMINATED BY "\t";
数据
INSERT INTO TABLE visit VALUES
( 'u01', '2017/1/21', 5 ),
( 'u02', '2017/1/23', 6 ),
( 'u03', '2017/1/22', 8 ),
( 'u04', '2017/1/20', 3 ),
( 'u01', '2017/1/23', 6 ),
( 'u01', '2017/2/21', 8 ),
( 'u02', '2017/1/23', 6 ),
( 'u01', '2017/2/22', 4 );
实现
select
tmp2.userId,
tmp2.month,
tmp2.total_month,
sum(tmp2.total_month) over(partition by userId rows between unbounded preceding and current row)
from
(select
tmp1.userId,
tmp1.month,
sum(tmp1.visitCount) as total_month
from
(select
userId,
date_format(replace(visitDate,"/","-"),"yyyy-MM") as month,
visitCount
from
test_sql.test1
)tmp1
group by tmp1.userId,tmp1.month
order by tmp1.userId,tmp1.month
)tmp2
;
结果
tmp2.userid tmp2.month tmp2.total_month sum_window_0
u04 2017-01 3 3
u03 2017-01 8 11
u02 2017-01 12 23
u01 2017-01 11 34
u01 2017-02 12 46
Time taken: 32.593 seconds, Fetched: 5 row(s)
分析
1、从需求来看,需要计算每个月的访问结果和截至当月累计和,并一起展示,必然用到开窗,这个在hive sql系列一讲过了
2、累计效果需要用到rows范围
3、这里日期格式不能直接用,需要用replace函数,这个函数作用最易于想到
4、这里实现步骤,先对日期转格式,在开窗直接计算
扩展
select date_format("2019/12/12","yyyy-MM");
select split("2019/12/12","/");
select replace("2019/12/12","/","-");
select month(replace("2019/12/12","/","-"));
select date_format(replace("2019/12/12","/","-"),"yyyy-MM");
--从开始到当前月累计
sum(tmp2.total_month) over(order by tmp2.month rows between unbounded preceding and current row)
知识点
rows是行数的意思,后面跟的是函数的范围
between是行数在什么范围
unbounded是行数的起点,这里可以将unbounded替换成1,那就是相邻上一个月的意思
preceding是前面的意思
current row是当前行的意思
连一起的意思是参与计算的行数是从起点开始直到当前行,这就是开窗的数据范围,再加上窗口之上的sum函数,
就是统计从起点到当前的和