id1 | id2 |
1 | 2 |
1 | 3 |
1 | 4 |
1 | 5 |
1 | 6 |
2 | 6 |
2 | 7 |
3 | 8 |
3 | 9 |
要求输出结果
id1 | feiend_bfb |
1 | 55.56% |
2 | 33.33% |
3 | 33.33% |
4 | 11.11% |
5 | 11.11% |
6 | 22.22% |
7 | 11.11% |
8 | 11.11% |
9 | 11.11% |
数据准备
create table temp_friends_0302 ( id1 int,--'用户', id2 int -- '用户好友' ) row format delimited fields terminated by '\t' stored as orc tblproperties ("orc.compress" = "snappy"); insert into temp_friends_0302 values (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 6), (2, 7), (3, 8), (3, 9); select * from temp_friends_0302;
解决方案
其实本题的用意在于求用户数以及用户对应的好友数
1、通过函数计算每个用户的好友数,id1 id2 是相互的,需要union 一下
select id1, id2 from temp_friends_0302 union all select id2 id1, id1 id2 from temp_friends_0302;
所有操作应该以此数据集作为底层数据
计算好友用户数select id1, size(collect_set(id2)) friend_num from (select id1, id2 from temp_friends_0302 union all select id2 id1, id1 id2 from temp_friends_0302) t group by id1;
2、计算用户总数
select id1, sum(1) over () user_num, friend_num from (select id1, size(collect_set(id2)) friend_num from (select id1, id2 from temp_friends_0302 union all select id2 id1, id1 id2 from temp_friends_0302) t group by id1) t;
3、根据第二部结果计算最总结果:用户拥有的朋友总数除以平台上的用户总数,然后乘以100转换成一个百分比。
select id1, cast(friend_num * 100 / t_num as decimal(5, 2)) || '%' feiend_bfb from (select id1, sum(1) over () t_num, friend_num from (select id1, size(collect_set(id2)) friend_num from (select id1, id2 from temp_friends_0302 union all select id2 id1, id1 id2 from temp_friends_0302) t group by id1) t) t;
总结
主要使用了 size(collect_set())函数,collect_set() 具有剔重效果,所以无需考虑重复数据,另外需要注意用户好友是相互的,所以原表需要union一下。