1 需求描述
统计硅谷影音视频网站的常规指标,各种 TopN 指标: -- 统计视频观看数 Top10 -- 统计视频类别热度 Top10 -- 统计出视频观看数最高的 20 个视频的所属类别以及类别包含 Top20 视频的个数 -- 统计视频观看数 Top50 所关联视频的所属类别排序 -- 统计每个类别中的视频热度 Top10,以 Music 为例 -- 统计每个类别视频观看数 Top10 -- 统计上传视频最多的用户 Top10 以及他们上传的视频观看次数在前 20 的视频2 数据结构
1)视频表 2)用户表3 准备工作
3.1 准备表 1)需要准备的表 创建原始数据表:gulivideo_ori,gulivideo_user_ori, 创建最终表:gulivideo_orc,gulivideo_user_orc 2)创建原始数据表: (1)gulivideo_oricreate table gulivideo_ori( videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int, relatedId array<string>) row format delimited fields terminated by "\t" collection items terminated by "&" stored as textfile;(2)创建原始数据表: gulivideo_user_ori
create table gulivideo_user_ori( uploader string, videos int, friends int) row format delimited fields terminated by "\t" stored as textfile;2)创建 orc 存储格式带 snappy 压缩的表: (1)gulivideo_orc
create table gulivideo_orc( videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int, relatedId array<string>) stored as orc tblproperties("orc.compress"="SNAPPY");(2)gulivideo_user_orc
create table gulivideo_user_orc( uploader string, videos int, friends int) row format delimited fields terminated by "\t" stored as orc tblproperties("orc.compress"="SNAPPY");(3)向 ori 表插入数据
load data local inpath "/opt/module/data/video" into table gulivideo_ori; load data local inpath "/opt/module/user" into table gulivideo_user_ori;(4)向 orc 表插入数据
insert into table gulivideo_orc select * from gulivideo_ori; insert into table gulivideo_user_orc select * from gulivideo_user_ori;3.2 安装 Tez 引擎(了解)