day06

2023-12-31 10:13:28

今天学

习过程

和小结

一、 Hive查询操作

1.1 全表和特定列查询

1、全表查询：select * from emp；一般不使用select *，因为会降低效率，在企业中根据需要查询列。

2、特定列查询：select 列名，列名… from emp；这种查询在企业中常用。

1.2 列别名

1、重命名一个列

2、便于计算

3、语句中紧跟列名，也可以加关键词 AS

例如：select enamel as name， depton dn from emp；

1.3 运算符

+ - * / % & | ^ ~

1.4 函数

1、总行数：count ：select count（*） cnt from emp；

2、max：select max(sal) max_sal from emp；

3、min：select min(sal) min_sal from emp;

4、sum：select sun(sal) sum_sal from emp;

5、avg：select avg(sal) avg_sal from emp;

1.5 Limit语句

用于限制返回的行数

Select * from emp limit 8；

1.6 Where语句

1、紧跟from

2、过滤不满足条件的数据

如：select * from emp where sal>1000；

1.7 比较运算符

Between/in/is null

1.8 Like和rlike

1、表示类似的值

2、%代表零个或对个字符，_代表一个字符

Rlike

1、支持Java中的正则表达式

1.9 逻辑运算

And/or/not

（1）查询薪水大于1000，部门是30

hive (default)> select * from emp where sal>1000 and deptno=30;

（2）查询薪水大于1000，或者部门是30

hive (default)> select * from emp where sal>1000 or deptno=30;

（3）查询除了20部门和30部门以外的员工信息

hive (default)> select * from emp where deptno not IN(30, 20);

1.10 笛卡尔积

（1）省略连接条件

（2）连接条件无效

（3）所有表中的所有行互相连接

项目中一般不使用笛卡尔积。尽量避免笛卡尔积，join的时候不加on条件，或者无效的on条件，Hive只能使用1个reducer来完成笛卡尔积。

1.11 排序

1、全局排序：order by（asc、desc）

2、Mapreduce内部排序

Sort By：每个Reducer内部进行排序，对全局结果集来说不是排序。

设置reduce个数

hive (default)> set mapreduce.job.reduces=3;

查看设置reduce个数

hive (default)> set mapreduce.job.reduces;

根据部门编号降序查看员工信息

hive (default)> select * from emp sort by empno desc;

将查询结果导入到文件中（按照部门编号降序排序）

hive (default)> insert overwrite local directory '/opt/module/datas/sortby-result'

select * from emp sort by deptno desc;

3、分区排序

Distribute By：类似MR中partition，进行分区，结合sort by使用。

二、压缩和存储

1) 文件存储格式

Hive支持的存储数的格式主要有：TEXTFILE 、SEQUENCEFILE、ORC、PARQUET。

TEXTFILE和SEQUENCEFILE的存储格式都是基于行存储的；

ORC和PARQUET是基于列式存储的。

2) 主流文件存储格式对比实验

压缩特点：ORC>Parquet>textFile

查询速度测试：查询速度相近

三、性能优化

1、Fetch抓取是指，Hive中对某些情况的查询可以不必使用MapReduce计算。例如：SELECT * FROM employees;在这种情况下，Hive可以简单地读取employee对应的存储目录下的文件，然后输出查询结果到控制台。

设置hive.fetch.task.conversion为more，则可以设置是否执行mapreduce程序

2、本地模式

用户可以通过设置hive.exec.mode.local.auto的值为true，来让Hive在适当的时候自动启动这个优化。

3、表的优化

大表小表join：将key相对分散，并且数据量小的表放在join的左边，这样可以有效减少内存溢出错误发生的几率；再进一步，可以使用map join让小的维度表（1000条以下的记录条数）先进内存。在map端完成reduce。

在新版的hive中已经对大表和小表进行了优化

4、动态分区调整

（1）开启动态分区功能（默认true，开启）

hive.exec.dynamic.partition=true

（2）设置为非严格模式（动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。）

hive.exec.dynamic.partition.mode=nonstrict

（3）在所有执行MR的节点上，最大一共可以创建多少个动态分区。

hive.exec.max.dynamic.partitions=1000

（4）在每个执行MR的节点上，最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如：源数据中包含了一年的数据，即day字段有365个值，那么该参数就需要设置成大于365，如果使用默认值100，则会报错。

hive.exec.max.dynamic.partitions.pernode=100

（5）整个MR Job中，最大可以创建多少个HDFS文件。

hive.exec.max.created.files=100000

（6）当有空分区生成时，是否抛出异常。一般不需要设置。

hive.error.on.empty.partition=false

四、 Hive案例-谷粒影音

创建表：gulivideo_ori，gulivideo_user_ori，

创建表：gulivideo_orc，gulivideo_user_orc

gulivideo_ori：

create table gulivideo_ori(

videoId string,

uploader string,

age int,

category array<string>,

length int,

views int,

rate float,

ratings int,

comments int,

relatedId array<string>)

row format delimited

fields terminated by "\t"

collection items terminated by "&"

stored as textfile;

gulivideo_user_ori：

create table gulivideo_user_ori(

uploader string,

videos int,

friends int)

row format delimited

fields terminated by "\t"

stored as textfile;

然后把原始数据插入到orc表中

gulivideo_orc：

create table gulivideo_orc(

videoId string,

uploader string,

age int,

category array<string>,

length int,

views int,

rate float,

ratings int,

comments int,

relatedId array<string>)

clustered by (uploader) into 8 buckets

row format delimited fields terminated by "\t"

collection items terminated by "&"

stored as orc;

gulivideo_user_orc：

create table gulivideo_user_orc(

uploader string,

videos int,

friends int)

row format delimited

fields terminated by "\t"

stored as orc;

10.4.1 统计视频观看数Top10

思路：使用order by按照views字段做一个全局排序即可，同时我们设置只显示前10条。

最终代码：

select

videoId,

uploader,

age,

category,

length,

views,

rate,

ratings,

comments

from

gulivideo_orc

order by

views

desc limit

10;

10.4.2 统计视频类别热度Top10

思路：

1) 即统计每个类别有多少个视频，显示出包含视频最多的前10个类别。

2) 我们需要按照类别group by聚合，然后count组内的videoId个数即可。

3) 因为当前表结构为：一个视频对应一个或多个类别。所以如果要group by类别，需要先将类别进行列转行(展开)，然后再进行count即可。

4) 最后按照热度排序，显示前10条。

最终代码：

select

category_name as category,

count(t1.videoId) as hot

from (

select

videoId,

category_name

from

gulivideo_orc lateral view explode(category) t_catetory as category_name) t1

group by

t1.category_name

order by

hot

desc limit

10;

10.4.3 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

思路：

1) 先找到观看数最高的20个视频所属条目的所有信息，降序排列

2) 把这20条信息中的category分裂出来(列转行)

3) 最后查询视频分类名称和该分类下有多少个Top20的视频

最终代码：

select

category_name as category,

count(t2.videoId) as hot_with_views

from (

select

videoId,

category_name

from (

select

from

gulivideo_orc

order by

views

desc limit

20) t1 lateral view explode(category) t_catetory as category_name) t2

group by

category_name

order by

hot_with_views

desc;

10.4.4 统计视频观看数Top50所关联视频的所属类别Rank

思路：

1) 查询出观看数最多的前50个视频的所有信息(当然包含了每个视频对应的关联视频)，记为临时表t1

t1：观看数前50的视频

select

from

gulivideo_orc

order by

views

desc limit

50;

2) 将找到的50条视频信息的相关视频relatedId列转行，记为临时表t2

t2：将相关视频的id进行列转行操作

select

explode(relatedId) as videoId

from

t1;

3) 将相关视频的id和gulivideo_orc表进行inner join操作

t5：得到两列数据，一列是category，一列是之前查询出来的相关视频id

(select

distinct(t2.videoId),

t3.category

from

inner join

gulivideo_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name;

4) 按照视频类别进行分组，统计每组视频个数，然后排行

最终代码：

select

category_name as category,

count(t5.videoId) as hot

from (

select

videoId,

category_name

from (

select

distinct(t2.videoId),

t3.category

from (

select

explode(relatedId) as videoId

from (

select

from

gulivideo_orc

order by

views

desc limit

50) t1) t2

inner join

gulivideo_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name) t5

group by

category_name

order by

hot

desc;

统计每个类别中的视频热度Top10，以Music为例

思路：

1) 要想统计Music类别中的视频热度Top10，需要先找到Music类别，那么就需要将category展开，所以可以创建一张表用于存放categoryId展开的数据。

2) 向category展开的表中插入数据。

3) 统计对应类别（Music）中的视频热度。

最终代码：

创建表类别表：

create table gulivideo_category(

videoId string,

uploader string,

age int,

categoryId string,

length int,

views int,

rate float,

ratings int,

comments int,

relatedId array<string>)

row format delimited

fields terminated by "\t"

collection items terminated by "&"

stored as orc;

向类别表中插入数据：

insert into table gulivideo_category

select

videoId,

uploader,

age,

categoryId,

length,

views,

rate,

ratings,

comments,

relatedId

from

gulivideo_orc lateral view explode(category) catetory as categoryId;

统计Music类别的Top10（也可以统计其他）

select

videoId,

views

from

gulivideo_category

where

categoryId = "Music"

order by

views

desc limit

10;

统计每个类别中视频流量Top10，以Music为例

思路：

1) 创建视频类别展开表（categoryId列转行后的表）

2) 按照ratings排序即可

最终代码：

select

videoId,

views,

ratings

from

gulivideo_category

where

categoryId = "Music"

order by

ratings

desc limit

10;

统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

思路：

1) 先找到上传视频最多的10个用户的用户信息

select

from

gulivideo_user_orc

order by

videos

desc limit

10;

2) 通过uploader字段与gulivideo_orc表进行join，得到的信息按照views观看次数进行排序即可。

最终代码：

select

t2.videoId,

t2.views,

t2.ratings,

t1.videos,

t1.friends

from (

select

from

gulivideo_user_orc

order by

videos desc

limit

10) t1

join

gulivideo_orc t2

t1.uploader = t2.uploader

order by

views desc

limit

20;

统计每个类别视频观看数Top10

思路：

1) 先得到categoryId展开的表数据

2) 子查询按照categoryId进行分区，然后分区内排序，并生成递增数字，该递增数字这一列起名为rank列

3) 通过子查询产生的临时表，查询rank值小于等于10的数据行即可。

最终代码：

select

t1.*

from (

select

videoId,

categoryId,

views,

row_number() over(partition by categoryId order by views desc) rank from gulivideo_category) t1

where

rank <= 10;

码农公寓

10.4.1 统计视频观看数Top10

10.4.2 统计视频类别热度Top10

10.4.3 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

10.4.4 统计视频观看数Top50所关联视频的所属类别Rank

统计每个类别中的视频热度Top10，以Music为例

统计每个类别中视频流量Top10，以Music为例

统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

统计每个类别视频观看数Top10

相关文章