解压user.zip
[root@hadoop1 test]# unzip user.zip -d /test/bigdatacase/dataset Archive: user.zip inflating: /test/bigdatacase/dataset/raw_user.csv inflating: /test/bigdatacase/dataset/small_user.csv
查看解压出来的两个文件,查看raw_user.csv
头文件看一下
[root@hadoop1 dataset]# ll 总用量 1025444 -rw-r--r--. 1 root root 1036519392 3月 3 2016 raw_user.csv -rw-r--r--. 1 root root 13530111 11月 26 2016 small_user.csv [root@hadoop1 dataset]# head -5 raw_user.csv user_id,item_id,behavior_type,user_geohash,item_category,time 10001082,285259775,1,97lk14c,4076,2014-12-08 18 10001082,4368907,1,,5503,2014-12-12 12 10001082,4368907,1,,5503,2014-12-12 12 10001082,53616768,1,,9762,2014-12-02 15 [root@hadoop1 dataset]#
每行记录都包含5个字段,数据集中的字段及其含义如下:
user_id(用户id) item_id(商品id) behaviour_type(包括浏览、收藏、加购物车、购买,对应取值分别是1、2、3、4) user_geohash(用户地理位置哈希值,有些记录中没有这个字段值,所以后面会用脚本做数据预处理时把这个字段全部删除) item_category(商品分类) time(该记录产生时间)
一、数据集的预处理
1.删除文件第一行记录,即字段名称
raw_user和small_user中的第一行都是字段名称,在文件中的数据导入到数据仓库Hive中时,不需要第一行字段名称,因此,这里在做数据预处理时,删除第一行
[root@hadoop1 dataset]# sed -i '1d' raw_user.csv [root@hadoop1 dataset]# head -5 raw_user.csv 10001082,285259775,1,97lk14c,4076,2014-12-08 18 10001082,4368907,1,,5503,2014-12-12 12 10001082,4368907,1,,5503,2014-12-12 12 10001082,53616768,1,,9762,2014-12-02 15 10001082,151466952,1,,5232,2014-12-12 11 [root@hadoop1 dataset]#
同理small_user.csv也执行一次
[root@hadoop1 dataset]# sed -i '1d' small_user.csv [root@hadoop1 dataset]# head -5 small_user.csv 10001082,285259775,1,97lk14c,4076,2014-12-08 18 10001082,4368907,1,,5503,2014-12-12 12 10001082,4368907,1,,5503,2014-12-12 12 10001082,53616768,1,,9762,2014-12-02 15 10001082,151466952,1,,5232,2014-12-12 11 [root@hadoop1 dataset]#
后面都是用small_user.csv这个小数据集进行操作,这样可以节省时间。等所有流程都跑通以后,就可以使用大数据集raw_user.csv去测试一遍。
2.对字段进行预处理
下面对数据集进行一些预处理,包括为每行记录增加一个id字段(让记录具有唯一性)、增加一个省份字段(用来后续进行可视化分析),并且丢弃user_geohash字段(后面分析不需要这个字段)。
下面要建一个脚本文件pre_deal.sh,把这个脚本文件放在dataset目录下,和数据集small_user.csv放在同一个目录下:
[root@hadoop1 dataset]# cd /test/bigdatacase//dataset [root@hadoop1 dataset]# vim pre_deal.sh
#/bin/bash #设置输入文件,把用户执行pre_deal.sh命令时提供的第一个参数作为输入文件名称 infile=$1 #下面设置输出文件,把用户执行pre_deal.sh命令时提供的第二个参数作为输出文件名称 outfile=$2 #awk -F "," '处理逻辑' $infile > $outfile awk -F "," 'BEGIN{ srand(); id=0; Province[0]="山东";Province[1]="山西";Province[2]="河南";Province[3]="河北";Province[4]="陕西";Province[5]="内蒙古";Province[6]="上海市"; Province[7]="北京市";Province[8]="重庆市";Province[9]="天津市";Province[10]="福建";Province[11]="广东";Province[12]="广西";Province[13]="云南"; Province[14]="浙江";Province[15]="贵州";Province[16]="*";Province[17]="*";Province[18]="江西";Province[19]="湖南";Province[20]="湖北"; Province[21]="黑龙江";Province[22]="吉林";Province[23]="辽宁"; Province[24]="江苏";Province[25]="甘肃";Province[26]="青海";Province[27]="四川"; Province[28]="安徽"; Province[29]="宁夏";Province[30]="海南";Province[31]="香港";Province[32]="澳门";Province[33]="*"; } { id=id+1; value=int(rand()*34); print id"\t"$1"\t"$2"\t"$3"\t"$5"\t"substr($6,1,10)"\t"Province[value] }' $infile > $outfile
这段脚本大概是-F参数用于指出每行记录的不同字段之间用逗号进行分割
srand()用于生成随机数的种子,id是为数据集新增的一个字段,它是一个自增类型,每条记录增加1,这样可以保证每条记录具有唯一性。
为数据集新增一个省份字段,方便用来进行后面的数据可视化分析,给每条记录增加一个省份字段的值,用Province[]数组用来保存全国各个省份信息,在遍历数据集raw_user.csv的时候,每当遍历到其中一条记录,使用value=int(rand()*34)语句随机生成一个0-33的整数,作为Province省份值,然后从Province[]数组当中获取省份名称,增加到该条记录中。
substr($6,1,10)这个语句是为了截取时间字段time的年月日,方便后续存储为date格式
awk每次遍历到一条记录时,每条记录包含了6个字段,其中,第6个字段是时间字段,substr($6,1,10)语句就表示获取第6个字段的值,截取前10个字符,第6个字段是类似”2014-12-08 18″这样的字符串(也就是表示2014年12月8日18时),substr($6,1,10)截取后,就丢弃了小时,只保留了年月日。
print id”\t”$1″\t”$2″\t”$3″\t”$5″\t”substr($6,1,10)”\t”Province[value]这行语句中,我们丢弃了每行记录的第4个字段,前面有说用户地理位置哈希值数据预处理时把这个字段全部删除,所以这里丢弃。生成后的文件是“\t”进行分割
执行脚本,对small_user.csv进行数据预处理
[root@hadoop1 dataset]# ./pre_deal.sh small_user.csv user_table.txt -bash: ./pre_deal.sh: 权限不够 # 加一个可执行的权限 [root@hadoop1 dataset]# chmod +x pre_deal.sh [root@hadoop1 dataset]# ./pre_deal.sh small_user.csv user_table.txt [root@hadoop1 dataset]# head -10 user_table.txt 1 10001082 285259775 1 4076 2014-12-08 香港 2 10001082 4368907 1 5503 2014-12-12 山西 3 10001082 4368907 1 5503 2014-12-12 湖北 4 10001082 53616768 1 9762 2014-12-02 河北 5 10001082 151466952 1 5232 2014-12-12 江西 6 10001082 53616768 4 9762 2014-12-02 广东 7 10001082 290088061 1 5503 2014-12-12 * 8 10001082 298397524 1 10894 2014-12-12 湖北 9 10001082 32104252 1 6513 2014-12-12 甘肃 10 10001082 323339743 1 10894 2014-12-12 江西
3.导入数据库
把user_table.txt中的数据最终导入到数据仓库Hive中,将user_table.txt上传到分布式文件系统HDFS中,然后,在Hive中创建一个外部表,完成数据的导入。
把Linux本地文件系统中的user_table.txt上传到分布式文件系统HDFS中,存放在HDFS中的“/bigdatacase/dataset”目录下,在HDFS的根目录下面创建一个新的目录bigdatacase,并在这个目录下创建一个子目录dataset
[root@hadoop1 dataset]# hdfs dfs -mkdir -p /bigdatacase/dataset [root@hadoop1 dataset]# hdfs dfs -put /test/bigdatacase/dataset/user_table.txt /bigdatacase/dataset
查看一下
[root@hadoop1 dataset]# hdfs dfs -cat /bigdatacase/dataset/user_table.txt | head -10 1 10001082 285259775 1 4076 2014-12-08 香港 2 10001082 4368907 1 5503 2014-12-12 山西 3 10001082 4368907 1 5503 2014-12-12 湖北 4 10001082 53616768 1 9762 2014-12-02 河北 5 10001082 151466952 1 5232 2014-12-12 江西 6 10001082 53616768 4 9762 2014-12-02 广东 7 10001082 290088061 1 5503 2014-12-12 * 8 10001082 298397524 1 10894 2014-12-12 湖北 9 10001082 32104252 1 6513 2014-12-12 甘肃 10 10001082 323339743 1 10894 2014-12-12 江西 cat: Unable to write to output stream. [root@hadoop1 dataset]#
创建外部表
先创建一个数据库,然后在库中建立一张外部表bigdata_user,它包含字段(id, uid, item_id, behavior_type, item_category, date, province),请在hive命令提示符下输入如下命令
hive (default)> CREATE DATABASE dblab; OK hive (default)> use dblab; OK Time taken: 0.692 seconds
CREATE EXTERNAL TABLE bigdata_user( id INT, uid STRING, item_id STRING, behavior_type INT, item_category STRING, visit_data DATE, province STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/bigdatacase/dataset';
已经成功把HDFS中的“/bigdatacase/dataset”目录下的数据加载到了数据仓库Hive中,查看一下
这里不小心把visit_date DATE
写成 visit_data DATE
,修改一下列
hive (dblab)> ALTER TABLE bigdata_user CHANGE visit_data visit_date date; OK Time taken: 9.862 seconds
hive (dblab)> select * from bigdata_user limit 10; OK bigdata_user.id bigdata_user.uid bigdata_user.item_id bigdata_user.behavior_type bigdata_user.item_category bigdata_user.visit_date bigdata_user.province 1 10001082 285259775 1 4076 2014-12-08 香港 2 10001082 4368907 1 5503 2014-12-12 山西 3 10001082 4368907 1 5503 2014-12-12 湖北 4 10001082 53616768 1 9762 2014-12-02 河北 5 10001082 151466952 1 5232 2014-12-12 江西 6 10001082 53616768 4 9762 2014-12-02 广东 7 10001082 290088061 1 5503 2014-12-12 * 8 10001082 298397524 1 10894 2014-12-12 湖北 9 10001082 32104252 1 6513 2014-12-12 甘肃 10 10001082 323339743 1 10894 2014-12-12 江西 Time taken: 12.896 seconds, Fetched: 10 row(s) hive (dblab)> select behavior_type from bigdata_user limit 10; OK behavior_type 1 1 1 1 1 4 1 1 1 1 Time taken: 6.927 seconds, Fetched: 10 row(s)
beeline查询表会比较完整
0: jdbc:hive2://hadoop001:10000> select * from bigdata_user limit 10; +------------------+-------------------+-----------------------+-----------------------------+-----------------------------+--------------------------+------------------------+ | bigdata_user.id | bigdata_user.uid | bigdata_user.item_id | bigdata_user.behavior_type | bigdata_user.item_category | bigdate_user.visit_data | bigdata_user.province | +------------------+-------------------+-----------------------+-----------------------------+-----------------------------+--------------------------+------------------------+ | 1 | 10001082 | 285259775 | 1 | 4076 | 2014-12-08 | 香港 | | 2 | 10001082 | 4368907 | 1 | 5503 | 2014-12-12 | 山西 | | 3 | 10001082 | 4368907 | 1 | 5503 | 2014-12-12 | 湖北 | | 4 | 10001082 | 53616768 | 1 | 9762 | 2014-12-02 | 河北 | | 5 | 10001082 | 151466952 | 1 | 5232 | 2014-12-12 | 江西 | | 6 | 10001082 | 53616768 | 4 | 9762 | 2014-12-02 | 广东 | | 7 | 10001082 | 290088061 | 1 | 5503 | 2014-12-12 | * | | 8 | 10001082 | 298397524 | 1 | 10894 | 2014-12-12 | 湖北 | | 9 | 10001082 | 32104252 | 1 | 6513 | 2014-12-12 | 甘肃 | | 10 | 10001082 | 323339743 | 1 | 10894 | 2014-12-12 | 江西 | +------------------+-------------------+-----------------------+-----------------------------+-----------------------------+--------------------------+------------------------+ 10 rows selected (1.055 seconds) 0: jdbc:hive2://hadoop001:10000> select behavior_type from bigdata_user limit 10; +----------------+ | behavior_type | +----------------+ | 1 | | 1 | | 1 | | 1 | | 1 | | 4 | | 1 | | 1 | | 1 | | 1 | +----------------+ 10 rows selected (6.337 seconds) 0: jdbc:hive2://hadoop001:10000>
查看一下的bigdata_user属性show create table bigdata_user;
查看一下表的简单结构desc bigdata_user;
hive (dblab)> show create table bigdata_user; OK createtab_stmt CREATE EXTERNAL TABLE `bigdata_user`( `id` int, `uid` string, `item_id` string, `behavior_type` int, `item_category` string, `visit_date` date, `province` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'field.delim'='\t', 'serialization.format'='\t') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://192.168.1.11:8020/bigdatacase/dataset' TBLPROPERTIES ( 'bucketing_version'='2', 'transient_lastDdlTime'='1605450738') Time taken: 35.032 seconds, Fetched: 22 row(s) hive (dblab)>
hive (dblab)> desc bigdata_user; OK col_name data_type comment id int uid string item_id string behavior_type int item_category string visit_date date province string Time taken: 14.897 seconds, Fetched: 7 row(s) hive (dblab)>
二、简单查询分析
先测试一下简单的指令:
查看前10位用户对商品的行为
SELECT behavior_type FROM bigdata_user LIMIT 10;
hive (dblab)> SELECT behavior_type FROM bigdata_user LIMIT 10; OK behavior_type 1 1 1 1 1 4 1 1 1 1 Time taken: 10.666 seconds, Fetched: 10 row(s)
查询前20位用户购买商品时的时间和商品的种类
SELECT visit_date,item_category FROM bigdata_user LIMIT 20;
hive (dblab)> SELECT visit_date,item_category FROM bigdata_user LIMIT 20; OK visit_date item_category 2014-12-08 4076 2014-12-12 5503 2014-12-12 5503 2014-12-02 9762 2014-12-12 5232 2014-12-02 9762 2014-12-12 5503 2014-12-12 10894 2014-12-12 6513 2014-12-12 10894 2014-12-12 2825 2014-11-28 2825 2014-12-15 3200 2014-12-03 10576 2014-11-20 10576 2014-12-13 10576 2014-12-08 10576 2014-12-14 7079 2014-12-02 6669 2014-12-12 5232 Time taken: 7.74 seconds, Fetched: 20 row(s)
表中查询可以利用嵌套语句
hive (dblab)> SELECT e.bh, e.it FROM (SELECT behavior_type AS bh, item_category AS it FROM bigdata_user) AS e LIMIT 20; OK e.bh e.it 1 4076 1 5503 1 5503 1 9762 1 5232 4 9762 1 5503 1 10894 1 6513 1 10894 1 2825 1 2825 1 3200 1 10576 1 10576 1 10576 1 10576 1 7079 1 6669 1 5232 Time taken: 5.872 seconds, Fetched: 20 row(s)
三、查询条数统计分析
(1)用聚合函数count()计算出表内有多少条行数据
SELECT COUNT(*) FROM bigdata_user;
hive (dblab)> SELECT COUNT(*) FROM bigdata_user; Query ID = root_20201115231503_8cce31a9-81cb-4ef3-92ca-94bafb2086b4 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-15 23:17:28,850 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0001, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0001/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-15 23:20:28,863 Stage-1 map = 0%, reduce = 0% 2020-11-15 23:21:29,392 Stage-1 map = 0%, reduce = 0% 2020-11-15 23:22:27,725 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.91 sec 2020-11-15 23:23:10,606 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.72 sec MapReduce Total cumulative CPU time: 11 seconds 970 msec Ended Job = job_1605437253399_0001 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 11.97 sec HDFS Read: 15603831 HDFS Write: 106 SUCCESS Total MapReduce CPU Time Spent: 11 seconds 970 msec OK _c0 300000 Time taken: 501.726 seconds, Fetched: 1 row(s)
(2)在函数内部加上distinct,查出uid不重复的数据有多少条
SELECT COUNT(DISTINCT uid) FROM bigdata_user;
hive (dblab)> SELECT COUNT(DISTINCT uid) FROM bigdata_user; Query ID = root_20201115232536_31d143aa-c5ed-4b53-b0a6-8c64f14f8634 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-15 23:25:58,752 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0002, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0002/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-15 23:26:38,676 Stage-1 map = 0%, reduce = 0% 2020-11-15 23:27:38,820 Stage-1 map = 0%, reduce = 0% 2020-11-15 23:28:20,002 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.33 sec 2020-11-15 23:29:05,715 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 9.77 sec 2020-11-15 23:29:11,177 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 10.77 sec MapReduce Total cumulative CPU time: 10 seconds 770 msec Ended Job = job_1605437253399_0002 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 10.77 sec HDFS Read: 15599879 HDFS Write: 103 SUCCESS Total MapReduce CPU Time Spent: 10 seconds 770 msec OK _c0 270 Time taken: 253.202 seconds, Fetched: 1 row(s)
(3)查询不重复的数据有多少条(为了排除客户刷单情况)
SELECT COUNT(*) FROM (SELECT uid,item_id,behavior_type,item_category,visit_date,province FROM bigdata_user GROUP BY uid,item_id,behavior_type,item_category,visit_date,province HAVING COUNT(*)=1)a;
hive (dblab)> SELECT COUNT(*) FROM (SELECT > uid,item_id,behavior_type,item_category,visit_date,province > FROM > bigdata_user > GROUP BY > uid,item_id,behavior_type,item_category,visit_date,province > HAVING COUNT(*)=1)a; Query ID = root_20201116003743_ffd71e21-1df6-4876-b80e-1672d4c3f19d Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 00:38:51,966 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0004, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0004/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0004 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 00:40:19,298 Stage-1 map = 0%, reduce = 0% 2020-11-16 00:41:12,139 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 11.81 sec 2020-11-16 00:41:13,194 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 13.35 sec 2020-11-16 00:42:13,270 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 13.35 sec 2020-11-16 00:42:43,105 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 17.76 sec 2020-11-16 00:42:49,495 Stage-1 map = 100%, reduce = 74%, Cumulative CPU 22.34 sec 2020-11-16 00:42:55,171 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 25.5 sec MapReduce Total cumulative CPU time: 25 seconds 680 msec Ended Job = job_1605437253399_0004 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 00:43:09,951 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0005, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0005/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0005 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 2020-11-16 00:43:51,371 Stage-2 map = 0%, reduce = 0% 2020-11-16 00:44:11,773 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 2.71 sec 2020-11-16 00:45:12,075 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 2.71 sec 2020-11-16 00:45:41,184 Stage-2 map = 100%, reduce = 67%, Cumulative CPU 7.7 sec 2020-11-16 00:45:47,993 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 8.67 sec MapReduce Total cumulative CPU time: 8 seconds 670 msec Ended Job = job_1605437253399_0005 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 25.68 sec HDFS Read: 15606542 HDFS Write: 117 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 8.67 sec HDFS Read: 6997 HDFS Write: 106 SUCCESS Total MapReduce CPU Time Spent: 34 seconds 350 msec OK _c0 284368 Time taken: 494.126 seconds, Fetched: 1 row(s)
四.关键字条件查询分析
1.以关键字的存在区间为条件的查询 使用where可以缩小查询分析的范围和精确度,下面用实例来测试一下。
(1)查询2014年12月10日到2014年12月13日有多少人浏览了商品
hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE > bwhavior_type='1' > AND visit_date<'2014-12-13' > AND visit_date>'2014-12-10'; Query ID = root_20201116090806_6b2a5b9c-f996-4caf-baa7-4bea6812cfda Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 09:11:29,381 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0008, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0008/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0008 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 09:16:53,597 Stage-1 map = 0%, reduce = 0% 2020-11-16 09:17:08,350 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.95 sec 2020-11-16 09:17:34,025 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 16.15 sec MapReduce Total cumulative CPU time: 16 seconds 350 msec Ended Job = job_1605437253399_0008 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 16.35 sec HDFS Read: 15606581 HDFS Write: 105 SUCCESS Total MapReduce CPU Time Spent: 16 seconds 350 msec OK _c0 26329 Time taken: 629.506 seconds, Fetched: 1 row(s)
(2)以月的第n天为统计单位,依次显示第n天网站卖出去的商品的个数
SELECT COUNT(distinct uid), day(visit_date) FROM bigdata_user WHERE behavior_type='4' GROUP BY day(visit_date);
hive (dblab)> SELECT COUNT(distinct uid), day(visit_date) FROM bigdata_user WHERE > behavior_type='4' GROUP BY day(visit_date); Query ID = root_20201116115335_d26e8b96-2962-4e18-8e18-adbc7324d127 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 11:55:06,219 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0009, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0009/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0009 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 11:56:13,054 Stage-1 map = 0%, reduce = 0% 2020-11-16 11:56:35,088 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.08 sec 2020-11-16 11:57:03,002 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 14.01 sec MapReduce Total cumulative CPU time: 14 seconds 10 msec Ended Job = job_1605437253399_0009 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 14.01 sec HDFS Read: 15601205 HDFS Write: 618 SUCCESS Total MapReduce CPU Time Spent: 14 seconds 10 msec OK _c0 _c1 37 1 48 2 42 3 38 4 42 5 33 6 42 7 36 8 34 9 40 10 43 11 98 12 39 13 43 14 42 15 44 16 42 17 66 18 38 19 50 20 33 21 34 22 32 23 47 24 34 25 31 26 30 27 34 28 39 29 38 30 Time taken: 212.641 seconds, Fetched: 30 row(s)
2.关键字赋予给定值为条件,对其他数据进行分析
取给定时间和给定地点,求当天发出到该地点的货物的数量
SELEct COUNT(*) FROM bigdata_user WHERE province='江西' AND visit_date='2014-12-12' AND behavior_type='4';
hive (dblab)> SELEct COUNT(*) FROM bigdata_user WHERE > province='江西' AND visit_date='2014-12-12' AND behavior_type='4'; Query ID = root_20201116120451_c13ae714-6cf1-400c-b9c0-d6afdd25ac7a Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 12:05:01,440 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0010, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0010/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0010 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 12:06:43,726 Stage-1 map = 0%, reduce = 0% 2020-11-16 12:07:13,387 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.12 sec 2020-11-16 12:07:31,429 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 9.12 sec MapReduce Total cumulative CPU time: 14 seconds 140 msec Ended Job = job_1605437253399_0010 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 14.14 sec HDFS Read: 15606342 HDFS Write: 102 SUCCESS Total MapReduce CPU Time Spent: 14 seconds 140 msec OK _c0 12 Time taken: 166.15 seconds, Fetched: 1 row(s)
五.根据用户行为分析
1.查询一件商品在某天的购买比例或浏览比例
例如
SELECT COUNT(*) FROM bigdata_user WHERE visit_date='2014-12-11' AND behavior_type='4';
hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE visit_date='2014-12-11' > AND behavior_type='4'; Query ID = root_20201116122136_fbb9be2d-7c88-409d-ada8-0e24ca4e912d Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 12:22:11,860 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0011, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0011/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0011 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 12:24:31,024 Stage-1 map = 0%, reduce = 0% 2020-11-16 12:24:40,319 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.37 sec 2020-11-16 12:25:19,783 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 16.99 sec MapReduce Total cumulative CPU time: 16 seconds 990 msec Ended Job = job_1605437253399_0011 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 16.99 sec HDFS Read: 15605958 HDFS Write: 102 SUCCESS Total MapReduce CPU Time Spent: 16 seconds 990 msec OK _c0 69 Time taken: 242.638 seconds, Fetched: 1 row(s)
查询有多少用户在2014-12-11点击了该店
SELECT COUNT(*) FROM bigdata_user WHERE visit_date='2014-12-12';
根据上面语句得到购买数量和点击数量,两个数相除即可得出当天该商品的购买率。
2.查询某个用户在某一天点击网站占该天所有点击行为的比例(点击行为包括浏览,加入购物车,收藏,购买)
//查询用户10001082在2014-12-12点击网站的次数
SELECT COUNT(*) FROM bigdata_user WHERE uid=10001082 AND visit_date='2014-12-12';
hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE > uid=10001082 AND visit_date='2014-12-12'; Query ID = root_20201116124039_d60d642d-b253-49a4-a5d6-b938e307157b Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 12:40:52,056 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0013, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0013/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0013 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 12:41:17,591 Stage-1 map = 0%, reduce = 0% 2020-11-16 12:41:44,422 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.98 sec 2020-11-16 12:42:10,097 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 15.17 sec MapReduce Total cumulative CPU time: 15 seconds 170 msec Ended Job = job_1605437253399_0013 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 15.17 sec HDFS Read: 15606459 HDFS Write: 102 SUCCESS Total MapReduce CPU Time Spent: 15 seconds 170 msec OK _c0 69 Time taken: 95.548 seconds, Fetched: 1 row(s) hive (dblab)>
//查询所有用户在这一天点击该网站的次数
SELECT COUNT(*) FROM bigdata_user WHERE visit_date='2014-12-12';
hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE > visit_date='2014-12-12'; Query ID = root_20201116124543_a447db6c-52cd-4be9-adb6-1a57f5163eb3 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 12:45:48,156 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0014, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0014/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0014 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 12:48:59,555 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.29 sec 2020-11-16 12:50:06,680 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.29 sec 2020-11-16 12:51:44,664 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 11.4 sec 2020-11-16 12:51:59,859 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 16.82 sec MapReduce Total cumulative CPU time: 17 seconds 280 msec Ended Job = job_1605437253399_0014 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 17.28 sec HDFS Read: 15605296 HDFS Write: 105 SUCCESS Total MapReduce CPU Time Spent: 17 seconds 280 msec OK _c0 17494 Time taken: 454.833 seconds, Fetched: 1 row(s)
上面两条语句的结果相除,就得到了要要求的比例。
3.给定购买商品的数量范围,查询某一天在该网站的购买该数量商品的用户id
//查询某一天在该网站购买商品超过5次的用户id
SELECT uid FROM bigdata_user WHERE behavior_type='4' AND visit_date='2014-12-12' GROUP BY uid having count(behavior_type='4')>5;
hive (dblab)> SELECT uid FROM bigdata_user WHERE > behavior_type='4' AND visit_date='2014-12-12' > GROUP BY uid having count(behavior_type='4')>5; Query ID = root_20201116130147_72230e9f-b748-4851-974c-1ca5dcceb37c Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 13:07:39,493 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0015, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0015/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0015 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 13:09:16,947 Stage-1 map = 0%, reduce = 0% 2020-11-16 13:10:02,481 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.51 sec 2020-11-16 13:10:33,388 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 17.71 sec MapReduce Total cumulative CPU time: 17 seconds 710 msec Ended Job = job_1605437253399_0015 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 17.71 sec HDFS Read: 15607239 HDFS Write: 478 SUCCESS Total MapReduce CPU Time Spent: 17 seconds 710 msec OK uid 100226515 100300684 100555417 100605 10095384 10142625 101490976 101982646 102011320 102030700 102079825 102349447 102612580 102650143 103082347 103139791 103794013 103995979 Time taken: 533.86 seconds, Fetched: 18 row(s)
六.用户实时查询分析
某个地区的用户当天浏览网站的次数
//创建新的数据表进行存储
CREATE TABLE scan(province STRING,scan INT) COMMENT 'This is the search of bigdataday' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
hive (dblab)> CREATE TABLE scan(province STRING,scan INT) COMMENT 'This is the search of bigdataday' > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; OK Time taken: 47.899 seconds
//导入数据 INSERT OVERWRITE TABLE scan SELECT province,COUNT(behavior_type) FROM bigdata_user WHERE behavior_type='1' GROUP BY province; //查看一下 SELECT * FROM scan;
hive (dblab)> INSERT OVERWRITE TABLE scan SELECT province,COUNT(behavior_type) > FROM bigdata_user WHERE behavior_type='1' GROUP BY province; Query ID = root_20201116133805_d502f968-f966-4de9-ac2b-075942cd23dc Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 13:38:36,530 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0016, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0016/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0016 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-11-16 13:39:13,851 Stage-1 map = 0%, reduce = 0% 2020-11-16 13:39:36,587 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.79 sec 2020-11-16 13:39:54,796 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.42 sec MapReduce Total cumulative CPU time: 13 seconds 420 msec Ended Job = job_1605437253399_0016 Loading data to table dblab.scan Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> 2020-11-16 13:40:02,593 INFO [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Starting Job = job_1605437253399_0017, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0017/ Kill Command = /program/hadoop-3.2.1/bin/mapred job -kill job_1605437253399_0017 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1 2020-11-16 13:40:34,772 Stage-3 map = 0%, reduce = 0% 2020-11-16 13:41:35,425 Stage-3 map = 0%, reduce = 0% 2020-11-16 13:42:27,059 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 5.0 sec 2020-11-16 13:42:57,783 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 9.3 sec MapReduce Total cumulative CPU time: 9 seconds 300 msec Ended Job = job_1605437253399_0017 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 13.42 sec HDFS Read: 15608313 HDFS Write: 913 SUCCESS Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 9.3 sec HDFS Read: 11053 HDFS Write: 515 SUCCESS Total MapReduce CPU Time Spent: 22 seconds 720 msec OK _col0 _col1 Time taken: 336.34 seconds
hive (dblab)> SELECT * FROM scan; OK scan.province scan.scan 上海市 8394 云南 8357 内蒙古 8146 北京市 8328 * 8186 吉林 8397 四川 8412 天津市 8557 宁夏 8414 安徽 8222 山东 8404 山西 8342 广东 8321 广西 8265 * 8311 江苏 8356 江西 8344 河北 8356 河南 8475 浙江 8321 海南 8417 湖北 8231 湖南 8390 澳门 8342 甘肃 8367 福建 8426 * 8355 贵州 8236 辽宁 8316 重庆市 8197 陕西 8364 青海 8384 香港 8203 黑龙江 8380 Time taken: 8.221 seconds, Fetched: 34 row(s) hive (dblab)>