hive简单的项目实战

解压user.zip

[root@hadoop1 test]# unzip user.zip -d /test/bigdatacase/dataset
Archive:  user.zip
  inflating: /test/bigdatacase/dataset/raw_user.csv  
  inflating: /test/bigdatacase/dataset/small_user.csv  

查看解压出来的两个文件,查看raw_user.csv头文件看一下

[root@hadoop1 dataset]# ll
总用量 1025444
-rw-r--r--. 1 root root 1036519392 3月   3 2016 raw_user.csv
-rw-r--r--. 1 root root   13530111 11月 26 2016 small_user.csv
[root@hadoop1 dataset]# head -5 raw_user.csv
user_id,item_id,behavior_type,user_geohash,item_category,time
10001082,285259775,1,97lk14c,4076,2014-12-08 18
10001082,4368907,1,,5503,2014-12-12 12
10001082,4368907,1,,5503,2014-12-12 12
10001082,53616768,1,,9762,2014-12-02 15
[root@hadoop1 dataset]# 

每行记录都包含5个字段,数据集中的字段及其含义如下:

user_id(用户id)
item_id(商品id)
behaviour_type(包括浏览、收藏、加购物车、购买,对应取值分别是1、2、3、4)
user_geohash(用户地理位置哈希值,有些记录中没有这个字段值,所以后面会用脚本做数据预处理时把这个字段全部删除)
item_category(商品分类)
time(该记录产生时间)

 

一、数据集的预处理

1.删除文件第一行记录,即字段名称

raw_user和small_user中的第一行都是字段名称,在文件中的数据导入到数据仓库Hive中时,不需要第一行字段名称,因此,这里在做数据预处理时,删除第一行

[root@hadoop1 dataset]# sed -i '1d' raw_user.csv
[root@hadoop1 dataset]# head -5 raw_user.csv
10001082,285259775,1,97lk14c,4076,2014-12-08 18
10001082,4368907,1,,5503,2014-12-12 12
10001082,4368907,1,,5503,2014-12-12 12
10001082,53616768,1,,9762,2014-12-02 15
10001082,151466952,1,,5232,2014-12-12 11
[root@hadoop1 dataset]# 

同理small_user.csv也执行一次

[root@hadoop1 dataset]# sed -i '1d' small_user.csv
[root@hadoop1 dataset]# head -5 small_user.csv
10001082,285259775,1,97lk14c,4076,2014-12-08 18
10001082,4368907,1,,5503,2014-12-12 12
10001082,4368907,1,,5503,2014-12-12 12
10001082,53616768,1,,9762,2014-12-02 15
10001082,151466952,1,,5232,2014-12-12 11
[root@hadoop1 dataset]# 

后面都是用small_user.csv这个小数据集进行操作,这样可以节省时间。等所有流程都跑通以后,就可以使用大数据集raw_user.csv去测试一遍。

 

2.对字段进行预处理

下面对数据集进行一些预处理,包括为每行记录增加一个id字段(让记录具有唯一性)、增加一个省份字段(用来后续进行可视化分析),并且丢弃user_geohash字段(后面分析不需要这个字段)。

下面要建一个脚本文件pre_deal.sh,把这个脚本文件放在dataset目录下,和数据集small_user.csv放在同一个目录下:

[root@hadoop1 dataset]# cd /test/bigdatacase//dataset
[root@hadoop1 dataset]# vim pre_deal.sh
#/bin/bash
#设置输入文件,把用户执行pre_deal.sh命令时提供的第一个参数作为输入文件名称
infile=$1
#下面设置输出文件,把用户执行pre_deal.sh命令时提供的第二个参数作为输出文件名称
outfile=$2


#awk -F "," '处理逻辑' $infile > $outfile
awk -F "," 'BEGIN{
                srand();
                id=0;
                Province[0]="山东";Province[1]="山西";Province[2]="河南";Province[3]="河北";Province[4]="陕西";Province[5]="内蒙古";Province[6]="上海市";
                Province[7]="北京市";Province[8]="重庆市";Province[9]="天津市";Province[10]="福建";Province[11]="广东";Province[12]="广西";Province[13]="云南"; 
                Province[14]="浙江";Province[15]="贵州";Province[16]="*";Province[17]="*";Province[18]="江西";Province[19]="湖南";Province[20]="湖北";
                Province[21]="黑龙江";Province[22]="吉林";Province[23]="辽宁"; Province[24]="江苏";Province[25]="甘肃";Province[26]="青海";Province[27]="四川";
                Province[28]="安徽"; Province[29]="宁夏";Province[30]="海南";Province[31]="香港";Province[32]="澳门";Province[33]="*";
        }
        {
                id=id+1;
                value=int(rand()*34);       
                print id"\t"$1"\t"$2"\t"$3"\t"$5"\t"substr($6,1,10)"\t"Province[value]
        }' $infile > $outfile

这段脚本大概是-F参数用于指出每行记录的不同字段之间用逗号进行分割

srand()用于生成随机数的种子,id是为数据集新增的一个字段,它是一个自增类型,每条记录增加1,这样可以保证每条记录具有唯一性。

为数据集新增一个省份字段,方便用来进行后面的数据可视化分析,给每条记录增加一个省份字段的值,用Province[]数组用来保存全国各个省份信息,在遍历数据集raw_user.csv的时候,每当遍历到其中一条记录,使用value=int(rand()*34)语句随机生成一个0-33的整数,作为Province省份值,然后从Province[]数组当中获取省份名称,增加到该条记录中。

substr($6,1,10)这个语句是为了截取时间字段time的年月日,方便后续存储为date格式

awk每次遍历到一条记录时,每条记录包含了6个字段,其中,第6个字段是时间字段,substr($6,1,10)语句就表示获取第6个字段的值,截取前10个字符,第6个字段是类似”2014-12-08 18″这样的字符串(也就是表示2014年12月8日18时),substr($6,1,10)截取后,就丢弃了小时,只保留了年月日。

print id”\t”$1″\t”$2″\t”$3″\t”$5″\t”substr($6,1,10)”\t”Province[value]这行语句中,我们丢弃了每行记录的第4个字段,前面有说用户地理位置哈希值数据预处理时把这个字段全部删除,所以这里丢弃。生成后的文件是“\t”进行分割

 

执行脚本,对small_user.csv进行数据预处理

[root@hadoop1 dataset]# ./pre_deal.sh small_user.csv user_table.txt
-bash: ./pre_deal.sh: 权限不够

# 加一个可执行的权限
[root@hadoop1 dataset]# chmod +x pre_deal.sh 
[root@hadoop1 dataset]# ./pre_deal.sh small_user.csv user_table.txt
[root@hadoop1 dataset]# head -10 user_table.txt 
1   10001082    285259775   1   4076    2014-12-08  香港
2   10001082    4368907 1   5503    2014-12-12  山西
3   10001082    4368907 1   5503    2014-12-12  湖北
4   10001082    53616768    1   9762    2014-12-02  河北
5   10001082    151466952   1   5232    2014-12-12  江西
6   10001082    53616768    4   9762    2014-12-02  广东
7   10001082    290088061   1   5503    2014-12-12  *
8   10001082    298397524   1   10894   2014-12-12  湖北
9   10001082    32104252    1   6513    2014-12-12  甘肃
10  10001082    323339743   1   10894   2014-12-12  江西

3.导入数据库

把user_table.txt中的数据最终导入到数据仓库Hive中,将user_table.txt上传到分布式文件系统HDFS中,然后,在Hive中创建一个外部表,完成数据的导入。

把Linux本地文件系统中的user_table.txt上传到分布式文件系统HDFS中,存放在HDFS中的“/bigdatacase/dataset”目录下,在HDFS的根目录下面创建一个新的目录bigdatacase,并在这个目录下创建一个子目录dataset

[root@hadoop1 dataset]# hdfs dfs -mkdir -p /bigdatacase/dataset
[root@hadoop1 dataset]# hdfs dfs -put /test/bigdatacase/dataset/user_table.txt /bigdatacase/dataset

查看一下

[root@hadoop1 dataset]# hdfs dfs -cat /bigdatacase/dataset/user_table.txt | head -10
1   10001082    285259775   1   4076    2014-12-08  香港
2   10001082    4368907 1   5503    2014-12-12  山西
3   10001082    4368907 1   5503    2014-12-12  湖北
4   10001082    53616768    1   9762    2014-12-02  河北
5   10001082    151466952   1   5232    2014-12-12  江西
6   10001082    53616768    4   9762    2014-12-02  广东
7   10001082    290088061   1   5503    2014-12-12  *
8   10001082    298397524   1   10894   2014-12-12  湖北
9   10001082    32104252    1   6513    2014-12-12  甘肃
10  10001082    323339743   1   10894   2014-12-12  江西
cat: Unable to write to output stream.
[root@hadoop1 dataset]# 

创建外部表

先创建一个数据库,然后在库中建立一张外部表bigdata_user,它包含字段(id, uid, item_id, behavior_type, item_category, date, province),请在hive命令提示符下输入如下命令

hive (default)> CREATE DATABASE dblab;
OK
hive (default)> use dblab;
OK
Time taken: 0.692 seconds
CREATE EXTERNAL TABLE bigdata_user(
id INT,
uid STRING,
item_id STRING,
behavior_type INT,
item_category STRING,
visit_data DATE,
province STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
LOCATION '/bigdatacase/dataset';

已经成功把HDFS中的“/bigdatacase/dataset”目录下的数据加载到了数据仓库Hive中,查看一下

 

这里不小心把visit_date DATE 写成 visit_data DATE,修改一下列

hive (dblab)> ALTER TABLE bigdata_user CHANGE visit_data visit_date date;
OK
Time taken: 9.862 seconds
hive (dblab)> select * from bigdata_user limit 10;
OK
bigdata_user.id bigdata_user.uid    bigdata_user.item_id    bigdata_user.behavior_type  bigdata_user.item_category  bigdata_user.visit_date bigdata_user.province
1   10001082    285259775   1   4076    2014-12-08  香港
2   10001082    4368907 1   5503    2014-12-12  山西
3   10001082    4368907 1   5503    2014-12-12  湖北
4   10001082    53616768    1   9762    2014-12-02  河北
5   10001082    151466952   1   5232    2014-12-12  江西
6   10001082    53616768    4   9762    2014-12-02  广东
7   10001082    290088061   1   5503    2014-12-12  *
8   10001082    298397524   1   10894   2014-12-12  湖北
9   10001082    32104252    1   6513    2014-12-12  甘肃
10  10001082    323339743   1   10894   2014-12-12  江西
Time taken: 12.896 seconds, Fetched: 10 row(s)
hive (dblab)> select behavior_type from bigdata_user limit 10;
OK
behavior_type
1
1
1
1
1
4
1
1
1
1
Time taken: 6.927 seconds, Fetched: 10 row(s)

 

beeline查询表会比较完整

0: jdbc:hive2://hadoop001:10000> select * from bigdata_user limit 10;
+------------------+-------------------+-----------------------+-----------------------------+-----------------------------+--------------------------+------------------------+
| bigdata_user.id  | bigdata_user.uid  | bigdata_user.item_id  | bigdata_user.behavior_type  | bigdata_user.item_category  | bigdate_user.visit_data  | bigdata_user.province  |
+------------------+-------------------+-----------------------+-----------------------------+-----------------------------+--------------------------+------------------------+
| 1                | 10001082          | 285259775             | 1                           | 4076                        | 2014-12-08               | 香港                     |
| 2                | 10001082          | 4368907               | 1                           | 5503                        | 2014-12-12               | 山西                     |
| 3                | 10001082          | 4368907               | 1                           | 5503                        | 2014-12-12               | 湖北                     |
| 4                | 10001082          | 53616768              | 1                           | 9762                        | 2014-12-02               | 河北                     |
| 5                | 10001082          | 151466952             | 1                           | 5232                        | 2014-12-12               | 江西                     |
| 6                | 10001082          | 53616768              | 4                           | 9762                        | 2014-12-02               | 广东                     |
| 7                | 10001082          | 290088061             | 1                           | 5503                        | 2014-12-12               | *                     |
| 8                | 10001082          | 298397524             | 1                           | 10894                       | 2014-12-12               | 湖北                     |
| 9                | 10001082          | 32104252              | 1                           | 6513                        | 2014-12-12               | 甘肃                     |
| 10               | 10001082          | 323339743             | 1                           | 10894                       | 2014-12-12               | 江西                     |
+------------------+-------------------+-----------------------+-----------------------------+-----------------------------+--------------------------+------------------------+
10 rows selected (1.055 seconds)
0: jdbc:hive2://hadoop001:10000> select behavior_type from bigdata_user limit 10;
+----------------+
| behavior_type  |
+----------------+
| 1              |
| 1              |
| 1              |
| 1              |
| 1              |
| 4              |
| 1              |
| 1              |
| 1              |
| 1              |
+----------------+
10 rows selected (6.337 seconds)
0: jdbc:hive2://hadoop001:10000> 

 

 

查看一下的bigdata_user属性show create table bigdata_user;

查看一下表的简单结构desc bigdata_user;

hive (dblab)> show create table bigdata_user;
OK
createtab_stmt
CREATE EXTERNAL TABLE `bigdata_user`(
  `id` int, 
  `uid` string, 
  `item_id` string, 
  `behavior_type` int, 
  `item_category` string, 
  `visit_date` date, 
  `province` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'='\t', 
  'serialization.format'='\t') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://192.168.1.11:8020/bigdatacase/dataset'
TBLPROPERTIES (
  'bucketing_version'='2', 
  'transient_lastDdlTime'='1605450738')
Time taken: 35.032 seconds, Fetched: 22 row(s)
hive (dblab)> 
hive (dblab)> desc bigdata_user;
OK
col_name    data_type   comment
id                      int                                         
uid                     string                                      
item_id                 string                                      
behavior_type           int                                         
item_category           string                                      
visit_date              date                                        
province                string                                      
Time taken: 14.897 seconds, Fetched: 7 row(s)
hive (dblab)> 

 

二、简单查询分析 

先测试一下简单的指令:

查看前10位用户对商品的行为

SELECT behavior_type FROM bigdata_user LIMIT 10;

hive (dblab)> SELECT behavior_type FROM bigdata_user LIMIT 10;
OK
behavior_type
1
1
1
1
1
4
1
1
1
1
Time taken: 10.666 seconds, Fetched: 10 row(s)

查询前20位用户购买商品时的时间和商品的种类

SELECT visit_date,item_category FROM bigdata_user LIMIT 20;

hive (dblab)> SELECT visit_date,item_category FROM bigdata_user LIMIT 20;
OK
visit_date  item_category
2014-12-08  4076
2014-12-12  5503
2014-12-12  5503
2014-12-02  9762
2014-12-12  5232
2014-12-02  9762
2014-12-12  5503
2014-12-12  10894
2014-12-12  6513
2014-12-12  10894
2014-12-12  2825
2014-11-28  2825
2014-12-15  3200
2014-12-03  10576
2014-11-20  10576
2014-12-13  10576
2014-12-08  10576
2014-12-14  7079
2014-12-02  6669
2014-12-12  5232
Time taken: 7.74 seconds, Fetched: 20 row(s)

表中查询可以利用嵌套语句

hive (dblab)> SELECT e.bh, e.it FROM (SELECT behavior_type AS bh, item_category AS it FROM bigdata_user) AS e LIMIT 20;
OK
e.bh    e.it
1   4076
1   5503
1   5503
1   9762
1   5232
4   9762
1   5503
1   10894
1   6513
1   10894
1   2825
1   2825
1   3200
1   10576
1   10576
1   10576
1   10576
1   7079
1   6669
1   5232
Time taken: 5.872 seconds, Fetched: 20 row(s)

 

三、查询条数统计分析 

(1)用聚合函数count()计算出表内有多少条行数据

SELECT COUNT(*) FROM bigdata_user;

hive (dblab)> SELECT COUNT(*) FROM bigdata_user;
Query ID = root_20201115231503_8cce31a9-81cb-4ef3-92ca-94bafb2086b4
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-15 23:17:28,850 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0001, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0001/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-15 23:20:28,863 Stage-1 map = 0%,  reduce = 0%
2020-11-15 23:21:29,392 Stage-1 map = 0%,  reduce = 0%
2020-11-15 23:22:27,725 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.91 sec
2020-11-15 23:23:10,606 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 11.72 sec
MapReduce Total cumulative CPU time: 11 seconds 970 msec
Ended Job = job_1605437253399_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 11.97 sec   HDFS Read: 15603831 HDFS Write: 106 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 970 msec
OK
_c0
300000
Time taken: 501.726 seconds, Fetched: 1 row(s)

(2)在函数内部加上distinct,查出uid不重复的数据有多少条

SELECT COUNT(DISTINCT uid) FROM bigdata_user;

hive (dblab)> SELECT COUNT(DISTINCT uid) FROM bigdata_user;
Query ID = root_20201115232536_31d143aa-c5ed-4b53-b0a6-8c64f14f8634
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-15 23:25:58,752 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0002, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0002/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-15 23:26:38,676 Stage-1 map = 0%,  reduce = 0%
2020-11-15 23:27:38,820 Stage-1 map = 0%,  reduce = 0%
2020-11-15 23:28:20,002 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.33 sec
2020-11-15 23:29:05,715 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 9.77 sec
2020-11-15 23:29:11,177 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 10.77 sec
MapReduce Total cumulative CPU time: 10 seconds 770 msec
Ended Job = job_1605437253399_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 10.77 sec   HDFS Read: 15599879 HDFS Write: 103 SUCCESS
Total MapReduce CPU Time Spent: 10 seconds 770 msec
OK
_c0
270
Time taken: 253.202 seconds, Fetched: 1 row(s)

(3)查询不重复的数据有多少条(为了排除客户刷单情况)

SELECT COUNT(*) FROM (SELECT
uid,item_id,behavior_type,item_category,visit_date,province 
FROM
bigdata_user
GROUP BY
uid,item_id,behavior_type,item_category,visit_date,province 
HAVING COUNT(*)=1)a;
hive (dblab)> SELECT COUNT(*) FROM (SELECT
            > uid,item_id,behavior_type,item_category,visit_date,province 
            > FROM
            > bigdata_user
            > GROUP BY
            > uid,item_id,behavior_type,item_category,visit_date,province 
            > HAVING COUNT(*)=1)a;
Query ID = root_20201116003743_ffd71e21-1df6-4876-b80e-1672d4c3f19d
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 00:38:51,966 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0004, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0004/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 00:40:19,298 Stage-1 map = 0%,  reduce = 0%
2020-11-16 00:41:12,139 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 11.81 sec
2020-11-16 00:41:13,194 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 13.35 sec
2020-11-16 00:42:13,270 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 13.35 sec
2020-11-16 00:42:43,105 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 17.76 sec
2020-11-16 00:42:49,495 Stage-1 map = 100%,  reduce = 74%, Cumulative CPU 22.34 sec
2020-11-16 00:42:55,171 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 25.5 sec
MapReduce Total cumulative CPU time: 25 seconds 680 msec
Ended Job = job_1605437253399_0004
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 00:43:09,951 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0005, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0005/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0005
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2020-11-16 00:43:51,371 Stage-2 map = 0%,  reduce = 0%
2020-11-16 00:44:11,773 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.71 sec
2020-11-16 00:45:12,075 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 2.71 sec
2020-11-16 00:45:41,184 Stage-2 map = 100%,  reduce = 67%, Cumulative CPU 7.7 sec
2020-11-16 00:45:47,993 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 8.67 sec
MapReduce Total cumulative CPU time: 8 seconds 670 msec
Ended Job = job_1605437253399_0005
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 25.68 sec   HDFS Read: 15606542 HDFS Write: 117 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 8.67 sec   HDFS Read: 6997 HDFS Write: 106 SUCCESS
Total MapReduce CPU Time Spent: 34 seconds 350 msec
OK
_c0
284368
Time taken: 494.126 seconds, Fetched: 1 row(s)

 

四.关键字条件查询分析 

1.以关键字的存在区间为条件的查询 使用where可以缩小查询分析的范围和精确度,下面用实例来测试一下。

(1)查询2014年12月10日到2014年12月13日有多少人浏览了商品

hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE 
            > bwhavior_type='1'
            > AND visit_date<'2014-12-13'
            > AND visit_date>'2014-12-10';
            Query ID = root_20201116090806_6b2a5b9c-f996-4caf-baa7-4bea6812cfda
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 09:11:29,381 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0008, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0008/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 09:16:53,597 Stage-1 map = 0%,  reduce = 0%
2020-11-16 09:17:08,350 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.95 sec
2020-11-16 09:17:34,025 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 16.15 sec
MapReduce Total cumulative CPU time: 16 seconds 350 msec
Ended Job = job_1605437253399_0008
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 16.35 sec   HDFS Read: 15606581 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 16 seconds 350 msec
OK
_c0
26329
Time taken: 629.506 seconds, Fetched: 1 row(s)

(2)以月的第n天为统计单位,依次显示第n天网站卖出去的商品的个数

SELECT COUNT(distinct uid), day(visit_date) FROM bigdata_user WHERE
behavior_type='4' GROUP BY day(visit_date);

 

hive (dblab)> SELECT COUNT(distinct uid), day(visit_date) FROM bigdata_user WHERE
            > behavior_type='4' GROUP BY day(visit_date);
Query ID = root_20201116115335_d26e8b96-2962-4e18-8e18-adbc7324d127
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 11:55:06,219 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0009, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0009/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 11:56:13,054 Stage-1 map = 0%,  reduce = 0%
2020-11-16 11:56:35,088 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 8.08 sec
2020-11-16 11:57:03,002 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 14.01 sec
MapReduce Total cumulative CPU time: 14 seconds 10 msec
Ended Job = job_1605437253399_0009
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 14.01 sec   HDFS Read: 15601205 HDFS Write: 618 SUCCESS
Total MapReduce CPU Time Spent: 14 seconds 10 msec
OK
_c0 _c1
37  1
48  2
42  3
38  4
42  5
33  6
42  7
36  8
34  9
40  10
43  11
98  12
39  13
43  14
42  15
44  16
42  17
66  18
38  19
50  20
33  21
34  22
32  23
47  24
34  25
31  26
30  27
34  28
39  29
38  30
Time taken: 212.641 seconds, Fetched: 30 row(s)

2.关键字赋予给定值为条件,对其他数据进行分析

取给定时间和给定地点,求当天发出到该地点的货物的数量

SELEct COUNT(*) FROM bigdata_user WHERE 
province='江西' AND visit_date='2014-12-12' AND behavior_type='4';
hive (dblab)> SELEct COUNT(*) FROM bigdata_user WHERE 
            > province='江西' AND visit_date='2014-12-12' AND behavior_type='4';
Query ID = root_20201116120451_c13ae714-6cf1-400c-b9c0-d6afdd25ac7a
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 12:05:01,440 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0010, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0010/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 12:06:43,726 Stage-1 map = 0%,  reduce = 0%
2020-11-16 12:07:13,387 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.12 sec
2020-11-16 12:07:31,429 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 9.12 sec
MapReduce Total cumulative CPU time: 14 seconds 140 msec
Ended Job = job_1605437253399_0010
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 14.14 sec   HDFS Read: 15606342 HDFS Write: 102 SUCCESS
Total MapReduce CPU Time Spent: 14 seconds 140 msec
OK
_c0
12
Time taken: 166.15 seconds, Fetched: 1 row(s)

五.根据用户行为分析 

1.查询一件商品在某天的购买比例或浏览比例

例如

SELECT COUNT(*) FROM bigdata_user WHERE visit_date='2014-12-11'
AND behavior_type='4';
hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE visit_date='2014-12-11'
            > AND behavior_type='4';
Query ID = root_20201116122136_fbb9be2d-7c88-409d-ada8-0e24ca4e912d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 12:22:11,860 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0011, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0011/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0011
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 12:24:31,024 Stage-1 map = 0%,  reduce = 0%
2020-11-16 12:24:40,319 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 8.37 sec
2020-11-16 12:25:19,783 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 16.99 sec
MapReduce Total cumulative CPU time: 16 seconds 990 msec
Ended Job = job_1605437253399_0011
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 16.99 sec   HDFS Read: 15605958 HDFS Write: 102 SUCCESS
Total MapReduce CPU Time Spent: 16 seconds 990 msec
OK
_c0
69
Time taken: 242.638 seconds, Fetched: 1 row(s)

查询有多少用户在2014-12-11点击了该店

SELECT COUNT(*) FROM bigdata_user WHERE
 visit_date='2014-12-12';

根据上面语句得到购买数量和点击数量,两个数相除即可得出当天该商品的购买率。

2.查询某个用户在某一天点击网站占该天所有点击行为的比例(点击行为包括浏览,加入购物车,收藏,购买)

//查询用户10001082在2014-12-12点击网站的次数

SELECT COUNT(*) FROM bigdata_user WHERE
uid=10001082 AND visit_date='2014-12-12';
hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE
            > uid=10001082 AND visit_date='2014-12-12';
Query ID = root_20201116124039_d60d642d-b253-49a4-a5d6-b938e307157b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 12:40:52,056 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0013, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0013/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0013
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 12:41:17,591 Stage-1 map = 0%,  reduce = 0%
2020-11-16 12:41:44,422 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 8.98 sec
2020-11-16 12:42:10,097 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 15.17 sec
MapReduce Total cumulative CPU time: 15 seconds 170 msec
Ended Job = job_1605437253399_0013
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 15.17 sec   HDFS Read: 15606459 HDFS Write: 102 SUCCESS
Total MapReduce CPU Time Spent: 15 seconds 170 msec
OK
_c0
69
Time taken: 95.548 seconds, Fetched: 1 row(s)
hive (dblab)> 

//查询所有用户在这一天点击该网站的次数

SELECT COUNT(*) FROM bigdata_user WHERE
visit_date='2014-12-12';
hive (dblab)> SELECT COUNT(*) FROM bigdata_user WHERE
            > visit_date='2014-12-12';
Query ID = root_20201116124543_a447db6c-52cd-4be9-adb6-1a57f5163eb3
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 12:45:48,156 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0014, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0014/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 12:48:59,555 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.29 sec
2020-11-16 12:50:06,680 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.29 sec
2020-11-16 12:51:44,664 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 11.4 sec
2020-11-16 12:51:59,859 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 16.82 sec
MapReduce Total cumulative CPU time: 17 seconds 280 msec
Ended Job = job_1605437253399_0014
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 17.28 sec   HDFS Read: 15605296 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 17 seconds 280 msec
OK
_c0
17494
Time taken: 454.833 seconds, Fetched: 1 row(s)

上面两条语句的结果相除,就得到了要要求的比例。

3.给定购买商品的数量范围,查询某一天在该网站的购买该数量商品的用户id

//查询某一天在该网站购买商品超过5次的用户id

SELECT uid FROM bigdata_user WHERE
behavior_type='4' AND visit_date='2014-12-12'
GROUP BY uid having count(behavior_type='4')>5;
hive (dblab)> SELECT uid FROM bigdata_user WHERE
            > behavior_type='4' AND visit_date='2014-12-12'
            > GROUP BY uid having count(behavior_type='4')>5;
Query ID = root_20201116130147_72230e9f-b748-4851-974c-1ca5dcceb37c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 13:07:39,493 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0015, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0015/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0015
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 13:09:16,947 Stage-1 map = 0%,  reduce = 0%
2020-11-16 13:10:02,481 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 10.51 sec
2020-11-16 13:10:33,388 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 17.71 sec
MapReduce Total cumulative CPU time: 17 seconds 710 msec
Ended Job = job_1605437253399_0015
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 17.71 sec   HDFS Read: 15607239 HDFS Write: 478 SUCCESS
Total MapReduce CPU Time Spent: 17 seconds 710 msec
OK
uid
100226515
100300684
100555417
100605
10095384
10142625
101490976
101982646
102011320
102030700
102079825
102349447
102612580
102650143
103082347
103139791
103794013
103995979
Time taken: 533.86 seconds, Fetched: 18 row(s)

六.用户实时查询分析 

某个地区的用户当天浏览网站的次数

//创建新的数据表进行存储

CREATE TABLE scan(province STRING,scan INT) COMMENT 'This is the search of bigdataday' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
hive (dblab)> CREATE TABLE scan(province STRING,scan INT) COMMENT 'This is the search of bigdataday' 
            > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
OK
Time taken: 47.899 seconds
//导入数据
INSERT OVERWRITE TABLE scan SELECT province,COUNT(behavior_type)
FROM bigdata_user WHERE behavior_type='1' GROUP BY province;

//查看一下
SELECT * FROM scan;
hive (dblab)> INSERT OVERWRITE TABLE scan SELECT province,COUNT(behavior_type)
            > FROM bigdata_user WHERE behavior_type='1' GROUP BY province;
Query ID = root_20201116133805_d502f968-f966-4de9-ac2b-075942cd23dc
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 13:38:36,530 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0016, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0016/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0016
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-16 13:39:13,851 Stage-1 map = 0%,  reduce = 0%
2020-11-16 13:39:36,587 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.79 sec
2020-11-16 13:39:54,796 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 13.42 sec
MapReduce Total cumulative CPU time: 13 seconds 420 msec
Ended Job = job_1605437253399_0016
Loading data to table dblab.scan
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2020-11-16 13:40:02,593 INFO  [6d9dd317-47d9-447c-bba6-8b3d0a9f3ae9 main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Starting Job = job_1605437253399_0017, Tracking URL = http://192.168.1.12:8089/proxy/application_1605437253399_0017/
Kill Command = /program/hadoop-3.2.1/bin/mapred job  -kill job_1605437253399_0017
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2020-11-16 13:40:34,772 Stage-3 map = 0%,  reduce = 0%
2020-11-16 13:41:35,425 Stage-3 map = 0%,  reduce = 0%
2020-11-16 13:42:27,059 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 5.0 sec
2020-11-16 13:42:57,783 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 9.3 sec
MapReduce Total cumulative CPU time: 9 seconds 300 msec
Ended Job = job_1605437253399_0017
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 13.42 sec   HDFS Read: 15608313 HDFS Write: 913 SUCCESS
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 9.3 sec   HDFS Read: 11053 HDFS Write: 515 SUCCESS
Total MapReduce CPU Time Spent: 22 seconds 720 msec
OK
_col0   _col1
Time taken: 336.34 seconds
hive (dblab)> SELECT * FROM scan;
OK
scan.province   scan.scan
上海市 8394
云南  8357
内蒙古 8146
北京市 8328
*  8186
吉林  8397
四川  8412
天津市 8557
宁夏  8414
安徽  8222
山东  8404
山西  8342
广东  8321
广西  8265
*  8311
江苏  8356
江西  8344
河北  8356
河南  8475
浙江  8321
海南  8417
湖北  8231
湖南  8390
澳门  8342
甘肃  8367
福建  8426
*  8355
贵州  8236
辽宁  8316
重庆市 8197
陕西  8364
青海  8384
香港  8203
黑龙江 8380
Time taken: 8.221 seconds, Fetched: 34 row(s)
hive (dblab)> 

 

上一篇:SQLserver 2014自定义备份数据库


下一篇:心脏滴血漏洞(CVE-2014-0160)