大数据开发之Hive优化篇3-Hive的统计信息

备注:
Hive 版本 2.1.1

文章目录

一.Hive统计信息概述

统计数据,比如表或分区的行数和特定列的直方图,在很多方面都很重要。统计数据的一个关键用例是查询优化。统计数据作为成本函数的输入,使优化器可以比较不同的方案并进行选择。统计数据有时可以满足用户查询的目的。用户可以通过只查询存储的统计信息而不是触发长时间运行的执行计划来快速获得某些查询的答案。有些例子是关于用户年龄分布的分位数,人们使用的前10个应用,以及不同的会话数。

1.1 表和分区的统计信息

支持统计的第一个里程碑是支持表和分区级统计。表和分区统计信息现在存储在Hive Metastore中,无论是新创建的表还是现有的表。分区目前支持以下统计信息:
1.表行数
2.表的文件数
3.表的大小

1.2 列的统计信息

Hive 2.3 开始会自动收集列的统计信息

number of distinct values
number of NULL values
number of TRUES/FALSE values
min/max values
avg/max of column length

1.3 Top K 统计信息

除了分区统计外,Hive表的列级top K值也可以估算。
如果用户没有指定倾斜,则最倾斜列的名称和top K值存储在分区或非分区表的倾斜信息中。这适用于新创建的表和现有的表。

二.Hive 统计信息收集方法

对于已经存在的表/或分区,用户可以发出ANALYZE命令来收集统计信息并将其写入Hive MetaStore。下面描述了该命令的语法

语法:

ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]  -- (Note: Fully support qualified table name since Hive 1.2.0, see HIVE-10007.)
  COMPUTE STATISTICS 
  [FOR COLUMNS]          -- (Note: Hive 0.10.0 and later.)
  [CACHE METADATA]       -- (Note: Hive 2.1.0 and later.)
  [NOSCAN];

ANALYZE TABLE Table1 COMPUTE STATISTICS NOSCAN;//只统计文件数和文件大小,不扫描文件行数,执行较快

代码:

-- 收集表的统计信息
ANALYZE TABLE ods_fact_sale COMPUTE STATISTICS;

-- 收集表的列的统计信息
ANALYZE TABLE ods_fact_sale COMPUTE STATISTICS FOR COLUMNS;

-- 收集分区的统计信息
ANALYZE TABLE ods_fact_sale_partion PARTITION(sale_date='2010-04-12') COMPUTE STATISTICS;

-- 收集分区列的统计信息
ANALYZE TABLE ods_fact_sale_partion PARTITION(sale_date='2010-04-12') COMPUTE STATISTICS FOR COLUMNS;

测试记录

hive> 
    > ANALYZE TABLE ods_fact_sale COMPUTE STATISTICS;
Query ID = root_20201224154859_39e3d480-f8dd-4497-b211-407c27fa7e7c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/24 15:49:00 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1608780340033_0005, Tracking URL = http://hp3:8088/proxy/application_1608780340033_0005/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1608780340033_0005
Hadoop job information for Stage-0: number of mappers: 117; number of reducers: 0
2020-12-24 15:49:08,266 Stage-0 map = 0%,  reduce = 0%
2020-12-24 15:49:16,660 Stage-0 map = 2%,  reduce = 0%, Cumulative CPU 11.27 sec
2020-12-24 15:49:22,859 Stage-0 map = 3%,  reduce = 0%, Cumulative CPU 22.04 sec
2020-12-24 15:49:29,038 Stage-0 map = 5%,  reduce = 0%, Cumulative CPU 32.83 sec
2020-12-24 15:49:35,199 Stage-0 map = 7%,  reduce = 0%, Cumulative CPU 43.44 sec
2020-12-24 15:49:41,345 Stage-0 map = 9%,  reduce = 0%, Cumulative CPU 53.97 sec
2020-12-24 15:49:47,505 Stage-0 map = 10%,  reduce = 0%, Cumulative CPU 64.55 sec
2020-12-24 15:49:52,647 Stage-0 map = 11%,  reduce = 0%, Cumulative CPU 69.8 sec
2020-12-24 15:49:53,675 Stage-0 map = 12%,  reduce = 0%, Cumulative CPU 75.07 sec
2020-12-24 15:49:58,800 Stage-0 map = 13%,  reduce = 0%, Cumulative CPU 80.21 sec
2020-12-24 15:49:59,829 Stage-0 map = 14%,  reduce = 0%, Cumulative CPU 85.45 sec
2020-12-24 15:50:04,947 Stage-0 map = 15%,  reduce = 0%, Cumulative CPU 96.56 sec
2020-12-24 15:50:11,096 Stage-0 map = 17%,  reduce = 0%, Cumulative CPU 106.75 sec
2020-12-24 15:50:17,247 Stage-0 map = 19%,  reduce = 0%, Cumulative CPU 117.44 sec
2020-12-24 15:50:23,399 Stage-0 map = 21%,  reduce = 0%, Cumulative CPU 127.82 sec
2020-12-24 15:50:29,550 Stage-0 map = 22%,  reduce = 0%, Cumulative CPU 138.23 sec
2020-12-24 15:50:35,692 Stage-0 map = 24%,  reduce = 0%, Cumulative CPU 148.83 sec
2020-12-24 15:50:40,822 Stage-0 map = 25%,  reduce = 0%, Cumulative CPU 154.07 sec
2020-12-24 15:50:41,848 Stage-0 map = 26%,  reduce = 0%, Cumulative CPU 159.35 sec
2020-12-24 15:50:47,994 Stage-0 map = 27%,  reduce = 0%, Cumulative CPU 170.39 sec
2020-12-24 15:50:53,132 Stage-0 map = 29%,  reduce = 0%, Cumulative CPU 180.88 sec
2020-12-24 15:50:59,262 Stage-0 map = 31%,  reduce = 0%, Cumulative CPU 191.42 sec
2020-12-24 15:51:05,406 Stage-0 map = 32%,  reduce = 0%, Cumulative CPU 201.98 sec
2020-12-24 15:51:11,556 Stage-0 map = 34%,  reduce = 0%, Cumulative CPU 212.58 sec
2020-12-24 15:51:17,693 Stage-0 map = 36%,  reduce = 0%, Cumulative CPU 223.13 sec
2020-12-24 15:51:23,842 Stage-0 map = 38%,  reduce = 0%, Cumulative CPU 233.67 sec
2020-12-24 15:51:29,992 Stage-0 map = 39%,  reduce = 0%, Cumulative CPU 244.13 sec
2020-12-24 15:51:35,113 Stage-0 map = 41%,  reduce = 0%, Cumulative CPU 254.69 sec
2020-12-24 15:51:41,273 Stage-0 map = 43%,  reduce = 0%, Cumulative CPU 265.25 sec
2020-12-24 15:51:47,409 Stage-0 map = 44%,  reduce = 0%, Cumulative CPU 276.11 sec
2020-12-24 15:51:53,551 Stage-0 map = 45%,  reduce = 0%, Cumulative CPU 281.31 sec
2020-12-24 15:51:54,582 Stage-0 map = 46%,  reduce = 0%, Cumulative CPU 286.5 sec
2020-12-24 15:51:59,690 Stage-0 map = 47%,  reduce = 0%, Cumulative CPU 291.84 sec
2020-12-24 15:52:00,712 Stage-0 map = 48%,  reduce = 0%, Cumulative CPU 297.0 sec
2020-12-24 15:52:05,824 Stage-0 map = 49%,  reduce = 0%, Cumulative CPU 302.48 sec
2020-12-24 15:52:06,847 Stage-0 map = 50%,  reduce = 0%, Cumulative CPU 307.62 sec
2020-12-24 15:52:12,988 Stage-0 map = 51%,  reduce = 0%, Cumulative CPU 317.61 sec
2020-12-24 15:52:17,076 Stage-0 map = 52%,  reduce = 0%, Cumulative CPU 323.3 sec
2020-12-24 15:52:20,140 Stage-0 map = 53%,  reduce = 0%, Cumulative CPU 328.78 sec
2020-12-24 15:52:24,232 Stage-0 map = 54%,  reduce = 0%, Cumulative CPU 334.28 sec
2020-12-24 15:52:25,258 Stage-0 map = 55%,  reduce = 0%, Cumulative CPU 339.48 sec
2020-12-24 15:52:29,355 Stage-0 map = 56%,  reduce = 0%, Cumulative CPU 344.79 sec
2020-12-24 15:52:35,498 Stage-0 map = 57%,  reduce = 0%, Cumulative CPU 355.16 sec
2020-12-24 15:52:37,545 Stage-0 map = 58%,  reduce = 0%, Cumulative CPU 360.44 sec
2020-12-24 15:52:41,640 Stage-0 map = 59%,  reduce = 0%, Cumulative CPU 365.71 sec
2020-12-24 15:52:43,685 Stage-0 map = 60%,  reduce = 0%, Cumulative CPU 370.86 sec
2020-12-24 15:52:47,798 Stage-0 map = 61%,  reduce = 0%, Cumulative CPU 376.1 sec
2020-12-24 15:52:49,838 Stage-0 map = 62%,  reduce = 0%, Cumulative CPU 381.33 sec
2020-12-24 15:52:55,969 Stage-0 map = 63%,  reduce = 0%, Cumulative CPU 391.97 sec
2020-12-24 15:53:00,060 Stage-0 map = 64%,  reduce = 0%, Cumulative CPU 397.24 sec
2020-12-24 15:53:02,113 Stage-0 map = 65%,  reduce = 0%, Cumulative CPU 402.45 sec
2020-12-24 15:53:06,201 Stage-0 map = 66%,  reduce = 0%, Cumulative CPU 407.61 sec
2020-12-24 15:53:08,249 Stage-0 map = 67%,  reduce = 0%, Cumulative CPU 412.86 sec
2020-12-24 15:53:11,316 Stage-0 map = 68%,  reduce = 0%, Cumulative CPU 417.91 sec
2020-12-24 15:53:17,447 Stage-0 map = 69%,  reduce = 0%, Cumulative CPU 428.45 sec
2020-12-24 15:53:19,490 Stage-0 map = 70%,  reduce = 0%, Cumulative CPU 433.8 sec
2020-12-24 15:53:23,582 Stage-0 map = 71%,  reduce = 0%, Cumulative CPU 438.97 sec
2020-12-24 15:53:25,630 Stage-0 map = 72%,  reduce = 0%, Cumulative CPU 444.21 sec
2020-12-24 15:53:29,723 Stage-0 map = 73%,  reduce = 0%, Cumulative CPU 449.4 sec
2020-12-24 15:53:31,772 Stage-0 map = 74%,  reduce = 0%, Cumulative CPU 454.85 sec
2020-12-24 15:53:37,908 Stage-0 map = 75%,  reduce = 0%, Cumulative CPU 465.22 sec
2020-12-24 15:53:41,998 Stage-0 map = 76%,  reduce = 0%, Cumulative CPU 470.45 sec
2020-12-24 15:53:44,041 Stage-0 map = 77%,  reduce = 0%, Cumulative CPU 475.78 sec
2020-12-24 15:53:48,134 Stage-0 map = 78%,  reduce = 0%, Cumulative CPU 481.32 sec
2020-12-24 15:53:50,180 Stage-0 map = 79%,  reduce = 0%, Cumulative CPU 486.63 sec
2020-12-24 15:53:56,320 Stage-0 map = 80%,  reduce = 0%, Cumulative CPU 497.34 sec
2020-12-24 15:54:00,415 Stage-0 map = 81%,  reduce = 0%, Cumulative CPU 502.9 sec
2020-12-24 15:54:02,463 Stage-0 map = 82%,  reduce = 0%, Cumulative CPU 508.21 sec
2020-12-24 15:54:05,545 Stage-0 map = 83%,  reduce = 0%, Cumulative CPU 513.57 sec
2020-12-24 15:54:07,595 Stage-0 map = 84%,  reduce = 0%, Cumulative CPU 518.97 sec
2020-12-24 15:54:11,685 Stage-0 map = 85%,  reduce = 0%, Cumulative CPU 524.27 sec
2020-12-24 15:54:17,818 Stage-0 map = 86%,  reduce = 0%, Cumulative CPU 534.97 sec
2020-12-24 15:54:19,864 Stage-0 map = 87%,  reduce = 0%, Cumulative CPU 540.16 sec
2020-12-24 15:54:23,958 Stage-0 map = 88%,  reduce = 0%, Cumulative CPU 545.62 sec
2020-12-24 15:54:27,029 Stage-0 map = 89%,  reduce = 0%, Cumulative CPU 551.63 sec
2020-12-24 15:54:30,099 Stage-0 map = 90%,  reduce = 0%, Cumulative CPU 556.73 sec
2020-12-24 15:54:33,169 Stage-0 map = 91%,  reduce = 0%, Cumulative CPU 561.88 sec
2020-12-24 15:54:39,298 Stage-0 map = 92%,  reduce = 0%, Cumulative CPU 572.27 sec
2020-12-24 15:54:42,360 Stage-0 map = 93%,  reduce = 0%, Cumulative CPU 577.92 sec
2020-12-24 15:54:45,429 Stage-0 map = 94%,  reduce = 0%, Cumulative CPU 583.33 sec
2020-12-24 15:54:48,489 Stage-0 map = 95%,  reduce = 0%, Cumulative CPU 588.64 sec
2020-12-24 15:54:51,558 Stage-0 map = 96%,  reduce = 0%, Cumulative CPU 593.87 sec
2020-12-24 15:54:53,604 Stage-0 map = 97%,  reduce = 0%, Cumulative CPU 598.99 sec
2020-12-24 15:54:59,745 Stage-0 map = 98%,  reduce = 0%, Cumulative CPU 609.31 sec
2020-12-24 15:55:02,809 Stage-0 map = 99%,  reduce = 0%, Cumulative CPU 614.51 sec
2020-12-24 15:55:05,876 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 619.68 sec
MapReduce Total cumulative CPU time: 10 minutes 19 seconds 680 msec
Ended Job = job_1608780340033_0005
MapReduce Jobs Launched: 
Stage-Stage-0: Map: 117   Cumulative CPU: 619.68 sec   HDFS Read: 31436744307 HDFS Write: 10062 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 10 minutes 19 seconds 680 msec
OK
Time taken: 368.167 seconds
hive> 
hive> 
    > 
    > ANALYZE TABLE ods_fact_sale COMPUTE STATISTICS FOR COLUMNS;
Query ID = root_20201224155646_487db6aa-b2a9-4db5-8fdf-fac6877e848b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
20/12/24 15:56:46 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1608780340033_0006, Tracking URL = http://hp3:8088/proxy/application_1608780340033_0006/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1608780340033_0006
Hadoop job information for Stage-0: number of mappers: 117; number of reducers: 1
2020-12-24 15:56:54,278 Stage-0 map = 0%,  reduce = 0%
2020-12-24 15:57:10,723 Stage-0 map = 1%,  reduce = 0%, Cumulative CPU 30.15 sec
2020-12-24 15:57:15,871 Stage-0 map = 2%,  reduce = 0%, Cumulative CPU 39.04 sec
2020-12-24 15:57:33,311 Stage-0 map = 3%,  reduce = 0%, Cumulative CPU 72.97 sec
2020-12-24 15:57:47,659 Stage-0 map = 4%,  reduce = 0%, Cumulative CPU 91.68 sec
2020-12-24 15:57:51,755 Stage-0 map = 5%,  reduce = 0%, Cumulative CPU 113.86 sec
2020-12-24 15:58:07,130 Stage-0 map = 6%,  reduce = 0%, Cumulative CPU 143.87 sec
2020-12-24 15:58:10,211 Stage-0 map = 7%,  reduce = 0%, Cumulative CPU 150.91 sec
2020-12-24 15:58:27,598 Stage-0 map = 8%,  reduce = 0%, Cumulative CPU 184.59 sec
2020-12-24 15:58:28,621 Stage-0 map = 9%,  reduce = 0%, Cumulative CPU 188.36 sec
2020-12-24 15:58:44,993 Stage-0 map = 10%,  reduce = 0%, Cumulative CPU 221.78 sec
2020-12-24 15:59:00,341 Stage-0 map = 11%,  reduce = 0%, Cumulative CPU 240.4 sec
2020-12-24 15:59:04,436 Stage-0 map = 12%,  reduce = 0%, Cumulative CPU 262.05 sec
2020-12-24 15:59:18,756 Stage-0 map = 13%,  reduce = 0%, Cumulative CPU 292.13 sec
2020-12-24 15:59:21,821 Stage-0 map = 14%,  reduce = 0%, Cumulative CPU 298.56 sec
2020-12-24 15:59:39,211 Stage-0 map = 15%,  reduce = 0%, Cumulative CPU 331.74 sec
2020-12-24 15:59:54,542 Stage-0 map = 16%,  reduce = 0%, Cumulative CPU 350.1 sec
2020-12-24 15:59:57,607 Stage-0 map = 17%,  reduce = 0%, Cumulative CPU 368.58 sec
2020-12-24 16:00:12,965 Stage-0 map = 18%,  reduce = 0%, Cumulative CPU 401.74 sec
2020-12-24 16:00:16,031 Stage-0 map = 19%,  reduce = 0%, Cumulative CPU 408.59 sec
2020-12-24 16:00:33,412 Stage-0 map = 20%,  reduce = 0%, Cumulative CPU 442.08 sec
2020-12-24 16:00:34,435 Stage-0 map = 21%,  reduce = 0%, Cumulative CPU 445.34 sec
2020-12-24 16:00:51,813 Stage-0 map = 22%,  reduce = 0%, Cumulative CPU 478.77 sec
2020-12-24 16:01:06,138 Stage-0 map = 23%,  reduce = 0%, Cumulative CPU 497.09 sec
2020-12-24 16:01:10,219 Stage-0 map = 24%,  reduce = 0%, Cumulative CPU 519.28 sec
2020-12-24 16:01:25,557 Stage-0 map = 25%,  reduce = 0%, Cumulative CPU 549.24 sec
2020-12-24 16:01:28,622 Stage-0 map = 26%,  reduce = 0%, Cumulative CPU 556.3 sec
2020-12-24 16:01:45,998 Stage-0 map = 27%,  reduce = 0%, Cumulative CPU 592.85 sec
2020-12-24 16:02:00,308 Stage-0 map = 28%,  reduce = 0%, Cumulative CPU 607.83 sec
2020-12-24 16:02:04,394 Stage-0 map = 29%,  reduce = 0%, Cumulative CPU 629.88 sec
2020-12-24 16:02:19,733 Stage-0 map = 30%,  reduce = 0%, Cumulative CPU 659.87 sec
2020-12-24 16:02:22,801 Stage-0 map = 31%,  reduce = 0%, Cumulative CPU 666.97 sec
2020-12-24 16:02:40,183 Stage-0 map = 32%,  reduce = 0%, Cumulative CPU 704.06 sec
2020-12-24 16:02:54,495 Stage-0 map = 33%,  reduce = 0%, Cumulative CPU 719.11 sec
2020-12-24 16:02:58,584 Stage-0 map = 34%,  reduce = 0%, Cumulative CPU 741.54 sec
2020-12-24 16:03:13,914 Stage-0 map = 35%,  reduce = 0%, Cumulative CPU 771.76 sec
2020-12-24 16:03:16,984 Stage-0 map = 36%,  reduce = 0%, Cumulative CPU 778.54 sec
2020-12-24 16:03:34,393 Stage-0 map = 38%,  reduce = 0%, Cumulative CPU 815.18 sec
2020-12-24 16:03:52,782 Stage-0 map = 39%,  reduce = 0%, Cumulative CPU 852.05 sec
2020-12-24 16:04:08,124 Stage-0 map = 40%,  reduce = 0%, Cumulative CPU 881.99 sec
2020-12-24 16:04:11,186 Stage-0 map = 41%,  reduce = 0%, Cumulative CPU 889.25 sec
2020-12-24 16:04:25,491 Stage-0 map = 42%,  reduce = 0%, Cumulative CPU 918.95 sec
2020-12-24 16:04:28,561 Stage-0 map = 43%,  reduce = 0%, Cumulative CPU 925.56 sec
2020-12-24 16:04:46,956 Stage-0 map = 44%,  reduce = 0%, Cumulative CPU 962.47 sec
2020-12-24 16:05:02,296 Stage-0 map = 45%,  reduce = 0%, Cumulative CPU 992.46 sec
2020-12-24 16:05:04,341 Stage-0 map = 46%,  reduce = 0%, Cumulative CPU 995.63 sec
2020-12-24 16:05:19,663 Stage-0 map = 47%,  reduce = 0%, Cumulative CPU 1029.36 sec
2020-12-24 16:05:22,731 Stage-0 map = 48%,  reduce = 0%, Cumulative CPU 1036.26 sec
2020-12-24 16:05:41,123 Stage-0 map = 50%,  reduce = 0%, Cumulative CPU 1073.25 sec
2020-12-24 16:05:55,414 Stage-0 map = 51%,  reduce = 0%, Cumulative CPU 1101.63 sec
2020-12-24 16:06:08,694 Stage-0 map = 52%,  reduce = 0%, Cumulative CPU 1122.99 sec
2020-12-24 16:06:20,947 Stage-0 map = 53%,  reduce = 0%, Cumulative CPU 1147.63 sec
2020-12-24 16:06:32,191 Stage-0 map = 54%,  reduce = 0%, Cumulative CPU 1166.2 sec
2020-12-24 16:06:39,344 Stage-0 map = 55%,  reduce = 0%, Cumulative CPU 1184.95 sec
2020-12-24 16:06:50,600 Stage-0 map = 56%,  reduce = 0%, Cumulative CPU 1203.09 sec
2020-12-24 16:07:04,895 Stage-0 map = 57%,  reduce = 0%, Cumulative CPU 1236.75 sec
2020-12-24 16:07:12,045 Stage-0 map = 58%,  reduce = 0%, Cumulative CPU 1255.5 sec
2020-12-24 16:07:26,355 Stage-0 map = 59%,  reduce = 0%, Cumulative CPU 1278.24 sec
2020-12-24 16:07:34,532 Stage-0 map = 60%,  reduce = 0%, Cumulative CPU 1296.89 sec
2020-12-24 16:07:44,755 Stage-0 map = 61%,  reduce = 0%, Cumulative CPU 1315.35 sec
2020-12-24 16:07:52,929 Stage-0 map = 62%,  reduce = 0%, Cumulative CPU 1334.18 sec
2020-12-24 16:08:07,246 Stage-0 map = 63%,  reduce = 0%, Cumulative CPU 1367.6 sec
2020-12-24 16:08:17,461 Stage-0 map = 64%,  reduce = 0%, Cumulative CPU 1386.29 sec
2020-12-24 16:08:28,700 Stage-0 map = 65%,  reduce = 0%, Cumulative CPU 1408.92 sec
2020-12-24 16:08:39,950 Stage-0 map = 66%,  reduce = 0%, Cumulative CPU 1427.21 sec
2020-12-24 16:08:47,100 Stage-0 map = 67%,  reduce = 0%, Cumulative CPU 1445.91 sec
2020-12-24 16:08:57,315 Stage-0 map = 68%,  reduce = 0%, Cumulative CPU 1464.37 sec
2020-12-24 16:09:12,646 Stage-0 map = 69%,  reduce = 0%, Cumulative CPU 1498.27 sec
2020-12-24 16:09:19,802 Stage-0 map = 70%,  reduce = 0%, Cumulative CPU 1516.72 sec
2020-12-24 16:09:33,093 Stage-0 map = 71%,  reduce = 0%, Cumulative CPU 1538.74 sec
2020-12-24 16:09:41,276 Stage-0 map = 72%,  reduce = 0%, Cumulative CPU 1557.18 sec
2020-12-24 16:09:51,511 Stage-0 map = 73%,  reduce = 0%, Cumulative CPU 1575.76 sec
2020-12-24 16:09:59,677 Stage-0 map = 74%,  reduce = 0%, Cumulative CPU 1594.37 sec
2020-12-24 16:10:13,990 Stage-0 map = 75%,  reduce = 0%, Cumulative CPU 1627.61 sec
2020-12-24 16:10:27,269 Stage-0 map = 76%,  reduce = 0%, Cumulative CPU 1649.8 sec
2020-12-24 16:10:35,446 Stage-0 map = 77%,  reduce = 0%, Cumulative CPU 1668.07 sec
2020-12-24 16:10:45,677 Stage-0 map = 78%,  reduce = 0%, Cumulative CPU 1686.3 sec
2020-12-24 16:10:53,843 Stage-0 map = 79%,  reduce = 0%, Cumulative CPU 1705.04 sec
2020-12-24 16:11:08,149 Stage-0 map = 80%,  reduce = 0%, Cumulative CPU 1738.48 sec
2020-12-24 16:11:18,364 Stage-0 map = 81%,  reduce = 0%, Cumulative CPU 1756.89 sec
2020-12-24 16:11:29,612 Stage-0 map = 82%,  reduce = 0%, Cumulative CPU 1779.3 sec
2020-12-24 16:11:36,773 Stage-0 map = 82%,  reduce = 27%, Cumulative CPU 1780.01 sec
2020-12-24 16:11:48,014 Stage-0 map = 83%,  reduce = 27%, Cumulative CPU 1798.76 sec
2020-12-24 16:11:54,149 Stage-0 map = 83%,  reduce = 28%, Cumulative CPU 1798.83 sec
2020-12-24 16:12:05,387 Stage-0 map = 84%,  reduce = 28%, Cumulative CPU 1817.22 sec
2020-12-24 16:12:23,780 Stage-0 map = 85%,  reduce = 28%, Cumulative CPU 1836.05 sec
2020-12-24 16:12:56,426 Stage-0 map = 86%,  reduce = 28%, Cumulative CPU 1869.77 sec
2020-12-24 16:13:00,512 Stage-0 map = 86%,  reduce = 29%, Cumulative CPU 1873.22 sec
2020-12-24 16:13:14,810 Stage-0 map = 87%,  reduce = 29%, Cumulative CPU 1888.29 sec
2020-12-24 16:13:35,239 Stage-0 map = 88%,  reduce = 29%, Cumulative CPU 1910.24 sec
2020-12-24 16:13:53,620 Stage-0 map = 89%,  reduce = 29%, Cumulative CPU 1928.79 sec
2020-12-24 16:13:54,635 Stage-0 map = 89%,  reduce = 30%, Cumulative CPU 1928.87 sec
2020-12-24 16:14:12,001 Stage-0 map = 90%,  reduce = 30%, Cumulative CPU 1947.33 sec
2020-12-24 16:14:29,365 Stage-0 map = 91%,  reduce = 30%, Cumulative CPU 1965.5 sec
2020-12-24 16:15:03,046 Stage-0 map = 92%,  reduce = 30%, Cumulative CPU 1999.01 sec
2020-12-24 16:15:07,115 Stage-0 map = 92%,  reduce = 31%, Cumulative CPU 2002.1 sec
2020-12-24 16:15:20,393 Stage-0 map = 93%,  reduce = 31%, Cumulative CPU 2017.13 sec
2020-12-24 16:15:41,846 Stage-0 map = 94%,  reduce = 31%, Cumulative CPU 2038.78 sec
2020-12-24 16:15:59,229 Stage-0 map = 95%,  reduce = 31%, Cumulative CPU 2056.97 sec
2020-12-24 16:16:01,272 Stage-0 map = 95%,  reduce = 32%, Cumulative CPU 2057.0 sec
2020-12-24 16:16:17,622 Stage-0 map = 96%,  reduce = 32%, Cumulative CPU 2075.28 sec
2020-12-24 16:16:35,986 Stage-0 map = 97%,  reduce = 32%, Cumulative CPU 2093.48 sec
2020-12-24 16:17:08,664 Stage-0 map = 98%,  reduce = 32%, Cumulative CPU 2126.8 sec
2020-12-24 16:17:13,761 Stage-0 map = 98%,  reduce = 33%, Cumulative CPU 2130.1 sec
2020-12-24 16:17:27,033 Stage-0 map = 99%,  reduce = 33%, Cumulative CPU 2145.07 sec
2020-12-24 16:17:47,469 Stage-0 map = 100%,  reduce = 33%, Cumulative CPU 2166.34 sec
2020-12-24 16:17:49,517 Stage-0 map = 100%,  reduce = 100%, Cumulative CPU 2168.5 sec
MapReduce Total cumulative CPU time: 36 minutes 8 seconds 500 msec
Ended Job = job_1608780340033_0006
MapReduce Jobs Launched: 
Stage-Stage-0: Map: 117  Reduce: 1   Cumulative CPU: 2168.5 sec   HDFS Read: 31437098919 HDFS Write: 2984 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 36 minutes 8 seconds 500 msec
OK
Time taken: 1264.65 seconds
hive> 
hive> 
    > ANALYZE TABLE ods_fact_sale_partion PARTITION(sale_date='2010-04-12') COMPUTE STATISTICS;
Partition test.ods_fact_sale_partion{sale_date=2010-04-12} stats: [numFiles=4, numRows=767830000, totalSize=990474835, rawDataSize=77550830000, numFilesErasureCoded=0]
OK
Time taken: 0.489 seconds
hive> ANALYZE TABLE ods_fact_sale_partion PARTITION(sale_date='2010-04-12') COMPUTE STATISTICS FOR COLUMNS;
Query ID = root_20201224162258_acae26de-8c00-4767-98df-fc0d02253dd2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 15
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
20/12/24 16:22:58 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1608780340033_0007, Tracking URL = http://hp3:8088/proxy/application_1608780340033_0007/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1608780340033_0007
Hadoop job information for Stage-0: number of mappers: 5; number of reducers: 15
2020-12-24 16:23:05,719 Stage-0 map = 0%,  reduce = 0%
2020-12-24 16:24:06,180 Stage-0 map = 0%,  reduce = 0%, Cumulative CPU 116.75 sec
2020-12-24 16:25:06,499 Stage-0 map = 0%,  reduce = 0%, Cumulative CPU 239.54 sec
2020-12-24 16:25:36,119 Stage-0 map = 7%,  reduce = 0%, Cumulative CPU 300.82 sec
2020-12-24 16:26:13,916 Stage-0 map = 27%,  reduce = 0%, Cumulative CPU 377.86 sec
2020-12-24 16:26:22,081 Stage-0 map = 47%,  reduce = 0%, Cumulative CPU 391.87 sec
2020-12-24 16:27:22,279 Stage-0 map = 47%,  reduce = 0%, Cumulative CPU 511.26 sec
2020-12-24 16:27:46,793 Stage-0 map = 60%,  reduce = 0%, Cumulative CPU 564.64 sec
2020-12-24 16:28:47,033 Stage-0 map = 60%,  reduce = 0%, Cumulative CPU 684.1 sec
2020-12-24 16:28:49,069 Stage-0 map = 67%,  reduce = 0%, Cumulative CPU 690.29 sec
2020-12-24 16:29:49,324 Stage-0 map = 67%,  reduce = 0%, Cumulative CPU 813.4 sec
2020-12-24 16:30:14,845 Stage-0 map = 73%,  reduce = 0%, Cumulative CPU 868.87 sec
2020-12-24 16:31:03,816 Stage-0 map = 87%,  reduce = 0%, Cumulative CPU 969.82 sec
2020-12-24 16:31:20,153 Stage-0 map = 87%,  reduce = 2%, Cumulative CPU 989.16 sec
2020-12-24 16:32:20,399 Stage-0 map = 87%,  reduce = 2%, Cumulative CPU 1051.43 sec
2020-12-24 16:32:25,510 Stage-0 map = 100%,  reduce = 2%, Cumulative CPU 1055.99 sec
2020-12-24 16:32:26,534 Stage-0 map = 100%,  reduce = 7%, Cumulative CPU 1057.8 sec
2020-12-24 16:32:29,601 Stage-0 map = 100%,  reduce = 13%, Cumulative CPU 1059.94 sec
2020-12-24 16:32:30,623 Stage-0 map = 100%,  reduce = 20%, Cumulative CPU 1061.97 sec
2020-12-24 16:32:33,687 Stage-0 map = 100%,  reduce = 27%, Cumulative CPU 1064.31 sec
2020-12-24 16:32:34,708 Stage-0 map = 100%,  reduce = 33%, Cumulative CPU 1066.24 sec
2020-12-24 16:32:36,755 Stage-0 map = 100%,  reduce = 40%, Cumulative CPU 1068.62 sec
2020-12-24 16:32:38,794 Stage-0 map = 100%,  reduce = 47%, Cumulative CPU 1070.93 sec
2020-12-24 16:32:40,834 Stage-0 map = 100%,  reduce = 53%, Cumulative CPU 1072.96 sec
2020-12-24 16:32:41,874 Stage-0 map = 100%,  reduce = 60%, Cumulative CPU 1075.33 sec
2020-12-24 16:32:44,930 Stage-0 map = 100%,  reduce = 67%, Cumulative CPU 1077.72 sec
2020-12-24 16:32:45,947 Stage-0 map = 100%,  reduce = 73%, Cumulative CPU 1080.01 sec
2020-12-24 16:32:49,012 Stage-0 map = 100%,  reduce = 80%, Cumulative CPU 1082.34 sec
2020-12-24 16:32:50,034 Stage-0 map = 100%,  reduce = 87%, Cumulative CPU 1084.35 sec
2020-12-24 16:32:53,099 Stage-0 map = 100%,  reduce = 93%, Cumulative CPU 1086.93 sec
2020-12-24 16:32:54,117 Stage-0 map = 100%,  reduce = 100%, Cumulative CPU 1089.4 sec
MapReduce Total cumulative CPU time: 18 minutes 9 seconds 400 msec
Ended Job = job_1608780340033_0007
MapReduce Jobs Launched: 
Stage-Stage-0: Map: 5  Reduce: 15   Cumulative CPU: 1089.4 sec   HDFS Read: 980630019 HDFS Write: 3666 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 18 minutes 9 seconds 400 msec
OK
Time taken: 597.274 seconds
hive> 

三.查看表和列的统计信息

3.1 desc命令查看统计信息

语法:

DESCRIBE formatted TABLE1;
DESCRIBE formatted TABLE1 PARTITION(ds='2008-04-09', hr=11);

代码

-- 查看普通表 统计信息
desc formatted ods_fact_sale;
-- 查看分区表统计信息
desc formatted ods_fact_sale_partion PARTITION(sale_date='2010-04-12');

测试记录

hive> 
    > 
    > desc formatted ods_fact_sale;
OK
# col_name              data_type               comment             
                 
id                      bigint                                      
sale_date               string                                      
prod_name               string                                      
sale_nums               int                                         
                 
# Detailed Table Information             
Database:               test                     
OwnerType:              USER                     
Owner:                  root                     
CreateTime:             Wed Nov 25 19:14:05 CST 2020     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://nameservice1/user/hive/warehouse/test.db/ods_fact_sale    
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"id\":\"true\",\"prod_name\":\"true\",\"sale_date\":\"true\",\"sale_nums\":\"true\"}}
        comment                 Imported by sqoop on 2020/11/25 19:14:01
        numFiles                4                   
        numRows                 767830000           
        rawDataSize             30653263662         
        totalSize               31421093662         
        transient_lastDdlTime   1608796507          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \u0001              
        line.delim              \n                  
        serialization.format    \u0001              
Time taken: 0.072 seconds, Fetched: 37 row(s)
hive> 
    > desc formatted ods_fact_sale_partion PARTITION(sale_date='2010-04-12');
OK
# col_name              data_type               comment             
                 
id                      bigint                                      
prod_name               string                                      
sale_nums               int                                         
                 
# Partition Information          
# col_name              data_type               comment             
                 
sale_date               string                                      
                 
# Detailed Partition Information                 
Partition Value:        [2010-04-12]             
Database:               test                     
Table:                  ods_fact_sale_partion    
CreateTime:             Tue Dec 22 17:53:40 CST 2020     
LastAccessTime:         UNKNOWN                  
Location:               hdfs://nameservice1/user/hive/warehouse/test.db/ods_fact_sale_partion/sale_date=2010-04-12       
Partition Parameters:            
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"id\":\"true\",\"prod_name\":\"true\",\"sale_nums\":\"true\"}}
        numFiles                4                   
        numRows                 767830000           
        rawDataSize             77550830000         
        totalSize               990474835           
        transient_lastDdlTime   1608798129          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.ql.io.orc.OrcSerde        
InputFormat:            org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
OutputFormat:           org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat         
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        serialization.format    1                   
Time taken: 0.126 seconds, Fetched: 36 row(s)
hive> 

3.2 从元数据查看hive表统计信息

非分区表的statics信息存在hive元数据表TABLE_PARAMS中
分区表的statics信息存在hive元数据表PARTITION_PARAMS 中

对于非分区表列的statics信息存在hive元数据表TAB_COL_STATS中
对于非分区表列的statics信息存在hive元数据表PART_COL_STATS中

代码

-- 查看普通表的统计信息
select tp.*
  from DBS d
 inner join TBLS t
  on d.db_id = t.DB_ID
 inner join TABLE_PARAMS tp
  on t.TBL_ID = tp.TBL_ID
 where d.name = 'test'
   and t.TBL_NAME = 'ods_fact_sale';

-- 查看分区表的统计信息
select PM.*
  from DBS d
 inner join TBLS t
  on d.db_id = t.DB_ID
 inner join TABLE_PARAMS tp
  on t.TBL_ID = tp.TBL_ID
 INNER JOIN PARTITIONS P
  ON t.TBL_ID = P.TBL_ID
 INNER JOIN PARTITION_PARAMS PM
  ON P.PART_ID = PM.PART_ID
 where d.name = 'test'
   and t.TBL_NAME = 'ods_fact_sale_partion'
   AND P.PART_NAME = 'sale_date=2010-04-12'
;


-- 查看普通表列的统计信息
select *
  from TAB_COL_STATS tc
where tc.DB_NAME = 'test'
  and tc.TABLE_NAME = 'ods_fact_sale';

-- 查看分区表列的统计信息
select *
  from PART_COL_STATS PC
 where pc.DB_NAME = 'test'
   and pc.TABLE_NAME = 'ods_fact_sale_partion'
   and pc.PARTITION_NAME = 'sale_date=2010-04-12'
;

测试记录

mysql> -- 查看普通表的统计信息
mysql> select tp.*
    ->   from DBS d
    ->  inner join TBLS t
    ->   on d.db_id = t.DB_ID
    ->  inner join TABLE_PARAMS tp
    ->   on t.TBL_ID = tp.TBL_ID
    ->  where d.name = 'test'
    ->    and t.TBL_NAME = 'ods_fact_sale';
+--------+-----------------------+--------------------------------------------------------------------------------------------------------------+
| TBL_ID | PARAM_KEY             | PARAM_VALUE                                                                                                  |
+--------+-----------------------+--------------------------------------------------------------------------------------------------------------+
|   3014 | COLUMN_STATS_ACCURATE | {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true","prod_name":"true","sale_date":"true","sale_nums":"true"}} |
|   3014 | comment               | Imported by sqoop on 2020/11/25 19:14:01                                                                     |
|   3014 | numFiles              | 4                                                                                                            |
|   3014 | numFilesErasureCoded  | 0                                                                                                            |
|   3014 | numRows               | 767830000                                                                                                    |
|   3014 | rawDataSize           | 30653263662                                                                                                  |
|   3014 | totalSize             | 31421093662                                                                                                  |
|   3014 | transient_lastDdlTime | 1608796507                                                                                                   |
+--------+-----------------------+--------------------------------------------------------------------------------------------------------------+
8 rows in set (0.00 sec)

mysql> 
mysql> -- 查看分区表的统计信息
mysql> select PM.*
    ->   from DBS d
    ->  inner join TBLS t
    ->   on d.db_id = t.DB_ID
    ->  inner join TABLE_PARAMS tp
    ->   on t.TBL_ID = tp.TBL_ID
    ->  INNER JOIN PARTITIONS P
    ->   ON t.TBL_ID = P.TBL_ID
    ->  INNER JOIN PARTITION_PARAMS PM
    ->   ON P.PART_ID = PM.PART_ID
    ->  where d.name = 'test'
    ->    and t.TBL_NAME = 'ods_fact_sale_partion'
    ->    AND P.PART_NAME = 'sale_date=2010-04-12'
    -> ;
+---------+-----------------------+-------------------------------------------------------------------------------------------+
| PART_ID | PARAM_KEY             | PARAM_VALUE                                                                               |
+---------+-----------------------+-------------------------------------------------------------------------------------------+
|   21450 | COLUMN_STATS_ACCURATE | {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true","prod_name":"true","sale_nums":"true"}} |
|   21450 | numFiles              | 4                                                                                         |
|   21450 | numFilesErasureCoded  | 0                                                                                         |
|   21450 | numRows               | 767830000                                                                                 |
|   21450 | rawDataSize           | 77550830000                                                                               |
|   21450 | totalSize             | 990474835                                                                                 |
|   21450 | transient_lastDdlTime | 1608798129                                                                                |
+---------+-----------------------+-------------------------------------------------------------------------------------------+
7 rows in set (0.00 sec)

mysql> 
mysql> -- 查看普通表列的统计信息
mysql> select *
    ->   from TAB_COL_STATS tc
    -> where tc.DB_NAME = 'test'
    ->   and tc.TABLE_NAME = 'ods_fact_sale';
+-------+---------+---------------+-------------+-------------+--------+----------------+-----------------+-------------------+------------------+-----------------------+------------------------+-----------+---------------+-------------+-------------+-----------+------------+---------------+
| CS_ID | DB_NAME | TABLE_NAME    | COLUMN_NAME | COLUMN_TYPE | TBL_ID | LONG_LOW_VALUE | LONG_HIGH_VALUE | DOUBLE_HIGH_VALUE | DOUBLE_LOW_VALUE | BIG_DECIMAL_LOW_VALUE | BIG_DECIMAL_HIGH_VALUE | NUM_NULLS | NUM_DISTINCTS | AVG_COL_LEN | MAX_COL_LEN | NUM_TRUES | NUM_FALSES | LAST_ANALYZED |
+-------+---------+---------------+-------------+-------------+--------+----------------+-----------------+-------------------+------------------+-----------------------+------------------------+-----------+---------------+-------------+-------------+-----------+------------+---------------+
|     1 | test    | ods_fact_sale | id          | bigint      |   3014 |              1 |       787621597 |              NULL |             NULL | NULL                  | NULL                   |         0 |     939949594 |        NULL |        NULL |      NULL |       NULL |    1608797870 |
|     2 | test    | ods_fact_sale | sale_date   | string      |   3014 |           NULL |            NULL |              NULL |             NULL | NULL                  | NULL                   |         0 |           858 |     21.0000 |          21 |      NULL |       NULL |    1608797870 |
|     3 | test    | ods_fact_sale | prod_name   | string      |   3014 |           NULL |            NULL |              NULL |             NULL | NULL                  | NULL                   |         0 |            12 |      5.1111 |           6 |      NULL |       NULL |    1608797870 |
|     4 | test    | ods_fact_sale | sale_nums   | int         |   3014 |              2 |              99 |              NULL |             NULL | NULL                  | NULL                   |         0 |            63 |        NULL |        NULL |      NULL |       NULL |    1608797870 |
+-------+---------+---------------+-------------+-------------+--------+----------------+-----------------+-------------------+------------------+-----------------------+------------------------+-----------+---------------+-------------+-------------+-----------+------------+---------------+
4 rows in set (0.00 sec)

mysql> 
mysql> 
mysql> select *
    ->   from PART_COL_STATS pc
    ->  where pc.DB_NAME = 'test'
    ->    and pc.TABLE_NAME = 'ods_fact_sale_partion'
    ->    and pc.PARTITION_NAME = 'sale_date=2010-04-12'
    -> ;
+-------+---------+-----------------------+----------------------+-------------+-------------+---------+----------------+-----------------+-------------------+------------------+-----------------------+------------------------+-----------+---------------+-------------+-------------+-----------+------------+---------------+
| CS_ID | DB_NAME | TABLE_NAME            | PARTITION_NAME       | COLUMN_NAME | COLUMN_TYPE | PART_ID | LONG_LOW_VALUE | LONG_HIGH_VALUE | DOUBLE_HIGH_VALUE | DOUBLE_LOW_VALUE | BIG_DECIMAL_LOW_VALUE | BIG_DECIMAL_HIGH_VALUE | NUM_NULLS | NUM_DISTINCTS | AVG_COL_LEN | MAX_COL_LEN | NUM_TRUES | NUM_FALSES | LAST_ANALYZED |
+-------+---------+-----------------------+----------------------+-------------+-------------+---------+----------------+-----------------+-------------------+------------------+-----------------------+------------------------+-----------+---------------+-------------+-------------+-----------+------------+---------------+
|     1 | test    | ods_fact_sale_partion | sale_date=2010-04-12 | id          | bigint      |   21450 |              1 |       787621597 |              NULL |             NULL | NULL                  | NULL                   |         0 |     939949594 |        NULL |        NULL |      NULL |       NULL |    1608798775 |
|     2 | test    | ods_fact_sale_partion | sale_date=2010-04-12 | prod_name   | string      |   21450 |           NULL |            NULL |              NULL |             NULL | NULL                  | NULL                   |         0 |            12 |      5.1111 |           6 |      NULL |       NULL |    1608798775 |
|     3 | test    | ods_fact_sale_partion | sale_date=2010-04-12 | sale_nums   | int         |   21450 |              2 |              99 |              NULL |             NULL | NULL                  | NULL                   |         0 |            63 |        NULL |        NULL |      NULL |       NULL |    1608798775 |
+-------+---------+-----------------------+----------------------+-------------+-------------+---------+----------------+-----------------+-------------------+------------------+-----------------------+------------------------+-----------+---------------+-------------+-------------+-----------+------------+---------------+
3 rows in set (0.00 sec)

mysql> 

参考

1.https://cwiki.apache.org/confluence/display/Hive/StatsDev

上一篇:Python坦克大战源代码


下一篇:Dynamics CRM - C# Delete Plugin