1.概述
Clickhouse目前(20.6版本之前)没有直接提供explain查询,但是可以借助于后台的服务日志(设置为debug或者trace即可),能变相的实现该功能
。
clickhouse-client -h localhost --send_logs_level=trace <<<"SQL语句" >/dev/null
这里我们使用clickhouse官方提供的hits_v1
表来说明。
hits表为clickhouse官方提供的数据集,表的定义如下:
Clickhouse> show create table datasets.hits_v1\G
SHOW CREATE TABLE datasets.hits_v1
Row 1:
──────
statement: CREATE TABLE datasets.hits_v1
(
`WatchID` UInt64,
`JavaEnable` UInt8,
`Title` String,
`GoodEvent` Int16,
`EventTime` DateTime,
`EventDate` Date,
`CounterID` UInt32,
`ClientIP` UInt32,
`ClientIP6` FixedString(16),
`RegionID` UInt32,
......
`RequestNum` UInt32,
`RequestTry` UInt8
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID))
SAMPLE BY intHash32(UserID)
SETTINGS index_granularity = 8192
1 rows in set. Elapsed: 0.009 sec.
分区情况:
Clickhouse> select partition,name from system.parts where table='hits_v1' and active;
SELECT
partition,
name
FROM system.parts
WHERE (table = 'hits_v1') AND active
┌─partition─┬─name───────────┐
│ 201403 │ 201403_10_18_2 │
└───────────┴────────────────┘
1 rows in set. Elapsed: 0.008 sec.
已经将其合并为一个分区了。
重新多个分区的过程:
Clickhouse> truncate table hits_v1;
TRUNCATE TABLE hits_v1
Ok.
0 rows in set. Elapsed: 0.088 sec.
导入数据:
# cat hits_v1.tsv | clickhouse-client --query "INSERT INTO datasets.hits_v1 FORMAT TSV" --max_insert_block_size=100000
查看表的分区情况:
Clickhouse> select partition,name from system.parts where table='hits_v1' and active=1;
SELECT
partition,
name
FROM system.parts
WHERE (table = 'hits_v1') AND (active = 1)
┌─partition─┬─name───────────┐
│ 201403 │ 201403_19_24_1 │
│ 201403 │ 201403_25_30_1 │
│ 201403 │ 201403_31_31_0 │
│ 201403 │ 201403_32_32_0 │
│ 201403 │ 201403_33_33_0 │
│ 201403 │ 201403_34_34_0 │
│ 201403 │ 201403_35_35_0 │
│ 201403 │ 201403_36_36_0 │
│ 201403 │ 201403_37_37_0 │
│ 201403 │ 201403_38_38_0 │
│ 201403 │ 201403_39_39_0 │
│ 201403 │ 201403_40_40_0 │
└───────────┴────────────────┘
12 rows in set. Elapsed: 0.040 sec.
1.全字段全表扫描:
# clickhouse-client -h localhost --send_logs_level=trace <<<"select * from datasets.hits_v1" >/dev/null
[hadoop.example.com] 2020.07.14 14:31:26.747392 [ 15000 ] {ef471d40-0b3f-43f1-936e-0176907c0310} <Debug> datasets.hits_v1 (SelectExecutor): Key condition: unknown
[hadoop.example.com] 2020.07.14 14:31:26.747460 [ 15000 ] {ef471d40-0b3f-43f1-936e-0176907c0310} <Debug> datasets.hits_v1 (SelectExecutor): MinMax index condition: unknown
[hadoop.example.com] 2020.07.14 14:31:26.747521 [ 15000 ] {ef471d40-0b3f-43f1-936e-0176907c0310} <Debug> datasets.hits_v1 (SelectExecutor): Selected 6 parts by date, 6 parts by key, 1086 marks to read from 6 ranges
[hadoop.example.com] 2020.07.14 14:31:26.751436 [ 15000 ] {ef471d40-0b3f-43f1-936e-0176907c0310} <Trace> datasets.hits_v1 (SelectExecutor): Reading approx. 8896512 rows with 4 streams
[hadoop.example.com] 2020.07.14 14:31:26.753085 [ 15000 ] {ef471d40-0b3f-43f1-936e-0176907c0310} <Trace> InterpreterSelectQuery: FetchColumns -> Complete
[hadoop.example.com] 2020.07.14 14:32:40.191365 [ 15000 ] {ef471d40-0b3f-43f1-936e-0176907c0310} <Information> executeQuery: Read 8873898 rows, 7.88 GiB in 73.449650215 sec., 120816 rows/sec., 109.85 MiB/sec.
[hadoop.example.com] 2020.07.14 14:32:40.191461 [ 15000 ] {ef471d40-0b3f-43f1-936e-0176907c0310} <Debug> MemoryTracker: Peak memory usage (for query): 528.56 MiB.
上面的信息重点关注右边即可:
<Debug> datasets.hits_v1 (SelectExecutor): Key condition: unknown
<Debug> datasets.hits_v1 (SelectExecutor): MinMax index condition: unknown
<Debug> datasets.hits_v1 (SelectExecutor): Selected 6 parts by date, 6 parts by key, 1086 marks to read from 6 ranges
<Trace> datasets.hits_v1 (SelectExecutor): Reading approx. 8896512 rows with 4 streams
<Trace> InterpreterSelectQuery: FetchColumns -> Complete
<Information> executeQuery: Read 8873898 rows, 7.88 GiB in 73.449650215 sec., 120816 rows/sec., 109.85 MiB/sec.
<Debug> MemoryTracker: Peak memory usage (for query): 528.56 MiB.
2.单个字段的全表扫描:
# clickhouse-client -h localhost --send_logs_level=trace <<<"select WatchID from datasets.hits_v1" >/dev/null
<Debug> executeQuery: (from [::1]:58630) select WatchID from datasets.hits_v1
<Trace> ContextAccess (default): Access granted: SELECT(WatchID) ON datasets.hits_v1
<Debug> datasets.hits_v1 (SelectExecutor): Key condition: unknown
<Debug> datasets.hits_v1 (SelectExecutor): MinMax index condition: unknown
<Debug> datasets.hits_v1 (SelectExecutor): Selected 6 parts by date, 6 parts by key, 1086 marks to read from 6 ranges
<Trace> datasets.hits_v1 (SelectExecutor): Reading approx. 8896512 rows with 4 streams
<Trace> InterpreterSelectQuery: FetchColumns -> Complete
<Information> executeQuery: Read 8873898 rows, 67.70 MiB in 0.568247033 sec., 15616268 rows/sec., 119.14 MiB/sec.
<Debug> MemoryTracker: Peak memory usage (for query): 18.90 MiB.
观察后面的两行数据 都的数据大小变为67.70MiB
内存的峰值:18.99MiB
3.使用分区索引:
# clickhouse-client -h localhost --send_logs_level=trace <<<"select WatchID from datasets.hits_v1 where EventDate='2014-03-18'" >/dev/null
<Debug> executeQuery: (from [::1]:58632) select WatchID from datasets.hits_v1 where EventDate='2014-03-18'
<Debug> InterpreterSelectQuery: MergeTreeWhereOptimizer: condition "EventDate = '2014-03-18'" moved to PREWHERE
<Trace> ContextAccess (default): Access granted: SELECT(WatchID, EventDate) ON datasets.hits_v1
<Debug> datasets.hits_v1 (SelectExecutor): Key condition: (column 1 in [16147, 16147])
<Debug> datasets.hits_v1 (SelectExecutor): MinMax index condition: (column 0 in [16147, 16147])
<Debug> datasets.hits_v1 (SelectExecutor): Selected 6 parts by date, 6 parts by key, 853 marks to read from 70 ranges
<Trace> datasets.hits_v1 (SelectExecutor): Reading approx. 6987776 rows with 4 streams
<Trace> InterpreterSelectQuery: FetchColumns -> Complete
<Information> executeQuery: Read 6965162 rows, 65.35 MiB in 0.202466468 sec., 34401558 rows/sec., 322.75 MiB/sec.
<Debug> MemoryTracker: Peak memory usage (for query): 27.02 MiB.
1.where 子句被优化为prewhere子句
2.分区索引被启动
3.分区索引 扫描了6个
由于没有启用主键索引,该查询仍然扫描了9个分区内的 853个markRange.
853 marks to read from 70 ranges
4.使用主键索引:
# clickhouse-client -h localhost --send_logs_level=trace <<<"select WatchID from datasets.hits_v1 where EventDate='2014-03-18' and CounterID=67141" >/dev/null
<Debug> executeQuery: (from [::1]:58634) select WatchID from datasets.hits_v1 where EventDate='2014-03-18' and CounterID=67141
<Debug> InterpreterSelectQuery: MergeTreeWhereOptimizer: condition "EventDate = '2014-03-18'" moved to PREWHERE
<Trace> ContextAccess (default): Access granted: SELECT(WatchID, EventDate, CounterID) ON datasets.hits_v1
<Debug> datasets.hits_v1 (SelectExecutor): Key condition: (column 1 in [16147, 16147]), (column 0 in [67141, 67141]), and, (column 1 in [16147, 16147]), and
<Debug> datasets.hits_v1 (SelectExecutor): MinMax index condition: (column 0 in [16147, 16147]), unknown, and, (column 0 in [16147, 16147]), and
<Debug> datasets.hits_v1 (SelectExecutor): Selected 6 parts by date, 6 parts by key, 6 marks to read from 6 ranges
<Trace> datasets.hits_v1 (SelectExecutor): Reading approx. 49152 rows with 4 streams
<Trace> InterpreterSelectQuery: FetchColumns -> Complete
<Information> executeQuery: Read 49152 rows, 598.09 KiB in 0.033530217 sec., 1465901 rows/sec., 17.42 MiB/sec.
<Debug> MemoryTracker: Peak memory usage (for query): 0.00 B.
主键被启动:
Key condition: (column 1 in [16147, 16147]), (column 0 in [67141, 67141]), and, (column 1 in [16147, 16147]), and
扫描的MarkRange:
Selected 6 parts by date, 6 parts by key, 6 marks to read from 6 ranges
读取到内存的预估数据量:
Reading approx. 49152 rows with 4 streams
总结:
- 将Clickhouse的服务日志设置为debug或者trace级别 可以实现查看执行计划
- 需要真正执行了SQL查询,clickhouse才能打印执行计划的日志,若表的数据很大需要借助limit减少返回的数据量
- 禁用select * 全字段查询语句
- 尽可能利用各种索引(分区索引,一级索引,二级索引)可以避免全表扫描。