【clickhouse】clickhouse SQL 执行计划

【clickhouse】clickhouse SQL 执行计划

1.概述

转载:clickhouse SQL 执行计划

Clickhouse目前(20.6版本之前)没有直接提供explain查询,但是可以借助于后台的服务日志(设置为debug或者trace即可),能变相的实现该功能

clickhouse-client -h localhost --send_logs_level=trace  <<<"SQL语句" >/dev/null

这里我们使用clickhouse官方提供的hits_v1表来说明。

hits表为clickhouse官方提供的数据集,表的定义如下:

Clickhouse> show create table datasets.hits_v1\G
 
SHOW CREATE TABLE datasets.hits_v1
 
Row 1:
──────
statement: CREATE TABLE datasets.hits_v1
(
    `WatchID` UInt64,
    `JavaEnable` UInt8,
    `Title` String,
    `GoodEvent` Int16,
    `EventTime` DateTime,
    `EventDate` Date,
    `CounterID` UInt32,
    `ClientIP` UInt32,
    `ClientIP6` FixedString(16),
    `RegionID` UInt32,
......
    `RequestNum` UInt32,
    `RequestTry` UInt8
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID))
SAMPLE BY intHash32(UserID)
SETTINGS index_granularity = 8192
 
1 rows in set. Elapsed: 0.009 sec. 
 

分区情况:

Clickhouse> select partition,name from system.parts where table='hits_v1' and active;
 
SELECT 
    partition,
    name
FROM system.parts
WHERE (table = 'hits_v1') AND active
 
┌─partition─┬─name───────────┐
│ 201403    │ 201403_10_18_2 │
└───────────┴────────────────┘
 
1 rows in set. Elapsed: 0.008 sec. 

已经将其合并为一个分区了。

重新多个分区的过程:

Clickhouse> truncate table hits_v1;
 
TRUNCATE TABLE hits_v1
 
Ok.
 
0 rows in set. Elapsed: 0.088 sec.

导入数据:

# cat hits_v1.tsv | clickhouse-client --query "INSERT INTO datasets.hits_v1 FORMAT TSV" --max_insert_block_size=100000 

查看表的分区情况:

Clickhouse> select partition,name from system.parts where table='hits_v1' and active=1;
 
SELECT 
    partition,
    name
FROM system.parts
WHERE (table = 'hits_v1') AND (active = 1)
 
┌─partition─┬─name───────────┐
│ 201403    │ 201403_19_24_1 │
│ 201403    │ 201403_25_30_1 │
│ 201403    │ 201403_31_31_0 │
│ 201403    │ 201403_32_32_0 │
│ 201403    │ 201403_33_33_0 │
│ 201403    │ 201403_34_34_0 │
│ 201403    │ 201403_35_35_0 │
│ 201403    │ 201403_36_36_0 │
│ 201403    │ 201403_37_37_0 │
│ 201403    │ 201403_38_38_0 │
│ 201403    │ 201403_39_39_0 │
│ 201403    │ 201403_40_40_0 │
└───────────┴────────────────┘
 
12 rows in set. Elapsed: 0.040 sec. 
 

1.全字段全表扫描:

# clickhouse-client -h localhost --send_logs_level=trace  <<<"select * from datasets.hits_v1" >/dev/null
 
[hadoop.example.com] 2020.07.14 14:31:26.747392 [ 15000 ] {ef471d40-0b3f-43f1-936e-0176907c0310} <Debug> datasets.hits_v1 (SelectExecutor): Key condition: unknown
[hadoop.example.com] 2020.07.14 14:31:26.747460 [ 15000 ] {ef471d40-0b3f-43f1-936e-0176907c0310} <Debug> datasets.hits_v1 (SelectExecutor): MinMax index condition: unknown
[hadoop.example.com] 2020.07.14 14:31:26.747521 [ 15000 ] {ef471d40-0b3f-43f1-936e-0176907c0310} <Debug> datasets.hits_v1 (SelectExecutor): Selected 6 parts by date, 6 parts by key, 1086 marks to read from 6 ranges
[hadoop.example.com] 2020.07.14 14:31:26.751436 [ 15000 ] {ef471d40-0b3f-43f1-936e-0176907c0310} <Trace> datasets.hits_v1 (SelectExecutor): Reading approx. 8896512 rows with 4 streams
[hadoop.example.com] 2020.07.14 14:31:26.753085 [ 15000 ] {ef471d40-0b3f-43f1-936e-0176907c0310} <Trace> InterpreterSelectQuery: FetchColumns -> Complete
[hadoop.example.com] 2020.07.14 14:32:40.191365 [ 15000 ] {ef471d40-0b3f-43f1-936e-0176907c0310} <Information> executeQuery: Read 8873898 rows, 7.88 GiB in 73.449650215 sec., 120816 rows/sec., 109.85 MiB/sec.
[hadoop.example.com] 2020.07.14 14:32:40.191461 [ 15000 ] {ef471d40-0b3f-43f1-936e-0176907c0310} <Debug> MemoryTracker: Peak memory usage (for query): 528.56 MiB.
 

上面的信息重点关注右边即可:

<Debug> datasets.hits_v1 (SelectExecutor): Key condition: unknown                                                      
<Debug> datasets.hits_v1 (SelectExecutor): MinMax index condition: unknown                                             
<Debug> datasets.hits_v1 (SelectExecutor): Selected 6 parts by date, 6 parts by key, 1086 marks to read from 6 ranges  
<Trace> datasets.hits_v1 (SelectExecutor): Reading approx. 8896512 rows with 4 streams                                 
<Trace> InterpreterSelectQuery: FetchColumns -> Complete                                                               
<Information> executeQuery: Read 8873898 rows, 7.88 GiB in 73.449650215 sec., 120816 rows/sec., 109.85 MiB/sec.        
<Debug> MemoryTracker: Peak memory usage (for query): 528.56 MiB.                            

2.单个字段的全表扫描:

#  clickhouse-client -h localhost --send_logs_level=trace  <<<"select WatchID from datasets.hits_v1" >/dev/null
 
<Debug> executeQuery: (from [::1]:58630) select WatchID from datasets.hits_v1 
<Trace> ContextAccess (default): Access granted: SELECT(WatchID) ON datasets.hits_v1
<Debug> datasets.hits_v1 (SelectExecutor): Key condition: unknown
<Debug> datasets.hits_v1 (SelectExecutor): MinMax index condition: unknown
<Debug> datasets.hits_v1 (SelectExecutor): Selected 6 parts by date, 6 parts by key, 1086 marks to read from 6 ranges
<Trace> datasets.hits_v1 (SelectExecutor): Reading approx. 8896512 rows with 4 streams
<Trace> InterpreterSelectQuery: FetchColumns -> Complete
<Information> executeQuery: Read 8873898 rows, 67.70 MiB in 0.568247033 sec., 15616268 rows/sec., 119.14 MiB/sec.
<Debug> MemoryTracker: Peak memory usage (for query): 18.90 MiB.

观察后面的两行数据 都的数据大小变为67.70MiB
内存的峰值:18.99MiB
3.使用分区索引:

#  clickhouse-client -h localhost --send_logs_level=trace  <<<"select WatchID from datasets.hits_v1 where EventDate='2014-03-18'" >/dev/null
 
<Debug> executeQuery: (from [::1]:58632) select WatchID from datasets.hits_v1 where EventDate='2014-03-18' 
<Debug> InterpreterSelectQuery: MergeTreeWhereOptimizer: condition "EventDate = '2014-03-18'" moved to PREWHERE
<Trace> ContextAccess (default): Access granted: SELECT(WatchID, EventDate) ON datasets.hits_v1
<Debug> datasets.hits_v1 (SelectExecutor): Key condition: (column 1 in [16147, 16147])
<Debug> datasets.hits_v1 (SelectExecutor): MinMax index condition: (column 0 in [16147, 16147])
<Debug> datasets.hits_v1 (SelectExecutor): Selected 6 parts by date, 6 parts by key, 853 marks to read from 70 ranges
<Trace> datasets.hits_v1 (SelectExecutor): Reading approx. 6987776 rows with 4 streams
<Trace> InterpreterSelectQuery: FetchColumns -> Complete
<Information> executeQuery: Read 6965162 rows, 65.35 MiB in 0.202466468 sec., 34401558 rows/sec., 322.75 MiB/sec.
<Debug> MemoryTracker: Peak memory usage (for query): 27.02 MiB.

1.where 子句被优化为prewhere子句
2.分区索引被启动
3.分区索引 扫描了6个
由于没有启用主键索引,该查询仍然扫描了9个分区内的 853个markRange.
853 marks to read from 70 ranges
4.使用主键索引:

#  clickhouse-client -h localhost --send_logs_level=trace  <<<"select WatchID from datasets.hits_v1 where EventDate='2014-03-18' and CounterID=67141" >/dev/null 
 
<Debug> executeQuery: (from [::1]:58634) select WatchID from datasets.hits_v1 where EventDate='2014-03-18' and CounterID=67141 
<Debug> InterpreterSelectQuery: MergeTreeWhereOptimizer: condition "EventDate = '2014-03-18'" moved to PREWHERE
<Trace> ContextAccess (default): Access granted: SELECT(WatchID, EventDate, CounterID) ON datasets.hits_v1
<Debug> datasets.hits_v1 (SelectExecutor): Key condition: (column 1 in [16147, 16147]), (column 0 in [67141, 67141]), and, (column 1 in [16147, 16147]), and
<Debug> datasets.hits_v1 (SelectExecutor): MinMax index condition: (column 0 in [16147, 16147]), unknown, and, (column 0 in [16147, 16147]), and
<Debug> datasets.hits_v1 (SelectExecutor): Selected 6 parts by date, 6 parts by key, 6 marks to read from 6 ranges
<Trace> datasets.hits_v1 (SelectExecutor): Reading approx. 49152 rows with 4 streams
<Trace> InterpreterSelectQuery: FetchColumns -> Complete
<Information> executeQuery: Read 49152 rows, 598.09 KiB in 0.033530217 sec., 1465901 rows/sec., 17.42 MiB/sec.
<Debug> MemoryTracker: Peak memory usage (for query): 0.00 B.
 
 

主键被启动:

Key condition: (column 1 in [16147, 16147]), (column 0 in [67141, 67141]), and, (column 1 in [16147, 16147]), and
扫描的MarkRange:
Selected 6 parts by date, 6 parts by key, 6 marks to read from 6 ranges
 

读取到内存的预估数据量:

Reading approx. 49152 rows with 4 streams

总结:

  1. 将Clickhouse的服务日志设置为debug或者trace级别 可以实现查看执行计划
  2. 需要真正执行了SQL查询,clickhouse才能打印执行计划的日志,若表的数据很大需要借助limit减少返回的数据量
  3. 禁用select * 全字段查询语句
  4. 尽可能利用各种索引(分区索引,一级索引,二级索引)可以避免全表扫描。
上一篇:ES加索引数据添加,IK分词器高亮显示


下一篇:lucene(11)