1. 概述
注意:
sample子句只能用于MergeTree系列引擎的数据表,并且在create table的时候就声明sample by 抽样表达式。
sample子句提供了近似计算
的功能,能够实现数据采样的功能,使查询仅仅返回采样数据而不是全部数据,从而有效减少查询负载。
sample子句的采样设计是一种幂等设计,即在数据发生变化的时候使用相同的采样规则能返回相同的数据
。这种特性非常适合那些可以接受近似查询结果的场景。
官方提供了如下的使用场景:
- When you have strict timing requirements (like <100ms) but you can’t justify the cost of additional hardware
resources to meet them. - When your raw data is not accurate, so approximation doesn’t noticeably degrade the quality.
- Business requirements target approximate results (for cost-effectiveness, or to market exact results to
premium users).
Clickhouse> create table clicks(CounterID UInt64,EventDate DATE, UserID UInt64) engine=MergeTree()
order by (CounterID,intHash32(UserID)) sample by intHash32(UserID);
CREATE TABLE clicks
(
`CounterID` UInt64,
`EventDate` DATE,
`UserID` UInt64
)
ENGINE = MergeTree()
ORDER BY (CounterID, intHash32(UserID))
SAMPLE BY intHash32(UserID)
插入测试数据:
Clickhouse> insert into clicks select CounterID,EventDate,UserID from hits_v1;
INSERT INTO clicks SELECT
CounterID,
EventDate,
UserID
FROM hits_v1
Ok.
0 rows in set. Elapsed: 1.003 sec. Processed 8.87 million rows, 124.23 MB (8.85 million rows/s., 123.88 MB/s.)
clicks表的定义按照intHash32(UserID) 分布后的结果采样查询。
声明Sample KEY的时候有两点需要注意:
sample by 所声明的表达式必须同时包含在主键的声明内
-
sample key必须UInt类型,若不是可以定义但是查询的时候会抛出异常
。
SAMPLE 子句支持三种格式:
1.sample k
k表示因子系数,采样因子,取值范围【0,1】,若在0–1之间的小数则表示采样,若为0或者1则等同于不采样。
select CounterID from clicks sample 0.1
等同于:
select CounterID from clicks sample 1/10
查询获取近似结果:
Clickhouse> select count() from clicks;
SELECT count()
FROM clicks
┌─count()─┐
│ 8873898 │
└─────────┘
1 rows in set. Elapsed: 0.003 sec.
Clickhouse> select count() from clicks sample 0.1;
SELECT count()
FROM clicks
SAMPLE 1 / 10
┌─count()─┐
│ 839889 │
└─────────┘
1 rows in set. Elapsed: 0.029 sec. Processed 5.89 million rows, 94.27 MB (201.86 million rows/s., 3.23 GB/s.)
Clickhouse> select CounterID,_sample_factor from clicks sample 0.1 limit 2;
SELECT
CounterID,
_sample_factor
FROM clicks
SAMPLE 1 / 10
LIMIT 2
┌─CounterID─┬─_sample_factor─┐
│ 57 │ 10 │
│ 57 │ 10 │
└───────────┴────────────────┘
2 rows in set. Elapsed: 0.012 sec.
可以通过虚拟字段_sample_factor 查询采样系数。
2.sample n
n表示采样的样本数量。n表示至少采样多少行数据。n=1表示不使用采样,n的范围从2到表的总行数。
Clickhouse> select count() from clicks sample 10000;
SELECT count()
FROM clicks
SAMPLE 10000
┌─count()─┐
│ 9251 │
└─────────┘
1 rows in set. Elapsed: 0.025 sec. Processed 5.48 million rows, 87.72 MB (223.47 million rows/s., 3.58 GB/s.)
Clickhouse> select count()*any(_sample_factor) from clicks sample 10000;
SELECT count() * any(_sample_factor)
FROM clicks
SAMPLE 10000
┌─multiply(count(), any(_sample_factor))─┐
│ 8154379.059200001 │
└────────────────────────────────────────┘
1 rows in set. Elapsed: 0.024 sec. Processed 5.48 million rows, 54.82 MB (229.44 million rows/s., 2.29 GB/s.)
Clickhouse> select CounterID,_sample_factor from clicks sample 10000 limit 2;
SELECT
CounterID,
_sample_factor
FROM clicks
SAMPLE 10000
LIMIT 2
┌─CounterID─┬────_sample_factor─┐
│ 1294 │ 881.4592000000001 │
└───────────┴───────────────────┘
┌─CounterID─┬────_sample_factor─┐
│ 1366 │ 881.4592000000001 │
└───────────┴───────────────────┘
2 rows in set. Elapsed: 0.041 sec. Processed 7.69 thousand rows, 123.01 KB (187.84 thousand rows/s., 3.01 MB/s.)
数据采样的范围是一个近似值,采样数据的最小粒度有index_granularity 索引粒度决定的。
若设置一个小于索引粒度或者较小的n值没有意义。
3. sample k offset n
表示按照因子系数和偏移量采样。
Clickhouse> select CounterID,_sample_factor from clicks sample 0.4 offset 0.5 limit 1;
SELECT
CounterID,
_sample_factor
FROM clicks
SAMPLE 4 / 10 OFFSET 5 / 10
LIMIT 1
┌─CounterID─┬─_sample_factor─┐
│ 57 │ 2.5 │
└───────────┴────────────────┘
1 rows in set. Elapsed: 0.017 sec.
Clickhouse> select CounterID,_sample_factor from clicks sample 0.6 offset 0.5 limit 1;
SELECT
CounterID,
_sample_factor
FROM clicks
SAMPLE 6 / 10 OFFSET 5 / 10
LIMIT 1
┌─CounterID─┬─────_sample_factor─┐
│ 57 │ 1.6666666666666667 │
└───────────┴────────────────────┘
1 rows in set. Elapsed: 0.007 sec.
当采样因子溢出(offset 的值+sample的值大于1)则溢出的数据则自动阶段。