ClickHouse使用
ClickHouse是一个面向列存储的OLAP分析数据库,以其强大的分析速度而闻名。有关ClickHouse的介绍可以参考其官网说明[1]。本文主要介绍它的基本使用。
1. 安装
使用的环境为2台 AWS EC2,操作系统为Amazon Linux2。使用的ClickHouse为最新的stable版本v21.2.5.5-stable [2]。
export LATEST_VERSION=21.2.5.5 curl -O https://repo.clickhouse.tech/tgz/stable/clickhouse-common-static-$LATEST_VERSION.tgz curl -O https://repo.clickhouse.tech/tgz/stable/clickhouse-common-static-dbg-$LATEST_VERSION.tgz curl -O https://repo.clickhouse.tech/tgz/stable/clickhouse-server-$LATEST_VERSION.tgz curl -O https://repo.clickhouse.tech/tgz/stable/clickhouse-client-$LATEST_VERSION.tgz tar -xzvf clickhouse-common-static-$LATEST_VERSION.tgz sudo clickhouse-common-static-$LATEST_VERSION/install/doinst.sh tar -xzvf clickhouse-common-static-dbg-$LATEST_VERSION.tgz sudo clickhouse-common-static-dbg-$LATEST_VERSION/install/doinst.sh tar -xzvf clickhouse-server-$LATEST_VERSION.tgz sudo clickhouse-server-$LATEST_VERSION/install/doinst.sh sudo /etc/init.d/clickhouse-server start tar -xzvf clickhouse-client-$LATEST_VERSION.tgz sudo clickhouse-client-$LATEST_VERSION/install/doinst.sh
2. 初次使用
2.1. 数据
使用官网提供的数据:Yandex.Metrica的匿名数据。它是在ClickHouse成为开源之前作为生产环境运行的第一个服务:
curl https://datasets.clickhouse.tech/hits/tsv/hits_v1.tsv.xz | unxz --threads=`nproc` > hits_v1.tsv curl https://datasets.clickhouse.tech/visits/tsv/visits_v1.tsv.xz | unxz --threads=`nproc` > visits_v1.tsv # 上传到s3 aws s3 sync ./ s3://xxx-clickhouse/data/
2.2. 建表
与其他数据库一样,clickhouse也自带一个default数据库。这里先创建一个tutorial数据库:
ip-10-0-4-69.cn-north-1.compute.internal :) create database if not exists tutorial > CREATE DATABASE IF NOT EXISTS tutorial
建表语句必须指定3个关键事情:
- 表名
- 表结构:列名以及对应数据类型
- 表引擎及其设置:决定了对此表的查询操作是如何在物理层执行的所有细节
Yandex.Metrica 是一个网络分析服务,样本数据集不包括其全部功能,因此只有2个表可以创建:
- hits表:包含所有用户在服务所涵盖的所有网站上完成的每个操作
- visits表:包含预先构建的会话,而不是单个操作
建表语句:
CREATE TABLE tutorial.hits_v1 ( `WatchID` UInt64, `JavaEnable` UInt8, `Title` String, `GoodEvent` Int16, `EventTime` DateTime, `EventDate` Date, `CounterID` UInt32, `ClientIP` UInt32, `ClientIP6` FixedString(16), `RegionID` UInt32, `UserID` UInt64, `CounterClass` Int8, `OS` UInt8, `UserAgent` UInt8, `URL` String, `Referer` String, `URLDomain` String, `RefererDomain` String, `Refresh` UInt8, `IsRobot` UInt8, `RefererCategories` Array(UInt16), `URLCategories` Array(UInt16), `URLRegions` Array(UInt32), `RefererRegions` Array(UInt32), `ResolutionWidth` UInt16, `ResolutionHeight` UInt16, `ResolutionDepth` UInt8, `FlashMajor` UInt8, `FlashMinor` UInt8, `FlashMinor2` String, `NetMajor` UInt8, `NetMinor` UInt8, `UserAgentMajor` UInt16, `UserAgentMinor` FixedString(2), `CookieEnable` UInt8, `JavascriptEnable` UInt8, `IsMobile` UInt8, `MobilePhone` UInt8, `MobilePhoneModel` String, `Params` String, `IPNetworkID` UInt32, `TraficSourceID` Int8, `SearchEngineID` UInt16, `SearchPhrase` String, `AdvEngineID` UInt8, `IsArtifical` UInt8, `WindowClientWidth` UInt16, `WindowClientHeight` UInt16, `ClientTimeZone` Int16, `ClientEventTime` DateTime, `SilverlightVersion1` UInt8, `SilverlightVersion2` UInt8, `SilverlightVersion3` UInt32, `SilverlightVersion4` UInt16, `PageCharset` String, `CodeVersion` UInt32, `IsLink` UInt8, `IsDownload` UInt8, `IsNotBounce` UInt8, `FUniqID` UInt64, `HID` UInt32, `IsOldCounter` UInt8, `IsEvent` UInt8, `IsParameter` UInt8, `DontCountHits` UInt8, `WithHash` UInt8, `HitColor` FixedString(1), `UTCEventTime` DateTime, `Age` UInt8, `Sex` UInt8, `Income` UInt8, `Interests` UInt16, `Robotness` UInt8, `GeneralInterests` Array(UInt16), `RemoteIP` UInt32, `RemoteIP6` FixedString(16), `WindowName` Int32, `OpenerName` Int32, `HistoryLength` Int16, `BrowserLanguage` FixedString(2), `BrowserCountry` FixedString(2), `SocialNetwork` String, `SocialAction` String, `HTTPError` UInt16, `SendTiming` Int32, `DNSTiming` Int32, `ConnectTiming` Int32, `ResponseStartTiming` Int32, `ResponseEndTiming` Int32, `FetchTiming` Int32, `RedirectTiming` Int32, `DOMInteractiveTiming` Int32, `DOMContentLoadedTiming` Int32, `DOMCompleteTiming` Int32, `LoadEventStartTiming` Int32, `LoadEventEndTiming` Int32, `NSToDOMContentLoadedTiming` Int32, `FirstPaintTiming` Int32, `RedirectCount` Int8, `SocialSourceNetworkID` UInt8, `SocialSourcePage` String, `ParamPrice` Int64, `ParamOrderID` String, `ParamCurrency` FixedString(3), `ParamCurrencyID` UInt16, `GoalsReached` Array(UInt32), `OpenstatServiceName` String, `OpenstatCampaignID` String, `OpenstatAdID` String, `OpenstatSourceID` String, `UTMSource` String, `UTMMedium` String, `UTMCampaign` String, `UTMContent` String, `UTMTerm` String, `FromTag` String, `HasGCLID` UInt8, `RefererHash` UInt64, `URLHash` UInt64, `CLID` UInt32, `YCLID` UInt64, `ShareService` String, `ShareURL` String, `ShareTitle` String, `ParsedParams` Nested( Key1 String, Key2 String, Key3 String, Key4 String, Key5 String, ValueDouble Float64), `IslandID` FixedString(16), `RequestNum` UInt32, `RequestTry` UInt8 ) ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) CREATE TABLE tutorial.visits_v1 ( `CounterID` UInt32, `StartDate` Date, `Sign` Int8, `IsNew` UInt8, `VisitID` UInt64, `UserID` UInt64, `StartTime` DateTime, `Duration` UInt32, `UTCStartTime` DateTime, `PageViews` Int32, `Hits` Int32, `IsBounce` UInt8, `Referer` String, `StartURL` String, `RefererDomain` String, `StartURLDomain` String, `EndURL` String, `LinkURL` String, `IsDownload` UInt8, `TraficSourceID` Int8, `SearchEngineID` UInt16, `SearchPhrase` String, `AdvEngineID` UInt8, `PlaceID` Int32, `RefererCategories` Array(UInt16), `URLCategories` Array(UInt16), `URLRegions` Array(UInt32), `RefererRegions` Array(UInt32), `IsYandex` UInt8, `GoalReachesDepth` Int32, `GoalReachesURL` Int32, `GoalReachesAny` Int32, `SocialSourceNetworkID` UInt8, `SocialSourcePage` String, `MobilePhoneModel` String, `ClientEventTime` DateTime, `RegionID` UInt32, `ClientIP` UInt32, `ClientIP6` FixedString(16), `RemoteIP` UInt32, `RemoteIP6` FixedString(16), `IPNetworkID` UInt32, `SilverlightVersion3` UInt32, `CodeVersion` UInt32, `ResolutionWidth` UInt16, `ResolutionHeight` UInt16, `UserAgentMajor` UInt16, `UserAgentMinor` UInt16, `WindowClientWidth` UInt16, `WindowClientHeight` UInt16, `SilverlightVersion2` UInt8, `SilverlightVersion4` UInt16, `FlashVersion3` UInt16, `FlashVersion4` UInt16, `ClientTimeZone` Int16, `OS` UInt8, `UserAgent` UInt8, `ResolutionDepth` UInt8, `FlashMajor` UInt8, `FlashMinor` UInt8, `NetMajor` UInt8, `NetMinor` UInt8, `MobilePhone` UInt8, `SilverlightVersion1` UInt8, `Age` UInt8, `Sex` UInt8, `Income` UInt8, `JavaEnable` UInt8, `CookieEnable` UInt8, `JavascriptEnable` UInt8, `IsMobile` UInt8, `BrowserLanguage` UInt16, `BrowserCountry` UInt16, `Interests` UInt16, `Robotness` UInt8, `GeneralInterests` Array(UInt16), `Params` Array(String), `Goals` Nested( ID UInt32, Serial UInt32, EventTime DateTime, Price Int64, OrderID String, CurrencyID UInt32), `WatchIDs` Array(UInt64), `ParamSumPrice` Int64, `ParamCurrency` FixedString(3), `ParamCurrencyID` UInt16, `ClickLogID` UInt64, `ClickEventID` Int32, `ClickGoodEvent` Int32, `ClickEventTime` DateTime, `ClickPriorityID` Int32, `ClickPhraseID` Int32, `ClickPageID` Int32, `ClickPlaceID` Int32, `ClickTypeID` Int32, `ClickResourceID` Int32, `ClickCost` UInt32, `ClickClientIP` UInt32, `ClickDomainID` UInt32, `ClickURL` String, `ClickAttempt` UInt8, `ClickOrderID` UInt32, `ClickBannerID` UInt32, `ClickMarketCategoryID` UInt32, `ClickMarketPP` UInt32, `ClickMarketCategoryName` String, `ClickMarketPPName` String, `ClickAWAPSCampaignName` String, `ClickPageName` String, `ClickTargetType` UInt16, `ClickTargetPhraseID` UInt64, `ClickContextType` UInt8, `ClickSelectType` Int8, `ClickOptions` String, `ClickGroupBannerID` Int32, `OpenstatServiceName` String, `OpenstatCampaignID` String, `OpenstatAdID` String, `OpenstatSourceID` String, `UTMSource` String, `UTMMedium` String, `UTMCampaign` String, `UTMContent` String, `UTMTerm` String, `FromTag` String, `HasGCLID` UInt8, `FirstVisit` DateTime, `PredLastVisit` Date, `LastVisit` Date, `TotalVisits` UInt32, `TraficSource` Nested( ID Int8, SearchEngineID UInt16, AdvEngineID UInt8, PlaceID UInt16, SocialSourceNetworkID UInt8, Domain String, SearchPhrase String, SocialSourcePage String), `Attendance` FixedString(16), `CLID` UInt32, `YCLID` UInt64, `NormalizedRefererHash` UInt64, `SearchPhraseHash` UInt64, `RefererDomainHash` UInt64, `NormalizedStartURLHash` UInt64, `StartURLDomainHash` UInt64, `NormalizedEndURLHash` UInt64, `TopLevelDomain` UInt64, `URLScheme` UInt64, `OpenstatServiceNameHash` UInt64, `OpenstatCampaignIDHash` UInt64, `OpenstatAdIDHash` UInt64, `OpenstatSourceIDHash` UInt64, `UTMSourceHash` UInt64, `UTMMediumHash` UInt64, `UTMCampaignHash` UInt64, `UTMContentHash` UInt64, `UTMTermHash` UInt64, `FromHash` UInt64, `WebVisorEnabled` UInt8, `WebVisorActivity` UInt32, `ParsedParams` Nested( Key1 String, Key2 String, Key3 String, Key4 String, Key5 String, ValueDouble Float64), `Market` Nested( Type UInt8, GoalID UInt32, OrderID String, OrderPrice Int64, PP UInt32, DirectPlaceID UInt32, DirectOrderID UInt32, DirectBannerID UInt32, GoodID String, GoodName String, GoodQuantity Int32, GoodPrice Int64), `IslandID` FixedString(16) ) ENGINE = CollapsingMergeTree(Sign) PARTITION BY toYYYYMM(StartDate) ORDER BY (CounterID, StartDate, intHash32(UserID), VisitID) SAMPLE BY intHash32(UserID)
可以看到,hits_v1 使用的是MergeTree引擎;visits_v1使用的是Collapsing引擎。两者的partition格式均为toYYYYMM(EventDate)。
2.3. 导入数据并查询
导入本地数据:
clickhouse-client --query "INSERT INTO tutorial.hits_v1 FORMAT TSV" --max_insert_block_size=100000 < hits_v1.tsv clickhouse-client --query "INSERT INTO tutorial.visits_v1 FORMAT TSV" --max_insert_block_size=100000 < visits_v1.tsv
优化表:
OPTIMIZE TABLE tutorial.hits_v1 FINAL OPTIMIZE TABLE tutorial.visits_v1 FINAL
示例查询:
SELECT StartURL AS URL, AVG(Duration) AS AvgDuration FROM tutorial.visits_v1 WHERE (StartDate >= '2014-03-23') AND (StartDate <= '2014-03-30') GROUP BY URL ORDER BY AvgDuration DESC LIMIT 10 10 rows in set. Elapsed: 0.088 sec. Processed 1.45 million rows, 114.85 MB (16.56 million rows/s., 1.31 GB/s.)
SELECT sum(Sign) AS visits, sumIf(Sign, has(Goals.ID, 1105530)) AS goal_visits, (100. * goal_visits) / visits AS goal_percent FROM tutorial.visits_v1 WHERE (CounterID = 912887) AND (toYYYYMM(StartDate) = 201403) AND (domain(StartURL) = 'yandex.ru') 1 rows in set. Elapsed: 0.012 sec. Processed 13.05 thousand rows, 2.88 MB (1.10 million rows/s., 242.38 MB/s.)
从返回速度来看,基本上是立即返回,处理时间仅用 0.088 和 0.012 秒。
3. 为什么ClickHouse如此快
从各种公开文档来看,ClickHouse如此之快的原因主要有2点:
- 列式存储数据库
- 使用向量化引擎
3.1. 列式存储
列式存储与行式存储的区别已经有大量公开文档进行详细说明,在此不再赘述。简单来说,列式存储的优势在于:
- 只提取所需要的列的信息,避免了扫描不需要的其他列信息
- 对数据压缩的友好型:因为同一列拥有同样的数据类型和现实语义,重复项的可能性更高
这两点优势提供的是:
- 减少了数据扫描范围:有效减少了所需扫描的数据量
- 减少了数据传输的大小:数据压缩率越高,则数据体量越小,在网络中传输的数据量更少,所以对网络带宽和磁盘IO的压力也就越小,速度也就越快。
3.2. 向量化执行
向量化执行是合理利用CPU指令集的方式,它的必备条件是CPU支持SIMD(Single Instruction Multiple Data)指令,此指令的作用是:单条指定一次性操作多条数据。
在Stack Overflow[3] 上对此有一个较为具体的说明:
许多CPU都有“vector”或“SIMD”指令集,可以将同一个操作同时应用到2条、4条或是更多的数据条目上。向量化(vectorization)就是重写循环的操作,在一个循环中(例如while循环),一个长度为N的数组需要循环N次才能处理完。若是使用向量化操作,假设它一次能够处理4条数据,则对于长度为N的数据,仅需要N/4的时间即能处理完毕。
更具体的例子是,假设有以下循环语句:
for (int i=0; i<16; ++i) C[i] = A[i] + B[i];
传统处理方式是:一次循环处理A[i] 与 B[i] 相加,并赋值给C[i]。
此循环可以继续展开为:
for (int i=0; i<16; i+=4) { C[i] = A[i] + B[i]; C[i+1] = A[i+1] + B[i+1]; C[i+2] = A[i+2] + B[i+2]; C[i+3] = A[i+3] + B[i+3]; }
对此使用向量化操作,则可以表示为:
for (int i=0; i<16; i+=4) addFourThingsAtOnceAndStoreResult(&C[i], &A[i], &B[i]);
此处addFourThingsAtOnceAndStoreResult() 为一个向量化操作,可以在一次循环中,同时处理4条数据。若是大家有了解过python 中 numpy的向量化操作,相信对此会有更深的了解。
3.3. 持续优化
根据朱凯[4] 在其书中提到的观点,ClickHouse如此之快的原因还包含:
- 开发人员注意到各种影响性能的细节,并进行优化,一点一滴的积累,使得性能越来越好
- 针对不同场景使用了最优的算法,使性能最优化
- 若是出现了更合适、更快的算法,开发人员会立即进行验证,若是效果理想则保留使用,否则将其抛弃
- ClickHouse更新迭代非常频繁,开发人员一直在对此进行不断的改进,追求更佳的性能
References
[1] https://clickhouse.tech/docs/zh/
[2] https://github.com/ClickHouse/ClickHouse/releases/tag/v21.1.8.30-stable
[3] https://*.com/questions/1422149/what-is-vectorization
[4] 朱凯,ClickHouse原理解析与应用实践,机器工业出版社,2020年