【重新发现PostgreSQL之美】- 9 面向多值列的倒排索引GIN|RUM

2022-04-22 13:40:42

背景

场景:

通用业务, 分词查询诉求.

挑战:

传统数据库没有分词、实时全文检索索引功能, 需要将数据同步到搜索引擎, 这种解决方案的弊端:

研发成本增加、

软硬件成本增加、

系统问题增多(同步延迟问题、同步异常问题、同步一致性问题)、

开发灵活性下降(无法同时过滤分词条件与表的其他条件, 需要业务层交换数据)

同时过滤分词条件与表的其他条件后, 无法有效的按RANK排序分词相似性

PG 解决方案:

1、倒排索引GIN:

支持多值类型的按元素检索: tsvector, array, json, xml, hstore, 任意字段组合搜索

一对多的数据模型

2、增强倒排索引RUM, RANK 加速方案:

RUM索引在posting list里面, 每个行号后面附加addon内容(文本向量的对应位置信息), 同时支持自定义addon信息.

addon的内容优势: 不需要回表搜索tuple内容. 降低IO, 提高性能.

文档

https://github.com/postgrespro/rum/blob/master/rum--1.3.sql

https://www.postgresql.org/docs/14/textsearch-controls.html#TEXTSEARCH-RANKING

如何计算rank :

0 (the default) ignores the document length

1 divides the rank by 1 + the logarithm of the document length

2 divides the rank by the document length

4 divides the rank by the mean harmonic distance between extents (this is implemented only by ts_rank_cd)

8 divides the rank by the number of unique words in document

16 divides the rank by 1 + the logarithm of the number of unique words in document

32 divides the rank by itself + 1

例子 GIN

分词搜索, 同时按rank排序返回

SELECT title, ts_rank_cd(textsearch, query) AS rank

FROM apod, to_tsquery('neutrino|(dark & matter)') query

WHERE query @@ textsearch

ORDER BY rank DESC

LIMIT 10;

title | rank

-----------------------------------------------+----------

Neutrinos in the Sun | 3.1

The Sudbury Neutrino Detector | 2.4

A MACHO View of Galactic Dark Matter | 2.01317

Hot Gas and Dark Matter | 1.91171

The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953

Rafting for Solar Neutrinos | 1.9

NGC 4650A: Strange Galaxy and Dark Matter | 1.85774

Hot Gas and Dark Matter | 1.6123

Ice Fishing for Cosmic Neutrinos | 1.6

Weak Lensing Distorts the Universe | 0.818218

GIN索引: 分词检索在索引内完成, 但是rank排序需要回表, 将所有tuple找到后计算rank, 然后排序

GIN 数据和索引结构演示:

数据:

行号1: hello world

行号2: digoal hello

行号3: apple alibaba

行号4: aliyun apple

行号5: hello aliyun

GIN 倒排索引结构:

token tree:

hello,world,digoal,apple,aliyun,alibaba

posting list|tree:

hello: 行号1, 行号2, 行号5

world: 行号1

digoal: 行号2

apple: 行号3, 行号4

aliyun: 行号4, 行号5

alibaba: 行号3

例子 RUM

如何解决RANK需要回表的性能问题:

数据和索引结构演示:

数据:

行号1: hello world

行号2: digoal hello

行号3: apple alibaba

行号4: aliyun apple

行号5: hello aliyun

RUM 倒排索引结构:

token tree:

hello,world,digoal,apple,aliyun,alibaba

posting list|tree:

hello: 行号1 hello出现的位置, 行号2 hello出现的位置, 行号5 hello出现的位置

world: 行号1 world出现的位置

digoal: 行号2 digoal出现的位置

apple: 行号3 apple出现的位置, 行号4 apple出现的位置

aliyun: 行号4 aliyun出现的位置, 行号5 aliyun出现的位置

alibaba: 行号3 alibaba出现的位置

rum和gin对比例子

SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats');

to_tsvector

-----------------------------------------------------

'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4

生成随机文本的例子

postgres=# select to_tsvector( string_agg(chr((array[32,97,98,99,100,101])[ceil(random()*6)]),'')) from generate_series(1,100);

to_tsvector

----------------------------------------------------------------------------------------------------------------------------------------------------

'abbcb':9 'bdc':4 'beaeeddddbdbdb':11 'beeaba':12 'dbdbb':5 'dccbdbbebd':1 'dcccbbbecac':3 'dccec':10 'decaabc':6 'ebacaececd':2 'ecbcb':7 'edd':8

(1 row)

create or replace function fu() returns tsvector as $$

select to_tsvector( string_agg(chr((array[32,97,98,99,100,101])[ceil(random()*6)]),'')) from generate_series(1,100);

$$ language sql strict volatile;

CREATE FUNCTION

写入111万随机文本数据

postgres=# create unlogged table tsv (id serial8 primary key, tsv tsvector);

CREATE TABLE

postgres=# insert into tsv (tsv) select fu() from generate_series(1,10000);

INSERT 0 10000

Time: 382.318 ms

postgres=# insert into tsv (tsv) select fu() from generate_series(1,100000);

INSERT 0 100000

Time: 3804.033 ms (00:03.804)

postgres=# insert into tsv (tsv) select fu() from generate_series(1,1000000);

INSERT 0 1000000

Time: 37887.880 ms (00:37.888)

postgres=# select * from tsv limit 10;

id | tsv

----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1 | 'adb':1 'bbebebdcd':5 'c':4 'ca':6 'cabddbecbcdabcdcbabbbdb':9 'cabeaaea':8 'cbbadaebdebedabdd':10 'd':2,12 'dac':11 'ddbaadebeb':7 'eceaabddb':3

2 | 'acdaaaa':10 'ad':6 'adcdcebeaeecc':4 'bcbdac':8 'bed':12 'bedaacabdbeceecaa':2 'cbabcddcda':7 'cecdec':1 'd':11 'dbbbcbacbd':9 'edddebcacea':5

3 | 'acececccbabadbdeeec':11 'aeadacbecad':8 'bdbadbadceedcadaeb':9 'bddab':4 'cebedbedb':3 'd':1,7 'da':2 'dda':12 'ddaacb':13 'ec':10 'eeadbeca':6

4 | 'aadccceb':18 'ae':5 'bcbbceda':3 'bdcd':13 'cabeac':12 'cb':2 'cbadccb':15 'cbeaba':8 'cbeb':6 'ccbbdbcaa':14 'cecbaeecb':7 'db':16 'dead':4 'e':1,11

5 | 'ab':2 'bc':12 'cadbaebeeabeec':7 'cce':6 'dab':9 'dadbccabbd':4 'dadddb':3 'db':1 'eb':10 'ecaadac':11 'eddbebcbabdccacbccde':8 'edddbcbde':5

6 | 'ae':11 'ba':9 'bb':12 'bcdd':7 'bd':15 'bdec':4 'beada':10 'caa':1 'caacc':14 'cabcbedadadceadbdbecec':8 'ddcdbaeeecad':13 'de':2 'dea':6 'deeaabba':3 'eeaad':5

7 | 'abaddcb':13 'acbedbbdceadcdda':10 'bcece':11 'cacbedc':3 'ccdb':2 'cd':7 'db':6 'dbebdd':12 'deb':8 'eaecba':4 'ecabeb':1 'ededdbbdacad':9 'eecd':5

8 | 'aaccccc':5 'aada':13 'abbddcbbcb':3 'b':7 'bae':1 'bb':4 'bcabceeabc':16 'bdbac':2 'bddaceeb':11 'cdcdceec':15 'cddea':9 'cecaa':8 'dbee':10 'e':14 'ee':6 'eedeba':12

9 | 'abcbcbd':1 'abdaa':12 'adaecaa':7 'caaabac':5 'caeeeebdc':6 'ccdd':9 'ddeab':10 'eaeebbbcadcbaac':4 'ebcddadeacb':11 'ebeeb':3 'ecbbcecaa':8 'ee':2

10 | 'aabadcab':14 'abaceeebdbeaaadbedccc':15 'b':16 'bcccdd':13 'bceedb':11 'c':2 'cc':12 'ccc':9 'cedac':4 'dadbba':7 'daee':8 'dd':10 'dddde':17 'e':6 'ecbea':5 'ed':1 'eeec':3

(10 rows)

搜索用例

postgres=# explain (analyze,verbose,timing,costs,buffers)

select *, ts_rank(tsv , to_tsquery('abc & c')) from tsv

where tsv @@ to_tsquery('abc & c')

order by ts_rank(tsv, to_tsquery('abc & c')) desc

limit 10;

1、GIN 索引, 需要回表计算rank:

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=9763.40..9763.42 rows=10 width=215) (actual time=39.030..39.033 rows=10 loops=1)

Output: id, tsv, (ts_rank(tsv, to_tsquery('abc & c'::text)))

Buffers: shared hit=7336

-> Sort (cost=9763.40..9782.45 rows=7622 width=215) (actual time=39.029..39.031 rows=10 loops=1)

Output: id, tsv, (ts_rank(tsv, to_tsquery('abc & c'::text)))

Sort Key: (ts_rank(tsv.tsv, to_tsquery('abc & c'::text))) DESC

Sort Method: top-N heapsort Memory: 30kB

Buffers: shared hit=7336

-> Index Scan using idx_tsv_1 on public.tsv (cost=11.25..9598.69 rows=7622 width=215) (actual time=20.434..37.754 rows=7531 loops=1)

Output: id, tsv, ts_rank(tsv, to_tsquery('abc & c'::text))

Index Cond: (tsv.tsv @@ to_tsquery('abc & c'::text))

Buffers: shared hit=7336

Planning:

Buffers: shared hit=2

Planning Time: 0.199 ms

Execution Time: 39.427 ms

(16 rows)

Time: 40.055 ms

postgres=# select *, ts_rank_cd(tsv , to_tsquery('abc & c'),32) from tsv where tsv @@ to_tsquery('abc & c') order by ts_rank_cd(tsv, to_tsquery('abc & c'),32) desc limit 10;

id | tsv | ts_rank_cd

--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------

386419 | 'abc':9,12 'abceadba':3 'b':8,16 'badeebdebbb':4 'bdd':20 'bea':2 'c':10,14 'cad':7 'cbedbbdeb':5 'ccdadcda':6 'd':1,15 'ddeceb':13 'deabdd':11 'e':18 | 0.16666667

568798 | 'abc':12 'addb':14 'aecdcecdbcebdbceeebaebca':4 'c':11,13 'cceddac':10 'd':1,3 'dab':6 'dcaddeeb':7 'de':8 'ea':5 'eadbadeacb':15 'edbb':2 'eecddada':9 | 0.16666667

246969 | 'aa':3 'aabbdaaea':15 'abc':6 'adaecaceec':8 'aea':9 'aebb':20 'beed':14 'c':4,5,7,17 'dbaa':18 'dbe':12 'dcdda':13 'dd':16 'ebadeec':10 'ebcce':1 'ebdbedc':19 | 0.16666667

276756 | 'aba':13 'abc':8 'aeecdceedaacdec':16 'b':5,6 'babba':12 'beadbcdeeeecbbbde':15 'c':7,9 'cabaacadb':11 'cb':10 'dcaebdbabddd':1 'dedaadd':14 'e':2 'ebdb':4 | 0.16666667

534425 | 'aa':7 'abc':19 'b':2 'baebdacbc':13 'c':9,18,20 'cabcdaaada':5 'cc':17 'cd':4 'ce':21 'd':12 'dcadccbcddc':15 'dcbcacec':11 'dea':1 'e':14 'ea':6 'eceeeadc':16 'edd':10 'ede':3 | 0.16666667

545684 | 'abc':11 'c':10,12 'caebdacdeabbccadec':3 'cbecc':2 'ccace':15 'cdb':13 'cddaccbacbcb':8 'cecadbbecda':5 'ddaab':6 'e':4,14 'eaeccedbdbc':1 'ebdeb':7 | 0.16666667

137518 | 'abc':13 'accebbedeabbdaa':2 'c':4,9,12,14 'cba':7 'cc':16 'ce':10 'dbbeebbdbd':18 'dce':8 'ddaeab':17 'eabadc':3 'eac':1 'eacc':11 'ecabdcddbecaebb':15 'ed':5 | 0.16666667

227131 | 'abc':2 'abeaacaeceaed':7 'b':12 'baae':13 'bddccbabcbbcaa':9 'c':1,3,11 'caeabd':8 'cbadea':5 'ccaacdeeabccabc':10 'dcac':6 'eadaebbcccac':4 'ed':14 | 0.16666667

264722 | 'abc':4 'acaabbeacacdbcaec':10 'ad':1 'ada':2 'c':3,5,6 'cbbcbcadee':13 'd':14 'dbdc':8 'decaceaaa':7 'deec':11 'ebcdaecea':12 'ecc':16 'ecdbebbbca':9 | 0.16666667

634783 | 'aaaba':11 'aba':10 'abbc':5 'abc':14 'ad':2 'adca':16 'bdbbb':6 'c':9,13,15 'cbacccadacbbdeda':1 'cbb':17 'd':4 'dcbd':3 'debdadd':12 'dedadeaeecacaceceebdd':8 'ea':7 | 0.16666667

(10 rows)

Time: 63.782 ms

2、RUM 索引, 不需要回表计算rank:

postgres=# explain (analyze,verbose,timing,costs,buffers) select *,tsv <=> to_tsquery('abc & c') from tsv where tsv @@ to_tsquery('abc & c') order by tsv <=> to_tsquery('abc & c') limit 10;

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=11.25..23.91 rows=10 width=215) (actual time=35.455..35.547 rows=10 loops=1)

Output: id, tsv, ((tsv <=> to_tsquery('abc & c'::text)))

Buffers: shared hit=1086

-> Index Scan using idx_tsv_1 on public.tsv (cost=11.25..8899.62 rows=7021 width=215) (actual time=35.453..35.526 rows=10 loops=1)

Output: id, tsv, (tsv <=> to_tsquery('abc & c'::text))

Index Cond: (tsv.tsv @@ to_tsquery('abc & c'::text))

Order By: (tsv.tsv <=> to_tsquery('abc & c'::text))

Buffers: shared hit=1086

Planning:

Buffers: shared hit=1

Planning Time: 0.187 ms

Execution Time: 36.395 ms

(12 rows)

Time: 37.025 ms

postgres=# select *,tsv <=> to_tsquery('abc & c') from tsv where tsv @@ to_tsquery('abc & c') order by tsv <=> to_tsquery('abc & c') limit 10;

id | tsv | ?column?

---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------

534425 | 'aa':7 'abc':19 'b':2 'baebdacbc':13 'c':9,18,20 'cabcdaaada':5 'cc':17 'cd':4 'ce':21 'd':12 'dcadccbcddc':15 'dcbcacec':11 'dea':1 'e':14 'ea':6 'eceeeadc':16 'edd':10 'ede':3 | 4.112335

1082907 | 'abc':9 'aeac':12 'bac':5 'baddaecd':6 'bc':7 'c':4,8,10 'd':3 'dadebcd':2 'dbebecabcaaeedaacdaaceebbbadeeeaad':11 'eaab':1 'ebaaebdaaaedc':13 'ed':14 | 4.112335

264722 | 'abc':4 'acaabbeacacdbcaec':10 'ad':1 'ada':2 'c':3,5,6 'cbbcbcadee':13 'd':14 'dbdc':8 'decaceaaa':7 'deec':11 'ebcdaecea':12 'ecc':16 'ecdbebbbca':9 | 4.112335

568798 | 'abc':12 'addb':14 'aecdcecdbcebdbceeebaebca':4 'c':11,13 'cceddac':10 'd':1,3 'dab':6 'dcaddeeb':7 'de':8 'ea':5 'eadbadeacb':15 'edbb':2 'eecddada':9 | 4.112335

545684 | 'abc':11 'c':10,12 'caebdacdeabbccadec':3 'cbecc':2 'ccace':15 'cdb':13 'cddaccbacbcb':8 'cecadbbecda':5 'ddaab':6 'e':4,14 'eaeccedbdbc':1 'ebdeb':7 | 4.112335

634783 | 'aaaba':11 'aba':10 'abbc':5 'abc':14 'ad':2 'adca':16 'bdbbb':6 'c':9,13,15 'cbacccadacbbdeda':1 'cbb':17 'd':4 'dcbd':3 'debdadd':12 'dedadeaeecacaceceebdd':8 'ea':7 | 4.112335

743324 | 'abc':14 'aeaddecbbd':12 'c':13,15,21 'ca':19 'ccca':1 'cdbbebdb':17 'cdc':2 'cdeeea':11 'd':4 'dadb':7 'db':8 'dbbcb':10 'dcdbeab':5 'ece':3 'ed':18 'eedcddddb':9 | 4.112335

905287 | 'abc':15 'ad':12 'b':10 'babebaca':17 'bbeebbedaacc':18 'bccdbc':1 'bcebb':9 'c':14,16 'cebaaaea':2 'd':13 'dc':7,11 'dca':4 'ddaebdddbc':8 'ddddebdbd':5 'e':6 'ebddb':3 | 4.112335

826801 | 'abc':4 'accbbe':11 'aebeedcebdcb':7 'b':1,9 'ba':13 'bbae':14 'bc':16 'bcecdeceeebad':17 'c':3,5 'cbcbbdeb':6 'd':15 'dcadbebce':12 'ebeceaabad':8 'ed':10 'ee':2 | 4.112335

246969 | 'aa':3 'aabbdaaea':15 'abc':6 'adaecaceec':8 'aea':9 'aebb':20 'beed':14 'c':4,5,7,17 'dbaa':18 'dbe':12 'dcdda':13 'dd':16 'ebadeec':10 'ebcce':1 'ebdbedc':19 | 4.112335

(10 rows)

Time: 38.362 ms

例子2: RUM addon 其他字段信息到posting tree

rum_TYPE_ops

For types: int2, int4, int8, float4, float8, money, oid, time, timetz, date,

interval, macaddr, inet, cidr, text, varchar, char, bytea, bit, varbit,

numeric, timestamp, timestamptz

Supported operations: <, <=, =, >=, > for all types and

<=>, <=| and |=> for int2, int4, int8, float4, float8, money, oid,

timestamp and timestamptz types.

Supports ordering by <=>, <=| and |=> operators. Can be used with

rum_tsvector_addon_ops, rum_tsvector_hash_addon_ops' and rum_anyarray_addon_ops` operator classes.

rum_tsvector_addon_ops

For type: tsvector

This operator class stores tsvector lexems with any supported by module

field. There is the example.

Let us assume we have the table:

CREATE TABLE tsts (id int, t tsvector, d timestamp);

\copy tsts from 'rum/data/tsts.data'

CREATE INDEX tsts_idx ON tsts USING rum (t rum_tsvector_addon_ops, d)

WITH (attach = 'd', to = 't');

Now we can execute the following queries:

EXPLAIN (costs off)

SELECT id, d, d <=> '2016-05-16 14:21:25' FROM tsts WHERE t @@ 'wr&qh' ORDER BY d <=> '2016-05-16 14:21:25' LIMIT 5;

QUERY PLAN

-----------------------------------------------------------------------------------

Limit

-> Index Scan using tsts_idx on tsts

Index Cond: (t @@ '''wr'' & ''qh'''::tsquery)

Order By: (d <=> 'Mon May 16 14:21:25 2016'::timestamp without time zone)

(4 rows)

SELECT id, d, d <=> '2016-05-16 14:21:25' FROM tsts WHERE t @@ 'wr&qh' ORDER BY d <=> '2016-05-16 14:21:25' LIMIT 5;

id | d | ?column?

-----+---------------------------------+---------------

355 | Mon May 16 14:21:22.326724 2016 | 2.673276

354 | Mon May 16 13:21:22.326724 2016 | 3602.673276

371 | Tue May 17 06:21:22.326724 2016 | 57597.326724

406 | Wed May 18 17:21:22.326724 2016 | 183597.326724

415 | Thu May 19 02:21:22.326724 2016 | 215997.326724

(5 rows)

Warning: Currently RUM has bogus behaviour when one creates an index using ordering over pass-by-reference additional information. This is due to the fact that posting trees have fixed length right bound and fixed length non-leaf posting items. It isn't allowed to create such indexes.

rum_tsvector_hash_addon_ops

For type: tsvector

This operator class stores hash of tsvector lexems with any supported by module

field.

Doesn't support prefix search.

rum_anyarray_addon_ops

For type: anyarray

This operator class stores anyarrray elements with any supported by module field.

rum_anyarray_addon_ops 的例子

postgres=# create extension rum;

CREATE EXTENSION

create table tbl (

id serial8 primary key,

a int[],

n int,

crt_time timestamp

);

create index idx_tbl_1 on tbl using rum (a rum_anyarray_addon_ops, n) with (attach='n', to='a');

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where a @> array[1,2,3] and n <= 1;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------

Index Scan using idx_tbl_1 on public.tbl (cost=15.40..38.90 rows=20 width=53) (actual time=130.694..130.711 rows=20 loops=1)

Output: id, a, n, crt_time

Index Cond: ((tbl.a @> '{1,2,3}'::integer[]) AND (tbl.n <= 1))

Buffers: shared hit=1768

Planning:

Buffers: shared hit=1

Planning Time: 0.092 ms

Execution Time: 130.735 ms

(8 rows)

码农公寓

背景

相关文章