对盗图、盗文、盗墓深恶痛绝吗?PostgreSQL结合余弦、线性相关算法 在文本、图片、数组相似 等领域的应用 - 2 smlar插件详解

标签

PostgreSQL , 文本分析 , cosine , smlar , 相似性 , simlar , tf , idf , tf-idf , tag


背景

以2个例子作为开始,

例1

在数据库中有两条这样的记录

"I want a dog"  // 狗  
"I want a chihuahua"  // 吉娃娃狗  

然后使用这样的查询条件进行查询

"dog|chihuahua"  

很显然,两条记录都会被匹配到,但是哪条记录应该排在前面呢?

例2

在搜索引擎中搜索"狗|吉娃娃狗"

哪个会排在前面呢?试试就知道了,翻到第二页,以防被广告冲昏头脑

对盗图、盗文、盗墓深恶痛绝吗?PostgreSQL结合余弦、线性相关算法 在文本、图片、数组相似 等领域的应用 - 2 smlar插件详解

很显然吉娃娃被排在了前面,为什么呢?

其实都是tf-idf算法起到的作用,因为在全局文本中,狗比吉娃娃出现的次数多,所以根据idf的算法 ( log(总文本数/包含此词的文本数) ) 狗的idf比吉娃娃的低。

那么当tf相同时,很显然tf*idf取决于idf的大小。

关于tf与idf的概念,请参考

《文本(关键词)分析 - TF(Term Frequency 词频) IDF(Inverse Document Frequency 逆向文本频率)》

那么作为支持全文检索的PostgreSQL数据库,默认支持TF-IDF吗?

PostgreSQL 默认不使用idf计算rank

我们在ts_rank, ts_rank_cd的代码中可以了解到,PostgreSQL这两个函数并不关心idf。

其实在网上也有人问过这样的问题

http://*.com/questions/18296444/does-postgresql-use-tf-idf

Within the ts_rank function, there is no native method to rank results using their global (corpus) frequency. The rank algorithm does however rank based on frequency within the document:  

http://www.postgresql.org/docs/9.3/static/textsearch-controls.html  

So if I search for "dog|chihuahua" the following two documents would have the same rank despite the relatively lower frequency of the word "chihuahua":  

"I want a dog"  // 狗  
"I want a chihuahua"  // 奇瓦瓦狗  

However, the following line would get ranked higher than the previous two lines above, because it contains the stemmed token "dog" twice in the document:  

"dog lovers have an average of 1.5 dogs"  

In short: higher term frequency(TF) within the document results in a higher rank, but a lower term frequency in the corpus has no impact.  

One caveat: the text search does ignore stop-words, so you will not match on ultra high frequency words like "the","a","of","for" etc (assuming you have correctly set your language)  
Postgres does not use TF-IDF as a similarity measure among documents.  

ts_rank is higher if a document contains query terms more frequently. It does not take into account the global frequency of the term.  

ts_rank_cd is higher if a document contains query terms closer together and more frequently. It does not take into account the global frequency of the term.  

There is an extension from the text search creators called smlar, that lets you calculate the similarity between arrays using TF-IDF.   

It also lets you turn tsvectors into arrays, and supports fast indexing.  

PostgreSQL内置的rank计算方法并不关心IDF,而仅仅计算当前文本的词频。

目前PostgreSQL通过ts_rank与ts_rank_cd计算tsquery与tsvector的相关性,算法详见

《文本(关键词)分析 - TF(Term Frequency 词频) IDF(Inverse Document Frequency 逆向文本频率)》

如何让PostgreSQL计算rank时关心IDF?

如何让PostgreSQL计算rank时关心IDF?详见此文

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

smlar插件介绍

smlar插件支持多种相似度计算公式(算法),cosine(default), tfidf, overlap。同时还提供了自定义公式计算相似度的函数。

函数接口

1. 计算数组的相似度

float4 smlar(anyarray, anyarray)  
    - computes similary of two arrays. Arrays should be the same type.  

2. 计算自定义复合数组(元素,权重)的数组的相似度,可以计算所有元素,同时也允许只计算重叠元素的部分(当权重不同时,相似度不同,例如cosine算法)。

或者我们把它理解为包含tfidf的加权数组,比如 [('中国', 0.1), ('日本', 0.1), ('海盗', 0.9)]

在文本相似度分析中很有用。

float4 smlar(anyarray, anyarray, bool useIntersect)  
    -  computes similary of two arrays of composite types. Composite type looks like:  
        CREATE TYPE type_name AS (element_name anytype, weight_name FLOAT4);  
       useIntersect option points to use only intersected elements in denominator  
       see an exmaples in sql/composite_int4.sql or sql/composite_text.sql  

例子

postgres=# create type tp as (c1 text, c2 float4);
CREATE TYPE

postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('你好',2.1), ('中国',1.1),('china',1)]::_tp);
  smlar   
----------
 0.816497
(1 row)

postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('你好',2.1), ('中国',1.1),('china',1)]::_tp,true);
 smlar 
-------
     1
(1 row)

postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('你好',2.2), ('中国',1.1),('china',1)]::_tp,true);
  smlar   
----------
 0.999822
(1 row)

与顺序无关
postgres=# select smlar(array[('你好',2.1), ('中国',1.1)]::_tp, array[('中国',1.1),('你好',2.2),('china',1)]::_tp,true);
  smlar   
----------
 0.999822
(1 row)

注意

elog(ERROR, "GiST  doesn't support composite (weighted) type");

3. 通过自定义公式,公式中包含3个变量N.i, N.a, N.b,计算两个数组的相似性,你可以自定义算法。

float4 smlar( anyarray a, anyarray b, text formula );  
    - computes similary of two arrays by given formula, arrays should   
    be the same type.   
    Predefined variables in formula:  
      N.i   - number of common elements in both array (intersection)  
      N.a   - number of uniqueelements in first array  
      N.b   - number of uniqueelements in second array  
    Example:  
    smlar('{1,4,6}'::int[], '{5,4,6}' )  
    smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / sqrt(N.a * N.b)' )  
    That calls are equivalent.  

操作符接口

1. 判断两个数组是否相似,(当相似值大于limit值时, limit值通过smlar.threshold参数设置)

anyarray % anyarray  
    - returns true if similarity of that arrays is greater than limit  

float4 show_smlar_limit()  - deprecated  
    - shows the limit for % operation  

float4 set_smlar_limit(float4) - deprecated  
    - sets the limit for % operation  

Use instead of show_smlar_limit/set_smlar_limit GUC variable   
smlar.threshold (see below)  

转换函数接口

1. 将tsvector类型转换为text array类型

text[] tsvector2textarray(tsvector)  
    - transforms tsvector type to text array  

2. 对数组内的元素排序并去除重复元素

anyarray array_unique(anyarray)  
    - sort and unique array  

3. 判断数组内是否包含某元素,包含时返回1.0, 不包含时返回0。

float4 inarray(anyarray, anyelement)  
    - returns zero if second argument does not present in a first one  
      and 1.0 in opposite case  

4. 类似三目操作符,判断数组内是否包含某元素,包含时返回第三个参数的值,不包含时返回第四个参数值。

float4 inarray(anyarray, anyelement, float4, float4)  
    - returns fourth argument if second argument does not present in   
      a first one and third argument in opposite case  

参数

1. 相似度LIMIT值,当相似性低于这个值时,%操作符(计算两个数组是否相似) 将返回false。

smlar.threshold  FLOAT  
    Array's with similarity lower than threshold are not similar   
    by % operation  

2. 是否持久化idf表

smlar.persistent_cache BOOL  
    Cache of global stat is stored in transaction-independent memory  

3. 默认的相似性计算公式

smlar.type  STRING  
    Type of similarity formula: cosine(default), tfidf, overlap  

源码,其中tfidf计算需要用到统计idf的表。

        switch(getSmlType())  
        {  
                case ST_TFIDF:  
                        PG_RETURN_FLOAT4( TFIDFSml(sa, sb) );  
                        break;  
                case ST_COSINE:  
                        {  
                                int                             cnt;  
                                double                  power;  

                                power = ((double)(sa->nelems)) * ((double)(sb->nelems));  
                                cnt = numOfIntersect(sa, sb);  

                                PG_RETURN_FLOAT4(  ((double)cnt) / sqrt( power ) );  
                        }  
                        break;  
                case ST_OVERLAP:  
                        {  
                                float4 res = (float4)numOfIntersect(sa, sb);  

                                PG_RETURN_FLOAT4(res);  
                        }  
                        break;  
                default:  
                        elog(ERROR,"Unsupported formula type of similarity");  
        }  

算法详见

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

4. 当相似性计算公式为tfidf时,需要设置这个参数,并且需要一张统计信息表,记录每个词的idf,以及总共有多少文本。

smlar.stattable STRING  
    Name of table stored set-wide statistic. Table should be   
    defined as  

    CREATE TABLE table_name (  
        value   data_type UNIQUE,  -- 词  
        ndoc    int4 (or bigint)  NOT NULL CHECK (ndoc>0)  -- 该词一共出现在几个文本中  
    );  

    And row with null value means total number of documents.  
    See an examples in sql/*g.sql files  

    Note: used on for smlar.type = 'tfidf'  

5. 当相似性计算公式为tfidf时的配置。

TF(词频)计算方法设置:出现次数、1+log(出现次数)、常数1

smlar.tf_method STRING  
    Calculation method for term frequency. Values:  
        "n"     - simple counting of entries (default)  
        "log"   - 1 + log(n)  
        "const" - TF is equal to 1  
    Note: used on for smlar.type = 'tfidf'  

注意 :

GIN index supports only smlar.tf_method = \"const\""

6. 当相似性计算公式为tfidf时的配置。

idf是否+1后再取LOG

smlar.idf_plus_one BOOL  
    If false (default), calculate idf as log(d/df),  
    if true - as log(1+d/df)  
    Note: used on for smlar.type = 'tfidf'  

建议的参数配置

Module provides several GUC variables smlar.threshold, it's highly  
recommended to add to postgesql.conf:  

custom_variable_classes = 'smlar'       # list of custom variable class names  
smlar.threshold = 0.6                   #or any other value > 0 and < 1  

and other smlar.* variables  

索引op class

smlar插件的核心,实际上是计算两个数组(任意类型的数组)的相似性,当然为了提高速度,它也支持索引。

不同类型对应的ops如下,在创建索引是需要使用它们

GiST/GIN support for  %  and  &&  operations for:  
  Array Type   |  GIN operator class  | GiST operator class    
---------------+----------------------+----------------------  
 bit[]         | _bit_sml_ops         |   
 bytea[]       | _bytea_sml_ops       | _bytea_sml_ops  
 char[]        | _char_sml_ops        | _char_sml_ops  
 cidr[]        | _cidr_sml_ops        | _cidr_sml_ops  
 date[]        | _date_sml_ops        | _date_sml_ops  
 float4[]      | _float4_sml_ops      | _float4_sml_ops  
 float8[]      | _float8_sml_ops      | _float8_sml_ops  
 inet[]        | _inet_sml_ops        | _inet_sml_ops  
 int2[]        | _int2_sml_ops        | _int2_sml_ops  
 int4[]        | _int4_sml_ops        | _int4_sml_ops  
 int8[]        | _int8_sml_ops        | _int8_sml_ops  
 interval[]    | _interval_sml_ops    | _interval_sml_ops  
 macaddr[]     | _macaddr_sml_ops     | _macaddr_sml_ops  
 money[]       | _money_sml_ops       |   
 numeric[]     | _numeric_sml_ops     | _numeric_sml_ops  
 oid[]         | _oid_sml_ops         | _oid_sml_ops  
 text[]        | _text_sml_ops        | _text_sml_ops  
 time[]        | _time_sml_ops        | _time_sml_ops  
 timestamp[]   | _timestamp_sml_ops   | _timestamp_sml_ops  
 timestamptz[] | _timestamptz_sml_ops | _timestamptz_sml_ops  
 timetz[]      | _timetz_sml_ops      | _timetz_sml_ops  
 varbit[]      | _varbit_sml_ops      |   
 varchar[]     | _varchar_sml_ops     | _varchar_sml_ops  

smlar 实现 tfidf相似性的 例子

默认的cosine算法的例子参考

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

接下来看看如何使用tfidf

1. 安装smlar插件

/* see http://blog.databasepatterns.com/2014/07/postgresql-install-smlar-extension.html */  
create extension if not exists smlar;  

2. 创建测试表

/* your basic table to search against: */  
create table documents (  
  document_id int primary key,  
  body text not null  
);  

3. 导入测试数据,注意可以直接到http://www.mockaroo.com勾选生成测试数据

/*   
   I created 100,000 "lorem ipsum" documents here http://www.mockaroo.com/c5418bd0  
   In retrospect, not a great choice due to the small number of unique words used to generate the dataset  
*/  
copy documents   
from program 'curl "http://www.mockaroo.com/c5418bd0/download?count=100000&key=5b15a410"'   
with (format csv, header true);  

copy documents   
from '/home/digoal/test.csv'   
with (format csv, header true);  

4. 创建统计表, 记录每个词在多少篇文本中出现过,以及总的文本数。

对于日常的使用,我们可以从词库中导出词组与idf,并生成文本总数,与词出现的次数,导入这张表

比如从scws, jieba分词等词库,导出并导入该表

/* this table holds document frequencies (# of docs in which a term appears) for the documents.body column: */  
create table documents_body_stats (  
  value text unique,  
  ndoc int not null  
);  

5. 使用本地数据,生成idf,(实际生产时,可以忽略,建议使用从词库导入的IDF)

/* used ts_stat for convenience, not ideal, but good for quick n dirty: */  
insert into documents_body_stats  
  select  
    word,  
    ndoc  
  from   
    ts_stat( 'select to_tsvector(''simple'', body) from documents' );  

6. 插入文本数

/* the smlar tfdif table needs the total document count as well. It's added as a row with null in the value column: */  
insert into documents_body_stats values   
  (null, (select count(*) from documents) );  

7. 将文本转换为text数组的函数

/* turn documents into array of words. you could also use tsvector2textarray( to_tsvector(...) ) : */  
create or replace function tokenize(text) returns text[] language sql strict immutable as $$  
 select regexp_split_to_array( lower($1), '[^[:alnum:]]' );  
$$;  

注意tsvector转换为数组时,会丢失重复值

postgres=# select tsvector2textarray(to_tsvector('i am digoal digoal')),  to_tsvector('i am digoal digoal');  
 tsvector2textarray | to_tsvector    
--------------------+--------------  
 {digoal}           | 'digoal':3,4  
(1 row)  

8. 创建表达式索引

/* use smlar's text array opclass. gist is a little more flexible than gin in this case (allows 'n' tf_method): */  
create index on documents using gist ( tokenize(body) _text_sml_ops ); --24 seconds  

9. 参数设置,使用tfidf公式计算数组相似度

/* setup smlar: */  
set smlar.type = 'tfidf';  
set smlar.stattable = 'documents_body_stats';  
set smlar.tf_method = 'n';  
set smlar.threshold = 0.4;  

10. 查询

当TFIDF similarity >= smlar.threshold时,返回。

/* the query */  
select  
  *,  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] )  
from  
  documents  
where  
  tokenize(body) % '{fringilla,malesuada,euismod}'::text[] -- where TFIDF similarity >= smlar.threshold  
order by  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] ) desc  
limit 10;  


postgres=# explain (analyze,verbose,timing,costs,buffers) select  
  *,  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] )  
from  
  documents  
where  
  tokenize(body) % '{fringilla,malesuada,euismod}'::text[] -- where TFIDF similarity >= smlar.threshold  
order by  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] ) desc  
limit 10;  
                                                                      QUERY PLAN                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=2.18..2.18 rows=1 width=237) (actual time=7.647..7.647 rows=0 loops=1)  
   Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))  
   Buffers: shared hit=79  
   ->  Sort  (cost=2.18..2.18 rows=1 width=237) (actual time=7.647..7.647 rows=0 loops=1)  
         Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))  
         Sort Key: (smlar(regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])) DESC  
         Sort Method: quicksort  Memory: 25kB  
         Buffers: shared hit=79  
         ->  Index Scan using documents_tokenize_idx on public.documents  (cost=0.14..2.17 rows=1 width=237) (actual time=7.641..7.641 rows=0 loops=1)  
               Output: document_id, body, smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])  
               Index Cond: (regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text) % '{fringilla,malesuada,euismod}'::text[])  
               Rows Removed by Index Recheck: 61  
               Buffers: shared hit=79  
 Planning time: 0.148 ms  
 Execution time: 7.703 ms  
(15 rows)  

我们可以把相似度调高,从而排除更多的记录,提高查询效率

postgres=# set smlar.threshold = 0.8;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select  
  *,  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] )  
from  
  documents  
where  
  tokenize(body) % '{fringilla,malesuada,euismod}'::text[] -- where TFIDF similarity >= smlar.threshold  
order by  
  smlar( tokenize(body), '{fringilla,malesuada,euismod}'::text[] ) desc  
limit 10;  
                                                                      QUERY PLAN                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=2.18..2.18 rows=1 width=237) (actual time=1.051..1.051 rows=0 loops=1)  
   Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))  
   Buffers: shared hit=29  
   ->  Sort  (cost=2.18..2.18 rows=1 width=237) (actual time=1.049..1.049 rows=0 loops=1)  
         Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[]))  
         Sort Key: (smlar(regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])) DESC  
         Sort Method: quicksort  Memory: 25kB  
         Buffers: shared hit=29  
         ->  Index Scan using documents_tokenize_idx on public.documents  (cost=0.14..2.17 rows=1 width=237) (actual time=1.042..1.042 rows=0 loops=1)  
               Output: document_id, body, smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,malesuada,euismod}'::text[])  
               Index Cond: (regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text) % '{fringilla,malesuada,euismod}'::text[])  
               Rows Removed by Index Recheck: 1  
               Buffers: shared hit=29  
 Planning time: 0.190 ms  
 Execution time: 1.113 ms  
(15 rows)  

如果你不需要对结果按相似度排序的话,也可以把order by去掉,提升性能。  

MADlib训练文本idf

Basic steps for clustering documents can vary a bit depending on your
precise goals, but one example would be:

1. Often there is an initial first level of processing to handle
tokenization, word stemming, filter stopwords etc.

There is no right or wrong way of handling this step and there is a fair
amount of art to doing it well.

2. Build feature vectors for each document.
There are a variety of methods for handling this, one common example
would be to build tf-idfs (http://en.wikipedia.org/wiki/Tfidf).

For which you need the following metrics:
a) word count for each document
b) how many documents word occur across all documents
c) documents count

  1. Having produced a feature vectors for your documents you can then call
    the kmeans function in MADlib to perform the actual clustering (
    http://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html).
    Given a tf-idf
    feature vector the most common distance function is cosine similarity,
    though other distance functions may make sense depending on your use case.

http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html

rum也在支持tfidf了。

参考

《文本(关键词)分析 - TF(Term Frequency 词频) IDF(Inverse Document Frequency 逆向文本频率)》

《从相似度算法谈起 - Effective similarity search in PostgreSQL》

http://www.sigaev.ru/

http://www.sigaev.ru/git/gitweb.cgi?p=smlar.git;a=summary

http://www.sigaev.ru/git/gitweb.cgi?p=smlar.git;a=blob;f=README;h=8fa81c51d749cc714e52257a15edf547f0b26edf;hb=8d02df18c0bbfd6ccba94a9582499ec8746047e5

http://www.pgcon.org/2012/schedule/events/443.en.html

http://www.pgcon.org/2012/schedule/attachments/252_smlar-2012.pdf

http://railsware.com/blog/2012/05/10/effective-similarity-search-in-postgresql/

上一篇:Knight_day15


下一篇:php 7.2 编译安装