上一片博文说明了全文索引的原理以及一些参数设置及如何创建全文索引。
MySQL数据库支持全文索引的查询,其语法如下:
MATCH (col1, col2,...) AGAINST (expr )
: { IN NATURAL LANGUAGE MODE | IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION | IN BOOLEAN MODE | WITH QUERY EXPANSION } MySQL数据库通过mathc()...against()语法支持全文检索的查询,match指定了需要查询的列,against指定了使用何种方法进行查询。
NATURAL LANGUAGE
全文检索通过match函数进行查询,默认采用natural language模式,其表示查询带有指定word的文档。
在上一片博客中,创建了一个表,以及在表中插入了数据,并且创建了全文索引,表中内容如下:
mysql> select * from tb3;+------------+-------------------------------------+| FTS_DOC_ID | body |+------------+-------------------------------------+| 1 | pLease porridge in the pot || 2 | please say sorry || 4 | some like it hot, some like it cold || 5 | i like coding || 6 | fuck the company |+------------+-------------------------------------+5 rows in set (0.00 sec)
我们查询body中含有“please”单词的记录。
#不使用全文索引时的情况如下 mysql tb3 body "please FTS_DOC_ID body pLease porridge the pot please say sorry rows ( explain tb3 body "please id select_type partitions type possible_keys key_len ref rows filtered Extra SIMPLE tb3 Using row , warning ( tb3 match(body) against("please" FTS_DOC_ID body pLease porridge the pot please say sorry rows ( explain tb3 match(body) against("please" id select_type partitions type possible_keys key_len ref rows filtered Extra SIMPLE tb3 fulltext ft_index ft_index const Using ; Ft_hints: sorted
由两次执行计划可以看到,使用全文索引过滤率为100%,仅扫描了一行;而不使用全文索引的时候过滤率仅为16.67%,并且是全表扫描。
在where条件中使用mathc函数,查询返回的结果是根据相关性进行降序排序的,即相关性最高的结果放在第一位。相关性的值是一个非负的浮点数,0表示没有任何相关性。根据MySQL官方文档可知,其相关性的计算依据是以下4个条件:
- word是否在文档中出现。
- word在文档中出现的次数。
- word在索引列中的数量。
- 多少个文档包含该word。
mysql tb3 row affected ( Duplicates: Warnings: tb3 match(body) against("please" FTS_DOC_ID body please please please pLease porridge the pot please say sorry rows ( () tb3 match(body) against("please" () row ( ((MATCH (body) AGAINST ("please" NATURAL LANGUAGE MODE),, row (
此外还可以使用SQL语句查看对应的相关性:
mysql> select fts_doc_id, body, match(body) against("please" in natural language mode) as relevence from tb3;+------------+-------------------------------------+---------------------+| fts_doc_id | body | relevence |+------------+-------------------------------------+---------------------+| 1 | pLease porridge in the pot | 0.13540691137313843 || 2 | please say sorry | 0.13540691137313843 || 4 | some like it hot, some like it cold | 0 || 5 | i like coding | 0 || 6 | fuck the company | 0 || 8 | please please please | 0.4062207341194153 |+------------+-------------------------------------+---------------------+6 rows in set (0.01 sec)
对于innodb存储引擎的全文索引中,还需要考虑一下因素。
- 查询的word字段在stopword列中,忽略该字符串的查询。
- 查询的word字符长度是否在区间[innodb_ft_min_token_size, innodb_ft_max_token_size]之间。
mysql show variables "innodb_ft_min Variable_name Value innodb_ft_min_token_size row ( show variables "innodb_ft_max Variable_name Value innodb_ft_max_token_size row ( sec) #这两个参数用于控制innodb存储引擎查询字符的长度,当长度小于innodb_ft_min_token_size,或者长度大于innodb_ft_max_token_size时,会忽略该次的搜索。
Boolean
MySQL数据库允许使用IN BOOLEAN MODE修饰符来进行全文检索。当使用该修饰符时,查询字符串的前后字符会有特殊的含义。
mysql> select * from tb3 where match(body) against ("+like -hot" in boolean mode);+------------+---------------+| FTS_DOC_ID | body |+------------+---------------+| 5 | i like coding |+------------+---------------+1 row in set (0.00 sec) #上面这个要求查询含有like但是没有hot字符的文档。
Boolean全文索引支持以下几种操作:
- +: 表示该word必须存在
- -:表示该word必须被排除
- (no operator): 表示该word是可选的,但是如果出现,其相关性会更高。
- @distance表示查询的多个单词之间的距离是否在distance之内,distance的单位是字节。这种全文素银的查询也称为Proximity Search.
- >表示出现该单词增加相关性
- <表示出现该单词降低相关性
- *表示以0个或多个字符。
- “表示短语
tb3 match (body) against ("please hot" FTS_DOC_ID body it hot, it cold please please please pLease porridge the pot please say sorry rows ( tb3 match(body) against ( FTS_DOC_ID body please say sorry row ( tb3 match(body) against ( ( fts_doc_id, body, match(body) against("please say pot" boolean mode) relevence fts_doc_id body relevence pLease porridge the pot please say sorry it hot, it cold i coding fuck the company please please please rows ( tb3 match(body) against ( FTS_DOC_ID body it hot, it cold please say sorry rows (
query expansion
MySQL数据库还支持全文索引的扩展查询。这种查询通常是在查询的关键词太短,用户需要implied knowledge(隐含知识)时进行。例如,对于单词database的查询,用户可能希望查询不仅仅包含database文档,可能还指那些包含MySQL,oracle,DB2,RDBMS的单词,而这时可以使用query expansion模式开启全文检索的implied knowledge.
通常在查询短语中添加WITH QUERY EXPANSION或IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION也可以开启blind query expansion.该查询分为两个阶段。
- 第一阶段: 根据搜索的单词进行全文索引查询。
- 第二阶段:根据第一阶段产生的分词再进行一次全文检索查询。
UNSIGNED auto_increment ( articles(title, body) mysql tricks",". Never run mysqld the following IBM IBM..."); mysql> select * from articles; #查询数据 +----+-----------------------+-----------------------------------------+ | id | title | body | +----+-----------------------+-----------------------------------------+ | 17 | mysql tutorial | DBMS stands for DATABASE.... | | 18 | How to use mysql well | After you went through a ... | | 19 | optimizing mysql | in this tutorial we will SHOW.... | | 20 | 1001 mysql tricks | 1. Never run mysqld as root... | | 21 | mysql VS pgsql | In the following database comparison... | | 22 | mysql security | When configured properly....,mysql... | | 23 | tunning DB2 | for IBM database | | 24 | IBM History | DB2 history for IBM... | +----+-----------------------+-----------------------------------------+ 8 rows in set (0.00 sec)
在这个例子中,并没有显示创建FTS_DOC_ID列,因此innodb存储引擎会自动创建该列,并添加唯一索引(都是隐藏的)。此外,又创建了列title和body的联合索引。
mysql> select * from articles where match(title, body) against("database" ); #默认是使用in natural language mode!+----+----------------+-----------------------------------------+| id | title | body |+----+----------------+-----------------------------------------+| 17 | mysql tutorial | DBMS stands for DATABASE.... || 21 | mysql VS pgsql | In the following database comparison... || 23 | tunning DB2 | for IBM database |+----+----------------+-----------------------------------------+3 rows in set (0.00 sec) mysql> select * from articles where match(title, body) against("database" with query expansion);+----+-----------------------+-----------------------------------------+| id | title | body |+----+-----------------------+-----------------------------------------+| 21 | mysql VS pgsql | In the following database comparison... || 17 | mysql tutorial | DBMS stands for DATABASE.... || 23 | tunning DB2 | for IBM database || 24 | IBM History | DB2 history for IBM... || 19 | optimizing mysql | in this tutorial we will SHOW.... || 22 | mysql security | When configured properly....,mysql... || 18 | How to use mysql well | After you went through a ... || 20 | 1001 mysql tricks | 1. Never run mysqld as root... |+----+-----------------------+-----------------------------------------+8 rows in set (0.01 sec)
因为query expansion会带来许多非相关性的查询,因此使用的时候,需要特别注意。