Elasticsearch 分词器

2022-06-16 01:14:13

无论是内置的分析器（analyzer），还是自定义的分析器（analyzer），都由三种构件块组成的：character filters ， tokenizers ， token filters。

内置的analyzer将这些构建块预先打包到适合不同语言和文本类型的analyzer中。

Character filters （字符过滤器）

字符过滤器以字符流的形式接收原始文本，并可以通过添加、删除或更改字符来转换该流。

举例来说，一个字符过滤器可以用来把阿拉伯数字（٠‎١٢٣٤٥٦٧٨‎٩）‎转成成Arabic-Latin的等价物（0123456789）。

一个分析器可能有0个或多个字符过滤器，它们按顺序应用。

（PS：类似Servlet中的过滤器，或者拦截器，想象一下有一个过滤器链）

Tokenizer （分词器）

一个分词器接收一个字符流，并将其拆分成单个token （通常是单个单词），并输出一个token流。例如，一个whitespace分词器当它看到空白的时候就会将文本拆分成token。它会将文本“Quick brown fox!”转换为[Quick, brown, fox!]

（PS：Tokenizer 负责将文本拆分成单个token ，这里token就指的就是一个一个的单词。就是一段文本被分割成好几部分，相当于Java中的字符串的 split ）

分词器还负责记录每个term的顺序或位置，以及该term所表示的原单词的开始和结束字符偏移量。（PS：文本被分词后的输出是一个term数组）

一个分析器必须只能有一个分词器

Token filters （token过滤器）

token过滤器接收token流，并且可能会添加、删除或更改tokens。

例如，一个lowercase token filter可以将所有的token转成小写。stop token filter可以删除常用的单词，比如 the 。synonym token filter可以将同义词引入token流。

不允许token过滤器更改每个token的位置或字符偏移量。

一个分析器可能有0个或多个token过滤器，它们按顺序应用。

小结&回顾

analyzer（分析器）是一个包，这个包由三部分组成，分别是：character filters （字符过滤器）、tokenizer（分词器）、token filters（token过滤器）

一个analyzer可以有0个或多个character filters

一个analyzer有且只能有一个tokenizer

一个analyzer可以有0个或多个token filters

character filter 是做字符转换的，它接收的是文本字符流，输出也是字符流

tokenizer 是做分词的，它接收字符流，输出token流（文本拆分后变成一个一个单词，这些单词叫token）

token filter 是做token过滤的，它接收token流，输出也是token流

由此可见，整个analyzer要做的事情就是将文本拆分成单个单词，文本 ----> 字符 ----> token

这就好比是拦截器

1. 测试分析器

analyze API 是一个工具，可以帮助我们查看分析的过程。（PS：类似于执行计划）

curl -X POST "192.168.1.134:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "whitespace",

  "text":     "The quick brown fox."

}

'

curl -X POST "192.168.1.134:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "tokenizer": "standard",

  "filter":  [ "lowercase", "asciifolding" ],

  "text":      "Is this déja vu?"

}

'

输出：

{

    "tokens":[

        {

            "token":"The",

            "start_offset":,

            "end_offset":,

            "type":"word",

            "position":

        },

        {

            "token":"quick",

            "start_offset":,

            "end_offset":,

            "type":"word",

            "position":

        },

        {

            "token":"brown",

            "start_offset":,

            "end_offset":,

            "type":"word",

            "position":

        },

        {

            "token":"fox.",

            "start_offset":,

            "end_offset":,

            "type":"word",

            "position":

        }

    ]

}

可以看到，对于每个term，记录了它的位置和偏移量

2. Analyzer

2.1. 配置内置的分析器

内置的分析器不用任何配置就可以直接使用。当然，默认配置是可以更改的。例如，standard分析器可以配置为支持停止字列表:

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "std_english": {

          "type":      "standard",

          "stopwords": "_english_"

        }

      }

    }

  },

  "mappings": {

    "_doc": {

      "properties": {

        "my_text": {

          "type":     "text",

          "analyzer": "standard",

          "fields": {

            "english": {

              "type":     "text",

              "analyzer": "std_english"

            }

          }

        }

      }

    }

  }

}

'

在这个例子中，我们基于standard分析器来定义了一个std_englisth分析器，同时配置为删除预定义的英语停止词列表。后面的mapping中，定义了my_text字段用standard，my_text.english用std_english分析器。因此，下面两个的分词结果会是这样的：

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'

{

  "field": "my_text",

  "text": "The old brown cow"

}

'

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'

{

  "field": "my_text.english",

  "text": "The old brown cow"

}

'

第一个由于用的standard分析器，因此分词的结果是：[ the, old, brown, cow ]

第二个用std_english分析的结果是：[ old, brown, cow ]

2.2. Standard Analyzer （默认）

如果没有特别指定的话，standard 是默认的分析器。它提供了基于语法的标记化（基于Unicode文本分割算法），适用于大多数语言。

例如：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "standard",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

上面例子中，那段文本将会输出如下terms：

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

2.2.1. 配置

标准分析器接受下列参数：

max_token_length ：最大token长度，默认255
stopwords ：预定义的停止词列表，如_english_ 或包含停止词列表的数组，默认是 _none_
stopwords_path ：包含停止词的文件路径

2.2.2. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_english_analyzer": {

          "type": "standard",

          "max_token_length": ,

          "stopwords": "_english_"

        }

      }

    }

  }

}

'

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "my_english_analyzer",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

以上输出下列terms:

[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]

2.2.3. 定义

standard分析器由下列两部分组成：

Tokenizer

Standard Tokenizer

Token Filters

Standard Token Filter
Lower Case Token Filter
Stop Token Filter （默认被禁用）

你还可以自定义

curl -X PUT "localhost:9200/standard_example" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "rebuilt_standard": {

          "tokenizer": "standard",

          "filter": [

            "lowercase"

          ]

        }

      }

    }

  }

}

'

2.3. Simple Analyzer

simple 分析器当它遇到只要不是字母的字符，就将文本解析成term，而且所有的term都是小写的。例如：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "simple",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

输入结果如下：

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

2.3.1. 自定义

curl -X PUT "localhost:9200/simple_example" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "rebuilt_simple": {

          "tokenizer": "lowercase",

          "filter": [

          ]

        }

      }

    }

  }

}

'

2.4. Whitespace Analyzer

whitespace 分析器，当它遇到空白字符时，就将文本解析成terms

示例：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "whitespace",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

输出结果如下：

[ The, , QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

2.5. Stop Analyzer

stop 分析器和 simple 分析器很像，唯一不同的是，stop 分析器增加了对删除停止词的支持。默认用的停止词是 _englisht_

（PS：意思是，假设有一句话“this is a apple”，并且假设“this” 和 “is”都是停止词，那么用simple的话输出会是[ this , is , a , apple ]，而用stop输出的结果会是[ a , apple ]，到这里就看出二者的区别了，stop 不会输出停止词，也就是说它不认为停止词是一个term）

（PS：所谓的停止词，可以理解为分隔符）

2.5.1. 示例输出

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

    "analyzer": "stop",

    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

输出

[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

2.5.2. 配置

stop 接受以下参数：

stopwords ：一个预定义的停止词列表（比如，_englisht_）或者是一个包含停止词的列表。默认是 _english_
stopwords_path ：包含停止词的文件路径。这个路径是相对于Elasticsearch的config目录的一个路径

2.5.3. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_stop_analyzer": {

          "type": "stop",

          "stopwords": ["the", "over"]

        }

      }

    }

  }

}

'

上面配置了一个stop分析器，它的停止词有两个：the 和 over

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "my_stop_analyzer",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

基于以上配置，这个请求输入会是这样的：

[ quick, brown, foxes, jumped, lazy, dog, s, bone ]

2.6. Pattern Analyzer

用Java正则表达式来将文本分割成terms，默认的正则表达式是\W+（非单词字符）

2.6.1. 示例输出

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "pattern",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

由于默认按照非单词字符分割，因此输出会是这样的：

[ the, , quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

2.6.2. 配置

pattern 分析器接受如下参数：

pattern ：一个Java正则表达式，默认 \W+
flags ： Java正则表达式flags。比如：CASE_INSENSITIVE 、COMMENTS
lowercase ：是否将terms全部转成小写。默认true
stopwords ：一个预定义的停止词列表，或者包含停止词的一个列表。默认是 _none_
stopwords_path ：停止词文件路径

2.6.3. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_email_analyzer": {

          "type":      "pattern",

          "pattern":   "\\W|_",

          "lowercase": true

        }

      }

    }

  }

}

'

上面的例子中配置了按照非单词字符或者下划线分割，并且输出的term都是小写

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "my_email_analyzer",

  "text": "John_Smith@foo-bar.com"

}

'

因此，基于以上配置，本例输出如下：

[ john, smith, foo, bar, com ]

2.7. Language Analyzers

支持不同语言环境下的文本分析。内置（预定义）的语言有：arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai

2.8. 自定义Analyzer

前面也说过，一个分析器由三部分构成：

zero or more character filters
a tokenizer
zero or more token filters

2.8.1. 实例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_custom_analyzer": {

          "type":      "custom",

          "tokenizer": "standard",

          "char_filter": [

            "html_strip"

          ],

          "filter": [

            "lowercase",

            "asciifolding"

          ]

        }

      }

    }

  }

}

'

3. Tokenizer

3.1. Standard Tokenizer

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "tokenizer": "standard",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

4. 中文分词器

4.1. smartCN

一个简单的中文或中英文混合文本的分词器

这个插件提供 smartcn analyzer 和 smartcn_tokenizer tokenizer，而且不需要配置

# 安装

bin/elasticsearch-plugin install analysis-smartcn

# 卸载

bin/elasticsearch-plugin remove analysis-smartcn

下面测试一下

可以看到，“今天天气真好”用smartcn分析器的结果是：

[ 今天 ， 天气 ， 真 ， 好 ]

如果用standard分析器的话，结果会是：

[ 今 ，天 ，气 ， 真 ， 好 ]

4.2. IK分词器

下载对应的版本，这里我下载6.5.3

然后，在Elasticsearch的plugins目录下建一个ik目录，将刚才下载的文件解压到该目录下

最后，重启Elasticsearch

接下来，还是用刚才那句话来测试一下

输出结果如下：

{

    "tokens": [

        {

            "token": "今天天气",

            "start_offset": ,

            "end_offset": ,

            "type": "CN_WORD",

            "position":

        },

        {

            "token": "今天",

            "start_offset": ,

            "end_offset": ,

            "type": "CN_WORD",

            "position":

        },

        {

            "token": "天天",

            "start_offset": ,

            "end_offset": ,

            "type": "CN_WORD",

            "position":

        },

        {

            "token": "天气",

            "start_offset": ,

            "end_offset": ,

            "type": "CN_WORD",

            "position":

        },

        {

            "token": "真好",

            "start_offset": ,

            "end_offset": ,

            "type": "CN_WORD",

            "position":

        }

    ]

}

显然比smartcn要更好一点

5. 参考

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html

https://github.com/medcl/elasticsearch-analysis-ik

码农公寓

相关文章