analyzer
由三部分构成:
Character Filters、Tokenizers、Token filters
Character Filters 负责字符过滤 官方的解释是:字符过滤器用来把阿拉伯数字(٠١٢٣٤٥٦٧٨٩)转成成Arabic-Latin的等价物(0123456789)或用于去掉html内容,如:<b>。
Tokenizers 负责分词,常用的分词器有:whitespace、standard
Token filters
- Standard Token Filter 目前什么也不做
- ASCII Folding Token Filter asciifolding 类型的词元过滤器,将不在前127个ASCII字符(“基本拉丁文”Unicode块)中的字母,数字和符号Unicode字符转换为ASCII等效项(如果存在)。
- Length Token Filter
length用于去掉过长或者过短的单词;
min 定义最短长度
max 定义最长长度
用法如下:
GET _analyze { "tokenizer" : "standard", "filter": [{"type": "length", "min":1, "max":3 }], "text" : "this is a test" }
结果:
"tokens": [ { "token": "is", "start_offset": 5, "end_offset": 7, "type": "<ALPHANUM>", "position": 1 }, { "token": "a", "start_offset": 8, "end_offset": 9, "type": "<ALPHANUM>", "position": 2 } ]
- Lowercase Token Filter 将词元文本规范化为小写
- Uppercase Token Filter 将词元文本规范化为大写
- Stop Token Filter 过滤某些关键字 输入:
{ "tokenizer" : "standard", "filter": [{"type": "stop", "stopwords": ["this", "a"]}], "text" : ["this is a test"] }
输出:
# stopwords中拦截词this, a 被过滤掉; "tokens": [ { "token": "is", "start_offset": 5, "end_offset": 7, "type": "<ALPHANUM>", "position": 1 }, { "token": "test", "start_offset": 10, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 } ]
- Stemmer Token Filter 可以添加几乎所有的词元过滤器,所以是一个通用接口 用法如下
PUT /my_index { "settings": { "analysis" : { "analyzer" : { "my_analyzer" : { "tokenizer" : "standard", "filter" : ["standard", "lowercase", "my_stemmer"] } }, "filter" : { "my_stemmer" : { "type" : "stemmer", "name" : "light_german" } } } } }
- Synonym Token Filter 同意词
- Reverse Token Filter 将词反转,示例如下:
调用:
GET _analyze { "tokenizer": "standard", "filter": ["reverse"], "text": ["hello world"] }结果:
"tokens": [ { "token": "olleh", "start_offset": 0, "end_offset": 5, "type": "<ALPHANUM>", "position": 0 }, { "token": "dlrow", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 } ]