一、normalization
normalization:规范化。在切词之后,包括大小写的转换、去掉语气词停用词(is、an)、单复数的变化
每种分词器的normalization策略不一样,如下图展示
二、char filter
char filter:字符过滤器,在切词之前完成操作
1、HTML Strip
1 PUT my_index 2 { 3 "settings": { 4 "analysis": { 5 "analyzer": { 6 "my_analyzer": { 7 "tokenizer": "keyword", 8 "char_filter": [ 9 "my_char_filter" 10 ] 11 } 12 }, 13 "char_filter": { 14 "my_char_filter": { 15 "type": "html_strip", 16 "escaped_tags":"a" 使用该属性可以规定保留哪些标签 17 } 18 } 19 } 20 } 21 }
2、Mapping
1 PUT my_index 2 { 3 "settings": { 4 "analysis": { 5 "char_filter": { 6 "my_char_filter": { 7 "type": "mapping", 8 "mappings": [ 9 "滚 => *", 10 "垃 => *", 11 "圾 => *" 12 ] 13 } 14 }, 15 "analyzer": { 16 "my_analyzer": { 17 "tokenizer": "keyword", 18 "char_filter": [ 19 "my_char_filter" 20 ] 21 } 22 } 23 } 24 } 25 }
3、Pattern Replace,正则替换
1 PUT my_index 2 { 3 "settings": { 4 "analysis": { 5 "char_filter": { 6 "my_char_filter": { 7 "type": "pattern_replace", 8 "pattern":"(\\d{3})\\d{4}(\\d{4})", 9 "replacement":"$1****$2" 10 } 11 }, 12 "analyzer": { 13 "my_analyzer": { 14 "tokenizer": "keyword", 15 "char_filter": [ 16 "my_char_filter" 17 ] 18 } 19 } 20 } 21 } 22 }
三、分词器tokenizer
分词器最主要的作用是进行切词,默认分词器为standard