es - elasticsearch自定义分析器 - 内建分词过滤器 - 5

世界上并没有完美的程序,但是我们并不因此而沮丧,因为写程序就是一个不断追求完美的过程。

自定义分析器 :

  1. Character filters :
        1. 作用 : 字符的增、删、改转换
        2. 数量限制 : 可以有0个或多个
        3. 内建字符过滤器 :
            1. HTML Strip Character filter : 去除html标签
            2. Mapping Character filter : 映射替换
            3. Pattern Replace Character filter : 正则替换
  2. Tokenizer :
        1. 作用 :
            1. 分词
            2. 记录词的顺序和位置(短语查询)
            3. 记录词的开头和结尾位置(高亮)
            4. 记录词的类型(分类)
        2. 数量限制 : 有且只能有一个
        3. 分类 :
            1. 完整分词 :
                1. Standard
                2. Letter
                3. Lowercase
                4. whitespace
                5. UAX URL Email
                6. Classic
                7. Thai
            2. 切词 :
                1. N-Gram
                2. Edge N-Gram
            3. 文本 :
                1. Keyword
                2. Pattern
                3. Simple Pattern
                4. Char Group
                5. Simple Pattern split
                6. Path
  3. Token filters :
        1. 作用 : 分词的增、删、改转换
        2. 数量限制 : 可以有0个或多个
        3. 分类 :
            1. apostrophe
            2. asciifolding
            3. cjk bigram
            4. cjk width
            5. classic
            6. common grams
            7. conditional
            8. decimal digit
            9. delimited payload
            10. dictionary decompounder
            11. edge ngram
            12. elision
            13. fingerprint
            14. flatten_graph
            15. hunspell
            16. hyphenation decompounder
            17. keep types
            18. keep words
            19. keyword marker

今天演示16-19,其中16没有实现

# hyphenation decompounder token filter
# 作用   : 匹配子词,包含则显示
# 配置项 :
#   1. hyphenation_patterns_path
#   2. word_list
#   3. word_list_path
#   4. max_subword_size
#   5. min_subword_size
#   6. min_word_size
#   7. only_longest_match
# 结果没有做出来,是xml文件的配置问题,如果谁会,麻烦反馈一下
GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [{
    "type" : "hyphenation_decompounder",
    "hyphenation_patterns_path" : "hyph/hyphenation_patterns.xml",
    "word_list" : ["hello", "good", "me"]
  }],
  "text": ["thisismy-hello-good"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "thisismy-hello-good",
      "start_offset" : 0,
      "end_offset" : 19,
      "type" : "word",
      "position" : 0
    }
  ]
}
# keep type token filter
# 作用   : 保留或移除特定类型的词
# 配置项 :
#   1. types : 分词类型
#   2. mode  : 保留或移除指定types
#      1. include : 保留
#      2. exclude : 移除
# 分词器 : standard - whitespace不行
GET /_analyze
{
  "tokenizer": "standard",
  "filter": [{
    "type"  : "keep_types",
    "types" : ["<NUM>"],
    "mode"  : "exclude"
  }],
  "text": ["1 hello 2 good 3 me"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "good",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "me",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}
# keep words token filter
# 作用 : 只保留指定的词
# 配置项 :
#   1. keep_words      : 要保留的词
#   2. keep_words_path
#   3. keep_words_case : 是否转为小写,默认false
GET /_analyze
{
  "tokenizer": "standard",
  "filter": [{
    "type" : "keep",
    "keep_words" : ["Hello", "me"],
    "keep_words_case" : true
  }],
  "text": ["hello good me"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "me",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

# keyword marker token filter
# 作用 : 指定分词为keyword,在词干提取时无需转为词干
# 配置项 :
#   1. ignore_case      : 忽略大小写,默认false
#   2. keywords         : 关键词列表
#   3. keywords_path    : 关键词文件路径
#   4. keywords_pattern : java 正则
# stemmer 是词干提取过滤器
GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [{
    "type"        : "keyword_marker",
    "keywords"    : ["hello", "Gooding", "me"],
    "ignore_case" : true
  }, "stemmer"],
  "text": ["gooding hello meing"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "gooding",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "hello",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "me",
      "start_offset" : 14,
      "end_offset" : 19,
      "type" : "word",
      "position" : 2
    }
  ]
}

上一篇:数据预处理相关


下一篇:通过key决定是否刷新keep-alive组件的方法