多字段特性及Mapping中配置自定义Analyzer

2022-06-07 08:13:39

报错org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried
多字段特性
Exact Values v.s Full Text
Exact Values不需要分词
自定义分词
Character Filter
Tokenizer
Token Filters
设置一个Custom Analyzer

报错org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried

#原因:线程占用
#杀死进程，启动进程
kill -9 `ps -ef | grep [e]lasticsearch | grep [j]ava | awk '{print $2}'`
elasticsearch

多字段特性

厂商名字实现精确匹配
增加一个keyword字段
使用不同的analyzer
不同语言
pinyin字段的搜索
还支持为搜索和索引指定不通的analyzer

Exact Values v.s Full Text

Exact Values:包括数字、日期、具体一个字符串（例如"Apple Store"）
Elasticsearch中的keyword
全文本，非结构化的文本数据
Elasticsearch中的text

Exact Values不需要分词

Elasticsearch为每一个字段创建一个倒排索引
Exact Value在索引时，不需要做特殊的分词处理

自定义分词

当Elasticsearch自带的分词器无法满足时，可以自定义分词器。通过自组合不同的组件实现
Character Filter
Tokenizer
Token Filter

Character Filter

在Tokenizer之前对文本进行处理，例如增加删除及替换字符。可以配置多个Character Filter。会影响Tokenizer的position和offset信息
一些自带的Character Filters
HTML strip - 去除 html 标签
Mapping - 字符串替换
Pattern replace - 正则匹配替换

Tokenizer

将原始的文本按照一定的规则，切分为词(term or token)
Elasticsearch内置的Tokenizers
whitespace / standard / uax_url_email / pattern / keyword / path hierarchy
可以用java开发插件，实现自己的Tokenizer

Token Filters

将Tokenizer输出的单词(term)，进行增加，修改，删除
自带的Token Filters
Lowercase / stop / synonym(添加近义词)

设置一个Custom Analyzer

提交请求,清除html标签

POST _analyze
{
  "tokenizer": "keyword",
  "char_filter": ["html_strip"],
  "text":"<b>hello world</b>"
}

返回响应

{
  "tokens" : [
    {
      "token" : "hello world",
      "start_offset" : 3,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

使用char filter进行替换减号

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type":"mapping",
      "mappings":["- => _"]
    }
    ],
    "text": "123-456, I-test! test-990 650-555-1234"
}

返回结果

{
  "tokens" : [
    {
      "token" : "123_456",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "I_test",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "test_990",
      "start_offset" : 17,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "650_555_1234",
      "start_offset" : 26,
      "end_offset" : 38,
      "type" : "<NUM>",
      "position" : 3
    }
  ]
}

char filter 替换表情符号

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type":"mapping",
      "mappings":[":) => happy",":( => sad"]
    }
    ],
    "text": ["I am felling :)","Feeling :( today"]
}

返回响应

{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "am",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "felling",
      "start_offset" : 5,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "happy",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "Feeling",
      "start_offset" : 16,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 104
    },
    {
      "token" : "sad",
      "start_offset" : 24,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 105
    },
    {
      "token" : "today",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 106
    }
  ]
}

正则表达式

GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type":"pattern_replace",
      "pattern":"http://(.*)",
      "replacement":"$1"
    }
    ],
    "text": "http://www.elastic.co"
}

返回值

  "tokens" : [
    {
      "token" : "www.elastic.co",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

按目录切分

POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/usr/ymruan/a/b"
}

返回结果

{
  "tokens" : [
    {
      "token" : "/usr",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/ymruan",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/ymruan/a",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/ymruan/a/b",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "word",
      "position" : 0
    }
  ]
}

whitespace与stop

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"],
  "text": ["The rain in Spain falls mainly on the plain."]
}

返回结果

{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "rain",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Spain",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "falls",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "mainly",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "plain.",
      "start_offset" : 38,
      "end_offset" : 44,
      "type" : "word",
      "position" : 8
    }
  ]
}

remove 加入lowercase后，The被当成stopword删除

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase","stop"],
  "text": ["The rain in Spain falls mainly on the plain."]
}

返回结果

{
  "tokens" : [
    {
      "token" : "rain",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "spain",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "falls",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "mainly",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "plain.",
      "start_offset" : 38,
      "end_offset" : 44,
      "type" : "word",
      "position" : 8
    }
  ]
}

听20章10分钟视频再记录

多字段特性及Mapping中配置自定义Analyzer

码农公寓

报错org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried

多字段特性

Exact Values v.s Full Text

Exact Values不需要分词

自定义分词

Character Filter

Tokenizer

Token Filters

设置一个Custom Analyzer

提交请求,清除html标签

使用char filter进行替换减号

char filter 替换表情符号

正则表达式

按目录切分

whitespace与stop

remove 加入lowercase后，The被当成stopword删除

听20章10分钟视频再记录

相关文章