Elasticsearch入门之从零开始安装ik分词器

2023-01-25 18:25:05

起因

需要在ES中使用聚合进行统计分析，但是聚合字段值为中文，ES的默认分词器对于中文支持非常不友好：会把完整的中文词语拆分为一系列独立的汉字进行聚合，显然这并不是我的初衷。我们来看个实例：

POST http://192.168.80.133:9200/my_index_name/my_type_name/_search

{

	"size": 0,

	"query" : {

		"range" : {

			"time": {

			    "gte": 1513778040000,

			    "lte": 1513848720000

			}

		}

    },

    "aggs": {

		"keywords": {

		    "terms": {"field": "keywords"},

			"aggs": {

			    "emotions": {

				    "terms": {"field": "emotion"}

                }

			}

		}

    }

}

输出结果：

{

	"took": 22,

	"timed_out": false,

	"_shards": {

		"total": 5,

		"successful": 5,

		"failed": 0

	},

	"hits": {

		"total": 32,

		"max_score": 0.0,

		"hits": []

	},

	"aggregations": {

		"keywords": {

			"doc_count_error_upper_bound": 0,

			"sum_other_doc_count": 0,

			"buckets": [

				{

					"key": "力",  # 完整的词被拆分为独立的汉字

					"doc_count": 2,

					"emotions": {

						"doc_count_error_upper_bound": 0,

						"sum_other_doc_count": 0,

						"buckets": [

							{

								"key": -1,

								"doc_count": 1

							},

							{

								"key": 0,

								"doc_count": 1

							}

						]

					}

				},

				{

					"key": "动",

					"doc_count": 2,

					"emotions": {

						"doc_count_error_upper_bound": 0,

						"sum_other_doc_count": 0,

						"buckets": [

							{

								"key": -1,

								"doc_count": 1

							},

							{

								"key": 0,

								"doc_count": 1

							}

						]

					}

				}

			]

		}

	}

}

既然ES的默认分词器对于中文支持非常不友好，那么有没有可以支持中文的分词器呢？如果有，该如何使用呢？

第一个问题，万能的谷歌告诉了我结果，已经有了支持中文的分词器，而且是开源实现：IK Analysis for Elasticsearch，详见：https://github.com/medcl/elasticsearch-analysis-ik。

秉着“拿来主义”不重复造*的指导思想，直接先拿过来使用一下，看看效果怎么样。那么，如何使用IK分词器呢？其实这是一个ES插件，直接安装并对ES进行相应的配置即可。

安装IK分词器

我的ES版本为2.4.1，需要下载的IK版本为：1.10.1（注意：必须下载与ES版本对应的IK，否则不能使用）。

1.下载，编译IK

wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.10.1/elasticsearch-analysis-ik-1.10.1.zip

unzip elasticsearch-analysis-ik-1.10.1.zip

cd elasticsearch-analysis-ik-1.10.1

mvn clean package

在elasticsearch-analysis-ik-1.10.1\target\releases目录下生成打包文件：elasticsearch-analysis-ik-1.10.1.zip。

2.在ES中安装IK插件

将上述打包好的IK插件：elasticsearch-analysis-ik-1.10.1.zip拷贝到ES/plugins目录下，执行解压。

unzip elasticsearch-analysis-ik-1.10.1.zip

rm -rf elasticsearch-analysis-ik-1.10.1.zip # 解压完之后一定要删除这个zip包，否则在启动ES时报错

重启ES。

使用IK分词器

安装IK分词器完毕之后，就可以在ES使用了。

第一步：新建index

PUT http://192.168.80.133:9200/my_index_name

第二步：给将来要使用的doc字段添加mapping

在这里我在ES中存储的doc格式如下：

{

    "nagtive_kw": []

    "is_all": false,

    "emotion": 0,

    "focuce": false,

    "keywords": ["动力","外观","油耗"],  // 在keywords字段上进行聚合分析

    "source": "汽车之家",

    "time": -1,

    "machine_emotion": 0,

    "title": "no title",

    "spider": "qczj_index",

    "content": {},

    "url": "http://xxx",

    "brand": "宝马",

    "series": "宝马1系",

    "model": "2017款"

}

需要在keywords字段上进行聚合分析，所以给keywords字段添加mapping设置：

POST http://192.168.80.133:9200/my_index_name/my_type_name/_mapping

{

	"properties": {

		"keywords": { # 设置keywords字段使用ik分词器

			"type": "string",

			"store": "no",

			"analyzer": "ik_smart",

			"search_analyzer": "ik_smart",

			"boost": 8

		}

	}

}

注意： 在设置mapping时有一个小插曲，我根据IK的官网设置“keywords”的type为“text”时报错：

POST http://192.168.80.133:9200/my_index_name/my_type_name/_mapping

{

	"properties": {

		"keywords": {

			"type": "text", # text类型在2.4.1版本中不支持

			"store": "no",

			"analyzer": "ik_smart",

			"search_analyzer": "ik_smart",

			"boost": 8

		}

	}

}

报错：

{

	"error": {

		"root_cause": [

			{

				"type": "mapper_parsing_exception",

				"reason": "No handler for type [text] declared on field [keywords]"

			}

		],

		"type": "mapper_parsing_exception",

		"reason": "No handler for type [text] declared on field [keywords]"

	},

	"status": 400

}

这是因为我使用的ES版本比较低：2.4.1，而text类型是ES5.0之后才添加的类型，所以不支持。在ES2.4.1版本中需要使用string类型。

第三步：添加doc对象

POST http://192.168.80.133:9200/my_index_name/my_type_name/

{

    "nagtive_kw": ["动力","外观","油耗"]

    "is_all": false,

    "emotion": 0,

    "focuce": false,

    "keywords": ["动力","外观","油耗"],  // 在keywords字段上进行聚合分析

    "source": "汽车之家",

    "time": -1,

    "machine_emotion": 0,

    "title": "从动次打次吃大餐",

    "spider": "qczj_index",

    "content": {},

    "url": "http://xxx",

    "brand": "宝马",

    "series": "宝马1系",

    "model": "2017款"

}

第四步：聚合分析

POST http://192.168.80.133:9200/my_index_name/my_type_name/_search

{

	"size": 0,

	"query" : {

		"range" : {

			"time": {

			    "gte": 1513778040000,

			    "lte": 1513848720000

			}

		}

    },

    "aggs": {

		"keywords": {

		    "terms": {"field": "keywords"},

			"aggs": {

			    "emotions": {

				    "terms": {"field": "emotion"}

                }

			}

		}

    }

}

输出结果：

{

	"took": 22,

	"timed_out": false,

	"_shards": {

		"total": 5,

		"successful": 5,

		"failed": 0

	},

	"hits": {

		"total": 32,

		"max_score": 0.0,

		"hits": []

	},

	"aggregations": {

		"keywords": {

			"doc_count_error_upper_bound": 0,

			"sum_other_doc_count": 0,

			"buckets": [

				{

					"key": "动力",     # 完整的词没有被拆分为独立的汉字

					"doc_count": 2,

					"emotions": {

						"doc_count_error_upper_bound": 0,

						"sum_other_doc_count": 0,

						"buckets": [

							{

								"key": -1,

								"doc_count": 1

							},

							{

								"key": 0,

								"doc_count": 1

							}

						]

					}

				}

			]

		}

	}

}

【参考】

http://www.cnblogs.com/xing901022/p/5910139.html 如何在Elasticsearch中安装中文分词器(IK+pinyin)

https://elasticsearch.cn/question/47 关于聚合（aggs）的问题

https://github.com/medcl/elasticsearch-analysis-ik/issues/276 create map时出现No handler for type [text] declared on field [content] #276

http://blog.csdn.net/guo_jia_liang/article/details/52980716 Elasticsearch2.4学习（三）------Elasticsearch2.4插件安装详解

码农公寓

起因

安装IK分词器

使用IK分词器

相关文章