结论
因为 score 的计算范围是单个 shard,而不是整个 index。
在 tf/idf 算法中,需要计算 docFreq 和 docCount。
基本上只要 shard 不同,得到的结果就不太可能一样,于是最后得到的 score 也会不一样。
测试步骤
1、建 index,定制 mapping
PUT luzhe
{
"settings": {
"index": {
"number_of_shards": 2
},
"analysis": {
"filter": {
"my_ngram": {
"type": "ngram",
"min_gram": "1",
"max_gram": "20"
}
},
"analyzer": {
"ngram_analyzer": {
"filter": [
"lowercase",
"my_ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"analyzer": "ngram_analyzer",
"term_vector": "with_positions_offsets",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 50,
"type": "keyword"
}
}
}
}
}
}
}
注意事项:
- index.number_of_shards = 2,是为了尽快达到不同分片文档数不一致的情况。
2、插入 document(基本上插 3 条就够用了)
PUT luzhe/_doc/1
{
"title": "大学语文"
}
PUT luzhe/_doc/2
{
"title": "大学语文"
}
PUT luzhe/_doc/3
{
"title": "大学语文"
}
3、查看两个 shard 的文档数
GET luzhe/_doc/_search?preference=_shards:0
GET luzhe/_doc/_search?preference=_shards:1
备注:在我的测试环境里,正好shard 0里有一个文档,shard 1里有两个文档。
只要俩分片文档数都不为0且不一样多就可以了。
4、测试
GET luzhe/_search
{
"query": {
"match": {
"title": "语文"
}
}
}
输出:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 2,
"successful" : 2,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "luzhe",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.5753642,
"_source" : {
"title" : "大学语文"
}
},
{
"_index" : "luzhe",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.36464313,
"_source" : {
"title" : "大学语文"
}
},
{
"_index" : "luzhe",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.36464313,
"_source" : {
"title" : "大学语文"
}
}
]
}
}
5、explain score
GET luzhe/_doc/3/_explain
{
"query": {
"match": {
"title": "语文"
}
}
}
输出
{
"_index" : "luzhe",
"_type" : "_doc",
"_id" : "3",
"matched" : true,
"explanation" : {
"value" : 0.5753642,
"description" : "sum of:",
"details" : [
{
"value" : 0.2876821,
"description" : "weight(title:语 in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.2876821,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 1.0,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 0.2876821,
"description" : "weight(title:文 in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.2876821,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 1.0,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
}
]
}
}
会发现不同的id,docFreq 和 docCount 的结果不同。