elasticsearch

2024-01-01 09:24:04

elasticsearch

索引

创建

自动创建

curl -X PUT "localhost:9200/user?pretty

创建一个userl的索引
pretty参数表示返回漂亮打印的JSON结果

在写入文档时，如果索引不存在，会自动创建索引
这种机制，使得我们无需手动定义mappings。Elasticsearch会自动根据文档信息，推算出字段的类型
有的时候，Elasticsearch可能会推算不对，如：地理位置信息
当类型推算得不对时，可能导致一些功能无法正常运行，如Range查询。

自定义创建

使用mapping，Mapping类似于关系型数据库的Schema，主要包含以下内容：

定义索引中字段的名称
定义字段的数据类型，如：字符串、数字、boolean等
可对字段设置倒排索引的相关配置，如是否需要分词，使用什么分词器
注意：从7.x开始，一个Mapping只属于一个索引的type 默认type 为：_doc

mapping的主要类型：

属性名字	说明
text	默认会进行分词，支持模糊查询（5.x之后版本string类型已废弃，请大家使用text）。
keyword	不进行分词；keyword类型默认开启doc_values来加速聚合排序操作，占用了大量磁盘io 如非必须可以禁用doc_values。
number	如果只有过滤场景用不到range查询的话，使用keyword性能更佳，另外数字类型的doc_values比字符串更容易压缩。
array	es不需要显示定义数组类型，只需要在插入数据时用'[]'表示即可，'[]'中的元素类型需保持一致。
range	对数据的范围进行索引；目前支持 number range、date range 、ip range。
boolean:	只接受true、false 也可以是字符串类型的“true”、“false”
date	支持毫秒、根据指定的format解析对应的日期格式，内部以long类型存储。
geo_point	存储经纬度数据对。
ip	将ip数据存储在这种数据类型中，方便后期对ip字段的模糊与范围查询。
nested	嵌套类型，一种特殊的object类型，存储object数组，可检索内部子项。
object	嵌套类型，不支持数组。

7.x新增：
alias：并不实际存在，而是对已有字段的一种别名映射，搜索该字段与搜索实际字段返回的内容没有本质的区别。
date_nanos：另一种时间类型，可精确到纳秒，用法类似date。
features：用来存储特征向量，数据不能为0和负数，查询时只能使用rank_feature query，该字段主要为支持后续机器学习相关功能做准备。
vector：存储特征数组，支持稀疏与稠密向量存储，该字段主要为支持后续机器学习相关功能做准备。

curl -X PUT "localhost:9200/user" -H "Content-Type: application/json" -d '{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 2
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "age": {
        "type": "integer"
      },
      "isteacher": {
        "type": "boolean"
      },
      "createdate": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      }
    }
  }
}'

mapping还可以这样创建：

curl -X PUT "localhost:9200/user/_mapping" -H "Content-Type: application/json" -d '{
  "properties": {
    "name": {
      "type": "text",
      "index": true
    },
    "sex": {
      "type": "keyword",
      "index": true
    },
    "tel": {
      "type": "keyword",
      "index": false
    }
  }
}'

每个字段的参数

参数	说明
index	是否为索引，否的话不能直接查找该字段
type	字段类型
doc_values	ture or false，列式存储，为支持快速聚合与排序场景而设计，不在该类场景的可禁用
ignore_malformed	是否忽略脏数据，为true，数据格式或类型错误数据将被忽略，其它字段会正常插入；为false，一旦数据不符合要求整个文档将被拒绝。

更多见（包括settings的参数）此文elasticsearch简介和elasticsearch_dsl

使用python创建

注意：指定analyzer需要自己安装，不再使用_doc的形式创建索引，否则报错。

from elasticsearch import Elasticsearch
# from elasticsearch import AsyncElasticsearch

es = Elasticsearch(host="localhost", port=9200)
# es = AsyncElasticsearch()

body = {
    "settings": {
        "number_of_shards": 3,
        "number_of_replicas": 2
    },
    "mappings": {
        #"_doc": {
            "properties": {
                "id": {
                    "type": "integer",
                },
                "text": {
                    "type": "text",
                  #  "analyzer": "ik_max_word",  # 指定ik分词器，适用中文分词。
                    "index": False
                },
                "userId": {
                    "type": "long",
                },
                "reprinted": {
                    "type": "keyword",
                },
           # }
        }
    }
}

# 创建 index
es.indices.create(index="test", body=body)

指定分词器

curl -X PUT "localhost:9200/user" -H "Content-Type: application/json" -d '{
    "settings": {
        "analysis": {
            "analyzer": {
                "ik": {
                    "tokenizer": "ik_max_word"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text"
            },
            "content": {
                "type": "text",
                "analyzer": "ik_max_word"
            }
        }
    }
}

改

curl -X PUT "localhost:9200/user/_mapping" -H "Content-Type: application/json" -d '{
  "properties": {
    "name": {
      "type": "text",
      "index": true
    },
    "sex": {
      "type": "keyword",
      "index": true
    },
    "tel": {
      "type": "keyword",
      "index": false
    }
  }
}'

查

curl：

curl -X GET "localhost:9200/_cat/indices?v"

也可以：

curl -X GET "localhost:9200/index_name/
# index_name就是要查的索引名字，有的话就会返回

python：


from elasticsearch import Elasticsearch

es = Elasticsearch(["127.0.0.1:9200"])

index_name = 'student'

if es.indices.exists(index_name) == True:
    print('索引存在')
else:
    print('索引不存在')

删除

curl：

curl -X DELETE  '127.0.0.1:9200/user'

python：

from elasticsearch import Elasticsearch

es = Elasticsearch(["127.0.0.1:9200"])

es.indices.delete(index='student')

文档

增加

指定id是，为幂等操作，所以用PUT。
curl：

curl -X PUT "localhost:9200/customer/_doc/1" -H 'Content-Type: application/json' -d'{ "name": "Jane Doe", "age": 20 }

'

假如不指定id："localhost:9200/customer/_doc/则会自动生成id，为非幂等操作，使用POST。
curl：

curl -X PUT "localhost:9200/customer/_doc/" -H 'Content-Type: application/json' -d'{ "name": "Jane Doe", "age": 20 }
'

也可以使用："localhost:9200/customer/_create/创建：

curl -X PUT "localhost:9200/customer/_create/3" -H 'Content-Type: application/json' -d'{ "name": "Jane Doe", "age": 20 }
'

python：

from elasticsearch import Elasticsearch
index_name = 'my_index'
 
es = Elasticsearch(['127.0.0.1:9200'])
   

es.index(index=index_name, id='1', body={
            'name': '法外狂徒-张三',
            'id': 1,
        }
        )

删除

curl

curl -X DELETE "localhost:9200/customer/_doc/2?pretty"

python

from elasticsearch import Elasticsearch
index_name = 'my_index'
 
es = Elasticsearch(['127.0.0.1:9200'])
es.delete(index=index_name, id=1)

修改

curl

修改的方式有两种，一是全量修改，二是局部更新。这两者的区别就是请求方式的不同，前者有PUT后者用POST，而且URL和请求体也不一样。

# 全量修改
curl -X PUT "localhost:9200/user/_doc/1?pretty" -H 'Content-Type: application/json' -d '{"name":"张三"}'

局部修改

curl -X POST "localhost:9200/user/_update/1?pretty" -H 'Content-Type: application/json' -d'
{
  "script" : "ctx._source.age += 5"
}'
# ctx._source引用的是当前源文档

# 或者
curl -X POST "localhost:9200/user/_update/1?pretty" -H 'Content-Type: application/json' -d '{"doc":{"name":"张三"}}'
# doc表示当前文档

python

from elasticsearch import Elasticsearch
index_name = 'my_index'

es.update(index = index_name, id = 1, body = {"doc":{"name":"张三"}})

查

可以使用这种方式查询：curl GET "localhost:9200/test/_search?q=name:lczmx"，但不推荐，以json的形式更好。

主键查询：根据id查一条数据

curl -X GET "localhost:9200/test/_doc/1?pretty"

假如想要全文查找，则需要使用curl -X GET "localhost:9200/test/_search"，见DSL的内容。

DSL

DSL是elasticsearch的一种查询语法，它是通过如下形式查找的：

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} }
}
'

返回结果

形如：

{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "lczmx"
        }
      }
    ]
  }
}

took ： Elasticsearch执行搜索的时间（以毫秒为单位）
timed_out ：告诉我们检索是否超时
_shards ：告诉我们检索了多少分片，以及成功/失败的分片数各是多少
hits ：检索的结果
hits.total ：符合检索条件的文档总数
hits.hits ：实际的检索结果数组（默认为前10个文档）
hits.sort ：排序的key（如果按分值排序的话则不显示）
hits._score 和 max_score 现在我们先忽略这些字段

query

DLS使用query表示要如何查找，查找方式有一下几种。

精准查询term

term
字段只有一个值时候，用term关键词查询
查询biz_id值为1909190023901225的记录

curl -XGET "localhost:9200/xyerp/order/_search" -H 'Content-Type: application/json' -d '
{
 "query": {
     "term": {
       "biz_id": "1909190023901225"
      }
 }
}'

进一步优化查询，因为是精准查询，不需要查询进行评分计算，只希望对文档进行包括或排除的计算，所以我们会使用 constant_score 查询以非评分模式来执行 term 查询并以一作为统一评分。推荐如下查询

{  
    "query" : {  
        "constant_score" : {  
             "filter" : {  
                "term" : {  
                    "biz_id" : "1909190023901225"  
                }  
            }  
        }  
    }  
}'

terms
字段有一多个值时候，用terms关键词查询，后跟数组

{
    "query":{
        "terms":{
            "biz_id":["1909190023901225"]
        }
    }
}'

constant_score 以非评分模式查询，推荐如下查询

{  
    "query" : {  
        "constant_score" : {  
             "filter" : {  
                "terms" : {  
                    "biz_id" : ["1909190023901225","e1909190111365113"]  
                }  
            }  
        }  
    }  
}'

term多个字段

{
	"query": [{
		"term": {
			"biz_id": "1909190023901225"
		}
	}, {
		"term": {
			"name": "lczmx"
		}
	}]
}

匹配查询match

match_all

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} },
  "_source": ["account_number", "balance"]
}
'

相当于：

SELECT account_number， balance FROM bank

match

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match": { "account_number": 20 } }
}
'

相当于：

SELECT * FROM bank WHERE account_number = 20

假如是字符串的话：

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match": { "address": "mill" } }
}
'

SELECT * FROM bank WHERE address LIKE '%mill%

以空格隔开如{ "match": { "address": "mill link" }，表示OR。

multi_match

{
    "query":{
        "multi_match":{
            "query":"2501",
            "fields":["merchant_id","_id"]
        }
    }
}

match_phrase
表示的是完全匹配，只有一样的才能匹配上。

curl GET "localhost:9200/test/_search?pretty" -H  "Content-Type: application/json" -d '{"query":{
"match_phrase":{"name": "lcz"}}}'

bool查询

bool查询包含四种操作符，分别是must,should,must_not,filter。它们均是一种数组，数组里面是对应的判断条件

must：必须匹配，与and等价。贡献算分
must_not：必须不匹配，与not等价，常过滤子句用，但不贡献算分
should：选择性匹配，至少满足一条，与 OR 等价。贡献算分
filter：过滤子句，必须匹配，但不贡献算分

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "should": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}
'
# 要求address匹配上mill 或 lane

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        { "match": { "address": "mill" } },
        { "match": { "name": "lczmx" } }
      ]
    }
  }
}
'

# 要求同时满足address和name

bool和filter+range经常一起使用，见下小节例子

filter查询

过滤器，会查询对结果进行缓存，不会计算相关度，避免计算分值，执行速度非常快。

如下，查询出status为active的状态


{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "status": "active"
        }
      }
    }
  }
}

filter也常和range范围查询一起结合使用，range范围可供组合的选项

gt : 大于
lt : 小于
gte : 大于等于
lte :小于等于

如下，查询merchant_id值为2501下的交易数据

{
  "query": {
    "bool": {
      "must": {
        "term": {
          "merchant_id": "2501"
        }
      }, 
      "filter": {
        "range": {
          "trade_finished_time": {
            "from": "2019-09-01T00:00:00", 
            "to": "2019-09-30T23:59:59"
          }
        }
      }
    }
  }
}

如下查询，must下匹配，filter进行过滤，range定义范围

{    
    "query": {    
        "bool": {    
            "must": [    
                {   
                    "match": {   
                        "title": "Search"   
                        }  
                },  
                {   
                    "match": {   
                    "content": "Elasticsearch"   
                    }  
                }  
            ],    
            "filter": [  
                {   
                    "term": {   
                        "status": "1"   
                        }  
                },  
                {   
                    "range": {   
                        "publish_date": {   
                        "gte": "2015-01-01"   
                        }  
                    }  
                }  
            ]  
        }  
     }  
}

sort

指定结果的排序方式

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} },
  "sort": {
    "account_number": { "order": "asc" }
  }
}
'
# 以account_number升序，desc为降序

size、from

from参数（从0开始）指定从哪个文档索引开始，并且size参数指定从from开始返回多少条，分页时有用。

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} },
  "from": 10,
  "size": 10
}
'

假如要分页的话：from = （页码 - 1） * 每页数据数
如：每页10条数据，则第2页的from为：(2 - 1) * 10 = 20

`_source`

指定要返回那些数据。

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} },
  "_source": ["account_number", "balance"]
}
'

相当于：

SELECT account_number， balance FROM bank

aggs

使用聚合函数。
需要指定聚合后的名字和使用哪个聚合函数。假如只需要统计数据的话，可以指定size为0。
分组：

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{
    "aggs":{
        "name_group": {
            "terms": {
                "field": "name"
                }
            }
        
    }
    
}'

# name_group只是一个自己取的名字

平均值：

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'{
    "aggs":{
        "name_avg": {
            "avg": {
                "field": "price"
                }
            }
    }
}'

highlight

指定哪些字段可以高亮显示。

curl GET "localhost:9200/test/_search?pretty" -H  "Content-Type: application/json" -d '{
    "query": {"match": {
        "name": "lczmx"
    }},
    "highlight":{
        "fields": {
            "name": {}
        }
    }
}'

# fields指定哪些字段高亮，字段的值为{}

python

关于python使用elasticsearch主要用到elasticsearch-dsl模块：
[elasticsearch-dsl配置、使用、查询]https://www.cnblogs.com/xiao-xue-di/p/14108238.html
查询操作实例
 包含用类的形式使用elasticsearch-dsl

配置信息

elasticsearch.yml

属性名	说明
`cluster.name`	配置elasticsearch的集群名称，默认是elasticsearch。建议修改成一个有意义的名称。
`node.name`	节点名，es会默认随机指定一个名字，建议指定一个有意义的名称，方便管理
`path.conf`	设置配置文件的存储路径，tar或zip包安装默认在es根目录下的config文件夹，rpm安装默认在/etc/ elasticsearch
`path.data`	设置索引数据的存储路径，默认是es根目录下的data文件夹，可以设置多个存储路径，用逗号隔开
`path.logs`	设置日志文件的存储路径，默认是es根目录下的logs文件夹
`path.plugins`	设置插件的存放路径，默认是es根目录下的plugins文件夹
`bootstrap.memory_lock`	设置为true可以锁住ES使用的内存，避免内存进行swap
`network.host`	设置bind_host和publish_host，设置为0.0.0.0允许外网访问
`http.port`	设置对外服务的http端口，默认为9200。
`transport.tcp.port`	集群结点之间通信端口
`discovery.zen.ping.timeout`	设置ES自动发现节点连接超时的时间，默认为3秒，如果网络延迟高可设置大些
`discovery.zen.minimum_master_nodes`	主结点数量的最少值 ,此值的公式为：(master_eligible_nodes / 2) + 1 ，比如：有3个符合要求的主结点，那么这里要设置为2

elasticsearch简介和elasticsearch_dsl

https://blog.csdn.net/lzxlfly/article/details/102771175
https://blog.csdn.net/qq_36697880/category_9318969.html
https://www.letianbiji.com/elasticsearch/es7-add-update-doc.html
https://www.kaifaxueyuan.com/server/elasticsearch7/elasticsearch-index.html
https://blog.csdn.net/onlineness/article/details/102788802

我的github
我的博客
 我的笔记

码农公寓

elasticsearch

索引

创建

自动创建

自定义创建

使用python创建

指定分词器

改

查

删除

文档

增加

删除

curl

python

修改

curl

python

查

DSL

返回结果

query

精准查询term

匹配查询match

bool查询

filter查询

sort

size、from

_source

aggs

highlight

python

配置信息

相关文章

`_source`