Elasticsearch document id 生成方式

2024-01-12 10:46:10

手动指定

根据应用情况来说，是否满足手动指定 document id 的前提：

一般来说，是从某些其他的系统中，导入一些数据到es时，会采取这种方式，就是使用系统中已有数据的唯一标识，作为es中document的id。举个例子，比如说，我们现在在开发一个电商网站，做搜索功能，或者是OA系统，做员工检索功能。这个时候，数据首先会在网站系统或者IT系统内部的数据库中，会先有一份，此时就肯定会有一个数据库的primary key（自增长，UUID，或者是业务编号）。如果将数据导入到 Elasticsearch 中，此时就比较适合采用数据在数据库中已有的primary key。

如果说，我们是在做一个系统，这个系统主要的数据存储就是 Elasticsearch 一种，也就是说，数据产生出来以后，可能就没有id，直接就放es一个存储，那么这个时候，可能就不太适合说手动指定document id的形式了，因为你也不知道id应该是什么，此时可以采取下面要讲解的让 Elasticsearch 自动生成id的方式。

# put /index/type/id

PUT /test_index/test_type/2
{
  "test_content": "my test"
}

{
  "_index" : "test_index",
  "_type" : "test_type",
  "_id" : "2",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

自动生成


# post /index/type

PUT test_index/test_type
{
  "test_content": "my test automated document id"
}

{
  "error" : "Incorrect HTTP method for uri [/test_index/test_type?pretty=true] and method [PUT], allowed: [POST]",
  "status" : 405
}

POST /test_index/test_type
{
  "test_content": "my test"
}

{
  "_index" : "test_index",
  "_type" : "test_type",
  "_id" : "A7Ma5XYB_s8SuYmy2Xg0",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 1
}
# post /index/type

PUT test_index/test_type
{
  "test_content": "my test automated document id"
}

{
  "error" : "Incorrect HTTP method for uri [/test_index/test_type?pretty=true] and method [PUT], allowed: [POST]",
  "status" : 405
}

POST /test_index/test_type
{
  "test_content": "my test"
}

{
  "_index" : "test_index",
  "_type" : "test_type",
  "_id" : "A7Ma5XYB_s8SuYmy2Xg0",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 1
}

有可能两个创建 Document 的请求是完全在同一时间执行的（小概率事件），只不过在不同的 Elastic 节点上，那么，如果 _id 自动生成的算法不够好的话，有没有可能出现两个节点，给两个不同的 Document 创建了相同的 _id ?

当然是不可能的。

GUID 算法可以保证在分布式的环境下，不同节点同一时间创建的 _id 一定是不冲突的（即使是同一个节点，也不会有任何的问题）。

Elasticsearch 自动生成 _id 的机制，可以保证不会出现两个不同的 Document 的 _id 是一样的。

注意，自动生成 ID 的时候，使用的是 POST 而不是 PUT；手动生成 ID 的时候使用 PUT 或者 POST 都可以。

另外，这一节的实际操作，我是在 cloud.elastic.co 提供的虚拟机上进行的。其实在准备认证期间，我觉得可以考虑购买两个月左右的服务；也可以考虑在阿里云上购买。

自动生成的id，长度为20个字符，URL安全，base64编码，GUID，分布式系统并行生成时不可能会发生冲突。

GUID

以下文字来自*

A universally unique identifier (UUID) is a 128-bit number used to identify information in computer systems. The term globally unique identifier (GUID) is also used, typically in software created by Microsoft.

而 CODING HORROR 说

Each globally unique ID is like a beautiful snowflake: every one a unique item waiting to be born.

GUID Pros

Unique across every table, every database, every server

Allows easy merging of records from different databases

Allows easy distribution of databases across multiple servers

You can generate IDs anywhere, instead of having to roundtrip to the database

Most replication scenarios require GUID columns anyway

GUID Cons

It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful

Cumbersome to debug where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}'

The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes

就我个人来说，我是不太喜欢 GUID 的。

作者:escray| 转自:InfoQ 点这里：2020Python高薪实战学习大合集**

[拿走不谢！Python 3.9 官方中文文档，限时领！] (http://dwz.date/dE6v)

[限时！速领！14张高清Python速查表，效率提升必备！] (http://dwz.date/dE6w)

[GitHub标星3W+，80个Python案例，带你轻松玩转Python学习！] (http://dwz.date/dE64)

码农公寓

手动指定

自动生成

GUID

相关文章