TiDB 分布式事务

TiDB 分布式事务

Percolator

Bigtable provides lookup and update operations on each row, and Bigtable row transactions enable atomic read-modify-write operations on individual rows.

Prewrite

The transaction’s constructor asks the timestamp oracle for a start timestamp.
In the first phase of commit (“prewrite”), we try to lock all the cells being written. (To handle client failure, we designate one lock arbitrarily as the primary.) The transaction reads metadata to check for conflicts in each cell being written. There are two kinds of conflicting metadata:

  1. if the transaction sees another write record after its start timestamp, it aborts; this is the write-write conflict that snapshot isolation guards against.
  2. If the transaction sees another lock at any timestamp, it also aborts. It’s possible that the other transaction is just being slow to release its lock after having already committed below our start timestamp, but we consider this unlikely, so we abort.

If there is no conflict, we write the lock and the data to each cell at the start timestamp.

Commit

If no cells conflict, the transaction may commit and proceeds to the second phase. At the beginning of the second phase, the client obtains the commit timestamp from the timestamp oracle. Then, at each cell (starting with the primary), the client releases its lock and make its write visible to readers by replacing the lock with a write record. The write record indicates to readers that committed data exists in this cell; it contains a pointer to the start timestamp where readers can find the actual data. Once the primary’s write is visible, the transaction must commit since it has made a write visible to readers.

Get

A Get() operation first checks for a lock in the timestamp range [0, start timestamp], which is the range of timestamps visible in the transaction’s snapshot. If a lock is present, another transaction is concurrently
writing this cell, so the reading transaction must wait until the lock is released. If no conflicting lock is found, Get() reads the latest write record in that timestamp range and returns the data item corresponding to that write record.

Clean Up

Transaction processing is complicated by the possibility of client failure. If a client fails while a transaction is being committed, locks will be left behind. Percolator must clean up those locks or they will cause future transactions to hang indefinitely. Percolator takes a lazy approach to cleanup: when a transaction A encounters a conflicting lock left behind by transaction
B, A may determine that B has failed and erase its locks.

It is very difficult for A to be perfectly confident in its judgment that B is failed; as a result we must avoid a race between A cleaning up B’s transaction and a notactually-failed B committing the same transaction. Performing either a cleanup or commit operation requires modifying the primary lock; since this modification is performed under a Bigtable row transaction, only one of the cleanup or commit operations will succeed. Specifically: before B commits, it must check that it still holds the primary lock and replace it with a write record. Before A erases B’s lock, A must check the primary to ensure that B has not committed; if the primary lock is still present, then it can safely erase the lock.

When a client crashes during the second phase of commit, a transaction will be past the commit point (it has written at least one write record) but will still have locks outstanding. We must perform roll-forward on these transactions. A transaction that encounters a lock can distinguish between the two cases by inspecting the primary lock:

  1. if the primary lock has been replaced by a write record, the transaction which wrote the lock must have committed and the lock must be rolled forward, To roll forward, the transaction performing the cleanup replaces the stranded lock with a write record as the original transaction would have done.
  2. otherwise it should be rolled back (since we always commit the primary first, we can be sure that it is safe to roll back if the primary is not committed).

TiDB事务隔离级别

Encoding

KV

TiDB 对每个表分配一个 TableID,每一个索引都会分配一个 IndexID,每一行分配一个 RowID(如果是非聚簇索引表,RowID 是 TiDB 内部隐式分配的 _tidb_rowid;如果是聚簇索引表,RowID 是用户给定的主键列数据),其中 TableID 在整个集群内唯一,IndexID,RowID 在表内唯一,这些 ID 都是 int64 类型。
每行数据按照如下规则进行编码成 Key-Value pair:

Key: tablePrefix{tableID}_recordPrefixSep{rowID}
Value: [col1, col2, col3, col4]

其中 Key 的 tablePrefix/recordPrefixSep都是特定的字符串常量,用于在 KV 空间内区分其他数据。

对于 Index 数据,会按照如下规则编码成 Key-Value pair:

Key: tablePrefix{tableID}_indexPrefixSep{indexID}_indexedColumnsValue
Value: rowID

Index 数据还需要考虑 Unique Index 和非 Unique Index 两种情况,对于 Unique Index,可以按照上述编码规则。但是对于非 Unique Index,通过这种编码并不能构造出唯一的 Key,因为同一个 Index 的 tablePrefix{tableID}_indexPrefixSep{indexID}都一样,可能有多行数据的 ColumnsValue是一样的,所以对于非 Unique Index 的编码做了一点调整:

Key: tablePrefix{tableID}_indexPrefixSep{indexID}_indexedColumnsValue_rowID
Value: null

MVCC

没有 MVCC 之前,可以把 TiKV 看做这样的:

Key1 -> Value
Key2 -> Value
……
KeyN -> Value

有了 MVCC 之后,TiKV 的 Key 排列是这样的:

Key1-Version3 -> Value
Key1-Version2 -> Value
Key1-Version1 -> Value
……
Key2-Version4 -> Value
Key2-Version3 -> Value
Key2-Version2 -> Value
Key2-Version1 -> Value
……
KeyN-Version2 -> Value
KeyN-Version1 -> Value
……

注意,对于同一个 Key 的多个版本,我们把版本号较大的放在前面,版本号小的放在后面,这样当用户通过一个 Key + Version 来获取 Value 的时候,可以将 Key 和 Version 构造出 MVCC 的 Key,也就是 Key-Version。然后可以直接 Seek(Key-Version),定位到第一个大于等于这个 Key-Version 的位置。
Our approach to compound user keys and timestamps together is:

  1. Encode the user key to memcomparable
  2. Bitwise invert the timestamp (an unsigned int64) and encode it into big-endian bytes.
  3. Append the encoded timestamp to the encoded key.

For example, key “key1” and timestamp 3 will be encoded as “key1\x00\x00\x00\x00\xfb\xff\xff\xff\xff\xff\xff\xff\xfe”, where the first 9 bytes is the memcomparable-encoded key and the remaining 8 bytes is the inverted timestamp in big-endian. In this way, different versions of the same key are always adjacent in RocksDB; and for each key, newer versions are always before older ones.

但是实际上不是用的 Seek,而是用的 Get(貌似是以前的实现?)

TiDB 分布式事务

MateKey --> key 的所有版本信息
DataKey(key+version_1)–>Value_v1
DataKey(key+version_2)–>Value_v2

核心思想是,先读取 meta key 然后通过 meta key 中找到相应的可见版本,然后再读取 data key,由于这些 key 都拥有相同的前缀,所以在实际的访问中,读放大的程度是可以接受的。
给出一个 MVCCGet(返回某 key 小于等于 version 的最大版本的值)的伪代码实现:

MVCCGet(key, version) {
      versions = kv.Get(key) // read meta
      targetVer = nil
      for ver in versions {
           if ver <= version {
                 targetVer = ver
                 break
            }
        }
        return kv.Get(mvccEncode(key, targetVer)), targetVer
    }

CF

RocksDB provides a feature named Column Family (hereafter referred to as CF). An instance of RocksDB may have multiple CFs, and each CF is a separated key namespace and has its own LSM-Tree. However different CFs in the same RocksDB instance uses a common WAL, providing the ability to write to different CFs atomically.
We divide a RocksDB instance to three CFs: CF_DEFAULT, CF_LOCK and CF_WRITE, which corresponds to Percolator’s data column, lock column and write column respectively. There’s an extra CF named CF_RAFT which is used to save some metadata of Raft, but that’s beside our topic.

CF_DEFAULT CF_LOCK CF_WRITE
Key {user_key}{start_ts} user_key {user_key}{commit_ts}
Value user_value lock_info write_info
  • User Key (user_key) 指 TiKV Client(如 TiDB)所写入的或所要读取的 Key,User Value (user_value) 指 User Key 对应的 Value。
  • lock_info 包含 lock type、primary key、timestamp、ttl 等信息。
  • write_info 包含 write type、start_ts 等信息。In TiKV, records in CF_WRITE has four different types: Put, Delete, Rollback and Lock.
  • Only Put records need a corresponding value in CF_DEFAULT.
  • When rolling back transactions, we don’t simply remove the lock but writes a Rollback record in CF_WRITE.
  • Different from Percolator’s lock, the Lock type of write records in TiKV is produced by queries like SELECT … FOR UPDATE in TiDB. For keys affected by this query, they are not only the objects for read, but the reading is also part of a write operation. To guarantee to be in snapshot-isolation, we make it acts like a write operation (though it doesn’t write anything) to ensure the keys are locked and won’t change before committing the transaction.

MVCC 数据读取

  1. 检查该行 CF_LOCK 是否存在 Lock,时间戳为 [0, startTs],如果有,表示目前有其他事务正占用此行,如果这个锁已经超时则尝试清除,否则等待超时或者其他事务主动解锁。
  2. 读取至 startTs 时该行最新的数据,方法是:读取 CF_WRITE 列,时间戳为 [0, startTs], 获取这一列的值,转化成时间戳 t, 然后读取 CF_DEFAULT 于 t 版本的数据内容。

Latch

Latch enable atomic read-modify-write operations on individual User Keys likes the Bigtable row transactions.

我们知道,Percolator 的事务算法建立在 BigTable 支持单行事务这一基础之上:prewrite 和 commit 操作,都需要 atomic 先读后写。在 TiKV 中,发往 engine 的每一个写操作(WriteBatch)都会被原子地写入。那么仅仅支持原子的写入肯定是不够的,否则存在这种情况:

  1. 事务 A 尝试 prewrite key1,读取之后发现没有锁
  2. 事务 B 尝试 prewrite key1,读取之后也发现没有锁
  3. 事务 A 写入 prewrite
  4. 事务 B 写入 prewrite

这样的话,事务 A 写入的锁会被覆盖,但是它会以为自己已经成功地写入。如果接下来事务 A 提交,那么由于事务 A 的一个锁已经丢失,这时数据一致性会被破坏。

Scheduler 调度事务的方式避免了这种情况。Scheduler 中有一个模块叫做 Latches,它包含很多个槽。每个需要写入操作的任务在开始前,会去取它们涉及到的 key 的 hash,每个 key 落在 Latch 的一个槽中;接下来会尝试对这些槽上锁,成功上锁才会继续执行取 snapshot、进行读写操作的流程。这样一来,如果两个任务需要写入同一个 key,那么它们必然需要在 Latches 的同一个槽中上锁,因而必然互斥。

Latch 源码

乐观事务

整体流程

TiDB 分布式事务

Prewrite

  1. The transaction starts. The client obtains the current timestamp (start_ts) from TSO.
  2. Select one row as the primary row, the others as the secondary rows.
  3. Check whether there are any commits located equal or after start_ts (check if there are any records in CF_WRITE whose commit_ts is equal or behind start_ts) or whether there is another lock on this row (check if there is lock in CF_LOCK). These two situations will lead to conflicts. If either happens, the commit fails and rollback will be called.
  4. Write lock in CF_LOCK and write data with start_ts in CF_DEFAULT.
  5. Repeat the steps above on secondary rows.

Commit

  1. Obtain the commit timestamp commit_ts from TSO.
  2. Check whether the lock on the primary row still exists.
    • 该 key 应当存在同一个事务的锁。如果确实如此,则继续后面的写操作即可。
    • 如果不存在,则在 CF_WRITE 中找到 start_ts 与当前事务的 start_ts 相等的记录
      • 该 key 已经成功提交。比如,当网络原因导致客户端没能收到提交成功的响应、因而发起重试时,可能会发生这种情况。此外,锁可能被另一个遇到锁的事务抢先提交,这样的话也会发生这种情况。在这种情况下,不进行任何操作返回成功(为了幂等)。
      • 该事务已经被回滚(CF_WRITE 记录的类型是 Rollback)。比如,如果由于网络原因迟迟不能成功提交,直到锁 TTL 超时时,事务便有可能被其它事务回滚。
  3. Delete lock in CF_LOCK, write to CF_WRITE with commit_ts.
  4. Repeat the steps above on secondary rows.

Rollback

我们的 CF_WRITE 记录有一种类型是 Rollback。这种记录用于标记被回滚的事务,其 commit_ts 被设为与 start_ts 相同。回滚事务时,我们不是简单地移除 CF_LOCK 中的锁,还要在 CF_WRITE 中写入Rollback 记录。

在某些情况下,一个事务回滚之后,TiKV 仍然有可能收到同一个事务的 prewrite 请求。比如,可能是网络原因导致该请求在网络上滞留比较久;或者由于 prewrite 的请求是并行发送的,客户端的一个线程收到了冲突的响应之后取消其它线程发送请求的任务并调用 rollback,此时其中一个线程的 prewrite 请求刚好刚发出去。当一个 key 在被 rollback 之后又收到同一个事务的 prewrite,那么我们不应当使其成功。根据上述 Rollback 的操作,如果在 rollback 之后收到同一个事务的 prewrite,则会由于 prewrite 的这个 check 而直接返回错误:

Check whether there are any commits located equal or after start_ts (check if there are any records in CF_WRITE whose commit_ts is equal or behind start_ts).

Clean Up

处理残留的锁

缺点

  • 两阶段提交,网络交互多。
  • 当事务过大时,会有以下问题:
    客户端 commit 之前写入数据都在内存里面,TiDB 内存暴涨,一不小心就会 OOM。
    第一阶段写入与其他事务出现冲突的概率就会指数级上升,事务之间相互阻塞影响。

重试机制

Async Commit & Single Region 1PC

https://tikv.org/deep-dive/distributed-transaction/optimized-percolator/#calculated-commit-timestamp
https://pingcap.com/zh/blog/async-commit-principle

悲观事务

初步想法

把 Prewrite 阶段加锁(乐观锁)操作提前到执行 DML 阶段。
TiDB 分布式事务
但是 Prewrite 过程中的锁(乐观锁)会阻塞读。
TiDB 分布式事务

整体流程

TiDB 分布式事务

  • 执行 DML 时,对修改的 key 加上悲观锁。
  • 事务提交同 Percolator:Prewrite 将悲观锁改写为 Percolator 的乐观锁,悲观锁的存在保证了 Prewrite 必定成功。
  • 遇到悲观锁时可以保证该锁的事务未到 Commit 阶段(因为都还没 Prewrite),从而不会阻塞读

悲观锁

  1. 执行 DML 获取到需要修改的 key。
  2. 对要修改的 key 加上悲观锁。
  3. Prewrite 时的约束检查提前到 Pessimistic Lock 阶段。

TiDB 分布式事务

格式和乐观锁相同。写悲观锁时不会写入数据,数据仍在 Prewrite 中写入
TiDB 分布式事务

遇到更新的数据

MySQL root@127.0.0.1:myc_test> begin;                                   MySQL root@127.0.0.1:myc_test> begin;
Query OK, 0 rows affected                                               Query OK, 0 rows affected
Time: 0.001s                                                            Time: 0.001s
MySQL root@127.0.0.1:myc_test> select * from test;                      MySQL root@127.0.0.1:myc_test> select * from test;
+----+-------+                                                          +----+-------+
| id | score |                                                          | id | score |
+----+-------+                                                          +----+-------+
| 1  | 2     |                                                          | 1  | 2     |
+----+-------+                                                          +----+-------+
1 row in set                                                            1 row in set
Time: 0.010s                                                            Time: 0.010s
MySQL root@127.0.0.1:myc_test> update test set score = 3 where id = 1;
Query OK, 1 row affected
Time: 0.001s
MySQL root@127.0.0.1:myc_test> commit;
Query OK, 0 rows affected
Time: 0.023s
MySQL root@127.0.0.1:myc_test> select * from test;
+----+-------+
| id | score |
+----+-------+
| 1  | 3     |
+----+-------+
1 row in set
Time: 0.004s

                                                                        MySQL root@127.0.0.1:myc_test> select * from test;
                                                                        +----+-------+
                                                                        | id | score |
                                                                        +----+-------+
                                                                        | 1  | 2     |
                                                                        +----+-------+
                                                                        1 row in set
                                                                        Time: 0.007s
                                                                        MySQL root@127.0.0.1:myc_test> update test set score = 4 where id = 1;
                                                                        Query OK, 1 row affected
                                                                        Time: 0.001s
                                                                        MySQL root@127.0.0.1:myc_test> select * from test;
                                                                        +----+-------+
                                                                        | id | score |
                                                                        +----+-------+
                                                                        | 1  | 4     |
                                                                        +----+-------+
                                                                        1 row in set
                                                                        Time: 0.010s
                                                                        MySQL root@127.0.0.1:myc_test> commit;
                                                                        Query OK, 0 rows affected
                                                                        Time: 0.023s
                                                                        MySQL root@127.0.0.1:myc_test> select * from test;
                                                                        +----+-------+
                                                                        | id | score |
                                                                        +----+-------+
                                                                        | 1  | 4     |
                                                                        +----+-------+
                                                                        1 row in set
                                                                        Time: 0.005s

DML 执行要基于最新已提交的数据,需要重新 执行 DML。
TiDB 分布式事务

遇到其他事务的锁

先在 TiKV 中等锁(Waiter Manager),然后再让 TiDB 重试。
TiDB 分布式事务

Commit

  1. Prewrite 将悲观锁改写为乐观锁,并写入数据:只需检查悲观锁是否存在且属于该事务。
  2. 后续同 Percolator。

TiDB 分布式事务

等锁

当遇到锁时,会在 TiKV 中等锁。
TiDB 分布式事务
等锁有超时时间,默认为 3s,可配置。

TiDB 分布式事务
当锁被释放时(提交或回滚),会唤醒等待的事务。
TiDB 分布式事务
当有多个事务等待同一个锁时,会按照事务的 start_ts 排序来返回响应,让 start_ts 小的更有可能获取到锁。
TiDB 分布式事务

死锁

  • 加悲观锁遇到其他事务的锁时,要等待锁被释放。
  • 事务间互相等待会导致死锁,且会发生在不同节点上。
    TiDB 分布式事务
    在 TiKV 中实现死锁检测
  • 一台 TiKV 作为死锁检测器的 leader,其他 TiKV 发送等锁信息到 leader。
    TiDB 分布式事务

TiDB 没有间隙锁

当无法保证符合过滤条件的数据唯一时:

  • MySQL 会锁住过滤条件能涵盖到的所有行:范围锁,全表锁。

  • TiDB 只会对读取到的行加锁。

具体对比如下表所示(注:table 中 id 为主键):
TiDB 分布式事务

参考

https://pingcap.com/zh/blog/tidb-internal-2
https://docs.pingcap.com/zh/tidb/stable/clustered-indexes/
https://pingcap.com/zh/blog/tidb-transaction-model
https://pingcap.com/zh/blog/tikv-source-code-reading-13
https://pingcap.com/zh/blog/tikv-source-code-reading-11
https://pingcap.com/zh/blog/tikv-source-code-reading-12
https://en.pingcap.com/blog/multi-version-concurrency-control-in-tikv#mvcc
https://tikv.org/deep-dive/distributed-transaction/percolator/
https://pingcap.com/zh/blog/how-do-we-build-tidb
https://pingcap.com/zh/blog/percolator-and-txn
https://pingcap.com/zh/blog/best-practice-optimistic-transaction
https://github.com/pingcap/presentations/blob/master/Infra-Meetup/Infra-Meetup-117-zhaolei-TiDB%20%E6%82%B2%E8%A7%82%E9%94%81.pdf
https://pingcap.com/zh/blog/tidb-4.0-pessimistic-lock
https://pingcap.com/zh/blog/pessimistic-transaction-the-new-features-of-tidb

上一篇:一、TiDB 数据库架构概述


下一篇:TiDB技术内幕 - 说存储