version: postgresql10.17
0 总结
总结
- leaf page
- 0级页面(最底层页面)有右兄弟的话,第一条是右兄弟的第一条(应该是本页的最后一条,注意这条不属于本页,是下一页最小值得复制)注意:0级页面的最左页面也有这个copy
- 0级页面(最底层页面)没有右兄弟的话,第一条就是起始数据
- root / branch
- root/branch页面的data存的是指向页面的最小值(正常是第二条,第一条一般是指向页面的右兄弟的第二条“当前页面最小值”的copy)。特殊的是root/branch指向最左页面的data不记录信息。
- 怎么找到中间节点(branch/root)的最左和最右?
- 最左页面的data是空的
- 最右节点的指向的leaf的第一条就是当前页面的最小值(其他页面第一条保存右sibling最小值的copy)
--------
| meta |
--------
| root |
--------
| left branch | branch | branch | ..... | branch |
--------------------------------------------------------
| leaf | leaf | leaf | leaf | leaf | right leaf |
--------------------------------------------------------
! left branch的data为空
! branch的data是指向页面的最小值
! right leaf的第一条是当前页面的最小值
! leaf的第一条是右sibling的最小值得copy
1 相关系统表
支持哪些索引类型?
postgres=# \d pg_am
Table "pg_catalog.pg_am"
Column | Type | Collation | Nullable | Default
-----------+---------+-----------+----------+---------
amname | name | | not null |
amhandler | regproc | | not null |
amtype | "char" | | not null |
postgres=# select * from pg_am;
amname | amhandler | amtype
--------+-------------+--------
btree | bthandler | i
hash | hashhandler | i
gist | gisthandler | i
gin | ginhandler | i
spgist | spghandler | i
brin | brinhandler | i
ps. 9.6前这张系统表有很多method的属性信息,新版本这些属性保存在C代码中了。
amhandler的相关函数在pg_proc中,都是C函数。
select proname, prosrc from pg_proc where oid in (330,331,332,333,334,335);
proname | prosrc
-------------+-------------
bthandler | bthandler
hashhandler | hashhandler
gisthandler | gisthandler
ginhandler | ginhandler
spghandler | spghandler
brinhandler | brinhandler
例子:
postgres=# \d t2
Table "public.t2"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
id | integer | | not null |
info | integer | | |
Indexes:
"pk_t2_a" PRIMARY KEY, btree (id)
系统表含义:
postgres=# select * from pg_index where indexrelid='pk_t2_a'::regclass;
-[ RECORD 1 ]--+-------
indexrelid | 126817
indrelid | 126814
indnatts | 1 -- 列数
indisunique | t -- 唯一索引
indisprimary | t -- 主键索引
indisexclusion | f -- 支持exclusion constraint
indimmediate | t -- 强制每次insert时检查唯一性
indisclustered | f -- 根据这个索引做了聚簇
indisvalid | t -- 是否用于查询
indcheckxmin | f -- 查询时需要检查索引元组xmin<事务的xmin才能用
indisready | t -- 索引是否可以进行插入、更新
indislive | t -- f表示索引正在被drop,其他操作应该忽略这个索引
indisreplident | f -- ALTER TABLE ... REPLICA IDENTITY USING INDEX
indkey | 1 -- 用哪几个列建的
indcollation | 0
indclass | 1978
indoption | 0
indexprs |
indpred |
-- indisreplident订阅了update和delete,需要指定REPLICA IDENTITY才能执行更新删除
postgres=# select oid,* from pg_class order by oid desc limit 1;
-[ RECORD 1 ]-------+--------
oid | 126817
relname | pk_t2_a
relnamespace | 2200
reltype | 0
reloftype | 0
relowner | 10
relam | 403
relfilenode | 126817
reltablespace | 0
relpages | 1
reltuples | 0
relallvisible | 0
reltoastrelid | 0
relhasindex | f
relisshared | f
relpersistence | p
relkind | i
relnatts | 1
relchecks | 0
relhasoids | f
relhaspkey | f
relhasrules | f
relhastriggers | f
relhassubclass | f
relrowsecurity | f
relforcerowsecurity | f
relispopulated | t
relreplident | n
relispartition | f
relfrozenxid | 0
relminmxid | 0
relacl |
reloptions |
relpartbound |
2 数据结构
|-----------------------------------------------------|
| linp0(page header) | linp1 | linp2 | |
|-----------------------------------------------------|
| |
| |
|-----------------------------------------------------|
| | itup2 | itup1 | special space(BTMetaPageData)|
|-----------------------------------------------------|
BTPageOpaqueData在special space中,维护B-linked-tree的结构
typedef struct BTPageOpaqueData
{
BlockNumber btpo_prev; /* 左兄弟的块号,便于反向扫描 */
BlockNumber btpo_next; /* 右兄弟的块号,正向扫描*/
union
{
uint32 level; /* 在索引树中的层级level*/
TransactionId xact; /* 删除无用的ID */
} btpo;
uint16 btpo_flags; /* 页面类型 */
BTCycleId btpo_cycleid; /* vacuum cycle ID of latest split */
} BTPageOpaqueData;
/* Bits defined in btpo_flags */
#define BTP_LEAF (1 << 0) /* 有该标志表明有叶子页面 */
#define BTP_ROOT (1 << 1) /* 根页面*/
#define BTP_DELETED (1 << 2) /* 从索引树中删除的页面 */
#define BTP_META (1 << 3) /* 元页面*/
#define BTP_HALF_DEAD (1 << 4) /* 空页面,但仍保留*/
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples,LP——DEAD元组 */
#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
// BTMetaPageData
typedef struct BTMetaPageData
{
uint32 btm_magic; /* should contain BTREE_MAGIC */
uint32 btm_version; /* should contain BTREE_VERSION */
BlockNumber btm_root; /* current root location */
uint32 btm_level; /* tree level of the root page */
BlockNumber btm_fastroot; /* current "fast" root location */
uint32 btm_fastlevel; /* tree level of the "fast" root page */
} BTMetaPageData;
3 观察存储结构
总结
- 0级页面(最底层页面)有右兄弟的话,第一条是右兄弟的第一条(应该是本页的最后一条,注意这条不属于本页,是下一页最小值得复制)注意:0级页面的最左页面也有这个copy
- 0级页面(最底层页面)没有右兄弟的话,第一条就是起始数据
- root页面的data存的是指向页面的最小值,(正常是第二条,第一条存右slibing最小值copy),这个第一条一般是指向页面的右兄弟的第二条(当前页面最小值)的copy。特殊的是指向最左页面的data不记录信息。
观察存储结构:
----------------
| meta page |
----------------
| root page |
----------------
创建测试表
create table t3(id int primary key, info text);
meta page
测试表的第0个页面是meta page,指向root page
postgres=# \d t3
Table "public.t3"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
id | integer | | not null |
info | text | | |
Indexes:
"t3_pkey" PRIMARY KEY, btree (id)
postgres=# select * from bt_page_stats('t3_pkey',0);
ERROR: block 0 is a meta page
postgres=# select * from bt_page_items('t3_pkey',0);
ERROR: block 0 is a meta page
postgres=# select * from bt_metap('t3_pkey');
magic | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
340322 | 2 | 1 | 0 | 1 | 0
场景一:100条|有meta page|有root|无branch|无leaf
create table t3(id int primary key, info text);
insert into t3 select generate_series(1,100), md5(random()::text);
level = 0
select * from bt_metap('t3_pkey');
btpo = 等级
btpo_flags = 类型
postgres=# select * from bt_page_stats('t3_pkey',1);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
1 | l | 100 | 0 | 16 | 8192 | 6148 | 0 | 0 | 0 | 3
(1 row)
bt_page_stats数据都是从SpecialSpage的BTPageOpaque中取的,含义可以参考BTPageOpaque数据结构
bt_page_stats
GetBTPageStatistics
BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
...
stat->btpo_prev = opaque->btpo_prev;
stat->btpo_next = opaque->btpo_next;
stat->btpo.level = opaque->btpo.level;
stat->btpo_flags = opaque->btpo_flags;
stat->btpo_cycleid = opaque->btpo_cycleid;
...
https://developer.aliyun.com/article/53701
层次可以从bt_page_stats的btpo得到,代表当前index page所处的层级。
注意层级并不是唯一的,例如btpo=3的层级,可能有分几个档。
btpo=0级表示最底层,处于这个层级的index pages存储的items(ctid)是指向heap page的。
类别和层级不挂钩,类别里面又可以有多个层级,但是只有层级=0的index page存储的ctid内容才是指向heap page的
其他层级index page存储的ctid内容都是指向同层级其他index page(双向链表),或者指下级的index page。
0层结构,只有meta和root页。
root页最多可以存储的item数,取决于索引字段数据的长度、以及索引页的大小。
当前
btpo = 等级 = 0
btpo_flags = 类型 = 3 = BTP_LEAF | BTP_ROOT
#define BTP_LEAF (1 << 0) /* leaf page, i.e. not internal page */
#define BTP_ROOT (1 << 1) /* root page (has no parent) */
#define BTP_DELETED (1 << 2) /* page has been deleted from tree */
#define BTP_META (1 << 3) /* meta-page */
#define BTP_HALF_DEAD (1 << 4) /* empty, but still in tree */
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples */
#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
btpo知道这是一个0级节点,0级节点指向的是heap tuple
postgres=# select * from bt_page_items('t3_pkey',1) limit 10;
itemoffset | ctid | itemlen | nulls | vars | data
------------+--------+---------+-------+------+-------------------------
1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
3 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00
4 | (0,4) | 16 | f | f | 04 00 00 00 00 00 00 00
5 | (0,5) | 16 | f | f | 05 00 00 00 00 00 00 00
6 | (0,6) | 16 | f | f | 06 00 00 00 00 00 00 00
7 | (0,7) | 16 | f | f | 07 00 00 00 00 00 00 00
8 | (0,8) | 16 | f | f | 08 00 00 00 00 00 00 00
9 | (0,9) | 16 | f | f | 09 00 00 00 00 00 00 00
10 | (0,10) | 16 | f | f | 0a 00 00 00 00 00 00 00
postgres=# select * from t3 where ctid='(0,10)';
id | info
----+----------------------------------
10 | 986dd76dcec6d1b705ea163fe6e11f37
场景二:1000条|有meta|有root|无branch|有leaf
create table t4(id int primary key, info text);
insert into t4 select generate_series(1,1000), md5(random()::text);
level = 1(只有leaf、meta和root不计入)
select * from bt_metap('t4_pkey');
1、root页面ID=3
postgres=# select * from bt_metap('t4_pkey');
magic | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
340322 | 2 | 3 | 1 | 3 | 1
2、页面stats
-- btpo=0:已经是最底层了
-- btpo_flags=1:BTP_LEAF
postgres=# select * from bt_page_stats('t4_pkey',1);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1
-- btpo=0:已经是最底层了
-- btpo_flags=1:BTP_LEAF
postgres=# select * from bt_page_stats('t4_pkey',2);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
2 | l | 367 | 0 | 16 | 8192 | 808 | 1 | 4 | 0 | 1
-- btpo=1:不是最底层
-- btpo_flags=2:BTP_ROOT
postgres=# select * from bt_page_stats('t4_pkey',3);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
3 | r | 3 | 0 | 13 | 8192 | 8096 | 0 | 0 | 1 | 2
-- btpo=0:已经是最底层了
-- btpo_flags=1:BTP_LEAF
postgres=# select * from bt_page_stats('t4_pkey',4);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
4 | l | 268 | 0 | 16 | 8192 | 2788 | 2 | 0 | 0 | 1
postgres=# select * from bt_page_stats('t4_pkey',5);
ERROR: block number out of range
3、root页面items
指向3个页面,data存储的是这个leaf page存储的最小值
postgres=# select * from bt_page_items('t4_pkey',3);
itemoffset | ctid | itemlen | nulls | vars | data
------------+-------+---------+-------+------+-------------------------
1 | (1,1) | 8 | f | f |
2 | (2,1) | 16 | f | f | 6f 01 00 00 00 00 00 00
3 | (4,1) | 16 | f | f | dd 02 00 00 00 00 00 00
第一个0级页面:
postgres=# select * from bt_page_items('t4_pkey',1);
itemoffset | ctid | itemlen | nulls | vars | data
------------+---------+---------+-------+------+-------------------------
1 | (3,7) | 16 | f | f | 6f 01 00 00 00 00 00 00
2 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
3 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
4 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00
...
365 | (3,4) | 16 | f | f | 6c 01 00 00 00 00 00 00
366 | (3,5) | 16 | f | f | 6d 01 00 00 00 00 00 00
367 | (3,6) | 16 | f | f | 6e 01 00 00 00 00 00 00
第二个0级页面:
postgres=# select * from bt_page_items('t4_pkey',2);
itemoffset | ctid | itemlen | nulls | vars | data
------------+---------+---------+-------+------+-------------------------
1 | (6,13) | 16 | f | f | dd 02 00 00 00 00 00 00
2 | (3,7) | 16 | f | f | 6f 01 00 00 00 00 00 00
3 | (3,8) | 16 | f | f | 70 01 00 00 00 00 00 00
4 | (3,9) | 16 | f | f | 71 01 00 00 00 00 00 00
...
365 | (6,10) | 16 | f | f | da 02 00 00 00 00 00 00
366 | (6,11) | 16 | f | f | db 02 00 00 00 00 00 00
367 | (6,12) | 16 | f | f | dc 02 00 00 00 00 00 00
第四个0级页面:
postgres=# select * from bt_page_items('t4_pkey',4);
itemoffset | ctid | itemlen | nulls | vars | data
------------+---------+---------+-------+------+-------------------------
1 | (6,13) | 16 | f | f | dd 02 00 00 00 00 00 00
2 | (6,14) | 16 | f | f | de 02 00 00 00 00 00 00
3 | (6,15) | 16 | f | f | df 02 00 00 00 00 00 00
4 | (6,16) | 16 | f | f | e0 02 00 00 00 00 00 00
...
266 | (8,38) | 16 | f | f | e6 03 00 00 00 00 00 00
267 | (8,39) | 16 | f | f | e7 03 00 00 00 00 00 00
268 | (8,40) | 16 | f | f | e8 03 00 00 00 00 00 00
场景三:10000000条|有meta|有root|有branch|有leaf
总结:
- ctid的第一个数字表示页面ID,第二个数字是偏移
- btpo_flags = 0是branch page、btpo_flags=2是root page
create table t5(id int primary key, info text);
insert into t5 select trunc(random()*10000000), md5(random()::text) from generate_series(1,1000000) on conflict on constraint t5_pkey do nothing;
level = 2(有branch和leaf)root是412
select * from bt_metap('t5_pkey');
magic | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
340322 | 2 | 412 | 2 | 412 | 2
查看root(data存的是左兄弟的最大值)
btpo = 2(在第二层)
btpo_flags = 2 = BTP_ROOT
postgres=# select * from bt_page_stats('t5_pkey', 412);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
412 | r | 11 | 0 | 15 | 8192 | 7936 | 0 | 0 | 2 | 2
postgres=# select * from bt_page_items('t5_pkey', 412);
itemoffset | ctid | itemlen | nulls | vars | data
------------+----------+---------+-------+------+-------------------------
1 | (3,1) | 8 | f | f |
2 | (2714,1) | 16 | f | f | 66 07 0a 00 00 00 00 00
3 | (1233,1) | 16 | f | f | a0 8c 16 00 00 00 00 00
4 | (2381,1) | 16 | f | f | 0c 62 23 00 00 00 00 00
5 | (583,1) | 16 | f | f | 4d 71 30 00 00 00 00 00
6 | (2286,1) | 16 | f | f | 6b 80 3d 00 00 00 00 00
7 | (1062,1) | 16 | f | f | 59 fb 4b 00 00 00 00 00
8 | (2046,1) | 16 | f | f | de b2 5a 00 00 00 00 00
9 | (411,1) | 16 | f | f | d6 df 6a 00 00 00 00 00
10 | (2006,1) | 16 | f | f | 06 6d 79 00 00 00 00 00
11 | (1380,1) | 16 | f | f | 8c 34 8a 00 00 00 00 00
这样看起来几个branch page是3、2714、1233、2381等11个页面。
postgres=# select * from bt_page_stats('t5_pkey', 3);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
3 | i | 214 | 0 | 15 | 8192 | 3876 | 0 | 2714 | 1 | 0
postgres=# select * from bt_page_stats('t5_pkey', 2714);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
2714 | i | 273 | 0 | 15 | 8192 | 2696 | 3 | 1233 | 1 | 0
下面看一下branch的内容,例如3页面:
postgres=# select * from bt_page_items('t5_pkey', 3);
itemoffset | ctid | itemlen | nulls | vars | data
------------+----------+---------+-------+------+-------------------------
1 | (1852,1) | 16 | f | f | 66 07 0a 00 00 00 00 00
2 | (1,1) | 8 | f | f |
3 | (3176,1) | 16 | f | f | 5c 08 00 00 00 00 00 00
4 | (1612,1) | 16 | f | f | ef 11 00 00 00 00 00 00
5 | (3045,1) | 16 | f | f | 2d 1a 00 00 00 00 00 00
6 | (724,1) | 16 | f | f | e2 24 00 00 00 00 00 00
7 | (2822,1) | 16 | f | f | ae 2d 00 00 00 00 00 00
8 | (1369,1) | 16 | f | f | 96 38 00 00 00 00 00 00
9 | (2612,1) | 16 | f | f | ac 42 00 00 00 00 00 00
10 | (315,1) | 16 | f | f | 0d 4e 00 00 00 00 00 00
11 | (2458,1) | 16 | f | f | d6 58 00 00 00 00 00 00
...
...
branch data保存的是什么?
查看leaf页面1852的内容,第一条是右兄弟最小值的复制;第二条是当前最小值。
所以data是保存指向页面的最小数据。
postgres=# select * from bt_page_items('t5_pkey', 1852);
itemoffset | ctid | itemlen | nulls | vars | data
------------+------------+---------+-------+------+-------------------------
1 | (151,77) | 16 | f | f | 2b 18 0a 00 00 00 00 00
2 | (1231,33) | 16 | f | f | 66 07 0a 00 00 00 00 00 <-----------
3 | (4098,24) | 16 | f | f | 67 07 0a 00 00 00 00 00
4 | (2373,32) | 16 | f | f | 73 07 0a 00 00 00 00 00
5 | (7888,95) | 16 | f | f | 9e 07 0a 00 00 00 00 00
怎么找到最小值?
查看root发现只有(3,1)没有data,data本应保存右兄弟的最小值。没有说明(3,1)是第一个页面。
postgres=# select * from bt_page_items('t5_pkey', 412);
itemoffset | ctid | itemlen | nulls | vars | data
------------+----------+---------+-------+------+-------------------------
1 | (3,1) | 8 | f | f |
2 | (2714,1) | 16 | f | f | 66 07 0a 00 00 00 00 00
3 | (1233,1) | 16 | f | f | a0 8c 16 00 00 00 00 00
4 | (2381,1) | 16 | f | f | 0c 62 23 00 00 00 00 00
查看branch
postgres=# select * from bt_page_items('t5_pkey', 3);
itemoffset | ctid | itemlen | nulls | vars | data
------------+----------+---------+-------+------+-------------------------
1 | (1852,1) | 16 | f | f | 66 07 0a 00 00 00 00 00
2 | (1,1) | 8 | f | f |
3 | (3176,1) | 16 | f | f | 5c 08 00 00 00 00 00 00
4 | (1612,1) | 16 | f | f | ef 11 00 00 00 00 00 00
5 | (3045,1) | 16 | f | f | 2d 1a 00 00 00 00 00 00
6 | (724,1) | 16 | f | f | e2 24 00 00 00 00 00 00
7 | (2822,1) | 16 | f | f | ae 2d 00 00 00 00 00 00
8 | (1369,1) | 16 | f | f | 96 38 00 00 00 00 00 00
9 | (2612,1) | 16 | f | f | ac 42 00 00 00 00 00 00
查看leaf,第一条是右sibling的最小值,第二条是当前页面的最小值。
postgres=# select * from bt_page_items('t5_pkey', 1);
itemoffset | ctid | itemlen | nulls | vars | data
------------+------------+---------+-------+------+-------------------------
1 | (212,57) | 16 | f | f | 5c 08 00 00 00 00 00 00
2 | (7238,62) | 16 | f | f | 2a 00 00 00 00 00 00 00
3 | (430,12) | 16 | f | f | 2b 00 00 00 00 00 00 00
4 | (5535,81) | 16 | f | f | 3e 00 00 00 00 00 00 00
5 | (1278,47) | 16 | f | f | 3f 00 00 00 00 00 00 00
6 | (4255,45) | 16 | f | f | 46 00 00 00 00 00 00 00
查看数据
postgres=# select * from bt_page_items('t5_pkey', 1);
itemoffset | ctid | itemlen | nulls | vars | data
------------+------------+---------+-------+------+-------------------------
1 | (212,57) | 16 | f | f | 5c 08 00 00 00 00 00 00
2 | (7238,62) | 16 | f | f | 2a 00 00 00 00 00 00 00
3 | (430,12) | 16 | f | f | 2b 00 00 00 00 00 00 00
4 | (5535,81) | 16 | f | f | 3e 00 00 00 00 00 00 00
5 | (1278,47) | 16 | f | f | 3f 00 00 00 00 00 00 00
6 | (4255,45) | 16 | f | f | 46 00 00 00 00 00 00 00
postgres=# select * from t5 where ctid='(7238,62)';
id | info
----+----------------------------------
42 | faca6c2f77cfb6eea5e9ee6bf2740578
postgres=# select min(id) from t5;
min
-----
42
ref
https://developer.aliyun.com/article/53701
下一篇开始翻源码