前言:
PostgreSQL既然提供了主从流复制的架构,那么肯定就需要有相应的主备切换的方案,目前PostgreSQL官方提供了多种方案,并且第三方工具也提供了很多可能性。
相关信息:
环境:异步流复制
主库:10.25.15.77
备库:10.25.15.78
版本:PostgreSQL 12
大致流程如下:
- 关闭主库
- 激活备库
- 修改原主库的primary_conninfo信息
- 启动原主库
1.查看目前的复制状态:
postgres=# select usename,application_name,client_addr,sync_state from pg_stat_replication;
usename | application_name | client_addr | sync_state
---------+------------------+-------------+------------
repuser | walreceiver | 10.25.15.78 | async
通过如下命令确认当前是不是主库:
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
f #f表示主库,t表示备库
2.关闭主库
建议使用-m fast模式关闭,因为shutdown之前会进行checkpoint确保所有已提交事务落盘,并且WAL也会同步到备库上pg_ctl stop -m fast
3.激活备库
在备库服务器上,执行如下命令来激活备库,也就是将备库提升为新的主库
[postgres@ndcnvx508 data]# pg_ctl promote
waiting for server to promote.... done
server promoted
注:
此时新主库其实已经可以写入数据了
此时日志会有如下记录:
2020-06-17 16:16:38.662 CST [24332] LOG: received promote request
2020-06-17 16:16:38.662 CST [24332] LOG: redo done at 0/11000028
2020-06-17 16:16:38.662 CST [24332] LOG: last completed transaction was at log time 2020-06-17 16:09:48.113029+08
2020-06-17 16:16:38.663 CST [24332] LOG: selected new timeline ID: 7
2020-06-17 16:16:38.691 CST [24332] LOG: archive recovery complete
2020-06-17 16:16:38.698 CST [24330] LOG: database system is ready to accept connections
4.修改原主库配置
此时原主库处于关闭状态,查看配置文件postgresql.auto.conf
是否有如下内容,如果没有请设置上
primary_conninfo = ‘user=repuser password=admin123host=10.25.15.78 port=1921
sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any’
注:
host的IP地址要写新主库的地址
查看数据目录$PGDATA下是否有standby.signal
,如果没有就直接touch新建一个,因为PG12版本以后删除了recovery.conf文件,而使用standby.signal
标识文件,表示数据库处于 standby
模式
[postgres@ndcnvx507 data]# ls -l standby.signal
ls: cannot access standby.signal: No such file or directory
[postgres@ndcnvx507 data]# touch standby.signal
[postgres@ndcnvx507 data]# ls -l standby.signal
-rw------- 1 postgres postgresgrp 0 Jun 17 16:26 standby.signal
5.启动原主库
pg_ctl start
此时会有如下日志记录($PGDATA/log/postgresql-XXX.log):
2020-06-17 16:34:00.722 CST [31159] LOG: database system was shut down at 2020-06-17 16:14:20 CST
2020-06-17 16:34:00.722 CST [31159] LOG: entering standby mode
2020-06-17 16:34:00.724 CST [31159] LOG: consistent recovery state reached at 0/110000A0
2020-06-17 16:34:00.724 CST [31159] LOG: invalid record length at 0/110000A0: wanted 24, got 0
2020-06-17 16:34:00.725 CST [31157] LOG: database system is ready to accept read only connections
2020-06-17 16:34:00.733 CST [31163] LOG: fetching timeline history file for timeline 7 from primary server
2020-06-17 16:34:00.735 CST [31163] LOG: started streaming WAL from primary at 0/11000000 on timeline 6
2020-06-17 16:34:00.735 CST [31163] LOG: replication terminated by primary server
2020-06-17 16:34:00.735 CST [31163] DETAIL: End of WAL reached on timeline 6 at 0/110000A0.
2020-06-17 16:34:00.735 CST [31159] LOG: new target timeline is 7
2020-06-17 16:34:00.736 CST [31163] LOG: restarted WAL streaming at 0/11000000 on timeline 7
2020-06-17 16:34:00.762 CST [31159] LOG: redo starts at 0/110000A0
6.验证
此时在新主库上查看状态:
postgres=# select usename,application_name,client_addr,sync_state from pg_stat_replication;
usename | application_name | client_addr | sync_state
---------+------------------+-------------+------------
repuser | walreceiver | 10.25.15.77 | async
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
f
(1 row)
同时,如果主库10.25.15.78上有了WAL发送进程,备库10.25.15.77上有了WAL接收进程,即说明failover切换成功
也可以通过在新主库上insert数据来判断是否数据可以同步成功,这里就不演示了。
另:
如果在激活备库前忘记关闭主库,那么可能会导致有些WAL没有同步过去,从而使原主库在切换成备库后无法继续同步WAL日志流数据,且此时,主备库上均看不到WAL的发送和接收进程存在,这个时候,我们可以重做备库,但如果数据库很大的情况下就非常耗费时间了,生产环境不太现实。
我们可以使用工具pg_rewind
来重新同步新主库的数据,其作用不是全量同步数据,而是只复制变化了的数据。
具体步骤:
关闭原主库pg_ctl stop -m fast
使用pg_rewind进行同步
pg_rewind --target-pgdata $PGDATA --source-server='host=10.25.15.78 port=1921 user=root password=admin123 dbname=postgres' -P
日志如下:
pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 0/5013580 on timeline 1
pg_rewind: rewinding from last common checkpoint at 0/50134D0 on timeline 1
pg_rewind: reading source file list
pg_rewind: reading target file list
pg_rewind: reading WAL in target
pg_rewind: need to copy 83 MB (total source directory size is 104 MB)
85872/85872 kB (100%) copied
pg_rewind: creating backup label and updating control file
pg_rewind: syncing target data directory
pg_rewind: Done!
之后注意如下两点:
注意检查一下postgresql.auto.conf
配置文件内容是否正确,不正确的话请自行改正
检查数据目录$PGDATA下文件standby.signal
是否存在,不在的话请自行创建
启动原主库pg_ctl start
思考:
虽然上述pg_rewind实现了原主库到备库的转换,但是原主库上是否会有数据丢失?如果想要保证这些数据不丢失怎么办?如果解决?我们后续找到方法后再进一步更新
参考网址:
how to failover in postgres12: https://www.enterprisedb.com/postgres-tutorials/how-manage-replication-and-failover-postgres-version-12-without-recoveryconf
《Postgres实战》作者博客文章: https://postgres.fun/20110824223133.html