208、PostgreSQL12 如何手动failover

前言:

PostgreSQL既然提供了主从流复制的架构,那么肯定就需要有相应的主备切换的方案,目前PostgreSQL官方提供了多种方案,并且第三方工具也提供了很多可能性。

相关信息:

环境:异步流复制
主库:10.25.15.77
备库:10.25.15.78
版本:PostgreSQL 12

大致流程如下:

  • 关闭主库
  • 激活备库
  • 修改原主库的primary_conninfo信息
  • 启动原主库

1.查看目前的复制状态:

postgres=# select usename,application_name,client_addr,sync_state from pg_stat_replication;
usename | application_name | client_addr | sync_state
---------+------------------+-------------+------------
repuser | walreceiver      | 10.25.15.78 | async

通过如下命令确认当前是不是主库:
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
f            #f表示主库,t表示备库

2.关闭主库

建议使用-m fast模式关闭,因为shutdown之前会进行checkpoint确保所有已提交事务落盘,并且WAL也会同步到备库上
pg_ctl stop -m fast

3.激活备库

在备库服务器上,执行如下命令来激活备库,也就是将备库提升为新的主库

[postgres@ndcnvx508 data]# pg_ctl promote
waiting for server to promote.... done
server promoted

注:此时新主库其实已经可以写入数据了

此时日志会有如下记录:
2020-06-17 16:16:38.662 CST [24332] LOG:  received promote request
2020-06-17 16:16:38.662 CST [24332] LOG:  redo done at 0/11000028
2020-06-17 16:16:38.662 CST [24332] LOG:  last completed transaction was at log time 2020-06-17 16:09:48.113029+08
2020-06-17 16:16:38.663 CST [24332] LOG:  selected new timeline ID: 7
2020-06-17 16:16:38.691 CST [24332] LOG:  archive recovery complete
2020-06-17 16:16:38.698 CST [24330] LOG:  database system is ready to accept connections

4.修改原主库配置

此时原主库处于关闭状态,查看配置文件postgresql.auto.conf是否有如下内容,如果没有请设置上
primary_conninfo = ‘user=repuser password=admin123host=10.25.15.78 port=1921 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any’

注:host的IP地址要写新主库的地址

查看数据目录$PGDATA下是否有standby.signal,如果没有就直接touch新建一个,因为PG12版本以后删除了recovery.conf文件,而使用standby.signal 标识文件,表示数据库处于 standby 模式

[postgres@ndcnvx507 data]# ls -l standby.signal
ls: cannot access standby.signal: No such file or directory
[postgres@ndcnvx507 data]# touch standby.signal
[postgres@ndcnvx507 data]# ls -l standby.signal
-rw------- 1 postgres postgresgrp 0 Jun 17 16:26 standby.signal

5.启动原主库

pg_ctl start

此时会有如下日志记录($PGDATA/log/postgresql-XXX.log):
2020-06-17 16:34:00.722 CST [31159] LOG:  database system was shut down at 2020-06-17 16:14:20 CST
2020-06-17 16:34:00.722 CST [31159] LOG:  entering standby mode
2020-06-17 16:34:00.724 CST [31159] LOG:  consistent recovery state reached at 0/110000A0
2020-06-17 16:34:00.724 CST [31159] LOG:  invalid record length at 0/110000A0: wanted 24, got 0
2020-06-17 16:34:00.725 CST [31157] LOG:  database system is ready to accept read only connections
2020-06-17 16:34:00.733 CST [31163] LOG:  fetching timeline history file for timeline 7 from primary server
2020-06-17 16:34:00.735 CST [31163] LOG:  started streaming WAL from primary at 0/11000000 on timeline 6
2020-06-17 16:34:00.735 CST [31163] LOG:  replication terminated by primary server
2020-06-17 16:34:00.735 CST [31163] DETAIL:  End of WAL reached on timeline 6 at 0/110000A0.
2020-06-17 16:34:00.735 CST [31159] LOG:  new target timeline is 7
2020-06-17 16:34:00.736 CST [31163] LOG:  restarted WAL streaming at 0/11000000 on timeline 7
2020-06-17 16:34:00.762 CST [31159] LOG:  redo starts at 0/110000A0

6.验证

此时在新主库上查看状态:

postgres=#  select usename,application_name,client_addr,sync_state from pg_stat_replication;
usename | application_name | client_addr | sync_state
---------+------------------+-------------+------------
repuser | walreceiver      | 10.25.15.77 | async

postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
f
(1 row)

同时,如果主库10.25.15.78上有了WAL发送进程,备库10.25.15.77上有了WAL接收进程,即说明failover切换成功
也可以通过在新主库上insert数据来判断是否数据可以同步成功,这里就不演示了。

另:
如果在激活备库前忘记关闭主库,那么可能会导致有些WAL没有同步过去,从而使原主库在切换成备库后无法继续同步WAL日志流数据,且此时,主备库上均看不到WAL的发送和接收进程存在,这个时候,我们可以重做备库,但如果数据库很大的情况下就非常耗费时间了,生产环境不太现实。
我们可以使用工具pg_rewind来重新同步新主库的数据,其作用不是全量同步数据,而是只复制变化了的数据。
具体步骤:
关闭原主库
pg_ctl stop -m fast

使用pg_rewind进行同步

 pg_rewind --target-pgdata $PGDATA --source-server='host=10.25.15.78 port=1921 user=root password=admin123 dbname=postgres' -P

日志如下:
pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 0/5013580 on timeline 1
pg_rewind: rewinding from last common checkpoint at 0/50134D0 on timeline 1
pg_rewind: reading source file list
pg_rewind: reading target file list
pg_rewind: reading WAL in target
pg_rewind: need to copy 83 MB (total source directory size is 104 MB)
85872/85872 kB (100%) copied
pg_rewind: creating backup label and updating control file
pg_rewind: syncing target data directory
pg_rewind: Done!

之后注意如下两点:

注意检查一下postgresql.auto.conf配置文件内容是否正确,不正确的话请自行改正
检查数据目录$PGDATA下文件standby.signal是否存在,不在的话请自行创建
启动原主库
pg_ctl start

思考:

虽然上述pg_rewind实现了原主库到备库的转换,但是原主库上是否会有数据丢失?如果想要保证这些数据不丢失怎么办?如果解决?我们后续找到方法后再进一步更新

参考网址:

how to failover in postgres12:  https://www.enterprisedb.com/postgres-tutorials/how-manage-replication-and-failover-postgres-version-12-without-recoveryconf
《Postgres实战》作者博客文章: https://postgres.fun/20110824223133.html

上一篇:oracle 19c使用dgmgrl来执行switchover和failover切换


下一篇:Java-Readable