背景
为了防止坏块问题, checkpoint后第一次修改的page需要将整个page写入wal日志. 即full page write(FPW).
有了fpw, 可以修复坏块, 例如pg_basebackup, 在线拷贝, 拷贝到partial block也没关系, 使用fp可以修复.
但是fpw的引入会导致wal日志变大, 特别是更新频繁的场景. 在checkpoint频率较高时更为明显.
checkpoint频度越高, 恢复需要replay的wal日志越少, 恢复速度越快, 例如数据库崩溃恢复会变得很快. 所以就有了矛盾:
- 拉长checkpoint周期可以降低开启fpw带来的性能影响, 但是会导致崩溃恢复的时间变长.
- 缩短checkpoint周期可以降低崩溃恢复的时间, 但是会导致WAL写入大量FPW日志, 影响数据库写性能.
PolarDB 通过standby来支持坏块修复, 可以关闭FPW功能. 从而解决了以上矛盾, 性能提升非常明显. 同时:
- Oracle也是用的类似方法, 但是Oracle通过checksum来判断一个page是否已损坏, 这种方法存在小概率的问题: 损坏的块和正常的块算出来的checksum没有发生变化, 从而导致block永久损坏. 为什么会出现这种情况呢, 8KB的块算出一个8字节的checksum, 存在hash value冲突, page 的内容不同但是checksum相同.
- PolarDB 所有第一次修改的page都从 hot standby拉取, 从而不存在这个小概率问题.
在高频率checkpoint的情况下, 关闭fpw写负载性能可提升30%.
对比测试
8c64g 2TB nvme ssd
rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm yum -y install coreutils glib2 lrzsz dstat sysstat e4fsprogs xfsprogs ntp readline-devel zlib-devel openssl-devel pam-devel libxml2-devel libxslt-devel python-devel tcl-devel gcc gcc-c++ make smartmontools flex bison perl-devel perl-ExtUtils* openldap-devel jadetex openjade bzip2 git iotop lvm2 perf centos-release-scl yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm yum install -y postgresql14*
parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE mkfs.ext4 /dev/vdb1 -m 0 -O extent,uninit_bg -b 4096 -L lv01
vi /etc/fstab LABEL=lv01 /data01 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0
mkdir /data01 mount -a
vi /etc/sysctl.conf # add by digoal.zhou fs.aio-max-nr = 1048576 fs.file-max = 76724600 # 可选:kernel.core_pattern = /data01/corefiles/core_%e_%u_%t_%s.%p # /data01/corefiles 事先建好,权限777,如果是软链接,对应的目录修改为777 kernel.sem = 4096 2147483647 2147483646 512000 # 信号量, ipcs -l 或 -u 查看,每16个进程一组,每组信号量需要17个信号量。 kernel.shmall = 107374182 # 所有共享内存段相加大小限制 (建议内存的80%),单位为页。 kernel.shmmax = 274877906944 # 最大单个共享内存段大小 (建议为内存一半), >9.2的版本已大幅降低共享内存的使用,单位为字节。 kernel.shmmni = 819200 # 一共能生成多少共享内存段,每个PG数据库集群至少2个共享内存段 net.core.netdev_max_backlog = 10000 net.core.rmem_default = 262144 # The default setting of the socket receive buffer in bytes. net.core.rmem_max = 4194304 # The maximum receive socket buffer size in bytes net.core.wmem_default = 262144 # The default setting (in bytes) of the socket send buffer. net.core.wmem_max = 4194304 # The maximum send socket buffer size in bytes. net.core.somaxconn = 4096 net.ipv4.tcp_max_syn_backlog = 4096 net.ipv4.tcp_keepalive_intvl = 20 net.ipv4.tcp_keepalive_probes = 3 net.ipv4.tcp_keepalive_time = 60 net.ipv4.tcp_mem = 8388608 12582912 16777216 net.ipv4.tcp_fin_timeout = 5 net.ipv4.tcp_synack_retries = 2 net.ipv4.tcp_syncookies = 1 # 开启SYN Cookies。当出现SYN等待队列溢出时,启用cookie来处理,可防范少量的SYN攻击 net.ipv4.tcp_timestamps = 1 # 减少time_wait net.ipv4.tcp_tw_recycle = 0 # 如果=1则开启TCP连接中TIME-WAIT套接字的快速回收,但是NAT环境可能导致连接失败,建议服务端关闭它 net.ipv4.tcp_tw_reuse = 1 # 开启重用。允许将TIME-WAIT套接字重新用于新的TCP连接 net.ipv4.tcp_max_tw_buckets = 262144 net.ipv4.tcp_rmem = 8192 87380 16777216 net.ipv4.tcp_wmem = 8192 65536 16777216 net.nf_conntrack_max = 1200000 net.netfilter.nf_conntrack_max = 1200000 vm.dirty_background_bytes = 409600000 # 系统脏页到达这个值,系统后台刷脏页调度进程 pdflush(或其他) 自动将(dirty_expire_centisecs/100)秒前的脏页刷到磁盘 # 默认为10%,大内存机器建议调整为直接指定多少字节 vm.dirty_expire_centisecs = 3000 # 比这个值老的脏页,将被刷到磁盘。3000表示30秒。 vm.dirty_ratio = 95 # 如果系统进程刷脏页太慢,使得系统脏页超过内存 95 % 时,则用户进程如果有写磁盘的操作(如fsync, fdatasync等调用),则需要主动把系统脏页刷出。 # 有效防止用户进程刷脏页,在单机多实例,并且使用CGROUP限制单实例IOPS的情况下非常有效。 vm.dirty_writeback_centisecs = 100 # pdflush(或其他)后台刷脏页进程的唤醒间隔, 100表示1秒。 vm.swappiness = 0 # 不使用交换分区 vm.mmap_min_addr = 65536 vm.overcommit_memory = 0 # 在分配内存时,允许少量over malloc, 如果设置为 1, 则认为总是有足够的内存,内存较少的测试环境可以使用 1 . vm.overcommit_ratio = 90 # 当overcommit_memory = 2 时,用于参与计算允许指派的内存大小。 vm.swappiness = 0 # 关闭交换分区 vm.zone_reclaim_mode = 0 # 禁用 numa, 或者在vmlinux中禁止. net.ipv4.ip_local_port_range = 40000 65535 # 本地自动分配的TCP, UDP端口号范围 fs.nr_open=20480000 # 单个进程允许打开的文件句柄上限 # 以下参数请注意 # vm.extra_free_kbytes = 4096000 # vm.min_free_kbytes = 2097152 # vm.min_free_kbytes 建议每32G内存分配1G vm.min_free_kbytes # 如果是小内存机器,以上两个值不建议设置 # vm.nr_hugepages = 66536 # 建议shared buffer设置超过64GB时 使用大页,页大小 /proc/meminfo Hugepagesize # vm.lowmem_reserve_ratio = 1 1 1 # 对于内存大于64G时,建议设置,否则建议默认值 256 256 32
sysctl -p
vi /etc/security/limits.d/20-nproc.conf # nofile超过1048576的话,一定要先将sysctl的fs.nr_open设置为更大的值,并生效后才能继续设置nofile. * soft nofile 1024000 * hard nofile 1024000 * soft nproc unlimited * hard nproc unlimited * soft core unlimited * hard core unlimited * soft memlock unlimited * hard memlock unlimited
chmod +x /etc/rc.d/rc.local
vi /etc/rc.local touch /var/lock/subsys/local if test -f /sys/kernel/mm/transparent_hugepage/enabled; then echo never > /sys/kernel/mm/transparent_hugepage/enabled fi echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
useradd postgres su - postgres wget https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.bz2 tar -jxvf postgresql-snapshot.tar.bz2 cd postgresql-15devel ./configure --prefix=/home/postgres/pg15 make world -j 16 make install-world
cd ~ vi .bash_profile # 追加 export PS1="$USER@`/bin/hostname -s`-> " export PGPORT=1921 export PGDATA=/data01/pg15_$PGPORT/pg_root export LANG=en_US.utf8 export PGHOME=/home/postgres/pg15 export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH export DATE=`date +"%Y%m%d%H%M"` export PATH=$PGHOME/bin:$PATH:. export MANPATH=$PGHOME/share/man:$MANPATH export PGHOST=$PGDATA export PGUSER=postgres export PGDATABASE=postgres alias rm='rm -i' alias ll='ls -lh' unalias vi
su - root mkdir /data01/pg15_1921 chown postgres:postgres /data01/pg15_1921
su - postgres initdb -D $PGDATA -U postgres -E UTF8 --lc-collate=C --lc-ctype=en_US.utf8
cd $PGDATA vi postgresql.conf listen_addresses = '0.0.0.0' port = 1921 max_connections = 1000 superuser_reserved_connections = 13 unix_socket_directories = '., /tmp' tcp_keepalives_idle = 60 tcp_keepalives_interval = 10 tcp_keepalives_count = 6 shared_buffers = 16GB maintenance_work_mem = 1GB dynamic_shared_memory_type = posix vacuum_cost_delay = 0 bgwriter_delay = 10ms bgwriter_lru_maxpages = 1000 bgwriter_lru_multiplier = 5.0 effective_io_concurrency = 0 max_parallel_workers_per_gather = 0 wal_level = replica fsync = on synchronous_commit = on full_page_writes = on wal_writer_delay = 10ms wal_writer_flush_after = 1MB max_wal_size = 1GB # 较高频率checkpoint, 对比fpw明显的影响 min_wal_size = 80MB random_page_cost = 1.1 effective_cache_size = 64GB log_destination = 'csvlog' logging_collector = on log_truncate_on_rotation = on log_checkpoints = on log_timezone = 'Asia/Shanghai' autovacuum_vacuum_cost_delay = 0ms vacuum_freeze_table_age = 750000000 vacuum_multixact_freeze_table_age = 750000000 datestyle = 'iso, mdy' timezone = 'Asia/Shanghai' lc_messages = 'en_US.utf8' lc_monetary = 'en_US.utf8' lc_numeric = 'en_US.utf8' lc_time = 'en_US.utf8' default_text_search_config = 'pg_catalog.english'
pg_ctl start
pgbench -i -s 5000
pgbench -M prepared -n -r -P 1 -c 16 -j 16 -T 120
开启full page write 时, tpc-b读写性能
transaction type: <builtin: TPC-B (sort of)> scaling factor: 5000 query mode: prepared number of clients: 16 number of threads: 16 duration: 120 s number of transactions actually processed: 1210084 latency average = 1.586 ms latency stddev = 1.181 ms initial connection time = 8.439 ms tps = 10072.852551 (without initial connection time) statement latencies in milliseconds: 0.001 \set aid random(1, 100000 * :scale) 0.000 \set bid random(1, 1 * :scale) 0.000 \set tid random(1, 10 * :scale) 0.000 \set delta random(-5000, 5000) 0.077 BEGIN; 0.681 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 0.111 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 0.117 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 0.107 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 0.091 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 0.399 END;
关闭full page write 时, tpc-b读写性能
transaction type: <builtin: TPC-B (sort of)> scaling factor: 5000 query mode: prepared number of clients: 16 number of threads: 16 duration: 120 s number of transactions actually processed: 1569385 latency average = 1.223 ms latency stddev = 0.970 ms initial connection time = 9.154 ms tps = 13070.045981 (without initial connection time) statement latencies in milliseconds: 0.001 \set aid random(1, 100000 * :scale) 0.000 \set bid random(1, 1 * :scale) 0.000 \set tid random(1, 10 * :scale) 0.000 \set delta random(-5000, 5000) 0.075 BEGIN; 0.402 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 0.113 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 0.114 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 0.111 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 0.096 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 0.310 END;
关闭fpw后, 性能提升了30%.