案例说明:
在生产环境下,由于安全需要,主机间不允许建立root用户的ssh信任连接,这样导致KingbaseES R6 repmgr集群,通过sys_monitor.sh脚本启动集群时,节点之间不能通过ssh正常访问,导致集群启动失败。本案例借助于es_server和es_client建立用户之间的信任连接,代替ssh访问。
测试数据库版本:
test=# select version();
version
----------------------------------------------------------------------------------------------------------------------
KingbaseES V008R006C003B0010 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
(1 row)
如下图所示,由于不能建立root用户的信任连接,导致sys_monitor.sh启动无法正常启动:
一、配置es_server启动(所有node)
es_server 配置:
启动es_server:
[kingbase@node3 bin]$ ./esHAmodel.sh start
[kingbase@node3 bin]$ ps -ef |grep es_server
kingbase 28024 1 0 15:18 pts/2 00:00:00 /home/kingbase/cluster/R6HA/KHA/kingbase/bin/es_server
[kingbase@node3 bin]$ netstat -an |grep 8890
tcp 0 0 0.0.0.0:8890 0.0.0.0:* LISTEN
测试es_server的连接:
[kingbase@node3 bin]$ ./es_client --help
es-client
Usage:
es-client [OPTION...] -o
Options:
-U, --username=NAME username for ES authentication
-h, --host=HOSTNAME ES Server host
-p, --port=PORT ES Server port number
-W, --password password
-d, --debug enable debug message (optional)
-?, --help print this help
-o, --option use user-define cmd: like "ls ."
[kingbase@node3 bin]$ ./es_client -h 192.168.7.248 -U kingbase -W 123456 -o "hostname"
node1
[kingbase@node3 bin]$ ./es_client -h 192.168.7.249 -U kingbase -W 123456 -o "hostname"
node2
二、配置repmgr.conf支持bmj方式连接
=如下图所示:在sys_monitor.sh脚本中,如果bmj=on,则使用es_server和es_client通讯,所以需修改repmgr.conf启动bmj通讯。=
配置repmgr.conf:(所有node)
[kingbase@node3 bin]$ cat ../etc/repmgr.conf
# 启用bmj
on_bmj=on
node_id=3
node_name=node243
promote_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf'
follow_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr standby follow -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf -W --upstream-node-id=%n'
conninfo='host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=2'
log_file='/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log'
data_directory='/home/kingbase/cluster/R6HA/KHA/kingbase/data'
sys_bindir='/home/kingbase/cluster/R6HA/KHA/kingbase/bin'
ssh_options='-q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22'
reconnect_attempts=2
reconnect_interval=3
failover='automatic'
recovery='automatic'
monitoring_history='no'
trusted_servers='192.168.7.1'
virtual_ip='192.168.7.240/24'
net_device='enp0s3'
ipaddr_path='/sbin'
arping_path='/sbin'
synchronous='quorum'
repmgrd_pid_file='/home/kingbase/cluster/R6HA/KHA/kingbase/hamgrd.pid'
ping_path='/usr/bin'
#priority=0
三、sys_monitor.sh启动集群测试
[kingbase@node3 bin]$ ./sys_monitor.sh restart
2021-03-01 15:25:58 Ready to stop all DB ...
sh: /etc/cron.d/KINGBASECRON: Permission deniedsh: /etc/cron.d/KINGBASECRON: Permission deniedsh: /etc/cron.d/KINGBASECRON: Permission denied2021-03-01 15:25:59 begin to stop repmgrd on "[192.168.7.248]".
2021-03-01 15:25:59 repmgrd on "[192.168.7.248]" stop success.
2021-03-01 15:25:59 begin to stop repmgrd on "[192.168.7.243]".
2021-03-01 15:25:59 repmgrd on "[192.168.7.243]" stop success.
2021-03-01 15:25:59 begin to stop repmgrd on "[192.168.7.249]".
2021-03-01 15:25:59 repmgrd on "[192.168.7.249]" stop success.
2021-03-01 15:25:59 begin to stop DB on "[192.168.7.248]".
waiting for server to shut down.... done
server stopped2021-03-01 15:26:00 DB on "[192.168.7.248]" stop success.
2021-03-01 15:26:00 begin to stop DB on "[192.168.7.249]".
waiting for server to shut down.... done
server stopped2021-03-01 15:26:00 DB on "[192.168.7.249]" stop success.
2021-03-01 15:26:00 begin to stop DB on "[192.168.7.243]".
waiting for server to shut down..... done
server stopped2021-03-01 15:26:01 DB on "[192.168.7.243]" stop success.
2021-03-01 15:26:01 Done.
2021-03-01 15:26:02 Ready to start all DB ...
2021-03-01 15:26:02 begin to start DB on "[192.168.7.243]".
waiting for server to start.... done
server started2021-03-01 15:26:02 execute to start DB on "[192.168.7.243]" success, connect to check it.
2021-03-01 15:26:03 DB on "[192.168.7.243]" start success.
2021-03-01 15:26:03 Try to ping trusted_servers on host 192.168.7.248 ...
2021-03-01 15:26:05 Try to ping trusted_servers on host 192.168.7.243 ...
2021-03-01 15:26:07 Try to ping trusted_servers on host 192.168.7.249 ...
2021-03-01 15:26:09 begin to start DB on "[192.168.7.248]".
waiting for server to start.... done
server started2021-03-01 15:26:10 execute to start DB on "[192.168.7.248]" success, connect to check it.
2021-03-01 15:26:11 DB on "[192.168.7.248]" start success.
2021-03-01 15:26:11 begin to start DB on "[192.168.7.249]".
waiting for server to start.... done
server started2021-03-01 15:26:12 execute to start DB on "[192.168.7.249]" success, connect to check it.
2021-03-01 15:26:13 DB on "[192.168.7.249]" start success.
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------
1 | node248 | standby | ! running | node243 | default | 100 | 23 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=2
2 | node249 | witness | * running | node243 | default | 0 | 1 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=2
3 | node243 | primary | * running | | default | 100 | 23 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=2
WARNING: following issues were detected
- node "node248" (ID: 1) is running but the repmgr node record is inactive
2021-03-01 15:26:13 The primary DB is started.
WARNING: There are no 2 standbys in pg_stat_replication, please check all the standby servers replica from primary
2021-03-01 15:26:37 Success to load virtual ip [192.168.7.240/24] on primary host [192.168.7.243].
2021-03-01 15:26:37 Try to ping vip on host 192.168.7.248 ...
2021-03-01 15:26:39 Try to ping vip on host 192.168.7.243 ...
2021-03-01 15:26:41 Try to ping vip on host 192.168.7.249 ...
2021-03-01 15:26:43 begin to start repmgrd on "[192.168.7.248]".
2021-03-01 15:26:43 repmgrd on "[192.168.7.248]" already started.
2021-03-01 15:26:43 begin to start repmgrd on "[192.168.7.243]".
2021-03-01 15:26:43 repmgrd on "[192.168.7.243]" already started.
2021-03-01 15:26:43 begin to start repmgrd on "[192.168.7.249]".
2021-03-01 15:26:43 repmgrd on "[192.168.7.249]" already started.
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+-------+---------+--------------------
1 | node248 | standby | running | node243 | running | 3589 | no | 0 second(s) ago
2 | node249 | witness | * running | node243 | running | 23739 | no | 0 second(s) ago
3 | node243 | primary | * running | | running | 30496 | no | n/a
sh: /etc/cron.d/KINGBASECRON: Permission deniedsh: /etc/logrotate.d/kingbase: Permission deniedchown: changing ownership of ‘/etc/logrotate.d/kingbase’: Operation not permittedchmod: changing permissions of ‘/etc/logrotate.d/kingbase’: Operation not permittedsh: /etc/cron.d/KINGBASECRON: Permission deniedsh: /etc/logrotate.d/kingbase: Permission deniedchown: changing ownership of ‘/etc/logrotate.d/kingbase’: Operation not permittedchmod: changing permissions of ‘/etc/logrotate.d/kingbase’: Operation not permittedsh: /etc/cron.d/KINGBASECRON: Permission deniedsh: /etc/logrotate.d/kingbase: Permission deniedchown: changing ownership of ‘/etc/logrotate.d/kingbase’: Operation not permittedchmod: changing permissions of ‘/etc/logrotate.d/kingbase’: Operation not permitted2021-03-01 15:26:44 Done.
如下图所示:sys_monitor.sh脚本启动访问“/etc/cron.d/KINGBASECRON”和“/etc/lograte.d/kingbase”文件时,出现权限错误:
注:
1)/etc/cron.d/KINGBASECRON,是repmgr集群启动时建立的计划任务,用于启动repmgrd进程。
2)/etc/logrotate.d/kingbase,配置文件用于切割hamgr.log和kbha.log日志
sys_monitor.sh脚本中/etc/cron.d/KINGBASECRON相关配置:
sys_monitor.sh脚本中/etc/logrotate.d/kingbase相关配置:
1)修改/etc/cron.d/KINGBASECRON文件相关权限(如下图所示)(所有node)
2)修改/etc/logrotate.d/kingbase相关权限(所有node)
修改kingbase文件所有者:(所有node)
注释sys_monitor.sh脚本中修改kingbase配置文件所有者和权限的语句:
function init_log_rotate()
{
_host="$1"
_final_target_file="/etc/logrotate.d/kingbase"
eval _rep_log_file=`grep log_file ${rep_conf} | awk -F '=' '{print $2}'`
execute_command ${super_user} $host "\
echo -e '# Generate by sys_monitor.sh at `date`\n\
${kbha_file} {\n\
weekly\n\
maxsize 100M\n\
su ${execute_user} ${execute_user}\n\
create 0600 ${execute_user} ${execute_user}\n\
rotate 3\n\
copytruncate\n\
dateext\n\
}\n\
${_rep_log_file} {\n\
weekly\n\
maxsize 100M\n\
su ${execute_user} ${execute_user}\n\
create 0600 ${execute_user} ${execute_user}\n\
rotate 3\n\
copytruncate\n\
dateext\n\
}\n\
' > ${_final_target_file}"
#execute_command ${super_user} $host "chown ${super_user}:${super_user} ${_final_target_file}"
#execute_command ${super_user} $host "chmod 644 ${_final_target_file}"
如下图所示:
四、测试集群启动
[kingbase@node3 bin]$ ./sys_monitor.sh restart
2021-03-01 15:52:08 Ready to stop all DB ...
2021-03-01 15:52:08 begin to stop repmgrd on "[192.168.7.248]".
2021-03-01 15:52:08 repmgrd on "[192.168.7.248]" stop success.
2021-03-01 15:52:08 begin to stop repmgrd on "[192.168.7.243]".
2021-03-01 15:52:08 repmgrd on "[192.168.7.243]" stop success.
2021-03-01 15:52:08 begin to stop repmgrd on "[192.168.7.249]".
2021-03-01 15:52:08 repmgrd on "[192.168.7.249]" stop success.
2021-03-01 15:52:08 begin to stop DB on "[192.168.7.248]".
waiting for server to shut down..... done
server stopped2021-03-01 15:52:09 DB on "[192.168.7.248]" stop success.
2021-03-01 15:52:09 begin to stop DB on "[192.168.7.249]".
waiting for server to shut down.... done
server stopped2021-03-01 15:52:10 DB on "[192.168.7.249]" stop success.
2021-03-01 15:52:10 begin to stop DB on "[192.168.7.243]".
waiting for server to shut down..... done
server stopped2021-03-01 15:52:12 DB on "[192.168.7.243]" stop success.
2021-03-01 15:52:12 Done.
2021-03-01 15:52:12 Ready to start all DB ...
2021-03-01 15:52:12 begin to start DB on "[192.168.7.243]".
waiting for server to start.... done
server started2021-03-01 15:52:12 execute to start DB on "[192.168.7.243]" success, connect to check it.
2021-03-01 15:52:13 DB on "[192.168.7.243]" start success.
2021-03-01 15:52:13 Try to ping trusted_servers on host 192.168.7.248 ...
2021-03-01 15:52:15 Try to ping trusted_servers on host 192.168.7.243 ...
2021-03-01 15:52:17 Try to ping trusted_servers on host 192.168.7.249 ...
2021-03-01 15:52:19 begin to start DB on "[192.168.7.248]".
waiting for server to start.... done
server started2021-03-01 15:52:20 execute to start DB on "[192.168.7.248]" success, connect to check it.
2021-03-01 15:52:21 DB on "[192.168.7.248]" start success.
2021-03-01 15:52:21 begin to start DB on "[192.168.7.249]".
waiting for server to start.... done
server started2021-03-01 15:52:21 execute to start DB on "[192.168.7.249]" success, connect to check it.
2021-03-01 15:52:22 DB on "[192.168.7.249]" start success.
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------
1 | node248 | standby | running | node243 | default | 100 | 23 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=2
2 | node249 | witness | * running | node243 | default | 0 | 1 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=2
3 | node243 | primary | * running | | default | 100 | 23 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=2
2021-03-01 15:52:22 The primary DB is started.
WARNING: There are no 2 standbys in pg_stat_replication, please check all the standby servers replica from primary
2021-03-01 15:52:46 Success to load virtual ip [192.168.7.240/24] on primary host [192.168.7.243].
2021-03-01 15:52:46 Try to ping vip on host 192.168.7.248 ...
2021-03-01 15:52:48 Try to ping vip on host 192.168.7.243 ...
2021-03-01 15:52:50 Try to ping vip on host 192.168.7.249 ...
2021-03-01 15:52:52 begin to start repmgrd on "[192.168.7.248]".
[2021-03-01 15:54:17] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf"
[2021-03-01 15:54:17] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log"
2021-03-01 15:52:52 repmgrd on "[192.168.7.248]" start success.
2021-03-01 15:52:52 begin to start repmgrd on "[192.168.7.243]".
[2021-03-01 15:52:52] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf"
[2021-03-01 15:52:52] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log"
2021-03-01 15:52:52 repmgrd on "[192.168.7.243]" start success.
2021-03-01 15:52:52 begin to start repmgrd on "[192.168.7.249]".
[2021-03-01 14:50:47] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf"
[2021-03-01 14:50:47] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log"
2021-03-01 15:52:53 repmgrd on "[192.168.7.249]" start success.
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+-------+---------+--------------------
1 | node248 | standby | running | node243 | running | 13909 | no | 0 second(s) ago
2 | node249 | witness | * running | node243 | running | 28830 | no | n/a
3 | node243 | primary | * running | | running | 6643 | no | n/a
2021-03-01 15:52:53 Done.
如下图所示:集群启动正常
附件:/etc/logrotate.d/kingbase权限故障处理
如下图所示:sys_monitor.sh脚本启动集群出现以下错误:
解决方案:
[root@node3 ~]# which chmod
/usr/bin/chmod
[root@node3 ~]# which chown
/usr/bin/chown
[root@node3 ~]# ls -lh /usr/bin/chown
-rwxr-xr-x. 1 root root 62K Nov 20 2015 /usr/bin/chown
[root@node3 ~]# ls -lh /usr/bin/chmod
-rwxr-xr-x. 1 root root 58K Nov 20 2015 /usr/bin/chmod
[root@node3 ~]# chmod u+s /usr/bin/chown
[root@node3 ~]# chmod u+s /usr/bin/chmod
[root@node3 ~]# ls -lh /usr/bin/chmod
-rwsr-xr-x. 1 root root 58K Nov 20 2015 /usr/bin/chmod
[root@node3 ~]# ls -lh /usr/bin/chown
-rwsr-xr-x. 1 root root 62K Nov 20 2015 /usr/bin/chown
[root@node3 ~]# ls -lh /etc/logrotate.d/kingbase
-rw-r--r--. 1 kingbase kingbase 492 Mar 1 15:52 /etc/logrotate.d/kingbase
[root@node3 ~]# su - kingbase
Last login: Mon Mar 1 15:51:39 CST 2021 on pts/1
Last failed login: Mon Mar 1 15:58:21 CST 2021 from :0 on :0
There was 1 failed login attempt since the last successful login.
[kingbase@node3 ~]$ chown root.root /etc/logrotate.d/kingbase
[kingbase@node3 ~]$ ls -lh /etc/logrotate.d/kingbase
-rw-r--r--. 1 root root 492 Mar 1 15:52 /etc/logrotate.d/kingbase
[kingbase@node3 ~]$ chown kingbase.kingbase /etc/logrotate.d/kingbase
[kingbase@node3 ~]$ ls -lh /etc/logrotate.d/kingbase
-rw-r--r--. 1 kingbase kingbase 492 Mar 1 15:52 /etc/logrotate.d/kingbase
#手工执行“sh /etc/logrotate.d/kingbase”
[kingbase@node3 bin]$ sh /etc/logrotate.d/kingbase
/etc/logrotate.d/kingbase: line 2: /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../kbha.log: Permission denied
/etc/logrotate.d/kingbase: line 3: weekly: command not found
/etc/logrotate.d/kingbase: line 4: maxsize: command not found
[kingbase@node3 kingbase]$ chmod u+x kbha.log
[kingbase@node3 kingbase]$ sh /etc/logrotate.d/kingbase
/etc/logrotate.d/kingbase: line 2: /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../kbha.log: Text file busy
/etc/logrotate.d/kingbase: line 3: weekly: command not found
/etc/logrotate.d/kingbase: line 4: maxsize: command not found
Password:
=通过以上处理,在通过sys_monitor.sh脚本启动集群时,仍然出现“sh /etc/logrotate.d/kingbase"错误,故修改了sys_monitor.sh脚本后,问题解决。=