KingbaseES R6 repmgr集群通用机root无法建立ssh信任连接案例

案例说明:
在生产环境下,由于安全需要,主机间不允许建立root用户的ssh信任连接,这样导致KingbaseES R6 repmgr集群,通过sys_monitor.sh脚本启动集群时,节点之间不能通过ssh正常访问,导致集群启动失败。本案例借助于es_server和es_client建立用户之间的信任连接,代替ssh访问。

测试数据库版本:

test=# select version();
                                                       version                                                        
----------------------------------------------------------------------------------------------------------------------
 KingbaseES V008R006C003B0010 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
(1 row)

如下图所示,由于不能建立root用户的信任连接,导致sys_monitor.sh启动无法正常启动:
KingbaseES R6 repmgr集群通用机root无法建立ssh信任连接案例

一、配置es_server启动(所有node)

es_server 配置:
KingbaseES R6 repmgr集群通用机root无法建立ssh信任连接案例

启动es_server:

[kingbase@node3 bin]$ ./esHAmodel.sh start
[kingbase@node3 bin]$ ps -ef |grep es_server
kingbase 28024     1  0 15:18 pts/2    00:00:00 /home/kingbase/cluster/R6HA/KHA/kingbase/bin/es_server

[kingbase@node3 bin]$ netstat -an |grep 8890
tcp        0      0 0.0.0.0:8890            0.0.0.0:*               LISTEN  

测试es_server的连接:

[kingbase@node3 bin]$ ./es_client --help
es-client 
Usage:
es-client [OPTION...] -o
Options:
  -U, --username=NAME    username for ES authentication
  -h, --host=HOSTNAME    ES Server host
  -p, --port=PORT        ES Server port number
  -W, --password         password
  -d, --debug            enable debug message (optional)
  -?, --help             print this help

  -o, --option           use user-define cmd: like "ls ."

  [kingbase@node3 bin]$ ./es_client -h 192.168.7.248 -U kingbase -W 123456 -o "hostname"
node1

[kingbase@node3 bin]$ ./es_client -h 192.168.7.249 -U kingbase -W 123456 -o "hostname"
node2

二、配置repmgr.conf支持bmj方式连接

=如下图所示:在sys_monitor.sh脚本中,如果bmj=on,则使用es_server和es_client通讯,所以需修改repmgr.conf启动bmj通讯。=
KingbaseES R6 repmgr集群通用机root无法建立ssh信任连接案例

配置repmgr.conf:(所有node)

[kingbase@node3 bin]$ cat ../etc/repmgr.conf
# 启用bmj
on_bmj=on
node_id=3
node_name=node243
promote_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr  standby promote -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf'
follow_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr  standby follow  -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf -W --upstream-node-id=%n'
conninfo='host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=2'
 
log_file='/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log'
data_directory='/home/kingbase/cluster/R6HA/KHA/kingbase/data'
sys_bindir='/home/kingbase/cluster/R6HA/KHA/kingbase/bin'
ssh_options='-q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22'
reconnect_attempts=2
reconnect_interval=3
failover='automatic'
recovery='automatic'
monitoring_history='no'
trusted_servers='192.168.7.1'
virtual_ip='192.168.7.240/24'
net_device='enp0s3'
ipaddr_path='/sbin'
arping_path='/sbin'
synchronous='quorum'
repmgrd_pid_file='/home/kingbase/cluster/R6HA/KHA/kingbase/hamgrd.pid'
ping_path='/usr/bin'
#priority=0

KingbaseES R6 repmgr集群通用机root无法建立ssh信任连接案例

三、sys_monitor.sh启动集群测试

[kingbase@node3 bin]$ ./sys_monitor.sh restart
2021-03-01 15:25:58 Ready to stop all DB ...
sh: /etc/cron.d/KINGBASECRON: Permission deniedsh: /etc/cron.d/KINGBASECRON: Permission deniedsh: /etc/cron.d/KINGBASECRON: Permission denied2021-03-01 15:25:59 begin to stop repmgrd on "[192.168.7.248]".
2021-03-01 15:25:59 repmgrd on "[192.168.7.248]" stop success.
2021-03-01 15:25:59 begin to stop repmgrd on "[192.168.7.243]".
2021-03-01 15:25:59 repmgrd on "[192.168.7.243]" stop success.
2021-03-01 15:25:59 begin to stop repmgrd on "[192.168.7.249]".
2021-03-01 15:25:59 repmgrd on "[192.168.7.249]" stop success.
2021-03-01 15:25:59 begin to stop DB on "[192.168.7.248]".
waiting for server to shut down.... done
server stopped2021-03-01 15:26:00 DB on "[192.168.7.248]" stop success.
2021-03-01 15:26:00 begin to stop DB on "[192.168.7.249]".
waiting for server to shut down.... done
server stopped2021-03-01 15:26:00 DB on "[192.168.7.249]" stop success.
2021-03-01 15:26:00 begin to stop DB on "[192.168.7.243]".
waiting for server to shut down..... done
server stopped2021-03-01 15:26:01 DB on "[192.168.7.243]" stop success.
2021-03-01 15:26:01 Done.
2021-03-01 15:26:02 Ready to start all DB ...
2021-03-01 15:26:02 begin to start DB on "[192.168.7.243]".
waiting for server to start.... done
server started2021-03-01 15:26:02 execute to start DB on "[192.168.7.243]" success, connect to check it.
2021-03-01 15:26:03 DB on "[192.168.7.243]" start success.
2021-03-01 15:26:03 Try to ping trusted_servers on host 192.168.7.248 ...
2021-03-01 15:26:05 Try to ping trusted_servers on host 192.168.7.243 ...
2021-03-01 15:26:07 Try to ping trusted_servers on host 192.168.7.249 ...
2021-03-01 15:26:09 begin to start DB on "[192.168.7.248]".
waiting for server to start.... done
server started2021-03-01 15:26:10 execute to start DB on "[192.168.7.248]" success, connect to check it.
2021-03-01 15:26:11 DB on "[192.168.7.248]" start success.
2021-03-01 15:26:11 begin to start DB on "[192.168.7.249]".
waiting for server to start.... done
server started2021-03-01 15:26:12 execute to start DB on "[192.168.7.249]" success, connect to check it.
2021-03-01 15:26:13 DB on "[192.168.7.249]" start success.
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node248 | standby | ! running | node243  | default  | 100      | 23       | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=2
 2  | node249 | witness | * running | node243  | default  | 0        | 1        | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=2
 3  | node243 | primary | * running |          | default  | 100      | 23       | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=2
WARNING: following issues were detected
  - node "node248" (ID: 1) is running but the repmgr node record is inactive
2021-03-01 15:26:13 The primary DB is started.
WARNING: There are no 2 standbys in pg_stat_replication, please check all the standby servers replica from primary
2021-03-01 15:26:37 Success to load virtual ip [192.168.7.240/24] on primary host [192.168.7.243].
2021-03-01 15:26:37 Try to ping vip on host 192.168.7.248 ...
2021-03-01 15:26:39 Try to ping vip on host 192.168.7.243 ...
2021-03-01 15:26:41 Try to ping vip on host 192.168.7.249 ...
2021-03-01 15:26:43 begin to start repmgrd on "[192.168.7.248]".
2021-03-01 15:26:43 repmgrd on "[192.168.7.248]" already started.
2021-03-01 15:26:43 begin to start repmgrd on "[192.168.7.243]".
2021-03-01 15:26:43 repmgrd on "[192.168.7.243]" already started.
2021-03-01 15:26:43 begin to start repmgrd on "[192.168.7.249]".
2021-03-01 15:26:43 repmgrd on "[192.168.7.249]" already started.
 ID | Name    | Role    | Status    | Upstream | repmgrd | PID   | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+-------+---------+--------------------
 1  | node248 | standby |   running | node243  | running | 3589  | no      | 0 second(s) ago    
 2  | node249 | witness | * running | node243  | running | 23739 | no      | 0 second(s) ago    
 3  | node243 | primary | * running |          | running | 30496 | no      | n/a                
sh: /etc/cron.d/KINGBASECRON: Permission deniedsh: /etc/logrotate.d/kingbase: Permission deniedchown: changing ownership of ‘/etc/logrotate.d/kingbase’: Operation not permittedchmod: changing permissions of ‘/etc/logrotate.d/kingbase’: Operation not permittedsh: /etc/cron.d/KINGBASECRON: Permission deniedsh: /etc/logrotate.d/kingbase: Permission deniedchown: changing ownership of ‘/etc/logrotate.d/kingbase’: Operation not permittedchmod: changing permissions of ‘/etc/logrotate.d/kingbase’: Operation not permittedsh: /etc/cron.d/KINGBASECRON: Permission deniedsh: /etc/logrotate.d/kingbase: Permission deniedchown: changing ownership of ‘/etc/logrotate.d/kingbase’: Operation not permittedchmod: changing permissions of ‘/etc/logrotate.d/kingbase’: Operation not permitted2021-03-01 15:26:44 Done.

如下图所示:sys_monitor.sh脚本启动访问“/etc/cron.d/KINGBASECRON”和“/etc/lograte.d/kingbase”文件时,出现权限错误:
KingbaseES R6 repmgr集群通用机root无法建立ssh信任连接案例

注:

1)/etc/cron.d/KINGBASECRON,是repmgr集群启动时建立的计划任务,用于启动repmgrd进程。
2)/etc/logrotate.d/kingbase,配置文件用于切割hamgr.log和kbha.log日志

sys_monitor.sh脚本中/etc/cron.d/KINGBASECRON相关配置:
KingbaseES R6 repmgr集群通用机root无法建立ssh信任连接案例

sys_monitor.sh脚本中/etc/logrotate.d/kingbase相关配置:
KingbaseES R6 repmgr集群通用机root无法建立ssh信任连接案例

1)修改/etc/cron.d/KINGBASECRON文件相关权限(如下图所示)(所有node)
KingbaseES R6 repmgr集群通用机root无法建立ssh信任连接案例

2)修改/etc/logrotate.d/kingbase相关权限(所有node)
KingbaseES R6 repmgr集群通用机root无法建立ssh信任连接案例

修改kingbase文件所有者:(所有node)
KingbaseES R6 repmgr集群通用机root无法建立ssh信任连接案例

注释sys_monitor.sh脚本中修改kingbase配置文件所有者和权限的语句:

function init_log_rotate()
{
_host="$1"
_final_target_file="/etc/logrotate.d/kingbase"
eval _rep_log_file=`grep log_file ${rep_conf} | awk -F '=' '{print $2}'`
execute_command ${super_user} $host "\
echo -e '# Generate by sys_monitor.sh at `date`\n\
${kbha_file} {\n\
        weekly\n\
        maxsize 100M\n\
        su ${execute_user} ${execute_user}\n\
        create 0600 ${execute_user} ${execute_user}\n\
        rotate 3\n\
        copytruncate\n\
        dateext\n\
}\n\
${_rep_log_file} {\n\
        weekly\n\
        maxsize 100M\n\
        su ${execute_user} ${execute_user}\n\
        create 0600 ${execute_user} ${execute_user}\n\
        rotate 3\n\
        copytruncate\n\
        dateext\n\
}\n\
' > ${_final_target_file}"
#execute_command ${super_user} $host "chown ${super_user}:${super_user} ${_final_target_file}"
#execute_command ${super_user} $host "chmod 644 ${_final_target_file}"

如下图所示:
KingbaseES R6 repmgr集群通用机root无法建立ssh信任连接案例

四、测试集群启动

[kingbase@node3 bin]$ ./sys_monitor.sh restart
2021-03-01 15:52:08 Ready to stop all DB ...
2021-03-01 15:52:08 begin to stop repmgrd on "[192.168.7.248]".
2021-03-01 15:52:08 repmgrd on "[192.168.7.248]" stop success.
2021-03-01 15:52:08 begin to stop repmgrd on "[192.168.7.243]".
2021-03-01 15:52:08 repmgrd on "[192.168.7.243]" stop success.
2021-03-01 15:52:08 begin to stop repmgrd on "[192.168.7.249]".
2021-03-01 15:52:08 repmgrd on "[192.168.7.249]" stop success.
2021-03-01 15:52:08 begin to stop DB on "[192.168.7.248]".
waiting for server to shut down..... done
server stopped2021-03-01 15:52:09 DB on "[192.168.7.248]" stop success.
2021-03-01 15:52:09 begin to stop DB on "[192.168.7.249]".
waiting for server to shut down.... done
server stopped2021-03-01 15:52:10 DB on "[192.168.7.249]" stop success.
2021-03-01 15:52:10 begin to stop DB on "[192.168.7.243]".
waiting for server to shut down..... done
server stopped2021-03-01 15:52:12 DB on "[192.168.7.243]" stop success.
2021-03-01 15:52:12 Done.
2021-03-01 15:52:12 Ready to start all DB ...
2021-03-01 15:52:12 begin to start DB on "[192.168.7.243]".
waiting for server to start.... done
server started2021-03-01 15:52:12 execute to start DB on "[192.168.7.243]" success, connect to check it.
2021-03-01 15:52:13 DB on "[192.168.7.243]" start success.
2021-03-01 15:52:13 Try to ping trusted_servers on host 192.168.7.248 ...
2021-03-01 15:52:15 Try to ping trusted_servers on host 192.168.7.243 ...
2021-03-01 15:52:17 Try to ping trusted_servers on host 192.168.7.249 ...
2021-03-01 15:52:19 begin to start DB on "[192.168.7.248]".
waiting for server to start.... done
server started2021-03-01 15:52:20 execute to start DB on "[192.168.7.248]" success, connect to check it.
2021-03-01 15:52:21 DB on "[192.168.7.248]" start success.
2021-03-01 15:52:21 begin to start DB on "[192.168.7.249]".
waiting for server to start.... done
server started2021-03-01 15:52:21 execute to start DB on "[192.168.7.249]" success, connect to check it.
2021-03-01 15:52:22 DB on "[192.168.7.249]" start success.
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node248 | standby |   running | node243  | default  | 100      | 23       | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=2
 2  | node249 | witness | * running | node243  | default  | 0        | 1        | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=2
 3  | node243 | primary | * running |          | default  | 100      | 23       | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=2
2021-03-01 15:52:22 The primary DB is started.
WARNING: There are no 2 standbys in pg_stat_replication, please check all the standby servers replica from primary
2021-03-01 15:52:46 Success to load virtual ip [192.168.7.240/24] on primary host [192.168.7.243].
2021-03-01 15:52:46 Try to ping vip on host 192.168.7.248 ...
2021-03-01 15:52:48 Try to ping vip on host 192.168.7.243 ...
2021-03-01 15:52:50 Try to ping vip on host 192.168.7.249 ...
2021-03-01 15:52:52 begin to start repmgrd on "[192.168.7.248]".
[2021-03-01 15:54:17] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf"
[2021-03-01 15:54:17] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log"
2021-03-01 15:52:52 repmgrd on "[192.168.7.248]" start success.
2021-03-01 15:52:52 begin to start repmgrd on "[192.168.7.243]".
[2021-03-01 15:52:52] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf"
[2021-03-01 15:52:52] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log"
2021-03-01 15:52:52 repmgrd on "[192.168.7.243]" start success.
2021-03-01 15:52:52 begin to start repmgrd on "[192.168.7.249]".
[2021-03-01 14:50:47] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf"
[2021-03-01 14:50:47] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log"
2021-03-01 15:52:53 repmgrd on "[192.168.7.249]" start success.
 ID | Name    | Role    | Status    | Upstream | repmgrd | PID   | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+-------+---------+--------------------
 1  | node248 | standby |   running | node243  | running | 13909 | no      | 0 second(s) ago    
 2  | node249 | witness | * running | node243  | running | 28830 | no      | n/a                
 3  | node243 | primary | * running |          | running | 6643  | no      | n/a                
2021-03-01 15:52:53 Done.

如下图所示:集群启动正常
KingbaseES R6 repmgr集群通用机root无法建立ssh信任连接案例

附件:/etc/logrotate.d/kingbase权限故障处理

如下图所示:sys_monitor.sh脚本启动集群出现以下错误:
KingbaseES R6 repmgr集群通用机root无法建立ssh信任连接案例

解决方案:

[root@node3 ~]# which chmod
/usr/bin/chmod
[root@node3 ~]# which chown
/usr/bin/chown

[root@node3 ~]# ls -lh /usr/bin/chown
-rwxr-xr-x. 1 root root 62K Nov 20  2015 /usr/bin/chown
[root@node3 ~]# ls -lh /usr/bin/chmod
-rwxr-xr-x. 1 root root 58K Nov 20  2015 /usr/bin/chmod

[root@node3 ~]# chmod u+s /usr/bin/chown
[root@node3 ~]# chmod u+s /usr/bin/chmod

[root@node3 ~]# ls -lh /usr/bin/chmod
-rwsr-xr-x. 1 root root 58K Nov 20  2015 /usr/bin/chmod
[root@node3 ~]# ls -lh /usr/bin/chown
-rwsr-xr-x. 1 root root 62K Nov 20  2015 /usr/bin/chown

[root@node3 ~]# ls -lh /etc/logrotate.d/kingbase 
-rw-r--r--. 1 kingbase kingbase 492 Mar  1 15:52 /etc/logrotate.d/kingbase

[root@node3 ~]# su - kingbase
Last login: Mon Mar  1 15:51:39 CST 2021 on pts/1
Last failed login: Mon Mar  1 15:58:21 CST 2021 from :0 on :0
There was 1 failed login attempt since the last successful login.
[kingbase@node3 ~]$ chown root.root /etc/logrotate.d/kingbase 
[kingbase@node3 ~]$ ls -lh  /etc/logrotate.d/kingbase 
-rw-r--r--. 1 root root 492 Mar  1 15:52 /etc/logrotate.d/kingbase
[kingbase@node3 ~]$ chown kingbase.kingbase /etc/logrotate.d/kingbase 
[kingbase@node3 ~]$ ls -lh  /etc/logrotate.d/kingbase 
-rw-r--r--. 1 kingbase kingbase 492 Mar  1 15:52 /etc/logrotate.d/kingbase

#手工执行“sh /etc/logrotate.d/kingbase”
[kingbase@node3 bin]$ sh /etc/logrotate.d/kingbase
/etc/logrotate.d/kingbase: line 2: /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../kbha.log: Permission denied
/etc/logrotate.d/kingbase: line 3: weekly: command not found
/etc/logrotate.d/kingbase: line 4: maxsize: command not found

[kingbase@node3 kingbase]$ chmod u+x kbha.log
[kingbase@node3 kingbase]$ sh /etc/logrotate.d/kingbase
/etc/logrotate.d/kingbase: line 2: /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../kbha.log: Text file busy
/etc/logrotate.d/kingbase: line 3: weekly: command not found
/etc/logrotate.d/kingbase: line 4: maxsize: command not found
Password: 

=通过以上处理,在通过sys_monitor.sh脚本启动集群时,仍然出现“sh /etc/logrotate.d/kingbase"错误,故修改了sys_monitor.sh脚本后,问题解决。=

上一篇:[kernel 启动流程] (第五章)第一阶段之——临时内核页表的创建【转】


下一篇:linux指令笑谈,权限小识 键盘修炼者2.0