一:背景介绍
MHA(Master HA)是一款开源的MySQL的高可用工具,能在MySQL主从复制的基础上,实现自动化主服务器故障转移。虽然MHA试图从宕机的主服务器上保存二进制日志,但并不是总是可行的。例如,如果主服务器硬件故障或无法通过ssh访问,MHA没法保存二进制日志,只进行故障转移而丢失最新数据。
MHA监控复制架构的主服务器,一旦检测到主服务器故障,就会自动进行故障转移。即使有些从服务器没有收到最新的relay log,MHA自动从最新的从服务器上识别差异的relay log并把这些日志应用到其他从服务器上,因此所有的从服务器保持一致性了。还可以设置优先级指定其中的一台slave作为master的候选,由于MHA在slaves之间修复一致性,因此可以将slave变成新的master,其他的slave都以其作为新master。
MHA有两部分组成,MHA Manager(管理节点)和MHA Node(数据节点):
1:MHA Manager可以单独部署在一*立机器上管理多个master-slave集群,也可以部署在一台slave上。MHA Manager探测集群的node节点,当发现master出现故障的时它可以自动将具有最新数据的slave提升为新的master,然后将所有其它的slave导向新的master上。整个故障转移过程对应用程序是透明的。
2:MHA node运行在每台MySQL服务器上(master/slave/manager),它通过监控具备解析和清理logs功能的脚本来加快故障转移的。
优点:
找出同步最成功的一台从服务器(也就是与主服务器数据最接近的那台从服务器)自动切换成主服务器。
如果主机还能够访问,从主服务器上找回最新从机与主机间的数据差异。
在每一台从服务器上操作,确定他们缺少哪些events,并分别进行补充。
将最新的一台从服务器提升为主服务器后,将其它从服务器重新指向新的主服务器。
缺点:
当群集内的数据库进行故障转移时,对外提供服务的虚拟IP也进行转移。
MHA管理进程需要以后台守护进程的方式运行,并有监控机制保证MHA管理进程的正常运行。
有监控机制保证当主机出现故障时,MHA能确定进行成功的Failover。
当故障主机恢复后,能重新回到群集中,并成为新的Slave,自动实现重新同步。
由于主机和从机上备份策略不同,进行故障转移后,自动调整cron中的调度(例如全备份)。
二:下载:https://github.com/yoshinorim/mha4mysql-manager
https://code.google.com/p/mysql-master-ha/downloads/list?q=label:Featured
https://code.google.com/p/mysql-master-ha/wiki/Downloads?tm=2 #(5.6)
wget https://mysql-master-ha.googlecode.com/files/mha4mysql-manager-0.55.tar.gz
wget https://mysql-master-ha.googlecode.com/files/mha4mysql-node-0.54.tar.gz
版本说明:https://code.google.com/p/mysql-master-ha/wiki/ReleaseNotes
三:安装
https://code.google.com/p/mysql-master-ha/wiki/Installation#Installing_MHA_Manager
安装完成之后会产生一些相关工具:
Manager工具:
- masterha_check_ssh : 检查MHA的SSH配置。
- masterha_check_repl : 检查MySQL复制。
- masterha_manager : 启动MHA。
- masterha_check_status : 检测当前MHA运行状态。
- masterha_master_monitor : 监测master是否宕机。
- masterha_master_switch : 控制故障转移(自动或手动)。
- masterha_conf_host : 添加或删除配置的server信息。 Node工具:
- save_binary_logs : 保存和复制master的二进制日志。
- apply_diff_relay_logs : 识别差异的中继日志事件并应用于其它slave。
- filter_mysqlbinlog : 去除不必要的ROLLBACK事件(MHA已不再使用这个工具)。
- purge_relay_logs : 清除中继日志(不会阻塞SQL线程)。
【本文是通过源码安装】
1,MHA Manager 安装
## Install dependent Perl modules
# Config::Tiny
## perl -MCPAN -e "install Config::Tiny"
# Log::Dispatch
## perl -MCPAN -e "install Log::Dispatch"
# Parallel::ForkManager
## perl -MCPAN -e "install Parallel::ForkManager"
## Installing MHA Manager
$ tar -zxf mha4mysql-manager-X.Y.tar.gz
$ perl Makefile.PL
*** Module::AutoInstall version 1.03
*** Checking for Perl dependencies...
[Core Features]
- DBI ...loaded. (1.616)
- DBD::mysql ...loaded. (4.020)
- Time::HiRes ...loaded. (1.972101)
- Config::Tiny ...loaded. (2.20)
- Log::Dispatch ...loaded. (2.41)
- Parallel::ForkManager ...loaded. (1.06)
- MHA::NodeConst ...loaded. (0.54) 先装Node,这里才能通过。
*** Module::AutoInstall configuration finished.
Checking if your kit is complete...
Looks good
Generating a Unix-style Makefile
Writing Makefile for mha4mysql::manager
Writing MYMETA.yml and MYMETA.json
$ make
$ sudo make install
2,MHA Node 安装
## Install DBD::mysql if not installed
$ tar -zxf mha4mysql-node-X.Y.tar.gz
$ perl Makefile.PL
*** Module::AutoInstall version 1.03
*** Checking for Perl dependencies...
[Core Features]
- DBI ...loaded. (1.616)
- DBD::mysql ...loaded. (4.020)
*** Module::AutoInstall configuration finished.
Checking if your kit is complete...
Looks good
Generating a Unix-style Makefile
Writing Makefile for mha4mysql::node
Writing MYMETA.yml and MYMETA.json
$ make
$ sudo make install
上面红色标明的不能出现miss,否则后面运行就会有问题。要是miss则需要执行(ubuntu):
sudo apt-get install libdbd-mysql-perl
其他的系统可以看这里。
在安装时候要设置好locale,否则安装也会有问题,可以先运行:
export LC_ALL="zh_CN.UTF-8"
在安装之前需要安装一些依赖的包,如:gcc、iputils-arping等。到此,MHA安装结束了。
三:测试
必要条件:
1)三台主机:201(M/node)、202(S/node)、204(S/manager)。
2)MHA按照上面的方法在三台上已经安装完毕,MySQL的Replication已经搭建完成,复制帐号三台都已经创建。
MySQL上需要设置:set global relay_log_purge=0,或则配置文件里修改。
--relay-log-purge={|} 禁用或启用不再需要中继日志时是否自动清空它们。默认值为1(启用)。这是一个全局变量,可以用SET GLOBAL Relay_log_purge动态更改。
3)MHA环境需要三台主机需要相互信任:实现三台主机之间相互免密钥登录。
[]
用ssh-keygen创建公钥,一直默认回车,最后会在.ssh/下面生成id_rsa.pub
ssh-keygen -t rsa
把id_rsa.pub 文件复制到另外2台机子并导入公钥:
cat id_rsa.pub >> /root/.ssh/authorized_keys 当然也可以直接复制里面的内容再粘帖到authorized_keys中。 这样204可以直接访问到其他2台机子了
ssh 192.168.200.201
ssh 192.168.200.202 同理在另外2台机子上面操作,让他们之间实现无密码登入。 要是root登陆的话,需要修改ssh的配置文件:
vi /etc/ssh/sshd_config 中的 PermitRootLogin ,设置为YES,再重启SSH。
注意:Manager 要是装到某一台MySQL上,则需要自己和自己无密码登入:ssh 192.168.200.204,单独到一台服务器则不需要。
4)MHA的配置
Manager上的配置
mkdir -p /usr/local/masterha/app1 #创建数据文件目录
mkdir /etc/masterha #创建配置文件目录 安装时解压出来的samples中有2个文件:
conf 是存放配置文件的模块
scripts是存放运行时的一些命令,可以把这几个名字移到/user/bin 中。用户下面灰色背景的几个参数中。
添加MHA的配置文件(Manager)
样本:(51-M,52-S,53-S,VIP:50)
[server default]
user=zjy
password=#
ssh_user=root
repl_user=rep
repl_password=rep123456
ping_interval= manager_workdir=/usr/local/masterha/mha_data
manager_log=/usr/local/masterha/manager.log
remote_workdir=/usr/local/masterha/mha_data master_ip_failover_script=/usr/bin/master_ip_failover
#master_ip_online_change_script=/usr/bin/master_ip_online_change
#shutdown_script=/usr/bin/power_manager
report_script =/usr/bin/send_report.py
#secondary_check_script= masterha_secondary_check -s remote_host1 -s remote_host2 [server1]
hostname=192.168.200.51
master_binlog_dir=/var/log/mysql
candidate_master=
check_repl_delay= [server2]
hostname=192.168.200.52
master_binlog_dir=/var/log/mysql
#candidate_master= [server3]
hostname=192.168.200.53
master_binlog_dir=/var/log/mysql
#candidate_master=
修改后的配置文件:[vi /etc/masterha/app1.cnf]
[server default]
user=root #mysql用戶名,
#需要的权限:Super,select,create,insert,update,delete,drop,reload
password= #mysql密码
ssh_user=root #ssh免密钥登录的帐号名
repl_user=rep #mysql复制帐号
repl_password= #mysql复制密码
ping_interval= #ping间隔,用来检测master是否正常,默认是3秒,尝试三次没有回应的时候自动进行failover manager_workdir=/usr/local/masterha/app1 #数据目录,主要该目录的权限,需要有创建的权限
manager_log=/usr/local/masterha/manager.log #日志文件
remote_workdir=/usr/local/masterha/app1 #另外2台机子在运行时候需要创建的目录,注意ssh-keygen帐号的权限问题 #master_ip_failover_script=/usr/bin/master_ip_failover
#master_ip_online_change_script=/usr/bin/master_ip_online_change
#shutdown_script=/usr/bin/power_manager
#report_script=/usr/bin/send_report
#secondary_check_script= masterha_secondary_check -s remote_host1 -s remote_host2
[server1]
hostname=192.168.200.201
master_binlog_dir=/var/lib/mysql #binlog目录
candidate_master= #master机宕掉后,优先启用这台作为新master
check_repl_delay=0 #默认情况下如果一个slave落后master 100M的relay logs的话,MHA将不会选择该slave作为一个新的master,因为对于这个slave的恢复需要花费很长时间,通过设置check_repl_delay=0,MHA触发切换在选择一个新的master的时候将会忽略复制延时,这个参数对于设置了candidate_master=1的主机非常有用,因为这个候选主在切换的过程中一定是新的master
[server2]
hostname=192.168.200.202
master_binlog_dir=/var/log/mysql
candidate_master=
[server3] hostname=192.168.200.204
master_binlog_dir=/var/log/mysql
#candidate_master=
进行测试:
◆ masterha_check_ssh工具验证ssh信任登录是否成功
root@db4:/usr/local/masterha# masterha_check_ssh --conf=/etc/masterha/app1.cnf
Wed Jun :: - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Wed Jun :: - [info] Reading application default configurations from /etc/masterha/app1.cnf..
Wed Jun :: - [info] Reading server configurations from /etc/masterha/app1.cnf..
Wed Jun :: - [info] Starting SSH connection tests..
Wed Jun :: - [debug]
Wed Jun :: - [debug] Connecting via SSH from root@192.168.200.201(192.168.200.201:) to root@192.168.200.202(192.168.200.202:)..
Wed Jun :: - [debug] ok.
Wed Jun :: - [debug] Connecting via SSH from root@192.168.200.201(192.168.200.201:) to root@192.168.200.204(192.168.200.204:)..
Wed Jun :: - [debug] ok.
Wed Jun :: - [debug]
Wed Jun :: - [debug] Connecting via SSH from root@192.168.200.202(192.168.200.202:) to root@192.168.200.201(192.168.200.201:)..
Wed Jun :: - [debug] ok.
Wed Jun :: - [debug] Connecting via SSH from root@192.168.200.202(192.168.200.202:) to root@192.168.200.204(192.168.200.204:)..
Wed Jun :: - [debug] ok.
Wed Jun :: - [debug]
Wed Jun :: - [debug] Connecting via SSH from root@192.168.200.204(192.168.200.204:) to root@192.168.200.201(192.168.200.201:)..
Wed Jun :: - [debug] ok.
Wed Jun :: - [debug] Connecting via SSH from root@192.168.200.204(192.168.200.204:) to root@192.168.200.202(192.168.200.202:)..
Wed Jun :: - [debug] ok.
Wed Jun :: - [info] All SSH connection tests passed successfully.
◆ masterha_check_repl工具验证mysql复制是否成功
root@db4:/usr/local/masterha# masterha_check_repl --conf=/etc/masterha/app1.cnf
...
...
192.168.200.201 (current master)
+--192.168.200.202
+--192.168.200.204 Wed Jun :: - [info] Checking replication health on 192.168.200.202..
Wed Jun :: - [info] ok.
Wed Jun :: - [info] Checking replication health on 192.168.200.204..
Wed Jun :: - [info] ok.
Wed Jun :: - [warning] master_ip_failover_script is not defined.
Wed Jun :: - [warning] shutdown_script is not defined.
Wed Jun :: - [info] Got exit code (Not master dead). MySQL Replication Health is OK.
◆ masterha_manager 启动MHA manager
root@db4:~# nohup masterha_manager --conf=/etc/masterha/app1.cnf > /tmp/mha_manager.log >&
查看日志文件(/masterha/app1/manager.log) Cleaning up test file(s).. done.
Wed Jun :: - [info] Slaves settings check done.
Wed Jun :: - [info]
192.168.200.201 (current master)
+--192.168.200.202
+--192.168.200.204 Wed Jun :: - [warning] master_ip_failover_script is not defined.
Wed Jun :: - [warning] shutdown_script is not defined.
Wed Jun :: - [info] Set master ping interval seconds.
Wed Jun :: - [info] Set secondary check script: masterha_secondary_check -s remote_host1 -s remote_host2
Wed Jun :: - [info] Starting ping health check on 192.168.200.201(192.168.200.201:)..
Wed Jun :: - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
参数:
--ignore_last_failover:在缺省情况下,如果MHA检测到连续发生宕机,且两次宕机间隔不足8小时的话,则不会进行Failover,之所以这样限制是为了避免ping-pong效应。该参数代表忽略上次MHA触发切换产生的文件,默认情况下,MHA发生切换后会在日志目录,也就是上面我设置的/data产生app1.failover.complete文件,下次再次切换的时候如果发现该目录下存在该文件将不允许触发切换,除非在第一次切换后收到删除该文件
--remove_dead_master_conf: 该参数代表当发生主从切换后,老的主库将会从配置文件中移除,一般情况下不需要开启。即:
开启程序
nohup masterha_manager --ignore_last_failover --remove_dead_master_conf --conf=/etc/masterha/app1.cnf > /tmp/mha_manager.log >& +++
故障切换
From:
192.168.200.202 (current master)
+--192.168.200.201
+--192.168.200.204 To:
192.168.200.201 (new master)
+--192.168.200.204
+++ 不先处理故障的机子,可直接运行程序,不加该参数会直接报错:
[error][/usr/local/share/perl/5.14./MHA/ServerManager.pm, ln443] Server 192.168.200.201(192.168.200.201:) is dead, but must be alive! Check server settings. nohup masterha_manager --ignore_last_failover --remove_dead_master_conf --conf=/etc/masterha/app1.cnf > /tmp/mha_manager.log >& 可以正常运行,但是配置文件的[server2]选项组被删除
◆ 测试master(201)宕机后,是否会自动切换?这里注意的是确保 /usr/local/masterha/app1/app1.failover.complete 没有这类文件存在。否则报:
remove /usr/local/masterha/app1/app1.failover.complete and run this script again,需要加上参数:--ignore_last_failover
nohup masterha_manager --ignore_last_failover --conf=/etc/masterha/app1.cnf > /tmp/mha_manager.log >&
...
...
Wed Jun :: - [info] Reading server configurations from /etc/masterha/app1.cnf..
Wed Jun :: - [info] Dead Servers:
Wed Jun :: - [info] 192.168.200.201(192.168.200.201:)
Wed Jun :: - [info] Alive Servers:
Wed Jun :: - [info] 192.168.200.202(192.168.200.202:)
Wed Jun :: - [info] 192.168.200.204(192.168.200.204:)
Wed Jun :: - [info] Alive Slaves:
Wed Jun :: - [info] 192.168.200.202(192.168.200.202:) Version=5.5.-0ubuntu0.12.04.-log (oldest major version between slaves) log-bin:enabled
Wed Jun :: - [info] Replicating from 192.168.200.201(192.168.200.201:)
....
....
Wed Jun :: - [info] Master is down!
Wed Jun :: - [info] Starting master failover.
....
....
进行5个阶段
Creating /usr/local/masterha/app1 if not exists.. ok.
...
Dumping binlog format description event, from position to .. ok.
Wed Jun :: - [info] scp from root@192.168.200.201:/usr/local/masterha/app1/saved_master_binlog_from_192.168.200.201_3306_20140625234550.binlog to local:/usr/local/masterha/app1/saved_master_binlog_from_192.168.200.201_3306_20140625234550.binlog succeeded.
...
Wed Jun :: - [info] New master is 192.168.200.202(192.168.200.202:)
Wed Jun :: - [info] Starting master failover..
Wed Jun :: - [info]
From:
192.168.200.201 (current master)
+--192.168.200.202
+--192.168.200.204 To:
192.168.200.202 (new master)
+--192.168.200.204
...
Applying log files succeeded.
Wed Jun :: - [info] All relay logs were successfully applied.
Wed Jun :: - [info] Getting new master's binlog name and position..
Wed Jun :: - [info] mysql-bin.:
Wed Jun :: - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.200.202', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=107, MASTER_USER='rep', MASTER_PASSWORD='xxx';
Wed Jun :: - [info] ** Finished master recovery successfully.
...
Wed Jun :: - [info] Executed CHANGE MASTER.
Wed Jun :: - [info] Slave started.
Wed Jun :: - [info] All new slave servers recovered successfully.
...
----- Failover Report ----- app1: MySQL Master failover 192.168.200.201 to 192.168.200.202 succeeded Master 192.168.200.201 is down! Check MHA Manager logs at db4:/usr/local/masterha/manager.log for details. Started automated(non-interactive) failover.
The latest slave 192.168.200.202(192.168.200.202:) has all relay logs for recovery.
Selected 192.168.200.202 as a new master.
192.168.200.202: OK: Applying all logs succeeded.
192.168.200.204: This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
192.168.200.204: OK: Applying all logs succeeded. Slave started, replicating from 192.168.200.202.
192.168.200.202: Resetting slave info succeeded.
Master failover to 192.168.200.202(192.168.200.202:) completed successfully.
上面只给出了一些比较重要的信息,更多的信息见配置文件指定的日志文件。大致的信息:
主(201)关闭后,查找最新的slave,通过scp把二进制日志应用过去,然后最近的从202执行了reset slave,成了新主,204从202开始同步。
老的主(201)开启之后,需要手动重新同步到新主。通过日志里的:
CHANGE MASTER TO MASTER_HOST='192.168.200.202', MASTER_PORT=, MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=, MASTER_USER='rep', MASTER_PASSWORD='xxx';
再重新开启MHA程序,这里需要注意:
一旦发生切换管理进程(Manager)将会退出,无法进行再次测试,需将故障数据库解决掉之后,重新change加入到MHA环境中来,并且要保证app1.failover.complete不存在或则加上--ignore_last_failover参数忽略,才能再次开启管理进程。
◆ masterha_master_switch工具进行主从切换,在手动切换的同时需要保证没有启用MHA自动切换功能。
root@db4:/usr/local/masterha# masterha_master_switch --master_state=alive --conf=/etc/masterha/app1.cnf
Thu Jun :: - [info] MHA::MasterRotate version 0.55.
Thu Jun :: - [info] Starting online master switch..
...
It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 192.168.200.202(192.168.200.202:)? (YES/no): yes
...
From:
192.168.200.202 (current master)
+--192.168.200.201
+--192.168.200.204 To:202消失
192.168.200.201 (new master)
+--192.168.200.204 Starting master switch from 192.168.200.202(192.168.200.202:3306) to 192.168.200.201(192.168.200.201:3306)? (yes/NO): yes
...
master_ip_online_change_script is not defined. If you do not disable writes on the current master manually, applications keep writing on the current master. Is it ok to proceed? (yes/NO): yes
...
Thu Jun :: - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.200.201', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin-3306.000001', MASTER_LOG_POS=107, MASTER_USER='rep', MASTER_PASSWORD='xxx';
...
Thu Jun :: - [info] All new slave servers switched successfully.
Thu Jun :: - [info] 192.168.200.201: Resetting slave info succeeded.
Thu Jun :: - [info] Switching master to 192.168.200.201(192.168.200.201:) completed successfully.
加粗部分说明主从切换后的架构变化,也会在命令行提示你是否CHANGE到新主。切换之后老主(202)将会被独立出来。想要把202重新加入到复制中,则只需要执行打印出来的CHANGE MASTER 语句。也可以指定一个IP作为主:
masterha_master_switch --master_state=alive --conf=/etc/masterha/app1.cnf --new_master_host=192.168.200.201
指定201为新主,切换完之后,老主被独立。
◆ masterha_check_status 工具 查看MHA程序的状态:
root@db4:/usr/local/masterha# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 is stopped(:NOT_RUNNING).
root@db4:/usr/local/masterha# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 monitoring program is now on initialization phase(:INITIALIZING_MONITOR). Wait for a while and try checking again.
root@db4:/usr/local/masterha# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:29232) is running(0:PING_OK), master:192.168.200.201
通过上面的信息可以看到MHA程序是否正在运行。
◆ masterha_stop 工具关闭Manager进程。
root@db4:/usr/local/masterha# masterha_stop --conf=/etc/masterha/app1.cnf
Stopped app1 successfully.
◆ 配置VIP:master_ip_failover_script=/usr/bin/master_ip_failover。在用虚拟IP的时候,需要在开启MHA程序之前要把虚拟IP先设置到主上去,否则MHA是不会自己的去设置VIP,第一次设置VIP之后,后续脚本的故障转移等功能会自动的对VIP进行切换。
设置:
ifconfig eth0: 192.168.100.xxx
移除:
ifconfig eth0: down
脚本样本:(51-M,52-S,53-S,VIP:50)
#!/usr/bin/env perl # Copyright (C) DeNA Co.,Ltd.
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc.,
# Franklin Street, Fifth Floor, Boston, MA - USA ## Note: This is a sample script and is not complete. Modify the script based on your environment. use strict;
use warnings FATAL => 'all'; use Getopt::Long; my (
$command, $ssh_user, $orig_master_host, $orig_master_ip,
$orig_master_port, $new_master_host, $new_master_ip, $new_master_port
); my $vip = '192.168.200.50/24';
my $key = '0';
my $ssh_start_vip = "/sbin/ifconfig eth0:$key $vip";
my $ssh_stop_vip = "/sbin/ifconfig eth0:$key down";
my $ssh_Bcast_arp = "/usr/bin/arping -c 3 -A 192.168.200.50"; #ARP回复模式,更新邻居。要是不加则服务器会自动等到vip缓存失效,期间VIP会有一定时间的不可用。 GetOptions(
'command=s' => \$command,
'ssh_user=s' => \$ssh_user,
'orig_master_host=s' => \$orig_master_host,
'orig_master_ip=s' => \$orig_master_ip,
'orig_master_port=i' => \$orig_master_port,
'new_master_host=s' => \$new_master_host,
'new_master_ip=s' => \$new_master_ip,
'new_master_port=i' => \$new_master_port,
); exit &main(); sub main { print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n"; if ( $command eq "stop" || $command eq "stopssh" ) { my $exit_code = ;
eval {
print "Disabling the VIP on old master: $orig_master_host \n";
&stop_vip();
$exit_code = ;
};
if ($@) {
warn "Got Error: $@\n";
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "start" ) { my $exit_code = ;
eval {
print "Enabling the VIP - $vip on the new master - $new_master_host \n";
&start_vip();
&start_arp();
$exit_code = ;
};
if ($@) {
warn $@;
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "status" ) {
print "Checking the Status of the script.. OK \n";
exit ;
} else {
&usage();
exit ;
}
} sub start_vip() {
`ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
sub start_arp() {
`ssh $ssh_user\@$new_master_host \" $ssh_Bcast_arp \"`;
}
sub stop_vip() {
`ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
} sub usage {
"Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}
在使用上面的arping 时候,需要安装(3台都需要安装):
apt-get install iputils-arping hash -r 更新缓存路径
关于arping 请参考:http://blog.sina.com.cn/s/blog_88b1c39501011sr1.html
◆ 邮件、短信通知:report_script=/usr/bin/send_report
脚本样本:(51-M,52-S,53-S,VIP:50)
#!/usr/bin/env python
#-*- encoding:utf-8 -*-
#-------------------------------------------------------------------------------
# Name: send_report.py
# Author: zhoujy
#----------------------------------------------
import os
import sys
import time
import datetime
import smtplib
import subprocess
import fileinput
import getopt
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from email.Utils import COMMASPACE, formatdate reload(sys)
sys.setdefaultencoding('utf8') def send_mail(to, subject, text, from_mail, server="localhost"):
message = MIMEMultipart()
message['From'] = from_mail
message['To'] = COMMASPACE.join(to)
message['Date'] = formatdate(localtime=True)
message['Subject'] = subject
message.attach(MIMEText(text,_charset='utf-8'))
smtp = smtplib.SMTP(server)
smtp.sendmail(from_mail, to, message.as_string())
smtp.close() if __name__ == "__main__":
opts,args = getopt.getopt(sys.argv[1:],"h",["orig_master_host=","new_master_host=","new_slave_hosts=","conf=","subject=","body=","app_vip=","new_master_ssh_port=","ssh_user="])
# print opts,args
for lines in opts:
key,values = lines
if key == '--orig_master_host':
orig_master_host = values
if key == '--new_master_host':
new_master_host = values
if key == '--new_slave_hosts':
new_slave_hosts = values
if key == '--subject':
subject = values
if key == '--body':
body = values
# text = sys.stdin.read()
mail_list = ['zjy@xxx.com']
send_mail(mail_list, subject.encode("utf8"), body, "MHA_Monitor@smtp.dxy.cn", server="192.168.220.251")
关于getopt 参数请参考:http://blog.sina.com.cn/s/blog_88b1c39501011sr1.html
到此MySQL HA已经搭建完成,下面开始针对其进行崩溃转移测试。
◆ 测试:
Master:51,Slave:52,53,VIP:50
工具:sysbench
1:生成测试数据:
sysbench --test=oltp --mysql-table-engine=innodb --mysql-host=192.168.200.50 --mysql-db=mha_test --oltp-table-size= --mysql-user=zjy --mysql-password=# prepare
2:在生成测试数据期间,在52上执行:关闭其IO线程,模拟延迟。53保持正常。
stop slave io_thread;
3:测试数据生成完毕之后执行:模拟主的并发等压力测试
sysbench --num-threads= --test=oltp --mysql-table-engine=innodb --mysql-host=192.168.200.50 --mysql-db=mha_test --oltp-table-size= --mysql-user=zjy --mysql-password=aaaaa# run
4:同时开启52的IO线程:继续同步M上的数据。
start slave io_thread;
5:关闭主库mysql进程,模拟主库发生故障,进行自动failover操作:
/etc/init.d/mysql stop
6:最后查看日志:manager.log
Wed Oct :: - [warning] Got error on MySQL select ping: (MySQL server has gone away)
Wed Oct :: - [info] Executing SSH check script: save_binary_logs --command=test --start_pos= --binlog_dir=/var/log/mysql --output_file=/usr/local/masterha/mha_data/save_binary_logs_test --manager_version=0.55 --binlog_prefix=mysql-bin51
Wed Oct :: - [info] HealthCheck: SSH to 192.168.200.51 is reachable.
Wed Oct :: - [warning] Got error on MySQL connect: (Can't connect to MySQL server on '192.168.200.51' (111))
Wed Oct :: - [warning] Connection failed time(s)..
Wed Oct :: - [warning] Got error on MySQL connect: (Can't connect to MySQL server on '192.168.200.51' (111))
Wed Oct :: - [warning] Connection failed time(s)..
Wed Oct :: - [warning] Got error on MySQL connect: (Can't connect to MySQL server on '192.168.200.51' (111))
Wed Oct :: - [warning] Connection failed time(s)..
Wed Oct :: - [warning] Master is not reachable from health checker!
Wed Oct :: - [warning] Master 192.168.200.51(192.168.200.51:) is not reachable!
Wed Oct :: - [warning] SSH is reachable.
Wed Oct :: - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/mha.cnf again, and trying to connect to all servers to check server status..
Wed Oct :: - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Wed Oct :: - [info] Reading application default configurations from /etc/masterha/mha.cnf..
Wed Oct :: - [info] Reading server configurations from /etc/masterha/mha.cnf..
Wed Oct :: - [info] Dead Servers:
Wed Oct :: - [info] 192.168.200.51(192.168.200.51:)
Wed Oct :: - [info] Alive Servers:
Wed Oct :: - [info] 192.168.200.52(192.168.200.52:)
Wed Oct :: - [info] 192.168.200.53(192.168.200.53:)
Wed Oct :: - [info] Alive Slaves:
Wed Oct :: - [info] 192.168.200.52(192.168.200.52:) Version=5.5.-0ubuntu0.12.04.-log (oldest major version between slaves) log-bin:enabled
Wed Oct :: - [info] Replicating from 192.168.200.51(192.168.200.51:)
Wed Oct :: - [info] 192.168.200.53(192.168.200.53:) Version=5.5.-0ubuntu0.12.04.-log (oldest major version between slaves) log-bin:enabled
Wed Oct :: - [info] Replicating from 192.168.200.51(192.168.200.51:)
Wed Oct :: - [info] Checking slave configurations..
Wed Oct :: - [info] read_only= is not set on slave 192.168.200.52(192.168.200.52:).
Wed Oct :: - [warning] relay_log_purge= is not set on slave 192.168.200.52(192.168.200.52:).
Wed Oct :: - [info] read_only= is not set on slave 192.168.200.53(192.168.200.53:).
Wed Oct :: - [warning] relay_log_purge= is not set on slave 192.168.200.53(192.168.200.53:).
Wed Oct :: - [info] Checking replication filtering settings..
Wed Oct :: - [info] Replication filtering check ok.
Wed Oct :: - [info] Master is down!
Wed Oct :: - [info] Terminating monitoring script.
Wed Oct :: - [info] Got exit code (Master dead).
Wed Oct :: - [info] MHA::MasterFailover version 0.55.
Wed Oct :: - [info] Starting master failover.
Wed Oct :: - [info]
Wed Oct :: - [info] * Phase : Configuration Check Phase..
Wed Oct :: - [info]
Wed Oct :: - [info] Dead Servers:
Wed Oct :: - [info] 192.168.200.51(192.168.200.51:)
Wed Oct :: - [info] Checking master reachability via mysql(double check)..
Wed Oct :: - [info] ok.
Wed Oct :: - [info] Alive Servers:
Wed Oct :: - [info] 192.168.200.52(192.168.200.52:)
Wed Oct :: - [info] 192.168.200.53(192.168.200.53:)
Wed Oct :: - [info] Alive Slaves:
Wed Oct :: - [info] 192.168.200.52(192.168.200.52:) Version=5.5.-0ubuntu0.12.04.-log (oldest major version between slaves) log-bin:enabled
Wed Oct :: - [info] Replicating from 192.168.200.51(192.168.200.51:)
Wed Oct :: - [info] 192.168.200.53(192.168.200.53:) Version=5.5.-0ubuntu0.12.04.-log (oldest major version between slaves) log-bin:enabled
Wed Oct :: - [info] Replicating from 192.168.200.51(192.168.200.51:)
Wed Oct :: - [info] ** Phase : Configuration Check Phase completed.
Wed Oct :: - [info]
Wed Oct :: - [info] * Phase : Dead Master Shutdown Phase..
Wed Oct :: - [info]
Wed Oct :: - [info] Forcing shutdown so that applications never connect to the current master..
Wed Oct :: - [info] Executing master IP deactivatation script:
Wed Oct :: - [info] /usr/bin/master_ip_failover --orig_master_host=192.168.200.51 --orig_master_ip=192.168.200.51 --orig_master_port= --command=stopssh --ssh_user=root IN SCRIPT TEST====/sbin/ifconfig eth0: down==/sbin/ifconfig eth0: 192.168.200.50/=== Disabling the VIP on old master: 192.168.200.51
Wed Oct :: - [info] done.
Wed Oct :: - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Wed Oct :: - [info] * Phase : Dead Master Shutdown Phase completed.
Wed Oct :: - [info]
Wed Oct :: - [info] * Phase : Master Recovery Phase..
Wed Oct :: - [info]
Wed Oct :: - [info] * Phase 3.1: Getting Latest Slaves Phase..
Wed Oct :: - [info]
Wed Oct :: - [info] The latest binary log file/position on all slaves is mysql-bin51.:
Wed Oct :: - [info] Latest slaves (Slaves that received relay log files to the latest):
Wed Oct :: - [info] 192.168.200.52(192.168.200.52:) Version=5.5.-0ubuntu0.12.04.-log (oldest major version between slaves) log-bin:enabled
Wed Oct :: - [info] Replicating from 192.168.200.51(192.168.200.51:)
Wed Oct :: - [info] 192.168.200.53(192.168.200.53:) Version=5.5.-0ubuntu0.12.04.-log (oldest major version between slaves) log-bin:enabled
Wed Oct :: - [info] Replicating from 192.168.200.51(192.168.200.51:)
Wed Oct :: - [info] The oldest binary log file/position on all slaves is mysql-bin51.:
Wed Oct :: - [info] Oldest slaves:
Wed Oct :: - [info] 192.168.200.52(192.168.200.52:) Version=5.5.-0ubuntu0.12.04.-log (oldest major version between slaves) log-bin:enabled
Wed Oct :: - [info] Replicating from 192.168.200.51(192.168.200.51:)
Wed Oct :: - [info] 192.168.200.53(192.168.200.53:) Version=5.5.-0ubuntu0.12.04.-log (oldest major version between slaves) log-bin:enabled
Wed Oct :: - [info] Replicating from 192.168.200.51(192.168.200.51:)
Wed Oct :: - [info]
Wed Oct :: - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Wed Oct :: - [info]
Wed Oct :: - [info] Fetching dead master's binary logs..
Wed Oct :: - [info] Executing command on the dead master 192.168.200.51(192.168.200.51:): save_binary_logs --command=save --start_file=mysql-bin51. --start_pos= --binlog_dir=/var/log/mysql --output_file=/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog --handle_raw_binlog= --disable_log_bin= --manager_version=0.55
Creating /usr/local/masterha/mha_data if not exists.. ok.
Concat binary/relay logs from mysql-bin51. pos to mysql-bin51. EOF into /usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog ..
Dumping binlog format description event, from position to .. ok.
Dumping effective binlog data from /var/log/mysql/mysql-bin51. position to tail().. ok.
Concat succeeded.
Wed Oct :: - [info] scp from root@192.168.200.51:/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog to local:/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog succeeded.
Wed Oct :: - [info] HealthCheck: SSH to 192.168.200.52 is reachable.
Wed Oct :: - [info] HealthCheck: SSH to 192.168.200.53 is reachable.
Wed Oct :: - [info]
Wed Oct :: - [info] * Phase 3.3: Determining New Master Phase..
Wed Oct :: - [info]
Wed Oct :: - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Wed Oct :: - [info] All slaves received relay logs to the same position. No need to resync each other.
Wed Oct :: - [warning] Slave 192.168.200.52(192.168.200.52:) SQL Thread delays too much. Latest log file:mysql-bin51.:, Current log file:mysql-bin51.:. This server is not selected as a new master because recovery will take long time.
Wed Oct :: - [info] Searching new master from slaves..
Wed Oct :: - [info] Candidate masters from the configuration file:
Wed Oct :: - [info] Non-candidate masters:
Wed Oct :: - [info] 192.168.200.52(192.168.200.52:) Version=5.5.-0ubuntu0.12.04.-log (oldest major version between slaves) log-bin:enabled
Wed Oct :: - [info] Replicating from 192.168.200.51(192.168.200.51:)
Wed Oct :: - [info] Searching from all slaves which have received the latest relay log events..
Wed Oct :: - [info] New master is 192.168.200.53(192.168.200.53:)
Wed Oct :: - [info] Starting master failover..
Wed Oct :: - [info]
From:
192.168.200.51 (current master)
+--192.168.200.52
+--192.168.200.53 To:
192.168.200.53 (new master)
+--192.168.200.52
Wed Oct :: - [info]
Wed Oct :: - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Wed Oct :: - [info]
Wed Oct :: - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Wed Oct :: - [info] Sending binlog..
Wed Oct :: - [info] scp from local:/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog to root@192.168.200.53:/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog succeeded.
Wed Oct :: - [info]
Wed Oct :: - [info] * Phase 3.4: Master Log Apply Phase..
Wed Oct :: - [info]
Wed Oct :: - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Wed Oct :: - [info] Starting recovery on 192.168.200.53(192.168.200.53:)..
Wed Oct :: - [info] Generating diffs succeeded.
Wed Oct :: - [info] Waiting until all relay logs are applied.
Wed Oct :: - [info] done.
Wed Oct :: - [info] Getting slave status..
Wed Oct :: - [info] This slave(192.168.200.53)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin51.000080:102992428). No need to recover from Exec_Master_Log_Pos.
Wed Oct :: - [info] Connecting to the target slave host 192.168.200.53, running recover script..
Wed Oct :: - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='zjy' --slave_host=192.168.200.53 --slave_ip=192.168.200.53 --slave_port= --apply_files=/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog --workdir=/usr/local/masterha/mha_data --target_version=5.5.-0ubuntu0.12.04.-log --timestamp= --handle_raw_binlog= --disable_log_bin= --manager_version=0.55 --slave_pass=xxx
Wed Oct :: - [info]
Applying differential binary/relay log files /usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog on 192.168.200.53:. This may take long time...
Applying log files succeeded.
Wed Oct :: - [info] All relay logs were successfully applied.
Wed Oct :: - [info] Getting new master's binlog name and position..
Wed Oct :: - [info] mysql-bin53.:
Wed Oct :: - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.200.53', MASTER_PORT=, MASTER_LOG_FILE='mysql-bin53.000048', MASTER_LOG_POS=, MASTER_USER='rep', MASTER_PASSWORD='xxx';
Wed Oct :: - [info] Executing master IP activate script:
Wed Oct :: - [info] /usr/bin/master_ip_failover --command=start --ssh_user=root --orig_master_host=192.168.200.51 --orig_master_ip=192.168.200.51 --orig_master_port= --new_master_host=192.168.200.53 --new_master_ip=192.168.200.53 --new_master_port= --new_master_user='zjy' --new_master_password='aaa#'
Unknown option: new_master_user
Unknown option: new_master_password IN SCRIPT TEST====/sbin/ifconfig eth0: down==/sbin/ifconfig eth0: 192.168.200.50/=== Enabling the VIP - 192.168.200.50/ on the new master - 192.168.200.53
bash: /usr/bin/arping: No such file or directory
Wed Oct :: - [info] OK.
Wed Oct :: - [info] ** Finished master recovery successfully.
Wed Oct :: - [info] * Phase : Master Recovery Phase completed.
Wed Oct :: - [info]
Wed Oct :: - [info] * Phase : Slaves Recovery Phase..
Wed Oct :: - [info]
Wed Oct :: - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Wed Oct :: - [info]
Wed Oct :: - [info] -- Slave diff file generation on host 192.168.200.52(192.168.200.52:) started, pid: . Check tmp log /usr/local/masterha/mha_data/192.168..52_3306_20141015162314.log if it takes time..
Wed Oct :: - [info]
Wed Oct :: - [info] Log messages from 192.168.200.52 ...
Wed Oct :: - [info]
Wed Oct :: - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Wed Oct :: - [info] End of log messages from 192.168.200.52.
Wed Oct :: - [info] -- 192.168.200.52(192.168.200.52:) has the latest relay log events.
Wed Oct :: - [info] Generating relay diff files from the latest slave succeeded.
Wed Oct :: - [info]
Wed Oct :: - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Wed Oct :: - [info]
Wed Oct :: - [info] -- Slave recovery on host 192.168.200.52(192.168.200.52:) started, pid: . Check tmp log /usr/local/masterha/mha_data/192.168..52_3306_20141015162314.log if it takes time..
Wed Oct :: - [info]
Wed Oct :: - [info] Log messages from 192.168.200.52 ...
Wed Oct :: - [info]
Wed Oct :: - [info] Sending binlog..
Wed Oct :: - [info] scp from local:/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog to root@192.168.200.52:/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog succeeded.
Wed Oct :: - [info] Starting recovery on 192.168.200.52(192.168.200.52:)..
Wed Oct :: - [info] Generating diffs succeeded.
Wed Oct :: - [info] Waiting until all relay logs are applied.
Wed Oct :: - [info] done.
Wed Oct :: - [info] Getting slave status..
Wed Oct :: - [info] This slave(192.168.200.52)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin51.000080:102992428). No need to recover from Exec_Master_Log_Pos.
Wed Oct :: - [info] Connecting to the target slave host 192.168.200.52, running recover script..
Wed Oct :: - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='zjy' --slave_host=192.168.200.52 --slave_ip=192.168.200.52 --slave_port= --apply_files=/usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog --workdir=/usr/local/masterha/mha_data --target_version=5.5.-0ubuntu0.12.04.-log --timestamp= --handle_raw_binlog= --disable_log_bin= --manager_version=0.55 --slave_pass=xxx
Wed Oct :: - [info]
Applying differential binary/relay log files /usr/local/masterha/mha_data/saved_master_binlog_from_192.168.200.51_3306_20141015162314.binlog on 192.168.200.52:. This may take long time...
Applying log files succeeded. Wed Oct :: - [info] All relay logs were successfully applied.
Wed Oct :: - [info] Resetting slave 192.168.200.52(192.168.200.52:) and starting replication from the new master 192.168.200.53(192.168.200.53:)..
Wed Oct :: - [info] Executed CHANGE MASTER.
Wed Oct :: - [info] Slave started.
Wed Oct :: - [info] End of log messages from 192.168.200.52.
Wed Oct :: - [info] -- Slave recovery on host 192.168.200.52(192.168.200.52:) succeeded.
Wed Oct :: - [info] All new slave servers recovered successfully.
Wed Oct :: - [info]
Wed Oct :: - [info] * Phase : New master cleanup phase..
Wed Oct :: - [info]
Wed Oct :: - [info] Resetting slave info on the new master..
Wed Oct :: - [info] 192.168.200.53: Resetting slave info succeeded.
Wed Oct :: - [info] Master failover to 192.168.200.53(192.168.200.53:) completed successfully.
Wed Oct :: - [info] ----- Failover Report ----- mha: MySQL Master failover 192.168.200.51 to 192.168.200.53 succeeded Master 192.168.200.51 is down! Check MHA Manager logs at ha1:/usr/local/masterha/manager.log for details. Started automated(non-interactive) failover.
Invalidated master IP address on 192.168.200.51.
The latest slave 192.168.200.52(192.168.200.52:) has all relay logs for recovery.
Selected 192.168.200.53 as a new master.
192.168.200.53: OK: Applying all logs succeeded.
192.168.200.53: OK: Activated master IP address.
192.168.200.52: This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
192.168.200.52: OK: Applying all logs succeeded. Slave started, replicating from 192.168.200.53.
192.168.200.53: Resetting slave info succeeded.
Master failover to 192.168.200.53(192.168.200.53:) completed successfully.
Wed Oct :: - [info] Sending mail..
从上面的输出可以看出整个MHA的切换过程,共包括以下的步骤:
1.配置文件检查阶段,这个阶段会检查整个集群配置文件配置
2.宕机的master处理,这个阶段包括虚拟ip摘除操作,主机关机操作等
3.复制dead maste和最新slave相差的relay log,并保存到MHA Manger具体的目录下
4.识别含有最新更新的slave
5.应用从master保存的二进制日志事件(binlog events)
6.提升一个slave为新的master进行复制
7.使其他的slave连接新的master进行复制
可以通过Failover Report可以看到 Failover 成功,这里需要注意的是:在51宕机之后,52(延迟很大)进行追赶的时候,虚拟IP会暂时保留在原先的地方,只要追赶上去之后VIP就会切到最新的Slave上(53)。并且进行一次切换,masterha_manager 就自动停止,需要再次手动开启才能再进行监控。上面的测试说明:延迟越久,切换也越久。即使没有延迟也至少要3s以上的切换时间,因为在M发生宕机的时候需要3s的重连验证。
更多信息:
https://code.google.com/p/mysql-master-ha/w/list
http://www.cnblogs.com/gomysql/p/3675429.html
参数: