1、故障转移
在(34)中提到了failover_state的状态会被设置为:SENTINEL_FAILOVER_STATE_UPDATE_CONFIG。这里需要注意的是这个状态的处理方法并不在之前提到的sentinelFailoverStateMachine中。这个方法的处理五个状态中并不包含上述状态。除了上述状态外还有一个状态:SENTINEL_FAILOVER_STATE_NONE。
整个故障转移流程的状态如下所示:
/* Failover machine different states. */
#define SENTINEL_FAILOVER_STATE_NONE 0 /* No failover in progress. */
#define SENTINEL_FAILOVER_STATE_WAIT_START 1 /* Wait for failover_start_time*/
#define SENTINEL_FAILOVER_STATE_SELECT_SLAVE 2 /* Select slave to promote */
#define SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE 3 /* Slave -> Master */
#define SENTINEL_FAILOVER_STATE_WAIT_PROMOTION 4 /* Wait slave to change role */
#define SENTINEL_FAILOVER_STATE_RECONF_SLAVES 5 /* SLAVEOF newmaster */
#define SENTINEL_FAILOVER_STATE_UPDATE_CONFIG 6 /* Monitor promoted slave. */
上述一共有7个状态,其中值为0的状态(即SENTINEL_FAILOVER_STATE_NONE)代表没有故障转移的状态。而1到5为之前解析的sentinelFailoverStateMachine方法中的状态。最后值为6的状态为文档(34)中提到的状态。
对于整个故障转移来说需要将该状态重新设置为0,并将相关参数设置为没有故障转移的状态,这个过程才算完成。所以即使在(34)的解析中Redis服务器已经能够正常提供服务,这个故障转移流程也没有结束。
而对于状态SENTINEL_FAILOVER_STATE_UPDATE_CONFIG的处理方法为:sentinelHandleDictOfRedisInstances。这个方法在文档(26)中提到过。它是有哨兵的sentinelTimer方法调用,而在其方法内部会调用一个重要的函数:sentinelHandleRedisInstance。sentinelHandleDictOfRedisInstances方法的具体内容如下:
/* Perform scheduled operations for all the instances in the dictionary.
* Recursively call the function against dictionaries of slaves. */
void sentinelHandleDictOfRedisInstances(dict *instances) {
dictIterator *di;
dictEntry *de;
sentinelRedisInstance *switch_to_promoted = NULL;
/* There are a number of things we need to perform against every master. */
di = dictGetIterator(instances);
while((de = dictNext(di)) != NULL) {
sentinelRedisInstance *ri = dictGetVal(de);
sentinelHandleRedisInstance(ri);
if (ri->flags & SRI_MASTER) {
sentinelHandleDictOfRedisInstances(ri->slaves);
sentinelHandleDictOfRedisInstances(ri->sentinels);
if (ri->failover_state == SENTINEL_FAILOVER_STATE_UPDATE_CONFIG) {
switch_to_promoted = ri;
}
}
}
if (switch_to_promoted)
sentinelFailoverSwitchToPromotedSlave(switch_to_promoted);
dictReleaseIterator(di);
}
在代码第17行出现了我们要分析的状态SENTINEL_FAILOVER_STATE_UPDATE_CONFIG,这里的逻辑很简单:先检查状态是否为需求状态,若是则为参数switch_to_promoted赋值,然后在第22行的if语句内的内容会被执行。这里它只执行了一个方法:sentinelFailoverSwitchToPromotedSlave。
该方法内容如下:
/* This function is called when the slave is in
* SENTINEL_FAILOVER_STATE_UPDATE_CONFIG state. In this state we need
* to remove it from the master table and add the promoted slave instead. */
void sentinelFailoverSwitchToPromotedSlave(sentinelRedisInstance *master) {
sentinelRedisInstance *ref = master->promoted_slave ?
master->promoted_slave : master;
sentinelEvent(LL_WARNING,"+switch-master",master,"%s %s %d %s %d",
master->name, master->addr->ip, master->addr->port,
ref->addr->ip, ref->addr->port);
sentinelResetMasterAndChangeAddress(master,ref->addr->ip,ref->addr->port);
}
这里首先看第5行,这里在为参数ref赋值,而赋值的内容是master的promoted_slave参数,这个参数在处理SELECT_SLAVE状态的操作中出现过。在处理该状态的时候会调用一个名为sentinelFailoverSelectSlave的方法,这个方法会在从服务器中选择一台新的服务器作为主服务器,而选中的服务器会被记录到参数promoted_slave中。
然后便是第12行的sentinelResetMasterAndChangeAddress方法,这个方法内容如下:
/* Reset the specified master with sentinelResetMaster(), and also change
* the ip:port address, but take the name of the instance unmodified.
*
* This is used to handle the +switch-master event.
*
* The function returns C_ERR if the address can't be resolved for some
* reason. Otherwise C_OK is returned. */
int sentinelResetMasterAndChangeAddress(sentinelRedisInstance *master, char *ip, int port) {
sentinelAddr *oldaddr, *newaddr;
sentinelAddr **slaves = NULL;
int numslaves = 0, j;
dictIterator *di;
dictEntry *de;
newaddr = createSentinelAddr(ip,port);
if (newaddr == NULL) return C_ERR;
/* Make a list of slaves to add back after the reset.
* Don't include the one having the address we are switching to. */
di = dictGetIterator(master->slaves);
while((de = dictNext(di)) != NULL) {
sentinelRedisInstance *slave = dictGetVal(de);
if (sentinelAddrIsEqual(slave->addr,newaddr)) continue;
slaves = zrealloc(slaves,sizeof(sentinelAddr*)*(numslaves+1));
slaves[numslaves++] = createSentinelAddr(slave->addr->ip,
slave->addr->port);
}
dictReleaseIterator(di);
/* If we are switching to a different address, include the old address
* as a slave as well, so that we'll be able to sense / reconfigure
* the old master. */
if (!sentinelAddrIsEqual(newaddr,master->addr)) {
slaves = zrealloc(slaves,sizeof(sentinelAddr*)*(numslaves+1));
slaves[numslaves++] = createSentinelAddr(master->addr->ip,
master->addr->port);
}
/* Reset and switch address. */
sentinelResetMaster(master,SENTINEL_RESET_NO_SENTINELS);
oldaddr = master->addr;
master->addr = newaddr;
master->o_down_since_time = 0;
master->s_down_since_time = 0;
/* Add slaves back. */
for (j = 0; j < numslaves; j++) {
sentinelRedisInstance *slave;
slave = createSentinelRedisInstance(NULL,SRI_SLAVE,slaves[j]->ip,
slaves[j]->port, master->quorum, master);
releaseSentinelAddr(slaves[j]);
if (slave) sentinelEvent(LL_NOTICE,"+slave",slave,"%@");
}
zfree(slaves);
/* Release the old address at the end so we are safe even if the function
* gets the master->addr->ip and master->addr->port as arguments. */
releaseSentinelAddr(oldaddr);
sentinelFlushConfig();
return C_OK;
}
这个方法主要为了清空原master的数据并写入新master的数据。首先是第15行这里会创建新master的ip与port信息。然后是第20行到29行将原master的所有的从服务器(除新master外)添加到一个临时参数slaves中去。然后是第34行到38行,将原master也添加到slaves中去。然后是第41行调用了一个sentinelResetMaster方法,这个方法的内容如下:
/* Reset the state of a monitored master:
* 1) Remove all slaves.
* 2) Remove all sentinels.
* 3) Remove most of the flags resulting from runtime operations.
* 4) Reset timers to their default value. For example after a reset it will be
* possible to failover again the same master ASAP, without waiting the
* failover timeout delay.
* 5) In the process of doing this undo the failover if in progress.
* 6) Disconnect the connections with the master (will reconnect automatically).
*/
#define SENTINEL_RESET_NO_SENTINELS (1<<0)
void sentinelResetMaster(sentinelRedisInstance *ri, int flags) {
serverAssert(ri->flags & SRI_MASTER);
dictRelease(ri->slaves);
ri->slaves = dictCreate(&instancesDictType,NULL);
if (!(flags & SENTINEL_RESET_NO_SENTINELS)) {
dictRelease(ri->sentinels);
ri->sentinels = dictCreate(&instancesDictType,NULL);
}
instanceLinkCloseConnection(ri->link,ri->link->cc);
instanceLinkCloseConnection(ri->link,ri->link->pc);
ri->flags &= SRI_MASTER;
if (ri->leader) {
sdsfree(ri->leader);
ri->leader = NULL;
}
ri->failover_state = SENTINEL_FAILOVER_STATE_NONE;
ri->failover_state_change_time = 0;
ri->failover_start_time = 0; /* We can failover again ASAP. */
ri->promoted_slave = NULL;
sdsfree(ri->runid);
sdsfree(ri->slave_master_host);
ri->runid = NULL;
ri->slave_master_host = NULL;
ri->link->act_ping_time = mstime();
ri->link->last_ping_time = 0;
ri->link->last_avail_time = mstime();
ri->link->last_pong_time = mstime();
ri->role_reported_time = mstime();
ri->role_reported = SRI_MASTER;
if (flags & SENTINEL_GENERATE_EVENT)
sentinelEvent(LL_WARNING,"+reset-master",ri,"%@");
}
如同其注释所述,这个方法会重置master,其中主要有6个步骤:1、删除记录的所有的从服务器;2、删除记录的所有的哨兵;3、删除运行时操作产生的大多数标志;4、将计时器重置为其默认值;5、撤销故障转移;6、关闭连接。
细看代码主要是对一些参数的赋值操作,其中大多数参数都在之前的文档中提到过。如用于存储从服务器的slaves,存储哨兵服务器的sentinels,标识故障转移状态的failover_state等等。
然后继续回到sentinelResetMasterAndChangeAddress方法,在执行完sentinelResetMaster方法后,接着第43行对服务器地址重新赋值,第44行、45行重置主客观状态。
然后是第48行到56行,将先前设置到临时参数slaves中的数据添加到新master的slaves参数中。
最后是第61行刷新配置文件。