Redis集群分析(35)

1、故障转移

在(34)中提到了failover_state的状态会被设置为:SENTINEL_FAILOVER_STATE_UPDATE_CONFIG。这里需要注意的是这个状态的处理方法并不在之前提到的sentinelFailoverStateMachine中。这个方法的处理五个状态中并不包含上述状态。除了上述状态外还有一个状态:SENTINEL_FAILOVER_STATE_NONE。

整个故障转移流程的状态如下所示:

/* Failover machine different states. */
#define SENTINEL_FAILOVER_STATE_NONE 0  /* No failover in progress. */
#define SENTINEL_FAILOVER_STATE_WAIT_START 1  /* Wait for failover_start_time*/
#define SENTINEL_FAILOVER_STATE_SELECT_SLAVE 2 /* Select slave to promote */
#define SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE 3 /* Slave -> Master */
#define SENTINEL_FAILOVER_STATE_WAIT_PROMOTION 4 /* Wait slave to change role */
#define SENTINEL_FAILOVER_STATE_RECONF_SLAVES 5 /* SLAVEOF newmaster */
#define SENTINEL_FAILOVER_STATE_UPDATE_CONFIG 6 /* Monitor promoted slave. */

上述一共有7个状态,其中值为0的状态(即SENTINEL_FAILOVER_STATE_NONE)代表没有故障转移的状态。而1到5为之前解析的sentinelFailoverStateMachine方法中的状态。最后值为6的状态为文档(34)中提到的状态。

对于整个故障转移来说需要将该状态重新设置为0,并将相关参数设置为没有故障转移的状态,这个过程才算完成。所以即使在(34)的解析中Redis服务器已经能够正常提供服务,这个故障转移流程也没有结束。

而对于状态SENTINEL_FAILOVER_STATE_UPDATE_CONFIG的处理方法为:sentinelHandleDictOfRedisInstances。这个方法在文档(26)中提到过。它是有哨兵的sentinelTimer方法调用,而在其方法内部会调用一个重要的函数:sentinelHandleRedisInstance。sentinelHandleDictOfRedisInstances方法的具体内容如下:

/* Perform scheduled operations for all the instances in the dictionary.
 * Recursively call the function against dictionaries of slaves. */
void sentinelHandleDictOfRedisInstances(dict *instances) {
    dictIterator *di;
    dictEntry *de;
    sentinelRedisInstance *switch_to_promoted = NULL;

    /* There are a number of things we need to perform against every master. */
    di = dictGetIterator(instances);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *ri = dictGetVal(de);

        sentinelHandleRedisInstance(ri);
        if (ri->flags & SRI_MASTER) {
            sentinelHandleDictOfRedisInstances(ri->slaves);
            sentinelHandleDictOfRedisInstances(ri->sentinels);
            if (ri->failover_state == SENTINEL_FAILOVER_STATE_UPDATE_CONFIG) {
                switch_to_promoted = ri;
            }
        }
    }
    if (switch_to_promoted)
        sentinelFailoverSwitchToPromotedSlave(switch_to_promoted);
    dictReleaseIterator(di);
}

在代码第17行出现了我们要分析的状态SENTINEL_FAILOVER_STATE_UPDATE_CONFIG,这里的逻辑很简单:先检查状态是否为需求状态,若是则为参数switch_to_promoted赋值,然后在第22行的if语句内的内容会被执行。这里它只执行了一个方法:sentinelFailoverSwitchToPromotedSlave。
该方法内容如下:

/* This function is called when the slave is in
 * SENTINEL_FAILOVER_STATE_UPDATE_CONFIG state. In this state we need
 * to remove it from the master table and add the promoted slave instead. */
void sentinelFailoverSwitchToPromotedSlave(sentinelRedisInstance *master) {
    sentinelRedisInstance *ref = master->promoted_slave ?
                                 master->promoted_slave : master;

    sentinelEvent(LL_WARNING,"+switch-master",master,"%s %s %d %s %d",
        master->name, master->addr->ip, master->addr->port,
        ref->addr->ip, ref->addr->port);

    sentinelResetMasterAndChangeAddress(master,ref->addr->ip,ref->addr->port);
}

这里首先看第5行,这里在为参数ref赋值,而赋值的内容是master的promoted_slave参数,这个参数在处理SELECT_SLAVE状态的操作中出现过。在处理该状态的时候会调用一个名为sentinelFailoverSelectSlave的方法,这个方法会在从服务器中选择一台新的服务器作为主服务器,而选中的服务器会被记录到参数promoted_slave中。

然后便是第12行的sentinelResetMasterAndChangeAddress方法,这个方法内容如下:

/* Reset the specified master with sentinelResetMaster(), and also change
 * the ip:port address, but take the name of the instance unmodified.
 *
 * This is used to handle the +switch-master event.
 *
 * The function returns C_ERR if the address can't be resolved for some
 * reason. Otherwise C_OK is returned.  */
int sentinelResetMasterAndChangeAddress(sentinelRedisInstance *master, char *ip, int port) {
    sentinelAddr *oldaddr, *newaddr;
    sentinelAddr **slaves = NULL;
    int numslaves = 0, j;
    dictIterator *di;
    dictEntry *de;

    newaddr = createSentinelAddr(ip,port);
    if (newaddr == NULL) return C_ERR;

    /* Make a list of slaves to add back after the reset.
     * Don't include the one having the address we are switching to. */
    di = dictGetIterator(master->slaves);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *slave = dictGetVal(de);

        if (sentinelAddrIsEqual(slave->addr,newaddr)) continue;
        slaves = zrealloc(slaves,sizeof(sentinelAddr*)*(numslaves+1));
        slaves[numslaves++] = createSentinelAddr(slave->addr->ip,
                                                 slave->addr->port);
    }
    dictReleaseIterator(di);

    /* If we are switching to a different address, include the old address
     * as a slave as well, so that we'll be able to sense / reconfigure
     * the old master. */
    if (!sentinelAddrIsEqual(newaddr,master->addr)) {
        slaves = zrealloc(slaves,sizeof(sentinelAddr*)*(numslaves+1));
        slaves[numslaves++] = createSentinelAddr(master->addr->ip,
                                                 master->addr->port);
    }

    /* Reset and switch address. */
    sentinelResetMaster(master,SENTINEL_RESET_NO_SENTINELS);
    oldaddr = master->addr;
    master->addr = newaddr;
    master->o_down_since_time = 0;
    master->s_down_since_time = 0;

    /* Add slaves back. */
    for (j = 0; j < numslaves; j++) {
        sentinelRedisInstance *slave;

        slave = createSentinelRedisInstance(NULL,SRI_SLAVE,slaves[j]->ip,
                    slaves[j]->port, master->quorum, master);
        releaseSentinelAddr(slaves[j]);
        if (slave) sentinelEvent(LL_NOTICE,"+slave",slave,"%@");
    }
    zfree(slaves);

    /* Release the old address at the end so we are safe even if the function
     * gets the master->addr->ip and master->addr->port as arguments. */
    releaseSentinelAddr(oldaddr);
    sentinelFlushConfig();
    return C_OK;
}

这个方法主要为了清空原master的数据并写入新master的数据。首先是第15行这里会创建新master的ip与port信息。然后是第20行到29行将原master的所有的从服务器(除新master外)添加到一个临时参数slaves中去。然后是第34行到38行,将原master也添加到slaves中去。然后是第41行调用了一个sentinelResetMaster方法,这个方法的内容如下:

/* Reset the state of a monitored master:
 * 1) Remove all slaves.
 * 2) Remove all sentinels.
 * 3) Remove most of the flags resulting from runtime operations.
 * 4) Reset timers to their default value. For example after a reset it will be
 *    possible to failover again the same master ASAP, without waiting the
 *    failover timeout delay.
 * 5) In the process of doing this undo the failover if in progress.
 * 6) Disconnect the connections with the master (will reconnect automatically).
 */

#define SENTINEL_RESET_NO_SENTINELS (1<<0)
void sentinelResetMaster(sentinelRedisInstance *ri, int flags) {
    serverAssert(ri->flags & SRI_MASTER);
    dictRelease(ri->slaves);
    ri->slaves = dictCreate(&instancesDictType,NULL);
    if (!(flags & SENTINEL_RESET_NO_SENTINELS)) {
        dictRelease(ri->sentinels);
        ri->sentinels = dictCreate(&instancesDictType,NULL);
    }
    instanceLinkCloseConnection(ri->link,ri->link->cc);
    instanceLinkCloseConnection(ri->link,ri->link->pc);
    ri->flags &= SRI_MASTER;
    if (ri->leader) {
        sdsfree(ri->leader);
        ri->leader = NULL;
    }
    ri->failover_state = SENTINEL_FAILOVER_STATE_NONE;
    ri->failover_state_change_time = 0;
    ri->failover_start_time = 0; /* We can failover again ASAP. */
    ri->promoted_slave = NULL;
    sdsfree(ri->runid);
    sdsfree(ri->slave_master_host);
    ri->runid = NULL;
    ri->slave_master_host = NULL;
    ri->link->act_ping_time = mstime();
    ri->link->last_ping_time = 0;
    ri->link->last_avail_time = mstime();
    ri->link->last_pong_time = mstime();
    ri->role_reported_time = mstime();
    ri->role_reported = SRI_MASTER;
    if (flags & SENTINEL_GENERATE_EVENT)
        sentinelEvent(LL_WARNING,"+reset-master",ri,"%@");
}

如同其注释所述,这个方法会重置master,其中主要有6个步骤:1、删除记录的所有的从服务器;2、删除记录的所有的哨兵;3、删除运行时操作产生的大多数标志;4、将计时器重置为其默认值;5、撤销故障转移;6、关闭连接。

细看代码主要是对一些参数的赋值操作,其中大多数参数都在之前的文档中提到过。如用于存储从服务器的slaves,存储哨兵服务器的sentinels,标识故障转移状态的failover_state等等。

然后继续回到sentinelResetMasterAndChangeAddress方法,在执行完sentinelResetMaster方法后,接着第43行对服务器地址重新赋值,第44行、45行重置主客观状态。
然后是第48行到56行,将先前设置到临时参数slaves中的数据添加到新master的slaves参数中。
最后是第61行刷新配置文件。

上一篇:Redis 版本差异


下一篇:jenkins