1、 主客观下线
在(27)(28)中,分析哨兵服务器发现从服务器和其他哨兵服务器的功能。剩下的三个功能(主客观下线、头领选举、故障迁移)关联较为紧密。这几个功能由主客观下线起始,会逐步引出剩下的两个功能。
主客观下线
主客观下线时哨兵对其他服务器的运行状态的一种标识,其中主观下线是面对其他所有的服务器,而客观下线只对主服务器执行。在(26)中提到的sentinelHandleRedisInstance方法中,代表主观下线的是sentinelCheckSubjectivelyDown方法,这个方法的内容如下:
/* Is this instance down from our point of view? */
void sentinelCheckSubjectivelyDown(sentinelRedisInstance *ri) {
mstime_t elapsed = 0;
if (ri->link->act_ping_time)
elapsed = mstime() - ri->link->act_ping_time;
else if (ri->link->disconnected)
elapsed = mstime() - ri->link->last_avail_time;
/* Check if we are in need for a reconnection of one of the
* links, because we are detecting low activity.
*
* 1) Check if the command link seems connected, was connected not less
* than SENTINEL_MIN_LINK_RECONNECT_PERIOD, but still we have a
* pending ping for more than half the timeout. */
if (ri->link->cc &&
(mstime() - ri->link->cc_conn_time) >
SENTINEL_MIN_LINK_RECONNECT_PERIOD &&
ri->link->act_ping_time != 0 && /* There is a pending ping... */
/* The pending ping is delayed, and we did not receive
* error replies as well. */
(mstime() - ri->link->act_ping_time) > (ri->down_after_period/2) &&
(mstime() - ri->link->last_pong_time) > (ri->down_after_period/2))
{
instanceLinkCloseConnection(ri->link,ri->link->cc);
}
/* 2) Check if the pubsub link seems connected, was connected not less
* than SENTINEL_MIN_LINK_RECONNECT_PERIOD, but still we have no
* activity in the Pub/Sub channel for more than
* SENTINEL_PUBLISH_PERIOD * 3.
*/
if (ri->link->pc &&
(mstime() - ri->link->pc_conn_time) >
SENTINEL_MIN_LINK_RECONNECT_PERIOD &&
(mstime() - ri->link->pc_last_activity) > (SENTINEL_PUBLISH_PERIOD*3))
{
instanceLinkCloseConnection(ri->link,ri->link->pc);
}
/* Update the SDOWN flag. We believe the instance is SDOWN if:
*
* 1) It is not replying.
* 2) We believe it is a master, it reports to be a slave for enough time
* to meet the down_after_period, plus enough time to get two times
* INFO report from the instance. */
if (elapsed > ri->down_after_period ||
(ri->flags & SRI_MASTER &&
ri->role_reported == SRI_SLAVE &&
mstime() - ri->role_reported_time >
(ri->down_after_period+SENTINEL_INFO_PERIOD*2)))
{
/* Is subjectively down */
if ((ri->flags & SRI_S_DOWN) == 0) {
sentinelEvent(LL_WARNING,"+sdown",ri,"%@");
ri->s_down_since_time = mstime();
ri->flags |= SRI_S_DOWN;
}
} else {
/* Is subjectively up */
if (ri->flags & SRI_S_DOWN) {
sentinelEvent(LL_WARNING,"-sdown",ri,"%@");
ri->flags &= ~(SRI_S_DOWN|SRI_SCRIPT_KILL_SENT);
}
}
}
这个方法看起来很长,但实际很简单。主要分两部分,第一部分是16行和33行的两个if语句,这两个语句主要是检查在(27)中创建的两个链接是否正常,若不正常则关闭链接。第二部分是47行,这里是在检查是否为主观下线。
注意这里的主观下线是不会向服务器发送命令的,发送命令的操作是在(28)中解析的方法里进行的,这里的主观下线是根据之前的返回来做出判断结果而已。
主观下线分析完成后,接下来就开始分析客观下线。
在开始分析源码前,需要先了解客观下线的机制。首先哨兵模式里的哨兵是一个集群,集群中的某一台机器认为主服务器下线的时候,并不一定是主服务器真的故障了,有可能是这台服务器与主服务器的网络连接出了问题,而主服务器与其他服务器的网络是正常的,可以正常提供服务。为了避免这种情况,集群要求有一定数量的哨兵服务器都认为主服务器主观下线,集群才会认为主服务器真的下线,即客观下线。这里哨兵服务器的数量是由配置文件决定的。在配置sentinel monitor的最后一个参数quorum就是在指定这里的哨兵数量。
sentinel monitor <master-name> <ip> <redis-port> <quorum>
在了解完机制后,我们再来细看redis的源码。还是在sentinelHandleRedisInstance方法中,代表主观下线的方法是sentinelCheckObjectivelyDown方法。这个方法的实现如下:
/* Is this instance down according to the configured quorum?
*
* Note that ODOWN is a weak quorum, it only means that enough Sentinels
* reported in a given time range that the instance was not reachable.
* However messages can be delayed so there are no strong guarantees about
* N instances agreeing at the same time about the down state. */
void sentinelCheckObjectivelyDown(sentinelRedisInstance *master) {
dictIterator *di;
dictEntry *de;
unsigned int quorum = 0, odown = 0;
if (master->flags & SRI_S_DOWN) {
/* Is down for enough sentinels? */
quorum = 1; /* the current sentinel. */
/* Count all the other sentinels. */
di = dictGetIterator(master->sentinels);
while((de = dictNext(di)) != NULL) {
sentinelRedisInstance *ri = dictGetVal(de);
if (ri->flags & SRI_MASTER_DOWN) quorum++;
}
dictReleaseIterator(di);
if (quorum >= master->quorum) odown = 1;
}
/* Set the flag accordingly to the outcome. */
if (odown) {
if ((master->flags & SRI_O_DOWN) == 0) {
sentinelEvent(LL_WARNING,"+odown",master,"%@ #quorum %d/%d",
quorum, master->quorum);
master->flags |= SRI_O_DOWN;
master->o_down_since_time = mstime();
}
} else {
if (master->flags & SRI_O_DOWN) {
sentinelEvent(LL_WARNING,"-odown",master,"%@");
master->flags &= ~SRI_O_DOWN;
}
}
}
这个方法和主观下线的方法相同,只是负责检查是否符合客观下线而已。与其他服务器进行通信的操作并不在这里。
这段代码主要分为两部分,第一部分为12行的if语句,这里负责检查是否为主观下线,第二部分是27行的if语句,根据之前代码的结果为参数赋值。
在12行的if语句中检查是否主观下线的方式也很简单。首先是16行从参数master->sentinels中读取数据,这个参数我们在(28)中提到了在发现了其他哨兵服务器的时候,这些服务器会被实例化然后存储到一个字典中,这个字典就是master->sentinels。然后是17行使用了一个while循环遍历字典中的服务器,若其也标识了主观下线,那么quorum的数量加1.最后比较统计的quorum和配置的quorum,若大于则主观下线。