在YARN上开发长服务,需要注意fault-tolerance,本篇文章对appmaster的平滑重启的一个参数做了解析,如何设置可以有助于达到appmaster平滑重启。
在yarn-site.xml有个参数
/**
* The maximum number of application attempts.
* It's a global setting for all application masters.
*/
yarn.resourcemanager.am.max-attempts
一个全局的appmaster重试次数的限制,yarn提交应用时,还可以为单独一个应用设置最大重试次数
/**
* Set the number of max attempts of the application to be submitted. WARNING:
* it should be no larger than the global number of max attempts in the Yarn
* configuration.
* @param maxAppAttempts the number of max attempts of the application
* to be submitted.
*/
@Public
@Stable
public abstract void setMaxAppAttempts(int maxAppAttempts);
当attempt失败时,如果设置keepContainersAcrossAppAttempts了,resource manager会决定上个attempt的container是否仍然保留着。
boolean keepContainersAcrossAppAttempts = false;
switch (finalAttemptState) {
case FINISHED:
{
appEvent = new RMAppFinishedAttemptEvent(applicationId,
appAttempt.getDiagnostics());
}
break;
case KILLED:
{
// don't leave the tracking URL pointing to a non-existent AM
appAttempt.setTrackingUrlToRMAppPage();
appAttempt.invalidateAMHostAndPort();
appEvent =
new RMAppFailedAttemptEvent(applicationId,
RMAppEventType.ATTEMPT_KILLED,
"Application killed by user.", false);
}
break;
case FAILED:
{
// don't leave the tracking URL pointing to a non-existent AM
appAttempt.setTrackingUrlToRMAppPage();
appAttempt.invalidateAMHostAndPort(); if (appAttempt.submissionContext
.getKeepContainersAcrossApplicationAttempts()
&& !appAttempt.submissionContext.getUnmanagedAM()) {
// See if we should retain containers for non-unmanaged applications
if (!appAttempt.shouldCountTowardsMaxAttemptRetry()) {
// Premption, hardware failures, NM resync doesn't count towards
// app-failures and so we should retain containers.
keepContainersAcrossAppAttempts = true;
} else if (!appAttempt.maybeLastAttempt) {
// Not preemption, hardware failures or NM resync.
// Not last-attempt too - keep containers.
keepContainersAcrossAppAttempts = true;
}
}
appEvent =
new RMAppFailedAttemptEvent(applicationId,
RMAppEventType.ATTEMPT_FAILED, appAttempt.getDiagnostics(),
keepContainersAcrossAppAttempts); }
}
关注appAttempt.maybeLastAttempt这个变量,rs如何判断是否这次attempt是最后一次呢?
private void createNewAttempt() {
ApplicationAttemptId appAttemptId =
ApplicationAttemptId.newInstance(applicationId, attempts.size() + );
RMAppAttempt attempt =
new RMAppAttemptImpl(appAttemptId, rmContext, scheduler, masterService,
submissionContext, conf,
// The newly created attempt maybe last attempt if (number of
// previously failed attempts(which should not include Preempted,
// hardware error and NM resync) + 1) equal to the max-attempt
// limit.
maxAppAttempts == (getNumFailedAppAttempts() + ), amReq);
attempts.put(appAttemptId, attempt);
currentAttempt = attempt;
}
在每次构造新的attempt时候,maxAppAttempts == (getNumFailedAppAttempts() + 1)会决定,已经失败的次数+1,是否已经达到了maxAppAttempts的限制了。
而maxAppAttempts这个参数是由global和individual两个配置取min,决定的。
int globalMaxAppAttempts = conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS,
YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS);
int individualMaxAppAttempts = submissionContext.getMaxAppAttempts();
if (individualMaxAppAttempts <= ||
individualMaxAppAttempts > globalMaxAppAttempts) {
this.maxAppAttempts = globalMaxAppAttempts;
LOG.warn("The specific max attempts: " + individualMaxAppAttempts
+ " for application: " + applicationId.getId()
+ " is invalid, because it is out of the range [1, "
+ globalMaxAppAttempts + "]. Use the global max attempts instead.");
} else {
this.maxAppAttempts = individualMaxAppAttempts;
}
总结:
如果希望appmaster可以达到不断重启,而且可以接管之前的container,需要把yarn.resourcemanager.am.max-attempts这个参数尽量调大,比如设置为10000,并且提交app时候设置submit context的最大次数,以及刷新窗口,这样基本就可以满足长服务应用在yarn上面的运行需求了。