一、前言
个人感觉学习Flink其实最不应该错过的博文是Flink社区的博文系列,里面的文章是不会让人失望的。强烈安利:https://ververica.cn/developers-resources/。
本文是自己第一次尝试写源码阅读的文章,会努力将原理和源码实现流程结合起来。文中有几个点目前也是没有弄清楚,若是写在一篇博客里,时间跨度太大,但又怕后期遗忘,所以先记下来,后期进一步阅读源码后再添上,若是看到不完整版博文的看官,对不住!
文中若是写的不准确的地方欢迎留言指出。
源码系列基于Flink 1.9
二、Per-job提交任务原理
Flink on Yarn模式下提交任务整体流程图如下(图源自Flink社区,链接见Ref [1])
图1 Flink Runtime层架构图
2.1. Runtime层架构简介
Flink采取的是经典的master-salve模式,图中的AM(ApplicationMater)为master,TaskManager是salve。
AM中的Dispatcher用于接收client提交的任务和启动相应的JobManager ;JobManager用于任务的接收,task的分配、管理task manager等;ResourceManager主要用于资源的申请和分配。
这里有点需要注意:Flink本身也是具有ResourceManager和TaskManager的,这里虽然是on Yarn模式,但Flink本身也是拥有一套资源管理架构,虽然各个组件的名字一样,但这里yarn只是一个资源的提供者,若是standalone模式,资源的提供者就是物理机或者虚拟机了。
2.2. Flink on Yarn 的Per-job模式提交任务的整体流程:
1)执行Flink程序,就类似client,主要是将代码进行优化形成JobGraph,向yarn的ResourceManager中的ApplicationManager申请资源启动AM(ApplicationMater),AM所在节点是Yarn上的NodeManager上;
2)当AM起来之后会启动Dispatcher、ResourceManager,其中Dispatcher会启动JobManager,ResourceManager会启动slotManager用于slot的管理和分配;
3)JobManager向ResourceManager(RM)申请资源用于任务的执行,最初TaskManager还没有启动,此时,RM会向yarn去申请资源,获得资源后,会在资源中启动TaskManager,相应启动的slot会向slotManager中注册,然后slotManager会将slot分配给只需资源的task,即向JobManager注册信息,然后JobManager就会将任务提交到对应的slot中执行。其实Flink on yarn的session模式和Per-job模式最大的区别是,提交任务时RM已向Yarn申请了固定大小的资源,其TaskManager是已经启动的。
资源分配如详细过程图下:
图2 slot管理图,源自Ref[1]
更详细的过程解析,强烈推荐Ref [2],是阿里Flink大牛写的,本博客在后期的源码分析过程也多依据此博客。
三、源码简析
提交任务语句
./flink run -m yarn-cluster ./flinkExample.jar
1、Client端提交任务阶段分析
flink脚本的入口类是org.apache.flink.client.cli.CliFrontend。
1)在CliFronted类的main()方法中,会加载flnk以及一些全局的配置项之后,根据命令行参数run,调用run()->runProgram()->deployJobCluster(),具体的代码如下:
private <T> void runProgram(
CustomCommandLine<T> customCommandLine,
CommandLine commandLine,
RunOptions runOptions,
PackagedProgram program) throws ProgramInvocationException, FlinkException {
final ClusterDescriptor<T> clusterDescriptor = customCommandLine.createClusterDescriptor(commandLine); try {
final T clusterId = customCommandLine.getClusterId(commandLine); final ClusterClient<T> client; // directly deploy the job if the cluster is started in job mode and detached
if (clusterId == null && runOptions.getDetachedMode()) {
int parallelism = runOptions.getParallelism() == -1 ? defaultParallelism : runOptions.getParallelism();
//构建JobGraph
final JobGraph jobGraph = PackagedProgramUtils.createJobGraph(program, configuration, parallelism); final ClusterSpecification clusterSpecification = customCommandLine.getClusterSpecification(commandLine);
//将任务提交到yarn上
client = clusterDescriptor.deployJobCluster(
clusterSpecification,
jobGraph,
runOptions.getDetachedMode()); logAndSysout("Job has been submitted with JobID " + jobGraph.getJobID()); ......................
} else{........}
2)提交任务会调用YarnClusterDescriptor 类中deployJobCluster()->AbstractYarnClusterDescriptor类中deployInteral(),该方法会一直阻塞直到ApplicationMaster/JobManager在yarn上部署成功,其中最关键的调用是对startAppMaster()方法的调用,代码如下:
protected ClusterClient<ApplicationId> deployInternal(
ClusterSpecification clusterSpecification,
String applicationName,
String yarnClusterEntrypoint,
@Nullable JobGraph jobGraph,
boolean detached) throws Exception { //1、验证集群是否可以访问
//2、若用户组是否开启安全认证
//3、检查配置以及vcore是否满足flink集群申请的需求
//4、指定的对列是否存在
//5、检查内存是否满足flink JobManager、NodeManager所需
//.................................... //Entry
ApplicationReport report = startAppMaster(
flinkConfiguration,
applicationName,
yarnClusterEntrypoint,
jobGraph,
yarnClient,
yarnApplication,
validClusterSpecification); //6、获取flink集群端口、地址信息
//..........................................
}
3)方法AbstractYarnClutserDescriptor.startAppMaster()主要是将配置文件和相关文件上传至分布式存储如HDFS,以及向Yarn上提交任务等,源码分析如下:
public ApplicationReport startAppMaster(
Configuration configuration,
String applicationName,
String yarnClusterEntrypoint,
JobGraph jobGraph,
YarnClient yarnClient,
YarnClientApplication yarnApplication,
ClusterSpecification clusterSpecification) throws Exception { // ....................... //1、上传conf目录下logback.xml、log4j.properties //2、上传环境变量中FLINK_PLUGINS_DIR ,FLINK_LIB_DIR包含的jar
addEnvironmentFoldersToShipFiles(systemShipFiles);
//...........
//3、设置applications的高可用的方案,通过设置AM重启次数,默认为1
//4、上传ship files、user jars、
//5、为TaskManager设置slots、heap memory
//6、上传flink-conf.yaml
//7、序列化JobGraph后上传
//8、登录权限检查 //................. //获得启动AM container的Java命令
final ContainerLaunchContext amContainer = setupApplicationMasterContainer(
yarnClusterEntrypoint,
hasLogback,
hasLog4j,
hasKrb5,
clusterSpecification.getMasterMemoryMB()); //9、为aAM启动绑定环境参数以及classpath和环境变量 //.......................... final String customApplicationName = customName != null ? customName : applicationName;
//10、应用名称、应用类型、用户提交的应用ContainerLaunchContext
appContext.setApplicationName(customApplicationName);
appContext.setApplicationType(applicationType != null ? applicationType : "Apache Flink");
appContext.setAMContainerSpec(amContainer);
appContext.setResource(capability); if (yarnQueue != null) {
appContext.setQueue(yarnQueue);
} setApplicationNodeLabel(appContext); setApplicationTags(appContext); //11、部署失败删除yarnFilesDir
// add a hook to clean up in case deployment fails
Thread deploymentFailureHook = new DeploymentFailureHook(yarnClient, yarnApplication, yarnFilesDir);
Runtime.getRuntime().addShutdownHook(deploymentFailureHook); LOG.info("Submitting application master " + appId); //Entry
yarnClient.submitApplication(appContext); LOG.info("Waiting for the cluster to be allocated");
final long startTime = System.currentTimeMillis();
ApplicationReport report;
YarnApplicationState lastAppState = YarnApplicationState.NEW;
//12、阻塞等待直到running
loop: while (true) {
//...................
//每隔250ms通过YarnClient获取应用报告
Thread.sleep(250);
}
//...........................
//13、部署成功删除shutdown回调
// since deployment was successful, remove the hook
ShutdownHookUtil.removeShutdownHook(deploymentFailureHook, getClass().getSimpleName(), LOG);
return report;
}
4)应用提交的Entry是YarnClientImpl.submitApplication(),该方法在于调用了ApplicationClientProtocolPBClientImpl.submitApplication(),其具体代码如下:
public SubmitApplicationResponse submitApplication(SubmitApplicationRequest request) throws YarnException, IOException {
//取出报文
SubmitApplicationRequestProto requestProto = ((SubmitApplicationRequestPBImpl)request).getProto(); try {
//将报文发送发送到服务端,并将返回结果构成response
return new SubmitApplicationResponsePBImpl(this.proxy.submitApplication((RpcController)null, requestProto));
} catch (ServiceException var4) {
RPCUtil.unwrapAndThrowException(var4);
return null;
}
}
报文就会通过RPC到达服务端,服务端处理报文的方法是ApplicationClientProtocolPBServiceImpl.submitApplication(),方法中会重新构建报文,然后通过ClientRMService.submitApplication()将应用请求提交到Yarn上的RMAppManager去提交任务(在Yarn的分配过后面会专门写一系列的博客去说明)。
至此,client端的流程就走完了,应用请求已提交到Yarn的ResourceManager上了,下面着重分析Flink Cluster启动流程。
2、Flink Cluster启动流程分析
1)在ClientRMService类的submitApplication()方法中,会先检查任务是否已经提交(通过applicationID)、Yarn的queue是否为空等,然后将请求提交到RMAppManager(ARN RM内部管理应用生命周期的组件),若提交成功会输出Application with id {applicationId.getId()} submitted by user {user}的信息,具体分析如下:
public SubmitApplicationResponse submitApplication(
SubmitApplicationRequest request) throws YarnException {
ApplicationSubmissionContext submissionContext = request
.getApplicationSubmissionContext();
ApplicationId applicationId = submissionContext.getApplicationId(); // ApplicationSubmissionContext needs to be validated for safety - only
// those fields that are independent of the RM's configuration will be
// checked here, those that are dependent on RM configuration are validated
// in RMAppManager.
//这里仅验证不属于RM的配置,属于RM的配置将在RMAppManager验证 //1、检查application是否已提交
//2、检查提交的queue是否为null,是,则设置为默认queue(default)
//3、检查是否设置application名,否,则为默认(N/A)
//4、检查是否设置application类型,否,则为默认(YARN);是,若名字长度大于给定的长度(20),则会截断
//............................. try {
// call RMAppManager to submit application directly
//直接submit任务
rmAppManager.submitApplication(submissionContext,
System.currentTimeMillis(), user); //submit成功
LOG.info("Application with id " + applicationId.getId() +
" submitted by user " + user);
RMAuditLogger.logSuccess(user, AuditConstants.SUBMIT_APP_REQUEST,
"ClientRMService", applicationId);
} catch (YarnException e) {
//失败会抛出异常
}
//..................
}
2)RMAppManager类的submitApplication()方法主要是创建RMApp和向ResourceScheduler申请AM container,该部分直到在NodeManager上启动AM container都是Yarn本身所为,其中具体过程在这里不详细分析,详细过程后期会分析,这里仅给出入口,代码如下:
protected void submitApplication(
ApplicationSubmissionContext submissionContext, long submitTime,
String user) throws YarnException {
ApplicationId applicationId = submissionContext.getApplicationId(); //1、创建RMApp,若具有相同的applicationId会抛出异常
RMAppImpl application =
createAndPopulateNewRMApp(submissionContext, submitTime, user);
ApplicationId appId = submissionContext.getApplicationId(); //security模式有simple和kerberos,在配置文件中配置
//开始kerberos
if (UserGroupInformation.isSecurityEnabled()) {
//..................
} else {
//simple模式
// Dispatcher is not yet started at this time, so these START events
// enqueued should be guaranteed to be first processed when dispatcher
// gets started.
//2、向ResourceScheduler(可插拔的资源调度器)提交任务??????????
this.rmContext.getDispatcher().getEventHandler()
.handle(new RMAppEvent(applicationId, RMAppEventType.START));
}
}
3)Flink在Per-job模式下,AM container加载运行的入口是YarnJobClusterEntryPoint中的main()方法,源码分析如下:
public static void main(String[] args) {
// startup checks and logging
//1、输出环境信息如用户、环境变量、Java版本等,以及JVM参数
EnvironmentInformation.logEnvironmentInfo(LOG, YarnJobClusterEntrypoint.class.getSimpleName(), args);
//2、注册处理各种SIGNAL的handler:记录到日志
SignalHandler.register(LOG);
//3、注册JVM关闭保障的shutdown hook:避免JVM退出时被其他shutdown hook阻塞
JvmShutdownSafeguard.installAsShutdownHook(LOG); Map<String, String> env = System.getenv(); final String workingDirectory = env.get(ApplicationConstants.Environment.PWD.key());
Preconditions.checkArgument(
workingDirectory != null,
"Working directory variable (%s) not set",
ApplicationConstants.Environment.PWD.key()); try {
//4、输出Yarn运行的用户信息
YarnEntrypointUtils.logYarnEnvironmentInformation(env, LOG);
} catch (IOException e) {
LOG.warn("Could not log YARN environment information.", e);
}
//5、加载flink的配置
Configuration configuration = YarnEntrypointUtils.loadConfiguration(workingDirectory, env, LOG); YarnJobClusterEntrypoint yarnJobClusterEntrypoint = new YarnJobClusterEntrypoint(
configuration,
workingDirectory);
//6、Entry 创建并启动各类内部服务
ClusterEntrypoint.runClusterEntrypoint(yarnJobClusterEntrypoint);
}
4)后续的调用过程:ClusterEntrypoint类中runClusterEntrypoint()->startCluster()->runCluster(),该过程比较简单,这里着实分析runCluster()方法,如下:
//#ClusterEntrypint.java
private void runCluster(Configuration configuration) throws Exception {
synchronized (lock) {
initializeServices(configuration); // write host information into configuration
configuration.setString(JobManagerOptions.ADDRESS, commonRpcService.getAddress());
configuration.setInteger(JobManagerOptions.PORT, commonRpcService.getPort());
//1、创建dispatcherResour、esourceManager对象,其中有从本地重新创建JobGraph的过程
final DispatcherResourceManagerComponentFactory<?> dispatcherResourceManagerComponentFactory = createDispatcherResourceManagerComponentFactory(configuration);
//2、Entry 启动RpcService、HAService、BlobServer、HeartbeatServices、MetricRegistry、ExecutionGraphStore等
clusterComponent = dispatcherResourceManagerComponentFactory.create(
configuration,
commonRpcService,
haServices,
blobServer,
heartbeatServices,
metricRegistry,
archivedExecutionGraphStore,
new RpcMetricQueryServiceRetriever(metricRegistry.getMetricQueryServiceRpcService()),
this); //............
}
}
4)在create()方法中,会启动Flink的诸多组件,其中与提交任务强相关的是Dispatcher、ResourceManager,具体代码如下:
public DispatcherResourceManagerComponent<T> create(
Configuration configuration,
RpcService rpcService,
HighAvailabilityServices highAvailabilityServices,
BlobServer blobServer,
HeartbeatServices heartbeatServices,
MetricRegistry metricRegistry,
ArchivedExecutionGraphStore archivedExecutionGraphStore,
MetricQueryServiceRetriever metricQueryServiceRetriever,
FatalErrorHandler fatalErrorHandler) throws Exception { LeaderRetrievalService dispatcherLeaderRetrievalService = null;
LeaderRetrievalService resourceManagerRetrievalService = null;
WebMonitorEndpoint<U> webMonitorEndpoint = null;
ResourceManager<?> resourceManager = null;
JobManagerMetricGroup jobManagerMetricGroup = null;
T dispatcher = null; try {
dispatcherLeaderRetrievalService = highAvailabilityServices.getDispatcherLeaderRetriever(); resourceManagerRetrievalService = highAvailabilityServices.getResourceManagerLeaderRetriever(); final LeaderGatewayRetriever<DispatcherGateway> dispatcherGatewayRetriever = new RpcGatewayRetriever<>(
rpcService,
DispatcherGateway.class,
DispatcherId::fromUuid,
10,
Time.milliseconds(50L)); final LeaderGatewayRetriever<ResourceManagerGateway> resourceManagerGatewayRetriever = new RpcGatewayRetriever<>(
rpcService,
ResourceManagerGateway.class,
ResourceManagerId::fromUuid,
10,
Time.milliseconds(50L)); final ExecutorService executor = WebMonitorEndpoint.createExecutorService(
configuration.getInteger(RestOptions.SERVER_NUM_THREADS),
configuration.getInteger(RestOptions.SERVER_THREAD_PRIORITY),
"DispatcherRestEndpoint"); final long updateInterval = configuration.getLong(MetricOptions.METRIC_FETCHER_UPDATE_INTERVAL);
final MetricFetcher metricFetcher = updateInterval == 0
? VoidMetricFetcher.INSTANCE
: MetricFetcherImpl.fromConfiguration(
configuration,
metricQueryServiceRetriever,
dispatcherGatewayRetriever,
executor); webMonitorEndpoint = restEndpointFactory.createRestEndpoint(
configuration,
dispatcherGatewayRetriever,
resourceManagerGatewayRetriever,
blobServer,
executor,
metricFetcher,
highAvailabilityServices.getWebMonitorLeaderElectionService(),
fatalErrorHandler); log.debug("Starting Dispatcher REST endpoint.");
webMonitorEndpoint.start(); final String hostname = getHostname(rpcService); jobManagerMetricGroup = MetricUtils.instantiateJobManagerMetricGroup(
metricRegistry,
hostname,
ConfigurationUtils.getSystemResourceMetricsProbingInterval(configuration));
//1、返回的是new YarnResourceManager
/*调度过程:AbstractDispatcherResourceManagerComponentFactory
->ActiveResourceManagerFactory
->YarnResourceManagerFactory
*/
ResourceManager<?> resourceManager1 = resourceManagerFactory.createResourceManager(
configuration,
ResourceID.generate(),
rpcService,
highAvailabilityServices,
heartbeatServices,
metricRegistry,
fatalErrorHandler,
new ClusterInformation(hostname, blobServer.getPort()),
webMonitorEndpoint.getRestBaseUrl(),
jobManagerMetricGroup);
resourceManager = resourceManager1; final HistoryServerArchivist historyServerArchivist = HistoryServerArchivist.createHistoryServerArchivist(configuration, webMonitorEndpoint);
//2、在此反序列化获取JobGraph实例;返回new MiniDispatcher
dispatcher = dispatcherFactory.createDispatcher(
configuration,
rpcService,
highAvailabilityServices,
resourceManagerGatewayRetriever,
blobServer,
heartbeatServices,
jobManagerMetricGroup,
metricRegistry.getMetricQueryServiceGatewayRpcAddress(),
archivedExecutionGraphStore,
fatalErrorHandler,
historyServerArchivist); log.debug("Starting ResourceManager.");
//启动resourceManager,此过程中会经历以下阶段
//leader选举->(ResourceManager.java中)
// ->grantLeadership(...)
// ->tryAcceptLeadership(...)
// ->slotManager的启动
resourceManager.start();
resourceManagerRetrievalService.start(resourceManagerGatewayRetriever); log.debug("Starting Dispatcher."); //启动Dispatcher,经历以下阶段:
//leader选举->(Dispatcher.java中)grantLeadership->tryAcceptLeadershipAndRunJobs
// ->createJobManagerRunner->startJobManagerRunner->jobManagerRunner.start()
//
//->(JobManagerRunner.java中)start()->leaderElectionService.start(...)
//->grantLeadership(...)->verifyJobSchedulingStatusAndStartJobManager(...)
//->startJobMaster(leaderSessionId)这里的startJobmaster应该是启动的JobManager
//
//->(JobManagerRunner.java中)jobMasterService.start(...)
//->(JobMaster.java)startJobExecution(...)
// ->{startJobMasterServices()在该方法中会启动slotPool->resourceManagerLeaderRetriever.start(...)}
//->startJobExecution(...)->
dispatcher.start();
dispatcherLeaderRetrievalService.start(dispatcherGatewayRetriever); return createDispatcherResourceManagerComponent(
dispatcher,
resourceManager,
dispatcherLeaderRetrievalService,
resourceManagerRetrievalService,
webMonitorEndpoint,
jobManagerMetricGroup); } catch (Exception exception) {
// clean up all started components
//失败会清除已启动的组件
//..............
}
}
5)此后,JobManager中的slotPool会向SlotManager申请资源,而SlotManager则向Yarn的ResourceManager申请,申请到后会启动TaskManager,然后将slot信息注册到slotManager和slotPool中,详细过程在此就不展开分析了,留作后面分析。
四、总结
该博客中还有诸多不完善的地方,需要自己后进一步的阅读源码、弄清设计架构后等一系列之后才能有更好的完善,此外,后期也会对照着Flink 的Per-job模式下任务提交的详细日志进一步验证。
若是文中有描述不清的,非常建议参考以下博文;若是存在不对的地方,非常欢迎大伙留言指出,谢谢了!
Ref
[1]https://files.alicdn.com/tpsservice/7bb8f513c765b97ab65401a1b78c8cb8.pdf
[2]https://yq.aliyun.com/articles/719262?spm=a2c4e.11153940.0.0.3ea9469ei7H3Wx#