集群模式概述oa信用盘源码搭建【地瓜源码论坛diguaym.com】企饿2152876294
This document gives a short overview of how Spark runs on clusters, to make it easier to understand the components involved. Read through the application submission guide to learn about launching applications on a cluster.
这篇文档简短概述了Spark如何在集群上运行,以便理解所涉及到的组件更加容易一些。阅读应用程序提交指南 ,了解有关在群集上启动应用程序的信息。
Components
组件
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
sets of processes:进程的集合
Spark应用程序在集群上以独立的进程集合运行,由主程序(称作驱动程序)中的SparkContext对象来协调和组织。一个Spark应用程序包括一个driver和多个executors。
分析:Spark应用程序是一组独立的进程,这些进程有哪些?一个driver program进程和多个executors进程组成。
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
具体来说,集群模式下,SparkContext 能够连接不同类型的cluster managers集群管理器,比如说Spark自己的standalone cluster manager, Mesos或者YARN,而这些cluster managers所扮演的角色是在各个应用程序application之间分配资源。一旦Spark连接上这些cluster managers,Spark就获得了分布在集群各个节点上的executors,这些executors其实是一系列的进程,这些进程负责执行我们的应用程序application中的计算并存储相关的数据。接着,SparkContext将我们的应用程序代码发送给executors,这些应用程序代码是由JAR或者Python文件所定义并且传给SparkContext。最后,SparkContext把tasks发送给executors去执行。
Driver program里面包含SparkContext,它为了能够在集群上去运行,能够把作业运行到集群上面,需要SparkContext需要通过Cluster manager去集群上申请资源,比如去了两个节点上面申请了两个executors。一旦连接上,拿到了资源,获得了分布在集群各个节点上的executors后,SparkContext就可以将我们的应用程序代码发送给executors。最后,SparkContext把tasks发送给executors去执行。
关于这个体系结构,需要注意以下几点:
1.Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
每个Spark应用程序都有属于它自己本身的executor进程,这些进程在这个Spark应用程序的整个生命周期都一直存活着,并且以多线程的方式来执行内部的多个tasks。这样做的好处是使各个应用之间相互隔离,无论是在调度层面scheduling side还是在执行层面executor side都是相互隔离的,从调度层面来看,每个driver调度属于它自身的tasks,从执行层面上来看,属于不同applications的tasks运行在不同的JVM上,(假如现在有两个Spark应用程序,分别有多个executors,这些executors分布在各个节点之上,假如两个Spark应用程序的其中分别有一个executor,它俩是可以在同一个节点上运行的,而且互不影响) 。然而,这也意味着不同的Spark applications(就是SparkContext的实例)之间是不能共享各自所属的数据,除非,你把数据写到外部存储系统。还有一个方法可以不用写到外部存储系统,比如说Alluxio内存高速虚拟分布式存储系统。