Spark on k8s

前言

Spark 自从2.3版本以来就支持运行在k8s上,本文主要介绍如何运行Spark在阿里云容器服务-Kubernetes。

前提条件

1、 已经购买阿里云容器服务-Kubernetes。购买链接:Kubernetes控制台。本例k8s集群类型为:Kubernetes 托管版。
2、 Spark镜像已构建。本例构建Spark的镜像的Dokerfile内容为:

# 基础镜像
FROM registry.cn-beijing.aliyuncs.com/acs/spark:v2.4.0

# 作者
LABEL maintainer "guangcheng.zgc@alibaba-inc.com"

#拷贝jar包到制定目录
COPY ./spark-examples-0.0.1-SNAPSHOT.jar /opt/spark/examples/jars/

镜像构建完后的registry的地址为:registry.cn-beijing.aliyuncs.com/bill_test/image_test:v2.4.0
3、 本例通过k8s 客户端提交yaml文件启动Spark任务。

下面详情介绍下每个步骤

制作Spark镜像

制作Spark镜像需要先在本地安装Docker,本例介绍Mac的安装方法。在Mac中执行:

brew cask install docker

安装完毕后执行docker命令显示如下内容说明安装成功。

Usage:    docker [OPTIONS] COMMAND

A self-sufficient runtime for containers

Options:
      --config string      Location of client config files (default "/Users/bill.zhou/.docker")
  -D, --debug              Enable debug mode
  -H, --host list          Daemon socket(s) to connect to
  -l, --log-level string   Set the logging level ("debug"|"info"|"warn"|"error"|"fatal") (default "info")
      --tls                Use TLS; implied by --tlsverify
      --tlscacert string   Trust certs signed only by this CA (default "/Users/bill.zhou/.docker/ca.pem")
      --tlscert string     Path to TLS certificate file (default "/Users/bill.zhou/.docker/cert.pem")
      --tlskey string      Path to TLS key file (default "/Users/bill.zhou/.docker/key.pem")
      --tlsverify          Use TLS and verify the remote
  -v, --version            Print version information and quit

制作docker镜像需要编写Dockerfile,本例的Dockerfile创建过程如下。

#进入目录:
cd /Users/bill.zhou/dockertest
#拷贝测试jar包到此目录:
cp /Users/jars/spark-examples-0.0.1-SNAPSHOT.jar ./
#创建文件Dockerfile
vi Dockerfile

在Dockerfile文件中输入如下内容:

# 基础镜像
FROM registry.cn-beijing.aliyuncs.com/acs/spark:v2.4.0

# 作者
LABEL maintainer "guangcheng.zgc@alibaba-inc.com"

#拷贝jar包到制定目录
COPY ./spark-examples-0.0.1-SNAPSHOT.jar /opt/spark/examples/jars/

本例镜像引用了别人的基础镜像:registry.cn-beijing.aliyuncs.com/acs/spark:v2.4.0,然加入了自己的测试代码jar包:spark-examples-0.0.1-SNAPSHOT.jar。
Dockerfile编写完毕后,开始构建镜像,命令如下:

docker build /Users/bill.zhou/dockertest/ -t registry.cn-beijing.aliyuncs.com/bill_test/image_test:v2.4.0

构建完毕后需要上传镜像到registry,命令如下:

#先登录自己的阿里云账号
docker login --username=zhouguangcheng007@aliyun.com registry.cn-beijing.aliyuncs.com
#上传镜像
docker push registry.cn-beijing.aliyuncs.com/bill_test/image_test:v2.4.0

镜像制作完毕后可以开始试用Spark镜像。

提交任务到k8s集群

本例通过k8s客户端kubectl提交yaml到k8s。
首先购买一次ECS(和k8s在同一个vpc下),安装k8s客户端kubectl。安装指导参考:安装k8s指导
安装完毕后配置集群的凭证后就可以访问k8s集群了。集群凭证配置方法,进入k8s集群“基本信息”页面获取凭证信息,如下图:
Spark on k8s
然后参考如下步骤提交spark任务:

## 安装crd
kubectl apply -f manifest/spark-operator-crds.yaml 
## 安装operator的服务账号与授权策略
kubectl apply -f manifest/spark-operator-rbac.yaml 
## 安装spark任务的服务账号与授权策略
kubectl apply -f manifest/spark-rbac.yaml 
## 安装spark-on-k8s-operator 
kubectl apply -f manifest/spark-operator.yaml 
## 下发spark-pi任务
kubectl apply -f examples/spark-pi.yaml 

对应的文件参可以从开源社区下载最新版本。

运行完毕后可以通过命令查看运行日志。如下:

#查看pod -n指定命名空间
kubectl get pod -n spark-operator
#查看pod 日志
kubectl log spark-pi-driver -n spark-operator

看下如下内容说明执行成功:

2019-07-23 11:55:54 INFO  SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1161
2019-07-23 11:55:54 INFO  DAGScheduler:54 - Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:17) (first 15 tasks are for partitions Vector(0, 1))
2019-07-23 11:55:54 INFO  TaskSchedulerImpl:54 - Adding task set 0.0 with 2 tasks
2019-07-23 11:55:55 INFO  TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, 172.20.1.9, executor 1, partition 0, PROCESS_LOCAL, 7878 bytes)
2019-07-23 11:55:55 INFO  TaskSetManager:54 - Starting task 1.0 in stage 0.0 (TID 1, 172.20.1.9, executor 1, partition 1, PROCESS_LOCAL, 7878 bytes)
2019-07-23 11:55:57 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 172.20.1.9:36662 (size: 1256.0 B, free: 117.0 MB)
2019-07-23 11:55:57 INFO  TaskSetManager:54 - Finished task 1.0 in stage 0.0 (TID 1) in 2493 ms on 172.20.1.9 (executor 1) (1/2)
2019-07-23 11:55:57 INFO  TaskSetManager:54 - Finished task 0.0 in stage 0.0 (TID 0) in 2789 ms on 172.20.1.9 (executor 1) (2/2)
2019-07-23 11:55:57 INFO  TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool
2019-07-23 11:55:58 INFO  DAGScheduler:54 - ResultStage 0 (reduce at SparkPi.scala:21) finished in 5.393 s
2019-07-23 11:55:58 INFO  DAGScheduler:54 - Job 0 finished: reduce at SparkPi.scala:21, took 6.501405 s
**Pi is roughly 3.142955714778574**
2019-07-23 11:55:58 INFO  AbstractConnector:318 - Stopped Spark@49096b06{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-07-23 11:55:58 INFO  SparkUI:54 - Stopped Spark web UI at http://spark-test-1563882878789-driver-svc.spark-operator-t01.svc:4040
2019-07-23 11:55:58 INFO  KubernetesClusterSchedulerBackend:54 - Shutting down all executors
2019-07-23 11:55:58 INFO  KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Asking each executor to shut down
2019-07-23 11:55:58 WARN  ExecutorPodsWatchSnapshotSource:87 - Kubernetes client has been closed (this is expected if the application is shutting down.)
2019-07-23 11:55:59 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-07-23 11:55:59 INFO  MemoryStore:54 - MemoryStore cleared
2019-07-23 11:55:59 INFO  BlockManager:54 - BlockManager stopped
2019-07-23 11:55:59 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2019-07-23 11:55:59 INFO  OutputCommitCoordinator$

最后看下spark-pi.yaml文件内容的关键信息。

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: spark-operator
spec:
  type: Scala
  mode: cluster
  image: "registry.cn-beijing.aliyuncs.com/acs/spark:v2.4.0"  #运行的镜像registry路径。
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi #运行的入口类。
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"  #运行的类的相关jar包,这个路径是镜像中的路径。
  sparkVersion: "2.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  #定义spark driver端的资源
  driver:
    cores: 0.1
    coreLimit: "200m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  #定义executor 端的资源
  executor
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

yaml是提交到k8s文件标准格式,yaml中定义了所需的镜像,Spark driver和executor端的资源。

总结

k8s容器介绍请参考:容器服务Kubernetes版
其它spark on k8s请参考:
Spark in action on Kubernetes - Playground搭建与架构浅析
Spark in action on Kubernetes - Spark Operator的原理解析

上一篇:java零基础后端工程师,附高频面试题合集


下一篇:Spark On HBase Idea远程调试