尝鲜阿里云容器服务Kubernetes 1.16,共享TensorFlow实验室《二》--共享GPU的弹性

上一篇文章《尝鲜阿里云容器服务Kubernetes 1.16,共享TensorFlow实验室》我们讲述了如何通过CGPU的方案来实现CGPU资源的共享和隔离。
本文介绍基于CGPU资源的弹性能力。
ps:下面的说明是基于上一篇文章的环境来进行的描述,环境的搭建请参考上一篇文章。

配置弹性伸缩组

  1. 在“集群列表”中目标集群的“更多”的下拉菜单中选中“自动伸缩”
    尝鲜阿里云容器服务Kubernetes 1.16,共享TensorFlow实验室《二》--共享GPU的弹性
  2. 配置基础的“缩容规则”后,“创建伸缩组”,选择“共享GPU实例”
    尝鲜阿里云容器服务Kubernetes 1.16,共享TensorFlow实验室《二》--共享GPU的弹性
  3. 然后选中需要的类型,比如本例中选择规格“ecs.gn6i-c4g1.xlarge”,其中我们已经默认设置了弹出节点的标签 "cgpu: true, workload_type: gpushare"
    尝鲜阿里云容器服务Kubernetes 1.16,共享TensorFlow实验室《二》--共享GPU的弹性
  4. 点击确定后,弹性伸缩组配置完成
    尝鲜阿里云容器服务Kubernetes 1.16,共享TensorFlow实验室《二》--共享GPU的弹性

触发扩容

将下面的内存存储为 mem_deployment.yaml,通过命令 kubectl apply -f mem_deployment.yaml 来初始化环境

---
# Define the tensorflow deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  replicas: 1
  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: tf-notebook
  template: # define the pods specifications
    metadata:
      labels:
        app: tf-notebook
    spec:
      containers:
      - name: tf-notebook
        image: tensorflow/tensorflow:1.4.1-gpu-py3
        resources:
          limits:
            aliyun.com/gpu-mem: 4
          requests:
            aliyun.com/gpu-mem: 4
        ports:
        - containerPort: 8888
        env:
          - name: PASSWORD
            value: mypassw0rd

# Define the tensorflow service
---
apiVersion: v1
kind: Service
metadata:
  name: tf-notebook
spec:
  ports:
  - port: 80
    targetPort: 8888
    name: jupyter
  selector:
    app: tf-notebook
  type: LoadBalancer

通过命令kubectl scale --replicas 7 deploy/tf-notebook扩大副本数至7,触发弹性伸缩组扩容

jumper(⎈ |zjk-gpu:default)➜  ~ kubectl scale --replicas 7 deploy/tf-notebook
deployment.extensions/tf-notebook scaled
jumper(⎈ |zjk-gpu:default)➜  ~ kubectl get pod -o wide
NAME                           READY   STATUS    RESTARTS   AGE   IP             NODE                           NOMINATED NODE   READINESS GATES
tf-notebook-7cf4575d78-dc2fr   0/1     Pending   0          19s   <none>         <none>                         <none>           <none>
tf-notebook-7cf4575d78-jm2cb   0/1     Pending   0          19s   <none>         <none>                         <none>           <none>
tf-notebook-7cf4575d78-lmn5w   0/1     Pending   0          19s   <none>         <none>                         <none>           <none>
tf-notebook-7cf4575d78-n9ldb   1/1     Running   0          19s   172.20.64.39   cn-zhangjiakou.192.168.3.184   <none>           <none>
tf-notebook-7cf4575d78-rzgtl   1/1     Running   0          19s   172.20.64.40   cn-zhangjiakou.192.168.3.184   <none>           <none>
tf-notebook-7cf4575d78-vzxvb   1/1     Running   0          58m   172.20.64.36   cn-zhangjiakou.192.168.3.184   <none>           <none>
tf-notebook-7cf4575d78-w6spt   0/1     Pending   0          19s   <none>         <none>                         <none>           <none>

#弹出资源需要一定的时间...

jumper(⎈ |zjk-gpu:default)➜  ~ kubectl get pod -o wide
NAME                           READY   STATUS    RESTARTS   AGE     IP             NODE                           NOMINATED NODE   READINESS GATES
tf-notebook-7cf4575d78-dc2fr   1/1     Running   0          2m10s   172.20.67.21   cn-zhangjiakou.192.168.3.198   <none>           <none>
tf-notebook-7cf4575d78-jm2cb   1/1     Running   0          2m10s   172.20.67.20   cn-zhangjiakou.192.168.3.198   <none>           <none>
tf-notebook-7cf4575d78-lmn5w   1/1     Running   0          2m10s   172.20.67.79   cn-zhangjiakou.192.168.3.199   <none>           <none>
tf-notebook-7cf4575d78-n9ldb   1/1     Running   0          2m10s   172.20.64.39   cn-zhangjiakou.192.168.3.184   <none>           <none>
tf-notebook-7cf4575d78-rzgtl   1/1     Running   0          2m10s   172.20.64.40   cn-zhangjiakou.192.168.3.184   <none>           <none>
tf-notebook-7cf4575d78-vzxvb   1/1     Running   0          60m     172.20.64.36   cn-zhangjiakou.192.168.3.184   <none>           <none>
tf-notebook-7cf4575d78-w6spt   1/1     Running   0          2m10s   172.20.67.22   cn-zhangjiakou.192.168.3.198   <none>           <none>
jumper(⎈ |zjk-gpu:default)➜  ~ kubectl get node -L cgpu,workload_type
NAME                           STATUS   ROLES    AGE    VERSION            CGPU   WORKLOAD_TYPE
cn-zhangjiakou.192.168.0.138   Ready    master   19d    v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.112   Ready    master   19d    v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.113   Ready    <none>   19d    v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.115   Ready    master   19d    v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.184   Ready    <none>   8d     v1.16.6-aliyun.1   true
cn-zhangjiakou.192.168.3.189   Ready    <none>   7d9h   v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.198   Ready    <none>   134m   v1.16.6-aliyun.1   true   gpushare
cn-zhangjiakou.192.168.3.199   Ready    <none>   129m   v1.16.6-aliyun.1   true   gpushare
jumper(⎈ |zjk-gpu:default)➜  ~ arena top node -s -d

NAME:       cn-zhangjiakou.192.168.3.184
IPADDRESS:  192.168.3.184

NAME                          NAMESPACE  GPU0(Allocated)
tf-notebook-7cf4575d78-n9ldb  default    4
tf-notebook-7cf4575d78-rzgtl  default    4
tf-notebook-7cf4575d78-vzxvb  default    4
Allocated :                   12 (85%)
Total :                       14
----------------------------------------------------------------------------------------------------------------------------------

NAME:       cn-zhangjiakou.192.168.3.198
IPADDRESS:  192.168.3.198

NAME                          NAMESPACE  GPU0(Allocated)
tf-notebook-7cf4575d78-dc2fr  default    4
tf-notebook-7cf4575d78-jm2cb  default    4
tf-notebook-7cf4575d78-w6spt  default    4
Allocated :                   12 (85%)
Total :                       14
----------------------------------------------------------------------------------------------------------------------------------

NAME:       cn-zhangjiakou.192.168.3.199
IPADDRESS:  192.168.3.199

NAME                          NAMESPACE  GPU0(Allocated)
tf-notebook-7cf4575d78-lmn5w  default    4
Allocated :                   4 (28%)
Total :                       14
----------------------------------------------------------------------------------------------------------------------------------


Allocated/Total GPU Memory In GPUShare Node:
28/42 (GiB) (66%)

如上所示,当副本数调至7时,额外弹出了两个gpu节点,“cgpu: true,workload_type: gpushare”
通过arena的命令可以看到显存资源使用了 28/42

触发缩容

由上可见,对于共享型GPU,是可以正常的弹出资源的。接下来我们把资源释放,来验证共享GPU资源的缩容情况

jumper(⎈ |zjk-gpu:default)➜  ~ kubectl scale --replicas 1 deploy/tf-notebook
deployment.extensions/tf-notebook scaled
jumper(⎈ |zjk-gpu:default)➜  ~ kubectl  get pod -o wide
NAME                           READY   STATUS        RESTARTS   AGE    IP             NODE                           NOMINATED NODE   READINESS GATES
tf-notebook-7cf4575d78-dc2fr   1/1     Terminating   0          4m7s   172.20.67.21   cn-zhangjiakou.192.168.3.198   <none>           <none>
tf-notebook-7cf4575d78-jm2cb   1/1     Terminating   0          4m7s   172.20.67.20   cn-zhangjiakou.192.168.3.198   <none>           <none>
tf-notebook-7cf4575d78-lmn5w   1/1     Terminating   0          4m7s   172.20.67.79   cn-zhangjiakou.192.168.3.199   <none>           <none>
tf-notebook-7cf4575d78-n9ldb   1/1     Terminating   0          4m7s   172.20.64.39   cn-zhangjiakou.192.168.3.184   <none>           <none>
tf-notebook-7cf4575d78-rzgtl   1/1     Terminating   0          4m7s   172.20.64.40   cn-zhangjiakou.192.168.3.184   <none>           <none>
tf-notebook-7cf4575d78-vzxvb   1/1     Running       0          62m    172.20.64.36   cn-zhangjiakou.192.168.3.184   <none>           <none>
tf-notebook-7cf4575d78-w6spt   1/1     Terminating   0          4m7s   172.20.67.22   cn-zhangjiakou.192.168.3.198   <none>           <none>
jumper(⎈ |zjk-gpu:default)➜  ~ kubectl get node
NAME                           STATUS   ROLES    AGE    VERSION
cn-zhangjiakou.192.168.0.138   Ready    master   19d    v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.112   Ready    master   19d    v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.113   Ready    <none>   19d    v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.115   Ready    master   19d    v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.184   Ready    <none>   8d     v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.189   Ready    <none>   7d8h   v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.198   Ready    <none>   78m    v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.199   Ready    <none>   73m    v1.16.6-aliyun.1
#此时新弹出来的机器的状态都是Ready,在下一个缩容周期中会缩容这些新弹出的Node,一段时间之后,这个时间取决于弹性伸缩组的缩容周期的设置

jumper(⎈ |zjk-gpu:default)➜  ~ kubectl  get node
NAME                           STATUS     ROLES    AGE    VERSION
cn-zhangjiakou.192.168.0.138   Ready      master   19d    v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.112   Ready      master   19d    v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.113   Ready      <none>   19d    v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.115   Ready      master   19d    v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.184   Ready      <none>   8d     v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.189   Ready      <none>   7d9h   v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.198   NotReady   <none>   142m   v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.199   NotReady   <none>   137m   v1.16.6-aliyun.1

如上所示,通过降低副本数后,经过一段时间,新弹出的机器会重新释放 -- 此处使用了ECS的极速模式,故大家看到的状态是NotReady而不是节点直接消失,极速模式可以让下次启动的速度更快,代价是会产生少量的存储费用。

参考信息

节点自动伸缩 https://help.aliyun.com/document_detail/119099.html

上一篇:Zero to JupyterHub with Kubernetes @aliyun


下一篇:Linux 文件属主、属组权限更改