上一篇文章《尝鲜阿里云容器服务Kubernetes 1.16,共享TensorFlow实验室》我们讲述了如何通过CGPU的方案来实现CGPU资源的共享和隔离。
本文介绍基于CGPU资源的弹性能力。
ps:下面的说明是基于上一篇文章的环境来进行的描述,环境的搭建请参考上一篇文章。
配置弹性伸缩组
- 在“集群列表”中目标集群的“更多”的下拉菜单中选中“自动伸缩”
- 配置基础的“缩容规则”后,“创建伸缩组”,选择“共享GPU实例”
- 然后选中需要的类型,比如本例中选择规格“ecs.gn6i-c4g1.xlarge”,其中我们已经默认设置了弹出节点的标签 "cgpu: true, workload_type: gpushare"
- 点击确定后,弹性伸缩组配置完成
触发扩容
将下面的内存存储为 mem_deployment.yaml,通过命令 kubectl apply -f mem_deployment.yaml
来初始化环境
---
# Define the tensorflow deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: tf-notebook
labels:
app: tf-notebook
spec:
replicas: 1
selector: # define how the deployment finds the pods it mangages
matchLabels:
app: tf-notebook
template: # define the pods specifications
metadata:
labels:
app: tf-notebook
spec:
containers:
- name: tf-notebook
image: tensorflow/tensorflow:1.4.1-gpu-py3
resources:
limits:
aliyun.com/gpu-mem: 4
requests:
aliyun.com/gpu-mem: 4
ports:
- containerPort: 8888
env:
- name: PASSWORD
value: mypassw0rd
# Define the tensorflow service
---
apiVersion: v1
kind: Service
metadata:
name: tf-notebook
spec:
ports:
- port: 80
targetPort: 8888
name: jupyter
selector:
app: tf-notebook
type: LoadBalancer
通过命令kubectl scale --replicas 7 deploy/tf-notebook
扩大副本数至7,触发弹性伸缩组扩容
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl scale --replicas 7 deploy/tf-notebook
deployment.extensions/tf-notebook scaled
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tf-notebook-7cf4575d78-dc2fr 0/1 Pending 0 19s <none> <none> <none> <none>
tf-notebook-7cf4575d78-jm2cb 0/1 Pending 0 19s <none> <none> <none> <none>
tf-notebook-7cf4575d78-lmn5w 0/1 Pending 0 19s <none> <none> <none> <none>
tf-notebook-7cf4575d78-n9ldb 1/1 Running 0 19s 172.20.64.39 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-rzgtl 1/1 Running 0 19s 172.20.64.40 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-vzxvb 1/1 Running 0 58m 172.20.64.36 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-w6spt 0/1 Pending 0 19s <none> <none> <none> <none>
#弹出资源需要一定的时间...
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tf-notebook-7cf4575d78-dc2fr 1/1 Running 0 2m10s 172.20.67.21 cn-zhangjiakou.192.168.3.198 <none> <none>
tf-notebook-7cf4575d78-jm2cb 1/1 Running 0 2m10s 172.20.67.20 cn-zhangjiakou.192.168.3.198 <none> <none>
tf-notebook-7cf4575d78-lmn5w 1/1 Running 0 2m10s 172.20.67.79 cn-zhangjiakou.192.168.3.199 <none> <none>
tf-notebook-7cf4575d78-n9ldb 1/1 Running 0 2m10s 172.20.64.39 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-rzgtl 1/1 Running 0 2m10s 172.20.64.40 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-vzxvb 1/1 Running 0 60m 172.20.64.36 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-w6spt 1/1 Running 0 2m10s 172.20.67.22 cn-zhangjiakou.192.168.3.198 <none> <none>
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get node -L cgpu,workload_type
NAME STATUS ROLES AGE VERSION CGPU WORKLOAD_TYPE
cn-zhangjiakou.192.168.0.138 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.112 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.113 Ready <none> 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.115 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.184 Ready <none> 8d v1.16.6-aliyun.1 true
cn-zhangjiakou.192.168.3.189 Ready <none> 7d9h v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.198 Ready <none> 134m v1.16.6-aliyun.1 true gpushare
cn-zhangjiakou.192.168.3.199 Ready <none> 129m v1.16.6-aliyun.1 true gpushare
jumper(⎈ |zjk-gpu:default)➜ ~ arena top node -s -d
NAME: cn-zhangjiakou.192.168.3.184
IPADDRESS: 192.168.3.184
NAME NAMESPACE GPU0(Allocated)
tf-notebook-7cf4575d78-n9ldb default 4
tf-notebook-7cf4575d78-rzgtl default 4
tf-notebook-7cf4575d78-vzxvb default 4
Allocated : 12 (85%)
Total : 14
----------------------------------------------------------------------------------------------------------------------------------
NAME: cn-zhangjiakou.192.168.3.198
IPADDRESS: 192.168.3.198
NAME NAMESPACE GPU0(Allocated)
tf-notebook-7cf4575d78-dc2fr default 4
tf-notebook-7cf4575d78-jm2cb default 4
tf-notebook-7cf4575d78-w6spt default 4
Allocated : 12 (85%)
Total : 14
----------------------------------------------------------------------------------------------------------------------------------
NAME: cn-zhangjiakou.192.168.3.199
IPADDRESS: 192.168.3.199
NAME NAMESPACE GPU0(Allocated)
tf-notebook-7cf4575d78-lmn5w default 4
Allocated : 4 (28%)
Total : 14
----------------------------------------------------------------------------------------------------------------------------------
Allocated/Total GPU Memory In GPUShare Node:
28/42 (GiB) (66%)
如上所示,当副本数调至7时,额外弹出了两个gpu节点,“cgpu: true,workload_type: gpushare”
通过arena的命令可以看到显存资源使用了 28/42
触发缩容
由上可见,对于共享型GPU,是可以正常的弹出资源的。接下来我们把资源释放,来验证共享GPU资源的缩容情况
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl scale --replicas 1 deploy/tf-notebook
deployment.extensions/tf-notebook scaled
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tf-notebook-7cf4575d78-dc2fr 1/1 Terminating 0 4m7s 172.20.67.21 cn-zhangjiakou.192.168.3.198 <none> <none>
tf-notebook-7cf4575d78-jm2cb 1/1 Terminating 0 4m7s 172.20.67.20 cn-zhangjiakou.192.168.3.198 <none> <none>
tf-notebook-7cf4575d78-lmn5w 1/1 Terminating 0 4m7s 172.20.67.79 cn-zhangjiakou.192.168.3.199 <none> <none>
tf-notebook-7cf4575d78-n9ldb 1/1 Terminating 0 4m7s 172.20.64.39 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-rzgtl 1/1 Terminating 0 4m7s 172.20.64.40 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-vzxvb 1/1 Running 0 62m 172.20.64.36 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-w6spt 1/1 Terminating 0 4m7s 172.20.67.22 cn-zhangjiakou.192.168.3.198 <none> <none>
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get node
NAME STATUS ROLES AGE VERSION
cn-zhangjiakou.192.168.0.138 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.112 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.113 Ready <none> 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.115 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.184 Ready <none> 8d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.189 Ready <none> 7d8h v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.198 Ready <none> 78m v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.199 Ready <none> 73m v1.16.6-aliyun.1
#此时新弹出来的机器的状态都是Ready,在下一个缩容周期中会缩容这些新弹出的Node,一段时间之后,这个时间取决于弹性伸缩组的缩容周期的设置
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get node
NAME STATUS ROLES AGE VERSION
cn-zhangjiakou.192.168.0.138 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.112 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.113 Ready <none> 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.115 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.184 Ready <none> 8d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.189 Ready <none> 7d9h v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.198 NotReady <none> 142m v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.199 NotReady <none> 137m v1.16.6-aliyun.1
如上所示,通过降低副本数后,经过一段时间,新弹出的机器会重新释放 -- 此处使用了ECS的极速模式,故大家看到的状态是NotReady而不是节点直接消失,极速模式可以让下次启动的速度更快,代价是会产生少量的存储费用。