k8s 1.5 与 k8s 1.9的差别
参照以前安装kubernetes 1.5.2失败,原因是docker包冲突。在查 看高版本安装过程中发现,高版本kubernetes不再打包安装docker,而是需要用户先自行安装好docker服务。
机器上已经安装了 Docker version 17.12.0-ce, build c97c6d6
再安装kubernetes (kubernetes.x86_64 1.5.2-0.7.git269f928.el7) 时失败。
错误:docker-ce conflicts with 2:docker-1.12.6-71.git3e8e77d.el7.centos.1.x86_64
您可以尝试添加 --skip-broken 选项来解决该问题
您可以尝试执行:rpm -Va --nofiles --nodigest
猜测可能因为版本问题,故去网上搜索安装更高级版本方法。结果如下:
“但是在kubernetes1.6之后,安装就比较繁琐了,需要证书各种认证,对于刚接触kubernetes的人来说很不友好,按照官方文档在本地安装“集群”的的话,我觉得你肯定是跑不起来的,除非你突破了GFW的限制,还要懂得怎么样不断修改参数。”
意思是k8s 1.6之后的安装与之前可能有比较大的差异。google被墙,需要预先下载很多docker镜像。
以下三篇文章安装k8s 1.7.5,由于缺乏docker镜像,安装失败。
https://www.cnblogs.com/liangDream/p/7358847.html
http://www.bubuko.com/infodetail-2375091.html
https://www.kubernetes.org.cn/3063.html
docker安装问题
docker版本选择
kubernetes1.9.0 最高支持docker17.03 目前装的17.12太高了 要降级。
Kubernetes对Docker的版本支持列表 http://blog.csdn.net/csdn_duomaomao/article/details/79171027
删除docker
[root@tensorflow0 hdzhou]# yum remove docker \
docker-common \
docker-selinux \
docker-engine
======================================================================================================================================================================================
Package 架构 版本 源 大小
======================================================================================================================================================================================
正在删除:
container-selinux noarch 2:2.36-1.gitff95335.el7 @extras 34 k
为依赖而移除:
docker-ce x86_64 17.12.0.ce-1.el7.centos installed 123 M
nvidia-docker2 noarch 2.0.2-1.docker17.12.0.ce @nvidia-docker 2.3 k
事务概要
======================================================================================================================================================================================
移除 1 软件包 (+2 依赖软件包)
docker启动失败问题
2月 26 16:42:00 tensorflow0 dockerd[8717]: time="2018-02-26T16:42:00.315096986+08:00" level=info msg="libcontainerd: new containerd process, pid: 8725"
2月 26 16:42:01 tensorflow0 dockerd[8717]: time="2018-02-26T16:42:01.319051277+08:00" level=error msg="[graphdriver] prior storage driver overlay2 failed: driver not supported"
2月 26 16:42:01 tensorflow0 dockerd[8717]: Error starting daemon: error initializing graphdriver: driver not supported
2月 26 16:42:01 tensorflow0 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
2月 26 16:42:01 tensorflow0 systemd[1]: Failed to start Docker Application Container Engine.
解决:
sudo mv /var/lib/docker /var/lib/docker.old
k8s安装问题
rpm安装
rpm -ivh socat-1.7.3.2-2.el7.x86_64.rpm
rpm -ivh kubernetes-cni-0.6.0-0.x86_64.rpm kubelet-1.9.9-9.x86_64.rpm kubectl-1.9.0-0.x86_64.rpm
rpm -ivh kubectl-1.9.0-0.x86_64.rpm
rpm -ivh kubeadm-1.9.0-0.x86_64.rpm
rpm删除
rpm -e 文件名 --nodeps
eg:
rpm -e socat-1.7.3.2-2.el7.x86_64 --nodeps
rpm -e kubernetes-cni-0.6.0-0.x86_64 --nodeps
rpm -e kubelet-1.9.0-0.x86_64 --nodeps
rpm -e kubectl-1.9.0-0.x86_64 --nodeps
rpm -e kubeadm-1.9.0-0.x86_64.rpm --nodeps
查看报错信息
cat /var/log/messages
journalctl -xeu kubelet
kubelet启动后 ca文件不存在是正常的,在后续步骤 kubeadm init执行后会生成ca文件。
kubelet启动后在不停重启是正常的!
The kubelet is now restarting every few seconds, as it waits in a crashloop for kubeadm to tell it what to do. This crashloop is expected and normal, please proceed with the next step and the kubelet will start running normally.
初始化集群
kubeadm init --kubernetes-version=v1.9.0 --pod-network-cidr=10.244.0.0/16
务必记录如下信息,每次生成都不一样
eg:
kubeadm join --token 5ce44e.47b6dc4e4b66980f 192.168.1.138:6443 --discovery-token-ca-cert-hash sha256:9d7eac82d66744405c783de5403e1f2bb7191b4c1b350d721b7b8570c62ff83a
token重新获取
kubeadm token list
或者
kubeadm token create
token 24小时后过期,超过时间需要重新获取
sha256获取方式 master节点执行:
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'
kubeadm init
[root@tensorflow0 etc]# kubeadm init --kubernetes-version=v1.9.0 --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.9.0
[init] Using Authorization modes: [Node RBAC]
[preflight] Running pre-flight checks.
[preflight] Some fatal errors occurred:
[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
[root@tensorflow0 etc]#
命令后面增加 --ignore-preflight-errors 'Swap' 或者 --ignore-preflight-errors all (这是不好的)
Port 2379 is in use 因为没有执行 kubeadm reset
查看错误:
kubectl get pod kube-proxy-d2p7p -o wide --namespace=kube-system
kubectl describe pod kube-proxy-d2p7p --namespace=kube-system
修改kubelet配置,启动kubelet(所有节点)
注意:时刻查看/var/log/messages的日志输出,会看到kubelet一直启动失败。
cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
编辑10-kubeadm.conf的文件,修改cgroup-driver配置:
[root@centos7-base-ok]# cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf[Service]Environment="KUBELET_KUBECONFIG_ARGS=--kubeconfig=/etc/kubernetes/kubelet.conf --require-kubeconfig=true"Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=0"Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs"ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CGROUP_ARGS $KUBELET_EXTRA_ARGS
Environment="KUBELET_SWAP_ARGS=--fail-swap-on=false" 1.8开始,如果机器开启了swap,kubulet会无法启动,默认参数是true。 可以在kubelet里配置swap false 也可以直接关闭机器的swap。关闭方法见下文。
将“--cgroup-driver=systems”修改成为“--cgroup-driver=cgroupfs”
这里需要主意的是要看一下docker的cgroup driver与 --cgroup-driver要一致。 可以用 docker info |grep Cgroup 查看,有可能是systemd 或者 cgroupfs
重新启动kubelet
[root@centos7-base-ok]# systemctl restart kubelet
[preflight] Running pre-flight checks.
[preflight] Some fatal errors occurred:
关闭swap
swapoff -a
设置永久关闭swap
修改/etc/fstab中内容,将swap哪一行用#注释掉。
https://zhidao.baidu.com/question/2011273820596440908.html
删除etcd
yum erase etcd.
删除etcd文件夹 mv /var/lib/etcd /var/lib/etcd.bak
The connection to the server localhost:8080 was refused - did you specify the right host or port?
export KUBECONFIG=/etc/kubernetes/admin.conf
定义在6443端口 而不是8080
runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
kube-dns 启动不成功
kube-system po/kube-dns-6f4fd4bdf-p5x4k 0/3 Pending 0 14m
修改 /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
删除$KUBELET_NETWORK_ARGS 别这么做
dns异常,kubeadm reset重来,试试先初始化master,然后配置flannel网络,ok了以后,再加入其它机器
重启
systemctl daemon-reload && systemctl restart kubelet
kubeadm reset
kubeadm init --kubernetes-version=v1.9.0 --pod-network-cidr=10.244.0.0/16
kube-proxy 启动不成功
原因同 no IP addresses available
多次启动集群,虚拟ip用完了。
no IP addresses available
E1216 23:50:16.116098 28152 pod_workers.go:186] Error syncing pod 6f5b9673-e2b5-11e7-a0f5-001e67d35991 ("kube-dns-6f4fd4bdf-xrj4w_kube-system(6f5b9673-e2b5-11e7-a0f5-001e67d35991)"), skipping: failed to "CreatePodSandbox" for "kube-dns-6f4fd4bdf-xrj4w_kube-system(6f5b9673-e2b5-11e7-a0f5-001e67d35991)" with CreatePodSandboxError: "CreatePodSandbox for pod "kube-dns-6f4fd4bdf-xrj4w_kube-system(6f5b9673-e2b5-11e7-a0f5-001e67d35991)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "kube-dns-6f4fd4bdf-xrj4w_kube-system" network: failed to allocate for range 0: no IP addresses available in range set: 10.244.0.1-10.244.0.254"
多次启动集群,虚拟ip用完了。
kubeadm reset
rm -rf /var/lib/cni/flannel/*
rm -rf /var/lib/cni/networks/cbr0/*
ip link delete cni0 flannel.1
重启!!!! kubeadm reset多了 网络开辟可能有什么残留 重启能清空。
https://github.com/kubernetes/kubernetes/issues/57280
这两个问题都是和网络有关,都是因为虚拟网络问题导致服务启动不正常。原因是多次kubeadm reset 多次重新启动flannel(或者其他网络),reset可能清理不彻底,导致多次reset后出现ip用完等问题。解决办法是先reset,然后删除文件夹和配置,重启机器(可能不用),一般是报错的机器这样做,也可以每台机器都要做。重新初始化k8s集群,即可。
pod ContainerCreating
查看pod情况发现pod起不来
default po/httpd-68f9d7648d-5f9gt 0/1 ContainerCreating 0 1m tensorflow0
describe一下 说sadbox创建失败。
Warning FailedCreatePodSandBox 20s (x12 over 54s) kubelet, tensorflow0 Failed create pod sandbox.
Normal SandboxChanged 20s (x12 over 53s) kubelet, tensorflow0 Pod sandbox changed, it will be killed and re-created.
到那台起不来的机器上去看kubelet状态。
发现
Error while adding to cni network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.2.1/24。
同
Error while adding to cni network: failed to allocate for range 0: no IP addresses available in range set: 10.244.2.1-10.244.2.254
[root@tensorflow0 ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since 四 2018-03-22 14:49:29 CST; 4min 12s ago
Docs: http://kubernetes.io/docs/
Main PID: 3873 (kubelet)
Memory: 45.0M
CGroup: /system.slice/kubelet.service
├─ 3873 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --feature-gates=DevicePlugins=true --pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --network-plugin=cni -...
├─11665 /opt/cni/bin/flannel
└─11670 /opt/cni/bin/bridge
3月 22 14:53:35 tensorflow0 kubelet[3873]: E0322 14:53:35.990200 3873 kuberuntime_manager.go:647] createPodSandbox for pod "httpd-68f9d7648d-5f9gt_default(39f66066-2d9d-11e8-bf17-98eecb73f4db)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to ...
3月 22 14:53:35 tensorflow0 kubelet[3873]: E0322 14:53:35.990287 3873 pod_workers.go:186] Error syncing pod 39f66066-2d9d-11e8-bf17-98eecb73f4db ("httpd-68f9d7648d-5f9gt_default(39f66066-2d9d-11e8-bf17-98eecb73f4db)"), skipping: failed to "Cre...bf17-98eecb73f4db)"
3月 22 14:53:37 tensorflow0 kubelet[3873]: W0322 14:53:37.041536 3873 pod_container_deletor.go:77] Container "73c43b8766686c64d31bdd0533604d1d349ebe08f95d7463d23ebdffe377113e" not found in pod's containers
3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.621047 3873 cni.go:259] Error adding network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.2.1/24
3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.621083 3873 cni.go:227] Error while adding to cni network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.2.1/24
3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.809286 3873 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "httpd-68f9d7648d-5f9gt_default" net...t from 10.244.2.1/24
3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.809337 3873 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "httpd-68f9d7648d-5f9gt_default(39f66066-2d9d-11e8-bf17-98eecb73f4db)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to s...
3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.809360 3873 kuberuntime_manager.go:647] createPodSandbox for pod "httpd-68f9d7648d-5f9gt_default(39f66066-2d9d-11e8-bf17-98eecb73f4db)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to ...
3月 22 14:53:39 tensorflow0 kubelet[3873]: E0322 14:53:39.809424 3873 pod_workers.go:186] Error syncing pod 39f66066-2d9d-11e8-bf17-98eecb73f4db ("httpd-68f9d7648d-5f9gt_default(39f66066-2d9d-11e8-bf17-98eecb73f4db)"), skipping: failed to "Cre...bf17-98eecb73f4db)"
3月 22 14:53:40 tensorflow0 kubelet[3873]: W0322 14:53:40.063548 3873 pod_container_deletor.go:77] Container "f1b063e5245c7a5c8527d1426858781c6554bcb06d987c7f472cfd0c41290110" not found in pod's containers
Hint: Some lines were ellipsized, use -l to show in full.
解决:
干掉cni-flannel,停运集群.清理环境.
rm -rf /var/lib/cni/flannel/ && rm -rf /var/lib/cni/networks/cbr0/ && ip link delete cni0
rm -rf /var/lib/cni/networks/cni0/*
把报错的那台清理了就行了。
加入节点
节点加入不报错 但是主节点看不到,因为kubelet 启动失败 ,也要修改cgroup-driver
重启kubelet
再次kubeadm join xxx
报错
[preflight] Running pre-flight checks.
[preflight] Some fatal errors occurred:
删除存在文件即可
kubeadm join 前需要 kubectl reset
删除节点
master执行 kubectl delete node {nodename}
eg:
kubectl delete node tensorflow0
节点执行 kubectl reset
master执行了删除节点操作
k8s + gpu
https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes
注意要设置default-runtime
容器没有启动在虚拟网络上
设置了虚拟网段 10.244.0.0/16
容器应该启动在虚拟网段上,每个容器一个ip,现在环境2并不是这样。这样就不能准确的指定ip,分布式tf任务跑不成。
解决方案同kube-dns 启动不成功
推荐打开,不打开我没发现什么问题。有时候,莫名就变成0了,就报错了,还是配置好比较好。
echo 'net.bridge.bridge-nf-call-iptables=1' >> /etc/sysctl.conf
sysctl -p
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
集群连接不上了
[root@tensorflow1 influxdb]# kubectl get all -o wide -n kube-system
error: {batch cronjobs} matches multiple kinds [batch/v1beta1, Kind=CronJob batch/v2alpha1, Kind=CronJob]
原因是 ~/.bash_profile 里配置的k8s属性丢失了。
启动nvidia-device-plugin-daemonset失败
Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "process_linux.go:337: running prestart hook 1 caused "error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=12545 /data1/docker/overlay/10be1d599f91da020b7bfced8058533bb6129b637871ea61e0547ecb8758b3a2/merged]\n Error in `/usr/bin/nvidia-container-cli': double free or corruption (!prev): 0x000055c6961daa10 \n======= Backtrace: =========\n/lib64/libc.so.6(+0x7c619)[0x7f5aa0af0619]\n/usr/lib64/nvidia/libcuda.so.1(+0x2edd7c)[0x7f5a9fb77d7c]\n/usr/lib64/nvidia/libcuda.so.1(+0x2eddc3)[0x7f5a9fb77dc3]\n/usr/lib64/nvidia/libcuda.so.1
发现gpu已经被占用,先清理干净,再启动就没问题了。
本文转自CSDN-安装k8s 1.9.0 实践:问题集锦