1、 K8S 的由来
K8S 是 kubernetes 的英文缩写,是用 8 代替 8 个字符 "ubernete" 而成的缩写。
2、 K8S 单机版实战
环境:
- ubuntu 16.04
- gpu 驱动 418.56
- docker 18.06
- k8s 1.13.5
一、设置环境
首先备份一下源配置:cp /etc/apt/sources.list /etc/apt/sources.list.cp
编辑vim /etc/apt/sources.list
,替换为阿里源:
deb-src http://archive.ubuntu.com/ubuntu xenial main restricted
deb http://mirrors.aliyun.com/ubuntu/ xenial main restricted
deb-src http://mirrors.aliyun.com/ubuntu/ xenial main restricted multiverse universe
deb http://mirrors.aliyun.com/ubuntu/ xenial-updates main restricted
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-updates main restricted multiverse universe
deb http://mirrors.aliyun.com/ubuntu/ xenial universe
deb http://mirrors.aliyun.com/ubuntu/ xenial-updates universe
deb http://mirrors.aliyun.com/ubuntu/ xenial multiverse
deb http://mirrors.aliyun.com/ubuntu/ xenial-updates multiverse
deb http://mirrors.aliyun.com/ubuntu/ xenial-backports main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/xenial-backports main restricted universe multiverse
deb http://archive.canonical.com/ubuntu xenial partner
deb-src http://archive.canonical.com/ubuntu xenial partner
deb http://mirrors.aliyun.com/ubuntu/ xenial-security main restricted
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-security main restricted multiverse universe
deb http://mirrors.aliyun.com/ubuntu/ xenial-security universe
deb http://mirrors.aliyun.com/ubuntu/ xenial-security multiverse
更新源:apt-get update
自动修复安装出现 broken 的 package:apt --fix-broken install
升级,对于 gpu 机器可不执行,否则可能升级 gpu 驱动导致问题:apt-get upgrade
关闭防火墙:ufw disable
安装 selinux:apt install selinux-utils
selinux 防火墙配置:
setenforce 0
vim/etc/selinux/conifg
SELINUX=disabled
设置网络:
tee /etc/sysctl.d/k8s.conf <<-'EOF'
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
EOF
modprobe br_netfilter
查看 ipv4 与 v6 配置是否生效:sysctl --system
配置 iptables:
iptables -P FORWARD ACCEPT
vim /etc/rc.local
/usr/sbin/iptables -P FORWARD ACCEPT
永久关闭 swap 分区:sed -i 's/.*swap.*/#&/' /etc/fstab
二、安装 docker
执行下面的命令:
apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | apt-key add -
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" apt-get update
apt-get purge docker-ce docker docker-engine docker.io && rm -rf /var/lib/docker
apt-get autoremove docker-ce docker docker-engine docker.io
apt-get install -y docker-ce=18.06.3~ce~3-0~ubuntu
启动 docker 并设置开机自重启:systemctl enable docker && systemctl start docker
Docker 配置:
vim /etc/docker/daemon.json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "10"
},
"insecure-registries": ["http://k8s.gcr.io"],
"data-root": "",
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
上面是含 GPU 的配置,不含 GPU 的配置:
{
"registry-mirrors":[
"https://registry.docker-cn.com"
],
"storage-driver":"overlay2",
"log-driver":"json-file",
"log-opts":{
"max-size":"100m"
},
"exec-opts":[
"native.cgroupdriver=systemd"
],
"insecure-registries":["http://k8s.gcr.io"],
"live-restore":true
}
重启服务并设置开机自动重启:
systemctl daemon-reload && systemctl restart docker && docker info
三、安装 k8s
拉取镜像前的设置:
apt-get update && apt-get install -y apt-transport-https curl
curl -s https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | apt-key add -
tee /etc/apt/sources.list.d/kubernetes.list <<-'EOF'
deb https://mirrors.aliyun.com/kubernetes/apt kubernetes-xenial main
EOF
更新:
apt-get update
apt-get purge kubelet=1.13.5-00 kubeadm=1.13.5-00 kubectl=1.13.5-00
apt-get autoremove kubelet=1.13.5-00 kubeadm=1.13.5-00 kubectl=1.13.5-00
apt-get install -y kubelet=1.13.5-00 kubeadm=1.13.5-00 kubectl=1.13.5-00
apt-mark hold kubelet=1.13.5-00 kubeadm=1.13.5-00 kubectl=1.13.5-00
启动服务并设置开机自动重启:
systemctl enable kubelet && sudo systemctl start kubelet
安装 k8s 相关镜像,由于 gcr.io 网络访问不了,从 registry.cn-hangzhou.aliyuncs.com 镜像地址下载:
docker pull registry.cn-hangzhou.aliyuncs.com/gg-gcr-io/kube-apiserver:v1.13.5
docker pull registry.cn-hangzhou.aliyuncs.com/gg-gcr-io/kube-controller-manager:v1.13.5
docker pull registry.cn-hangzhou.aliyuncs.com/gg-gcr-io/kube-scheduler:v1.13.5
docker pull registry.cn-hangzhou.aliyuncs.com/gg-gcr-io/kube-proxy:v1.13.5
docker pull registry.cn-hangzhou.aliyuncs.com/kuberimages/pause:3.1
docker pull registry.cn-hangzhou.aliyuncs.com/kuberimages/etcd:3.2.24
docker pull registry.cn-hangzhou.aliyuncs.com/kuberimages/coredns:1.2.6
打标签:
docker tag registry.cn-hangzhou.aliyuncs.com/gg-gcr-io/kube-apiserver:v1.13.5 k8s.gcr.io/kube-apiserver:v1.13.5
docker tag registry.cn-hangzhou.aliyuncs.com/gg-gcr-io/kube-controller-manager:v1.13.5 k8s.gcr.io/kube-controller-manager:v1.13.5
docker tag registry.cn-hangzhou.aliyuncs.com/gg-gcr-io/kube-scheduler:v1.13.5 k8s.gcr.io/kube-scheduler:v1.13.5
docker tag registry.cn-hangzhou.aliyuncs.com/gg-gcr-io/kube-proxy:v1.13.5 k8s.gcr.io/kube-proxy:v1.13.5
docker tag registry.cn-hangzhou.aliyuncs.com/kuberimages/pause:3.1 k8s.gcr.io/pause:3.1
docker tag registry.cn-hangzhou.aliyuncs.com/kuberimages/etcd:3.2.24 k8s.gcr.io/etcd:3.2.24
docker tag registry.cn-hangzhou.aliyuncs.com/kuberimages/coredns:1.2.6 k8s.gcr.io/coredns:1.2.6
四、kubeadm 初始化
利用 kubeadm 初始化 k8s,其中主机 IP 根据自己的实际情况输入:
kubeadm init --kubernetes-version=v1.13.5 --pod-network-cidr=10.244.0.0/16 --service-cidr=10.16.0.0/16 --apiserver-advertise-address=${masterIp} | tee kubeadm-init.log
此时,如果未知主机 IP,也可利用 yaml 文件动态初始化:
vi /etc/hosts
10.10.5.100 k8s.api.server
vi kube-init.yaml
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
kubernetesVersion: v1.13.5
imageRepository: registry.aliyuncs.com/google_containers
apiServer:
certSANs:
- "k8s.api.server"
controlPlaneEndpoint: "k8s.api.server:6443"
networking:
serviceSubnet: "10.1.0.0/16"
podSubnet: "10.244.0.0/16"
HA 版本:
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
kubernetesVersion: v1.13.5
imageRepository: registry.aliyuncs.com/google_containers
apiServer:
certSANs:
- "api.k8s.com"
controlPlaneEndpoint: "api.k8s.com:6443"
etcd:
external:
endpoints:
- https://ETCD_0_IP:2379
- https://ETCD_1_IP:2379
- https://ETCD_2_IP:2379
networking:
serviceSubnet: 10.1.0.0/16
podSubnet: 10.244.0.0/16
注意: apiVersion 中用 kubeadm,因为需要用 kubeadm 来初始化,最后执行下面来初始化:
kubeadm init --config=kube-init.yaml
出现问题,解决后,reset 后再执行,如果需要更多,执行:
kubeadm --help
五、部署出现问题
先删除 node 节点(集群版)
kubectl drain <node name> --delete-local-data --force --ignore-daemonsets
kubectl delete node <node name>
清空 init 配置在需要删除的节点上执行(注意,当执行 init 或者 join 后出现任何错误,都可以使用此命令返回):
kubeadm reset
六、查问题
初始化后出现问题,可以通过以下命令先查看其容器状态以及网络情况:
sudo docker ps -a | grep kube | grep -v pause
sudo docker logs CONTAINERID
sudo docker images && systemctl status -l kubelet
netstat -nlpt
kubectl describe ep kubernetes
kubectl describe svc kubernetes
kubectl get svc kubernetes
kubectl get ep
netstat -nlpt | grep apiser
vi /var/log/syslog
七、给当前用户配置 k8s apiserver 访问公钥
sudo mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
八、网络插件
kubectl apply -f https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/rbac-kdd.yaml
wget https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml
vi calico.yaml
- name: CALICO_IPV4POOL_IPIP
value:"off"
- name: CALICO_IPV4POOL_CIDR
value: "10.244.0.0/16
kubectl apply -f calico.yaml
单机下允许 master 节点部署 pod 命令如下:
kubectl taint nodes --all node-role.kubernetes.io/master-
禁止 master 部署 pod:
kubectl taint nodes k8s node-role.kubernetes.io/master=true:NoSchedule
以上单机版部署结束,如果你的项目中,交付的是软硬件结合的一体机,那么到此就结束了。记得单机下要允许 master 节点部署
哟!
接下来,集群版本上线咯!
以上面部署的机器为例,作为 master 节点,继续执行:
scp /etc/kubernetes/admin.conf $nodeUser@$nodeIp:/home/$nodeUser
scp /etc/kubernetes/pki/etcd/* $nodeUser@$nodeIp:/home/$nodeUser/etcd
kubeadm token generate
kubeadm token create $token_name --print-join-command --ttl=0
kubeadm join $masterIP:6443 --token $token_name --discovery-token-ca-cert-hash $hash
Node 机器执行时,如果需要 cuda ,可以参考以下资料:
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu-installation
https://blog.csdn.net/u012235003/article/details/54575758
https://blog.csdn.net/qq_39670011/article/details/90404111
正式执行:
vim /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
update-initramfs -u
重启 ubuntu 查看是否禁用成功:
lsmod | grep nouveau
apt-get remove --purge nvidia*
https://developer.nvidia.com/cuda-downloads
sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev
安装 cuda:
accept
select "Install" / Enter
select "Yes"
sh cuda_10.1.168_418.67_linux.run
echo 'export PATH=/usr/local/cuda-10.1/bin:$PATH' >> ~/.bashrc
echo 'export PATH=/usr/local/cuda-10.1/NsightCompute-2019.3:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
重启机器,检查 cuda 是否安装成功。
查看是否有“nvidia*”的设备:
cd /dev && ls -al
如果没有,创建一个 nv.sh:
vi nv.sh
#!/bin/bash /sbin/modprobe nvidia
if [ "$?" -eq 0 ];
then
NVDEVS=`lspci |
grep -i NVIDIA
`
N3D=`
echo
"$NVDEVS"
| grep "3D controller" |
wc -l
`
NVGA=`
echo
"$NVDEVS"
| grep "VGA compatible controller" |
wc -l
`
N=`
expr $N3D + $NVGA -
1
`
for i in `
seq
0
$N
`; do
mknod -m 666 /dev/nvidia$i c 195 $i
done
mknod -m 666 /dev/nvidiactl c 195 255
else
exit 1
fi
chmod +x nv.sh && bash nv.sh
再次重启机器查看 cuda 版本:
nvcc -V
编译:
cd /usr/local/cuda-10.1/samples && make
cd /usr/local/cuda-10.1/samples/bin/x86_64/linux/release ./deviceQuery
以上如果输出:“Result = PASS” 代表 cuda 安装成功。
安装 nvdocker:
vim /etc/docker/daemon.json
{
"runtimes":{
"nvidia":{
"path":"nvidia-container-runtime",
"runtimeArgs":[]
}
},
"registry-mirrors":["https://registry.docker-cn.com"],
"storage-driver":"overlay2",
"default-runtime":"nvidia",
"log-driver":"json-file",
"log-opts":{
"max-size":"100m"
},
"exec-opts": ["native.cgroupdriver=systemd"],
"insecure-registries": [$harborRgistry],
"live-restore": true
}
重启 docker:
sudo systemctl daemon-reload && sudo systemctl restart docker && docker info
检查 nvidia-docker 安装是否成功:
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
在节点机器进入 su 模式:su $nodeUser
给当前节点用户配置 k8s apiserver 访问公钥:
mkdir -p $HOME/.kube
cp -i admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
mkdir -p $HOME/etcd
sudo rm -rf /etc/kubernetes
sudo mkdir -p /etc/kubernetes/pki/etcd
sudo cp /home/$nodeUser/etcd/* /etc/kubernetes/pki/etcd
sudo kubeadm join $masterIP:6443 --token $token_name --discovery-token-ca-cert-hash $hash
如:
sudo kubeadm join 192.168.8.116:6443 --token vyi4ga.foyxqr2iz9i391q3 --discovery-token-ca-cert-hash sha256:929143bcdaa3e23c6faf20bc51ef6a57df02edf9df86cedf200320a9b4d3220a
检查 node 是否加入 master:kubectl get node