Debug issue of OOM/Pod restarting of in Kubernetes
Recently, customers reported a issue that pod keep restarting after pod migrating from node to node, and java process inside pod exit abnormally.
After a couple of troubleshooting, we could figure out root cause OOM caused by LimitRange of namespace, kernel killed newly created process once memory of JVM request exceeds the default limit. In this article, I will explain the troubleshooting method step by step, it should be common PD method for most application OOM/pod restarting related in Kubernetes.
Failed to start java/jetty application
kubectl logs console-54dc5566b4-nq6gs -n test
jetty process failed to start without any log
Starting Jetty: -Djava.io.tmpdir=/tmp/jetty
start-stop-daemon -S -p/var/run/jetty.pid -croot -d/var/lib/jetty -b -m -a /usr/bin/java -- -Djava.io.tmpdir=/tmp/jetty -Xms512m -Xmx1g -Djetty.logs=/usr/local/jetty/logs -Dspring.profiles.active=StandaloneLogin,Agility,NotAliCloud,NotPublicCloud -Djetty.home=/usr/local/jetty -Djetty.base=/var/lib/jetty -Djava.io.tmpdir=/tmp/jetty -jar /usr/local/jetty/start.jar jetty.state=/var/lib/jetty/jetty.state jetty-started.xml start-log-file=/usr/local/jetty/logs/start.log
jetty.state=/var/lib/jetty/jetty.state
FAILED Thu Mar 14 09:43:55 UTC 2019
tail: cannot open '/var/lib/jetty/logs/*.log' for reading: No such file or directory
tail: no files remaining
console pod keep re-creating on k8s cluster
kubectl get events --all-namespaces | grep console
test 1m 1m 1 console-54dc5566b4-sx2r6.158bc8b1f2a076ce Pod spec.containers{console} Normal Killing kubelet, k8s003 Killing container with id docker://console:Container failed liveness probe.. Container will be killed and recreated.
test 1m 6m 2 console-54dc5566b4-hx6wb.158bc86c4379c4e7 Pod spec.containers{console} Normal Started kubelet, k8s001 Started container
test 1m 6m 2 console-54dc5566b4-hx6wb.158bc86c355ab395 Pod spec.containers{console} Normal Created kubelet, k8s001 Created container
test 1m 6m 2 console-54dc5566b4-hx6wb.158bc86c2fe32c76 Pod spec.containers{console} Normal Pulled kubelet, k8s001 Container image "registry.cn-hangzhou.aliyuncs.com/eric-dev/console:0.9-62f837e" already present on machine
test 1m 1m 1 console-54dc5566b4-hx6wb.158bc8b87083e752 Pod spec.containers{console} Normal Killing kubelet, k8s001 Killing container with id docker://console:Container failed liveness probe.. Container will be killed and recreated.
Determine an OOM from pod state
kubectl get pod console-54dc5566b4-hx6wb -n test -o yaml | grep reason -C5
lastState:
terminated:
containerID: docker://90e5c9e618f3e745ebf510b8f215da3a165e3d03be58e0369e27c1773e75ef70
exitCode: 137
finishedAt: 2019-03-14T09:29:51Z
reason: OOMKilled
startedAt: 2019-03-14T09:24:51Z
name: console
ready: true
restartCount: 3
state:
kubectl get pod console-54dc5566b4-hx6wb -n test -o jsonpath='{.status.containerStatuses[].lastState}'
map[terminated:map[exitCode:137 reason:OOMKilled startedAt:2019-03-14T09:24:51Z finishedAt:2019-03-14T09:29:51Z containerID:docker://90e5c9e618f3e745ebf510b8f215da3a165e3d03be58e0369e27c1773e75ef70]]
Detect oom thru system log to validate assumption
Following error indicate an java oom caused by cgroup setting
# grep oom /var/log/messages
/var/log/messages:2019-03-14T09:15:17.541049+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949064] java invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=968
/var/log/messages:2019-03-14T09:15:17.541117+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949153] [<ffffffff81191de4>] oom_kill_process+0x214/0x3f0
/var/log/messages:2019-03-14T09:15:17.541119+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949171] [<ffffffff811f9481>] mem_cgroup_oom_synchronize+0x2f1/0x310
/var/log/messages:2019-03-14T09:15:17.541147+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.950571] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
#grep oom /var/log/warn
2019-03-14T09:15:17.541049+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949064] java invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=968
2019-03-14T09:15:17.541117+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949153] [<ffffffff81191de4>] oom_kill_process+0x214/0x3f0
2019-03-14T09:15:17.541119+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949171] [<ffffffff811f9481>] mem_cgroup_oom_synchronize+0x2f1/0x310
Root cause:
kubectl get pod console-54dc5566b4-hx6wb -n test -o yaml | grep limits -A4
limits:
memory: 512Mi
requests:
memory: 256Mi
In this case, application console pod extends the limits setting from default limits of namespace test
kubectl describe pod console-54dc5566b4-hx6wb -n test | grep limit
Annotations: kubernetes.io/limit-ranger=LimitRanger plugin set: memory request for container console; memory limit for container console
kubectl get limitrange -n test
NAME CREATED AT
mem-limit-range 2019-03-14T09:04:10Z
kubectl describe ns test
Name: test
Labels: <none>
Annotations: <none>
Status: Active
No resource quota.
Resource Limits
Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
---- -------- --- --- --------------- ------------- -----------------------
Container memory - - 256Mi 512Mi -
Action to fix oom issue
After fixing limits setting and recreated pod, application become healthy.
kubectl delete limitrange mem-limit-range -n test
kubectl delete pod console-54dc5566b4-hx6wb