Debug issue of OOM/pod restarting of in Kubernetes

Debug issue of OOM/Pod restarting of in Kubernetes

Recently, customers reported a issue that pod keep restarting after pod migrating from node to node, and java process inside pod exit abnormally.
After a couple of troubleshooting, we could figure out root cause OOM caused by LimitRange of namespace, kernel killed newly created process once memory of JVM request exceeds the default limit. In this article, I will explain the troubleshooting method step by step, it should be common PD method for most application OOM/pod restarting related in Kubernetes.

Failed to start java/jetty application

kubectl logs console-54dc5566b4-nq6gs -n test

jetty process failed to start without any log

Starting Jetty: -Djava.io.tmpdir=/tmp/jetty
start-stop-daemon -S -p/var/run/jetty.pid -croot -d/var/lib/jetty -b -m -a /usr/bin/java -- -Djava.io.tmpdir=/tmp/jetty -Xms512m -Xmx1g -Djetty.logs=/usr/local/jetty/logs -Dspring.profiles.active=StandaloneLogin,Agility,NotAliCloud,NotPublicCloud -Djetty.home=/usr/local/jetty -Djetty.base=/var/lib/jetty -Djava.io.tmpdir=/tmp/jetty -jar /usr/local/jetty/start.jar jetty.state=/var/lib/jetty/jetty.state jetty-started.xml start-log-file=/usr/local/jetty/logs/start.log
jetty.state=/var/lib/jetty/jetty.state
FAILED Thu Mar 14 09:43:55 UTC 2019
tail: cannot open '/var/lib/jetty/logs/*.log' for reading: No such file or directory
tail: no files remaining

console pod keep re-creating on k8s cluster

kubectl get events --all-namespaces | grep console

test      1m        1m        1         console-54dc5566b4-sx2r6.158bc8b1f2a076ce   Pod       spec.containers{console}   Normal    Killing   kubelet, k8s003   Killing container with id docker://console:Container failed liveness probe.. Container will be killed and recreated.
test      1m        6m        2         console-54dc5566b4-hx6wb.158bc86c4379c4e7   Pod       spec.containers{console}   Normal    Started   kubelet, k8s001   Started container
test      1m        6m        2         console-54dc5566b4-hx6wb.158bc86c355ab395   Pod       spec.containers{console}   Normal    Created   kubelet, k8s001   Created container
test      1m        6m        2         console-54dc5566b4-hx6wb.158bc86c2fe32c76   Pod       spec.containers{console}   Normal    Pulled    kubelet, k8s001   Container image "registry.cn-hangzhou.aliyuncs.com/eric-dev/console:0.9-62f837e" already present on machine
test      1m        1m        1         console-54dc5566b4-hx6wb.158bc8b87083e752   Pod       spec.containers{console}   Normal    Killing   kubelet, k8s001   Killing container with id docker://console:Container failed liveness probe.. Container will be killed and recreated.

Determine an OOM from pod state

kubectl get pod console-54dc5566b4-hx6wb -n test -o yaml | grep reason -C5

lastState:
      terminated:
        containerID: docker://90e5c9e618f3e745ebf510b8f215da3a165e3d03be58e0369e27c1773e75ef70
        exitCode: 137
        finishedAt: 2019-03-14T09:29:51Z
        reason: OOMKilled
        startedAt: 2019-03-14T09:24:51Z
    name: console
    ready: true
    restartCount: 3
    state:

kubectl get pod console-54dc5566b4-hx6wb -n test -o jsonpath='{.status.containerStatuses[].lastState}'

map[terminated:map[exitCode:137 reason:OOMKilled startedAt:2019-03-14T09:24:51Z finishedAt:2019-03-14T09:29:51Z containerID:docker://90e5c9e618f3e745ebf510b8f215da3a165e3d03be58e0369e27c1773e75ef70]]

Detect oom thru system log to validate assumption

Following error indicate an java oom caused by cgroup setting

# grep oom /var/log/messages

/var/log/messages:2019-03-14T09:15:17.541049+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949064] java invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=968
/var/log/messages:2019-03-14T09:15:17.541117+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949153]  [<ffffffff81191de4>] oom_kill_process+0x214/0x3f0
/var/log/messages:2019-03-14T09:15:17.541119+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949171]  [<ffffffff811f9481>] mem_cgroup_oom_synchronize+0x2f1/0x310
/var/log/messages:2019-03-14T09:15:17.541147+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.950571] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name

#grep oom /var/log/warn

2019-03-14T09:15:17.541049+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949064] java invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=968
2019-03-14T09:15:17.541117+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949153]  [<ffffffff81191de4>] oom_kill_process+0x214/0x3f0
2019-03-14T09:15:17.541119+00:00 iZbp185dy2o3o6lnlo4f07Z kernel: [8040341.949171]  [<ffffffff811f9481>] mem_cgroup_oom_synchronize+0x2f1/0x310

Root cause:

kubectl get pod console-54dc5566b4-hx6wb -n test -o yaml | grep limits -A4

limits:
  memory: 512Mi
requests:
  memory: 256Mi

In this case, application console pod extends the limits setting from default limits of namespace test

kubectl describe pod console-54dc5566b4-hx6wb -n test | grep limit

Annotations:        kubernetes.io/limit-ranger=LimitRanger plugin set: memory request for container console; memory limit for container console

kubectl get limitrange -n test

NAME              CREATED AT
mem-limit-range   2019-03-14T09:04:10Z

kubectl describe ns test

Name:         test
Labels:       <none>
Annotations:  <none>
Status:       Active

No resource quota.

Resource Limits
 Type       Resource  Min  Max  Default Request  Default Limit  Max Limit/Request Ratio
 ----       --------  ---  ---  ---------------  -------------  -----------------------
 Container  memory    -    -    256Mi            512Mi          -

Action to fix oom issue

After fixing limits setting and recreated pod, application become healthy.

kubectl delete limitrange mem-limit-range -n test
kubectl delete pod console-54dc5566b4-hx6wb
上一篇:Java POI导出excel 指定名称分组折叠


下一篇:随机早期检测RED(Random Early Detection)算法--两个门限防抖动