【原创】用户空间死循环导致memcached访问变慢问题排查


有同事发现,某业务访问 memcached 时,时不时查询速度会变慢,正常情况 几十毫秒 ,而慢的时候会达到 几百毫秒 。该同事百思不得其解,以为 memcached 出了什么奇怪的问题。 

拿到问题后,简单思考一些,给出如下问题排查思路: 
  • memcached 是基于 libevent 实现的多线程服务程序,理论上讲能够轻松应对大量连接访问;
  • 业务访问模型比较简单,没有复杂的使用场景,且访问速度变慢是随机发生的,并没有什么规律;
所以我的最终结论,怀疑是否由其他业务的运行情况异常,导致当前业务针对 memcached 的访问变慢。 

按照这个思路,首先 查看 memcached 实时运行情况,可以看到 memcached 启动了 6 个线程在跑,6 个线程分布(运行)在 5 个 CPU 上。 

在如下输出信息中,可以看到有几个 CPU 使用比较高,达到了 50% 以上。 
[root@gfsd1 ~]# top -Hp 30402 

top - 14:17:13 up 76 days, 22:06,  9 users,  load average: 1.72, 2.93, 4.01
Tasks:   6 total,   1 running,   5 sleeping,   0 stopped,   0 zombie
Cpu0  : 30.2%us, 17.3%sy,  0.0%ni, 52.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  6.3%us, 11.2%sy,  0.0%ni, 82.2%id,  0.0%wa,  0.0%hi,  0.4%si,  0.0%st
Cpu2  : 15.8%us, 11.7%sy,  0.0%ni, 72.2%id,  0.0%wa,  0.0%hi,  0.4%si,  0.0%st
Cpu3  : 14.5%us, 26.2%sy,  0.0%ni, 56.1%id,  0.0%wa,  0.5%hi,  2.8%si,  0.0%st
Cpu4  : 10.9%us, 46.6%sy,  0.0%ni, 40.9%id,  0.5%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu5  : 10.2%us,  6.5%sy,  0.0%ni, 81.8%id,  0.4%wa,  0.4%hi,  0.7%si,  0.0%st
Cpu6  : 32.5%us, 25.3%sy,  0.0%ni, 37.1%id,  4.1%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu7  :  9.8%us, 18.0%sy,  0.0%ni, 71.7%id,  0.0%wa,  0.0%hi,  0.4%si,  0.0%st
Mem:  16333656k total, 15608956k used,   724700k free,   247856k buffers
Swap:  2097144k total,    40584k used,  2056560k free, 11275076k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
30402 root      20   0  333m  16m  648 S  0.0  0.1   1:58.13 7 memcached
30404 root      20   0  333m  16m  648 S  0.0  0.1  15:38.50 4 memcached
30405 root      20   0  333m  16m  648 S  0.0  0.1  12:17.44 1 memcached
30406 root      20   0  333m  16m  648 R  0.0  0.1  14:27.60 3 memcached
30408 root      20   0  333m  16m  648 S  0.0  0.1  16:46.25 5 memcached
30409 root      20   0  333m  16m  648 S  0.0  0.1   0:00.00 1 memcached
重新执行 top 观察是否其他进程存在异常,结果发现名为 sus 的进程 CPU 占用达到 100% 。 查看其对应的线程运行情况如下 
[root@gfsd1 ~]# top -Hp 23775

top - 14:17:12 up 76 days, 22:06,  9 users,  load average: 1.72, 2.93, 4.01
Tasks:  19 total,   5 running,  14 sleeping,   0 stopped,   0 zombie
Cpu0  : 21.1%us, 13.0%sy,  0.0%ni, 65.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 13.4%us, 30.8%sy,  0.0%ni, 54.7%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu2  : 10.9%us, 12.3%sy,  0.0%ni, 76.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 18.8%us, 13.1%sy,  0.0%ni, 64.5%id,  0.0%wa,  0.4%hi,  3.3%si,  0.0%st
Cpu4  : 14.5%us, 24.7%sy,  0.0%ni, 59.5%id,  0.4%wa,  0.0%hi,  0.9%si,  0.0%st
Cpu5  : 12.4%us, 12.7%sy,  0.0%ni, 73.9%id,  0.4%wa,  0.4%hi,  0.4%si,  0.0%st
Cpu6  :  7.8%us, 14.1%sy,  0.0%ni, 74.5%id,  3.1%wa,  0.0%hi,  0.4%si,  0.0%st
Cpu7  : 63.6%us, 16.5%sy,  0.0%ni, 18.8%id,  0.0%wa,  0.0%hi,  1.1%si,  0.0%st
Mem:  16333656k total, 15610728k used,   722928k free,   247856k buffers
Swap:  2097144k total,    40584k used,  2056560k free, 11275044k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
23805 root      20   0  209m  29m 2812 R 92.6  0.2   5431:48 0 CenterCtrl
23808 root      20   0  209m  29m 2812 R  1.3  0.2  30:49.91 5 DBPoolTask_1
23800 root      20   0  209m  29m 2812 R  0.3  0.2  31:08.73 1 OspTimerTask
23914 root      20   0  209m  29m 2812 S  0.3  0.2  30:09.36 2 DBPoolTask_3
23922 root      20   0  209m  29m 2812 S  0.3  0.2  30:35.01 2 DBPoolTask_4
23925 root      20   0  209m  29m 2812 S  0.3  0.2  30:44.61 2 DBPoolTask_5
23929 root      20   0  209m  29m 2812 S  0.3  0.2  30:50.71 2 DBPoolTask_6
23775 root      20   0  209m  29m 2812 R  0.0  0.2   0:18.83 7 sus
23798 root      20   0  209m  29m 2812 S  0.0  0.2   0:25.06 5 OspLogTask
23799 root      20   0  209m  29m 2812 S  0.0  0.2   0:08.22 2 OspDispatchTask
23801 root      20   0  209m  29m 2812 S  0.0  0.2   1:22.25 3 OspTeleDaemon
23802 root      20   0  209m  29m 2812 S  0.0  0.2   0:16.50 2 OspPostDaemon
23803 root      20   0  209m  29m 2812 S  0.0  0.2   2:17.48 0 OspNodeMan
23806 root      20   0  209m  29m 2812 S  0.0  0.2   0:00.00 7 ConMgrSsn
23812 root      20   0  209m  29m 2812 S  0.0  0.2   0:00.01 3 DevSsn
23813 root      20   0  209m  29m 2812 S  0.0  0.2   0:31.29 3 ConnectSA
23814 root      20   0  209m  29m 2812 S  0.0  0.2   0:00.00 6 FileSend
23892 root      20   0  209m  29m 2812 R  0.0  0.2  31:07.47 1 DBPoolTask_2
28070 root      20   0  209m  29m 2812 S  0.0  0.2   0:03.71 6 ConnectADPLib
另外一次 top 输出可以清楚看到其中一个 CPU 被跑满。 
top - 14:38:50 up 76 days, 22:28,  9 users,  load average: 1.31, 2.03, 2.61
Tasks:  19 total,   6 running,  13 sleeping,   0 stopped,   0 zombie
Cpu0  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  1.6%us,  1.3%sy,  0.0%ni, 97.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  3.0%us,  2.4%sy,  0.0%ni, 94.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  6.5%us,  3.1%sy,  0.0%ni, 90.1%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu4  :  3.3%us,  9.1%sy,  0.0%ni, 87.3%id,  0.0%wa,  0.0%hi,  0.4%si,  0.0%st
Cpu5  :  2.4%us,  2.8%sy,  0.0%ni, 94.4%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu6  :  1.1%us,  1.8%sy,  0.0%ni, 97.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  3.4%us,  1.7%sy,  0.0%ni, 94.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16333656k total, 15647840k used,   685816k free,   247856k buffers
Swap:  2097144k total,    40584k used,  2056560k free, 11280028k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
23805 root      20   0  210m  29m 2812 R 99.9  0.2   5453:08 0 CenterCtrl
23925 root      20   0  210m  29m 2812 S  0.3  0.2  30:51.16 4 DBPoolTask_5
23775 root      20   0  210m  29m 2812 S  0.0  0.2   0:19.17 1 sus
23798 root      20   0  210m  29m 2812 S  0.0  0.2   0:25.09 5 OspLogTask
23799 root      20   0  210m  29m 2812 S  0.0  0.2   0:08.39 3 OspDispatchTask
23800 root      20   0  210m  29m 2812 R  0.0  0.2  31:14.61 7 OspTimerTask
23801 root      20   0  210m  29m 2812 S  0.0  0.2   1:22.50 1 OspTeleDaemon
23802 root      20   0  210m  29m 2812 S  0.0  0.2   0:16.80 3 OspPostDaemon
23803 root      20   0  210m  29m 2812 S  0.0  0.2   2:18.64 2 OspNodeMan
23806 root      20   0  210m  29m 2812 S  0.0  0.2   0:00.00 7 ConMgrSsn
23808 root      20   0  210m  29m 2812 R  0.0  0.2  30:56.07 6 DBPoolTask_1
23812 root      20   0  210m  29m 2812 S  0.0  0.2   0:00.01 3 DevSsn
23813 root      20   0  210m  29m 2812 S  0.0  0.2   0:31.29 3 ConnectSA
23814 root      20   0  210m  29m 2812 S  0.0  0.2   0:00.00 6 FileSend
23892 root      20   0  210m  29m 2812 R  0.0  0.2  31:13.82 7 DBPoolTask_2
23914 root      20   0  210m  29m 2812 R  0.0  0.2  30:15.19 2 DBPoolTask_3
23922 root      20   0  210m  29m 2812 R  0.0  0.2  30:41.05 2 DBPoolTask_4
23929 root      20   0  210m  29m 2812 S  0.0  0.2  30:56.35 2 DBPoolTask_6
28070 root      20   0  210m  29m 2812 S  0.0  0.2   0:03.74 5 ConnectADPLib
查看占用 CPU 100% 的线程的运行状态,发现一直处于析构函数的执行中。 
[root@gfsd1 ~]# pstack 23805
Thread 1 (process 23805):
#0  0x08081d75 in CMqMsgParse::~CMqMsgParse() ()
#1  0x0804dc49 in CCenterCtrlInst::DaemonProcPowerOn(CMessage*) ()
#2  0x00141607 in OspAppEntry () from /opt/mcu/sodir/libosp.so
#3  0x001569c6 in OspTaskTemplateFunc(void*) () from /opt/mcu/sodir/libosp.so
#4  0x00d26b39 in start_thread () from /lib/libpthread.so.0
#5  0x00e2ed6e in clone () from /lib/libc.so.6
[root@gfsd1 ~]# pstack 23805
Thread 1 (process 23805):
#0  0x08081d75 in CMqMsgParse::~CMqMsgParse() ()
#1  0x0804dc49 in CCenterCtrlInst::DaemonProcPowerOn(CMessage*) ()
#2  0x00141607 in OspAppEntry () from /opt/mcu/sodir/libosp.so
#3  0x001569c6 in OspTaskTemplateFunc(void*) () from /opt/mcu/sodir/libosp.so
#4  0x00d26b39 in start_thread () from /lib/libpthread.so.0
#5  0x00e2ed6e in clone () from /lib/libc.so.6
[root@gfsd1 ~]#
观察线程 23805 的运行时状态信息,可以看到 nonvoluntary_ctxt_switches 的值一直在快速增长,对比 strace -p 23805 没有任何输出,进一步说明该线程处于用户态死循环中。 
[root@gfsd1 ~]# cd /proc/23775/task/
[root@gfsd1 task]# cd 23805
[root@gfsd1 23805]# cat status |grep "ctx"
voluntary_ctxt_switches:        22
nonvoluntary_ctxt_switches:     4145539
[root@gfsd1 23805]# cat status |grep "ctx"
voluntary_ctxt_switches:        22
nonvoluntary_ctxt_switches:     4145585
...
[root@gfsd1 23805]# cat status |grep "ctx"
voluntary_ctxt_switches:        22
nonvoluntary_ctxt_switches:     4145937
[root@gfsd1 23805]# cat status |grep "ctx"
voluntary_ctxt_switches:        22
nonvoluntary_ctxt_switches:     4173266
[root@gfsd1 23805]#

下面给出一个 memcached 和死循环业务跑才同一个 CPU 的信息输出 
[root@gfsd1 ~]# top -d 3 -Hp 26606

Tasks:  19 total,   1 running,  18 sleeping,   0 stopped,   0 zombie
Cpu0  :  3.2%us,  5.7%sy,  0.0%ni, 91.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 96.4%us,  0.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  3.0%si,  0.0%st
Cpu2  :  9.0%us,  7.6%sy,  0.0%ni, 82.3%id,  0.7%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu3  : 16.7%us, 13.3%sy,  0.0%ni, 67.8%id,  0.0%wa,  0.4%hi,  1.9%si,  0.0%st
Cpu4  :  9.5%us,  8.0%sy,  0.0%ni, 82.1%id,  0.0%wa,  0.0%hi,  0.4%si,  0.0%st
Cpu5  :  7.1%us, 11.8%sy,  0.0%ni, 80.7%id,  0.0%wa,  0.0%hi,  0.4%si,  0.0%st
Cpu6  :  4.6%us,  7.0%sy,  0.0%ni, 87.7%id,  0.4%wa,  0.0%hi,  0.4%si,  0.0%st
Cpu7  : 11.5%us, 10.3%sy,  0.0%ni, 77.8%id,  0.0%wa,  0.0%hi,  0.4%si,  0.0%st
Mem:  16333656k total, 15678016k used,   655640k free,   246752k buffers
Swap:  2097144k total,    41476k used,  2055668k free, 11363572k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
26613 root      20   0  210m  21m 2760 R 99.8  0.1  35:01.73 1 CenterCtrl
26609 root      20   0  210m  21m 2760 S  1.0  0.1   0:12.04 4 OspTimerTask
26612 root      20   0  210m  21m 2760 S  0.7  0.1   0:01.10 6 OspNodeMan
26620 root      20   0  210m  21m 2760 S  0.7  0.1   0:09.62 2 DBPoolTask_2
26615 root      20   0  210m  21m 2760 S  0.3  0.1   0:09.86 0 DBPoolTask_1
26624 root      20   0  210m  21m 2760 S  0.3  0.1   0:09.54 4 DBPoolTask_4
26628 root      20   0  210m  21m 2760 S  0.3  0.1   0:11.16 4 DBPoolTask_5
26652 root      20   0  210m  21m 2760 S  0.3  0.1   0:11.24 4 DBPoolTask_6
26606 root      20   0  210m  21m 2760 S  0.0  0.1   0:00.59 2 sus
26607 root      20   0  210m  21m 2760 S  0.0  0.1   0:00.06 4 OspLogTask
26608 root      20   0  210m  21m 2760 S  0.0  0.1   0:00.17 7 OspDispatchTask
26610 root      20   0  210m  21m 2760 S  0.0  0.1   0:00.58 0 OspTeleDaemon
26611 root      20   0  210m  21m 2760 S  0.0  0.1   0:00.26 4 OspPostDaemon
26614 root      20   0  210m  21m 2760 S  0.0  0.1   0:00.00 5 ConMgrSsn
26616 root      20   0  210m  21m 2760 S  0.0  0.1   0:00.00 2 DevSsn
26617 root      20   0  210m  21m 2760 S  0.0  0.1   0:00.00 1 ConnectSA
26618 root      20   0  210m  21m 2760 S  0.0  0.1   0:00.00 5 FileSend
26621 root      20   0  210m  21m 2760 S  0.0  0.1   0:10.81 7 DBPoolTask_3
26869 root      20   0  210m  21m 2760 S  0.0  0.1   0:00.03 5 ConnectADPLib
可以看到死循环线程跑在了 cpu1 上。 
[root@gfsd1 ~]# top -d 3 -Hp 30402

top - 15:33:10 up 76 days, 23:22,  9 users,  load average: 3.95, 3.01, 2.83
Tasks:   6 total,   0 running,   6 sleeping,   0 stopped,   0 zombie
Cpu0  :  3.2%us,  5.7%sy,  0.0%ni, 91.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 96.4%us,  0.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  3.0%si,  0.0%st
Cpu2  :  9.0%us,  7.6%sy,  0.0%ni, 82.3%id,  0.7%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu3  : 16.7%us, 13.3%sy,  0.0%ni, 67.8%id,  0.0%wa,  0.4%hi,  1.9%si,  0.0%st
Cpu4  :  9.5%us,  8.3%sy,  0.0%ni, 81.8%id,  0.0%wa,  0.0%hi,  0.4%si,  0.0%st
Cpu5  :  7.1%us, 11.8%sy,  0.0%ni, 80.7%id,  0.0%wa,  0.0%hi,  0.4%si,  0.0%st
Cpu6  :  4.6%us,  7.0%sy,  0.0%ni, 87.7%id,  0.4%wa,  0.0%hi,  0.4%si,  0.0%st
Cpu7  : 11.5%us, 10.3%sy,  0.0%ni, 77.8%id,  0.0%wa,  0.0%hi,  0.4%si,  0.0%st
Mem:  16333656k total, 15678016k used,   655640k free,   246752k buffers
Swap:  2097144k total,    41476k used,  2055668k free, 11363572k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
30402 root      20   0  333m  16m  648 S  0.0  0.1   1:59.04 0 memcached
30404 root      20   0  333m  16m  648 S  0.0  0.1  15:50.61 0 memcached
30405 root      20   0  333m  16m  648 S  0.0  0.1  12:47.98 3 memcached
30406 root      20   0  333m  16m  648 S  0.0  0.1  14:43.97 2 memcached
30408 root      20   0  333m  16m  648 S  0.0  0.1  16:58.21 2 memcached
30409 root      20   0  333m  16m  648 S  0.0  0.1   0:00.00 1 memcached
与此同时,memcached 其中一个线程也跑在了 cpu1 上,所以,若此时业务请求恰好被该线程处理,则会出现处理速度变慢的问题。 

到此,问题基本上清楚,在将问题业务处理掉后,memcached 又和从前一般快了~~ 
上一篇:使用VisualVM查看Java Heap Dump


下一篇:mysql数据库编写存储过程实现批量插入1000万条数据