最近协助分析了一个.net进程内存泄露的问题,过程分享给大家。
症状:客户的服务端.net进程出现分钟级的cpu抖动,接近100%后落回。
图1
分析:支持同学通过procdump.exe工具抓取了进程dump, 设定触发dump的条件为,若进程的CPU使用量超过80%持续1秒,则开始抓取。
procdump.exe -ma -s 1 -c 80 10672 f:\aliyun
Loading Dump File [E:\temp\201127\GameServer.exe_201119_171245.dmp\GameServer.exe_201119_171245.dmp]
User Mini Dump File with Full Memory: Only application data is available
Comment: '
*** e:\soft\procdump\procdump.exe -ma -s 1 -c 80 10672 f:\aliyun
*** Process exceeded 80% CPU (system scale) for 1 second. Value: 88%. Hottest Thread: 4196 (0x1064).'
************* Path validation summary **************
Response Time (ms) Location
Deferred srv*F:\symbols*https://msdl.microsoft.com/download/symbols
Symbol search path is: srv*F:\symbols*https://msdl.microsoft.com/download/symbols
Executable search path is:
Windows 8.1 Version 9600 MP (32 procs) Free x64
Product: Server, suite: TerminalServer DataCenter SingleUserTS
6.3.9600.18217 (winblue_ltsb.160124-0053)
Machine Name:
Debug session time: Thu Nov 19 17:12:45.000 2020 (UTC + 8:00)
System Uptime: 38 days 3:36:48.460
Process Uptime: 0 days 0:33:22.000
在dump抓取时,所采样的系统CPU负载高达91%。
0:059> .loadby sos clr
0:059> !threadpool
CPU utilization: 91%
Worker Thread: Total: 57 Running: 3 Idle: 49 MaxLimit: 32767 MinLimit: 32
Work Request in Queue: 0
--------------------------------------
Number of Timers: 1
--------------------------------------
Completion Port Thread:Total: 89 Free: 88 MaxFree: 64 CurrentLimit: 89 MaxLimit: 1000 MinLimit: 65
查看dump抓取瞬间,有为数不多的几个线程在使用CPU。
1) Thread 37
0:059> ~37s
mscorlib_ni!System.IO.FileStream.WriteFileNative(Microsoft.Win32.SafeHandles.SafeFileHandle, Byte[], Int32, Int32, System.Threading.NativeOverlapped*, Int32 ByRef)$##600184B+0x86:
00007ffb`c5923d76 48894de0 mov qword ptr [rbp-20h],rcx ss:000000e2`16b0ee20=000000e3cde2fc10
0:037> kL
# Child-SP RetAddr Call Site
00 000000e2`16b0edf0 00007ffb`c5923cc2 mscorlib_ni!System.IO.FileStream.WriteFileNative(Microsoft.Win32.SafeHandles.SafeFileHandle, Byte[], Int32, Int32, System.Threading.NativeOverlapped*, Int32 ByRef)$##600184B+0x86
01 000000e2`16b0ee50 00007ffb`c5923aa7 mscorlib_ni!System.IO.FileStream.WriteCore(Byte[], Int32, Int32)$##600183D+0x62
02 000000e2`16b0eec0 00007ffb`c5923a34 mscorlib_ni!System.IO.FileStream.FlushInternalBuffer()$##600182F+0x57
03 000000e2`16b0ef00 00007ffb`c58d4f4c mscorlib_ni!System.IO.FileStream.Flush(Boolean)$##600182E+0x24
04 000000e2`16b0ef40 00007ffb`69ccac7b mscorlib_ni!System.IO.StreamWriter.Flush(Boolean, Boolean)$##60019BE+0x8c
05 000000e2`16b0efa0 00007ffb`69ccaaae 0x00007ffb`69ccac7b
06 000000e2`16b0efe0 00007ffb`69ca103e 0x00007ffb`69ccaaae
07 000000e2`16b0f030 00007ffb`c58eca72 0x00007ffb`69ca103e
08 000000e2`16b0f070 00007ffb`c58ec904 mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)$##6003A95+0x162
09 000000e2`16b0f140 00007ffb`c58ec8c2 mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)$##6003A94+0x14
0a 000000e2`16b0f170 00007ffb`c5926472 mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)$##6003A93+0x52
0b 000000e2`16b0f1c0 00007ffb`c8fd6793 mscorlib_ni!System.Threading.ThreadHelper.ThreadStart()$##6003B8E+0x52
0c 000000e2`16b0f200 00007ffb`c8fd6665 clr!CallDescrWorkerInternal+0x83
0d 000000e2`16b0f240 00007ffb`c8fd736d clr!CallDescrWorkerWithHandler+0x4e
0e 000000e2`16b0f280 00007ffb`c90bbf59 clr!MethodDescCallSite::CallTargetWorker+0xf8
0f 000000e2`16b0f380 00007ffb`c8fd7ce5 clr!ThreadNative::KickOffThread_Worker+0x109
10 000000e2`16b0f5e0 00007ffb`c8fd7c60 clr!Frame::Push+0x59
11 000000e2`16b0f620 00007ffb`c8fd7b9e clr!FillInRegTypeMap+0x198
12 000000e2`16b0f720 00007ffb`c8fd7d1f clr!FillInRegTypeMap+0xc1
13 000000e2`16b0f7b0 00007ffb`c90bbe3b clr!FillInRegTypeMap+0x47
14 000000e2`16b0f810 00007ffb`c919159f clr!ThreadNative::KickOffThread+0xdb
15 000000e2`16b0f8e0 00007ffb`d90d13d2 clr!Thread::intermediateThreadProc+0x86
16 000000e2`16b0fa20 00007ffb`d92254f4 kernel32!BaseThreadInitThunk+0x22
17 000000e2`16b0fa50 00000000`00000000 ntdll!RtlUserThreadStart+0x34
0:037> ub rip
mscorlib_ni!System.IO.FileStream.WriteFileNative(Microsoft.Win32.SafeHandles.SafeFileHandle, Byte[], Int32, Int32, System.Threading.NativeOverlapped*, Int32 ByRef)$##600184B+0x6c:
00007ffb`c5923d5c c9 leave
00007ffb`c5923d5d 4903d1 add rdx,r9
00007ffb`c5923d60 4533c9 xor r9d,r9d
00007ffb`c5923d63 4c894c2420 mov qword ptr [rsp+20h],r9
00007ffb`c5923d68 4c8d4de8 lea r9,[rbp-18h]
00007ffb`c5923d6c 448bc0 mov r8d,eax
00007ffb`c5923d6f e8e40bf2ff call mscorlib_ni!System.Runtime.Remoting.Activation.ActivationServices.GetActivator()$##6005B45 (mscorlib_ni+0x434958) (00007ffb`c5844958)
00007ffb`c5923d74 33c9 xor ecx,ecx
2) Thread 39
0:042> ~39s
mscorlib_ni!System.Text.UTF8Encoding.GetBytes(Char*, Int32, Byte*, Int32, System.Text.EncoderNLS)$##600675A+0x1df:
00007ffb`c5963cef f7c280ff80ff test edx,0FF80FF80h
0:039> kL
# Child-SP RetAddr Call Site
00 000000e2`16d0ee40 00007ffb`c58d50be mscorlib_ni!System.Text.UTF8Encoding.GetBytes(Char*, Int32, Byte*, Int32, System.Text.EncoderNLS)$##600675A+0x1df
01 000000e2`16d0eed0 00007ffb`c58d4f17 mscorlib_ni!System.Text.EncoderNLS.GetBytes(Char[], Int32, Int32, Byte[], Int32, Boolean)$##6006608+0x11e
02 000000e2`16d0ef60 00007ffb`69ccac7b mscorlib_ni!System.IO.StreamWriter.Flush(Boolean, Boolean)$##
60019BE+0x57
03 000000e2`16d0efc0 00007ffb`69ccaaae 0x00007ffb`69ccac7b
04 000000e2`16d0f000 00007ffb`69ca103e 0x00007ffb`69ccaaae
05 000000e2`16d0f050 00007ffb`c58eca72 0x00007ffb`69ca103e
06 000000e2`16d0f090 00007ffb`c58ec904 mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)$##6003A95+0x162
07 000000e2`16d0f160 00007ffb`c58ec8c2 mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)$##6003A94+0x14
08 000000e2`16d0f190 00007ffb`c5926472 mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)$##6003A93+0x52
09 000000e2`16d0f1e0 00007ffb`c8fd6793 mscorlib_ni!System.Threading.ThreadHelper.ThreadStart()$##6003B8E+0x52
0a 000000e2`16d0f220 00007ffb`c8fd6665 clr!CallDescrWorkerInternal+0x83
0b 000000e2`16d0f260 00007ffb`c8fd736d clr!CallDescrWorkerWithHandler+0x4e
0c 000000e2`16d0f2a0 00007ffb`c90bbf59 clr!MethodDescCallSite::CallTargetWorker+0xf8
0d 000000e2`16d0f3a0 00007ffb`c8fd7ce5 clr!ThreadNative::KickOffThread_Worker+0x109
0e 000000e2`16d0f600 00007ffb`c8fd7c60 clr!Frame::Push+0x59
0f 000000e2`16d0f640 00007ffb`c8fd7b9e clr!FillInRegTypeMap+0x198
10 000000e2`16d0f740 00007ffb`c8fd7d1f clr!FillInRegTypeMap+0xc1
11 000000e2`16d0f7d0 00007ffb`c90bbe3b clr!FillInRegTypeMap+0x47
12 000000e2`16d0f830 00007ffb`c919159f clr!ThreadNative::KickOffThread+0xdb
13 000000e2`16d0f900 00007ffb`d90d13d2 clr!Thread::intermediateThreadProc+0x86
14 000000e2`16d0fb40 00007ffb`d92254f4 kernel32!BaseThreadInitThunk+0x22
15 000000e2`16d0fb70 00000000`00000000 ntdll!RtlUserThreadStart+0x34
0:039> ub rip
mscorlib_ni!System.Text.UTF8Encoding.GetBytes(Char*, Int32, Byte*, Int32, System.Text.EncoderNLS)$##600675A+0x1c0:
00007ffb`c5963cd0 4889442438 mov qword ptr [rsp+38h],rax
00007ffb`c5963cd5 e979050000 jmp mscorlib_ni!System.Text.UTF8Encoding.GetBytes(Char*, Int32, Byte*, Int32, System.Text.EncoderNLS)$##600675A+0x743 (00007ffb`c5964253)
00007ffb`c5963cda 488b4c2440 mov rcx,qword ptr [rsp+40h]
00007ffb`c5963cdf 448b19 mov r11d,dword ptr [rcx]
00007ffb`c5963ce2 488b4c2440 mov rcx,qword ptr [rsp+40h]
00007ffb`c5963ce7 8b4904 mov ecx,dword ptr [rcx+4]
00007ffb`c5963cea 418bd3 mov edx,r11d
00007ffb`c5963ced 0bd1 or edx,ecx
3) Thread 52
0:039> ~52s
clr!SVR::gc_heap::background_mark_simple1+0x48:
00007ffb`c9162f38 488bd7 mov rdx,rdi
0:052> kL
# Child-SP RetAddr Call Site
00 000000e2`17e0f140 00007ffb`c91631ae clr!SVR::gc_heap::background_mark_simple1+0x48
01 000000e2`17e0f1b0 00007ffb`c9163f14 clr!SVR::gc_heap::background_mark_simple+0x91
02 000000e2`17e0f1e0 00007ffb`c91628b4 clr!SVR::gc_heap::background_drain_mark_list+0x50
03 000000e2`17e0f210 00007ffb`c934f660 clr!SVR::gc_heap::background_mark_phase+0x3bf
04 000000e2`17e0f2a0 00007ffb`c9162244 clr! ?? ::FNODOBFM::`string'+0x8082a
05 000000e2`17e0f2f0 00007ffb`c919159f clr!SVR::gc_heap::bgc_thread_function+0x132
06 000000e2`17e0f340 00007ffb`d90d13d2 clr!Thread::intermediateThreadProc+0x86
07 000000e2`17e0fb80 00007ffb`d92254f4 kernel32!BaseThreadInitThunk+0x22
08 000000e2`17e0fbb0 00000000`00000000 ntdll!RtlUserThreadStart+0x34
0:052> ub rip
clr!SVR::gc_heap::background_mark_simple1+0x26:
00007ffb`c9162f16 488bfa mov rdi,rdx
00007ffb`c9162f19 488bd9 mov rbx,rcx
00007ffb`c9162f1c 4c8989f01e0000 mov qword ptr [rcx+1EF0h],r9
00007ffb`c9162f23 4d8d04c1 lea r8,[r9+rax*8]
00007ffb`c9162f27 4c89442478 mov qword ptr [rsp+78h],r8
00007ffb`c9162f2c 4533db xor r11d,r11d
00007ffb`c9162f2f 4885ff test rdi,rdi
00007ffb`c9162f32 0f84e1010000 je clr!SVR::gc_heap::background_mark_simple1+0x901 (00007ffb`c9163119)
4) Thread 113
0:052> ~113s
MSVCR120_CLR0400!memset+0x23:
00007ffb`c8f0f8b3 f3aa rep stos byte ptr [rdi]
0:113> kL
# Child-SP RetAddr Call Site
00 000000e2`1c7fd580 000000ec`7cb76810 MSVCR120_CLR0400!memset+0x23
01 000000e2`1c7fd588 00007ffb`c918f750 0x000000ec`7cb76810
02 000000e2`1c7fd590 00007ffb`c918f3d2 clr!SVR::gc_heap::adjust_limit_clr+0xe0
03 000000e2`1c7fd5e0 00007ffb`c914625f clr!SVR::gc_heap::allocate_small+0x3ae
04 000000e2`1c7fd6a0 00007ffb`c58e0e5c clr!JIT_New+0x61f
*** WARNING: Unable to verify checksum for System.Core.ni.dll
*** ERROR: Module load completed but symbols could not be loaded for System.Core.ni.dll
05 000000e2`1c7fdae0 00007ffb`c36fb3ae mscorlib_ni!System.Collections.Generic.List`1[System.__Canon].System.Collections.Generic.IEnumerable.GetEnumerator()$##60039A3+0x4c
06 000000e2`1c7fdb40 00007ffb`6ad21c2d System_Core_ni+0x2db3ae
07 000000e2`1c7fdbb0 00007ffb`6ad21544 0x00007ffb`6ad21c2d
08 000000e2`1c7fdc00 00007ffb`6ad0a240 0x00007ffb`6ad21544
09 000000e2`1c7fdc60 00007ffb`6ad09cfb 0x00007ffb`6ad0a240
0a 000000e2`1c7fdcc0 00007ffb`6ad0920c 0x00007ffb`6ad09cfb
0b 000000e2`1c7fdd00 00007ffb`6ad07790 0x00007ffb`6ad0920c
0c 000000e2`1c7fdd40 00007ffb`6ace5a30 0x00007ffb`6ad07790
0d 000000e2`1c7fdda0 00007ffb`6ace38e8 0x00007ffb`6ace5a30
0e 000000e2`1c7fde90 00007ffb`6ace1a7b 0x00007ffb`6ace38e8
0f 000000e2`1c7fdf80 00007ffb`6ace1407 0x00007ffb`6ace1a7b
10 000000e2`1c7fe080 00007ffb`6a80981e 0x00007ffb`6ace1407
11 000000e2`1c7fe0b0 00007ffb`6a8081db 0x00007ffb`6a80981e
12 000000e2`1c7fe110 00007ffb`c58eca72 0x00007ffb`6a8081db
...
20 000000e2`1c7fe710 00007ffb`c8fd6665 clr!CallDescrWorkerInternal+0x83
21 000000e2`1c7fe750 00007ffb`c8fd736d clr!CallDescrWorkerWithHandler+0x4e
22 000000e2`1c7fe790 00007ffb`c8fdaf69 clr!MethodDescCallSite::CallTargetWorker+0xf8
23 000000e2`1c7fe890 00007ffb`c8fd7ce5 clr!QueueUserWorkItemManagedCallback+0x2a
24 000000e2`1c7fe980 00007ffb`c8fd7c60 clr!Frame::Push+0x59
25 000000e2`1c7fe9c0 00007ffb`c8fd7b9e clr!FillInRegTypeMap+0x198
26 000000e2`1c7feac0 00007ffb`c8fd7d1f clr!FillInRegTypeMap+0xc1
27 000000e2`1c7feb50 00007ffb`c8fdaa70 clr!FillInRegTypeMap+0x47
28 000000e2`1c7febb0 00007ffb`c8fd82b8 clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0xa0
29 000000e2`1c7fed30 00007ffb`c8fd8195 clr!ThreadpoolMgr::ExecuteWorkRequest+0x64
2a 000000e2`1c7fed60 00007ffb`c919159f clr!ThreadpoolMgr::WorkerThreadStart+0xf5
2b 000000e2`1c7fee00 00007ffb`d90d13d2 clr!Thread::intermediateThreadProc+0x86
2c 000000e2`1c7ffbc0 00007ffb`d92254f4 kernel32!BaseThreadInitThunk+0x22
2d 000000e2`1c7ffbf0 00000000`00000000 ntdll!RtlUserThreadStart+0x34
0:113> ub rip
MSVCR120_CLR0400!memset+0x6:
00007ffb`c8f0f896 4983f810 cmp r8,10h
00007ffb`c8f0f89a 0f825c010000 jb MSVCR120_CLR0400!memset+0x16c (00007ffb`c8f0f9fc)
00007ffb`c8f0f8a0 0fba25b08e0a0001 bt dword ptr [MSVCR120_CLR0400!_favor (00007ffb`c8fb8758)],1
00007ffb`c8f0f8a8 730e jae MSVCR120_CLR0400!memset+0x28 (00007ffb`c8f0f8b8)
00007ffb`c8f0f8aa 57 push rdi
00007ffb`c8f0f8ab 488bf9 mov rdi,rcx
00007ffb`c8f0f8ae 8bc2 mov eax,edx
00007ffb`c8f0f8b0 498bc8 mov rcx,r8
但是我们看到这台机器是32核心的服务器:
0:113> !cpuid
CP F/M/S Manufacturer MHz
0 6,5,7 2500
1 6,5,7 2500
2 6,5,7 2500
3 6,5,7 2500
4 6,5,7 2500
5 6,5,7 2500
6 6,5,7 2500
7 6,5,7 2500
8 6,5,7 2500
9 6,5,7 2500
10 6,5,7 2500
11 6,5,7 2500
12 6,5,7 2500
13 6,5,7 2500
14 6,5,7 2500
15 6,5,7 2500
16 6,5,7 2500
17 6,5,7 2500
18 6,5,7 2500
19 6,5,7 2500
20 6,5,7 2500
21 6,5,7 2500
22 6,5,7 2500
23 6,5,7 2500
24 6,5,7 2500
25 6,5,7 2500
26 6,5,7 2500
27 6,5,7 2500
28 6,5,7 2500
29 6,5,7 2500
30 6,5,7 2500
31 6,5,7 2500
那么,上述几个线程不至于将服务器的CPU飙高。 dump抓取时,实际上这个进程CPU占用并不高,因此,我们也就无法通过分析这个dump中的线程行为来直接找到high cpu的原因了。
我们留意到,这个dump本身也是很大的,dump文件本身在20G左右,而且绝大多数内存为.net托管。
0:113> !address -summary
Mapping file section regions...
Mapping module regions...
Mapping PEB regions...
Mapping TEB and stack regions...
Mapping heap regions...
Mapping page heap regions...
Mapping other regions...
Mapping stack trace database regions...
Mapping activation context regions...
--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
Free 166 7ff5`c19eb000 ( 127.960 TB) 99.97%
999 a`1af16000 ( 40.421 GB) 98.65% 0.03%
Stack 862 0`10240000 ( 258.250 MB) 0.62% 0.00%
Image 712 0`0a0b6000 ( 160.711 MB) 0.38% 0.00%
Heap 61 0`08fef000 ( 143.934 MB) 0.34% 0.00%
TEB 284 0`00238000 ( 2.219 MB) 0.01% 0.00%
Other 9 0`001d1000 ( 1.816 MB) 0.00% 0.00%
PEB 1 0`00001000 ( 4.000 kB) 0.00% 0.00%
--- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_PRIVATE 1845 a`31a9f000 ( 40.776 GB) 99.52% 0.03%
MEM_IMAGE 1048 0`0ab32000 ( 171.195 MB) 0.41% 0.00%
MEM_MAPPED 35 0`02034000 ( 32.203 MB) 0.08% 0.00%
--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_FREE 166 7ff5`c19eb000 ( 127.960 TB) 99.97%
MEM_RESERVE 655 5`2804b000 ( 20.625 GB) 50.34% 0.02%
MEM_COMMIT 2273 5`165ba000 ( 20.349 GB) 49.66% 0.02%
--- Protect Summary (for commit) - RgnCount ----------- Total Size -------- %ofBusy %ofTotal
PAGE_READWRITE 1252 5`0a7de000 ( 20.164 GB) 49.21% 0.02%
PAGE_EXECUTE_READ 74 0`073c5000 ( 115.770 MB) 0.28% 0.00%
PAGE_READONLY 360 0`02d12000 ( 45.070 MB) 0.11% 0.00%
PAGE_WRITECOPY 175 0`0102e000 ( 16.180 MB) 0.04% 0.00%
PAGE_EXECUTE_READWRITE 79 0`005d8000 ( 5.844 MB) 0.01% 0.00%
PAGE_READWRITE|PAGE_GUARD 284 0`00582000 ( 5.508 MB) 0.01% 0.00%
PAGE_EXECUTE_WRITECOPY 36 0`0016f000 ( 1.434 MB) 0.00% 0.00%
PAGE_NOACCESS 11 0`0000b000 ( 44.000 kB) 0.00% 0.00%
PAGE_EXECUTE 2 0`00003000 ( 12.000 kB) 0.00% 0.00%
那么,很值得看一下这些.net托管对象在内存中的行为。
我们看到,客户自己命名空间下的对象(ShowHand,已脱敏),有些已经达到了几百万的数量:
0:113> !dumpheap -stat
...
00007ffb6a1fab68 1062320 95485536 ShowHand.ConfigData.CommonPropertyInfo[]
00007ffb6ac2c4c0 1310734 104858720 ShowHand.ProjectU.Common.LBStaticActorProcessingAchievement
00007ffb6ac20db0 1633682 109318576 ShowHand.ProjectU.Common.IPropertiesProvider[]
00007ffb6ac0ad88 253100 114645888 ShowHand.ProjectU.Common.ICommonActorComp[]
00007ffbc5b16948 1809532 116227168 System.String
00007ffb6a5a84a8 1076959 120619408 ShowHand.ProjectU.Common.BattleGrid
00007ffb6b04bf40 2549394 122370912 behaviac.Action+ActionTask
00007ffb6a1f8a50 5651702 135640848 ShowHand.ConfigData.CommonPropertyInfo
00007ffb6b04ae10 1986359 143017848 behaviac.Selector+SelectorTask
00007ffb6abe65f8 31814 151783472 ShowHand.ProjectU.Common.IGameEventPipeListener[]
00007ffb6b04bdc8 3176057 152450736 behaviac.Assignment+AssignmentTask
00007ffb6ac03478 14993 155447424 System.Collections.Generic.Dictionary`2+Entry[[System.Int32, mscorlib],[ShowHand.ProjectU.Common.WayPointInfo, CommonDefine]][]
00007ffb6b0ba198 3976124 159044960 ShowHand.ProjectU.Common.BattleGrid+GridLink
00007ffb6afe0ef8 228402 164720576 ShowHand.ProjectU.Common.BattleGrid[]
00007ffb6afe0698 344120 185049680 ShowHand.ProjectU.Common.BattleGridInfo4Select[]
00007ffb6abda5f0 4078096 195748608 ShowHand.ProjectU.Common.WayPointInfo
00007ffb6b04bac0 4306198 206697504 behaviac.Condition+ConditionTask
00007ffb6b04bc88 2827987 226238960 behaviac.ReferencedBehavior+ReferencedBehaviorTask
00007ffb6aba7b88 5975977 239039080 ShowHand.ProjectU.Common.ProcessingMissionInfo
00007ffb6abebab0 6016203 240648120 ShowHand.ProjectU.Common.GameEventIdDefine[]
00007ffb6abe98e0 6050943 242037720 System.Collections.Generic.List`1[[ShowHand.ProjectU.Common.GameEventIdDefine, CommonDefine]]
00007ffb6afe0c78 515019 275971336 ShowHand.ProjectU.Common.BattleGridInfo4Attack[]
...
那么,这里是否存在托管内存泄露的问题,就值得深究一下了。
通过查看32个gc堆,可以看到这些堆中确实十分不健康:
0:113> !eeheap -gc
Number of GC Heaps: 32
------------------------------
Heap 0 (000000e275cd3290)
generation 0 starts at 0x000000e29cfa59a8
generation 1 starts at 0x000000e29b152070
generation 2 starts at 0x000000e277b31000
ephemeral segment allocation context: none
segment begin allocated size
000000e277b30000 000000e277b31000 000000e29d2959c0 0x257649c0(628509120)
Large object heap starts at 0x000000ea77b31000
segment begin allocated size
000000ea77b30000 000000ea77b31000 000000ea7811ccc8 0x5ebcc8(6208712)
Heap Size: Size: 0x25d50688 (634717832) bytes.
------------------------------
Heap 1 (000000e275cd6620)
generation 0 starts at 0x000000e2d363e938
generation 1 starts at 0x000000e2d182a800
generation 2 starts at 0x000000e2b7b31000
ephemeral segment allocation context: none
segment begin allocated size
000000e2b7b30000 000000e2b7b31000 000000e2d5478678 0x1d947678(496268920)
Large object heap starts at 0x000000ea87b31000
segment begin allocated size
000000ea87b30000 000000ea87b31000 000000ea87fa0e68 0x46fe68(4652648)
Heap Size: Size: 0x1ddb74e0 (500921568) bytes.
------------------------------
Heap 2 (000000e275cda830)
generation 0 starts at 0x000000e31f5516c8
generation 1 starts at 0x000000e31d26b3b8
generation 2 starts at 0x000000e2f7b31000
ephemeral segment allocation context: none
segment begin allocated size
000000e2f7b30000 000000e2f7b31000 000000e321b9c7b0 0x2a06b7b0(705083312)
Large object heap starts at 0x000000ea97b31000
segment begin allocated size
000000ea97b30000 000000ea97b31000 000000ea98130d60 0x5ffd60(6290784)
Heap Size: Size: 0x2a66b510 (711374096) bytes.
------------------------------
Heap 3 (000000e275cdf480)
generation 0 starts at 0x000000e35263b398
generation 1 starts at 0x000000e350719af8
generation 2 starts at 0x000000e337b31000
ephemeral segment allocation context: none
segment begin allocated size
000000e337b30000 000000e337b31000 000000e3529b73b0 0x1ae863b0(451437488)
Large object heap starts at 0x000000eaa7b31000
segment begin allocated size
000000eaa7b30000 000000eaa7b31000 000000eaa7f90b30 0x45fb30(4586288)
Heap Size: Size: 0x1b2e5ee0 (456023776) bytes.
-----------------------------
-
以下29个heap数据略,以gc heap0 为例:
其Gen2 中大小已经到了593M(593629296Bytes)。
0:059> ? 0x000000e29b152070-0x000000e277b31000
Evaluate expression: 593629296 = 00000000`23621070
然而0代和1代大小才几MB和几十MB。 这种gen0 gen1很小,gen2爆大的分布是很不正常的,说明可能存在有GC不掉的托管对象。
我们查看一下heap 0 gen2中的对象,看到一个堆中,客户命名空间下的对象多的也达20万之多。
0:059>!dumpheap -stat 0x000000e277b31000 0x000000e29b152070
00007ffb6a5a84a8 35146 3936352 ShowHand.ProjectU.Common.BattleGrid
00007ffb6b04ae10 62375 4491000 behaviac.Selector+SelectorTask
00007ffb6a1f8a50 190065 4561560 ShowHand.ConfigData.CommonPropertyInfo
00007ffb6b04bdc8 99780 4789440 behaviac.Assignment+AssignmentTask
00007ffb6b0ba198 129947 5197880 ShowHand.ProjectU.Common.BattleGrid+GridLink
00007ffb6ac03478 508 5266944 System.Collections.Generic.Dictionary`2+Entry[[System.Int32, mscorlib],[ShowHand.ProjectU.Common.WayPointInfo, CommonDefine]][]
00007ffb6afe0698 10768 5802880 ShowHand.ProjectU.Common.BattleGridInfo4Select[]
00007ffb6b04bac0 133954 6429792 behaviac.Condition+ConditionTask
00007ffb6abda5f0 138160 6631680 ShowHand.ProjectU.Common.WayPointInfo
00007ffb6b04bc88 87804 7024320 behaviac.ReferencedBehavior+ReferencedBehaviorTask
00007ffb6aba7b88 204523 8180920 ShowHand.ProjectU.Common.ProcessingMissionInfo
00007ffb6abebab0 205528 8221120 ShowHand.ProjectU.Common.GameEventIdDefine[]
00007ffb6abe98e0 207062 8282480 System.Collections.Generic.List`1[[ShowHand.ProjectU.Common.GameEventIdDefine, CommonDefine]]
00007ffb6afe0c78 16152 8657472 ShowHand.ProjectU.Common.BattleGridInfo4Attack[]
00007ffb6b04af38 222880 8915200 System.Collections.Generic.List`1[[behaviac.BehaviorTask, BehaviacRuntime]]
00007ffb6b0ea0c0 5248 9385248 System.Collections.Generic.Dictionary`2+Entry[[System.String, mscorlib],[FixMath.NET.Fix64, Fix64]][]
00007ffbc5aea7f0 121480 9718400 System.Collections.Generic.Dictionary`2[[System.Int32, mscorlib],[System.Int32, mscorlib]]
00007ffb6b04b730 147779 10640088 behaviac.Sequence+SequenceTask
我们已ShowHand.ProjectU.Common.ProcessingMissionInfo为例,随机挑选一些该类的对象,查看其root行为。
0:059>!dumpheap -mt 00007ffb6aba7b88 0x000000e277b31000 0x000000e29b152070
...
000000e27c723178 00007ffb6aba7b88 40
000000e27c723348 00007ffb6aba7b88 40
000000e27c723500 00007ffb6aba7b88 40
000000e27c7236b8 00007ffb6aba7b88 40
000000e27c723888 00007ffb6aba7b88 40
000000e27c723a40 00007ffb6aba7b88 40
000000e27c723bf8 00007ffb6aba7b88 40
000000e27c723dc8 00007ffb6aba7b88 40
000000e27c723f80 00007ffb6aba7b88 40
000000e27c724138 00007ffb6aba7b88 40
000000e27c724308 00007ffb6aba7b88 40
000000e27c7244c0 00007ffb6aba7b88 40
000000e27c724678 00007ffb6aba7b88 40
000000e27c724848 00007ffb6aba7b88 40
000000e27c724a00 00007ffb6aba7b88 40
000000e27c724bb8 00007ffb6aba7b88 40
000000e27c724d88 00007ffb6aba7b88 40
000000e27c724f40 00007ffb6aba7b88 40
000000e27c7250f8 00007ffb6aba7b88 40
000000e27c7252c8 00007ffb6aba7b88 40
000000e27c725480 00007ffb6aba7b88 40
000000e27c725638 00007ffb6aba7b88 40
000000e27c725808 00007ffb6aba7b88 40
000000e27c7259c0 00007ffb6aba7b88 40
000000e27c725b78 00007ffb6aba7b88 40
000000e27c725d48 00007ffb6aba7b88 40
000000e27c725f00 00007ffb6aba7b88 40
000000e27c7260b8 00007ffb6aba7b88 40
000000e27c726288 00007ffb6aba7b88 40
000000e27c726440 00007ffb6aba7b88 40
000000e27c7265f8 00007ffb6aba7b88 40
000000e27c7267c8 00007ffb6aba7b88 40
...
随机挑选000000e27c723348和000000e27c723f80 这两个对象,查看其引用链:
0:059> !gcroot 000000e27c723348
Thread 2f78:
000000e216b0efa0 00007ffb69ccac7b log4net.Appender.FileAppender.Append(log4net.Core.LoggingEvent)
rbp+10: 000000e216b0efe0
-> 000000e377b3f738 log4net.Appender.AsyncRollingFileAppender
-> 000000e377b3f850 System.Collections.Concurrent.ConcurrentQueue`1[[log4net.Core.LoggingEvent, log4net]]
-> 000000e623ca3f08 System.Collections.Concurrent.ConcurrentQueue`1+Segment[[log4net.Core.LoggingEvent, log4net]]
-> 000000e623ca3f48 log4net.Core.LoggingEvent[]
-> 000000e9555e8218 log4net.Core.LoggingEvent
-> 000000e377b3b310 log4net.Repository.Hierarchy.Hierarchy
-> 000000e377b527a0 log4net.Repository.LoggerRepositoryShutdownEventHandler
-> 000000e377b526c8 log4net.Core.WrapperMap
-> 000000e377b526f0 System.Collections.Hashtable
-> 000000e377b52740 System.Collections.Hashtable+bucket[]
-> 000000e377b527e0 System.Collections.Hashtable
-> 000000e377b52830 System.Collections.Hashtable+bucket[]
-> 000000e377b522e8 log4net.Repository.Hierarchy.DefaultLoggerFactory+LoggerImpl
-> 000000e377b3e720 log4net.Repository.Hierarchy.RootLogger
-> 000000e377b46980 log4net.Util.AppenderAttachedImpl
-> 000000e377b469a0 log4net.Appender.AppenderCollection
-> 000000e377b51130 log4net.Appender.IAppender[]
-> 000000e377b469c0 log4net.Appender.AsyncRollingFileAppender
-> 000000e377b46b70 System.Threading.Thread
-> 000000e6b7b33cb8 System.Runtime.Remoting.Contexts.Context
-> 000000e277b31560 System.AppDomain
-> 000000e6b7b67160 System.UnhandledExceptionEventHandler
-> 000000e277b32a60 ShowHand.ProjectU.GameServer.GameServer
-> 000000e377b53750 ShowHand.ServerBase.PlayerContextManager
-> 000000e377b53a48 System.Collections.Concurrent.ConcurrentDictionary`2[[System.UInt64, mscorlib],[ShowHand.ServerBase.IManagedContext, ServerBase]]
-> 000000e9c04a04d0 System.Collections.Concurrent.ConcurrentDictionary`2+Tables[[System.UInt64, mscorlib],[ShowHand.ServerBase.IManagedContext, ServerBase]]
-> 000000ec47cb0218 System.Collections.Concurrent.ConcurrentDictionary`2+Node[[System.UInt64, mscorlib],[ShowHand.ServerBase.IManagedContext, ServerBase]][]
-> 000000e9c048e440 System.Collections.Concurrent.ConcurrentDictionary`2+Node[[System.UInt64, mscorlib],[ShowHand.ServerBase.IManagedContext, ServerBase]]
-> 000000e707424da8 ShowHand.ProjectU.GameServer.GameServerPlayerContext
-> 000000e707424cf0 ShowHand.NetSharp.Client
-> 000000e707425610 System.Threading.SemaphoreSlim
-> 000000e707425668 System.Threading.SemaphoreSlim+TaskNode
-> 000000e685d6a2e0 System.Collections.Generic.List`1[[System.Object, mscorlib]]
-> 000000e6c2a468d0 System.Object[]
-> 000000e2d362b088 System.Threading.Tasks.TaskFactory+CompleteOnInvokePromise
-> 000000e707425840 System.Action
-> 000000e707425820 System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner
-> 000000e7074258d0 ShowHand.NetSharp.Client+d__21
-> 000000e707425880 System.Threading.Tasks.Task`1[[System.Threading.Tasks.VoidTaskResult, mscorlib]]
-> 000000e707425980 System.Action
-> 000000e707425960 System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner
-> 000000e7074259c0 ShowHand.NetSharp.Endpoint+d__8
-> 000000e2b7b495f8 ShowHand.NetSharp.Endpoint
-> 000000e2b7b49630 System.Collections.Concurrent.ConcurrentDictionary`2[[ShowHand.NetSharp.Client, NetSharp],[System.Boolean, mscorlib]]
-> 000000e5bccc1f00 System.Collections.Concurrent.ConcurrentDictionary`2+Tables[[ShowHand.NetSharp.Client, NetSharp],[System.Boolean, mscorlib]]
-> 000000eb47bd1150 System.Collections.Concurrent.ConcurrentDictionary`2+Node[[ShowHand.NetSharp.Client, NetSharp],[System.Boolean, mscorlib]][]
-> 000000e5bccbbee8 System.Collections.Concurrent.ConcurrentDictionary`2+Node[[ShowHand.NetSharp.Client, NetSharp],[System.Boolean, mscorlib]]
-> 000000e43c0e1d48 ShowHand.NetSharp.Client
-> 000000e43c0e2cb8 ShowHand.ProjectU.GameServer.GameServerPlayerContext
-> 000000e43c0e2e50 System.Collections.Generic.List`1[[ShowHand.ProjectU.Common.Components.Server.IServerDataSection, ProjectU.LogicComp]]
-> 000000e27c6b4f48 ShowHand.ProjectU.Common.Components.Server.IServerDataSection[]
-> 000000e27c6a12f8 ShowHand.ProjectU.Common.Components.Server.MissionDataSection
-> 000000e27c6a1330 System.Collections.Generic.List`1[[ShowHand.ProjectU.Common.ProcessingMissionInfo, CommonDefine]]
-> 000000e27c71d400 ShowHand.ProjectU.Common.ProcessingMissionInfo[]
-> 000000e27c723348 ShowHand.ProjectU.Common.ProcessingMissionInfo
我们可以看到,000000e27c723348间接引用自ShowHand.NetSharp.Client,我们看到000000e27c723f80和其他所有gen2中的ShowHand.ProjectU.Common.ProcessingMissionInfo都间接引用自ShowHand.NetSharp.Client,并且都引用来自同一个ShowHand.NetSharp.Endpoint对象000000e2b7b495f8。 我们进而分析这个对象。
0:059> !do 000000e2b7b495f8
Name: ShowHand.NetSharp.Endpoint
MethodTable: 00007ffb69f641d0
EEClass: 00007ffb69f53fd8
Size: 56(0x38) bytes
File: e:\ServerRelease\Server\GameServer\NetSharp.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007ffbc5b19288 400002d 28 System.Int32 1 instance 65536 k__BackingField
00007ffb698a8d80 400002e 8 ...pointEventHandler 0 instance 000000e277b32a60 m_endPointEventHandler
00007ffbc4bf2088 400002f 10 ...ckets.TcpListener 0 instance 000000e2b7b4ad60 m_listener
00007ffb69f64078 4000030 2c System.Int32 1 instance 0 m_state
00007ffbc5b27b38 4000031 18 ...eading.Tasks.Task 0 instance 000000e2b7b4af70 m_mainTask
00007ffb69f64888 4000032 20 ...olean, mscorlib]] 0 instance 000000e2b7b49630 m_clientActiveness
0:059> !do 000000e2b7b49630
Name: System.Collections.Concurrent.ConcurrentDictionary`2[[ShowHand.NetSharp.Client, NetSharp],[System.Boolean, mscorlib]]
MethodTable: 00007ffb69f64888
EEClass: 00007ffb69e90240
Size: 64(0x40) bytes
File: C:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007ffb69d33f48 4001830 8 ...olean, mscorlib]] 0 instance 000000e5bccc1f00 m_tables
00007ffbc5b34300 4001831 10 ...Canon, mscorlib]] 0 instance 0000000000000000 m_comparer
00007ffbc5b21f28 4001832 30 System.Boolean 1 instance 1 m_growLockArray
00007ffbc5b19288 4001833 20 System.Int32 1 instance 0 m_keyRehashCount
00007ffbc5b19288 4001834 24 System.Int32 1 instance 32 m_budget
00007ffbc65fc070 4001835 18 ...ean, mscorlib]][] 0 instance 0000000000000000 m_serializationArray
00007ffbc5b19288 4001836 28 System.Int32 1 instance 0 m_serializationConcurrencyLevel
00007ffbc5b19288 4001837 2c System.Int32 1 instance 0 m_serializationCapacity
00007ffbc5b21f28 400183b 10 System.Boolean 1 static
0:059> !do 000000e5bccc1f00
Name: System.Collections.Concurrent.ConcurrentDictionary`2+Tables[[ShowHand.NetSharp.Client, NetSharp],[System.Boolean, mscorlib]]
MethodTable: 00007ffb69f65790
EEClass: 00007ffb69e90968
Size: 48(0x30) bytes
File: C:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
MT Field Offset Type VT Attr Value Name
0000000000000000 400341d 8 SZARRAY 0 instance 000000eb47bd1150 m_buckets
00007ffbc5b16fc0 400341e 10 System.Object[] 0 instance 000000e737ef1220 m_locks
00007ffbc5b19220 400341f 18 System.Int32[] 0 instance 000000e5bcc76a00 m_countPerLock
00007ffbc5b34300 4003420 20 ...Canon, mscorlib]] 0 instance 000000e2b7b41d18 m_comparer
0:059> !do 000000eb47bd1150
Name: System.Collections.Concurrent.ConcurrentDictionary`2+Node[[ShowHand.NetSharp.Client, NetSharp],[System.Boolean, mscorlib]][]
MethodTable: 00007ffb69f65610
EEClass: 00007ffbc54daa00
Size: 266240(0x41000) bytes
Array: Rank 1, Number of elements 33277, Type CLASS (Print Array)
Fields:
None
我们可以看到,某个ShowHand.NetSharp.Endpoint对象,引用了数万的ShowHand.NetSharp.Client,而数万的
ShowHand.NetSharp.Client又间接引用了几千万的客户自己的各种对象。 最终这些对象因为存在着引用,经历GC回收后最终被推到了Gen2。
这是一个不健康的行为,比起gen0和gen1的垃圾回收,gen2的回收则昂贵的多。 基于严谨,我们不能将上述分析作为确凿证据和文初的cpu抖动挂钩(其实的确是又这种可能性,大规模的gen2 GC引发高CPU)。但是,该不健康的2代对象太多的问题,的确需要解决,无论它和CPU抖动有没有直接关系,它都会给程序的健康运行带来巨大隐患。
我们建议客户基于以上分析,并基于自身业务考虑该情况的发生是否合理,如不合理,应适当考虑对程序进行优化。
我们是阿里云智能全球技术服务-SRE团队,我们致力成为一个以技术为基础、面向服务、保障业务系统高可用的工程师团队;提供专业、体系化的SRE服务,帮助广大客户更好地使用云、基于云构建更加稳定可靠的业务系统,提升业务稳定性。我们期望能够分享更多帮助企业客户上云、用好云,让客户云上业务运行更加稳定可靠的技术,您可用钉钉扫描下方二维码,加入阿里云SRE技术学院钉钉圈子,和更多云上人交流关于云平台的那些事。