【问题描述】:
前段时间有一项目反馈,常用的审批功能有时的响应较慢,多个管理员功能不定期的出现客户端无响应的状况,并且管理员功能一旦出现卡死,也会影响到普通的业务用户致使很多用户无法操作。该系统为混合场景,管理员功能使用的CS客户端,业务用户使用的是WEB。
【环境信息】:
应用服务器:Windows Server 2008 R2 IIS7 .NET 4.0
数据库服务器:AIX Oracle 10.2.0.4
【处理过程】:
经过跟踪发现两个情况:
- 确实有部分待审单据响应很慢,可以反复重现,貌似跟单据有关(慢的单据反复打开都会很慢,其他单据则一直没有类似问题),感觉很奇怪;
- 管理员功能发生了客户端不响应的情况后,查看对应数据库端的性能视图,没有SQL阻塞信息,也没有明显的执行时间很长、效率很低的SQL;
在出现系统挂起时,抓取应用服务器的dump文件后,发现存在明显的线程阻塞:
- 105号线程阻塞了其他4个线程{(9-1)/2},而105号线程在等待数据库的返回,OracleLob,这是个特别的数据类型,需要引起警惕了;
- 根据线程堆栈:OperateDataBaseManager.GetPacekage(System.String),我们可以很快定位到程序代码及对应的SQL脚本;
- 根据windbg还可以看到,其他线程都在MetadataCache.GetObject()方法上处于被阻塞状态,查看对应的源代码,发现该方法使用了静态对象作为同步锁对象;
下图是线程阻塞对应windbg的log日志
Microsoft (R) Windows Debugger Version 6.12.0002.633 AMD64
Copyright (c) Microsoft Corporation. All rights reserved. Loading Dump File [D:\dumps\w3wp.DMP]
User Mini Dump File with Full Memory: Only application data is available Symbol search path is: SRV*c:\symbols*http://msdl.microsoft.com/download/symbols
Executable search path is:
Windows 7 Version 7601 (Service Pack 1) MP (8 procs) Free x64
Product: Server, suite: Enterprise TerminalServer SingleUserTS
Machine Name:
Debug session time: Tue Apr 14 14:22:10.000 2015 (UTC + 8:00)
System Uptime: 19 days 20:26:50.903
Process Uptime: 0 days 5:33:15.000
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
......................................
ntdll!NtWaitForSingleObject+0xa:
00000000`7711135a c3 ret
0:000> .load D:\dumps\sos.dll
0:000> !tp
CPU utilization: 0%
Worker Thread: Total: 29 Running: 6 Idle: 23 MaxLimit: 800 MinLimit: 8
Work Request in Queue: 0
--------------------------------------
Number of Timers: 43
--------------------------------------
Completion Port Thread:Total: 1 Free: 1 MaxFree: 16 CurrentLimit: 0 MaxLimit: 800 MinLimit: 8
0:000> !syncblk
Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner
47 0000000005532488 9 1 000000001385b1e0 93a0 105 000000017ffefbe0 System.Object
-----------------------------
Total 6125
CCW 4
RCW 11
ComClassFactory 0
Free 4333
0:000> ~105s
ntdll!NtWaitForSingleObject+0xa:
00000000`7711135a c3 ret
0:105> !clrstack
OS Thread Id: 0x93a0 (105)
Child SP IP Call Site
000000000fbac038 000000007711135a [NDirectMethodFrameStandalone: 000000000fbac038] System.Data.Common.UnsafeNativeMethods.OCILobRead(System.Data.OracleClient.OciHandle, System.Data.OracleClient.OciHandle, System.Data.OracleClient.OciHandle, UInt32 ByRef, UInt32, IntPtr, UInt32, IntPtr, IntPtr, UInt16, CHARSETFORM)
000000000fbabfb0 000007fef3455640 DomainNeutralILStubClass.IL_STUB_PInvoke(System.Data.OracleClient.OciHandle, System.Data.OracleClient.OciHandle, System.Data.OracleClient.OciHandle, UInt32 ByRef, UInt32, IntPtr, UInt32, IntPtr, IntPtr, UInt16, CHARSETFORM)*** WARNING: Unable to verify checksum for System.Data.OracleClient.ni.dll 000000000fbac0f0 000007fef344ea9b System.Data.OracleClient.TracedNativeMethods.OCILobRead(System.Data.OracleClient.OciHandle, System.Data.OracleClient.OciHandle, System.Data.OracleClient.OciHandle, Int32 ByRef, UInt32, IntPtr, UInt32, UInt16, CHARSETFORM)
000000000fbac1d0 000007fef347c63d System.Data.OracleClient.OracleLob.Read(Byte[], Int32, Int32)
000000000fbac2a0 000007fef347d483 System.Data.OracleClient.OracleLob.get_Value()
000000000fbac320 000007fef34721b6 System.Data.OracleClient.OracleColumn.GetValue(System.Data.OracleClient.NativeBuffer_RowBuffer)
000000000fbac3d0 000007fef346efd2 System.Data.OracleClient.OracleDataReader.GetValues(System.Object[])
000000000fbac420 000007fedd16faec System.Data.ProviderBase.SchemaMapping.LoadDataRow()*** WARNING: Unable to verify checksum for System.Data.ni.dll 000000000fbac460 000007fedd1674fd System.Data.Common.DataAdapter.FillLoadDataRow(System.Data.ProviderBase.SchemaMapping)
000000000fbac4f0 000007fedd1673d4 System.Data.Common.DataAdapter.FillFromReader(System.Data.DataSet, System.Data.DataTable, System.String, System.Data.ProviderBase.DataReaderContainer, Int32, Int32, System.Data.DataColumn, System.Object)
000000000fbac5b0 000007fedd167271 System.Data.Common.DataAdapter.Fill(System.Data.DataSet, System.String, System.Data.IDataReader, Int32, Int32)
000000000fbac670 000007fedd167ef6 System.Data.Common.DbDataAdapter.FillInternal(System.Data.DataSet, System.Data.DataTable[], Int32, Int32, System.String, System.Data.IDbCommand, System.Data.CommandBehavior)
000000000fbac720 000007fedd167cc4 System.Data.Common.DbDataAdapter.Fill(System.Data.DataSet, Int32, Int32, System.String, System.Data.IDbCommand, System.Data.CommandBehavior)
000000000fbac7d0 000007fedd167ac2 System.Data.Common.DbDataAdapter.Fill(System.Data.DataSet)
000000000fbac880 000007ff002393ee xxxxxx.Platform.Core.DataAccess.Database.GetDataSet(System.String, System.String, System.Data.IDbDataParameter[], Boolean)*** ERROR: Module load completed but symbols could not be loaded for xxxxxx.Platform.Core.DataAccess.dll 000000000fbac950 000007ff002391cd xxxxxx.Platform.Core.DataAccess.Database.GetDataSet(System.String, System.Data.IDbDataParameter[], Boolean)
000000000fbac990 000007ff0023906a xxxxxx.Platform.Core.DataAccess.Database.ExecuteDataSet(System.String, System.Data.IDbDataParameter[])
000000000fbac9f0 000007ff00238810 xxxxxx.Platform.Core.DataAccess.Database.ExecuteDataSet(System.String, System.Object[])
000000000fbaca60 000007ff01566749 xxxxxx.Platform.Workflow.Server.Designer.OperateDataBaseManager.GetPacekage(System.String)*** WARNING: Unable to verify checksum for xxxxxx.Platform.Workflow.Server.dll
*** ERROR: Module load completed but symbols could not be loaded for xxxxxx.Platform.Workflow.Server.dll 000000000fbacab0 000007ff01566548 xxxxxx.Platform.Workflow.Server.Designer.OperateDataBaseManager.Load(System.String)
000000000fbacb00 000007ff0024bbfe xxxxxx.Platform.Resource.ResourceManager.MetadataService.GetCheckedinVersionContent(System.String, System.Data.DataRow, System.String ByRef)*** ERROR: Module load completed but symbols could not be loaded for xxxxxx.Platform.Resource.MetadataManager.dll 000000000fbacb60 000007ff0024b842 xxxxxx.Platform.Resource.ResourceManager.MetadataService.GetMetadataContentString(System.String, System.String ByRef, System.String)
000000000fbacbe0 000007ff0024b53b xxxxxx.Platform.Resource.ResourceManager.MetadataService.GetMetadataContentFromDB(System.String)
000000000fbacc40 000007ff00243d1a xxxxxx.Platform.Resource.ResourceManager.MetadataCache.GetObject(System.String, xxxxxx.Platform.AppFramework.Service.xxxState)
000000000fbaccb0 000007ff00fb036b xxxxxx.Platform.Resource.MetadataManager.MetadataRestFul.TryGetNewVersionContent(System.IO.BinaryReader, System.IO.BinaryWriter)
000000000fbad418 000007fef98d10b4 [DebuggerU2MCatchHandlerFrame: 000000000fbad418]
000000000fbad500 000007fef98d10b4 [CustomGCFrame: 000000000fbad500]
000000000fbad4c8 000007fef98d10b4 [GCFrame: 000000000fbad4c8]
000000000fbad458 000007fef98d10b4 [GCFrame: 000000000fbad458]
000000000fbad828 000007fef98d10b4 [HelperMethodFrame_PROTECTOBJ: 000000000fbad828] System.RuntimeMethodHandle._InvokeMethodFast(System.IRuntimeMethodInfo, System.Object, System.Object[], System.SignatureStruct ByRef, System.Reflection.MethodAttributes, System.RuntimeType)
000000000fbad970 000007fef895587f System.Reflection.RuntimeMethodInfo.Invoke(System.Object, System.Reflection.BindingFlags, System.Reflection.Binder, System.Object[], System.Globalization.CultureInfo, Boolean)*** WARNING: Unable to verify checksum for mscorlib.ni.dll 000000000fbadab0 000007fef89a0426 System.Reflection.RuntimeMethodInfo.Invoke(System.Object, System.Reflection.BindingFlags, System.Reflection.Binder, System.Object[], System.Globalization.CultureInfo)
000000000fbadb00 000007ff00b4f3b3 xxxxxx.Platform.AppFramework.Service.xxxRestfulContext.InternalInvoke(System.String, System.Reflection.MethodInfo, System.Object, System.Object[], Int32[] ByRef, System.Object[] ByRef)*** ERROR: Module load completed but symbols could not be loaded for xxxxxx.Platform.AppFramework.RestfulService.dll 000000000fbadb60 000007ff00b4ebf0 xxxxxx.Platform.AppFramework.Service.xxxRestfulContext.Invoke(System.String, System.String, System.String, Boolean, System.Object[], Int32[] ByRef, System.Object[] ByRef)
000000000fbadc40 000007ff00c28ae8 xxxxxx.Platform.AppFramework.Service.xxxRestfulContext.Invoke(System.String, System.String, System.String, Boolean, System.Object[])
000000000fbadca0 000007ff00c2876b xxxxxx.Platform.AppFramework.RESTFulWebService.xxxHttpWebHandler.InvokeStream(System.IO.BinaryReader, System.Web.HttpContext, Boolean)*** ERROR: Module load completed but symbols could not be loaded for xxxxxx.Platform.AppFramework.RESTFulWebService.dll 000000000fbadd80 000007ff00b4de0c xxxxxx.Platform.AppFramework.RESTFulWebService.xxxHttpWebHandler.ProcessRequest(System.Web.HttpContext)
000000000fbade70 000007fef1eb5c25 System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()*** WARNING: Unable to verify checksum for System.Web.ni.dll 000000000fbadf30 000007fef1e8337a System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)
000000000fbadfe0 000007fef1e97030 System.Web.HttpApplication+ApplicationStepManager.ResumeSteps(System.Exception)
000000000fbae0a0 000007fef1e82879 System.Web.HttpApplication.System.Web.IHttpAsyncHandler.BeginProcessRequest(System.Web.HttpContext, System.AsyncCallback, System.Object)
000000000fbae100 000007fef1e8777c System.Web.HttpRuntime.ProcessRequestInternal(System.Web.HttpWorkerRequest)
000000000fbae180 000007fef1e86477 System.Web.Hosting.ISAPIRuntime.ProcessRequest(IntPtr, Int32)
000000000fbae2b0 000007fef25ed6dd DomainNeutralILStubClass.IL_STUB_COMtoCLR(Int64, Int32, Int32 ByRef)
000000000fbae5e8 000007fef993e93e [GCFrame: 000000000fbae5e8]
000000000fbae628 000007fef993e93e [ContextTransitionFrame: 000000000fbae628]
000000000fbae668 000007fef993e93e [GCFrame: 000000000fbae668]
000000000fbae850 000007fef993e93e [ComMethodFrame: 000000000fbae850]
0:105> ~* e!clrstack
OS Thread Id: 0x2aec (0)
Unable to walk the managed stack. The current thread is likely not a
managed thread. You can run !threads to get a list of managed threads in
the process
OS Thread Id: 0x15f0 (106)
Child SP IP Call Site
000000000ed5df70 00000000771118ca [GCFrame: 000000000ed5df70]
000000000ed5e1c8 00000000771118ca [GCFrame: 000000000ed5e1c8]
000000000ed5e178 00000000771118ca [HelperMethodFrame: 000000000ed5e178] System.Threading.Monitor.Enter(System.Object)
000000000ed5e2d0 000007ff00243ce8 xxxxxx.Platform.Resource.ResourceManager.MetadataCache.GetObject(System.String, xxxxxx.Platform.AppFramework.Service.GSPState)
000000000ed5e340 000007ff00d40c8d xxxxxx.Platform.Resource.ResourceManager.DefaultValueManager.IsFormMobileForm(System.String)
000000000ed5e380 000007ff00d40ac0 xxxxxx.Platform.Resource.ResourceManager.DefaultValueManager.GetDefaultValueXmlExactly(System.String, System.String, System.String)
000000000ed5e3d0 000007ff00d409df xxxxxx.Platform.Engine.DataAccess.Common.DAEUtils.GetDefaultValueXml(System.String, System.String, System.String)
000000000ed5e420 000007ff00d40583 xxxxxx.Platform.Engine.DataAccess.GSPDataAccessEngine.NewObjectWithDefault(xxxxxx.Platform.Resource.Metadata.DataModel.GspDataModel, System.String, System.String)
000000000ed5e500 000007ff0091495d xxxxxx.Platform.BizComponent.JSProxy.BizComponentRESTFulService.InvokeInternal(System.String, System.Object[], System.Collections.Generic.Dictionary`2<Int32,System.Object> ByRef, Boolean ByRef)
000000000ed5e590 000007ff0023fe16 xxxxxx.Platform.BizComponent.JSProxy.BizComponentRESTFulService.Get(System.Collections.Generic.Dictionary`2<System.String,System.String>)
000000000ed5e610 000007ff002276d8 xxxxxx.Platform.WebFramework.RestFulService.RESTFulHandlerForWeb.ProcessRequest(System.Web.HttpContext)
000000000ed5e840 000007fef1eb5c25 System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
000000000ed5e900 000007fef1e8337a System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)
000000000ed5e9b0 000007fef1e97030 System.Web.HttpApplication+ApplicationStepManager.ResumeSteps(System.Exception)
000000000ed5ea70 000007fef1e82879 System.Web.HttpApplication.System.Web.IHttpAsyncHandler.BeginProcessRequest(System.Web.HttpContext, System.AsyncCallback, System.Object)
000000000ed5ead0 000007fef1e8777c System.Web.HttpRuntime.ProcessRequestInternal(System.Web.HttpWorkerRequest)
000000000ed5eb50 000007fef1e86477 System.Web.Hosting.ISAPIRuntime.ProcessRequest(IntPtr, Int32)
000000000ed5ec80 000007fef25ed6dd DomainNeutralILStubClass.IL_STUB_COMtoCLR(Int64, Int32, Int32 ByRef)
000000000ed5efb8 000007fef993e93e [GCFrame: 000000000ed5efb8]
000000000ed5eff8 000007fef993e93e [ContextTransitionFrame: 000000000ed5eff8]
000000000ed5f038 000007fef993e93e [GCFrame: 000000000ed5f038]
000000000ed5f220 000007fef993e93e [ComMethodFrame: 000000000ed5f220]
下图为阻塞源头105号线程(被数据库返回阻塞),执行SQL对应的程序代码:
下图为线程阻塞的同步程序代码:
至此问题应该比较清楚了,执行SQL时间长的线程阻塞了其他线程,而线程同步锁的粒度太大(仅为了做数据缓存时,防止插入重复值,结果锁定了所有的元数据对象)。但是,该SQL脚本真的有问题吗?
【网络分析】:
将该SQL放到plsql中执行,客户端响应时间很长,经过对不同单据进行测试后发现,当LOB类型的数据超过16K以后问题就会出现,但该SQL在测试环境并没有问题。前面已经跟踪过数据库服务器,没有发现什么问题,因此目标开始转移到应用服务器到数据库服务器的网络环境上。
直接使用ADO.NET和JDBC分别直接访问数据库,均发现响应延迟问题。与客户方IT人员沟通了解到,从应用服务器到数据库经过了一个思科的防火墙。我们就分别在应用服务器、应用服务器端的交换机、数据库服务器端的交换机、数据库服务器,4个点进行网络抓包。经过对比发现,经过防火墙前后两个交换机的数据包明显有问题:有非常明显的包乱序、包数量不一致、window size递减的情况;并且所有经过防火墙的网络包,window size的值均从原来的64k篡改为16k;
如下图:
把这个情况反馈给思科的代理后,他们调整防火墙设置:好像是改为按流量控制,不是特别懂。
至此,问题彻底解决!