同事反映线上服务特别慢,访问接口一直无响应,直到504网关超时。上线赶紧看了看CPU以及内存使用率,发现CPU、内存正常。
查看服务日志有大量断开的管道:
org.springframework.web.servlet.mvc.method.annotation.AbstractMessageConverterMethodProcessor.writeWithMessageConverters(AbstractMessageConverterMethodProcessor.java:272)
at org.springframework.web.servlet.mvc.method.annotation.RequestResponseBodyMethodProcessor.handleReturnValue(RequestResponseBodyMethodProcessor.java:180)
at org.springframework.web.method.support.HandlerMethodReturnValueHandlerComposite.handleReturnValue(HandlerMethodReturnValueHandlerComposite.java:82)
at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:119)
at org.springframework.web.servlet.mvc.method.annotation.ExceptionHandlerExceptionResolver.doResolveHandlerMethodException(ExceptionHandlerExceptionResolver.java:400)
at org.springframework.web.servlet.handler.AbstractHandlerMethodExceptionResolver.doResolveException(AbstractHandlerMethodExceptionResolver.java:61)
at org.springframework.web.servlet.handler.AbstractHandlerExceptionResolver.resolveException(AbstractHandlerExceptionResolver.java:139)
at org.springframework.web.servlet.DispatcherServlet.processHandlerException(DispatcherServlet.java:1255)
at org.springframework.web.servlet.DispatcherServlet.processDispatchResult(DispatcherServlet.java:1062)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1008)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:925)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:974)
at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:866)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:635)
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:851)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: 断开的管道
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:318)
... 48 common frames omitted
当前tomcat使用nio模型处理网络请求,断开管道(Broken pipe)原因是当客户端发起请求,由于服务端回应慢或者其他原因,服务端返回响应的时候,客户端已经断开了连接。但是服务端仍然向断开的连接返回数据,会产生这个异常。
查看nginx日志,大量的连接超时:
2019/12/27 17:10:56 [error] 8937#0: *3282524 connect() failed (110: Connection timed out) while connecting to upstream, client: ****.****.****.****, server: ****.com, request: "POST /***/*** HTTP/1.1", upstream: "http://****:****/****/****", host: "****:****"
这种情况不能单纯增加各阶段的超时时间或增大句柄数解决,因为没有定位真正原因,但是修改句柄数量治标不治本。
找原因,打印线程状态,如下:
"http-nio-8090-exec-4" #39 daemon prio=5 os_prio=0 tid=0x00007f807443a800 nid=0x1fc3 runnable [0x00007f7f987d4000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
- locked <0x00000007b95b3dc8> (a java.lang.Object)
at oracle.net.nt.TimeoutSocketChannel.read(TimeoutSocketChannel.java:144)
at oracle.net.ns.NIOHeader.readHeaderBuffer(NIOHeader.java:82)
at oracle.net.ns.NIOPacket.readFromSocketChannel(NIOPacket.java:139)
at oracle.net.ns.NIOPacket.readFromSocketChannel(NIOPacket.java:101)
at oracle.net.ns.NIONSDataChannel.readDataFromSocketChannel(NIONSDataChannel.java:80)
at oracle.jdbc.driver.T4CMAREngineNIO.prepareForReading(T4CMAREngineNIO.java:98)
at oracle.jdbc.driver.T4CMAREngineNIO.unmarshalUB1(T4CMAREngineNIO.java:534)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:485)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:252)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:612)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:226)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:59)
at oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:747)
at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:904)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1082)
at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3780)
at oracle.jdbc.driver.T4CPreparedStatement.executeInternal(T4CPreparedStatement.java:1343)
at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3822)
- locked <0x00000007b81b8518> (a oracle.jdbc.driver.T4CConnection)
发现有大量Oracle连接相关,且状态为RUNABLE,线程dump有locked字样,但是这种情况查看oracle数据库是否有死锁,不会发现死锁。
在当前例子中,java只能检测到java程序中的锁,例如synchronized以及线程wait等,它对数据库中的锁一无所知。同样,数据库能检测到数据库实例中产生的锁,它对java线程的锁一无所知。java锁和数据库中的锁是两个完全不同的锁。
在本例线程dump中,表示在地址为0x00000007b81b8518的T4CConnection类型的对象上持有一个锁,只意味着这个线程正在某个连接中执行代码。不表示当前线程被其他线程阻塞,否则当前线程状态应该是WAITING或BLOCKED,不是RUNABLE。它正在运行并从网络套接字读取来自数据库的响应内容。
分析:数据库中SQL执行时间过长,且被大量调用,最终定位到了SQL,因此线上表数据量巨大,执行一次耗时难以忍受的长,而且这类接口在大屏上每隔几秒就刷新一次,导致前几次的接口还没响应,后几次的请求又过来了,项目线程被耗尽,没有多余资源去响应登陆等请求,最终造成504错误。
其余请求打过来之后,因为服务端迟迟不能响应,最终客户端关闭了连接,但是服务器端没有关闭连接,服务器端造成大量CLOSE_WAIT的状态线程,出现断开的连接错误。如果句柄设置过少,还会导致句柄数限制错误。