平台应用开发人员向我们平台报了一个issue,说在测试他们应用的时候遇到一个奇怪的问题,系统报了undefinedexception(开发人员自定义的exception,捕捉了jpa exception之后发现无法处理的情况下就抛出未定义异常),他们贴出了weblogic里面报出的异常栈信息:
<Sep 17, 2014 7:56:45 AM SAST> <Error> <org.hibernate.transaction.JDBCTransaction> <BEA-000000> <Could not toggle autocommit com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure The last packet successfully received from the server was 33,177,390 milliseconds ago. The last packet sent successfully to the server was 4 milliseconds ago. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at com.mysql.jdbc.Util.handleNewInstance(Util.java:395) Truncated. see log file for complete stacktrace Caused By: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost. at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3052) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3503) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3492) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4043) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2664) Truncated. see log file for complete stacktrace > Sep 17, 2014 7:56:45 AM org.hibernate.type.NullableType nullSafeSet INFO: could not bind value '...' to parameter: 1; No operations allowed after statement closed. Sep 17, 2014 7:56:45 AM org.hibernate.jdbc.AbstractBatcher closeQueryStatement WARNING: exception clearing maxRows/queryTimeout com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after statement closed. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
同时他们也贴出了mysql的error.log
140917 8:35:37 [Warning] Aborted connection 35011 to db: '' user: '' host: 'z-2-pl-6' (Got an error reading communication packets) 140917 8:35:38 [Warning] Aborted connection 35367 to db: '' user: '' host: 'z-2-pl-4' (Got an error reading communication packets) 140917 8:35:39 [Warning] Aborted connection 35080 to db: '' user: '' host: 'z-2-pl-5' (Got an error reading communication packets)里面出现connection abort,在mysql的官方文档里面有说明可能的原因,最常见的package size,connection_timeout,网络故障。由于我们系统是内网,而且connection的eth是bond模式,数据统计也没有问题,那么最大怀疑点就是connection_timeout,connection_timeout在connection pool里面是一种很常见的行为,所以它有对应的解决方案,不管是哪种JDBC的connection pool,都提供了配置来解决连接池里面的缓存连接被数据库断掉的情况。如果这个问题有三种解决方案:
1. 增加wait_timeout的时间,默认8个小时。
2. 减少Connection pools中connection的lifetime (小于wait_timeout)。
3. 测试Connection pools中connection的有效性。
第一种方案可能会导致真正挂起的连接资源无法得到回收,第二种方案会增加缓冲池中连接的创建次数,所以我们平台默认配置的JDBC连接池选择了第三种方案,从连接池获取连接的时候测试连接的有效性(增加了overhead,据说有3%)。
我详细询问了开发人员异常出现的场景,开发人员说是在系统集成测试的时候,第一天所以的测试都通过了,系统放了一晚,到第二天再来测试的时候,刚开始的几个用例失败了,多测几次之后所有的测试都通过了。很明显就是连接inactive时间太长,被数据库断掉,测试用例存储数据的时候从连接池里面拿出来的连接是无效的,导致测试失败。而且系统连接池配置没有被修改过。
可能大家会很奇怪,已经有了测试连接池中连接有效性的配置,为什么还会出现这种情况呢?这是怎么回事呢?这里贴出我们系统中JDBC连接有效性的测试配置:
这里我们关注的重点是以下3个配置:
我们已经在返回connection的时候做了测试,并且还有2分钟一次的周期性的unused connection测试,不用测试的空闲connection也只是10秒的间隔,不应该出现这种错误才对,这是怎么回事呢?
首先对返回connection做测试有局限性,为了防止系统问题(数据库连接失败,数据库当机),测试连接数有限制
<p>When connectivity to the DBMS is lost, even if only momentarily, some or all of the JDBC connections in a data source typically become defunct. If the data source is configured to test connections on reserve, when an application requests a database connection, WebLogic Server tests the connection, discovers that the connection is dead, and tries to replace it with a new connection to satisfy the request. Ordinarily, when the DBMS comes back online, the refresh process succeeds. However, in some cases and for some modes of failure, testing a dead connection can impose a long delay.</p><p>To minimize this delay, WebLogic data sources include logic that considers <span class="italic">all</span> connections in the data source as dead after a number of consecutive test failures, and closes all connections in the data source. After all connections are closed, when an application requests a connection, the data source creates a connection without first having to test a dead connection. This behavior minimizes the delay for connection requests following the data source's connection pool flush.</p><p>WebLogic Server determines the number of test failures before closing all connections based on the Test Frequency setting for the data source:</p><ul><li><p>If Test Frequency is greater than 0, the number of test failures before closing all connections is set to 2.</p></li><li><p>If Test Frequency is set to 0 (periodic testing is disabled), the number of test failures before closing all connections is set to 25% of the Maximum Capacity for the data source. </p></li></ul>
看起来有点像是连续获取连接都失败了,导致connection失效,返回错误。
可是系统还有周期性的检测连接有效性,理论上是不会导致connection连续多次无效,这里我无法理解,后来上网查询之后,看到一个可能的解释,如果系统中大量连接长时间都无访问,周期性测试可能会有很大一部分连接永远无法测试到(因为系统在测试是也需要保证一定的connection可使用,有部分无法测试到)。
我们的系统在实际环境中出现大量连接同时间断掉导致测试连接有效性失效的可能性为零,只有在测试环境中系统长时间没有使用的时候才可能会遇到这个问题,所以这个问题我们没有提供Solution。
Note: 网上有说还有一种解决方案是设置autoReconnect=true,但是在mysql升级到5X之后没用了,我没试过
记weblogic JDBC 'No operations allowed after statement closed' 缘由