VXFS启用异步IO导致的严重问题

2021-07-04 04:06:29

今天在做数据迁移的时候，碰到了一个严重的问题，数据加载完全hang住了，最后无奈回退了。

系统使用的vxfs文件系统，在生产升级前一个月的时候，做过一次小规模的数据迁移，当时查看awr,ash，最后根据addm的推荐得出加载速度比较慢主要是由于异步IO导致的，而且当时生产库确实没有启用异步io, filesystemio_option的设置为none，在经过确认之后，在半个月前的一此例行维护中，由客户做了这个配置的修改。修改后发现iowait明显增加了，当时也没再多跟多的分析因为没有比较明显的性能问题，当时认为iowait高可能和启用了异步io后处理的效率更高有关，反而认为这是一种改进。

在升级的前2天的时候，做数据迁移前的检查工作的时候就做了一次简单的io分析。其中使用了dd来做了一个小测试。

发现io的速度很差，相比测试环境有很大的差别。

time dd if=/dev/zero bs=1M count=204 of=direct_200M

213909504 bytes (214 MB) copied, 5.27558 seconds, 40MB/s40.5 MB/s

real 0m5.284s
user 0m0.003s
sys 0m0.031s

以下是当时做的sar的记录。
07:00:01 AM CPU %user %nice %system %iowait %steal %idle
09:20:01 AM all 10.48 0.11 1.76 2.89 0.00 84.76
09:30:01 AM all 10.59 0.10 1.81 2.45 0.00 85.04
09:40:01 AM all 7.91 0.18 1.61 3.20 0.00 87.10
09:50:01 AM all 7.26 0.07 1.66 3.23 0.00 87.78
10:00:01 AM all 7.54 0.13 1.53 3.67 0.00 87.13
10:10:01 AM all 7.78 0.09 1.76 3.92 0.00 86.45
10:20:01 AM all 8.24 0.09 2.27 3.98 0.00 85.43
10:30:01 AM all 7.38 0.08 1.79 5.18 0.00 85.57
10:40:01 AM all 8.14 0.16 2.01 6.36 0.00 83.33
10:50:02 AM all 7.05 0.10 1.74 4.83 0.00 86.29
11:00:01 AM all 7.61 0.09 2.04 5.43 0.00 84.83
11:10:01 AM all 7.22 0.09 1.70 6.22 0.00 84.76
11:20:01 AM all 6.71 0.12 2.10 7.35 0.00 83.72
11:30:01 AM all 9.36 0.10 2.87 5.03 0.00 82.63
11:40:01 AM all 7.26 0.25 1.76 6.08 0.00 84.65
11:50:01 AM all 7.17 0.12 2.40 5.24 0.00 85.07
12:00:01 PM all 6.30 0.10 2.64 5.27 0.00 85.69
Average: all 10.36 0.26 1.14 3.40 0.00 84.83
一个月前的数据情况
Production statistics 20-June-14:
204+0 records in
204+0 records out
213909504 bytes (214 MB) copied, 1.44182 seconds, 148 MB/s
real 0m1.445s
user 0m0.001s
sys 0m0.039s

测试环境
TEST machine statistics:
204+0 records in
204+0 records out
213909504 bytes (214 MB) copied, 0.550607 seconds, 388 MB/s
real 0m0.595s
user 0m0.001s
sys 0m0.072s

另外一个数据迁移服务器
TEST2 machine statistics:
213909504 bytes (214 MB) copied, 0.320128 seconds, 668 MB/s

real 0m0.43s
user 0m0.01s
sys 0m0.42s

结果这个问题在升级前还是没有解决，在数据迁移的时候就最终回退了。

在做数据的merge的时候，强制启用了parallel，但是通过top命令看到cpu的使用率可怜的低。

使用dd简单测试，竟然最低达到了15M/s左右。

以下是当时查看top的结果。

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

31130 root 10 -5 0 0 0 S 20.5 0.0 1352:56 [vxfs_thread]

17285 oraccbs1 16 0 18.3g 114m 35m S 12.9 0.0 4:56.41 ora_p026_PRODB

18568 oraccbs1 16 0 18.3g 50m 22m D 7.3 0.0 2:40.81 ora_p056_PRODB

18580 oraccbs1 16 0 18.2g 42m 21m D 4.6 0.0 2:24.26 ora_p062_PRODB

7846 oraccbs1 16 0 18.5g 315m 47m S 4.0 0.1 0:12.23 oraclePRODB (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

18576 oraccbs1 16 0 18.2g 42m 21m D 3.6 0.0 2:25.63 ora_p060_PRODB

11334 tivoliam 15 0 820m 89m 14m S 3.3 0.0 341:18.97 /opt/app/IBM/ITM/lx8266/lz/bin/klzagent

18570 oraccbs1 16 0 18.2g 42m 21m D 3.3 0.0 2:25.69 ora_p057_PRODB

18578 oraccbs1 16 0 18.2g 42m 21m D 3.0 0.0 2:23.12 ora_p061_PRODB

稍后就看到parallel启用的很艰难。过一会才能看到几个相关的进程。

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

31130 root 10 -5 0 0 0 S 5.6 0.0 1371:05 [vxfs_thread]

11334 tivoliam 15 0 820m 89m 14m S 3.3 0.0 342:47.68 /opt/app/IBM/ITM/lx826

1388 root 18 0 59116 784 568 S 1.6 0.0 0:13.80 sadc 5 1001 -z

4661 oraccbs1 15 0 18.2g 24m 4272 S 1.3 0.0 23:40.35 ora_dbw2_PRODB

27545 oraccbs1 16 0 13428 1844 816 R 1.0 0.0 0:02.35 top -c

2833 root 16 0 127m 71m 3520 S 0.7 0.0 81:28.54 vxconfigd -x syslog

4653 oraccbs1 18 0 18.3g 130m 14m S 0.7 0.0 221:18.30 ora_dia0_PRODB

4663 oraccbs1 15 0 18.2g 24m 3464 S 0.7 0.0 23:48.27 ora_dbw3_PRODB

2598 root 15 0 0 0 0 S 0.3 0.0 5:03.01 [dmp_daemon]

4878 oraccbs1 15 0 18.2g 7140 4396 S 0.3 0.0 67:17.54 ora_mmnl_PRODB

5016 root 10 -5 0 0 0 S 0.3 0.0 0:14.14 [kjournald]

7334 root 18 0 280m 21m 7824 S 0.3 0.0 26:27.83 /opt/VRTSob/bin/vxsvc

19215 root 15 0 37872 9464 2264 S 0.3 0.0 69:47.85 /opt/VRTSvcs/bin/Mount

最后，公司的unix team的一个同事的判断是vxfs的bug，需要打一个补丁。活还得干，看看今晚的进展了。

The first two are major

VXFS version

We had IO performance issues with the very same version of VXFS installed in TRUE 6.0.100.000

Eventually we found we were hitting the following bug which is fixed with version 6.0.3 https://sort.symantec.com/patch/detail/8260

this happened at that site – even though it was a fresh install and NOT and upgrade as indicated in the below.

We did see the very same issues of performance degrading when removing the direct mount option

Hence we recommend installing this patch

SYMPTOM:

Performance degradation is seen after upgrade from SF 5.1SP1RP3 to SF 6.0.1 on

Linux

DESCRIPTION:

The degradation in performance is seen because the I/O are not unplugged before

getting delivered to lower layers in the IO path. These I/Os are unplugged by

OS at a default time which 3 milli seconds, which resulted in an additional

overhead in completion of I/Os.

RESOLUTION:

Code Changes made to explicitly unplug the I/Os before sending then to the lower

layer.

* 3254132 (Tracking ID: 3186971)

Power management

Servers should have power management savings disabled set to high performance

Make sure C-state is disabled set to C0

This is executed at the BIOS level and requires a reboot.

码农公寓

相关文章