自从数据库服务器从redhat4.6升级到redhat5.5之后,在使用TSM备份的时候偶尔会出现SQL2043N
查看错误:
[db2inst1@limt ~]$ db2 ? SQL2043N SQL2043N Unable to start a child process or thread.
Explanation:
Unable to start up the child processes or threads required during the
processing of a database utility. There may not be enough available
memory to create the new process or thread. The utility stops
processing.
User response:
Ensure the system limit for number of processes or threads has not been
reached (either increase the limit or reduce the number of processes or
threads already running). Ensure that there is sufficient memory for the
new process or thread. Resubmit the utility command
从描述看好像是数据库在申请内存的时候失败,但是内存应该很充裕 ,redhat4.6的时候是16G,升级到redhat5.5
之后已经提高到了64G,应该不会内存不足,但是备份天天做,偶尔失败一次也可以接受,就没在意,之后在值班过程
中发现组内其他系统也会偶尔出现SQL2043N,看来这似乎并不是一个偶尔现象 ,晚上就回家百度一下 SQL2043N,获得
了意外的收获,在官网找到如下解释:
Problem(Abstract)
ASLR or Address Space Layout Randomization is a feature that is activated by default on some of the newer linux distributions. It is designed to load shared memory objects in random addresses. In DB2, multiple processes map a shared memory object at the same address across the processes. It was found that DB2 cannot guarantee the availability of address for the shared memory object when ASLR is turned on. Important note: DB2 10.1 has been enhanced so that ASLR can be safely enabled.
Symptom
This conflict in the address space means that a process trying to attach a shared memory object to a specific address may not be able to do so, resulting in a failure in shmat subroutine. However, on subsequent retry (using a new process) the shared memory attachment may work. The result is a random set of failures. Some processes that have been known to see this error are: db2pd, db2egcf, and db2vend.
Some of the behaviors seen include the following:
For the db2pd command, it will report no data found even through the instance / database is active:
Database SAMPLE not activated on database partition 0. For the db2egcf process, used for HA monitoring, the db2egcf may incorrectly determine the instance is down and initiate a failover. For the db2vend process, backup and log archive methods might fail with an error indicating a child process could not be started:
SQL2043N Unable to start a child process or thread. Diagnosing the problem
When this problem is suspected, check db2diag.log for the shmat failure like the following. Note that the same error message can also occur for a different cause. Hence, it's important to note the process that reported this error. FUNCTION: DB2 UDB, SQO Memory Management, sqlocshr, probe:180
MESSAGE : ZRC=0x850F0005=-2062614523=SQLO_NOSEG
"No Storage Available for allocation"
DIA8305C Memory allocation failure occurred.
CALLED : OS, -, shmat OSERR: EINVAL (22)
Resolving the problem
1) Disable ASLR temporarily (change is only effective until next boot):
Run "sysctl -w kernel.randomize_va_space=0" as root. 2) Disable ASLR immediately and on all subsequent reboots: Add the following line to /etc/sysctl.conf:
kernel.randomize_va_space=0
and then run "sysctl -p" as root to make the change take effect immediately.
大致意思就是LINUX的内存随机化地址特性导致DB2进程不能正确的attach到一个 shared memory object ,那么linux为什么要开启这种特性?
在百度 randomize_va_space 关键字:
Linux Kernel引入了地址空间布局随机化的概念,该概念的提出是出于安全考虑。试想如果堆栈空间的地址都是确定的,那么恶意代码就很容易
通过内存溢出的代码来访问堆栈空间的内容,地址空间布局随机化就是使得进程虚拟空间的布局(主要是各个部分的起始地址)位于随机的位置,
以此来降低被攻击的可能性。
在/proc/sys/kernel/randomize_va_space中的值如果为0则表示关闭所有的随机化,如果为1,表示打开mmap base、栈、VDSO页面随机化,如果
为2则表示在1的基础上进一步打开堆地址随机化。在打开堆地址随机化之前,堆的起始位置是紧接着应用程序bss段之后的。
了解这些之后突然想起在平时使用db2pd时候,也会出现SQL2043N,然后在运行一次就正常了,因为db2pd通过attach db2共享内存来获得数据库
的监控数据,所以db2pd为轻量级工具,对数据库的性能影响比较少
之后在服务器上设置kernel.randomize_va_space=0之后就在无此错误出现