现象
把一个打开的文件描述符,通过mmap映射到一片内存区间,对这块区间进行读写,长时间运行后出现访存错误 SIGBus Error, GDB分析相应的core出现一些内存空间不可用的错误。
问题分析
参考man mmap , 在出现下列情况下,会出错:
ERRORS
EBADF fd is not a valid file descriptor (and MAP_ANONYMOUS was not set).
EACCES A file descriptor refers to a non-regular file. Or MAP_PRIVATE was requested, but fd is not open for reading. Or
MAP_SHARED was requested and PROT_WRITE is set, but fd is not open in read/write (O_RDWR) mode. Or PROT_WRITE is set,
but the file is append-only.
EINVAL We don't like start or length or offset. (E.g., they are too large, or not aligned on a PAGESIZE boundary.)
ETXTBSY
MAP_DENYWRITE was set but the object specified by fd is open for writing.
EAGAIN The file has been locked, or too much memory has been locked.
ENOMEM No memory is available, or the process's maximum number of mappings would have been exceeded.
ENODEV The underlying filesystem of the specified file does not support memory mapping.
Use of a mapped region can result in these signals:
SIGSEGV
Attempted write into a region specified to mmap as read-only.
SIGBUS Attempted access to a portion of the buffer that does not correspond to the file (for example, beyond the end of the
file, including the case where another process has truncated the file).
根据上面的说明,可以看到出现SIGBUS错误的时候,要么访问的buffer 不在文件范围之内,或者所映射的文件已经被truncate了。但笔者碰到的错误并不是调用mmap碰到的,而是在访问buffer 过程中碰到的。 那该怎么分析呢?
解决方法和验证
首先理清了下笔者所在的系统的上下文环境,弄清楚了涉及到mmap 的文件及其内存区间的使用方式。接着根据异常的core ,用GDB去看访问那个文件的多个线程的堆栈,居然发现:一个线程在访问mmap的buffer,另外一个线程居然还在重新打开那个文件!对着异常日志检查,确实是有个线程重新打开了一个已经mmap的文件。
马上加了下防御的代码,重新跑起来了测试,这个问题彻底消失了。
总结
发现mmap 异常的问题,需要充分结合 core的多个线程堆栈进行分析排查,才能解决问题。