511数据库无法启动故障

数据库mysq服务无法启动

临时解决办法:

经人提醒用mysqld_safe方式直接启动(自打脸1024次)

事件回顾


赞到爆活动粉丝猛增导致数据库主机CPU满载,系统卡顿

解决过程

  1.  将数据库关闭,操作命令:systemctl stop  mysqld
  2. 将数据库重启一次检验是否可正常重启 systemctl start  mysqld可正常启动
  3. 将数据库关闭
  4. 在阿里云控制台将数据库主机扩容至16C/32G高性能计算型ECS
  5. 登陆数据库主机systemctl start  mysqld无法启动,日志报错如下


mysql报错信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=0
max_threads=300
thread_count=0
connection_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 127390 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0xf4c845]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x7dd584]
/lib64/libpthread.so.0(+0xf370)[0x7f814fcb7370]
/usr/lib64/mysql/plugin/server_audit.so(get_db_mysql57+0x12)[0x7f81107123e2]
/usr/lib64/mysql/plugin/server_audit.so(+0xa541)[0x7f8110712541]
/usr/sbin/mysqld[0x7dd6c6]
/usr/sbin/mysqld(_Z24plugin_foreach_with_maskP3THDPFcS0_P13st_plugin_intPvEijS3_+0x1b9)[0xd3e349]
/usr/sbin/mysqld(_Z18mysql_audit_notifyP3THD30mysql_event_general_subclass_tiPKcm+0x3d6)[0x7de326]
/usr/sbin/mysqld(my_message_sql+0x6f)[0x7cf28f]
/usr/sbin/mysqld(my_error+0xf6)[0xf47246]
/usr/sbin/mysqld(my_register_filename+0x29a)[0xf49b5a]
/usr/sbin/mysqld(my_create+0x6c)[0xf46a9c]
/usr/sbin/mysqld[0x7d0c60]
/usr/sbin/mysqld(_Z11mysqld_mainiPPc+0x16c0)[0x7d7fb0]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f814e697b35]
/usr/sbin/mysqld[0x7cd49d]
The manual page at http: class="bash plain">//dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
2018-05-11T06:49:23.670507Z mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

被上图中的第8行日志

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 127390 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

所迷惑,我们一直在修改上面的几个参数,但是数据库仍然无法启动

根据原因探索

  • 检查/etc/init.d/mysqld系统启动脚本,其中有一段代码是判断PID文件是否存在,如果/etc/my.cnf文件里无配置就会用/var/run/mysqld/mysqld.pid默认配置文件

代码如下

mysqld启动脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
get_mysql_option(){
    result=$(/usr/bin/my_print_defaults "$1" | sed -n "s/^--$2=//p" | tail -n 1)
    if [ -z "$result" ]; then
        # not found, use default
        result="$3"
    fi
}
get_mysql_option mysqld datadir "/var/lib/mysql"
datadir="$result"
get_mysql_option mysqld socket "$datadir/mysql.sock"
socketfile="$result"
get_mysql_option mysqld_safe log-error "/var/log/mysqld.log"
errlogfile="$result"
get_mysql_option mysqld_safe pid-file "/var/run/mysqld/mysqld.pid"
mypidfile="$result"

  • 检查/var/run下并无此文件,因为主机重启后系统会自动将此文件删除

三种解决方案:

原理解析:/etc/init.d/mysqld原理其实就是调用mysqld_safe来启动数据库

  • 方法一:给/etc/my.cnf添加配置让mysql到自己的数据目录下写/mydata/mysqld.pid

    [mysqld_safe]

    log-error=/mydata/mysqld.log

    pid-file=/mydata/mysqld.pid
    结果:systemctl start mysqld会一直卡在那,systemctl不正常结束返回,此时数据库是已经正常起来了,而操作系统命令systemctl找不到/var/run/mysqld/mysqld.pid而卡在那(失败)

  • 方法二:更换启动方式,直接用mysqld_safe --defaults-file=/etc/my.cnf启动,此方法是默认到/mydata目录下创建mysqld.pid文件,但是关闭时用
    mysqladmin  shutdown -uroot   -P6033 -p方式来关闭,还要输入密码较麻烦而且做开机启动也麻烦
  • 方法三:修改/etc/init.d/mysqld启动脚本添加自动创建/var/run/mysqld文件夹

    如果文件夹不存在就自动创建

    1
    2
    3
    4
    5
    6
    test -d /var/run/mysqld
    if [ $? -eq 1  ]
    then
    mkdir /var/run/mysqld/
    chown mysql:mysql /var/run/mysqld
    fi

    主机重启可自动将数据库拉起。

总结

        此次故障处理心得

                <p class="title">心得</p>
                        <span class="aui-icon icon-warning">Icon</span>
            <div class="message-content">
                        <p>1:一直被mysql的报错迷惑不断调试各个buffer的size大小,</p><p>2:忘记了尝试用mysqld_safe来启动</p><p>3:从库没能在第一时间切换成主库,这个在接下来的时候进行方案设计和实施</p><p>4:业务逻辑没有实现读写分离从库一直空闲而主库却已经累死</p><p>5:慢查询超时未告警部署</p><p>6:直接导致丢失大量可能性粉丝</p>
                </div>
</div>

为赞到爆的运营人员的公关和开发人员的应急点赞。

上一篇:《Java数字图像处理:编程技巧与应用实践》——1.5 小结


下一篇:ElasticSearch 单个节点监控