目前遇到一个崩溃问题记录一下!
使用crash 分析结果如下:
crash> sys KERNEL: vmlinux DUMPFILE: kernel_dump_file_debug [PARTIAL DUMP] CPUS: 32 DATE: Thu Jul 8 16:06:13 2021 UPTIME: 12 days, 01:19:36 LOAD AVERAGE: 4.57, 5.64, 5.97 TASKS: 832 NODENAME: localhost RELEASE: 2.6.39-gentoo-r3-wafg2-47137 VERSION: #18 SMP Wed Dec 30 21:37:53 JST 2020 MACHINE: x86_64 (2599 Mhz) MEMORY: 128 GB PANIC: "[1039338.727675] Kernel panic - not syncing: softlockup: hung tasks"
crash> bt PID: 22501 TASK: ffff881ff4340690 CPU: 1 COMMAND: "xxxxproess" #0 [ffff88107fc238b0] machine_kexec at ffffffff810243b6 #1 [ffff88107fc23920] crash_kexec at ffffffff810773b9 #2 [ffff88107fc239f0] panic at ffffffff815f35e0 #3 [ffff88107fc23a70] watchdog_timer_fn at ffffffff81089a38 #4 [ffff88107fc23aa0] __run_hrtimer.clone.28 at ffffffff8106303a #5 [ffff88107fc23ad0] hrtimer_interrupt at ffffffff81063541 #6 [ffff88107fc23b30] smp_apic_timer_interrupt at ffffffff81020b92 #7 [ffff88107fc23b50] apic_timer_interrupt at ffffffff815f6553 #8 [ffff88107fc23bb8] igb_xmit_frame_ring at ffffffffa006a754 [igb] #9 [ffff88107fc23c70] igb_xmit_frame at ffffffffa006ada4 [igb] #10 [ffff88107fc23ca0] dev_hard_start_xmit at ffffffff814d588d #11 [ffff88107fc23d10] sch_direct_xmit at ffffffff814e87f7 #12 [ffff88107fc23d60] dev_queue_xmit at ffffffff814d5c2e #13 [ffff88107fc23db0] transmit_skb at ffffffffa0111032 [wafg2] #14 [ffff88107fc23dc0] forward_skb at ffffffffa01113b4 [wafg2] #15 [ffff88107fc23df0] dev_rx_skb at ffffffffa0111875 [wafg2] #16 [ffff88107fc23e40] igb_poll at ffffffffa006d6fc [igb] #17 [ffff88107fc23f10] net_rx_action at ffffffff814d437a #18 [ffff88107fc23f60] __do_softirq at ffffffff8104f3bf #19 [ffff88107fc23fb0] call_softirq at ffffffff815f6d9c --- <IRQ stack> --- #20 [ffff881f2ebcfae0] __skb_queue_purge at ffffffff8153af65 #21 [ffff881f2ebcfb00] do_softirq at ffffffff8100d1c4 #22 [ffff881f2ebcfb20] _local_bh_enable_ip.clone.8 at ffffffff8104f311 #23 [ffff881f2ebcfb30] local_bh_enable at ffffffff8104f336 #24 [ffff881f2ebcfb40] inet_csk_listen_stop at ffffffff8152a94b #25 [ffff881f2ebcfb80] tcp_close at ffffffff8152c8aa #26 [ffff881f2ebcfbb0] inet_release at ffffffff8154a44d #27 [ffff881f2ebcfbd0] sock_release at ffffffff814c409f #28 [ffff881f2ebcfbf0] sock_close at ffffffff814c4111 #29 [ffff881f2ebcfc00] fput at ffffffff810d4c85 #30 [ffff881f2ebcfc50] filp_close at ffffffff810d1ea0 #31 [ffff881f2ebcfc80] put_files_struct at ffffffff8104d4d9 #32 [ffff881f2ebcfcd0] exit_files at ffffffff8104d5b4 #33 [ffff881f2ebcfcf0] do_exit at ffffffff8104d821 #34 [ffff881f2ebcfd70] do_group_exit at ffffffff8104df5c #35 [ffff881f2ebcfda0] get_signal_to_deliver at ffffffff810570b2 #36 [ffff881f2ebcfe20] do_signal at ffffffff8100ae52 #37 [ffff881f2ebcff20] do_notify_resume at ffffffff8100b47e #38 [ffff881f2ebcff50] int_signal at ffffffff815f5e63 RIP: 00007fd9e52e1cdd RSP: 00007fd9a7cfa370 RFLAGS: 00000293 RAX: 000000000000001b RBX: 00000000000000fb RCX: ffffffffffffffff RDX: 000000000000001b RSI: 00007fd96a77e05e RDI: 00000000000000fb RBP: 00007fd9a8513e80 R8: 00000000007a7880 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001b R13: 00007fd96a77e05e R14: 000000000000001b R15: 0000000000735240 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
首先弄明白 “Kernel panic - not syncing: softlockup: hung tasks” 这个结果是怎么出现,它代表着什么意思?也就是翻译翻译这个结论!!
lockup分为soft lockup和hard lockup。
soft lockup是指内核中有BUG导致在内核模式下一直循环的时间超过n s(n为配置参数),而其他进程得不到运行的机会;实现方式:内核对于每一个cpu都有一个监控进程watchdog/x 每秒钟会统计相关数据时间戳,,对比时间戳就可以知道运行情况
hard lockup的发生是由于禁止了CPU的所有中断超过一定时间(几秒)这种情况下,外部设备发生的中断无法处理,内核认为此时发生了所谓的hard lockup
那就看为啥cpu 没有被调度过来了?? 看了一下鬼知道!!! 干饭去----->下午继续