在实际系统维护过程中,常常需要知道一个进程在做哪些动作,比如想判断一个进程是否hang,我们可以使用strace命令,此命令式用来跟踪一个进程在调用哪些系统函数和信号
通过跟踪xinetd进程演示strace命令:
我们首先找到xinetd进程号,然后使用strace命令attached到xinetd进程,然后再另外一个窗口telnet
[root@limt ~]# ps -ef|grep xinet
root 4314 1 0 10:55 ? 00:00:00 xinetd -stayalive -pidfile /var/run/xinetd.pid
root 4657 3612 0 11:02 pts/1 00:00:00 grep xinet
[root@limt ~]# strace -ff -p 4314
Process 4314 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 1
accept(5, {sa_family=AF_INET6, sin6_port=htons(60177), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 6
setsockopt(6, SOL_IPV6, IPV6_ADDRFORM, [2], 4) = 0
clone(Process 4422 attached child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f0db3adfa70) = 4422
[pid 4314] sendto(7, "<30>Dec 22 10:57:26 xinetd[4314]"..., 78, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
[pid 4422] rt_sigaction(SIGPIPE, {SIG_DFL, [PIPE], SA_RESTORER|SA_RESTART, 0x7f0db2a9b920}, <unfinished ...>
[pid 4314] <... sendto resumed> ) = 78
[pid 4422] <... rt_sigaction resumed> {SIG_IGN, [], SA_RESTORER, 0x7f0db2a9b920}, 8) = 0
[pid 4314] close(6) = 0
[pid 4314] poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}], 2, -1 <unfinished ...>
[pid 4422] rt_sigaction(SIGTSTP, {SIG_DFL, [TSTP], SA_RESTORER|SA_RESTART, 0x7f0db2a9b920}, {SIG_IGN, [], SA_RESTORER, 0x7f0db2a9b920}, 8) = 0
[pid 4422] rt_sigaction(SIGTTIN, {SIG_DFL, [TTIN], SA_RESTORER|SA_RESTART, 0x7f0db2a9b920}, {SIG_IGN, [], SA_RESTORER, 0x7f0db2a9b920}, 8) = 0
[pid 4422] rt_sigaction(SIGTTOU, {SIG_DFL, [TTOU], SA_RESTORER|SA_RESTART, 0x7f0db2a9b920}, {SIG_IGN, [], SA_RESTORER, 0x7f0db2a9b920}, 8) = 0
[pid 4422] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 4422] close(3) = 0
[pid 4422] close(4) = 0
[pid 4422] close(0) = 0
[pid 4422] close(1) = 0
[pid 4422] close(2) = 0
[pid 4422] setgid(0) = 0
[pid 4422] open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 0
[pid 4422] fstat(0, {st_mode=S_IFREG|0644, st_size=3785, ...}) = 0
[pid 4422] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0db3b17000
[pid 4422] read(0, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 3785
[pid 4422] close(0) = 0
[pid 4422] munmap(0x7f0db3b17000, 4096) = 0
[pid 4422] open("/proc/sys/kernel/ngroups_max", O_RDONLY) = 0
[pid 4422] read(0, "65536\n", 31) = 6
[pid 4422] close(0) = 0
[pid 4422] open("/etc/group", O_RDONLY|O_CLOEXEC) = 0
[pid 4422] fstat(0, {st_mode=S_IFREG|0644, st_size=1411, ...}) = 0
[pid 4422] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0db3b17000
[pid 4422] lseek(0, 0, SEEK_CUR) = 0
[pid 4422] read(0, "root:x:0:\nbin:x:1:bin,daemon\ndae"..., 4096) = 1411
[pid 4422] read(0, "", 4096) = 0
[pid 4422] close(0) = 0
从上面的输出可以看出4314进程先accept一个socket请求,然后cloned一个子进程4422,然后父进程close刚创建的socket文件描述6,子进程关闭继承过来的文件描述符0,1,2,3,4,,然后子进程读取/etc/passwd文件,为telnet登陆用户验证做准备
在生产系统中就曾经遇到一个问题,最后通过strace命令解决,情况大致为:
有一个批处理作业运行了10几个小时都没有结束,一般情况下几分钟就完成,由于本身对这套系统也不是太熟悉,就先登录系统,查看跑批处理作业用户在运行哪些进程(ps -ef|grep user),发现此用户运行着一个shell脚步,和运行着一个gzip程序,并且gzip程序的父进程为shell脚步的进程,说明shell脚步调用了gzip,那gzip为什么卡着不动啊,查看了那个要压缩的文件也不是很大。然后通过strace文件查看那个gzip进程,发现一直处于read函数状态不动,然后恍然大悟,是不是在等待Y/N输入,立刻去那个文件目录,发现目录下已经有了那个文件的gz文件,所以需要确认是否覆盖,因为是系统自动调起的程序,所以一直卡在后台,至于那个早先生成的gz文件,是因为之前系统上线造成的。