最近工作上在做搭建机器学习平台的相关工作,使用的是MLflow;但是线上的Data Scientist在使用Pytorch的时候遇到了问题,下面做个记录…
现象
MLflow在部署使用Pytorch RNN训练的模型的时候,无法正常启动,内部的gunicorn的work无限重启,同时dump thread stack和heap到core文件,一度造成线上GFS run out of space…
由于我们的service是跑在k8s的Pod内的,最神奇的是一部分pod可以启动无问题,一部分不行…
解决问题
既然手里有core dump文件,那就分析,使用的是gdb
,打开core文件后见到如下错误:
[New LWP 470]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/conda/bin/python /opt/conda/bin/gunicorn --timeout 60 -b 0.0.0.0:5000 -w 4'.
Program terminated with signal SIGILL, Illegal instruction.
#0 0x00007fc6db828ffa in Xbyak::Operand::Operand(int, Xbyak::Operand::Kind, int, bool) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so
(gdb) where
#0 0x00007fc6db828ffa in Xbyak::Operand::Operand(int, Xbyak::Operand::Kind, int, bool) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#1 0x00007fc6d9bb0877 in _GLOBAL__sub_I_verbose.cpp () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#2 0x00007fc71d91879a in call_init (l=<optimized out>, argc=argc@entry=9, argv=argv@entry=0x7ffc9bc85ad8, env=env@entry=0x55bfa73c4d20) at dl-init.c:72
能看出来是Python和Pytorch的问题,google搜了下,很多人遇到过这个问题
which is caused by the CPU architecture
cat /proc/cpuinfo | grep flag
compared and reference from same-issue-from-github, comfired that, that model works well on CPU AVX2
解决办法
upgrade pytorch to 1.2.0 to fix the issue