tensorboard和pytorch分布式卡死有关?

没有去深入研究,只记录现象吧:

依赖Detron2的代码,直接跑分布式卡死,gpu占用率100%完全不动的那种,改成单卡训练后报了tensorboard找不到的错误,于是pip install tensorboard 完成后分布式训练就没问题了。

install之前pip list:

tensorboard和pytorch分布式卡死有关?
Package       Version
------------- -------------------
certifi       2020.11.8
cloudpickle   1.6.0
future        0.18.2
fvcore        0.1.2.post20201122
numpy         1.19.4
opencv-python 4.4.0.46
Pillow        8.0.1
pip           20.2.4
portalocker   2.0.0
PyYAML        5.3.1
setuptools    50.3.1.post20201107
some-package  0.1
tabulate      0.8.7
termcolor     1.1.0
torch         1.5.0+cu101
torchvision   0.6.0+cu101
tqdm          4.54.0
wheel         0.35.1
yacs          0.1.8
View Code

install 之后pip list:

tensorboard和pytorch分布式卡死有关?
Package                Version
---------------------- -------------------
absl-py                0.11.0
cachetools             4.1.1
certifi                2020.11.8
chardet                3.0.4
cloudpickle            1.6.0
future                 0.18.2
fvcore                 0.1.2.post20201122
google-auth            1.23.0
google-auth-oauthlib   0.4.2
grpcio                 1.33.2
idna                   2.10
importlib-metadata     3.1.0
Markdown               3.3.3
numpy                  1.19.4
oauthlib               3.1.0
opencv-python          4.4.0.46
Pillow                 8.0.1
pip                    20.2.4
portalocker            2.0.0
protobuf               3.14.0
pyasn1                 0.4.8
pyasn1-modules         0.2.8
PyYAML                 5.3.1
requests               2.25.0
requests-oauthlib      1.3.0
rsa                    4.6
setuptools             50.3.1.post20201107
six                    1.15.0
some-package           0.1
tabulate               0.8.7
tensorboard            2.4.0
tensorboard-plugin-wit 1.7.0
termcolor              1.1.0
torch                  1.5.0+cu101
torchvision            0.6.0+cu101
tqdm                   4.54.0
urllib3                1.26.2
Werkzeug               1.0.1
wheel                  0.35.1
yacs                   0.1.8
zipp                   3.4.0
View Code

 

上一篇:树莓派4b 对于 Failed to execute command 的解决方案


下一篇:mysql 去掉某字符前的文字