没有去深入研究,只记录现象吧:
依赖Detron2的代码,直接跑分布式卡死,gpu占用率100%完全不动的那种,改成单卡训练后报了tensorboard找不到的错误,于是pip install tensorboard 完成后分布式训练就没问题了。
install之前pip list:
Package Version ------------- ------------------- certifi 2020.11.8 cloudpickle 1.6.0 future 0.18.2 fvcore 0.1.2.post20201122 numpy 1.19.4 opencv-python 4.4.0.46 Pillow 8.0.1 pip 20.2.4 portalocker 2.0.0 PyYAML 5.3.1 setuptools 50.3.1.post20201107 some-package 0.1 tabulate 0.8.7 termcolor 1.1.0 torch 1.5.0+cu101 torchvision 0.6.0+cu101 tqdm 4.54.0 wheel 0.35.1 yacs 0.1.8View Code
install 之后pip list:
Package Version ---------------------- ------------------- absl-py 0.11.0 cachetools 4.1.1 certifi 2020.11.8 chardet 3.0.4 cloudpickle 1.6.0 future 0.18.2 fvcore 0.1.2.post20201122 google-auth 1.23.0 google-auth-oauthlib 0.4.2 grpcio 1.33.2 idna 2.10 importlib-metadata 3.1.0 Markdown 3.3.3 numpy 1.19.4 oauthlib 3.1.0 opencv-python 4.4.0.46 Pillow 8.0.1 pip 20.2.4 portalocker 2.0.0 protobuf 3.14.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 PyYAML 5.3.1 requests 2.25.0 requests-oauthlib 1.3.0 rsa 4.6 setuptools 50.3.1.post20201107 six 1.15.0 some-package 0.1 tabulate 0.8.7 tensorboard 2.4.0 tensorboard-plugin-wit 1.7.0 termcolor 1.1.0 torch 1.5.0+cu101 torchvision 0.6.0+cu101 tqdm 4.54.0 urllib3 1.26.2 Werkzeug 1.0.1 wheel 0.35.1 yacs 0.1.8 zipp 3.4.0View Code