使用命令:CUDA_VISIBLE_DEVICES=0,1,2 xxx
遇到问题:RuntimeError: CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 23.70 GiB total capacity; 1.40 GiB already allocated; 10.69 MiB free; 1.42 GiB reserved in total by PyTorch)
解决方案:CUDA_VISIBLE_DEVICES=1,2 xxx
反思结果:
1- Server是按照0,1,2的顺序进行GPU调用,当GPU-0出现问题时候,机器不会顺延使用GPU1-2,而是报错