深度学习训练/GPU服务器硬件配置
现有配置:
cpu
# cpu个数
cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l
# 每个物理cpu的核数
cat /proc/cpuinfo| grep "cpu cores"| uniq
# 逻辑cpu的个数
cat /proc/cpuinfo| grep "processor"| wc -l
#
内存条
# 查看内存条状况
sudo dmidecode --type memory
下述是摘取的一部分。其中,最大内存为384G,槽数为6个,
Handle 0x003C, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 384 GB
Error Information Handle: Not Provided
Number Of Devices: 6
一个槽位的具体数据:
每个槽位插了32G,其中有两个槽位安插了内存条。
同时有4*6个这样的内存槽位,最理想的是每个槽位的内存条为384/6=64
,目前是2\*4\*32 = 256
和CPU传输的速率:2667MT/s(Mega-transfer per second)
Handle 0x003E, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x003C
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 72 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: P1_DIMMA1
Bank Locator: P1_Node0_Channel0_Dimm1
Type: DDR4
Type Detail: Synchronous
Speed: 2667 MT/s
Manufacturer: Samsung
Serial Number: 38ED2DAE
Asset Tag: P1_DIMMA1_AssetTag (Date:18/15)
Part Number: M393A4K40CB2-CTD
Rank: 2
Configured Clock Speed: 2400 MT/s
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
同时也可以利用free查看内存大小
$ free -h
IP | CPU | 内存/G | 系统盘/G | 数据盘 | GPU | |
---|---|---|---|---|---|---|
204 | 2*Intel® Xeon® CPU E5-2650 v4 @ 2.20GHz(12核) | 256 | 787 | 3T/3T/1.2T | 10*2080Ti | |
199 | Intel® Xeon® CPU E5-2650 v4 @ 2.20GHz(12核) | 256 | 196 | 1007G | 8*2080Ti | |
198 | 256 | 800 | 10*1080TI | |||
29 | 2*Intel® Xeon® Gold 5118 CPU @ 2.30GHz(12核) | 256 | 393 | 484G/2.0T/4.6T | 8*2080Ti | |
Failed to initialize NVML: Driver/library version mismatch
问题:
the driver was not installed correctly. This can happen if the previous driver was installed using the runfile installer and the new driver was installed using package manager, or vice versa. There are probably other scenarios as well.
Remove all previous package manager installs, and all previous runfile installer installs, then reinstall the driver.
我们之前安装了.run文件的cuda和nvidia驱动。之后又利用apt命令安装了nvidia-cuda-toolkit和cuda。导致版本冲突,驱动不匹配问题。
卸载:
卸载cuda
卸载通过.run文件安装的cuda:
cd /usr/local/cuda-xx.x/bin/
sudo ./cuda-uninstaller
sudo rm -rf /usr/local/cuda-xx.x
卸载通过apt命令安装的cuda:
sudo apt-get remove "cuda*" "*cublas*" "*cufft*" "*curand*" "*cusolver*" "*cusparse*" "*npp*" "*nvjpeg*" "nsight*"
通过dpkg查找对应的package是否删除干净:
dpkg -l
查找对应版本,我这边装的9.1.85。通过版本确认已经删除干净。
卸载nvidia
卸载通过.run文件安装的nvidia驱动:
sudo /usr/bin/nvidia-uninstall
卸载之前安装的所有驱动,包括通过apt安装的:
sudo apt-get --purge remove "*nvidia*"
安装
安装cuda和nvidia驱动可以参考:
Ubuntu服务器安装nvidia-430.64、cuda-10.1,cudnn-7.6.0和anaconda
参考
当然也有些其他人遇到了相同的问题,采用的解决方式不一样可以作为参考:
NVIDIA NVML Driver/library version mismatch [closed]
nvidia-smi返回错误信息‘Failed to initialize NVML: Driver/library version mismatch’
官方提供了遇到冲突时的解决方案:
Handle Conflicting Installation Methods
官方卸载cuda和nvidia(runfile文件)的方式:
Uninstallation