Tesla K20m主要参数
Total amount of global memory: 4800 MBytes (5032706048 bytes)
Total amount of constant memory: 64KB(65536 bytes)
Total amount of shared memory per block: 48KB(49152 bytes)
Total number of registers available per block: 65536
Maximum number of threads per multiprocessor: 2048 (每个SM最多可处理2048个线程)
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores (共13个SM, 每个SM有192个SP, 一共13*192=2496个SP, CUDA Cores也就是SP)
每个GPU结点:
- 共13个SM, 每个SM有192个SP, 一共13*192=2496个SP;
- 一个SM中最多有2048个线程,即一个SM最多2048/32=64个线程束Warp
- 一个block最多1024个线程,
- 每个block最多65536个寄存器可用
deviceQuery查询结果:
Device 0: "Tesla K20m"
CUDA Driver Version / Runtime Version 6.5 / 6.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 4800 MBytes (5032706048 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores (13个SM,每个SM192个SP)
GPU Clock rate: 706 MHz (0.71 GHz)
Memory Clock rate: 2600 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 1310720 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Texture alignment: 512 bytes
Total amount of constant memory: 64KB (65536 bytes)
Total amount of shared memory per block: 48KB (49152 bytes)
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes (显存访问时对齐时的pitch的最大值)
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 6 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Tesla K20m"
CUDA Driver Version / Runtime Version 6.5 / 6.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 4800 MBytes (5032706048 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Clock rate: 706 MHz (0.71 GHz)
Memory Clock rate: 2600 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 1310720 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 132 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla K20m (GPU0) -> Tesla K20m (GPU1) : No
> Peer access from Tesla K20m (GPU1) -> Tesla K20m (GPU0) : No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 6.0, NumDevs = 2, Device0 = Tesla K20m, Device1 = Tesla K20m
Result = PASS
nvprof的使用:
首先保证使用nvcc编译器将源程序编译为可执行程序
接着执行命令:
nvprof ./gpu-fmm
profiling result中显示的是kernel执行的time情况, api calls则显示的是程序调用的api所耗费的time情况,
一般对kernel进行分析时,看profiling result中的结果。
此外,还可以测试程序的其他性能参数:
achieved_occupancy参数:每个sm在每个cycle能够达到的最大activewarp 占总warp的比例。
nvprof --metrics achieved_occupancy ./gpu-fmm
gld_throughput: global load throughput (查看memory 的throughput)
nvprof --metrics gld_throughput ./gpu-fmm
gld_efficiency: global memory loadefficiency: device memory bandwidth的使用率
nvprof –metrics gld_efficiency ./ gpu-fmm
nvprof查询bank冲突:
nvprof --events shared_ld_bank_conflict,shared_st_bank_conflict ./gpu-fmm