CUDA编程图例
Figure 7. Matrix Multiplication without Shared Memory
Figure 8. Matrix Multiplication with Shared Memory
Figure 20. Examples of Global Memory Accesses. Examples of Global Memory Accesses by a Warp, 4-Byte Word per Thread, and Associated Memory Transactions for Compute Capabilities 3.x and Beyond
Figure 21. Strided Shared Memory Accesses. Examples for devices of compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x
Figure 22. Irregular Shared Memory Accesses. Examples for devices of compute capability 3.x, 5.x, or 6.x.
参考链接:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions__throughput-native-arithmetic-instructions