一、通用CPU和GPU的对比
二、CUDA 并行程序概念
Blockx:blocks of threads(线程集合)
SM:Streaming Multiprocessors
GPU with 2SMs:2核GPU
GPU with 4SMs:4核GPU
针对Figure 3的解释如下:
At its core are three key abstractions
- a hierarchy of thread groups
- shared memories
- and barrier synchronization
that are simply exposed to the programmer as a minimal set of language extensions.
These 3 abstractions provide fine-grained(细粒度) data parallelism and thread parallelism, nested(嵌套) within coarse-grained(大粒度) data parallelism and task parallelism.
They guide the programmer to partition the problem into coarse(大颗粒) sub-problems that can be solved independently in parallel by blocks of threads, and each sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block.
This decomposition(分解) preserves language expressivity(表达性) by allowing threads to cooperate when solving each sub-problem, and at the same time enables automatic scalability(可扩展性).
Indeed, each block of threads can be scheduled on any of the available multiprocessors within a GPU, in any order, concurrently or sequentially, so that a compiled CUDA program can execute on any number of multiprocessors as illustrated by Figure 3, and only the runtime system(运行时系统) needs to know the physical multiprocessor count.
通过以上描述,程序员要干的事情:
要解决的问题:细分成粗粒度问题
{
粗粒度问题1(可由一个block1来处理):细分成细粒度问题
{
细粒度问题1(可由block1中的一个thread来处理)
......
细粒度问题n
}
粗粒度问题2(可由一个block2来处理)
{
......
}
粗粒度问题3(可由一个block3来处理)
{
......
}
......
粗粒度问题n(可由一个blockn来处理)
{
......
}
}
注意:我这里仅仅示例了一个两级问题分解过程,在实际应用中可能涉及多级问题分解,以此类推即可。
所以,目前留给开发者的问题(比如我)是啥呢?
1.如何进行问题分解。
2.分解后如何通过代码的形式告诉给CUDA runtime environment。
第一个问题属于经验的问题,需要慢慢积累。
第二个问题属于照本宣科的问题,也是咱们接下来要学习的内容。