GPU Hardware and Writing Efficient Code

NVIDIA's Implementation

Warps

NVIDIA GPUs execute kernel code in so-called warps.
One warp consists of 32 threads with consecutive index numbers:

Thus, the programmer divides threads into workgroups, after which the GPU hardware further divides the workgroups into warps.
The streaming multiprocessor (SM) contains several processing elements and one or more warp schedulers, which distribute the SM's resources among the warps:

Each thread has its own program counter, registers, and private memory.
Thus, each thread independently executes the kernel's source code at a logical level.
From the hardware perspective, threads belonging to the same warp are executed together.
Some threads are automatically disabled when the execution paths of threads within a warp diverge.

Warp Scheduling

The warps allocated to a computing unit form a "warp pool"
Warps in a warp pool can be in two different states:
- Ready: At least one thread in the warp is ready to execute a command.
- Waiting: No thread in the warp is ready to execute the next command.
Threads within a warp may be waiting, for example, for the completion of a previous command.
Warp schedulers select a set of warps from the warp pool every clock cycle (assuming at least one is available) and issue the selected warps for execution on the processing elements.

The warp scheduler can choose an instruction to execute:
- From among multiple ready warps
- From among multiple independent instructions

Compute Capability (CC)

NVIDIA reports the capabilities of their GPUs under the concept of Compute Capability, abbreviated CC.
The CC indicated various aspects of the GPU and different CUDA functionalities may be supported on different CCs.
The CUDA C++ Programming Guide gives an overview of CCs in its Section 16.
For the most part this is not something you need to worry about, but it can be a good idea to check up on the details when targeting a specific device.

Compute capability (CC) number

Nvidia reports the capabilities of the graphics cards it manufactures as compute capability
The CC numbers of the GPUs used in this course are
- Pirate (Nvidia Tesla K40c / Kepler): 3. 5
- Mako (Nvidia GeForce GTX580 / Fermi): 2.0
The CC number indicates, for example, the CUDA capabilities supported by the GPU (and corresponding OpenCL capabilities) and the main features of the rail implementation

This probably needs some updated numbers?

This is also VERY technical information that isn't directly relevant. It should probably be summarized into some more general information about the development of CC and some links to resources for the students to look up details if they are interested?

I Propose removing this section and leaving the above note.

— 11 Jul 24 (edited 11 Jul 24)

Compute capability 2.x

Units of account

The CC 2.0 compute unit contains:
- 32 processing elements (CUDA cores)
- 4 special purpose units for processing single precision special functions
- 8 PEs are combined when the compute unit performs double precision floating point operations
- 64KB of combined local (shared) memory and L1 cache
CC 2.The computing unit of a CC CC 1 device contains:
- 48 processing elements
- 8 special purpose units for single precision special purpose functions
- 64KB combined local (shared) memory and L1 cache memory

Maybe we just refer the users here? https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-5-x

— 11 Jul 24 (edited 11 Jul 24)

Both CC 2.0 and CC 2.1 devices contain 2 warp repeaters
During one clock cycle, the beam repeater can alternate
- one command on CC 2.0 devices and
- two independent commands on CC 2.1 devices
A single queue can queue warps for only half of the processing elements
The first queue queue queue queues all odd warps and the second queue queue queue queues all even warps
In double precision computation, only the second queue queue queue is active

Global memory

Each compute unit has its own L1-level cache and all compute units share a common L2-level cache
Threads in the same warp make memory requests together!
The L1 cache line width is 128 bytes (L1 can be partially disabled)
If memory requests from threads in a warp do not refer to the same 128 byte cache line, the warp memory request is split into several 128 byte chunks
L2 cache serves memory requests in 32 byte chunks

Local memory

Local memory is divided into 32 banks so that consecutive 32-bit words belong to consecutive banks
Each bank can serve 32 bits in two clock cycles
If threads belonging to the same warp refer to the same bank at the same time, a so-called "warp bank" is created. bank conflict
Bank conflicts are handled in such a way that requests pointing to the same bank are handled sequentially
Requests referring to the same 32-bit word do not cause a bank conflict.

Compute capability 3.x

Units of account

The CC 3.x computing unit contains:
- 192 single precision PEs
- 64 double precision PEs (K40c)
- 4 warp dispatchers
- 64KB combined local (shared) memory and L1 cache
Each warp dispatcher is capable of alternating two independent commands from a single ready warp

Global memory

By default, memory requests referencing global memory only pass through the L2-level cache
The L2-level cache behaves essentially the same as the CC 2.x devices L2-level cache

Local memory

Local memory is divided into banks like CC 2.x devices, but the bank "width" size can be set by cudaDeviceSetSharedMemConfig() subroutine to either 32 or 64 bits
So the former means that
- Local memory is divided into 32 banks so that consecutive 32 bit words are consecutively in the banks or
- Local memory is divided into 32 banks so that consecutive 64 bit words are consecutively in the banks.

Compute capability 5.x

Units of account

The CC 5.x computing unit includes:
- 128 single precision PEs
- Four double precision PEs
- 4 warp slaves
- 24KB L1 cache
- 64KB local (shared) memory
Each warp slave is capable of alternating one command from one of the available warps

Global memory

By default, memory requests referencing global memory only pass through the L2-level cache
The L2-level cache behaves essentially the same as the CC 2.x devices L2-level cache

Local memory

Local memory is divided into 32 banks, so that consecutive 32-bit memory addresses belong to consecutive banks
Local memory behaves essentially the same as in CC 2.x devices

AMD's Implementation

Wavefront

AMD graphics cards execute kernel code in so-called wavefronts
One wavefront consists of 64 threads with consecutive index numbers:

Compute Units

Graphics Core Next Architecture

A GCN GPU's compute unit includes:
- One scalar processing unit
- Four vector units, each with 16 processing elements
- L1 cache
- Local memory / Local Data Share (LDS) (typically 64KB)

Radeon DNA

RDNA (Radeon Digital Next-Generation Architecture) is the latest architecture from AMD.
Whitepaper (SHOULD WE KEEP THIS?)
As of writing the newest version is RDNA3.

Wavefront Scheduling

All vector unit processing elements execute the same instruction in a clock cycle.
Typically, a vector unit executes the command associated with the same wavefront over four clock cycles, meaning all 64 threads are effectively executed together.
Different compute units and different vector units within a compute unit can execute different commands in the same clock cycle.

The command stream can include scalar instructions for the scalar unit and vector instructions for the vector units.
Within a single clock cycle, it is possible to alternate between a scalar instruction, one vector instruction, a memory operation, and a branch operation.
One vector unit executes a single vector instruction over four clock cycles, so vector instructions are alternated among different vector units.
Newer AMD gpus support wave32 as well, i.e. shorter wavefronts.

The vector instructions are faster on modern architectures

— 11 Jul 24

Architectures

AMD don't report Computing Capabilities of they GPU devices. Instead one typically refer to the GPU architecture to learn more about the types of processing elements, e.g. matrix cores, available.
From about 2012-2018 was the GCN (Graphics Core Next) architectures.
The latest architectures are called RDNA and CDNA (Radeon DNA and Compute DNA).
- According to ChatGPT the DNA stands for Digital Next-generation Architecture, though I've been unable to corroborate this claim.
- The latest versions of these architectures are currently RNDA3 and CDNA3.
- RDNA is optimized for gaming, while CDNA is optimized for computing.
Compared to NVIDIA, AMD is unfortunately quite terse in their description of ROCm support, and generally only outline what is generally supported.
So certain features of ROCm might or might not work on other AMD GPUs.
Some things can - naturally - be devined, such as features related to matrix cores will not work on cards without said cores.

Utilization of Processing Elements

Different commands require a different number of clock cycles to execute:

In practice, CPUs and GPUs execute instructions using pipelining, allowing them to start executing a new instruction every clock cycle:

Often, there may be gaps left on the pipeline, known as "bubbles."

Illustration of a situation where the 3rd command depends on the outcome of the 1st command

In the case of GPUs, the situation is even worse, because a bubble affects an entire warp/wavefront:

Therefore, it is important that the warp/wavefront scheduler can issue instructions to the processing elements' pipelines every clock cycle.
In practice, this is achieved by having significantly more threads than processing elements.
Additionally, instruction-level parallelism helps in this regard.

Impact of Warps and Wavefronts

Warps and wavefronts execute code together, so all threads within them go through all required execution paths.
Below is a graph where the x-axis represents the probability of one thread diverging, and the y-axis represents the probability that the entire warp/wavefront must diverge:

Conditional statements can have a significant impact on kernel performance.
Therefore, strive to write your code so that threads within the same warp follow the same execution path.
For example, the following code snippets are effectively almost equally expensive:

Snippet 1

const int idx = threadIdx.x;
if(idx % 2 == 0)  // If the thread index is even
    y1 = x/d;     // Division operations are expensive
else
    y2 = 1.0;

Snippet 2

y1 = x/d;
y2 = 1.0;

Global Memory Usage

An illustration of the GPU hardware global memory implementation

The bits of the memory address determine the cache line, memory controller, and memory bank (illustration):

Depending on the graphics card, memory is efficiently utilized by following these guidelines:
- Threads within the same warp/wavefront should share as many cache lines as possible.
- Threads within the same warp/wavefront should use the same memory controller
  - ⟹ different warps/wavefronts likely use different memory controllers.
- Threads within the same warp/wavefront should use the same memory bank
  - ⟹ different warps/wavefronts likely use different memory banks.

Examples where the memory references of threads within a warp/wavefront spill over multiple cache lines:

The simplest solution is to design the kernel code so that the threads of the same warp/wavefront reference adjacent N-byte words starting from a memory address that is a multiple of 128 or 256:

Example: Calculating a Sum

Attempt 1 (DO NOT FOLLOW THIS EXAMPLE!)

Attempting to calculate the sum of an array on the GPU
The array is divided into parts, with each thread calculating a partial sum
The final summation can be done on the CPU

A HIP kernel, where each thread calculates a partial sum and the partial sum results are stored in the res array (the size of the partial sum is given by the argument m):

__global__ void sum(double *res, double *x, int n, int m) {
    const int idx = threadIdx.x + blockIdx.x * blockDim.x;

    // Calculate the boundaries for the partial sum for each thread
    const int begin = idx * m;
    const int end   = min(n, (idx+1) * m);

    // Each thread calculates the sum of m consecutive numbers
    double tmp = 0;
    for (int i = begin; i < end; i++)
        tmp += x[i];

    // Threads store their own partial sums in the res array
    res[idx] = tmp;
}

Schematic diagram of the memory access pattern of the implementation (m = 8):

The threads belonging to the same warp now reference memory so that there are 8 \(\times\) 8 = 64 bytes between two references.

The entire warp's memory access thus scatters across 16 128-byte cache lines!

Program output when array size is 43435342 and m = 128:

$ make gpu2 && ./gpu2
g++ -std=c++11 -o gpu2 gpu2.cpp -lOpenCL
Time: 0.025615 s
Flops: 1.6957 GFlops
Sum value: 2527.55
Real value: 2527.55
Diff: 1.56433e-10

A better approach

Let's modify the code a bit:

__global__ void sum(double *res, double *x, int n) {
    const int idx = threadIdx.x + blockIdx.x * blockDim.x;
	
    // The size of the global index space (number of threads)
    const int step = gridDim.x * blockDim.x;

    double tmp = 0;

    // Threads start from their respective index and jump by 
    // step elements in each iteration
    for (int i = idx; i < n; i += step)
        tmp += x[i];

    // Threads store their partial sums in the res array
    res[idx] = tmp;
}

Simplified schematic illustration of the memory reference structure (8 threads)

Threads belonging to the same warp now reference memory so that there are 8 bytes between two references:

Therefore, the entire warp memory access spreads across two 128-byte cache lines!
Note that the illustration has been simplified!

Output of the test program when the size of the array is 43435342 and m = 128:

$ make gpu3 && ./gpu3
g++ -std=c++11 -o gpu3 gpu3.cpp -lOpenCL
Time: 0.00473809 s
Flops: 9.16726 GFlops
Sum value: 4544.81
Real value: 4544.81
Diff: 5.63887e-10

The implementation is approximately 5.4 times faster than the first attempt!

About Local Memory Usage

Local memory is typically divided into 32 banks, so that consecutive 32/64-bit words belong to consecutive banks.
Each bank can serve one memory request at a time (illustration):

Memory requests referencing the same memory bank (but different 32/64-bit words) are processed sequentially:

The entire warp/wavefront must wait for queued up threads!
Situations where threads within the same warp/wavefront access adjacent 32/64-bit words are often an optimal approach for memory banks.

Example: Calculating a Sum in Local Memory

An array in local memory can be efficiently summed using the following memory reference structure:

The blue dashed line corresponds to the synchronization point

Let's go through the previous illustration in more detail.
Assume the size of the array is n, and the size of the workgroup is greater than or equal to n.
The array is divided into two parts: [0, m-1] and [m, n-1].
The value of m is chosen so that
- the parts are of equal size when n is even, and
- the first part is one element larger when n is odd.

Next, threads compute sums such that a thread with a local index k sums elements k and m+k:

Finally, thread k stores the result in memory location k:

Notice that we now have a new array to sum, with size m ≈ n/2.
We can repeat the same pairwise summation as long as we synchronize the threads first!
Otherwise, some threads might still be performing summations from the previous round while others are already performing the next iteration.

The entire array can be efficiently summed using the technique described above:

The blue dashed line corresponds to the synchronisation point

Combining Global and Local Summation

The summation techniques described above can be combined such that threads first compute partial sums in global memory, after which threads belonging to the same workgroup compute their partial sums together in local memory:

Printout of the test program when the table size is 43435342:

$ make gpu4 && ./gpu4
c++ -std=c++11 -o gpu4 gpu4.cpp -lOpenCL  
Time: 0.00255108 s
Flops: 17.0263 GFlops
Sum value: 1440.41
Real value: 1440.41
Diff: 1.11868e-10

So the implementation is about 1.9 times faster than the previous implementation!

this code seems to be missing? Is it an exercise to implement it?

— 14 Jul 24

Calculating the sum entirely on the GPU

The sum can be calculated entirely on the GPU side using two cores:

could be done in a single core with atomics

— 14 Jul 24

Tip: The size and the number of filament groups should be chosen so that the number of filament groups in the first core is the same as the number of filament groups in the second core. For example:

ydin1<<<WG_SIZE_2, WG_SIZE1>>(...);
...
ydin2<<<1, WG_SIZE_2>>>(...);

In this case, the first core counts WG_SIZE_2 \(\times\) WG_SIZE_1 partial sums, which are added together in local memory to give a sum whose size is WG_SIZE_2
Now core2 can efficiently calculate the remaining sum by one beam group!

Printout of the test program when the table size is 43435342:

$ make gpu5 && ./gpu5
c++ -std=c++11 -o gpu5 gpu5.cpp -lOpenCL                                  
Time: 0.00261402 s
Flops: 16.6163 GFlops
Sum value: -8903.63
Real value: -8903.63
Diff: 1.81899e-12

Summary

Utilize many threads and structure your code to leverage instruction-level parallelism on the GPU.
Avoid scenarios where threads within the same warp/wavefront access memory in a scattered manner.
Especially avoid situations where there are power-of-two byte gaps between references!
Avoid situations where execution paths of threads within the same warp/wavefront diverge.