More on GPU programming, Part 2

Number of Threads and Workgroup Size

Choosing the number of threads involves several conflicting goals:
- The number of threads should be sufficiently large to keep processing elements busy.
- However, too many threads can introduce unnecessary overhead.
- Not all threads can be accommodated simultaneously due to limited resources (registers, etc.), so some threads may need to queue. Are these queued threads beneficial?
- The number of threads should align with the nature of the program (number of parallel tasks vs. number of threads).

Choosing the workgroup size involves several considerations:
- The workgroup size should be a multiple of the warp/wavefront size.
- Launching a workgroup incurs overhead, so larger workgroups are generally better (fewer workgroup launches).
- Threads within the same workgroup can communicate with each other, reducing the need for global communication.
- Internal synchronization within a workgroup causes more overhead if the workgroup size is large (more threads waiting on barriers).

Additional factors influencing workgroup size:
- Resources allocated by threads within the same workgroup (registers, local memory) restrict the maximum size of a workgroup.
- A workgroup that consumes many resources can prevent other workgroups from utilizing computing units concurrently. Therefore, selecting the workgroup size is crucial to effectively utilize processing units.
- The number of workgroups should align with the nature of the program (number of independent parallel tasks vs. number of workgroups).

In OpenCL, there is a flag CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE that allows querying a multiple factor suitable for workgroup size (link).
CUDA provides cudaOccupancyMaxPotentialBlockSize() and cudaOccupancyMaxPotentialBlockSizeVariableSMem() functions for heuristic selection of workgroup size (link).
HIP similarly provides hipOccupancyMaxPotentialBlockSize() and other related methods (link).
The GPU manufacturer documentation offers valuable insights.
Various profiling tools from the GPU manufacturers can also be beneficial.

More on Using global memory

Using a two-Dimensional table

Using a Two-Dimensional Array

Improperly allocated two-dimensional arrays can impact program efficiency poorly.
For instance, the end of one row and the beginning of the next row might belong to the same cache line:

The problem can be solved by allocating extra space at the end of rows:

This solution wastes some memory, but the speed gain achieved could be worth it.
The amount of extra space can be chosen so that each row of the array starts at the beginning of a cache line.
This is called pitched memory or pitched linear memory.
Caution Robert Crovella from NVIDIA remarks in the following 2017-thread on the CUDA forums that pitched memory was more important on old GPUs and that newer devices rely less on it, that the extra arithmetics involved in pitching the memory can sometimes be detrimental these days.
As always, it makes sense to try out if your application will benefit or not.

added note about pitched memory

— 14 Jul 24

OpenCL

OpenCL does not provide a ready-made tool for reserving a two-dimensional table
The simplest approach is to reserve the table so that each row is a multiple of the warp/wavefront:

#define CEIL(x,y) (((x)+(y)-1)/(y))        // x/y pyöristettynä ylöspäin
#define ROUND_TO_NEXT(x,n) ((n)*CEIL(x,n)) // Pyöristää seuraavaan moninkertaan
#define WARPFRONT 32                       // Nvidian näytönohjain

cl_uint height= 125:
cl_uint width = 432;

// Lasketaan seuraava 32 moninkerta eli tässä tapauksessa 448
cl_uint ldf = ROUND_TO_NEXT(width,WARPFRONT);

// Varataan height x ldf int-taulukko
cl::Buffer deviceBuffer(context, CL_MEM_READ_WRITE, height*ldf*sizeof(int));

// Asetetaan ydin komentojonoon
kernel(cl::EnqueueArgs(...), deviceBuffer, height, width, ldf);

remove this section

— 14 Jul 24

The variable ldf (row length in elements) is passed to the kernel, which uses it to process the table:

__kernel void kernel(__global int *buffer, uint height, uint width, uint ldf) {
    ...

    int value = buffer[i*ldf+j]; // Sama kuin buffer[i][j]

    ...
}

CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE

The cache row length can also be queried with the cl::Device::getInfo member function, in which case the table row length can be set to a multiple of the cache row length:

#define CEIL(x,y) (((x)+(y)-1)/(y))        // x/y pyöristettynä ylöspäin
#define ROUND_TO_NEXT(x,n) ((n)*CEIL(x,n)) // Pyöristää seuraavaan moninkertaan

cl_uint cacheLine;
device.getInfo(CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE, &cacheLine);

if(cacheLine < 1)
    cacheLine = 1;

cl_uint height = 125:
cl_uint width = 432;

// Lasketaan seuraava välimuistin rivin pituuden moninkerta
cl_uint ldf = ROUND_TO_NEXT(width*sizeof(int), cacheLine);

// Varataan taulukko, jossa jokaisen rivin pituus on ldf tavua
cl::Buffer deviceBuffer(context, CL_MEM_READ_WRITE, height*ldf);

kernel(cl::EnqueueArgs(...), deviceBuffer, height, width, ldf);

In the kernel source code, note that the variable ldf now passes information about the line length in bytes. Also note that the buffer table is passed as char.

__kernel void kernel(__global char *buffer, uint height, uint width, uint ldf) {
    ...
    
    // Lasketaan rivi; buffer[i][*]
    __global int *line = (__global int*)(buffer + i*ldf);
    
    int value1 = line[j];                       // Sama kuin buffer[i][j]
    
    int value2 = *((int*)(buffer + i*ldf) + j); // Sama kuin buffer[i][j]

    ...
}

HIP and CUDA

HIP provides the hipMallocPitch function for allocating two-dimensional arrays:
CUDA provides the cudaMallocPitch function for allocating two-dimensional arrays:


hipError_t hipMallocPitch(
    void   **devPtr, // Pointer to the beginning of the array
    size_t  *ldf,    // Optimal value for the row pitch in bytes
    size_t   width   // Width of the array in _bytes_
    size_t   height, // Height of the array in elements
);
// cudaError_t cudaMallocPitch(void **devPtr, size_t *ldf, size_t height, size_t width);

size_t height = 125:
size_t width = 432;

size_t ldf;
char* deviceBuffer;

// Allocate the array, with each row having a pitch of ldf bytes
hipMallocPitch(&deviceBuffer, &ldf, width * sizeof(int), height);
//cudaMallocPitch(&deviceBuffer, &ldf, width * sizeof(int), height);

kernel<<<...>>>(deviceBuffer, height, width, ldf);

It is possible you might experience an error of the type

error: no matching function for call to 'hipMallocPitch'
    hipMallocPitch(&deviceBuffer, &pitch, w * sizeof(int), h);
    ^~~~~~~~~~~~~~
/.../hip/hip_runtime_api.h:3172:12: note: candidate function not viable: no known 
conversion from 'int **' to 'void **' for 1st argument 
hipError_t hipMallocPitch(void** ptr, size_t* pitch, size_t width, size_t height);

In which case you can fix it with this header

// hipMallocPitch fix
template <class T>
static inline hipError_t hipMallocPitch(T** devPtr, size_t* pitch, size_t width, size_t height) {
    return hipMallocPitch((void**)devPtr, pitch, width, height);
}

Transfer from host memory to global memory can be conveniently done using the hipMemcpy2DAsync function:
Transfer from host memory to global memory can be conveniently done using the cudaMemcpy2DAsync function:

int *hostBuffer;
hipMallocHost(&hostBuffer, height * width * sizeof(int));
//cudaMallocHost(&hostBuffer, height * width * sizeof(int));
// --- Other synonyms ---
// hipHostMalloc(...)
// cudaHostAlloc(...)


// Fill hostBuffer with data here...

err = hipMemcpy2DAsync(
    deviceBuffer,          // Destination
    ldf,                   // Pitch in bytes for the destination
    hostBuffer,            // Source
    width * sizeof(int),   // Pitch in bytes for the source
    width * sizeof(int),   // Width of the array in bytes
    height,                // Height of the array
    hipMemcpyHostToDevice, // Direction of transfer
    0);                    // Stream
// err = cudaMemcpy2DAsync(...);

In the kernel source code, note that the ldf variable conveys the row pitch in bytes:

__global__ void kernel(char *buffer, size_t height, size_t width, size_t ldf) {
    ...
    
    int *line = (int*)(buffer + i * ldf);          // Calculate the row
    int value1 = line[j];                          // Same as buffer[i][j]
    
    int value2 = *((int*)(buffer + i * ldf) + j);  // Same as buffer[i][j]

    ...
}
// Alternatively
__global__ void kernel(int *buffer, size_t height, size_t width, size_t ldf) {
    ...
    
    size_t stride = ldf/sizeof(int);
    int *line = &buffer[i*stride];                    // Calculate the row
    int value1 = line[j];                          // Same as buffer[i][j]
    
    int value2 = buffer[i*stride + j];                // Same as buffer[i][j]

    ...
}

Three-Dimensional arrays

HIP provides the hipMalloc3D function for allocating three-dimensional arrays:
CUDA provides the cudaMalloc3D function for allocating three-dimensional arrays:

hipError_t hipMalloc3D(
    struct hipPitchedPtr *pitchedDevPtr,  // Allocated 3D array
    struct hipExtent      extent          // Size information of the array
);
cudaError_t cudaMalloc3D(
    struct cudaPitchedPtr *pitchedDevPtr, // Allocated 3D array
    struct cudaExtent      extent         // Size information of the array
);

Here are the datatypes structures.

Note that the structs are not the same in CUDA and HIP.

struct cudaPitchedPtr {
    size_t  pitch; // Pitch in bytes for one row of the array
    void *  ptr;   // Pointer to the beginning of the array
    size_t  xsize; 
    size_t  ysize;
};
struct hipPitchedPtr {
    void *  ptr;   // Pointer to the beginning of the array
    size_t  pitch; // Pitch in bytes for one row of the array
    size_t  xsize; 
    size_t  ysize;
};

// HIP
struct hipExtent {
    size_t width;
    size_t height;
    size_t depth;
}
// CUDA
struct cudaExtent {
    size_t  depth;
    size_t  height;
    size_t  width;
};

Both CUDA and HIP provide a make_cudaExtent and make_hipExtent method with similar syntax.

// HIP
hipExtent make_hipExtent(
    size_t w,    // Width of the array in bytes
    size_t h,    // Height of the array in elements
    size_t d);   // Depth of the array in elements
// CUDA
cudaExtent make_cudaExtent(
    size_t w,    // Width of the array in bytes
    size_t h,    // Height of the array in elements
    size_t d);   // Depth of the array in elements

The following example allocates a 54 × 23 × 5 array and calls the kernel:

size_t width = 54, height = 23, depth = 5;

// Create a cudaExtent struct for allocating the 3D array
cudaExtent extent = make_cudaExtent(width * sizeof(int), height, depth);

// Allocate the 3D array based on the created cudaExtent
cudaPitchedPtr pitchPointer;
cudaMalloc3D(&pitchPointer, extent);

// Call the kernel
kernel<<<...>>>(pitchPointer, width, height, depth);

__global__ void kernel(
    cudaPitchedPtr pitchPointer, size_t width, size_t height, size_t depth) {
    
    ...
    
    // Calculate the start of the row (i,j,*)
    int *line = (int*)(
        (char*)pitchPointer.ptr + (i * pitchPointer.pitch) + (j * pitchPointer.pitch * height));
    
    int value = line[k]; // Same as buffer[i][j][k]
}

can we do a "simpler" variant here?

— 15 Jul 24

Note

Note that 2D and 3D arrays can still be handled as one-dimensional arrays.
Thread group indexing can be utilized for array handling:

__global__ void kernel(float *A, int height, int width) {
    const int i = blockIdx.x;  // height workgroups in direction 1
    const int j = threadIdx.x; // At least width threads in direction 0
    
    float value = 0.0;
    if (j < width && i < height)
        value = A[i * width + j]; // Each thread reads one element from the array 
    
    ...
}

considering removing this

— 15 Jul 24

Splitting Memory Buffers into Sub-Buffers

Pointers to HIP or CUDA GPU memory behave like normal pointers.
Pointer arithmetic can be used to create sub-buffers:

int *buffer;
hipMalloc(&buffer, N*sizeof(int));
//cudaMalloc(&buffer, N*sizeof(int));

// Sub-buffer starting at index N/3
int *subBuffer = buffer + N/3;

kernel<<<...>>>(buffer, ...);    // Perform operation on the entire buffer
kernel<<<...>>>(subBuffer, ...); // Perform operation on part of the buffer

OpenCL

In OpenCL, sub-buffers must be created with the createSubBuffer member function of the cl::Buffer class:

remove this

— 15 Jul 24

cl::Buffer cl::Buffer::createSubBuffer(
    cl_mem_flags           flags,
    cl_buffer_create_type  buffer_create_type,
    const void            *buffer_create_info,
    cl_int                *err = NULL)

Example:

cl::Buffer buffer(context, CL_MEM_READ_WRITE, N*sizeof(int));

cl_buffer_region region;
region.origin = N/3;   // Alipuskuri alkaa indeksistä N/3
region.size = N/3;     // Alipuskurin koko on N/3

cl::Buffer subBuffer = buffer.createSubBuffer(
    CL_MEM_READ_WRITE, CL_BUFFER_CREATE_TYPE_REGION, &region);

Transfers RAM <-> VRAM

Speeds of different types of memory transfers:
- Local memory: \(\mathcal{O}(10^{12})\) B/s
- Global memory: \(\mathcal{O}(10^{11})\) B/s
- RAM memory: \(\mathcal{O}(10^{10})\) B/s
- RAM-VRAM transfer: \(\mathcal{O}(10^9)\) B/s
Transferring data from RAM to the GPU's global memory is very costly and should be avoided whenever possible!

numbers still relevant?

— 15 Jul 24 (edited 15 Jul 24)

In the case of simple computation operations (e.g., axpy), most of the time may be spent transferring data between RAM and global memory:

Also, transferring data between calculations slows down the calculation significantly:

Therefore, try to:
- Transfer all necessary data to global memory at once,
- Perform as much computation as possible,
- Transfer the results back to RAM only at the end:

Page-Locked / Pinned Memory

Allocating page-locked or pinned memory significantly speeds up data transfers because the PCIe bus's DMA controller can operate more efficiently!
Reminder: In HIP, pinned memory can be allocated and freed using the hipHostMalloc and hipHostFree functions:
Reminder: In CUDA, pinned memory can be allocated and freed using the cudaMallocHost and cudaFreeHost functions:

// HIP
hipError_t hipHostMalloc(void **ptr, size_t size);    
hipError_t hipHostFree(void *ptr); 
// CUDA
cudaError_t cudaMallocHost(void **ptr, size_t size);    
cudaError_t cudaFreeHost(void *ptr);

The OpenCL specifications do not impose any requirements on the use of mounted memory
AMD's implementation implements large data transfers by mounting the memory area on the side of the host device in chunks during the transfer
NVidia's OpenCL implementation has more documentation, but it works probably in the same way

delete this

— 15 Jul 24

Memory Mapping

Memory mapping simplifies problem code and improves performance in some cases.
In practice, the CPU and GPU share (at least temporarily) the same memory area.
Mapping allows data to be transferred only when it is needed.

Principle

Before memory mapping, the CPU and GPU memory spaces are physically and logically separate:

Memory mapping unifies the CPU and GPU memory spaces at a logical level:

The CPU can write the data used for computation to the mapped memory area:

The GPU can access the contents of the mapped memory area once the kernel is launched:

The GPU can store the computation result into the mapped memory area:

The CPU can see the contents of the mapped memory area once the kernel execution is complete:

Practical Implementation

In practice, the data used for computation may be physically transferred to global memory before the kernel starts.
In principle, it's also possible that the data is transferred physically only when the GPU needs it for computation.

Similarly, the computation result can be physically transferred to RAM either all at once or in pieces when the CPU needs parts of the data:

OpenCL

OpenCL offers three different ways to use mapped memory:
- CL_MEM_ALLOC_HOST_PTR provided when creating cl::Buffer ticket
- CL_MEM_USE_HOST_PTR provided when creating cl::Buffer ticket
- cl::Buffer ticket created without additional flags
AMD extends the offer with its own CL_MEM_USE_PERSISTENT_MEM_AMD ticket

remove this

— 15 Jul 24

CL_MEM_ALLOC_HOST_PTR

CL_MEM_ALLOC_HOST_PTRflag to the cl::Buffer form function tells the OpenCL implementation that the OpenCL implementation should allocate memory so that can be accessed by the host:

cl::Buffer deviceBuffer = cl::Buffer(context, 
    CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, N*sizeof(int));

In AMD's OpenCL implementation, the data associated with the mapping is physically located in attached RAM

Memory mapping is done with the enqueueMapBuffer member function of the command line:

void * cl::CommandQueue::enqueueMapBuffer(
    const Buffer&              buffer,
    cl_bool                    blocking_map,
    cl_map_flags               flags,
    ::size_t                   offset,
    ::size_t                   size,
    const VECTOR_CLASS<Event> *events = NULL,
    Event                     *event = NULL,
    cl_int                    *err = NULL);

The return value returns the address of the start of the mapped memory area in the host memory space
blocking_map setting the flag to CL_TRUE makes the mapping operation blockform
flags the flag specifies whether the memory buffer is to be mapped as read (CL_MAP_READ) or write (CL_MAP_WRITE)

Next, we map deviceBuffer into the memory of the host device using the blockkaave operation so that the CPU can write to the mapped memory area:

int *hostPointer = 0;
hostPointer = (int*) queue.enqueueMapBuffer(
    deviceBuffer, CL_TRUE, CL_MAP_WRITE, 0, N*sizeof(int));

The host program can now write to the mapped memory area (hostPointer)

for(int i = 0; i < N; i++)
    hostPointer[i] = checkBuffer[i];

The mapping of the memory area must be unmapped before the kernel can access the data shared by the mapping
. The unmapping is done with the enqueueUnmapMemObject member function of the command line:

cl_int cl::CommandQueue::enqueueUnmapMemObject(
    const Memory              &memory,
    void                      *mapped_ptr,
    const VECTOR_CLASS<Event> *events = NULL,
    Event                     *event = NULL)

So we unpack the previously created map, where deviceBuffer was mapped to the memory space (hostPointer) of the host program:

queue.enqueueUnmapMemObject(deviceBuffer, hostPointer);

After unmapping, we can call the kernel normally:

kernel(cl::EnqueueArgs(...), deviceBuffer, ...);

Next, we place the blockkaavat command to map deviceBuffer into the memory space of the host program so that the host program can read from the mapped memory area:

hostPointer = (int*) queue.enqueueMapBuffer(
    deviceBuffer, CL_TRUE, CL_MAP_READ, 0, N*sizeof(int));

After mapping, the host program can, for example, check the correctness of the result executed by the GPU:

int rightValues = 0;
for(int i = 0; i < N; i++)
    if(hostPointer[i] == checkBuffer[i]+1)
        rightValues++;

Finally, we need to remember to unmap:

queue.enqueueUnmapMemObject(deviceBuffer, hostBuffer);

CL_MEM_USE_HOST_PTR

The memory area on the host side to be used for mapping can also be reserved in advance:

int *hostBuffer = new int[N];

In this case, hostBuffer must be reserved so that the starting address of the memory area is a multiple of the data type used on the OpenCL device side! The pointer returned by the
new operator is often already appropriate for the basic data type . In other cases, it may be necessary to apply some pointer arithmetic.

A dedicated memory buffer can now be assigned to an OpenCL implementation when creating the cl::Buffer instance of :

cl::Buffer deviceBuffer = cl::Buffer(context, 
    CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, N*sizeof(int), hostBuffer);

CL_MEM_USE_HOST_PTRThe flag indicates that the memory area referenced by in the hostBuffer pointer can be freely used by the OpenCL implementation for mapping operations. The buffer cannot be used for any other purpose!

The enqueueMapBuffer subroutine of the command line returns the address given when the cl::Buffer alias was created. However, the OpenCL specifications do not guarantee this, so it is always best to store the address of the mapped memory area in a separate pointer variable:

int *hostPointer = 0;
hostPointer = (int*) queue.enqueueMapBuffer(
    deviceBuffer, CL_TRUE, CL_MAP_WRITE, 0, N*sizeof(int));

cl::Buffer-olio created without extra flags

A mapped cl::Buffer can also be created without CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR flags
In AMD's OpenCL implementation, unmapping triggers a transfer operation between RAM and global memory

HIP and CUDA

Memory mapping requires synchronization between the CPU and GPU executions.
In the following example, synchronization is achieved using a stream:

// HIP
hipStream_t stream;
hipStreamCreate(&stream);
// CUDA
cudaStream_t stream;
cudaStreamCreate(&stream);

Synchronization could also be handled without a stream using the hipDeviceSynchronize() or cudaDeviceSynchronize() function.

Initially, we allocate pinned memory from the host program's memory space:

int *hostPointer;
// HIP
hipHostMalloc(&hostPointer, N*sizeof(int), hipHostMallocMapped);
// CUDA
cudaMallocHost(&hostPointer, N*sizeof(int), cudaHostAllocMapped);

The hipHostMallocMapped flag tells HIP that the allocated memory area is visible to the device.
The cudaHostAllocMapped flag tells CUDA that the allocated memory area is visible to the device.
The host program can write data used for computation to the allocated memory area as usual:

for(int i = 0; i < N; i++)
    hostPointer[i] = (i & 0x1) * i;

Synchronization is not mandatory at this stage as long as the kernel is not queued on the stream before the host program has written all the required data to the mapped memory area.

For older GPUs (Compute Capability < 2.0, no UVA support), we first need to obtain the address of the mapped memory area on the GPU:

int *devicePointer;
cudaHostGetDevicePointer(&devicePointer, hostPointer, 0);

The same method exists in HIP.

hipHostGetDevicePointer(&devicePointer, hostPointer, 0);

It is possible that the returned pointer is the same as the hostPointer. The 0 argument at the end is amndatory and simply a flag argument for potential flags in future versions of CUDA and HIP.

Now, the kernel can be enqueued using the obtained devicePointer:

kernel<<<..., queue>>>(devicePointer, ...);

For newer GPUs, we can directly use the hostPointer:

kernel<<<..., queue>>>(hostPointer, ...);

Next, the host program needs to synchronize its execution with the GPU:

// HIP
hipStreamSynchronize(queue);
// CUDA
cudaStreamSynchronize(queue);

Otherwise, hostPointer would not necessarily be ready for use yet.
After synchronisation, the host application can use the hostPointer normally:

int rightValues = 0;
for(int i = 0; i < N; i++)
    if(hostPointer[i] == (i & 0x1) * i + 1)
        rightValues++;

Using Multiple Streams

Some of the time spent on transfers between RAM and GPU memory can be hidden by using multiple streams:

This way, the GPU's processing elements can compute data in one stream while the DMA controller transfers data in another stream.