More on GPU programming, Part 2
Number of Threads and Workgroup Size
- Choosing the number of threads involves several conflicting goals:
- The number of threads should be sufficiently large to keep processing elements busy.
- However, too many threads can introduce unnecessary overhead.
- Not all threads can be accommodated simultaneously due to limited resources (registers, etc.), so some threads may need to queue. Are these queued threads beneficial?
- The number of threads should align with the nature of the program (number of parallel tasks vs. number of threads).
- Choosing the workgroup size involves several considerations:
- The workgroup size should be a multiple of the warp/wavefront size.
- Launching a workgroup incurs overhead, so larger workgroups are generally better (fewer workgroup launches).
- Threads within the same workgroup can communicate with each other, reducing the need for global communication.
- Internal synchronization within a workgroup causes more overhead if the workgroup size is large (more threads waiting on barriers).
- Additional factors influencing workgroup size:
- Resources allocated by threads within the same workgroup (registers, local memory) restrict the maximum size of a workgroup.
- A workgroup that consumes many resources can prevent other workgroups from utilizing computing units concurrently. Therefore, selecting the workgroup size is crucial to effectively utilize processing units.
- The number of workgroups should align with the nature of the program (number of independent parallel tasks vs. number of workgroups).
- In OpenCL, there is a flag
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
that allows querying a multiple factor suitable for workgroup size (link). - CUDA provides
cudaOccupancyMaxPotentialBlockSize()
andcudaOccupancyMaxPotentialBlockSizeVariableSMem()
functions for heuristic selection of workgroup size (link). - HIP similarly provides
hipOccupancyMaxPotentialBlockSize()
and other related methods (link). - The GPU manufacturer documentation offers valuable insights.
- Various profiling tools from the GPU manufacturers can also be beneficial.
More on Using global memory
Using a two-Dimensional table
Using a Two-Dimensional Array
- Improperly allocated two-dimensional arrays can impact program efficiency poorly.
- For instance, the end of one row and the beginning of the next row might belong to the same cache line:
- The problem can be solved by allocating extra space at the end of rows:
- This solution wastes some memory, but the speed gain achieved could be worth it.
- The amount of extra space can be chosen so that each row of the array starts at the beginning of a cache line.
- This is called pitched memory or pitched linear memory.
- Caution Robert Crovella from NVIDIA remarks in the following 2017-thread on the CUDA forums that pitched memory was more important on old GPUs and that newer devices rely less on it, that the extra arithmetics involved in pitching the memory can sometimes be detrimental these days.
- As always, it makes sense to try out if your application will benefit or not.
added note about pitched memory
—OpenCL
- OpenCL does not provide a ready-made tool for reserving a two-dimensional table
- The simplest approach is to reserve the table so that each row is a multiple of the warp/wavefront:
#define CEIL(x,y) (((x)+(y)-1)/(y)) // x/y pyöristettynä ylöspäin
#define ROUND_TO_NEXT(x,n) ((n)*CEIL(x,n)) // Pyöristää seuraavaan moninkertaan
#define WARPFRONT 32 // Nvidian näytönohjain
cl_uint height= 125:
cl_uint width = 432;
// Lasketaan seuraava 32 moninkerta eli tässä tapauksessa 448
cl_uint ldf = ROUND_TO_NEXT(width,WARPFRONT);
// Varataan height x ldf int-taulukko
cl::Buffer deviceBuffer(context, CL_MEM_READ_WRITE, height*ldf*sizeof(int));
// Asetetaan ydin komentojonoon
kernel(cl::EnqueueArgs(...), deviceBuffer, height, width, ldf);
remove this section
—- The variable
ldf
(row length in elements) is passed to the kernel, which uses it to process the table:
__kernel void kernel(__global int *buffer, uint height, uint width, uint ldf) {
...
int value = buffer[i*ldf+j]; // Sama kuin buffer[i][j]
...
}
CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE
- The cache row length can also be queried with the
cl::Device::getInfo
member function, in which case the table row length can be set to a multiple of the cache row length:
#define CEIL(x,y) (((x)+(y)-1)/(y)) // x/y pyöristettynä ylöspäin
#define ROUND_TO_NEXT(x,n) ((n)*CEIL(x,n)) // Pyöristää seuraavaan moninkertaan
cl_uint cacheLine;
device.getInfo(CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE, &cacheLine);
if(cacheLine < 1)
cacheLine = 1;
cl_uint height = 125:
cl_uint width = 432;
// Lasketaan seuraava välimuistin rivin pituuden moninkerta
cl_uint ldf = ROUND_TO_NEXT(width*sizeof(int), cacheLine);
// Varataan taulukko, jossa jokaisen rivin pituus on ldf tavua
cl::Buffer deviceBuffer(context, CL_MEM_READ_WRITE, height*ldf);
kernel(cl::EnqueueArgs(...), deviceBuffer, height, width, ldf);
- In the kernel source code, note that the variable
ldf
now passes information about the line length in bytes. Also note that thebuffer
table is passed aschar
.
__kernel void kernel(__global char *buffer, uint height, uint width, uint ldf) {
...
// Lasketaan rivi; buffer[i][*]
__global int *line = (__global int*)(buffer + i*ldf);
int value1 = line[j]; // Sama kuin buffer[i][j]
int value2 = *((int*)(buffer + i*ldf) + j); // Sama kuin buffer[i][j]
...
}
HIP and CUDA
- HIP provides the
hipMallocPitch
function for allocating two-dimensional arrays: - CUDA provides the
cudaMallocPitch
function for allocating two-dimensional arrays:
hipError_t hipMallocPitch(
void **devPtr, // Pointer to the beginning of the array
size_t *ldf, // Optimal value for the row pitch in bytes
size_t width // Width of the array in _bytes_
size_t height, // Height of the array in elements
);
// cudaError_t cudaMallocPitch(void **devPtr, size_t *ldf, size_t height, size_t width);
size_t height = 125:
size_t width = 432;
size_t ldf;
char* deviceBuffer;
// Allocate the array, with each row having a pitch of ldf bytes
hipMallocPitch(&deviceBuffer, &ldf, width * sizeof(int), height);
//cudaMallocPitch(&deviceBuffer, &ldf, width * sizeof(int), height);
kernel<<<...>>>(deviceBuffer, height, width, ldf);
It is possible you might experience an error of the type
error: no matching function for call to 'hipMallocPitch'
hipMallocPitch(&deviceBuffer, &pitch, w * sizeof(int), h);
^~~~~~~~~~~~~~
/.../hip/hip_runtime_api.h:3172:12: note: candidate function not viable: no known
conversion from 'int **' to 'void **' for 1st argument
hipError_t hipMallocPitch(void** ptr, size_t* pitch, size_t width, size_t height);
In which case you can fix it with this header
- Transfer from host memory to global memory can be conveniently done using the
hipMemcpy2DAsync
function: - Transfer from host memory to global memory can be conveniently done using the
cudaMemcpy2DAsync
function:
int *hostBuffer;
hipMallocHost(&hostBuffer, height * width * sizeof(int));
//cudaMallocHost(&hostBuffer, height * width * sizeof(int));
// --- Other synonyms ---
// hipHostMalloc(...)
// cudaHostAlloc(...)
// Fill hostBuffer with data here...
err = hipMemcpy2DAsync(
deviceBuffer, // Destination
ldf, // Pitch in bytes for the destination
hostBuffer, // Source
width * sizeof(int), // Pitch in bytes for the source
width * sizeof(int), // Width of the array in bytes
height, // Height of the array
hipMemcpyHostToDevice, // Direction of transfer
0); // Stream
// err = cudaMemcpy2DAsync(...);
- In the kernel source code, note that the
ldf
variable conveys the row pitch in bytes:
__global__ void kernel(char *buffer, size_t height, size_t width, size_t ldf) {
...
int *line = (int*)(buffer + i * ldf); // Calculate the row
int value1 = line[j]; // Same as buffer[i][j]
int value2 = *((int*)(buffer + i * ldf) + j); // Same as buffer[i][j]
...
}
// Alternatively
__global__ void kernel(int *buffer, size_t height, size_t width, size_t ldf) {
...
size_t stride = ldf/sizeof(int);
int *line = &buffer[i*stride]; // Calculate the row
int value1 = line[j]; // Same as buffer[i][j]
int value2 = buffer[i*stride + j]; // Same as buffer[i][j]
...
}
Three-Dimensional arrays
- HIP provides the
hipMalloc3D
function for allocating three-dimensional arrays: - CUDA provides the
cudaMalloc3D
function for allocating three-dimensional arrays:
hipError_t hipMalloc3D(
struct hipPitchedPtr *pitchedDevPtr, // Allocated 3D array
struct hipExtent extent // Size information of the array
);
cudaError_t cudaMalloc3D(
struct cudaPitchedPtr *pitchedDevPtr, // Allocated 3D array
struct cudaExtent extent // Size information of the array
);
Here are the datatypes structures.
Note that the structs are not the same in CUDA and HIP.
struct cudaPitchedPtr {
size_t pitch; // Pitch in bytes for one row of the array
void * ptr; // Pointer to the beginning of the array
size_t xsize;
size_t ysize;
};
struct hipPitchedPtr {
void * ptr; // Pointer to the beginning of the array
size_t pitch; // Pitch in bytes for one row of the array
size_t xsize;
size_t ysize;
};
// HIP
struct hipExtent {
size_t width;
size_t height;
size_t depth;
}
// CUDA
struct cudaExtent {
size_t depth;
size_t height;
size_t width;
};
Both CUDA and HIP provide a make_cudaExtent
and make_hipExtent
method with similar syntax.
// HIP
hipExtent make_hipExtent(
size_t w, // Width of the array in bytes
size_t h, // Height of the array in elements
size_t d); // Depth of the array in elements
// CUDA
cudaExtent make_cudaExtent(
size_t w, // Width of the array in bytes
size_t h, // Height of the array in elements
size_t d); // Depth of the array in elements
- The following example allocates a 54 × 23 × 5 array and calls the kernel:
size_t width = 54, height = 23, depth = 5;
// Create a cudaExtent struct for allocating the 3D array
cudaExtent extent = make_cudaExtent(width * sizeof(int), height, depth);
// Allocate the 3D array based on the created cudaExtent
cudaPitchedPtr pitchPointer;
cudaMalloc3D(&pitchPointer, extent);
// Call the kernel
kernel<<<...>>>(pitchPointer, width, height, depth);
__global__ void kernel(
cudaPitchedPtr pitchPointer, size_t width, size_t height, size_t depth) {
...
// Calculate the start of the row (i,j,*)
int *line = (int*)(
(char*)pitchPointer.ptr + (i * pitchPointer.pitch) + (j * pitchPointer.pitch * height));
int value = line[k]; // Same as buffer[i][j][k]
}
can we do a "simpler" variant here?
—Note
- Note that 2D and 3D arrays can still be handled as one-dimensional arrays.
- Thread group indexing can be utilized for array handling:
__global__ void kernel(float *A, int height, int width) {
const int i = blockIdx.x; // height workgroups in direction 1
const int j = threadIdx.x; // At least width threads in direction 0
float value = 0.0;
if (j < width && i < height)
value = A[i * width + j]; // Each thread reads one element from the array
...
}
considering removing this
—Splitting Memory Buffers into Sub-Buffers
- Pointers to HIP or CUDA GPU memory behave like normal pointers.
- Pointer arithmetic can be used to create sub-buffers:
int *buffer;
hipMalloc(&buffer, N*sizeof(int));
//cudaMalloc(&buffer, N*sizeof(int));
// Sub-buffer starting at index N/3
int *subBuffer = buffer + N/3;
kernel<<<...>>>(buffer, ...); // Perform operation on the entire buffer
kernel<<<...>>>(subBuffer, ...); // Perform operation on part of the buffer
OpenCL
- In OpenCL, sub-buffers must be created with the
createSubBuffer
member function of thecl::Buffer
class:
remove this
—cl::Buffer cl::Buffer::createSubBuffer(
cl_mem_flags flags,
cl_buffer_create_type buffer_create_type,
const void *buffer_create_info,
cl_int *err = NULL)
- Example:
cl::Buffer buffer(context, CL_MEM_READ_WRITE, N*sizeof(int));
cl_buffer_region region;
region.origin = N/3; // Alipuskuri alkaa indeksistä N/3
region.size = N/3; // Alipuskurin koko on N/3
cl::Buffer subBuffer = buffer.createSubBuffer(
CL_MEM_READ_WRITE, CL_BUFFER_CREATE_TYPE_REGION, ®ion);
Transfers RAM <-> VRAM
- Speeds of different types of memory transfers:
- Local memory: \(\mathcal{O}(10^{12})\) B/s
- Global memory: \(\mathcal{O}(10^{11})\) B/s
- RAM memory: \(\mathcal{O}(10^{10})\) B/s
- RAM-VRAM transfer: \(\mathcal{O}(10^9)\) B/s
- Transferring data from RAM to the GPU's global memory is very costly and should be avoided whenever possible!
numbers still relevant?
—- In the case of simple computation operations (e.g., axpy), most of the time may be spent transferring data between RAM and global memory:
- Also, transferring data between calculations slows down the calculation significantly:
- Therefore, try to:
- Transfer all necessary data to global memory at once,
- Perform as much computation as possible,
- Transfer the results back to RAM only at the end:
Page-Locked / Pinned Memory
- Allocating page-locked or pinned memory significantly speeds up data transfers because the PCIe bus's DMA controller can operate more efficiently!
- Reminder: In HIP, pinned memory can be allocated and freed using the
hipHostMalloc
andhipHostFree
functions: - Reminder: In CUDA, pinned memory can be allocated and freed using the
cudaMallocHost
andcudaFreeHost
functions:
- The OpenCL specifications do not impose any requirements on the use of mounted memory
- AMD's implementation implements large data transfers by mounting the memory area on the side of the host device in chunks during the transfer
- NVidia's OpenCL implementation has more documentation, but it works probably in the same way
delete this
—Memory Mapping
- Memory mapping simplifies problem code and improves performance in some cases.
- In practice, the CPU and GPU share (at least temporarily) the same memory area.
- Mapping allows data to be transferred only when it is needed.
Principle
- Before memory mapping, the CPU and GPU memory spaces are physically and logically separate:
- Memory mapping unifies the CPU and GPU memory spaces at a logical level:
- The CPU can write the data used for computation to the mapped memory area:
- The GPU can access the contents of the mapped memory area once the kernel is launched:
- The GPU can store the computation result into the mapped memory area:
- The CPU can see the contents of the mapped memory area once the kernel execution is complete:
Practical Implementation
- In practice, the data used for computation may be physically transferred to global memory before the kernel starts.
- In principle, it's also possible that the data is transferred physically only when the GPU needs it for computation.
- Similarly, the computation result can be physically transferred to RAM either all at once or in pieces when the CPU needs parts of the data:
OpenCL
- OpenCL offers three different ways to use mapped memory:
CL_MEM_ALLOC_HOST_PTR
provided when creatingcl::Buffer
ticketCL_MEM_USE_HOST_PTR
provided when creatingcl::Buffer
ticketcl::Buffer
ticket created without additional flags
- AMD extends the offer with its own
CL_MEM_USE_PERSISTENT_MEM_AMD
ticket
remove this
—CL_MEM_ALLOC_HOST_PTR
CL_MEM_ALLOC_HOST_PTR
flag to thecl::Buffer
form function tells the OpenCL implementation that the OpenCL implementation should allocate memory so that can be accessed by the host:
cl::Buffer deviceBuffer = cl::Buffer(context,
CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, N*sizeof(int));
- In AMD's OpenCL implementation, the data associated with the mapping is physically located in attached RAM
- Memory mapping is done with the
enqueueMapBuffer
member function of the command line:
void * cl::CommandQueue::enqueueMapBuffer(
const Buffer& buffer,
cl_bool blocking_map,
cl_map_flags flags,
::size_t offset,
::size_t size,
const VECTOR_CLASS<Event> *events = NULL,
Event *event = NULL,
cl_int *err = NULL);
- The return value returns the address of the start of the mapped memory area in the host memory space
blocking_map
setting the flag toCL_TRUE
makes the mapping operation blockformflags
the flag specifies whether the memory buffer is to be mapped as read (CL_MAP_READ
) or write (CL_MAP_WRITE
)
- Next, we map
deviceBuffer
into the memory of the host device using the blockkaave operation so that the CPU can write to the mapped memory area:
int *hostPointer = 0;
hostPointer = (int*) queue.enqueueMapBuffer(
deviceBuffer, CL_TRUE, CL_MAP_WRITE, 0, N*sizeof(int));
- The host program can now write to the mapped memory area (
hostPointer
)
for(int i = 0; i < N; i++)
hostPointer[i] = checkBuffer[i];
- The mapping of the memory area must be unmapped before the kernel can access the data shared by the mapping
- . The unmapping is done with the
enqueueUnmapMemObject
member function of the command line:
cl_int cl::CommandQueue::enqueueUnmapMemObject(
const Memory &memory,
void *mapped_ptr,
const VECTOR_CLASS<Event> *events = NULL,
Event *event = NULL)
- So we unpack the previously created map, where
deviceBuffer
was mapped to the memory space (hostPointer
) of the host program:
queue.enqueueUnmapMemObject(deviceBuffer, hostPointer);
- After unmapping, we can call the kernel normally:
kernel(cl::EnqueueArgs(...), deviceBuffer, ...);
- Next, we place the blockkaavat command to map
deviceBuffer
into the memory space of the host program so that the host program can read from the mapped memory area:
hostPointer = (int*) queue.enqueueMapBuffer(
deviceBuffer, CL_TRUE, CL_MAP_READ, 0, N*sizeof(int));
- After mapping, the host program can, for example, check the correctness of the result executed by the GPU:
int rightValues = 0;
for(int i = 0; i < N; i++)
if(hostPointer[i] == checkBuffer[i]+1)
rightValues++;
- Finally, we need to remember to unmap:
queue.enqueueUnmapMemObject(deviceBuffer, hostBuffer);
CL_MEM_USE_HOST_PTR
- The memory area on the host side to be used for mapping can also be reserved in advance:
int *hostBuffer = new int[N];
- In this case,
hostBuffer
must be reserved so that the starting address of the memory area is a multiple of the data type used on the OpenCL device side! The pointer returned by the new
operator is often already appropriate for the basic data type . In other cases, it may be necessary to apply some pointer arithmetic.
- A dedicated memory buffer can now be assigned to an OpenCL implementation when creating the
cl::Buffer
instance of :
cl::Buffer deviceBuffer = cl::Buffer(context,
CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, N*sizeof(int), hostBuffer);
CL_MEM_USE_HOST_PTR
The flag indicates that the memory area referenced by in thehostBuffer
pointer can be freely used by the OpenCL implementation for mapping operations. The buffer cannot be used for any other purpose!
- The
enqueueMapBuffer
subroutine of the command line returns the address given when thecl::Buffer
alias was created. However, the OpenCL specifications do not guarantee this, so it is always best to store the address of the mapped memory area in a separate pointer variable:
int *hostPointer = 0;
hostPointer = (int*) queue.enqueueMapBuffer(
deviceBuffer, CL_TRUE, CL_MAP_WRITE, 0, N*sizeof(int));
cl::Buffer-olio created without extra flags
- A mapped
cl::Buffer
can also be created withoutCL_MEM_ALLOC_HOST_PTR
andCL_MEM_USE_HOST_PTR
flags - In AMD's OpenCL implementation, unmapping triggers a transfer operation between RAM and global memory
HIP and CUDA
Memory mapping requires synchronization between the CPU and GPU executions.
In the following example, synchronization is achieved using a stream:
// HIP
hipStream_t stream;
hipStreamCreate(&stream);
// CUDA
cudaStream_t stream;
cudaStreamCreate(&stream);
- Synchronization could also be handled without a stream using the
hipDeviceSynchronize()
orcudaDeviceSynchronize()
function.
- Initially, we allocate pinned memory from the host program's memory space:
int *hostPointer;
// HIP
hipHostMalloc(&hostPointer, N*sizeof(int), hipHostMallocMapped);
// CUDA
cudaMallocHost(&hostPointer, N*sizeof(int), cudaHostAllocMapped);
The
hipHostMallocMapped
flag tells HIP that the allocated memory area is visible to the device.The
cudaHostAllocMapped
flag tells CUDA that the allocated memory area is visible to the device.The host program can write data used for computation to the allocated memory area as usual:
- Synchronization is not mandatory at this stage as long as the kernel is not queued on the stream before the host program has written all the required data to the mapped memory area.
- For older GPUs (Compute Capability < 2.0, no UVA support), we first need to obtain the address of the mapped memory area on the GPU:
The same method exists in HIP.
It is possible that the returned pointer is the same as the hostPointer. The 0 argument at the end is amndatory and simply a flag argument for potential flags in future versions of CUDA and HIP.
- Now, the kernel can be enqueued using the obtained
devicePointer
:
- For newer GPUs, we can directly use the
hostPointer
:
- Next, the host program needs to synchronize its execution with the GPU:
Otherwise,
hostPointer
would not necessarily be ready for use yet.After synchronisation, the host application can use the
hostPointer
normally:
Using Multiple Streams
- Some of the time spent on transfers between RAM and GPU memory can be hidden by using multiple streams:
- This way, the GPU's processing elements can compute data in one stream while the DMA controller transfers data in another stream.
These are the current permissions for this document; please modify if needed. You can always modify these permissions from the manage page.