The first touch of GPU programming

CUDA vs HPI

There are currently a number of interfaces for GPU programming on the market. The two primary players are CUDA from NVIDIA and HIP from AMD; though technically ROCm is AMD's competitor to CUDA, their open source framework HIP provides an interface which is similar to CUDA and can in fact compile to CUDA devices too.
- CUDA
- HIP
Manufacturers' support for the different interfaces varies:

platform \ manufacturer Nvidia AMD

CUDA Supported Not supported

HIP Supported Supported
GPU programming is also possible with interfaces like OpenACC and OpenCL, and oneAPI for Intel GPUs.
The 2016 version of this course covered OpenCL 1.2, so if you are interested, you can have a look in the old material (only in Finnish).

platform \ manufacturer	Nvidia	AMD
CUDA	Supported	Not supported
HIP	Supported	Supported

How to phrase this part? * ROCm is to AMD what CUDA is to NVIDIA * HIP is essentially an interface like OpenACC(?), but pretty directly tied to ROCm and CUDA. * There appear to be ways to compile HIP to oneAPI (https://www.intel.com/content/www/us/en/developer/videos/hip-on-aurora.html)

— 22 Apr 24

Overview of the GPU hardware

Generally, computing hardware is divided into two categories: the host, which is typically a CPU, and devices, which is usually GPUs but could in principle be any unit the host is able to communicate with.
We may at times refer to for instance CUDA devices or ROCm devices which would mean any device to which CUDA, respectively ROCm, compiles. Primarily this would mean respectively NVIDIA and AMD GPUs.
The acronym GPU is a little ambiguous nowadays though, as there are many so-called GPU devices that are built primarily for computational purposes, and not with graphics rendering in mind.

Host device

The host device is typically the computer's central processing unit (CPU), and for the purposes of this course devices are assumed to be GPUs.
The primary role of the host device will be to manage the GPU devices installed in the system.
The execution of an application starts on the host device like a regular application, but it can then leverage the computational power of GPU devices.
The host device can allocate part of the GPU's memory for its use and command the GPU devices to perform various operations.

Device

A device contains one or more compute units and a single computing unit contains one or more processing elements.
For example, on the LUMI supercomputer a GPU device contains 220 copmpute units and 14080 processing element (called stream processors)

Calculation units and processing elements

Processing elements belonging to the same computing unit are typically closely linked and share resources with each other
Computation is performed by processing elements, but a single processing element may not be directly equated with a traditional processor kernel

Calculation units and processing elements (continued)

On a general device, processing elements belonging to the same computational unit execute program code either together as a SIMD unit (i.e. each computational unit is a vector processor) or separately as SPMD units (i.e. as separate kernels).

However, the actual hardware implementation does not necessarily fit perfectly into either category. We will return to this issue later when we discuss Nvidia and AMD's implementation-level solutions

GPU applications

A GPU application behaves like a normal application on the host device, but also accesses the computing resources of GPU through some framework to accelerate computations that it may do better than the host CPU.

GPU application (continued)

Specifically, GPU application is divided into
- kernels, which run on the device, and a
- host program, which runs on the host device

Kernel

A kernel can be thought of as a subroutine that can be called to perform computational operations on a GPU device
The host program is responsible for creating, launching and managing the kernels

Global index space

Before launching a kernel, the host program must define a specific index space.
Once the kernel is launched, this index space defines the global index numbers for a set of threads (work-items).

Strands execution paths

Each thread starts executing code from the beginning of the kernel, but the index numbering allows the execution paths of the different threads to branch:

In practice, one should be very careful with if - else statements in GPU code. CUDA and AMD devices group threads into sets called warps and wavefronts, respectively. Within such a group branching can cause the code to be evaluate serially instead of in parallel.
More on warps and wavefronts later, but for now simply know that they are subsets of and different from the work-group discussed in the next section.

Work-groups

Threads are divided into work-groups (also called a block) as defined by the host program and each thread group is assigned its own thread group index number
In addition, each thread is assigned its own local index number within the thread group.
Indexing can be one-, two- or three-dimensional

Example of indexing threads

Below is an example of a situation where 16 threads are divided into four work-groups, each containing four threads:

Thread groups and subtasks

Dividing threads into thread groups directs the programmer to divide the task into subtasks that can be solved independently and in parallel:

Thread groups and calculation units

Only at runtime does the system need to know the actual structure of the device.
Below is visualized two potential systems, one with device A and one with device B.

Memory template

A single thread has access to various memory areas:

The programmer must determine which memory area the referenced data resides in.

Global memory

Global memory is an area of memory to which each thread has read and write access, regardless of their work-group affiliation.
For GPUs, most memory is global memory.
Global memory is typically implemented off the GPU chip, leading to limited memory bandwidth and latencies in the hundreds of clock cycles.
For modern GPUs, global memory is accessed through a cache.

Local memory

Local memory is a memory area visible only to threads within the same work-group.
Local memory is typically used when threads in the same work-group need to share data.
In modern GPUs, local memory is integrated into the GPU chip itself, often with each compute unit having its own local memory. This makes local memory an order of magnitude faster than global memory.
Local memory is very limited, modern GPUs may have 64 kb of local memory available per compute unit, and using too much within a work-group can impact performance.
In CUDA and HIP terminology, local memory is referred to as shared memory.

Constant memory

Constant memory is a memory area visible to all threads, whose contents remain the same during kernel execution
Memory requests to constant memory often pass through cache.
In CUDA and HIP constant memory is declared using __constant__ and must be declared at compile time.

Private memory

Private memory is an area of memory visible only to an individual thread.
The compiler may allocate data to private memory when resources are scarce within a compute unit.
Typically implemented as part of global memory.
This means, if your kernel is allocating too much memory it might not fit in the on-chip registers and instead use global memory, which is magnitudes slower. So avoid this at all cost!

Relaxed consistency

Usually the memory model for GPUs are consistent from the perspective of a single thread.
This means that for instance the order of memory operations is predefined.
However, the memory model is generally not consistent across threads.
Which means that you cannot expect that information written by one thread to a memory address will be available when another thread attempts to read that same address.
Using synchronization primitives some cross-thread consistency can be achieved.
In CUDA and HIP such synchronization primitives are for example __syncthreads() and __threadfence(), the former much more common than the latter.

Our first HIP program

Information about the hip/hip_runtime.h installation goes here

info about hip_runtime install?

— 15 Jul 24

Let's take a look at the following example program: gitlab:first-hip-program
The example program includes a comprehensive version (main.hip.cpp), which does a bunch of extra things like checking device information, and handles all error situations explicitly; and is extensively commented.
Additionally, there is also a so-called. barebones version, which includes only a minimum amount of code.

The main.hip.cpp file starts with the inclusion of the hip runtime header file

#include <hip/hip_runtime.h>

All classes and functions related to the hip runtime are included with that and have a hip-prefix; e.g. hipSetDevice, hipMalloc and hipStreamCreate to name a few.

Kernel code

A kernel is like a function, but it will be executed individually by each thread on the device.
Let's have a look at our simple kernel add_one.

__global__ void add_one(int *buffer, int n)
{
    const int idx = threadIdx.x + blockIdx.x * blockDim.x;
    
    if(idx < n)
        buffer[idx]++;
}

A kernel is written like other functions, with a few exceptions:
Note that the kernel is marked with the __global__ keyword, which indicates that the host will call it and it will run on the device. The exception being in CUDA Dynamic Parallelism, which we won't be covering.
The kernel return value is void, i.e. it has no return value. Return variables are generally arguments in the kernel, or the kernel modifies one of the input variables. In our kernel here, the variable buffer is getting modified.
Inside a kernel we have access to some special constants and functions used to manage threads. In out kernel we use some of these constants to compute the global thread index idx.
- threadIdx.x is the local thread index within the block (work-group).
- blockIdx.x is the block index (work-group index)
- blockDim.x is the size of the block (work-group), i.e. how many threads are in each block.
- Finally, there is also gridDim.x which is the number of blocks in the grid.
Note that the above all use .x, this is the first dimension of the "grid". When configuring threads and blocks for a GPU kernel, they can be index in 1, 2 or 3 dimensions; .y and .z are used for the other directions. Just note that there are certain limitations on the maximum total number of threads in a block, and the number of blocks in the various directions.

Regardless of the dimensionality of the grid we use, we can always compute a global index. Here's an example in a 2-dimensional grid

int threadsInBlock = blockDim.x * blockDim.y;
int gIdx = threadIdx.x + threadIdx.y * blockDim.x + threadsInBlock * (blockIdx.x + blockIdx.y * gridDim.x);

The order of counting, i.e. x first or y first, can be important to ensure that the threads the GPU bundle together for execution have neighboring global indices. This can be important for efficient memory operations on the GPU.
In the last part of the kernel
```
if(idx < n)
    buffer[idx]++;
```
we check if the global thread index is smaller than n, the number of elements in buffer, so if we have more threads than the length of the array, the last threads won't get to work on it.
Without the check we'd be going out of bounds, which would be bad.

Error handling

Most hip runtime calls return an error code of type hipError_t. Different functions may have different types of return error codes, which can be looked up in the documentation.
The hipSuccess is the general code that a call ran without any issues.
Examples of other errors could for example be
- hipErrorMemoryAllocation
- hipErrorInvalidDevice
- hipErrorInvalidMemcpyDirection
- hipErrorNotReady
- hipErrorRuntimeMemory
The hipError_t type is simply an enum defined in hip_runtime.h, so you can check out the different options there if you like.

Overview of devices

A machine may have multiple devices.
We can request information about the number of available devices with

hipError_t hipGetDeviceCount(int *deviceCount)

Example

int deviceCount = 0;
err = hipGetDeviceCount(&deviceCount);

if (err != hipSuccess) {
    std::cerr << "Error getting a device count." << std::endl;
    std::cerr << "hipError-code: " << err << std::endl;
    std::cerr << "hipError-string: " << hipGetErrorString(err) << std::endl;
    return 1;
}

We can similarly query about other information like the

the runtime version,

hipError_t hipRuntimeGetVersion(int *version)

the driver version,

hipError_t hipDriverGetVersion(int *version)

and a whole slew of different device properties,

hipError_t hipGetDeviceProperties(hipDeviceProp_t *prop, int deviceId)

Streams and Selecting a Device

Selecting a device to compute on is easy in HIP!

hipSetDevice(0);

This sets the device with index 0 as the active device. Following instructions will be in relation to the context of that device.
This is only really relevant on machines with multiple GPUs though, as HIP assumes device 0 to be the active by defualt.

Streams are queues on the GPU to which we can shedule tasks.

Creating multiple streams allow us to potentially overlap various independent tasks on the GPU, which may improve performance; as illustrated in the figure below.
H2D and D2H in the image are abbreviations of Host to Device and Device to Host.
Note that not all tasks can be concurrent; and the order of issuing tasks to the scheduler on the GPU can impact concurrency as well. These CUDA Webinar slides (which the above illustration was based on) has some good illustrations of the general concept on slides 16-19.

A stream can be created as follows

hipStream_t stream;
hipStreamCreate(&stream);

The stream may then be supplied to the configuration for a kernel call, making the call execute on that stream.
A number of asynchronous HIP functions also takes a stream as an argument, e.g. hipMemcpyAsync.

Memory on the Device, `Malloc` and `Memcpy`

In general, we distinguish between memory on the host and memory on the device, and as we work with data on devices we will be moving the data back and forth between the host and the device, since information on the device is generally not accessible directly from the host.

Copying memory between host and device is a somewhat expensive process though, and should be avoided whenever possible in critical computation where speed is important.

We will cover some basics below, but you can read more about GPU memory here.

Allocation of memory on the device is generally done using

// Header information
hipError_t hipMalloc(void **ptr, size_t size);

// Example
int *deviceBuffer;
hipMalloc(&deviceBuffer, N * sizeof(int));

This allocates memory for an array of N int's.
deviceBuffer is now a pointer to a memory address on the device, and can in general not be dereferenced outside of kernels.

Copying memory to the device uses Memcpy.

// Header information
hipError_t hipMemcpy(void *dst, const void *src, size_t size, hipMemcpyKind kind);

// Example
int *hostBuffer_ptr = hostBuffer.data(); // assuming hostBuffer is a std::vector<int>
hipMemcpy(deviceBuffer, hostBuffer_ptr, N * sizeof(int), hipMemcpyDefault);

This copies N int-size data from the host memory location hostBuffer_ptr to the device memory location deviceBuffer.
The hipMemcpyDefault flag essentially tells HIP to figure out for itself that this is a transfer from Host to Device based on the pointers. We could also have given the argument hipMemcpyHostToDevice instead. The argument for the latter is that the code would then throw an error if we by accident provided two host memory pointers. With hipMemcpyDefault HIP will simply assume we wanted a hipMemcpyHostToHost flag and not complain. But assuming you can manage your pointers properly, hipMemcpyDefault is the recommended flag.
There are a bunch of other Memcpy methods; see the documentation here. For example hipMemcpyAsync, which takes a stream as an argument, allowing you to manage memory transfer and computation simultaneously.

Configuring and launching a kernel

Launching a kernel is simple in HIP. The syntax is

// Full
kernel<<<numberOfBlocks, threadsPerBlock, SharedMemoryInBytes, Stream>>>(arg1, arg2, ...);
// Simple, on the 0 stream and with no shared memory
kernel<<<numberOfBlocks, threadsPerBlock>>>(arg1,arg2,...);

The arguments in the <<<...>>> brackets are the configuration.
If the stream is not given, then HIP defaults to the 0 stream.
If the shared memory isn't provided, it defaults to 0.

When setting up the configuration we use the dim3 type

int maxThreads = 1024;
const dim3 numBlks((N - 1)/maxThreads + 1);
const dim3 thrPrBlk(maxThreads); // same as: (maxThreads, 1, 1)

The type dim3 is a struct with the uint32_t components x, y, z.
If a component in the dim3 struct isn't specified, the default value is 1.
For numBlks we compute the minimal number of blocks required. Note that the division is an integer division leaving no remainder, so try to go through the steps and convince yourself that the above formulation will in fact always give the lower bound on the number of blocks needed.

With the configuration we call our kernel with the syntax given above

add_one<<<numBlks, thrPrBlk, 0, stream>>>(deviceBuffer, N);

The first OpenCL 1.2 program

I imagine we are tossing the rest of this document.

— 09 Jul 24

In this course, the OpenCL interface is mainly used through the C++ wrapper. This choice simplifies the code somewhat and in particular simplifies memory management.
If a working OpenCL implementation is installed on the computer, the C++ wrapper only needs the CL/cl.hpp header file
to run, and if required, the wrapper can be downloaded from the Khronos group website. More detailed documentation can be found here.

Let's take a look at the following example program: git /first-opencl-program
The example program includes an extended version (main.cpp) which checks the suitability of the parameters used for the OpenCL device used and handles all error situations
In addition, the package includes a simplified version which demonstrates the C++ wrapper C++11 features and exception handling
Finally, there is also a so-called. barebones version, which includes only the absolute minimum amount of code (no argument validation and error code handling).

main.cppfile starts with the inclusion of the CL/cl.hpp header file:

#include <CL/cl.hpp>

All classes and functions related to the C++ wrapper belong to the cl namespace
The namespace prefix would not need to be explicitly mentioned if we were to use using namespace cl. However, we include it in this example for clarity.
VECTOR_CLASS and STRING_CLASS in the C++ wrapper refer by default to the data types std::vector and std::string

Error handling

Most OpenCL library calls either
- return an error code of type cl_int or
- take as argument a pointer to an error code variable of type cl_int where the OpenCL call status is written
CL_SUCCESS (0) is a generic all ok return value
Other typical error codes:
- CL_DEVICE_NOT_FOUND / -1,
- CL_OUT_OF_RESOURCES / -5,
- CL_INVALID_COMMAND_QUEUE / -37,
- CL_INVALID_KERNEL_ARGS / -52
The numeric values corresponding to the error codes can be found in the CL/cl.h file

The C++ wrapper can also report errors using exceptions
Exceptions are disabled by default, but can be enabled as follows:

#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>

Exceptions are caught with the try-catch structure:

try {
    ...
} catch(const cl::Error &err) {
    std::cerr << "Virheviesti: " << err.what() << std::endl;
}

OpenCL platform

To use the OpenCL implementation installed on a computer, we first need to get access to the OpenCL platform layer
Through the OpenCL platform, we further access the OpenCL devices and their resources installed on the hardware
In the C++ wrapper, the OpenCL platform is wrapped inside a cl::Platform class that provides a static member function for fetching OpenCL platforms:

static cl_int cl::Platform::get(VECTOR_CLASS<Platform> * platforms)

cl_int err;
    
std::vector<cl::Platform> platforms;
err = cl::Platform::get(&platforms);

if(err != CL_SUCCESS) {
    std::cerr << "OpenCL-alustojen hakeminen epäonnistui." << 
        std::endl;
    std::cerr << "OpenCL-virhekoodi: " << err << std::endl;
    return 1;
}

if(platforms.size() < 1) {
    std::cerr << "OpenCL-alustoja ei löytynyt." << std::endl;
    return 1;
}

Next, we could query the OpenCL platform on various issues using the getInfo member function :

cl_int cl::Platform::getInfo(cl_platform_info name,
                             STRING_CLASS * param)

The OpenCL version supported by the OpenCL platform can be checked with the CL_PLATFORM_VERSION flag and the vendor name can be found with the CL_PLATFORM_VENDOR flag
The OpenCL platform may also support various standards and implementation-specific extensions. CL_PLATFORM_EXTENSIONS flag returns a list of supported extensions.

Searching for OpenCL devices

The OpenCL platform may include multiple OpenCL devices
In the C++ wrapper, an OpenCL device is wrapped in the cl::Device class and the cl::Platform class contains the getDevices member function for querying OpenCL devices :

cl_int cl::Platform::getDevices(cl_device_type type,
                                VECTOR_CLASS<Device> * devices)

An argument above type can take values such as CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU or CL_DEVICE_TYPE_ACCELERATOR. In the context of this course, CL_DEVICE_TYPE_GPU is of course the most interesting option.

std::vector<cl::Device> devices;
err = platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);

if(err != CL_SUCCESS) {
    std::cout << "OpenCL-laitteiden hakeminen epäonnistui." << 
        std::endl;
    std::cerr << "OpenCL-virhekoodi: " << err << std::endl;
    return 1;
}

if(devices.size() < 1) {
    std::cout << "OpenCL-laitteita ei löytynyt." << std::endl;
    return 1;
}

We could ask cl::Device about various things, as we have done in the past:

template <typename T>
cl_int cl::Device::getInfo(cl_device_info name,
                           T * param)

CL_DEVICE_MAX_WORK_GROUP_SIZE -the flag returns the largest possible size of the thread group
CL_DEVICE_MAX_MEM_ALLOC_SIZE -the flag returns the largest available global memory space
CL_DEVICE_LOCAL_MEM_SIZE-the flag returns the largest available local memory space
CL_DEVICE_EXTENSIONS -the flag returns the extensions supported by the OpenCL device

OpenCL context

The OpenCL device resources are managed via the OpenCL context
In the C++ wrapper, the OpenCL context is wrapped in the cl::Context class, whose build function takes as arguments the OpenCL devices to be included:

cl::Context::Context(VECTOR_CLASS<Device>& devices,
                     cl_context_properties * properties = NULL,
                     void (CL_CALLBACK * pfn_notify)(
                         const char * errorinfo,
                         const void * private_info,
                         ::size_t cb,
                         void * user_data) = NULL,
                     void * user_data = NULL,
                     cl_int * err = NULL)

std::vector<cl::Device> oneDevice;
oneDevice.push_back(devices[0]);
cl::Context context(oneDevice, 0, 0, 0, &err);

if(err != CL_SUCCESS) {
    std::cout << "OpenCL-kontekstin luominen epäonnistui." << 
        std::endl;
    std::cerr << "OpenCL-virhekoodi: " << err << std::endl;
    return 1;
}

cl::Context can also be created without the cl::Device alias:

cl::Context::Context(cl_device_type type,
                     cl_context_properties * properties = NULL,
                     void (CL_CALLBACK * pfn_notify)(
                         const char * errorinfo,
                         const void * private_info,
                         ::size_t cb,
                         void * user_data) = NULL,
                     void * user_data = NULL,
                     cl_int * err = NULL)

In this case, all device types corresponding to the type argument will be included

The OpenCL context information can be accessed using the getInfo member function:

template <typename T>
cl_int cl::Context::getInfo(cl_context_info name,
                            T * param)

In particular, the CL_CONTEXT_DEVICES flag, which returns the OpenCL entities associated with the context, is useful if the cl::Context entity was created without the cl::Device entity.

Kernel source code

Kernels are written in the (limited) C language
Consider a simple kernel that takes as arguments the integers (n pcs) of the table (buffer) and adds one to each element:

__kernel void add_one(__global int *buffer, int n) {
    const int global_id = get_global_id(0);

    if(global_id < n)
        buffer[global_id]++; 
}

__kernel -keyword indicates that this is a kernel that can be called from the host
__global -keyword indicates that the table buffer is stored in global memory

The global index number is obtained by using the get_global_id subroutine. The argument to the subroutine determines the index dimension:

const int global_id = get_global_id(0);

Each thread increments the table element corresponding to its own global index number by one:

if(global_id < n)
    buffer[global_id]++;

The above assumes that there are more threads than there are elements in the table

In simple cases, kernels can be written into the host program as strings:

const char *kernel =
"__kernel void add_one(__global int *buffer, int n) { \n" \
" const int global_id = get_global_id(0); \n" \n
" if(global_id &lt; n) \n" \
" buffer[global_id]++; \n" \
"} \n";

After this, the source code for the kernels can be compressed inside a cl::Program::Sources object:

cl::Program::Sources sources;
sources.push_back(
    cl::Program::Sources::value_type(kernel, strlen(kernel)));

The kernel source code can also be stored in a separate file. OpenCL does not support this directly, but for example the xxd tool available in the linux environment is perfectly suitable for this purpose
Let's assume that the kernel source code is written to a separate kernel.cl file. In this case, the command xxd -i kernel.cl > kernel.cl.dat would produce the following file:

unsigned char kernel_cl[] = {
  0x0a, 0x2f, 0x2f, 0x20, 0x59, 0x64, 0x69, 0x6e, 0x2c, ...
  0x6b, 0x61, 0x20, 0x6f, 0x74, 0x74, 0x61, 0x61, 0x20, ...
  ...
  0x2b, 0x3b, 0x20, 0x0a, 0x7d
};
unsigned int kernel_cl_len = 857;

Now we could include the source code as part of the host program:

#include "kernel.cl.dat"

...

cl::Program::Sources sources;
sources.push_back(cl::Program::Sources::value_type(
    (const char*) kernel_cl, kernel_cl_len));

In addition to the string embedded in the source code of the host program and the use of the xxd tool , the source code for the kernels could be loaded from a separate text file at runtime string table

Compiling kernel source code

To compile the kernel source code, we need a Program object, which encapsulates the associated OpenCL context, the kernel source code/binary, the compiled version of the kernel source code, translation logs and kernel "handles"
In the C++ wrapper, the program object is wrapped inside the cl::Program class, whose build function takes as arguments the associated OpenCL context and kernel source code:

cl::Program::Program(const Context& context,
                     const Sources& sources,
                     cl_int * err = NULL)

cl::Program program(context, sources, &err);

if(err != CL_SUCCESS) {
    std::cout << "Ohjelma-objektin luominen epäonnistui." << 
        std::endl;
    std::cerr << "OpenCL-virhekoodi: " << err << std::endl;
    return 1;
}

The source code of kernels attached to a program object can be compiled using the build member function of the cl::Program class:

cl_int cl::Program::build(const VECTOR_CLASS<Device>& devices,
                          const char * options = NULL,
                          void (CL_CALLBACK * pfn_notify)(
                              cl_program,
                              void * user_data) = NULL,
                          void * data = NULL)

The output of the compilation process and possible errors can be found by passing the CL_PROGRAM_BUILD_LOG flag to the ´getBuildInfo´ member function:

template <typename T>
cl_int cl::Program::getBuildInfo(cl_program_build_info name,
                                 T * param)

err = program.build(oneDevice, 0);

if(err != CL_SUCCESS) {

    std::string log;
    program.getBuildInfo(oneDevice[0], 
        CL_PROGRAM_BUILD_LOG, &log);
    std::cout << 
        "OpenCL-kääntäjän tuloste:" << std::endl << 
        log << std::endl;
    
    std::cout << 
        "Ytimien lähdekoodin kääntäminen epäonnistui." << 
        std::endl;
    std::cerr << "OpenCL-virhekoodi: " << err << std::endl;
    return 1;
}

If the source codes of the kernels are stored in a single string, we could omit the creation of from cl::Program::Sources altogether, and pass the source codes of directly as a string to the cl::Program build function:

cl::Program::Program(const Context& context,
                     const STRING_CLASS& source,
                     bool build,
                     cl_int * err = NULL)

If, in addition, we were to call the build member function with the value of the CL_TRUE argument, the source codes of the kernels would already be compiled when creating the cl::Program mole , so we would not need to call the build member function separately

Creating a core object

A kernel object encapsulates the properties associated with a single kernel
In the C++ wrapper, a kernel object is encapsulated inside the cl::Kernel class, whose build function takes the Program object and kernel name as arguments:

cl::Kernel::Kernel(const Program& program,
                   const char * name,
                   cl_int * err = NULL)

cl::Kernel kernel(program, "add_one", &err);

if(err != CL_SUCCESS) {
    std::cout << "Ytimen luominen epäonnistui." << std::endl;
    std::cerr << "OpenCL-virhekoodi: " << err << std::endl;
    return 1;
}

We could also use the functions found in the C++11 standard to create the core object :

template<typename T0, 
         typename T1 = detail::NullType, 
         ..., 
         typename T31 = detail::NullType>
struct make_kernel::detail::functionImplementation<T0, T1, ..., T31>
cl::make_kernel::make_kernel(const Program &program,
                             const STRING_CLASS name,
                             cl_int *err = NULL)

In the example program, a kernel object could be created as follows:

auto kernel = cl::make_kernel<cl::Buffer&, int>(program, "add_one");

Creating a command line

The host application sets the kernel startup commands in an OpenCL target Command Queue
The runtime system automatically executes the kernels in the given order
The Command Queue can also be used to set, for example, data transfer commands and events, which allow the status of the command queue to be monitored in real time.

The command string is encapsulated inside the cl::CommandQueue class, where the build function takes as arguments the OpenCL context and the OpenCL device to which the command string is to be attached:

cl::CommandQueue::CommandQueue(
    const Context& context,
    const Device& device,
    cl_command_queue_properties properties = 0,
    cl_int * err = NULL)

cl::CommandQueue queue(context, oneDevice[0], 0, &err);

if(err != CL_SUCCESS) {
    std::cout << "Komentojonon luominen epäonnistui." << std::endl;
    std::cerr << "OpenCL-virhekoodi: " << err << std::endl;
    return 1;
}

Creating a buffer object

The global and constant memory of an OpenCL device is managed by means of Buffer objects
The Buffer object is encapsulated inside the cl::Buffer class, which takes as arguments the OpenCL context associated with , the flags flag variable, and the size of the buffer:

cl::Buffer::Buffer(
    const Context& context,
    cl_mem_flags flags,
    ::size_t size,
    void * host_ptr = NULL,
    cl_int * err = NULL)

cl::Buffer deviceBuffer(context, CL_MEM_READ_WRITE, N*sizeof(int), 0, &err);

if(err != CL_SUCCESS) {
    std::cout << "Muistin varaaminen epäonnistui epäonnistui." << std::endl;
    std::cerr << "OpenCL-virhekoodi: " << err << std::endl;
    return 1;
}

Writing data to the buffer

The enqueueWriteBuffer member function of the command line sets the command to write data from host memory Setting the
blocking_write flag to CL_TRUE in the buffer object makes the call block , i.e. the subroutine returns only when the transfer is complete

cl_int cl::CommandQueue::enqueueWriteBuffer(
    const Buffer& buffer,
    cl_bool blocking_write,
    ::size_t offset,
    ::size_t size,
    const void * ptr,
    const VECTOR_CLASS<Event> * events = NULL,
    Event * event = NULL)

err = queue.enqueueWriteBuffer(deviceBuffer, CL_FALSE, 0, N*sizeof(int), 
    hostBuffer, 0, 0);

if(err != CL_SUCCESS) {
    std::cout << "Isäntälaite -> OpenCL-laite -siirtokäskyn asettaminen " \
        "komentojonoon epäonnistui." << std::endl;
    std::cerr << "OpenCL-virhekoodi: " << err << std::endl;
}

Setting kernel arguments

Kernel arguments are set one by one using the setArg member function:

template <typename T>
cl_int cl::Kernel::setArg(cl_uint index, T value)

kernel.setArg(0, deviceBuffer);

if(err != CL_SUCCESS) {
    std::cout << "Ytimen 1. argumentin asettaminen epäonnistui." << 
        std::endl;
    std::cerr << "OpenCL-virhekoodi: " << err << std::endl;

    return 1;
}

kernel.setArg(1, N);

if(err != CL_SUCCESS) {
    std::cout << "Ytimen 2. argumentin asettaminen epäonnistui." << 
        std::endl;
    std::cerr << "OpenCL-virhekoodi: " << err << std::endl;

    return 1;
}

Starting the kernel

The kernel start command is set in the command line with the enqueueNDRangeKernel member function
The size of the global index space is set with the global argument, and the size of the thread group is set with the local argument.

cl_int cl::CommandQueue::enqueueNDRangeKernel(
    const Kernel& kernel,
    const NDRange& offset,
    const NDRange& global,
    const NDRange& local,
    const VECTOR_CLASS<Event> * events = NULL,
    Event * event = NULL)

It's a good idea to check the maximum size of the thread group:

::size_t maxWorkGroupSize;
oneDevice[0].getInfo(CL_DEVICE_MAX_WORK_GROUP_SIZE, &maxWorkGroupSize);

if(err != CL_SUCCESS) {
    std::cout << "OpenCL-laitteen suurimman mahdollinen säieryhmän koon "\
    "kysyminen epäonnistui." << std::endl;
    std::cerr << "OpenCL-virhekoodi: " << err << std::endl;

    return 1;
}

Calculating the global index space and the size of the beam group:

::size_t workGroupCount = N/maxWorkGroupSize+1;

cl::NDRange globalDim(workGroupCount*maxWorkGroupSize, 1, 1);
cl::NDRange localDim(maxWorkGroupSize, 1, 1);

err = queue.enqueueNDRangeKernel(kernel, cl::NullRange, 
    globalDim, localDim, 0, 0);

if(err != CL_SUCCESS) {
    std::cout << "Ytimen käynnistyskäskyn asettaminen komentojonoon " \
        "epäonnistui." << std::endl;
    std::cerr << "OpenCL-virhekoodi: " << err << std::endl;
    
    return 1;
}

Reading data from the buffer

The enqueueReadBuffer member function in the command line sets the command to read data from the Buffer object into the host device memory
blocking_read setting the flag to CL_TRUE makes the call block i.e. the subroutine returns only when the transfer is complete

cl_int cl::CommandQueue::enqueueReadBuffer(
    const Buffer& buffer,
    cl_bool blocking_read,
    ::size_t offset,
    ::size_t size,
    const void * ptr,
    const VECTOR_CLASS<Event> * events = NULL,
    Event * event = NULL)

err = queue.enqueueReadBuffer(deviceBuffer, CL_TRUE, 0, N*sizeof(int), 
    hostBuffer, 0, 0);

if(err != CL_SUCCESS) {
    std::cout << "OpenCL-laite -> Isäntälaite -siirtokäskyn asettaminen " \
        "komentojonoon epäonnistui." << std::endl;
    std::cerr << "OpenCL-virhekoodi: " << err << std::endl;
    
    return 1;
}

The first touch of GPU programming

CUDA vs HPI

Overview of the GPU hardware

Host device

Device

Calculation units and processing elements

Calculation units and processing elements (continued)

GPU applications

GPU application (continued)

Kernel

Global index space

Strands execution paths

Work-groups

Example of indexing threads

Thread groups and subtasks

Thread groups and calculation units

Memory template

Global memory

Local memory

Constant memory

Private memory

Relaxed consistency

Our first HIP program

Kernel code

Error handling

Overview of devices

Streams and Selecting a Device

Memory on the Device, Malloc and Memcpy

Configuring and launching a kernel

The first OpenCL 1.2 program

Error handling

OpenCL platform

Searching for OpenCL devices

OpenCL context

Kernel source code

Compiling kernel source code

Creating a core object

Creating a command line

Creating a buffer object

Writing data to the buffer

Setting kernel arguments

Starting the kernel

Reading data from the buffer

Memory on the Device, `Malloc` and `Memcpy`