OpenMP

Literature

Introduction

The OpenMP compiler extension allows serial code to be parallelized with relatively little extra effort. In the simplest case, a for loop with iterations that are independent of each other can be parallelized as follows:

#pragma omp parallel for
for(int i = 0; i < n; i++) {
    a[i] = i;
}

However, the compiler must support OpenMP and, for example, OpenMP support for the gcc compiler is enabled with -fopenmp:

g++ ... -fopenmp

In some cases, it is necessary to include the omp.h header file:

#include <omp.h>

Pragmas work without a header file, but prototypes for OpenMP subroutines such as are defined in that header file.

Parallel code block

Parallel-pragma

All OpenMP parallelism takes place within a parallel block of code:

#pragma omp parallel
{
    // This block of code is executed in parallel on several threads
}

The definition of a parallel block of code triggers a set of threads / team. The omp parallel for pragma in the introduction is therefore an abbreviation of the following:

#pragma omp parallel
{
    #pragma omp for
    for(int i = 0; i < n; i++) {
        a[i] = i;
    }
}

The omp for pragma divides the for loop into parallel sub-loops

Next we will look at how to distribute work within a omp parallel region to different threads. There are two main ways to do this:

omp sections and omp section is a way to limit the scope of some work so only a single thread will execute it. Multiple omp section can then be executed simultaneously by different threads.

While omp sections might still be useful in certain situations today, OpenMP have had some new a more versatile tools available since version 5.0. Namely the omp task and omp taskgroup.

omp taskgroup and omp task provides a way to let OpenMP dynamically balance the workload for large groups of tasks. Tasks furthermore provides a way to setup task dependecies, so dependent tasks can be scehduled but not picked up until the remaining computations are ready for it.

Sections-pragma

The program code within a parallel block of code can also be divided into sections using the omp sections and omp section pragmas. The subtasks specified by the omp section pragma are automatically distributed among threads so that one subtask is executed only once:

#pragma omp parallel
{
    // All threads execute the following subroutine
    subroutine1();
    
    // Start a block of code consisting of subtasks
    #pragma omp sections
    {
        // All threads execute the following subroutine
        subroutine2();
    
        #pragma omp section
        {
            // Only one thread shall execute the following commands
            int k = 55;
            subroutine2();
        }
        
         #pragma omp section
         subroutine3(); // Only one thread shall execute the subroutine
    }
}

FIXED EXAMPLE

#pragma omp parallel
{
    // All threads execute the following subroutine
    subroutine1();
    
    // Start a block of code consisting of subtasks
    #pragma omp sections
    {
        // This subroutine is treated as if wrapped in an omp section block of 
        // its own and thus executed by only one thread. It is good practice to 
        // make this explicit by always including the "#pragma omp section"
        subroutine2();
    
        #pragma omp section
        {
            // Only one thread shall execute the following commands
            int k = 55;
            subroutine2();
        }
        
         #pragma omp section
         subroutine3(); // Only one thread shall execute the subroutine
    }
}

In a situation where you do not have a subroutine that needs to be executed by all threads, but only need sections, you can collapse

#pragma omp parallel 
{
    #pragma omp sections
    {
        // Some sections here
        // ...
    }
}

into the slightly more compact form

#pragma omp parallel sections
{
    // Some sections here
    // ...
}

This example code is actually wrong. Maybe it was right in the past? It is correct that subroutine1 is executed by all threads, but subroutine2 is only executed by a single thread. It turns out that the first content of an omp sections area up until the first omp section counts as an omp section. What happens is that there cannot be code within an omp sections block that is not wrapped in an omp section block. So putting code into the omp sections is treated as being inside an omp section implicitly, however only up till the first omp section directive, after that any non-omp section'ed code will cause a compile error!

— 02 Jul 24

The former can be written in shorter form:

#pragma omp parallel sections
{
    // All threads execute the following subroutine
    subroutine1();
    
    // All threads execute the following subroutine
    subroutine2();

    #pragma omp section
    {
        // Only one thread shall execute the following commands
        int k = 55;
        subroutine2();
    }
    
     #pragma omp section
     subroutine3(); // Only one thread shall execute the subroutine
}

FIXED EXAMPLE The folllowing is NOT the same as the above, because the above example CANNOT be shortened!

#pragma omp parallel sections
{
    // These two subroutines as treated as if in their own shared omp section.
    // So they are executed sequentially on the same single thread.
    subroutine1();
    
    subroutine2();

    #pragma omp section
    {
        // Only one thread shall execute the following commands
        int k = 55;
        subroutine2();
    }
    
     #pragma omp section
     subroutine3(); // Only one thread shall execute the subroutine
}

This example suffer from the same mistake as the above. But furthermore, because of the mistake in the above this example is actually functionally different from the above.

In fact, the above example CANNOT be shortened. So the claim here is simply wrong.

We should just delete this "paragraph block".

— 02 Jul 24 (edited 02 Jul 24)

Task-pragma

Parallel tasks can also be defined directly with a task pragma:

// subroutine1 and subroutine2 are parallel tasks and can be
// be processed in parallel
#pragma omp parallel 
{
    #pragma omp single nowait
    {
        #pragma omp task
        subroutine1();
        
        #pragma omp task
        subroutine2();
    }
    
    // Wait for the parallel tasks to complete
    #pragma omp taskwait
}

The omp single pragma here ensures that only a single thread queues up tasks. A bit more on the single pragma in a bit.

Note that this is essentially equivalent to

#pragma omp parallel
{
    #pragma omp sections
    {
        #pragma omp section
        subroutine1();
        
        #pragma omp section
        subroutine2();
    }
}

Dependencies

Some excellent examples are available here: Link to examples

Let us have a look at the first example

#include <stdio.h>
int main() {
   int x = 1;
   #pragma omp parallel
   #pragma omp single
   {
      #pragma omp task shared(x) depend(out: x)
         x = 2;
      #pragma omp task shared(x) depend(in: x)
         printf("x = %d\n", x);
   }
   return 0;
}

Note the depend(...) clause. This clause causes an ordering of the tasks. The out: x indicates that a change to x will occur and any in: x task should wait for this change. Thus this program will always output x = 2.

The shared(x) indicates here that the variable x should be shared amount the tasks. If this wasn't specified, each task would get its own copy of x (this is called firstprivate and the default behavior for a variable in a omp task) and the update x=2 wouldn't reach the second task.

In the second example the tasks are switched, but the dependencies are the same.

#include <stdio.h>
int main()
{
   int x = 1;
   #pragma omp parallel
   #pragma omp single
   {
      #pragma omp task shared(x) depend(in: x)
         printf("x = %d\n", x);
      #pragma omp task shared(x) depend(out: x)
         x = 2;
   }
   return 0;
}

Intuituvely one might think that this should work like the above, but the order of scheduling matters, so in this example the output will always be x = 1, i.e. the print-statement always gets exectured before the x = 2 assignment occurs.

Taskgroup pragma

The omp taskgroup pragma allow you to group subsets of tasks and have them synchronize independent of other tasks. A good example of this can be found here: Link to example

Running a block on a single thread

If necessary, part of a parallel block of code can be executed in a single thread:

#pragma omp parallel
{
    subroutine1(); // All threads execute the subroutine
    
    #pragma omp single
    {
        // Only one thread executes the subroutine. The other threads wait for 
        // the end of the block.
        subroutine2();
    }
    
    // All threads execute the subroutine as long as the previous single block
    // has been processed first
    subroutine3();
    
}

The pragma omp master behaves very similarly, but the block is executed each time by the so-called master thread (the same thread that executes the parts outside the parallel code block) and the other threads do not wait at the end of the block.

Data ownership

Shared and private variables

Variables outside the parallel block are shared by default, i.e. all threads handle the same variable. The variables defined inside the parallel block are private by default, i.e. each thread has its own version of the variable:

int a = 3; // Shared variable
#pragma omp parallel
{
    int b = 5; // Private variable
}

Race condition

Shared variables need to be protected in some situations:

int a = 3; // Shared variable
#pragma omp parallel
{
    int b = a; // Private variable
    b += 5;
    a = b; // Line (*)
}

When run in two threads, the above might do, for example:

Both threads behave nicely, where the shared variable a is incremented twice by 5, i.e. after the parallel block a = 13
Thread 1 assigns the value of a to b but before it reaches line (*) thread 2 assigns the value of a to b. Both threads thus have b equal to 3 and do their increments by 5. In either order the value 8 is stored into a.

Either outcome is possible. This is a race condition.

Thread 1	Thread 2	Variable `a`	In thread 1 `b`	In thread 2 `b`
int b = a;	int b = a;	3	= 3	= 3
b += 5;	-	3	3 -> 8	3
a = b;	b += 5;	3 -> 8	8	3 -> 8
	a = b;	8 -> 8

Read protecting resources to learn how to avoid this.

Explicitly defining the type of variable

Variables can also be explicitly defined as shared or private:

int a = 3;
int b = 5;
#pragma omp parallel private(a) shared(b)
{
    cout << a << endl; // undefined
    cout << b << endl; // 5
    
    #pragma omp single
    b = 7; // A single thread changes the value of the shared variable
    
    cout << b << endl; // 7
}

The value of the variable a inside the parallel block above is undefined, because each thread has its own variable a, which is not initialized by default. The problem can be solved with the firstprivate keyword, which copies the value of the variable outside the parallel block to the single variable of the same name in the thread:

int a = 3;
int b = 5;
#pragma omp parallel firstprivate(a) shared(b)
{
    cout << a << endl; // 3
    cout << b << endl; // 5
    
    #pragma omp single
    {
        b = 7;  // A single thread changes the value of the variable
        a = 15; // A single thread modifies its own variable
    }

    cout << a << endl; // One thread prints 15, the rest print 3
    cout << b << endl; // 7
}

Reduction-keyword

The reduction keyword combines the keywords private, shared and atomic (jumps to the atomic keyword) :

TYPE VARIABLE = INIT
#pragma omp PRAGMA reduction(OP:VARIABLE)
{
    // Use the variable VARIABLE in some way...
}

doesn't this code piece miss the INIT state that appears in the expanded code below?
maybe INIT is whatever the ... was chosen as?

— 22 Apr 24 (edited 22 Apr 24)

The above pseudocode effectively does the following:

TYPE VARIABLE = INIT
#pragma omp PRAGMA
{
    TYPE _VARIABLE = OP_INIT;
    
    // Use the variable _VARIABLE in some way...
    
    #pragma omp atomic
    VARIABLE = VARIABLE OP _VARIABLE;
}

The binary operator OP and the corresponding initial value OP_INIT (which is dependent on the choice of OP) could for example be any of the following:

OP	INIT
+, -, \|, ^, \|\|	0
*, &&	1
&	~0 = 1111...

For example, program code

int a = 0;
#pragma omp parallel for reduction(+:a)
for(int i = 0; i < N; i++)
    a += i;

is effectively the same as

int a = 0;
#pragma omp parallel
{
    int _a = 0;

    #pragma omp for nowait
    for(int i = 0; i < N; i++)
        _a += i;
    
    #pragma omp atomic
    a = a + _a;
}

Flush-pragma

The compiler may optimize the code so that the shared variable is stored in the registry instead of memory. This may distort the results of in some cases, because one thread sees only the memory, but not the register "belonging" to the other thread:

int a = 5;
#pragma omp parallel sections shared(a)
{
    #pragma omp section
    {
        a = 66;
        #pragma omp flush(a)
    }
    
     #pragma omp section
    {
        #pragma omp flush(a)
        cout << a << endl;
    }
}

The flush pragma forces the compiler to write the variable's value to memory after a write operation and forces the compiler to read the variable's value from memory before a read operation. This kind of tuning is rarely necessary, so it is better to use atomic operations or critical code sections (see below).

Management of threads

Setting the number of threads

The number of threads can be selected using the num_threads(<number of threads>) keyword in the context of pragma , for example:

// Run the for-loop with three threads
#pragma omp parallel for num_threads(3)
for(int i = 0; i < n; i++) 
    a[i] = i;

Alternatively, the thread count can be set with the OMP_NUM_THREADS environment variable in Linux (e.g. $ OMP_NUM_THREADS=3 ./program). The thread count can also be set at runtime with a subroutine call:

void omp_set_num_threads(int num_threads);

Other OpenMP aliases

The number of threads in use can be found out with a subroutine:

int omp_get_num_threads(void);

Each thread has its own index number, which can be accessed by a subroutine:

int omp_get_thread_num(void);

The maximum possible number of threads, when the keyword num_threads is not given, is obtained by a subroutine:

int omp_get_max_threads(void);

Protecting resources

In situations where it is essential that only one thread at a time can execute a block of code, either atomic operations or the definition of a critical block of code must be used.

Atomic Operations

The atomic pragma defines the following command as atomic, meaning it will be executed entirely at once or not at all. This prevents other threads from "interrupting" the command's execution. In most cases, the command to be executed atomically must be implementable with a single machine instruction. In other situations, you must use, for example, a critical code section. If you're unsure, do not use atomic operations but stick to using critical sections (see the next subsection). Refer to Section 15.8.4 in the OpenMP specification and Section 9.4 in the API reference to learn more.

Examples of the use of atomic pragma:

// Atomic operation
#pragma omp atomic
v += something;

// Atomic read operation
#pragma omp atomic read
v = x;

// Atomic write operation
#pragma omp atomic write
v = something;

// Atomic update operation (some as `omp atomic`)
#pragma omp atomic update
v++;

// Atomic combined update and read operation
#pragma omp atomic capture
v = x += something;

In particular, the previously occurring race condition can be corrected as follows:

int a = 3; // Shared variable
#pragma omp parallel
{
    #pragma omp atomic
    a += 5; // Only one thread has time to read and update the value of a
}

Critical code blocks

The critical pragma can be used a little more freely, but it is not nearly as fast as the atomic pragma. The command or code block that comes after the pragma is automatically protected by a lock, and other threads must wait for the first thread to exit the block. An example of using a critical pragma to correct a previous race condition :

int a = 3; // Shared variable
#pragma omp parallel
{
    #pragma omp critical(name)
    {
        int b = a; // Private variable
        b += 5;
        // No other thread can execute the line (*) in between
        a = b; // Line (*)
    }
}

In the above example, (name) is an optional name for a critical code block. Multiple critical code blocks can have the same name, and only one thread at a time can execute any of the critical blocks with the same name. Critical blocks with the same name therefore share the same lock.

Synchronisation

The execution of strings can be synchronized using the barrier as follows:

#pragma omp parallel
{

    subroutine1();
    
    // Barrier
    #pragma omp barrier
    
    // subrountine2 below only starts executing aft ehr previous barrier has 
    // been reached by all threads
    subroutine2();

}

At the end of each sections, for and single block there is also an implicit synchronization command (i.e. the threads are automatically synchronized). You can block this with the nowait keyword:

#pragma omp parallel for nowait
for(int i = 0; i < n; i++) {
    a[i] = i;
}

Nested loops

You can also parallelize two nested loops whose iterations are independent of each other using the collapse keyword:

// Unroll two levels of loops so that both loops are run in parallel simultaneously
#pragma omp parallel for collapse(2)
for(int i = 0; i < n; i++)
    for(int j = 0; j < n; j++)
        A[i][j] = i + j;

Example

In this chapter, we will walk through an example code for summing a table in parallel using the OpenMP interface. The example code can be downloaded from gitlab.

Serial implementation

Consider the following C/C++ subroutine:

double sum(double *x, int n) {
    double tmp = 0.0;
    for(int i = 0; i < n; i++)
        tmp += x[i];
    return tmp;
}

A subroutine computes the elements of the table x in a for loop. Printing the test program:

Time: 0.0400391 s
Flops: 1.08482 GFlops
Sum value: -2906.08
Real value: -2906.08

According to the output, it took about 40ms to sum the table (n = 43435342).

Example code

First (unsuccessful) attempt

OpenMP's omp parallel for pragma allows for automatic parallelization of the for loop in cases where the loop iterations are independent of each other. So let's try to use it:

double sum(double *x, int n) {
    
    //
    // This version is meant to show you a WRONG approach!
    // DO NOT TAKE THIS AS AN EXAMPLE!!!
    //

    double tmp = 0.0; // Shared variable
    
    #pragma omp parallel for
    for(int i = 0; i < n; i++)
        tmp += x[i];
    
    return tmp;
}

Above tmp is a shared variable, which is shown for each thread, i.e. all threads edit the same tmp variable in the for loop. Printout of the test program:

Time: 0.0093472 s
Flops: 4.64688 GFlops
Sum value: 944.953
Real value: 910.438

So the table summation took about 9ms, i.e. the parallel execution is about four times faster. Unfortunately, the result is wrong! The reason for the incorrect result is a race condition occurring inside the for loop, i.e. all threads try to change the tmp variable at the same time.

Example code

A workable but slow solution with a critical block

The race condition is typically solved by protecting the raced resource, in this case the tmp variable. Let's try to do this with the omp critical pragma:

double sum(double *x, int n) {
    
    //
    // This version is meant to show you a WRONG approach!
    // DO NOT TAKE THIS AS AN EXAMPLE!!!
    //

    double tmp = 0.0; // Shared variable
    
    #pragma omp parallel for
    for(int i = 0; i < n; i++)
        #pragma omp critical
        tmp += x[i];
    
    return tmp;
}

The omp critical pragma used above defines the following code block as the critical code block. In practice, this means that the block is protected by a lock or semaphore, so that only one thread at a time can execute the protected code block.

Printing the test program:

Time: 7.64137 s
Flops: 0.00568424 GFlops
Sum value: 3746.95
Real value: 3746.95

So the program works correctly, but is about 190 times slower than the unparallelized version! The slowdown is due to the fact that the critical block is placed inside the for loop , so the associated lock is opened and closed n times and the other threads have to wait. So in practice, there is no parallelism in the program code!

Example code

A slightly faster version

Simple code blocks can be protected by the atomic operation . The line tmp += x[i]; is a simple single variable update operation, so it can be protected by the omp atomic pragma:

double sum(double *x, int n) {
    
    //
    // This version is meant to show you a WRONG approach!
    // DO NOT TAKE THIS AS AN EXAMPLE!!!
    //

    double tmp = 0.0;
    
    #pragma omp parallel for
    for(int i = 0; i < n; i++)
        #pragma omp atomic
        tmp += x[i];
    
    return tmp;
}

Test program output:

Time: 3.90907 s                                                                                                                            
Flops: 0.0111114 GFlops
Sum value: 7738.2
Real value: 7738.2

The program code produces the correct result, but is still about 100 times slower than the unparsed version! The reason for this is again that the atomic operation is inside the for loop.

Example code

The fast version

So, based on the previous two examples, we want to move the atomic operation out of the for loop. This cannot be done using the omp parallel for pragma but we need to revert back to the omp parallel pragma (link):

double sum(double *x, int n) {

    double tmp = 0.0; // Shared variable
    
    // The code block after the `omp parallel`-pragma is executred by multiple threads
    #pragma omp parallel
    {
        // Each thread has its own variable p where their subtotal is stored
        double p = 0.0;
        
        // The `omp for`-pargma automatically splits the for-loop into subtasks.
        // Each thread has its own summation variable p, so the thread don't get
        // their results mixed up. 
        // The atomic operation has been removed from the loop!!
        #pragma omp for nowait
        for(int i = 0; i < n; i++)
            p += x[i];
        
        // Finally, each thread adds its own subtotal to the result in the variable tmp.
        // This way the `omp atomic`-pragma is only executed once per thread!!
        #pragma omp atomic
        tmp += p;
    }
    
    return tmp;
}

Above, the variable tmp is defined outside the omp parallel block, so it is a shared variable. The variable p is in turn defined inside the block, so it is a private variable. Thus each thread has its own variable p, whose value they can change without incuring a race condition!

Output of the test program:

Time: 0.00854421 s
Flops: 5.0836 GFlops
Sum value: 3836.15
Real value: 3836.15

The program produces the correct result and is almost 5 times faster than the unparallelized version!

Example code

Single line version

Program code can be parallelized with one extra line of code:

double sum(double *x, int n) {

    double tmp = 0.0;
    
    #pragma omp parallel for reduction(+:tmp)
    for(int i = 0; i < n; i++)
        tmp += x[i];
    
    return tmp;
}

The above program code is based on the reduction keyword . The use of the reduction keyword requires a good command of the concepts discussed above.

Printing the test program:

Time: 0.00862789 s
Flops: 5.03429 GFlops
Sum value: 2112.93
Real value: 2112.93

The reduction keyword is therefore as fast as the program code in the previous subsection.

Example code