OpenMP
Introduction
The OpenMP compiler extension allows serial code to be parallelized with relatively little extra effort. In the simplest case, a for
loop with iterations that are independent of each other can be parallelized as follows:
#pragma omp parallel for
for(int i = 0; i < n; i++) {
a[i] = i;
}
However, the compiler must support OpenMP and, for example, OpenMP support for the gcc compiler is enabled with -fopenmp
:
g++ ... -fopenmp
In some cases, it is necessary to include the omp.h
header file:
#include <omp.h>
Pragmas work without a header file, but prototypes for OpenMP subroutines such as are defined in that header file.
Parallel code block
Parallel-pragma
All OpenMP parallelism takes place within a parallel block of code:
#pragma omp parallel
{
// This block of code is executed in parallel on several threads
}
The definition of a parallel block of code triggers a set of threads / team. The omp parallel for
pragma in the introduction is therefore an abbreviation of the following:
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < n; i++) {
a[i] = i;
}
}
The omp for
pragma divides the for
loop into parallel sub-loops
Next we will look at how to distribute work within a omp parallel
region to different threads. There are two main ways to do this:
omp sections
andomp section
is a way to limit the scope of some work so only a single thread will execute it. Multipleomp section
can then be executed simultaneously by different threads.
While omp sections
might still be useful in certain situations today, OpenMP have had some new a more versatile tools available since version 5.0. Namely the omp task
and omp taskgroup
.
omp taskgroup
andomp task
provides a way to let OpenMP dynamically balance the workload for large groups of tasks. Tasks furthermore provides a way to setup task dependecies, so dependent tasks can be scehduled but not picked up until the remaining computations are ready for it.
Sections-pragma
The program code within a parallel block of code can also be divided into sections using the omp sections
and omp section
pragmas. The subtasks specified by the omp section
pragma are automatically distributed among threads so that one subtask is executed only once:
#pragma omp parallel
{
// All threads execute the following subroutine
subroutine1();
// Start a block of code consisting of subtasks
#pragma omp sections
{
// All threads execute the following subroutine
subroutine2();
#pragma omp section
{
// Only one thread shall execute the following commands
int k = 55;
subroutine2();
}
#pragma omp section
subroutine3(); // Only one thread shall execute the subroutine
}
}
FIXED EXAMPLE
#pragma omp parallel
{
// All threads execute the following subroutine
subroutine1();
// Start a block of code consisting of subtasks
#pragma omp sections
{
// This subroutine is treated as if wrapped in an omp section block of
// its own and thus executed by only one thread. It is good practice to
// make this explicit by always including the "#pragma omp section"
subroutine2();
#pragma omp section
{
// Only one thread shall execute the following commands
int k = 55;
subroutine2();
}
#pragma omp section
subroutine3(); // Only one thread shall execute the subroutine
}
}
In a situation where you do not have a subroutine that needs to be executed by all threads, but only need sections, you can collapse
#pragma omp parallel
{
#pragma omp sections
{
// Some sections here
// ...
}
}
into the slightly more compact form
#pragma omp parallel sections
{
// Some sections here
// ...
}
This example code is actually wrong. Maybe it was right in the past? It is correct that subroutine1
is executed by all threads, but subroutine2
is only executed by a single thread. It turns out that the first content of an omp sections
area up until the first omp section
counts as an omp section
. What happens is that there cannot be code within an omp sections
block that is not wrapped in an omp section
block. So putting code into the omp sections
is treated as being inside an omp section
implicitly, however only up till the first omp section
directive, after that any non-omp section
'ed code will cause a compile error!
The former can be written in shorter form:
#pragma omp parallel sections
{
// All threads execute the following subroutine
subroutine1();
// All threads execute the following subroutine
subroutine2();
#pragma omp section
{
// Only one thread shall execute the following commands
int k = 55;
subroutine2();
}
#pragma omp section
subroutine3(); // Only one thread shall execute the subroutine
}
FIXED EXAMPLE The folllowing is NOT the same as the above, because the above example CANNOT be shortened!
#pragma omp parallel sections
{
// These two subroutines as treated as if in their own shared omp section.
// So they are executed sequentially on the same single thread.
subroutine1();
subroutine2();
#pragma omp section
{
// Only one thread shall execute the following commands
int k = 55;
subroutine2();
}
#pragma omp section
subroutine3(); // Only one thread shall execute the subroutine
}
This example suffer from the same mistake as the above. But furthermore, because of the mistake in the above this example is actually functionally different from the above.
In fact, the above example CANNOT be shortened. So the claim here is simply wrong.
We should just delete this "paragraph block".
—Task-pragma
Parallel tasks can also be defined directly with a task pragma:
// subroutine1 and subroutine2 are parallel tasks and can be
// be processed in parallel
#pragma omp parallel
{
#pragma omp single nowait
{
#pragma omp task
subroutine1();
#pragma omp task
subroutine2();
}
// Wait for the parallel tasks to complete
#pragma omp taskwait
}
The omp single
pragma here ensures that only a single thread queues up tasks. A bit more on the single
pragma in a bit.
Note that this is essentially equivalent to
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
subroutine1();
#pragma omp section
subroutine2();
}
}
Dependencies
Some excellent examples are available here: Link to examples
Let us have a look at the first example
#include <stdio.h>
int main() {
int x = 1;
#pragma omp parallel
#pragma omp single
{
#pragma omp task shared(x) depend(out: x)
x = 2;
#pragma omp task shared(x) depend(in: x)
printf("x = %d\n", x);
}
return 0;
}
Note the depend(...)
clause. This clause causes an ordering of the tasks. The out: x
indicates that a change to x
will occur and any in: x
task should wait for this change. Thus this program will always output x = 2
.
The shared(x)
indicates here that the variable x
should be shared amount the tasks. If this wasn't specified, each task would get its own copy of x
(this is called firstprivate
and the default behavior for a variable in a omp task
) and the update x=2
wouldn't reach the second task.
In the second example the tasks are switched, but the dependencies are the same.
#include <stdio.h>
int main()
{
int x = 1;
#pragma omp parallel
#pragma omp single
{
#pragma omp task shared(x) depend(in: x)
printf("x = %d\n", x);
#pragma omp task shared(x) depend(out: x)
x = 2;
}
return 0;
}
Intuituvely one might think that this should work like the above, but the order of scheduling matters, so in this example the output will always be x = 1
, i.e. the print-statement always gets exectured before the x = 2
assignment occurs.
Taskgroup pragma
The omp taskgroup
pragma allow you to group subsets of tasks and have them synchronize independent of other tasks. A good example of this can be found here: Link to example
Running a block on a single thread
If necessary, part of a parallel block of code can be executed in a single thread:
#pragma omp parallel
{
subroutine1(); // All threads execute the subroutine
#pragma omp single
{
// Only one thread executes the subroutine. The other threads wait for
// the end of the block.
subroutine2();
}
// All threads execute the subroutine as long as the previous single block
// has been processed first
subroutine3();
}
The pragma omp master
behaves very similarly, but the block is executed each time by the so-called master thread (the same thread that executes the parts outside the parallel code block) and the other threads do not wait at the end of the block.
Data ownership
Shared and private variables
Variables outside the parallel block are shared by default, i.e. all threads handle the same variable. The variables defined inside the parallel block are private by default, i.e. each thread has its own version of the variable:
int a = 3; // Shared variable
#pragma omp parallel
{
int b = 5; // Private variable
}
Race condition
Shared variables need to be protected in some situations:
int a = 3; // Shared variable
#pragma omp parallel
{
int b = a; // Private variable
b += 5;
a = b; // Line (*)
}
When run in two threads, the above might do, for example:
Both threads behave nicely, where the shared variable
a
is incremented twice by 5, i.e. after the parallel blocka = 13
Thread 1 assigns the value of
a
tob
but before it reaches line(*)
thread 2 assigns the value ofa
tob
. Both threads thus haveb
equal to 3 and do their increments by 5. In either order the value 8 is stored intoa
.
Either outcome is possible. This is a race condition.
Thread 1 | Thread 2 | Variable a |
In thread 1 b |
In thread 2 b |
---|---|---|---|---|
int b = a; | int b = a; | 3 | = 3 | = 3 |
b += 5; | - | 3 | 3 -> 8 | 3 |
a = b; | b += 5; | 3 -> 8 | 8 | 3 -> 8 |
a = b; | 8 -> 8 |
Read protecting resources to learn how to avoid this.
Explicitly defining the type of variable
Variables can also be explicitly defined as shared or private:
int a = 3;
int b = 5;
#pragma omp parallel private(a) shared(b)
{
cout << a << endl; // undefined
cout << b << endl; // 5
#pragma omp single
b = 7; // A single thread changes the value of the shared variable
cout << b << endl; // 7
}
The value of the variable a
inside the parallel block above is undefined, because each thread has its own variable a
, which is not initialized by default. The problem can be solved with the firstprivate
keyword, which copies the value of the variable outside the parallel block to the single variable of the same name in the thread:
int a = 3;
int b = 5;
#pragma omp parallel firstprivate(a) shared(b)
{
cout << a << endl; // 3
cout << b << endl; // 5
#pragma omp single
{
b = 7; // A single thread changes the value of the variable
a = 15; // A single thread modifies its own variable
}
cout << a << endl; // One thread prints 15, the rest print 3
cout << b << endl; // 7
}
Reduction-keyword
The reduction
keyword combines the keywords private
, shared
and atomic
(jumps to the atomic keyword) :
TYPE VARIABLE = INIT
#pragma omp PRAGMA reduction(OP:VARIABLE)
{
// Use the variable VARIABLE in some way...
}
- doesn't this code piece miss the
INIT
state that appears in the expanded code below? - maybe
INIT
is whatever the...
was chosen as?
The above pseudocode effectively does the following:
TYPE VARIABLE = INIT
#pragma omp PRAGMA
{
TYPE _VARIABLE = OP_INIT;
// Use the variable _VARIABLE in some way...
#pragma omp atomic
VARIABLE = VARIABLE OP _VARIABLE;
}
The binary operator OP
and the corresponding initial value OP_INIT
(which is dependent on the choice of OP
) could for example be any of the following:
OP | INIT |
---|---|
+, -, |, ^, || | 0 |
*, && | 1 |
& | ~0 = 1111... |
For example, program code
int a = 0;
#pragma omp parallel for reduction(+:a)
for(int i = 0; i < N; i++)
a += i;
is effectively the same as
int a = 0;
#pragma omp parallel
{
int _a = 0;
#pragma omp for nowait
for(int i = 0; i < N; i++)
_a += i;
#pragma omp atomic
a = a + _a;
}
Flush-pragma
The compiler may optimize the code so that the shared variable is stored in the registry instead of memory. This may distort the results of in some cases, because one thread sees only the memory, but not the register "belonging" to the other thread:
int a = 5;
#pragma omp parallel sections shared(a)
{
#pragma omp section
{
a = 66;
#pragma omp flush(a)
}
#pragma omp section
{
#pragma omp flush(a)
cout << a << endl;
}
}
The flush
pragma forces the compiler to write the variable's value to memory after a write operation and forces the compiler to read the variable's value from memory before a read operation. This kind of tuning is rarely necessary, so it is better to use atomic operations or critical code sections (see below).
Management of threads
Setting the number of threads
The number of threads can be selected using the num_threads(<number of threads>)
keyword in the context of pragma , for example:
// Run the for-loop with three threads
#pragma omp parallel for num_threads(3)
for(int i = 0; i < n; i++)
a[i] = i;
Alternatively, the thread count can be set with the OMP_NUM_THREADS
environment variable in Linux (e.g. $ OMP_NUM_THREADS=3 ./program
). The thread count can also be set at runtime with a subroutine call:
void omp_set_num_threads(int num_threads);
Other OpenMP aliases
The number of threads in use can be found out with a subroutine:
int omp_get_num_threads(void);
Each thread has its own index number, which can be accessed by a subroutine:
int omp_get_thread_num(void);
The maximum possible number of threads, when the keyword num_threads
is not given, is obtained by a subroutine:
int omp_get_max_threads(void);
Protecting resources
In situations where it is essential that only one thread at a time can execute a block of code, either atomic operations or the definition of a critical block of code must be used.
Atomic Operations
The atomic pragma defines the following command as atomic, meaning it will be executed entirely at once or not at all. This prevents other threads from "interrupting" the command's execution. In most cases, the command to be executed atomically must be implementable with a single machine instruction. In other situations, you must use, for example, a critical code section. If you're unsure, do not use atomic operations but stick to using critical sections (see the next subsection). Refer to Section 15.8.4 in the OpenMP specification and Section 9.4 in the API reference to learn more.
Examples of the use of atomic pragma:
// Atomic operation
#pragma omp atomic
v += something;
// Atomic read operation
#pragma omp atomic read
v = x;
// Atomic write operation
#pragma omp atomic write
v = something;
// Atomic update operation (some as `omp atomic`)
#pragma omp atomic update
v++;
// Atomic combined update and read operation
#pragma omp atomic capture
v = x += something;
In particular, the previously occurring race condition can be corrected as follows:
int a = 3; // Shared variable
#pragma omp parallel
{
#pragma omp atomic
a += 5; // Only one thread has time to read and update the value of a
}
Critical code blocks
The critical pragma can be used a little more freely, but it is not nearly as fast as the atomic pragma. The command or code block that comes after the pragma is automatically protected by a lock, and other threads must wait for the first thread to exit the block. An example of using a critical pragma to correct a previous race condition :
int a = 3; // Shared variable
#pragma omp parallel
{
#pragma omp critical(name)
{
int b = a; // Private variable
b += 5;
// No other thread can execute the line (*) in between
a = b; // Line (*)
}
}
In the above example, (name)
is an optional name for a critical code block. Multiple critical code blocks can have the same name, and only one thread at a time can execute any of the critical blocks with the same name. Critical blocks with the same name therefore share the same lock.
Synchronisation
The execution of strings can be synchronized using the barrier as follows:
#pragma omp parallel
{
subroutine1();
// Barrier
#pragma omp barrier
// subrountine2 below only starts executing aft ehr previous barrier has
// been reached by all threads
subroutine2();
}
At the end of each sections
, for
and single
block there is also an implicit synchronization command (i.e. the threads are automatically synchronized). You can block this with the nowait
keyword:
#pragma omp parallel for nowait
for(int i = 0; i < n; i++) {
a[i] = i;
}
Nested loops
You can also parallelize two nested loops whose iterations are independent of each other using the collapse
keyword:
// Unroll two levels of loops so that both loops are run in parallel simultaneously
#pragma omp parallel for collapse(2)
for(int i = 0; i < n; i++)
for(int j = 0; j < n; j++)
A[i][j] = i + j;
Example
In this chapter, we will walk through an example code for summing a table in parallel using the OpenMP interface. The example code can be downloaded from gitlab.
Serial implementation
Consider the following C/C++ subroutine:
double sum(double *x, int n) {
double tmp = 0.0;
for(int i = 0; i < n; i++)
tmp += x[i];
return tmp;
}
A subroutine computes the elements of the table x
in a for loop. Printing the test program:
Time: 0.0400391 s
Flops: 1.08482 GFlops
Sum value: -2906.08
Real value: -2906.08
According to the output, it took about 40ms to sum the table (n = 43435342
).
First (unsuccessful) attempt
OpenMP's omp parallel for
pragma allows for automatic parallelization of the for loop in cases where the loop iterations are independent of each other. So let's try to use it:
double sum(double *x, int n) {
//
// This version is meant to show you a WRONG approach!
// DO NOT TAKE THIS AS AN EXAMPLE!!!
//
double tmp = 0.0; // Shared variable
#pragma omp parallel for
for(int i = 0; i < n; i++)
tmp += x[i];
return tmp;
}
Above tmp
is a shared variable, which is shown for each thread, i.e. all threads edit the same tmp
variable in the for loop. Printout of the test program:
Time: 0.0093472 s
Flops: 4.64688 GFlops
Sum value: 944.953
Real value: 910.438
So the table summation took about 9ms, i.e. the parallel execution is about four times faster. Unfortunately, the result is wrong! The reason for the incorrect result is a race condition occurring inside the for loop, i.e. all threads try to change the tmp
variable at the same time.
A workable but slow solution with a critical block
The race condition is typically solved by protecting the raced resource, in this case the tmp
variable. Let's try to do this with the omp critical
pragma:
double sum(double *x, int n) {
//
// This version is meant to show you a WRONG approach!
// DO NOT TAKE THIS AS AN EXAMPLE!!!
//
double tmp = 0.0; // Shared variable
#pragma omp parallel for
for(int i = 0; i < n; i++)
#pragma omp critical
tmp += x[i];
return tmp;
}
The omp critical
pragma used above defines the following code block as the critical code block. In practice, this means that the block is protected by a lock or semaphore, so that only one thread at a time can execute the protected code block.
Printing the test program:
Time: 7.64137 s
Flops: 0.00568424 GFlops
Sum value: 3746.95
Real value: 3746.95
So the program works correctly, but is about 190 times slower than the unparallelized version! The slowdown is due to the fact that the critical block is placed inside the for loop , so the associated lock is opened and closed n
times and the other threads have to wait. So in practice, there is no parallelism in the program code!
A slightly faster version
Simple code blocks can be protected by the atomic operation . The line tmp += x[i];
is a simple single variable update operation, so it can be protected by the omp atomic
pragma:
double sum(double *x, int n) {
//
// This version is meant to show you a WRONG approach!
// DO NOT TAKE THIS AS AN EXAMPLE!!!
//
double tmp = 0.0;
#pragma omp parallel for
for(int i = 0; i < n; i++)
#pragma omp atomic
tmp += x[i];
return tmp;
}
Test program output:
Time: 3.90907 s
Flops: 0.0111114 GFlops
Sum value: 7738.2
Real value: 7738.2
The program code produces the correct result, but is still about 100 times slower than the unparsed version! The reason for this is again that the atomic operation is inside the for loop.
The fast version
So, based on the previous two examples, we want to move the atomic operation out of the for loop. This cannot be done using the omp parallel for
pragma but we need to revert back to the omp parallel
pragma (link):
double sum(double *x, int n) {
double tmp = 0.0; // Shared variable
// The code block after the `omp parallel`-pragma is executred by multiple threads
#pragma omp parallel
{
// Each thread has its own variable p where their subtotal is stored
double p = 0.0;
// The `omp for`-pargma automatically splits the for-loop into subtasks.
// Each thread has its own summation variable p, so the thread don't get
// their results mixed up.
// The atomic operation has been removed from the loop!!
#pragma omp for nowait
for(int i = 0; i < n; i++)
p += x[i];
// Finally, each thread adds its own subtotal to the result in the variable tmp.
// This way the `omp atomic`-pragma is only executed once per thread!!
#pragma omp atomic
tmp += p;
}
return tmp;
}
Above, the variable tmp
is defined outside the omp parallel
block, so it is a shared variable. The variable p
is in turn defined inside the block, so it is a private variable. Thus each thread has its own variable p
, whose value they can change without incuring a race condition!
Output of the test program:
Time: 0.00854421 s
Flops: 5.0836 GFlops
Sum value: 3836.15
Real value: 3836.15
The program produces the correct result and is almost 5 times faster than the unparallelized version!
Single line version
Program code can be parallelized with one extra line of code:
double sum(double *x, int n) {
double tmp = 0.0;
#pragma omp parallel for reduction(+:tmp)
for(int i = 0; i < n; i++)
tmp += x[i];
return tmp;
}
The above program code is based on the reduction keyword . The use of the reduction keyword requires a good command of the concepts discussed above.
Printing the test program:
Time: 0.00862789 s
Flops: 5.03429 GFlops
Sum value: 2112.93
Real value: 2112.93
The reduction keyword is therefore as fast as the program code in the previous subsection.
These are the current permissions for this document; please modify if needed. You can always modify these permissions from the manage page.