OpenMP Tutorial – firstprivate and lastprivate

Here I will consider firstprivate and lastprivate. Recall one of the earlier entries about private variables. When a variable is declared as private, each thread gets a unique memory address of where to store values for that variable while in the parallel region. When the parallel region ends, the memory is freed and these variables no longer exist. Consider the following bit of code as an example:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>


int main(void){
int i;
int x;
x=44;
#pragma omp parallel for private(x) 
for(i=0;i<=10;i++){
x=i;
printf("Thread number: %d     x: %d\n",omp_get_thread_num(),x);
}
printf("x is %d\n", x);


}

Yields…

Thread number: 0     x: 0
Thread number: 0     x: 1
Thread number: 0     x: 2
Thread number: 3     x: 9
Thread number: 3     x: 10
Thread number: 2     x: 6
Thread number: 2     x: 7
Thread number: 2     x: 8
Thread number: 1     x: 3
Thread number: 1     x: 4
Thread number: 1     x: 5
x is 44

You’ll notice that x is exactly the value it was before the parallel region.

Suppose we wanted to keep the last value of x after the parallel region. This can be achieved with lastprivate. Replace private(x) with lastprivate(x) and this is the result:

Thread number: 3     x: 9
Thread number: 3     x: 10
Thread number: 1     x: 3
Thread number: 1     x: 4
Thread number: 1     x: 5
Thread number: 0     x: 0
Thread number: 0     x: 1
Thread number: 0     x: 2
Thread number: 2     x: 6
Thread number: 2     x: 7
Thread number: 2     x: 8
x is 10

Notice that it is 10 and not 8. That is to say, it is the last iteration which is kept, not the last operation. Now what if we replace lastprivate(x) with firstprivate(x). What do you think it will do? This:

Thread number: 3     x: 9
Thread number: 3     x: 10
Thread number: 1     x: 3
Thread number: 1     x: 4
Thread number: 1     x: 5
Thread number: 0     x: 0
Thread number: 0     x: 1
Thread number: 0     x: 2
Thread number: 2     x: 6
Thread number: 2     x: 7
Thread number: 2     x: 8
x is 44

If you were like me, you were expecting to get the value 0 i.e. the value of x on the first iteration. NO

firstprivate Specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct.

That is, every thread gets its own instance of x and that instance equals 44.

OpenMP Tutorial – Critical, Atomic and Reduction

Atomic and Critical

  • critical: the enclosed code block will be executed by only one thread at a time, and not simultaneously executed by multiple threads. It is often used to protect shared data fromrace conditions.
  • atomic: the memory update (write, or read-modify-write) in the next instruction will be performed atomically. It does not make the entire statement atomic; only the memory update is atomic. A compiler might use special hardware instructions for better performance than when using critical.

Consider this code which numerically approximates pi:


int main(void){
double pi,x;
int i,N;
pi=0.0;
N=1000;
#pragma omp parallel for
for(i=0;i x=(double)i/N;
pi+=4/(1+x*x);
}
pi=pi/N;
printf("Pi is %f\n",pi);
}

We compile this with gcc main.c -o test, ignoring the -fopenmp options, this means that the #pragma omp parallel for will be interpreted as a comment i.e. ignored. We run it and this is the result:
< Now compile with the -fopenmp option and run: Oh dear... Let's examine what went wrong. Well, by default and as we have not specified it as private, the variable x is shared. This means all threads have the same memory address of the variable x. Therefore, thread i will compute some value at x and store it at memory address &x, thread j will then compute its value of x and store it at &x BEFORE thread i has used its value to make its contribution to pi. The threads are all over writing each others values of x because they all have the same memory address for x. Our first correction is that x must be made private:
#pragma omp parallel for private(x)

Secondly, we have a “Race Condition” for pi. Let me illustrate this with a simple example. Here is what would ideally happen:

 

  • Thread 1 reads the current value of pi : 0
  • Thread 1 increments the value of pi : 1
  • Thread 1 stores the new value of pi: 1
  • Thread 2 reads the current value of pi: 1
  • Thread 2 increments the value of pi: 2
  • Thread 2 stores the value of pi: 2

What is actually happening is more like this:

  • Thread 1 reads the current value of pi: 0
  • Thread 2 reads the current value of pi: 0
  • Thread 1 increments pi: 1
  • Thread 2 increments pi: 1
  • Thread 1 stores its value of pi: 1
  • Thread 2 stores its value of pi: 1

The way to correct this is to tell the code to execute the read/write of pi only one thread at a time. This can be achieved with critical or atomic. Add
#pragma omp atomic Just before pi get’s updated and you’ll see that it works.

This scenario crops up time and time again where you are updating some value inside a parallel loop so in the end it had its own clause made for it. All the above can be achieved by simply making pi a reduction variable.

Reduction

To make pi a reduction variable the code is changed as follows:

 

int main(void){
double pi,x;
int i,N;
pi=0.0;
N=1000;
#pragma omp parallel for private(x) reduction(+:pi)
for(i=0;i&lt;N;i++){
x=(double)i/N;
pi+=4/(1+x*x);
}
pi=pi/N;
printf("Pi is %f\n",pi);
}

This is simply the quick and neat way of achieving all what we did above.

OpenMP Parallel For

The parallel directive #pragma omp parallel makes the code parallel, that is, it forks the master thread into a number of parallel threads, but it doesn’t actually share out the work.
What we are really after is the parallel for directive, which we call a work-sharing construct. Consider

#include <iostream>
#include <omp.h>

using namespace std;
main (void)
{
	int i;
	int foo;
#pragma omp parallel for
  for(i=1;i<10;i++){
#pragma omp critical
{
foo=omp_get_thread_num();
	  cout << "Loop number: "<< i << " " << "Thread number: " << foo << endl;
}
}
}

The for directive applies to the for loop immediately preceding it. Notice how we don’t have to outline a parallel region with curly braces {} following this directive in contrast to before. This program yields:

[michael@michael lindonslog]$ ./openmp 
Loop number: 1 Thread number: 0
Loop number: 8 Thread number: 3
Loop number: 2 Thread number: 0
Loop number: 3 Thread number: 0
Loop number: 9 Thread number: 3
Loop number: 6 Thread number: 2
Loop number: 4 Thread number: 1
Loop number: 7 Thread number: 2
Loop number: 5 Thread number: 1
[michael@michael lindonslog]$ 

Notice what I said about the order. By default, the loop index i.e. “i” in this context, is made private by the for directive.

At the end of the parallel for loop, there is an implicit barrier where all threads wait until they have all finished. There are however some rules for the parallel for directive

  1. The loop index, i, is incremented by a fixed amount each iteration e.g. i++ or i+=step.
  2. The start and end values must not change during the loop.
  3. There must be no “breaks” in the loop where the code steps out of that code block. Functions are, however, permitted and run as you would expect.
  4. The comparison operators may be < <= => >

There may be times when you want to perform some operation in the order of the iterations. This can be achieved with an ordered directive and an ordered clause. Each thread will wait until the previous iteration has finished it’s ordered section before proceeding with its own.

int main(void){
int i,a[10];
#pragma omp parallel for ordered 
for(i=0;i<10;i++){
a[i]=expensive_function(i);
#pragma omp ordered
printf("Thread ID: %d    Hello World %d\n",omp_get_thread_num(),i);
}
}

Will now print out the Hello Worlds in order. N.B. There is a penalty for this. The threads have to wait until the preceding iteration has finished with its ordered section of code. Only if the expensive_function() in this case were expensive, would this be worthwhile.

Parallel Programming in C with OpenMP

OpenMP – Open Specifications for Multi Processing

The central theme of parallel code is that of threads. A serial code starts off as one thread. As soon as the first parallel directive(Fortran)/pragma(C) is encountered, the master thread forks into a number of threads. The proceeding code is then executed in parallel in a manner which can be adjusted using certain options.

To get started with OpenMP. You will need to include the omp header file with
#include <omp.h>
and you will need to add the -fopenmp option when compiling.

To fork the master thread into a number of parallel threads, one writes the following line of code:
#pragma omp parallel
This directive will apply to the following block of code, {…}, only and must be structured as such. By default, all variables previously declared are shared i.e. all threads have the same memory address of a shared variable. This can, however, be declared explicitly by adding shared(var_name). Conversely, you may want to make variables private, that is, each thread gets allocated a unique location in the memory to store this variable. Private variables are only accessed by the threads they are in and all the additional copies of the variable created for parallisation are destroyed when the threads merge. There are also reduction variables. More on that later…

Lets try an example. When you execute your code, it will inherit the OMP_NUM_THREADS environment variable of your terminal. Suppose we want to set the number of threads to 4. We write

prog@michael-laptop:~$ export OMP_NUM_THREADS=4
prog@michael-laptop:~$ echo $OMP_NUM_THREADS
4
prog@michael-laptop:~$ 

You can also specify the number of threads during run time with the omp_set_num_threads() function defined in omp.h

Good. Now here’s our sample code:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(void){
printf("Number of threads before parallisation: %d\n",omp_get_num_threads());
#pragma omp parallel 
{
printf("Current thread number: %d\n",omp_get_thread_num());
if(omp_get_thread_num()==0)
{
printf("Number of threads after parallisation: %d\n",omp_get_num_threads());
}
}
printf("Number of threads after merging threads %d\n",omp_get_num_threads());
}

compile and run:

prog@michael-laptop:~$ g++ openmp.cpp -o test -fopenmp
prog@michael-laptop:~$ ./test
Number of threads before parallisation: 1
Current thread number: 3
Current thread number: 1
Current thread number: 0
Number of threads after parallisation: 4
Current thread number: 2
Number of threads after merging threads 1

Next I’ll talk about work sharing…