Compile R and OpenBLAS from Source Guide

1. Get OpenBLAS
2.1 Get R
2.2 Specific Instructions for DSS Users
3. Validation
4. Benchmark

This guide is intended to aid any R and Linux user who desires a threaded version of BLAS. In particular I hope this will allow other grad students, who like me do not have many user privileges on their office computer, to follow suit and exploit multiple cores to speed up their linear algebra computations within R. The following will be performed on Scientific Linux 6.4 but has should be completely general. If you are a Ubuntu user, then there is an elegant and streamlined process for changing BLAS libraries and a recommended post about it here. I use Fedora on my laptop, and the following has also been tested thereupon.

My office computer has a quadcore processor with two threads per core but I also have access to a departmental computer with 4 sockets and 12 cores per socket (1 thread per core), so it really makes sense to use a threaded version of BLAS. If you are curious about the hardware on your own computer you can run the command “cat /proc/cpuinfo” or “lscpu”.

Unfortunately my office computer is part of a network upon which I do not have permissions to change ‘/usr/lib64/R/lib/libRblas.so’. Moreover R appears to be running serially: if you start up R and get the PID (process ID) from ‘top’ or ‘ps aux | grep R’ or something and then execute ‘cat /proc/PID/status | grep Threads’ you can see there is only one thread available.

[msl33@cabbage ~]$ cat /proc/13605/status | grep Threads
Threads:	1

(where 13605 was the process ID of my R process. That is using the default R on the network. One could appeal to the network administrator to change things for you but they probably won’t because a parallel BLAS implementation may cause problems for other users who require a serial BLAS, such as those that use the multicore environment to perform inherently parallel algorithms such as parallel tempering instead of using idle cores to speed up the linear algebra. There are also some known conflicts with the multicore package in R. There is, however, nothing stopping the user from compiling one’s own custom R build in one’s home directory and just changing the executable path thereto. In addition, you then have the power and freedom customize R to your needs – at the moment I have some very large matrices which would benefit from a threaded BLAS but at some point I may want to revert to a tuned serial BLAS such at ATLAS for certain parallel algorithms.

Firstly, go ahead and create a directory in which to keep all your custom software.

[msl33@cabbage ~]$ pwd
/home/grad/msl33
[msl33@cabbage ~]$ mkdir software


Download OpenBLAS

Make a directory “openblas” in the “software directory.

[msl33@cabbage ~]$ cd software/
[msl33@cabbage software]$ mkdir openblas

Next, grab the tarball from the OpenBLAS homepage. Change directory into where you downloaded the tarball and extract the files from it.

[msl33@cabbage ~]$ cd Downloads/
[msl33@cabbage Downloads]$ tar -xvf xianyi-OpenBLAS-v0.2.9-0-gf773f49.tar.gz 

While this is running, fill a kettle with some water and turn it on, this stage is very important.

Change directory into where you extracted the files and verify that NO_AFFINITY=1 is uncommented in the Makefile.rule. If so proceed and run make.

[msl33@cabbage ~/Downloads]$ cd xianyi-OpenBLAS-347dded/
[msl33@cabbage xianyi-OpenBLAS-347dded]$ cat Makefile.rule | grep NO_AFFINITY
NO_AFFINITY = 1
[msl33@cabbage xianyi-OpenBLAS-347dded]$ make

Now is a good time to “make” some tea with the water prepared earlier. When done successfully one will see
openblas confirmation
Now, as instructed above, install to the “software” directory made earlier.

[msl33@cabbage xianyi-OpenBLAS-347dded]$ make PREFIX=/home/grad/msl33/software/openblas install
...
Install OK!

In openblas/lib there will be a file “libopenblas.so”, needed for later. That’s it for openblas, next we will do R.


Download R

Let’s create an R directory in software. Go onto the R homepage, then download, then choose a mirror and grab the tarball of the latest version. Download it to your “software” directory and extract it as before with “tar -xvf R-3.1.1.tar.gz”. Once extracted, remove the tarball and change directory into R-3.1.1. Before running the configure script one might bring some customizations into consideration in the name of efficiency. One might consider upping the optimization level from 2 to 3 and adding march or mtune by editing “config.site” and changing “## CFLAGS=” on line 53 to “CFLAGS=’-O3 -march=native'” and making similar changes for FFLAGS and CXXFLAGS. It is noted in the R Installation and Administration documentation that these can produce worthwhile speedups but come with a warning that the build will be less reliable, with segfaults and numerical errors creeping in. It is safest to leave things regular (reccommended link) but I’ll take the risk. Now, if you are not using a computer on the duke statistical science network, run the configure script, otherwise see the additional instructions before running configure.

[msl33@cabbage R-3.1.1]$ ./configure --prefix=/home/grad/msl33/software/R --enable-R-shlib --enable-BLAS-shlib --enable-memory-profiling --with-tcltk=no


BEGIN ADDITIONAL INSTRUCTIONS FOR DUKE STATISTICAL SCIENCE STUDENTS

[On the DSS computers some further instructions are required to locate headers and libraries. The first time I tried to make on my office computer I encountered this error. “jni.h” could not be found. The error was resolved by locating it and then export the environment variable JAVA_HOME.

[msl33@cabbage software]$ locate jni.h
/usr/lib/jvm/java-1.7.0-sun-1.7.0.11/include/jni.h
[msl33@cabbage software]$ export JAVA_HOME=/usr/lib/jvm/java-1.7.0-sun-1.7.0.11/

In addition, when running the configure script the readline headers/libs could not be found. We’ll just borrow them from some other software. Add to CFLAGS, FFLAGS, CXXFLAGS “-I/opt/EPD_Free/include -L/opt/EPD_Free/lib” in addition to any other flags that you have set. Also make a lib directory and copy them in.

[msl33@cabbage R-3.1.1]$ mkdir lib
[msl33@cabbage R-3.1.1]$ cp /opt/EPD_Free/lib/libreadline.* lib/
[msl33@cabbage R-3.1.1]$ cp /opt/EPD_Free/lib/libncurses* lib/

Now run the configure line above.]

END ADDITIONAL INSTRUCTIONS FOR DUKE STATISTICAL SCIENCE STUDENTS

Once the configure has completed, you’ll see a summary below like
openblas configure
Now issue the command “make”, it will take some time. Once make has finished, you can execute “make install” to populate the software/R directory created earlier but you don’t need to. Change directories to lib and make a backup of libRblas.so and create a symbolic link to the openblas library that was made earlier.

[msl33@cabbage ~]$ cd software/R-3.1.1/lib
[msl33@cabbage lib]$ pwd
/home/grad/msl33/software/R-3.1.1/lib
[msl33@cabbage lib]$ mv libRblas.so libRblas.so.keep
[msl33@cabbage lib]$ ln -s /home/grad/msl33/software/openblas/lib/libopenblas.so libRblas.so

That was the last step.


Setup Validation

The R executable in the bin directory should now use openblas. Note this is the R executable you now need to run in order to use the custom built R with openblas. Just typing R in terminal will load the old /usr/lib64… which we students didn’t have the permissions to alter. You can, however, create an alias in your .bashrc file by inserting the line ‘alias R=”/home/grad/msl33/software/R-3.1.1/bin/./R”‘. Now when you type R in a terminal it will load the new R and not the old one. One can check that R executable depends on the correct linked shared blas library with the “ldd” command.

[msl33@cabbage bin]$ pwd
/home/grad/msl33/software/R-3.1.1/bin
[msl33@cabbage bin]$ ./R CMD ldd exec/./R | grep blas
	libRblas.so => /home/grad/msl33/software/R-3.1.1/lib/libRblas.so (0x00007f62e3fb7000)
[msl33@cabbage bin]$ ls -lt ../lib | grep openblas
lrwxrwxrwx  1 msl33 grad      53 Jul 16 15:35 libRblas.so -> /home/grad/msl33/software/openblas/lib/libopenblas.so

In addition, execute “./R” from the “bin” directory (or just R if you set up the alias) and grab the process id.

[msl33@cabbage bin]$ ps aux | grep R | grep software | awk '{print $2}'
2412
[msl33@cabbage bin]$ cat /proc/`ps aux | grep R | grep software | awk '{print $2}'`/status | grep Threads
Threads:	8
[msl33@cabbage bin]$ 

Evidently the R session now has 8 threads available. Finally, lets perform an eigen-decomposition and look at the cpu usage using top. You’ll see it light up all of your cores.
openblas cpu usage


Benchmark

Using this benchmark the reference BLAS took 32.1 seconds whilst openBLAS took 7.1 seconds.

Parallel Tempering Algorithm with OpenMP / C++

1.1. Parallel Tempering Theory
1.2. Physics Origins
2.1 Intra-Thread Metropolis Move
2.2. Inter-Thread Parallel Tempering
2.3. OpenMP Parallelization
3. Full Code
4. Simulation Study
5. On the Future use of Parallel Tempering with OpenMP

Parallel tempering is one of my favourite sampling algorithms to improve MCMC mixing times. This algorithm seems to be used exclusively on distributed memory architectures using MPI and remains unexploited on shared memory architectures such as our office computers, which have up to eight cores. I’ve written parallel tempering algorithms in MPI and Rmpi but never in OpenMP. It turns out that the latter has substantial advantages. I guess when people think of parallel tempering they think of processors communicating with each other via MPI and swapping parameters directly. If you are on a shared memory device, however, you can have processor A simply write to a shared array and have processor B read therefrom, which really saves a lot of aggro fiddling around with message numbers, blocking/non-blocking calls and deadlocks etc. Moreover, with OpenMP you can spawn more threads than you have processors, which translates to more parallel MCMC chains in the present context, whereas this becomes troublesome with MPI due to the danger of deadlocks. OpenMP is also much easier to use than MPI, with one line you can fork a serial thread into a desired and hardware-independent number of parallel threads. The code looks as follows:


Parallel Tempering Theory

Each thread simulates an MCMC trajectory from the posterior raised to a fractional power, B. When B=1, the MCMC draws are from the posterior from which we wish to sample. When B=0, the MCMC trajectory is just a realization of a Brownian motion random walk. To see this, consider the acceptance probability of the metropolis move. The density evaluated at the proposed parameters over the density evaluated at the current parameters all raised to the power of zero is unity, whatever the densities are, so the moves always get accepted. Similarly if B is close to zero, then the acceptance probability is near unity and the distribution from which this MCMC is sampling is quite uniform over the parameter space, so the trajectory explores a relatively larger part of the parameter space. As B is increased toward one, the features of the distribution from which we wish to sample start to become more prominent. In the other direction from B=1 to 0 one commonly says that the posterior is “melted down” and spreading out its mass. The terminology has remained from its origins in statistical physics where one would simulated particles at a hotter temperature, so that they would jostle around more and escape wells in the potential energy. The key to parallel tempering is to use the more diffuse, hotter or melted down MCMC chains as proposal distributions for the actual cold distribution we wish to sample from. One proceeds by performing a Metropolis-Hastings move because the proposal distributions are not symmetric. For illustration, thread j uses the hotter thread j+1 as its partner and as proposal distribution. Let theta j+1 be the proposed new position for thread j, being the current position of thread j+1.
\alpha=min(1,\frac{  p_{j} (\theta_{j+1} )}  {   p_{j}(\theta_{j} ) }  \frac{  p_{j+1} (\theta_{j} )}  {   p_{j+1}(\theta_{j+1} ) }    )
The second fraction is the Hastings addition to the Metropolis algorithm and is required to satisfy detailed balance for an unsymmetrical proposal distribution. Now realise that
p_{j}=\pi(\theta|Y)^{B_{j}}\\  p_{j+1}=\pi(\theta|Y)^{B_{j+1}}
i.e. they are the same distribution raised to different fractional powers. Working now on the log scale, it can be shown that
log \left( \frac{  p_{j} (\theta_{j+1} )}  {   p_{j}(\theta_{j} ) }  \frac{  p_{j+1} (\theta_{j} )}  {   p_{j+1}(\theta_{j+1} ) }    \right) =  (B_{j}-B_{j+1}) \left( log(\pi[\theta_{j+1}|Y]) - log(\pi[\theta_{j}|Y]) \right)

Physics Origins

It is at this point where sometimes, in order to make things correspond to the earlier physics literature, one defines the “Energy” as
E_{j}=-log(\pi[\theta_{j}|Y]).
So that the acceptance probability becomes
\alpha=min(1,e^{-(B_{j}-B_{j+1})(E_{j+1} - E_{j})  }).
It’s not necessary to define this energy, it only defines an equivalence mapping between statistics and physics. In physics particles get stuck in the local minima of the energy landscape and in statistics the MCMC gets stuck in the local peaks of the posterior. The reason for this is that in a canonical ensemble lower energy states are more probable (recall that nature tries to minimize the potential energy and that force is the negative gradient of the potential energy), so regions of the parameter space with low potential energy, physically, correspond to regions of high probability density, statistically. To be more precise, a result from statistical physics is that the distribution of energy is exponential with scale parameter kT, where k is Boltzmann’s constant and T is temperature (this condition holds only for a canonical ensemble). An exponential distribution with this scale parameter is called the Boltzmann distribution by physicists. As the temperature increases, higher energy states become more probable and the particle jumps out of the minima more. If you are a statistician you don’t need to worry about this, but sometimes this notation crops up in the literature. Its also the same acceptance probability now as in physics when sampling energies from a Boltzmann distribution. I have decided not to adopt the physics notation for this post.


Intra-Thread Metropolis Move

Each thread, within itself, performs a normal vanilla metropolis move:

//Propose Candidate Position//
			t1new=t1[rank*nmc+i-1] + normal(stream[rank]);
			t2new=t2[rank*nmc+i-1] + normal(stream[rank]);

			//Calculate log-Density at Newly-Proposed and Current Position//
			lpost_new[rank]=lLikelihood(t1new,t2new) + lprior(t1new,t2new);
			lpost[rank]=lLikelihood(t1[rank*nmc+i-1],t2[rank*nmc+i-1]) + lprior(t1[rank*nmc+i-1],t2[rank*nmc+i-1]);

			//Melt Density and Calculate log-Acceptance Probability//
			lalpha=B[rank]*(lpost_new[rank]-lpost[rank]);

			//Perform Metropolis Accept-Reject Step//
			if( log(u(stream[rank])) < lalpha ){
				//Accept
				//Proposed as Current Position
				t1[rank*nmc+i]=t1new;
				t2[rank*nmc+i]=t2new;
			}else{
				//Do not Accept
				//Propogate Current Position
				t1[rank*nmc+i]=t1[rank*nmc+i-1];
				t2[rank*nmc+i]=t2[rank*nmc+i-1];
			}

A few comments about the variables. “nmc” is the number of mcmc draws I wish to generate. I have two parameters which I have denoted t1 and t2, because t is closest to theta. Moreover, each processor stores its nmc draws of t1 and t2 in a contiguous array in the memory of length nmc times number of threads. “Rank” Identifies the thread and “lpost” and “B” are arrays of length equal to the number of threads in which to store the log posterior density at the current position and the fractional melting power. All of these variables are defined at the top of the code.


Inter-Thread Metropolis-Hastings Move


				if(u(stream[rank]) < 0.5){
					rank_partner=rank+1;
					if(rank_partner < size){
						//Inter-Thread Metropolis-Hastings Part
						lalpha = (B[rank]-B[rank_partner])*(lpost[rank_partner]-lpost[rank]);
						if(log(u(stream[rank])) < lalpha){
							//accept swap
							swap(t1[rank*nmc+i],t1[rank_partner*nmc+i]);
							swap(t2[rank*nmc+i],t2[rank_partner*nmc+i]);
						}

					}
				}

The only additional thing to add is that each chain attempts a swap with its neighbour at each iteration with probability 1/2. There is nothing special about 1/2, you could choose what you like, but there are pros and cons. How this made parallel in OpenMP is shown below.


OpenMP Parallelization

The OpenMP parallel implementation of the above algorithm is very simple!

#pragma omp parallel private(i,t1new,t2new,rank,lalpha,rank_partner) shared(B, lpost, lpost_new,t1,t2,swapt1,swapt2)
	{
		//Identify Each Thread
		rank=omp_get_thread_num();

		for (i = 1; i < nmc; ++i)
		{

                 //***Intra-Thread Metropolis Part***//
	
#pragma omp barrier      //Synchronise Threads
#pragma omp critical     //Executed Critical Code Block Oney Thread at a Time. 
			{

                 //***Inter-Thread Parallel Tempering Part***//

			}
#pragma omp barrier   //Synchronise Threads
		}
	}

The first parallel pragma simply forks the master thread into a number of threads whereby each thread executes the following code block independently i.e. a number of independent parallel mcmcs. Specifying variables as private means that each thread gets a copy of that variable in its own seperate location in the memory. Shared is the opposite, although I think variables are shared by default. The barrier pragma means that each thread halts until all threads have reached this point. The critical pragma means the following code block is executed by one thread at a time only. This prevents thread j swapping with thread j+1 whilst thread j+1 is attempting a swap with thread j+2, nasty things such as race conditions can occur. The last pragma barrier waits for all threads to have reached the end and then the next iteration of the for loop proceeds.


Full code

The full code can be found here. It depends on OpenMP and the TRNG library in order to generate multiple independent streams of random numbers. It takes the number of mcmc draws as a command-line argument.

[michael@michael tempering]$ wget http://www.lindonslog.com/example_code/tempering.cpp
[michael@michael tempering]$ g++ tempering.cpp -fopenmp -o tempering  -ltrng4 -lm
[michael@michael tempering]$ ./tempering 10000
Thread 0 has fractional power 1
Thread 1 has fractional power 0.469117
Thread 2 has fractional power 0.220071
Thread 3 has fractional power 0.103239
Thread 4 has fractional power 0.0484313
Thread 5 has fractional power 0.0227199
Thread 6 has fractional power 0.0106583
Thread 7 has fractional power 0.005
[michael@michael tempering]$


Simulation Study

I chose the likelihood to be 5 sharply peaked normal distributions located at the corners of a sort-of unit square plus one at the origin with variances of 0.001. The prior was a normal of variance 1000 centered at the origin. The parallel tempering algorithm was run with 8 threads. The posterior draws and mixing results are below:

Posterior Draws

Posterior Draws from Parallel Tempering


parallel tempering mixing

Mixing of parallel tempering algorithm


On the Future use of Parallel Tempering with OpenMP

I hope the code exemplifies how easy it is to run parallel MCMC chains with OpenMP. I would argue that the metropolis moves are the hardest part. If you can write them for a single serial chain, then it is only a few extra steps to run parallel chains and imlement that parallel tempering algorithm. My laptop has four cores and my office computer has eight. Given the trajectory of technology that shared memory devices have an ever increasing number of cores, it seems to me that parallel tempering is becoming an ever-more valuable algorithm to improve mixing times of MCMC runs. Afterall, had I not used the extra 3 cores on my laptop, they would have remained idle. If you have extra cores, why not use them! Moreover with OpenMP you can spawn as many parallel MCMCs as you desire, avoiding the pitalls of MPI.

Earl D.J. & Deem M.W. (2005). Parallel tempering: Theory, applications, and new perspectives, Physical Chemistry Chemical Physics, 7 (23) 3910. DOI:

OpenMP Tutorial – firstprivate and lastprivate

Here I will consider firstprivate and lastprivate. Recall one of the earlier entries about private variables. When a variable is declared as private, each thread gets a unique memory address of where to store values for that variable while in the parallel region. When the parallel region ends, the memory is freed and these variables no longer exist. Consider the following bit of code as an example:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>


int main(void){
int i;
int x;
x=44;
#pragma omp parallel for private(x) 
for(i=0;i<=10;i++){
x=i;
printf("Thread number: %d     x: %d\n",omp_get_thread_num(),x);
}
printf("x is %d\n", x);


}

Yields…

Thread number: 0     x: 0
Thread number: 0     x: 1
Thread number: 0     x: 2
Thread number: 3     x: 9
Thread number: 3     x: 10
Thread number: 2     x: 6
Thread number: 2     x: 7
Thread number: 2     x: 8
Thread number: 1     x: 3
Thread number: 1     x: 4
Thread number: 1     x: 5
x is 44

You’ll notice that x is exactly the value it was before the parallel region.

Suppose we wanted to keep the last value of x after the parallel region. This can be achieved with lastprivate. Replace private(x) with lastprivate(x) and this is the result:

Thread number: 3     x: 9
Thread number: 3     x: 10
Thread number: 1     x: 3
Thread number: 1     x: 4
Thread number: 1     x: 5
Thread number: 0     x: 0
Thread number: 0     x: 1
Thread number: 0     x: 2
Thread number: 2     x: 6
Thread number: 2     x: 7
Thread number: 2     x: 8
x is 10

Notice that it is 10 and not 8. That is to say, it is the last iteration which is kept, not the last operation. Now what if we replace lastprivate(x) with firstprivate(x). What do you think it will do? This:

Thread number: 3     x: 9
Thread number: 3     x: 10
Thread number: 1     x: 3
Thread number: 1     x: 4
Thread number: 1     x: 5
Thread number: 0     x: 0
Thread number: 0     x: 1
Thread number: 0     x: 2
Thread number: 2     x: 6
Thread number: 2     x: 7
Thread number: 2     x: 8
x is 44

If you were like me, you were expecting to get the value 0 i.e. the value of x on the first iteration. NO

firstprivate Specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct.

That is, every thread gets its own instance of x and that instance equals 44.

OpenMP Tutorial – Critical, Atomic and Reduction

Atomic and Critical

  • critical: the enclosed code block will be executed by only one thread at a time, and not simultaneously executed by multiple threads. It is often used to protect shared data fromrace conditions.
  • atomic: the memory update (write, or read-modify-write) in the next instruction will be performed atomically. It does not make the entire statement atomic; only the memory update is atomic. A compiler might use special hardware instructions for better performance than when using critical.

Consider this code which numerically approximates pi:


int main(void){
double pi,x;
int i,N;
pi=0.0;
N=1000;
#pragma omp parallel for
for(i=0;i x=(double)i/N;
pi+=4/(1+x*x);
}
pi=pi/N;
printf("Pi is %f\n",pi);
}

We compile this with gcc main.c -o test, ignoring the -fopenmp options, this means that the #pragma omp parallel for will be interpreted as a comment i.e. ignored. We run it and this is the result:
< Now compile with the -fopenmp option and run: Oh dear... Let's examine what went wrong. Well, by default and as we have not specified it as private, the variable x is shared. This means all threads have the same memory address of the variable x. Therefore, thread i will compute some value at x and store it at memory address &x, thread j will then compute its value of x and store it at &x BEFORE thread i has used its value to make its contribution to pi. The threads are all over writing each others values of x because they all have the same memory address for x. Our first correction is that x must be made private:
#pragma omp parallel for private(x)

Secondly, we have a “Race Condition” for pi. Let me illustrate this with a simple example. Here is what would ideally happen:

 

  • Thread 1 reads the current value of pi : 0
  • Thread 1 increments the value of pi : 1
  • Thread 1 stores the new value of pi: 1
  • Thread 2 reads the current value of pi: 1
  • Thread 2 increments the value of pi: 2
  • Thread 2 stores the value of pi: 2

What is actually happening is more like this:

  • Thread 1 reads the current value of pi: 0
  • Thread 2 reads the current value of pi: 0
  • Thread 1 increments pi: 1
  • Thread 2 increments pi: 1
  • Thread 1 stores its value of pi: 1
  • Thread 2 stores its value of pi: 1

The way to correct this is to tell the code to execute the read/write of pi only one thread at a time. This can be achieved with critical or atomic. Add
#pragma omp atomic Just before pi get’s updated and you’ll see that it works.

This scenario crops up time and time again where you are updating some value inside a parallel loop so in the end it had its own clause made for it. All the above can be achieved by simply making pi a reduction variable.

Reduction

To make pi a reduction variable the code is changed as follows:

 

int main(void){
double pi,x;
int i,N;
pi=0.0;
N=1000;
#pragma omp parallel for private(x) reduction(+:pi)
for(i=0;i&lt;N;i++){
x=(double)i/N;
pi+=4/(1+x*x);
}
pi=pi/N;
printf("Pi is %f\n",pi);
}

This is simply the quick and neat way of achieving all what we did above.

OpenMP Parallel For

The parallel directive #pragma omp parallel makes the code parallel, that is, it forks the master thread into a number of parallel threads, but it doesn’t actually share out the work.
What we are really after is the parallel for directive, which we call a work-sharing construct. Consider

#include <iostream>
#include <omp.h>

using namespace std;
main (void)
{
	int i;
	int foo;
#pragma omp parallel for
  for(i=1;i<10;i++){
#pragma omp critical
{
foo=omp_get_thread_num();
	  cout << "Loop number: "<< i << " " << "Thread number: " << foo << endl;
}
}
}

The for directive applies to the for loop immediately preceding it. Notice how we don’t have to outline a parallel region with curly braces {} following this directive in contrast to before. This program yields:

[michael@michael lindonslog]$ ./openmp 
Loop number: 1 Thread number: 0
Loop number: 8 Thread number: 3
Loop number: 2 Thread number: 0
Loop number: 3 Thread number: 0
Loop number: 9 Thread number: 3
Loop number: 6 Thread number: 2
Loop number: 4 Thread number: 1
Loop number: 7 Thread number: 2
Loop number: 5 Thread number: 1
[michael@michael lindonslog]$ 

Notice what I said about the order. By default, the loop index i.e. “i” in this context, is made private by the for directive.

At the end of the parallel for loop, there is an implicit barrier where all threads wait until they have all finished. There are however some rules for the parallel for directive

  1. The loop index, i, is incremented by a fixed amount each iteration e.g. i++ or i+=step.
  2. The start and end values must not change during the loop.
  3. There must be no “breaks” in the loop where the code steps out of that code block. Functions are, however, permitted and run as you would expect.
  4. The comparison operators may be < <= => >

There may be times when you want to perform some operation in the order of the iterations. This can be achieved with an ordered directive and an ordered clause. Each thread will wait until the previous iteration has finished it’s ordered section before proceeding with its own.

int main(void){
int i,a[10];
#pragma omp parallel for ordered 
for(i=0;i<10;i++){
a[i]=expensive_function(i);
#pragma omp ordered
printf("Thread ID: %d    Hello World %d\n",omp_get_thread_num(),i);
}
}

Will now print out the Hello Worlds in order. N.B. There is a penalty for this. The threads have to wait until the preceding iteration has finished with its ordered section of code. Only if the expensive_function() in this case were expensive, would this be worthwhile.

Parallel Programming in C with OpenMP

OpenMP – Open Specifications for Multi Processing

The central theme of parallel code is that of threads. A serial code starts off as one thread. As soon as the first parallel directive(Fortran)/pragma(C) is encountered, the master thread forks into a number of threads. The proceeding code is then executed in parallel in a manner which can be adjusted using certain options.

To get started with OpenMP. You will need to include the omp header file with
#include <omp.h>
and you will need to add the -fopenmp option when compiling.

To fork the master thread into a number of parallel threads, one writes the following line of code:
#pragma omp parallel
This directive will apply to the following block of code, {…}, only and must be structured as such. By default, all variables previously declared are shared i.e. all threads have the same memory address of a shared variable. This can, however, be declared explicitly by adding shared(var_name). Conversely, you may want to make variables private, that is, each thread gets allocated a unique location in the memory to store this variable. Private variables are only accessed by the threads they are in and all the additional copies of the variable created for parallisation are destroyed when the threads merge. There are also reduction variables. More on that later…

Lets try an example. When you execute your code, it will inherit the OMP_NUM_THREADS environment variable of your terminal. Suppose we want to set the number of threads to 4. We write

prog@michael-laptop:~$ export OMP_NUM_THREADS=4
prog@michael-laptop:~$ echo $OMP_NUM_THREADS
4
prog@michael-laptop:~$ 

You can also specify the number of threads during run time with the omp_set_num_threads() function defined in omp.h

Good. Now here’s our sample code:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(void){
printf("Number of threads before parallisation: %d\n",omp_get_num_threads());
#pragma omp parallel 
{
printf("Current thread number: %d\n",omp_get_thread_num());
if(omp_get_thread_num()==0)
{
printf("Number of threads after parallisation: %d\n",omp_get_num_threads());
}
}
printf("Number of threads after merging threads %d\n",omp_get_num_threads());
}

compile and run:

prog@michael-laptop:~$ g++ openmp.cpp -o test -fopenmp
prog@michael-laptop:~$ ./test
Number of threads before parallisation: 1
Current thread number: 3
Current thread number: 1
Current thread number: 0
Number of threads after parallisation: 4
Current thread number: 2
Number of threads after merging threads 1

Next I’ll talk about work sharing…