## Parallel Tempering Algorithm with OpenMP / C++

Parallel tempering is one of my favourite sampling algorithms to improve MCMC mixing times. This algorithm seems to be used exclusively on distributed memory architectures using MPI and remains unexploited on shared memory architectures such as our office computers, which have up to eight cores. I’ve written parallel tempering algorithms in MPI and Rmpi but never in OpenMP. It turns out that the latter has substantial advantages. I guess when people think of parallel tempering they think of processors communicating with each other via MPI and swapping parameters directly. If you are on a shared memory device, however, you can have processor A simply write to a shared array and have processor B read therefrom, which really saves a lot of aggro fiddling around with message numbers, blocking/non-blocking calls and deadlocks etc. Moreover, with OpenMP you can spawn more threads than you have processors, which translates to more parallel MCMC chains in the present context, whereas this becomes troublesome with MPI due to the danger of deadlocks. OpenMP is also much easier to use than MPI, with one line you can fork a serial thread into a desired and hardware-independent number of parallel threads. The code looks as follows:

## Parallel Tempering Theory

Each thread simulates an MCMC trajectory from the posterior raised to a fractional power, B. When B=1, the MCMC draws are from the posterior from which we wish to sample. When B=0, the MCMC trajectory is just a realization of a Brownian motion random walk. To see this, consider the acceptance probability of the metropolis move. The density evaluated at the proposed parameters over the density evaluated at the current parameters all raised to the power of zero is unity, whatever the densities are, so the moves always get accepted. Similarly if B is close to zero, then the acceptance probability is near unity and the distribution from which this MCMC is sampling is quite uniform over the parameter space, so the trajectory explores a relatively larger part of the parameter space. As B is increased toward one, the features of the distribution from which we wish to sample start to become more prominent. In the other direction from B=1 to 0 one commonly says that the posterior is “melted down” and spreading out its mass. The terminology has remained from its origins in statistical physics where one would simulated particles at a hotter temperature, so that they would jostle around more and escape wells in the potential energy. The key to parallel tempering is to use the more diffuse, hotter or melted down MCMC chains as proposal distributions for the actual cold distribution we wish to sample from. One proceeds by performing a Metropolis-Hastings move because the proposal distributions are not symmetric. For illustration, thread j uses the hotter thread j+1 as its partner and as proposal distribution. Let theta j+1 be the proposed new position for thread j, being the current position of thread j+1.
$\alpha=min(1,\frac{ p_{j} (\theta_{j+1} )} { p_{j}(\theta_{j} ) } \frac{ p_{j+1} (\theta_{j} )} { p_{j+1}(\theta_{j+1} ) } )$
The second fraction is the Hastings addition to the Metropolis algorithm and is required to satisfy detailed balance for an unsymmetrical proposal distribution. Now realise that
$p_{j}=\pi(\theta|Y)^{B_{j}}\\ p_{j+1}=\pi(\theta|Y)^{B_{j+1}}$
i.e. they are the same distribution raised to different fractional powers. Working now on the log scale, it can be shown that
$log \left( \frac{ p_{j} (\theta_{j+1} )} { p_{j}(\theta_{j} ) } \frac{ p_{j+1} (\theta_{j} )} { p_{j+1}(\theta_{j+1} ) } \right) = (B_{j}-B_{j+1}) \left( log(\pi[\theta_{j+1}|Y]) - log(\pi[\theta_{j}|Y]) \right)$

### Physics Origins

It is at this point where sometimes, in order to make things correspond to the earlier physics literature, one defines the “Energy” as
$E_{j}=-log(\pi[\theta_{j}|Y]).$
So that the acceptance probability becomes
$\alpha=min(1,e^{-(B_{j}-B_{j+1})(E_{j+1} - E_{j}) }).$
It’s not necessary to define this energy, it only defines an equivalence mapping between statistics and physics. In physics particles get stuck in the local minima of the energy landscape and in statistics the MCMC gets stuck in the local peaks of the posterior. The reason for this is that in a canonical ensemble lower energy states are more probable (recall that nature tries to minimize the potential energy and that force is the negative gradient of the potential energy), so regions of the parameter space with low potential energy, physically, correspond to regions of high probability density, statistically. To be more precise, a result from statistical physics is that the distribution of energy is exponential with scale parameter kT, where k is Boltzmann’s constant and T is temperature (this condition holds only for a canonical ensemble). An exponential distribution with this scale parameter is called the Boltzmann distribution by physicists. As the temperature increases, higher energy states become more probable and the particle jumps out of the minima more. If you are a statistician you don’t need to worry about this, but sometimes this notation crops up in the literature. Its also the same acceptance probability now as in physics when sampling energies from a Boltzmann distribution. I have decided not to adopt the physics notation for this post.

Each thread, within itself, performs a normal vanilla metropolis move:

//Propose Candidate Position//
t1new=t1[rank*nmc+i-1] + normal(stream[rank]);
t2new=t2[rank*nmc+i-1] + normal(stream[rank]);

//Calculate log-Density at Newly-Proposed and Current Position//
lpost_new[rank]=lLikelihood(t1new,t2new) + lprior(t1new,t2new);
lpost[rank]=lLikelihood(t1[rank*nmc+i-1],t2[rank*nmc+i-1]) + lprior(t1[rank*nmc+i-1],t2[rank*nmc+i-1]);

//Melt Density and Calculate log-Acceptance Probability//
lalpha=B[rank]*(lpost_new[rank]-lpost[rank]);

//Perform Metropolis Accept-Reject Step//
if( log(u(stream[rank])) < lalpha ){
//Accept
//Proposed as Current Position
t1[rank*nmc+i]=t1new;
t2[rank*nmc+i]=t2new;
}else{
//Do not Accept
//Propogate Current Position
t1[rank*nmc+i]=t1[rank*nmc+i-1];
t2[rank*nmc+i]=t2[rank*nmc+i-1];
}


A few comments about the variables. “nmc” is the number of mcmc draws I wish to generate. I have two parameters which I have denoted t1 and t2, because t is closest to theta. Moreover, each processor stores its nmc draws of t1 and t2 in a contiguous array in the memory of length nmc times number of threads. “Rank” Identifies the thread and “lpost” and “B” are arrays of length equal to the number of threads in which to store the log posterior density at the current position and the fractional melting power. All of these variables are defined at the top of the code.


if(u(stream[rank]) < 0.5){
rank_partner=rank+1;
if(rank_partner < size){
lalpha = (B[rank]-B[rank_partner])*(lpost[rank_partner]-lpost[rank]);
if(log(u(stream[rank])) < lalpha){
//accept swap
swap(t1[rank*nmc+i],t1[rank_partner*nmc+i]);
swap(t2[rank*nmc+i],t2[rank_partner*nmc+i]);
}

}
}


The only additional thing to add is that each chain attempts a swap with its neighbour at each iteration with probability 1/2. There is nothing special about 1/2, you could choose what you like, but there are pros and cons. How this made parallel in OpenMP is shown below.

## OpenMP Parallelization

The OpenMP parallel implementation of the above algorithm is very simple!

#pragma omp parallel private(i,t1new,t2new,rank,lalpha,rank_partner) shared(B, lpost, lpost_new,t1,t2,swapt1,swapt2)
{

for (i = 1; i < nmc; ++i)
{

#pragma omp critical     //Executed Critical Code Block Oney Thread at a Time.
{

}
}
}


## Full code

The full code can be found here. It depends on OpenMP and the TRNG library in order to generate multiple independent streams of random numbers. It takes the number of mcmc draws as a command-line argument.

[michael@michael tempering]$wget http://www.lindonslog.com/example_code/tempering.cpp [michael@michael tempering]$ g++ tempering.cpp -fopenmp -o tempering  -ltrng4 -lm
[michael@michael tempering]$./tempering 10000 Thread 0 has fractional power 1 Thread 1 has fractional power 0.469117 Thread 2 has fractional power 0.220071 Thread 3 has fractional power 0.103239 Thread 4 has fractional power 0.0484313 Thread 5 has fractional power 0.0227199 Thread 6 has fractional power 0.0106583 Thread 7 has fractional power 0.005 [michael@michael tempering]$


## Simulation Study

I chose the likelihood to be 5 sharply peaked normal distributions located at the corners of a sort-of unit square plus one at the origin with variances of 0.001. The prior was a normal of variance 1000 centered at the origin. The parallel tempering algorithm was run with 8 threads. The posterior draws and mixing results are below:

Posterior Draws from Parallel Tempering

Mixing of parallel tempering algorithm

## On the Future use of Parallel Tempering with OpenMP

I hope the code exemplifies how easy it is to run parallel MCMC chains with OpenMP. I would argue that the metropolis moves are the hardest part. If you can write them for a single serial chain, then it is only a few extra steps to run parallel chains and imlement that parallel tempering algorithm. My laptop has four cores and my office computer has eight. Given the trajectory of technology that shared memory devices have an ever increasing number of cores, it seems to me that parallel tempering is becoming an ever-more valuable algorithm to improve mixing times of MCMC runs. Afterall, had I not used the extra 3 cores on my laptop, they would have remained idle. If you have extra cores, why not use them! Moreover with OpenMP you can spawn as many parallel MCMCs as you desire, avoiding the pitalls of MPI.

Earl D.J. & Deem M.W. (2005). Parallel tempering: Theory, applications, and new perspectives, Physical Chemistry Chemical Physics, 7 (23) 3910. DOI:

## OpenMP Tutorial – firstprivate and lastprivate

Here I will consider firstprivate and lastprivate. Recall one of the earlier entries about private variables. When a variable is declared as private, each thread gets a unique memory address of where to store values for that variable while in the parallel region. When the parallel region ends, the memory is freed and these variables no longer exist. Consider the following bit of code as an example:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(void){
int i;
int x;
x=44;
#pragma omp parallel for private(x)
for(i=0;i<=10;i++){
x=i;
}
printf("x is %d\n", x);

}


Yields…

Thread number: 0     x: 0
x is 44


You’ll notice that x is exactly the value it was before the parallel region.

Suppose we wanted to keep the last value of x after the parallel region. This can be achieved with lastprivate. Replace private(x) with lastprivate(x) and this is the result:

Thread number: 3     x: 9
x is 10


Notice that it is 10 and not 8. That is to say, it is the last iteration which is kept, not the last operation. Now what if we replace lastprivate(x) with firstprivate(x). What do you think it will do? This:

Thread number: 3     x: 9
x is 44


If you were like me, you were expecting to get the value 0 i.e. the value of x on the first iteration. NO

firstprivate Specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct.

That is, every thread gets its own instance of x and that instance equals 44.

## Atomic and Critical

• critical: the enclosed code block will be executed by only one thread at a time, and not simultaneously executed by multiple threads. It is often used to protect shared data fromrace conditions.
• atomic: the memory update (write, or read-modify-write) in the next instruction will be performed atomically. It does not make the entire statement atomic; only the memory update is atomic. A compiler might use special hardware instructions for better performance than when using critical.

Consider this code which numerically approximates pi:


int main(void){
double pi,x;
int i,N;
pi=0.0;
N=1000;
#pragma omp parallel for
for(i=0;i x=(double)i/N;
pi+=4/(1+x*x);
}
pi=pi/N;
printf("Pi is %f\n",pi);
}


We compile this with gcc main.c -o test, ignoring the -fopenmp options, this means that the #pragma omp parallel for will be interpreted as a comment i.e. ignored. We run it and this is the result:
<

prog@michael-laptop:~$gcc test.c -o test prog@michael-laptop:~$ ./test
Pi is 3.142592


Now compile with the -fopenmp option and run:

prog@michael-laptop:~$gcc test.c -o test -fopenmp prog@michael-laptop:~$ ./test
Pi is 2.785016

Oh dear... Let's examine what went wrong. Well, by default and as we have not specified it as private, the variable x is shared. This means all threads have the same memory address of the variable x. Therefore, thread i will compute some value at x and store it at memory address &x, thread j will then compute its value of x and store it at &x BEFORE thread i has used its value to make its contribution to pi. The threads are all over writing each others values of x because they all have the same memory address for x. Our first correction is that x must be made private:
#pragma omp parallel for private(x)

Secondly, we have a "Race Condition" for pi. Let me illustrate this with a simple example. Here is what would ideally happen:

• Thread 1 increments the value of pi : 1
• Thread 1 stores the new value of pi: 1
• Thread 2 increments the value of pi: 2
• Thread 2 stores the value of pi: 2

What is actually happening is more like this:

• Thread 1 increments pi: 1
• Thread 2 increments pi: 1
• Thread 1 stores its value of pi: 1
• Thread 2 stores its value of pi: 1

The way to correct this is to tell the code to execute the read/write of pi only one thread at a time. This can be achieved with critical or atomic. Add
#pragma omp atomic Just before pi get's updated and you'll see that it works.

This scenario crops up time and time again where you are updating some value inside a parallel loop so in the end it had its own clause made for it. All the above can be achieved by simply making pi a reduction variable.

## Reduction

To make pi a reduction variable the code is changed as follows:

int main(void){
double pi,x;
int i,N;
pi=0.0;
N=1000;
#pragma omp parallel for private(x) reduction(+:pi)
for(i=0;i&lt;N;i++){
x=(double)i/N;
pi+=4/(1+x*x);
}
pi=pi/N;
printf("Pi is %f\n",pi);
}


This is simply the quick and neat way of achieving all what we did above.

## OpenMP Parallel For

The parallel directive #pragma omp parallel makes the code parallel, that is, it forks the master thread into a number of parallel threads, but it doesn’t actually share out the work.
What we are really after is the parallel for directive, which we call a work-sharing construct. Consider

#include <iostream>
#include <omp.h>

using namespace std;
main (void)
{
int i;
int foo;
#pragma omp parallel for
for(i=1;i<10;i++){
#pragma omp critical
{
cout << "Loop number: "<< i << " " << "Thread number: " << foo << endl;
}
}
}


The for directive applies to the for loop immediately preceding it. Notice how we don’t have to outline a parallel region with curly braces {} following this directive in contrast to before. This program yields:

[michael@michael lindonslog]$./openmp Loop number: 1 Thread number: 0 Loop number: 8 Thread number: 3 Loop number: 2 Thread number: 0 Loop number: 3 Thread number: 0 Loop number: 9 Thread number: 3 Loop number: 6 Thread number: 2 Loop number: 4 Thread number: 1 Loop number: 7 Thread number: 2 Loop number: 5 Thread number: 1 [michael@michael lindonslog]$


Notice what I said about the order. By default, the loop index i.e. “i” in this context, is made private by the for directive.

At the end of the parallel for loop, there is an implicit barrier where all threads wait until they have all finished. There are however some rules for the parallel for directive

1. The loop index, i, is incremented by a fixed amount each iteration e.g. i++ or i+=step.
2. The start and end values must not change during the loop.
3. There must be no “breaks” in the loop where the code steps out of that code block. Functions are, however, permitted and run as you would expect.
4. The comparison operators may be < <= => >

There may be times when you want to perform some operation in the order of the iterations. This can be achieved with an ordered directive and an ordered clause. Each thread will wait until the previous iteration has finished it’s ordered section before proceeding with its own.

int main(void){
int i,a[10];
#pragma omp parallel for ordered
for(i=0;i<10;i++){
a[i]=expensive_function(i);
#pragma omp ordered
}
}


Will now print out the Hello Worlds in order. N.B. There is a penalty for this. The threads have to wait until the preceding iteration has finished with its ordered section of code. Only if the expensive_function() in this case were expensive, would this be worthwhile.

## OpenMP – Open Specifications for Multi Processing

The central theme of parallel code is that of threads. A serial code starts off as one thread. As soon as the first parallel directive(Fortran)/pragma(C) is encountered, the master thread forks into a number of threads. The proceeding code is then executed in parallel in a manner which can be adjusted using certain options.

To get started with OpenMP. You will need to include the omp header file with
#include <omp.h>
and you will need to add the -fopenmp option when compiling.

To fork the master thread into a number of parallel threads, one writes the following line of code:
#pragma omp parallel
This directive will apply to the following block of code, {…}, only and must be structured as such. By default, all variables previously declared are shared i.e. all threads have the same memory address of a shared variable. This can, however, be declared explicitly by adding shared(var_name). Conversely, you may want to make variables private, that is, each thread gets allocated a unique location in the memory to store this variable. Private variables are only accessed by the threads they are in and all the additional copies of the variable created for parallisation are destroyed when the threads merge. There are also reduction variables. More on that later…

Lets try an example. When you execute your code, it will inherit the OMP_NUM_THREADS environment variable of your terminal. Suppose we want to set the number of threads to 4. We write

prog@michael-laptop:~$export OMP_NUM_THREADS=4 prog@michael-laptop:~$ echo $OMP_NUM_THREADS 4 prog@michael-laptop:~$


You can also specify the number of threads during run time with the omp_set_num_threads() function defined in omp.h

Good. Now here’s our sample code:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(void){
#pragma omp parallel
{
{
}
}
}


compile and run:

prog@michael-laptop:~$g++ openmp.cpp -o test -fopenmp prog@michael-laptop:~$ ./test
Number of threads before parallisation: 1