Model Scale Parameterization for MCMC Efficiency

I recently came across a very interesting paper by Y. Yu and X. Meng[1] who present an interweaving strategy between different model parameterizations to improve mixing. It is well known that different model parameterizations can perform better than others under certain conditions. Papaspiliopoulos, Roberts and Sköld [2] present a general framework for how to parameterize hierarchical models and provide insights into the conditions under which centered and non-centered parameterizations outperform each other. One isn’t, however, restricted to reperameterizations of location parameters only, as outlined in the aforementioned paper, and so I decided to experiment with reparameterizations of the scale parameter in a simple hierarchical model with improper priors on the parameters.

Centered Parameterization

Papaspiliopoulos gave a general definition of the centered parameterization to be when Y_{i} is independent of \lambda given X_{i}

\displaystyle Y_{i}|X_{i},\sigma^{2} \sim N(X_{i},\sigma^{2}) \ \ \ \ \ (1)

\displaystyle X_{i}|\sigma^{2},\lambda^{2} \sim N(0,\lambda^{2}\sigma^{2}) \ \ \ \ \ (2)

\displaystyle p( \lambda^{2} ) \propto \frac{1}{\lambda^{2}} \ \ \ \ \ (3)

Full Conditionals

\displaystyle \lambda^{2}|Y_{1:n},X_{1:n},\sigma^{2} \sim \Gamma^{-1}\left( \frac{n}{2}, \frac{\sum_{i}^{n} X_{i}^{2}}{2\sigma^{2}}\right) \ \ \ \ \ (4)

\displaystyle X_{i}|Y_{i},\sigma^{2},\lambda^{2} \sim N\left( \frac{\frac{Y_{i}}{\sigma^{2}}}{\frac{1}{\sigma^{2}}+\frac{1}{\lambda^{2}\sigma^{2}}}, \frac{1}{\frac{1}{\sigma^{2}}+\frac{1}{\lambda^{2}\sigma^{2}}} \right) \ \ \ \ \ (5)

Non-Centered Parameterization

Papaspiliopoulos gave a general definition of the non-centered parameterization to be when \tilde{X}_{i} and \lambda are a priori independent.

\displaystyle Y_{i}|\tilde{X}_{i},\sigma^{2},\lambda \sim N(\lambda \tilde{X}_{i},\sigma^{2}) \ \ \ \ \ (6)

\displaystyle \tilde{X}_{i}|\sigma^{2} \sim N(0,\sigma^{2}) \ \ \ \ \ (7)

\displaystyle p(\lambda) \propto 1 \ \ \ \ \ (8)

Full Conditionals

\displaystyle \lambda|Y_{1:n},X_{1:n},\sigma^{2} \sim N \left( \frac{\sum_{i=1}^{n}\tilde{X}_{i}Y_{i}}{\sum_{i=1}^{n}\tilde{X}_{i}^{2}}, \frac{\sigma^{2}}{\sum_{i=1}^{n}\tilde{X}_{i}^{2}} \right) \ \ \ \ \ (9)

\displaystyle \tilde{X}_{i}|Y_{i},\sigma^{2},\lambda^{2} \sim N\left( \frac{\frac{\lambda Y_{i}}{\sigma^{2}}}{\frac{\lambda^{2}}{\sigma^{2}}+\frac{1}{\sigma^{2}}}, \frac{1}{\frac{\lambda^{2}}{\sigma^{2}}+\frac{1}{\sigma^{2}}} \right) \ \ \ \ \ (10)

Interweaving Strategy

Generally when the CP works well, the NCP works poorly and vice versa. Yaming Yu and Xiao-Li Meng[1] present a way of combining both strategies by interweaving the Gibbs steps of both parameterizations at each iteration. The details can be read in their paper. I decided to test all three Gibbs samplers with the following R code:

#Generate Data
lam2=0.5
lam=sqrt(lam2)
sig2=1
n=1000
Xt=rnorm(n,0,sqrt(lam2*sig2))
Y=rnorm(n,Xt,sqrt(sig2))
nmc=2000
X=Xt

#Centered Parameterization
cp_lam2=rep(0,nmc)
cp_X=matrix(0,nmc,n)
for(i in 1:nmc){
	inv_lam2=rgamma(1,(n)/2,rate=(t(X)%*%X)/(2*sig2))
	lam2=1/inv_lam2
	X=rnorm(n,(1/(1/sig2 + 1/(sig2*lam2)))*Y/sig2, sqrt(1/(1/sig2 + 1/(sig2*lam2))))
	cp_lam2[i]=lam2
	cp_X[i,]=X
}
mean_cp_X=apply(cp_X,2,mean)

#Non-Centered Parameterization
X=Xt
ncp_lam2=rep(0,nmc)
ncp_X=matrix(0,nmc,n)
for(i in 1:nmc){
	lam=rnorm(1,t(X)%*%Y/(t(X)%*%X), sqrt(sig2/(t(X)%*%X)))
	lam2=lam*lam;
	X=rnorm(n, (1/(1/sig2 + lam2/sig2))*lam*Y/sig2, sqrt(1/(1/sig2+lam2/sig2))  )
	ncp_lam2[i]=lam2
	ncp_X[i,]=X
}
mean_ncp_X=apply(ncp_X,2,mean)

#Interweaving Strategy
int_lam2=rep(0,nmc)
int_X=matrix(0,nmc,n)
for(i in 1:nmc){
	X=rnorm(n,(1/(1/sig2 + 1/(sig2*lam2)))*Y/sig2, sqrt(1/(1/sig2 + 1/(sig2*lam2))))
	inv_lam2=rgamma(1,(n)/2,rate=(t(X)%*%X)/(2*sig2))
	half_lam2=1/inv_lam2
	X=X/sqrt(half_lam2)    #Transform to Xtilde
	lam=rnorm(1,t(X)%*%Y/(t(X)%*%X), sqrt(sig2/(t(X)%*%X)))
	lam2=lam*lam;
	int_lam2[i]=lam2
	int_X[i,]=X
}
mean_cp_X=apply(cp_X,2,mean)

#Remove Burnin
cp_lam2=cp_lam2[-(1:1000)]
ncp_lam2=ncp_lam2[-(1:1000)]
int_lam2=int_lam2[-(1:1000)]

#Plot Results
par(mfrow=c(3,3))
acf(cp_lam2)
plot(cp_lam2,type="l")
plot(cp_lam2[1:nmc-1],cp_lam2[2:nmc])
acf(ncp_lam2)
plot(ncp_lam2,type="l")
plot(ncp_lam2[1:nmc-1],ncp_lam2[2:nmc])
acf(int_lam2)
plot(int_lam2,type="l")
plot(int_lam2[1:nmc-1],int_lam2[2:nmc])

Results

\lambda=0.3

mcmc parameterization

Interweaving outperforms non-centered outperforms centered

\lambda=6

ncp poorly mixing

Interweaving outperforms centered outperforms non-centered

Discussion

As lambda gets small the centered parameterization becomes ever more autocorrelated and poorly mixing. When lambda becomes large the non-centered parameterization becomes ever more autocorrelated and poorly mixing. The interweaved Gibbs sampler exhibits great mixing in all cases.

[1] [doi] Y. Yu and X. Meng, “To Center or Not to Center: That Is Not the Question–An Ancillarity-Sufficiency Interweaving Strategy (ASIS) for Boosting MCMC Efficiency,” Journal of computational and graphical statistics, vol. 20, iss. 3, pp. 531-570, 2011.
[Bibtex]
@article{Yu11,
author = {Yu, Yaming and Meng, Xiao-Li},
citeulike-article-id = {10408757},
citeulike-linkout-0 = {http://amstat.tandfonline.com/doi/abs/10.1198/jcgs.2011.203main},
citeulike-linkout-1 = {http://pubs.amstat.org/doi/abs/10.1198/jcgs.2011.203main},
citeulike-linkout-2 = {http://dx.doi.org/10.1198/jcgs.2011.203main},
doi = {10.1198/jcgs.2011.203main},
journal = {Journal of Computational and Graphical Statistics},
number = {3},
pages = {531--570},
posted-at = {2012-03-03 18:10:07},
priority = {2},
title = {{To Center or Not to Center: That Is Not the Question--An Ancillarity-Sufficiency Interweaving Strategy (ASIS) for Boosting MCMC Efficiency}},
url = {http://amstat.tandfonline.com/doi/abs/10.1198/jcgs.2011.203main},
volume = {20},
year = {2011}
}
[2] O. Papaspiliopoulos, G. O. Roberts, and M. Sköld, “A general framework for the parametrization of hierarchical models,” Statistical science, vol. 22, iss. 1, pp. 59-73, 2007.
[Bibtex]
@article{Papaspiliopoulos07,
abstract = {{In this paper, we describe centering and noncentering methodology as complementary techniques for use in parametrization of broad classes of hierarchical models, with a view to the construction of effective MCMC algorithms for exploring posterior distributions from these models. We give a clear qualitative understanding as to when centering and noncentering work well, and introduce theory concerning the convergence time complexity of Gibbs samplers using centered and noncentered parametrizations. We give general recipes for the construction of noncentered parametrizations, including an auxiliary variable technique called the state-space expansion technique. We also describe partially noncentered methods, and demonstrate their use in constructing robust Gibbs sampler algorithms whose convergence properties are not overly sensitive to the data.}},
author = {Papaspiliopoulos, Omiros and Roberts, Gareth O. and Sk\"{o}ld, Martin},
citeulike-article-id = {8977350},
citeulike-linkout-0 = {http://www.jstor.org/stable/27645805},
journal = {Statistical Science},
number = {1},
pages = {59--73},
posted-at = {2011-03-10 18:55:50},
priority = {2},
publisher = {Institute of Mathematical Statistics},
title = {{A general framework for the parametrization of hierarchical models}},
url = {http://www.jstor.org/stable/27645805},
volume = {22},
year = {2007}
}

Yu Y. & Meng X.L. (2011). To Center or Not to Center: That Is Not the Question—An Ancillarity–Sufficiency Interweaving Strategy (ASIS) for Boosting MCMC Efficiency, Journal of Computational and Graphical Statistics, 20 (3) 531-570. DOI:

Woodbury Matrix Inverse Identity

Application in Conditional Distribution of Multivariate Normal

The Sherman-Woodbury-Morrison matrix inverse identity can be regarded as a transform between Schur complements. That is, given V_{22.1}^{-1} one can obtain V_{11.2}^{-1} by using the Woodbury matrix identity and vice versa. Recall the Woodbury Identity:

V_{11.2}^{-1}=V_{11}^{-1}+V_{11}^{-1}V_{12}V_{22.1}^{-1}V_{21}V_{11}^{-1}

and

V_{22.1}^{-1}=V_{22}^{-1}+V_{22}^{-1}V_{21}V_{11.2}^{-1}V_{12}V_{22}^{-1}

I recently stumbled across a neat application of this whilst deriving full conditionals for a multivariate normal. Recall that if the data are partitioned into two blocks, Y_{1},Y_{2}, then the variance of the conditional distribution Y_{1}|Y_{2},- is the Schur complement of the block V_{22} of total variance matrix V, that is, the variance of the conditional distribution is V_{11.2}=V_{11}-V_{12}V_{22}^{-1}V_{21} which is the variance of Y_{1} subtracted by something corresponding to the reduction in uncertainty about Y_{1} gained from the knowledge about Y_{2}. If, however, V_{22} has the form of a Schur complement itself, then it may be possible to exploit the Woodbury identity above to considerably simplify the variance term. I came across this when I derived two very different-looking expressions for the conditional distribution and found them equivalent by the Woodbury identity. Consider the model

\begin{bmatrix}   Y_{1}\\   Y_{2}  \end{bmatrix}  =  \begin{bmatrix}   X_{1}\\   X_{2}  \end{bmatrix}\beta_{ } + \varepsilon

where

\varepsilon \sim N\left( \begin{bmatrix}0\\ 0\end{bmatrix}, \sigma^{2} \begin{bmatrix}I_{1} & 0 \\ 0 & I_{2}\end{bmatrix} \right)

\beta_{ }| ,\sigma^{2} \sim N(0, \sigma^{2}\Lambda^{-1})

.
I was seeking the distribution Y_{1}| Y_{2},\sigma^{2} and arrived there through two different paths. The distributions derived looked very different, but they turned out to be equivalent upon considering the Woodbury identity.

Method 1

This simply manipulates properties of the multivariate normal. Marginalizing over \beta one gets

Cov \begin{bmatrix}   Y_{1} \\  Y_{2}  \end{bmatrix} = \begin{bmatrix}   X_{1} \\  X_{2}  \end{bmatrix} Cov (\beta_{ })  \begin{bmatrix}   X_{1}^{T} &  X_{2}^{T}  \end{bmatrix} + Cov(\varepsilon)

Cov \begin{bmatrix}   Y_{1} \\  Y_{2}  \end{bmatrix} = \sigma^{2}\begin{bmatrix}   X_{1}\Lambda^{-1} X_{1}^{T} &  X_{1}\Lambda^{-1} X_{2}^{T} \\  X_{2}\Lambda^{-1} X_{1}^{T} &  X_{2}\Lambda^{-1} X_{2}^{T}  \end{bmatrix}  + \sigma^{2}  \begin{bmatrix}  I_{1} & 0 \\ 0 & I_{2}  \end{bmatrix}

.
Such that the distribution

\begin{bmatrix}   Y_{1}\\   Y_{2}  \end{bmatrix}| ,\sigma^{2} \sim N \left(  \begin{bmatrix}  0\\  0  \end{bmatrix},  \sigma^{2} \left( \begin{bmatrix}  I_{1}+  X_{1}\Lambda^{-1} X_{1}^{T} &  X_{1}\Lambda^{-1} X_{2}^{T} \\  X_{2}\Lambda^{-1} X_{1}^{T} & I_{2}+ X_{2}\Lambda^{-1} X_{2}^{T}  \end{bmatrix}  \right) \right)

It follows that the conditional distribution is
Y_{1}| Y_{2} ,\sigma^{2} \sim N \left(  X_{1}\Lambda^{-1} X_{2}^{T} \left[  X_{2}\Lambda^{-1} X_{2}^{T} + I_{2}\right]^{-1} Y_{2}, I_{1} +  X_{1}\Lambda^{-1} X_{1}^{T} -  X_{1}\Lambda^{-1} X_{2}^{T} \left[ I_{2} +  X_{2}\Lambda^{-1} X_{2}^{T} \right]^{-1}  X_{2}\Lambda^{-1} X_{1}^{T}\right).
This looks a bit nasty, but notice that V_{22}^{-1} looks like it too could be a Schur complement of some matrix.

Method 2

An alternative route to this distribution is

f( Y_{1}| Y_{2},\sigma^{2} )=\int f( Y_{1}|\sigma^{2} ,\beta_{ })\pi(\beta_{ }| Y_{2},\sigma^{2} )d\beta_{ }

where

\beta_{ }| Y_{2} ,\sigma^{2}\sim N \left( ( X_{2}^{T} X_{2}+\Lambda)^{-1} X_{2}^{T} Y_{2}, \sigma^{2}( X_{2}^{T} X_{2}+\Lambda)^{-1} \right).

It follows that

Y_{1}| Y_{2} ,\sigma^{2} \sim N\left(  X_{1}( X_{2}^{T} X_{2}+\Lambda)^{-1} X_{2}^{T} Y_{2}, \sigma^{2} (I_{1} +  X_{1} ( X_{2}^{T} X_{2}+\Lambda)^{-1} X_{1}^{T}) \right)

which looks different from the distribution obtained through method 1. The expression for the variance is a lot neater. They are in fact identical by the Woodbury identity.

Comparison

Mean (Submitted by Michelle Leigh)

\left[\Lambda+  X_{2}^TI_{2} X_{2}\right]^{-1} X_{2}^T\\  =\{\Lambda^{-1}-\Lambda^{-1} X_{2}^T\left[I_{2}+  X_{2}\Lambda^{-1} X_{2}^T\right]^{-1} X_{2}\Lambda^{-1}\} X_{2}^T\\  =\Lambda^{-1} X_{2}^T\left[I_{2}+  X_{2}\Lambda^{-1} X_{2}^T\right]^{-1}\left[I_{2}+  X_{2}\Lambda^{-1} X_{2}^T\right]-\Lambda^{-1} X_{2}^T\left[I_{2}+  X_{2}\Lambda^{-1} X_{2}^T\right]^{-1} X_{2}\Lambda^{-1} X_{2}^T\\  =\Lambda^{-1} X_{2}^T\left[I_{2}+  X_{2}\Lambda^{-1} X_{2}^T\right]^{-1}I_{2}

So mean1=mean2.

Variance

By the Woodbury Identity it follows that

\Lambda^{-1} - \Lambda^{-1} X_{2}^{T} \left[ I_{2} +  X_{2}\Lambda^{-1} X_{2}^{T} \right]^{-1}  X_{2}\Lambda^{-1} = ( X_{2}^{T}I_{2} X_{2}+\Lambda)^{-1}.

Therefore

X_{1}\Lambda^{-1} X_{1}^{T}- X_{1}\Lambda^{-1} X_{2}^{T} \left[ I_{2}+ X_{2}\Lambda^{-1} X_{2}^{T} \right]^{-1}  X_{2}\Lambda^{-1} X_{1}^{T}={ X_{1} ( X_{2}^{T} X_{2}+\Lambda)^{-1} X_{1}^{T}}\\

and so variance1=variance2. The trick is recognizing the form of the formulas at the top of the page, then one can write the variance as a much neater expression.

Parallel Tempering Algorithm with OpenMP / C++

1.1. Parallel Tempering Theory
1.2. Physics Origins
2.1 Intra-Thread Metropolis Move
2.2. Inter-Thread Parallel Tempering
2.3. OpenMP Parallelization
3. Full Code
4. Simulation Study
5. On the Future use of Parallel Tempering with OpenMP

Parallel tempering is one of my favourite sampling algorithms to improve MCMC mixing times. This algorithm seems to be used exclusively on distributed memory architectures using MPI and remains unexploited on shared memory architectures such as our office computers, which have up to eight cores. I’ve written parallel tempering algorithms in MPI and Rmpi but never in OpenMP. It turns out that the latter has substantial advantages. I guess when people think of parallel tempering they think of processors communicating with each other via MPI and swapping parameters directly. If you are on a shared memory device, however, you can have processor A simply write to a shared array and have processor B read therefrom, which really saves a lot of aggro fiddling around with message numbers, blocking/non-blocking calls and deadlocks etc. Moreover, with OpenMP you can spawn more threads than you have processors, which translates to more parallel MCMC chains in the present context, whereas this becomes troublesome with MPI due to the danger of deadlocks. OpenMP is also much easier to use than MPI, with one line you can fork a serial thread into a desired and hardware-independent number of parallel threads. The code looks as follows:


Parallel Tempering Theory

Each thread simulates an MCMC trajectory from the posterior raised to a fractional power, B. When B=1, the MCMC draws are from the posterior from which we wish to sample. When B=0, the MCMC trajectory is just a realization of a Brownian motion random walk. To see this, consider the acceptance probability of the metropolis move. The density evaluated at the proposed parameters over the density evaluated at the current parameters all raised to the power of zero is unity, whatever the densities are, so the moves always get accepted. Similarly if B is close to zero, then the acceptance probability is near unity and the distribution from which this MCMC is sampling is quite uniform over the parameter space, so the trajectory explores a relatively larger part of the parameter space. As B is increased toward one, the features of the distribution from which we wish to sample start to become more prominent. In the other direction from B=1 to 0 one commonly says that the posterior is “melted down” and spreading out its mass. The terminology has remained from its origins in statistical physics where one would simulated particles at a hotter temperature, so that they would jostle around more and escape wells in the potential energy. The key to parallel tempering is to use the more diffuse, hotter or melted down MCMC chains as proposal distributions for the actual cold distribution we wish to sample from. One proceeds by performing a Metropolis-Hastings move because the proposal distributions are not symmetric. For illustration, thread j uses the hotter thread j+1 as its partner and as proposal distribution. Let theta j+1 be the proposed new position for thread j, being the current position of thread j+1.
\alpha=min(1,\frac{  p_{j} (\theta_{j+1} )}  {   p_{j}(\theta_{j} ) }  \frac{  p_{j+1} (\theta_{j} )}  {   p_{j+1}(\theta_{j+1} ) }    )
The second fraction is the Hastings addition to the Metropolis algorithm and is required to satisfy detailed balance for an unsymmetrical proposal distribution. Now realise that
p_{j}=\pi(\theta|Y)^{B_{j}}\\  p_{j+1}=\pi(\theta|Y)^{B_{j+1}}
i.e. they are the same distribution raised to different fractional powers. Working now on the log scale, it can be shown that
log \left( \frac{  p_{j} (\theta_{j+1} )}  {   p_{j}(\theta_{j} ) }  \frac{  p_{j+1} (\theta_{j} )}  {   p_{j+1}(\theta_{j+1} ) }    \right) =  (B_{j}-B_{j+1}) \left( log(\pi[\theta_{j+1}|Y]) - log(\pi[\theta_{j}|Y]) \right)

Physics Origins

It is at this point where sometimes, in order to make things correspond to the earlier physics literature, one defines the “Energy” as
E_{j}=-log(\pi[\theta_{j}|Y]).
So that the acceptance probability becomes
\alpha=min(1,e^{-(B_{j}-B_{j+1})(E_{j+1} - E_{j})  }).
It’s not necessary to define this energy, it only defines an equivalence mapping between statistics and physics. In physics particles get stuck in the local minima of the energy landscape and in statistics the MCMC gets stuck in the local peaks of the posterior. The reason for this is that in a canonical ensemble lower energy states are more probable (recall that nature tries to minimize the potential energy and that force is the negative gradient of the potential energy), so regions of the parameter space with low potential energy, physically, correspond to regions of high probability density, statistically. To be more precise, a result from statistical physics is that the distribution of energy is exponential with scale parameter kT, where k is Boltzmann’s constant and T is temperature (this condition holds only for a canonical ensemble). An exponential distribution with this scale parameter is called the Boltzmann distribution by physicists. As the temperature increases, higher energy states become more probable and the particle jumps out of the minima more. If you are a statistician you don’t need to worry about this, but sometimes this notation crops up in the literature. Its also the same acceptance probability now as in physics when sampling energies from a Boltzmann distribution. I have decided not to adopt the physics notation for this post.


Intra-Thread Metropolis Move

Each thread, within itself, performs a normal vanilla metropolis move:

//Propose Candidate Position//
			t1new=t1[rank*nmc+i-1] + normal(stream[rank]);
			t2new=t2[rank*nmc+i-1] + normal(stream[rank]);

			//Calculate log-Density at Newly-Proposed and Current Position//
			lpost_new[rank]=lLikelihood(t1new,t2new) + lprior(t1new,t2new);
			lpost[rank]=lLikelihood(t1[rank*nmc+i-1],t2[rank*nmc+i-1]) + lprior(t1[rank*nmc+i-1],t2[rank*nmc+i-1]);

			//Melt Density and Calculate log-Acceptance Probability//
			lalpha=B[rank]*(lpost_new[rank]-lpost[rank]);

			//Perform Metropolis Accept-Reject Step//
			if( log(u(stream[rank])) < lalpha ){
				//Accept
				//Proposed as Current Position
				t1[rank*nmc+i]=t1new;
				t2[rank*nmc+i]=t2new;
			}else{
				//Do not Accept
				//Propogate Current Position
				t1[rank*nmc+i]=t1[rank*nmc+i-1];
				t2[rank*nmc+i]=t2[rank*nmc+i-1];
			}

A few comments about the variables. “nmc” is the number of mcmc draws I wish to generate. I have two parameters which I have denoted t1 and t2, because t is closest to theta. Moreover, each processor stores its nmc draws of t1 and t2 in a contiguous array in the memory of length nmc times number of threads. “Rank” Identifies the thread and “lpost” and “B” are arrays of length equal to the number of threads in which to store the log posterior density at the current position and the fractional melting power. All of these variables are defined at the top of the code.


Inter-Thread Metropolis-Hastings Move


				if(u(stream[rank]) < 0.5){
					rank_partner=rank+1;
					if(rank_partner < size){
						//Inter-Thread Metropolis-Hastings Part
						lalpha = (B[rank]-B[rank_partner])*(lpost[rank_partner]-lpost[rank]);
						if(log(u(stream[rank])) < lalpha){
							//accept swap
							swap(t1[rank*nmc+i],t1[rank_partner*nmc+i]);
							swap(t2[rank*nmc+i],t2[rank_partner*nmc+i]);
						}

					}
				}

The only additional thing to add is that each chain attempts a swap with its neighbour at each iteration with probability 1/2. There is nothing special about 1/2, you could choose what you like, but there are pros and cons. How this made parallel in OpenMP is shown below.


OpenMP Parallelization

The OpenMP parallel implementation of the above algorithm is very simple!

#pragma omp parallel private(i,t1new,t2new,rank,lalpha,rank_partner) shared(B, lpost, lpost_new,t1,t2,swapt1,swapt2)
	{
		//Identify Each Thread
		rank=omp_get_thread_num();

		for (i = 1; i < nmc; ++i)
		{

                 //***Intra-Thread Metropolis Part***//
	
#pragma omp barrier      //Synchronise Threads
#pragma omp critical     //Executed Critical Code Block Oney Thread at a Time. 
			{

                 //***Inter-Thread Parallel Tempering Part***//

			}
#pragma omp barrier   //Synchronise Threads
		}
	}

The first parallel pragma simply forks the master thread into a number of threads whereby each thread executes the following code block independently i.e. a number of independent parallel mcmcs. Specifying variables as private means that each thread gets a copy of that variable in its own seperate location in the memory. Shared is the opposite, although I think variables are shared by default. The barrier pragma means that each thread halts until all threads have reached this point. The critical pragma means the following code block is executed by one thread at a time only. This prevents thread j swapping with thread j+1 whilst thread j+1 is attempting a swap with thread j+2, nasty things such as race conditions can occur. The last pragma barrier waits for all threads to have reached the end and then the next iteration of the for loop proceeds.


Full code

The full code can be found here. It depends on OpenMP and the TRNG library in order to generate multiple independent streams of random numbers. It takes the number of mcmc draws as a command-line argument.

[michael@michael tempering]$ wget http://www.lindonslog.com/example_code/tempering.cpp
[michael@michael tempering]$ g++ tempering.cpp -fopenmp -o tempering  -ltrng4 -lm
[michael@michael tempering]$ ./tempering 10000
Thread 0 has fractional power 1
Thread 1 has fractional power 0.469117
Thread 2 has fractional power 0.220071
Thread 3 has fractional power 0.103239
Thread 4 has fractional power 0.0484313
Thread 5 has fractional power 0.0227199
Thread 6 has fractional power 0.0106583
Thread 7 has fractional power 0.005
[michael@michael tempering]$


Simulation Study

I chose the likelihood to be 5 sharply peaked normal distributions located at the corners of a sort-of unit square plus one at the origin with variances of 0.001. The prior was a normal of variance 1000 centered at the origin. The parallel tempering algorithm was run with 8 threads. The posterior draws and mixing results are below:

Posterior Draws

Posterior Draws from Parallel Tempering


parallel tempering mixing

Mixing of parallel tempering algorithm


On the Future use of Parallel Tempering with OpenMP

I hope the code exemplifies how easy it is to run parallel MCMC chains with OpenMP. I would argue that the metropolis moves are the hardest part. If you can write them for a single serial chain, then it is only a few extra steps to run parallel chains and imlement that parallel tempering algorithm. My laptop has four cores and my office computer has eight. Given the trajectory of technology that shared memory devices have an ever increasing number of cores, it seems to me that parallel tempering is becoming an ever-more valuable algorithm to improve mixing times of MCMC runs. Afterall, had I not used the extra 3 cores on my laptop, they would have remained idle. If you have extra cores, why not use them! Moreover with OpenMP you can spawn as many parallel MCMCs as you desire, avoiding the pitalls of MPI.

Earl D.J. & Deem M.W. (2005). Parallel tempering: Theory, applications, and new perspectives, Physical Chemistry Chemical Physics, 7 (23) 3910. DOI:

Marginal Likelihood and Model Evidence in Bayesian Regression

The marginal likelihood or the model evidence is the probability of observing the data given a specific model. This is used in Bayesian model selection and comparison when computing Bayes factor between models, which is simply the ratio of the two respective marginal likelihoods. This can be used to select which covariates to include in a linear model for describing data. Consider the usual normal-inverse gamma prior specification on the regression parameters and variance.

Normal Inverse-Gamma Priors

\pi(\beta,\sigma^{2})=\pi(\beta|\sigma^{2})\pi(\sigma^{2})\\  \pi(\beta|\sigma^{2})=N(\mu_{0},\sigma^{2} \Lambda_{0}^{-1})\\  \pi(\sigma^{2})=IG(a_{0},b_{0})\\  f(Y|\beta,\sigma^{2})=N(X\beta,\sigma^{2})\\

Now construct the joint, out of which the regression coefficients and variance will be integrated.

f(Y|\beta,\sigma^{2})\pi(\beta|\sigma^{2})\pi(\sigma^{2})=  \left( \frac{1}{2\pi} \right)^{\frac{n}{2}} \left( \frac{1}{\sigma^{2}}\right)^{\frac{n}{2}} e^{-\frac{1}{2\sigma^{2}}(Y-X\beta)'(Y-X\beta)}   \left( \frac{1}{2\pi}\right) ^{\frac{k}{2}}\left( \frac{1}{\sigma^{2}} \right)^{\frac{k}{2}} |\Lambda_{0}|^{\frac{1}{2}} e^{-\frac{1}{2\sigma^{2}}(\beta-\mu_{0})'\Lambda_{0}(\beta-\mu_{0})}  \frac{b_{0}^{a_{0}}}{\Gamma(a_{0})}\left(\frac{1}{\sigma^{2}}\right)^{a_{0}-1}e^{-\frac{b_{0}}{\sigma^{2}}}\\

This can be made neater by rearrangement. Consider first the two exponentials above. The exponents can be combined and rewritten in such a manner that a normal kernel in Beta is recognizable.

(Y-X\beta)'(Y-X\beta)+(\beta-\mu_{0})\Lambda_{0}(\beta_{0}-\mu_{0})=\\  \beta'(X'X+\Lambda_{0})\beta-2\beta'(X'X\hat{\beta}+\Lambda_{0}\mu_{0})   +\mu_{0}'\Lambda_{0}\mu_{0}+\hat{\beta}X'X\hat{\beta}+Y'(I-P)Y=\\  \beta'(X'X+\Lambda_{0})\beta-2\beta'(X'X\hat{\beta}+\Lambda_{0}\mu_{0})   +\mu_{0}'\Lambda_{0}\mu_{0}+Y'Y

Define:
\Lambda_{n}=X'X+\Lambda_{0}\\  \mu_{n}=(X'X+\Lambda_{0})^{-1}(X'X\hat{\beta}+\Lambda_{0}\mu_{0})=\Lambda_{n}^{-1}(X'X\hat{\beta}+\Lambda_{0}\mu_{0})
then the above can written as:

(Y-X\beta)'(Y-X\beta)+(\beta-\mu_{0})\Lambda_{0}(\beta_{0}-\mu_{0})=\\  \beta'\Lambda_{n}\beta-2\beta'\Lambda_{n}\mu_{n}+\mu_{0}'\Lambda_{0}\mu_{0}+Y'Y=\\  (\beta-\mu_{n})\Lambda_{n}(\beta-\mu_{n})-\mu_{n}\Lambda_{n}\mu_{n}+\mu_{0}'\Lambda_{0}\mu_{0}+Y'Y\\
where the last line was achieved by completing the square.

f(Y|\beta,\sigma^{2})\pi(\beta|\sigma^{2})\pi(\sigma^{2})=f(Y,\beta,\sigma^{2})=\\  \left( \frac{1}{2\pi} \right)^{\frac{n}{2}} \left( \frac{1}{\sigma^{2}}\right)^{\frac{n}{2}} e^{-\frac{1}{2\sigma^{2}}(Y'Y-\mu_{n}\Lambda_{n}\mu_{n}+\mu_{0}'\Lambda_{0}\mu_{0})}  \left( \frac{1}{2\pi}\right) ^{\frac{k}{2}}\left( \frac{1}{\sigma^{2}} \right)^{\frac{k}{2}} |\Lambda_{0}|^{\frac{1}{2}} e^{-\frac{1}{2\sigma^{2}}(\beta-\mu_{n})\Lambda_{n}(\beta-\mu_{n})}   \frac{b_{0}^{a_{0}}}{\Gamma(a_{0})}\left(\frac{1}{\sigma^{2}}\right)^{a_{0}-1}e^{-\frac{b_{0}}{\sigma^{2}}}\\

Recognizing the kernel of a normal for Beta and knowing the normalizing constant the integral over Beta is now simple. Integrating out Beta yields:

f(Y,\sigma^{2})=  \left( \frac{1}{2\pi} \right)^{\frac{n}{2}} \left( \frac{1}{\sigma^{2}}\right)^{\frac{n}{2}} e^{-\frac{1}{2\sigma^{2}}(Y'Y-\mu_{n}\Lambda_{n}\mu_{n}+\mu_{0}'\Lambda_{0}\mu_{0})}   \frac{|\Lambda_{0}|^{\frac{1}{2}}}{|\Lambda_{n}|^{\frac{1}{2}}}   \frac{b_{0}^{a_{0}}}{\Gamma(a_{0})}\left(\frac{1}{\sigma^{2}}\right)^{a_{0}-1}e^{-\frac{b_{0}}{\sigma^{2}}}\\

Following a similar strategy as that for Beta, this can be rearranged into something containing a Gamma kernel for sigma squared.

f(Y,\sigma^{2})=\left( \frac{1}{2\pi} \right)^{\frac{n}{2}} \left( \frac{1}{\sigma^{2}}\right)^{\frac{n}{2}+a_{0}-1} e^{-\frac{b_{0}+\frac{1}{2}(Y'Y-\mu_{n}'\Lambda_{n}\mu_{n}+\mu_{0}'\Lambda_{0}\mu_{0})}{\sigma^{2}}}    \frac{|\Lambda_{0}|^{\frac{1}{2}}}{|\Lambda_{n}|^{\frac{1}{2}}}   \frac{b_{0}^{a_{0}}}{\Gamma(a_{0})}

Define
a_{n}=a_{0}+\frac{n}{2}\\  b_{n}=b_{0}+\frac{1}{2}(Y'Y-\mu_{n}\Lambda_{n}\mu_{n}+\mu_{0}'\Lambda_{0}\mu_{0})
Then the above reads

f(Y,\sigma^{2})=\left( \frac{1}{2\pi} \right)^{\frac{n}{2}}    \frac{|\Lambda_{0}|^{\frac{1}{2}}}{|\Lambda_{n}|^{\frac{1}{2}}}  \frac{b_{0}^{a_{0}}}{\Gamma(a_{0})}   \left( \frac{1}{\sigma^{2}}\right)^{a_{n}-1} e^{-\frac{b_{n}}{\sigma^{2}}}\\

Recognizing the kernel of a Gamma makes the integration easy, which yields the final result for the marginal likelihood below.

Marginal Likelihood/Model Evidence

m(Y)=\left( \frac{1}{2\pi} \right)^{\frac{n}{2}}    \frac{|\Lambda_{0}|^{\frac{1}{2}}}{|\Lambda_{n}|^{\frac{1}{2}}}  \frac{b_{0}^{a_{0}}}{b_{n}^{a_{n}}}  \frac{\Gamma(a_{n})}{\Gamma(a_{0})}\\

An example application of this can be found here in linear model selection.

Polynomial Regression Linear Model in R

This post is just a brief example of how linear model theory can be used to perform polynomial regression. Consider getting some bivariate data that looks like this (downloadable here)
Polynomial Regression

and suppose we wish to fit a 3rd degree polynomial to this data. We can write it as a linear model
y_{i}=B_{0}+B_{1}x_{i}+B_{2}x_{i}^{2}+B_{3}x_{i}^{3} + \epsilon_{i}\\  Y=XB + \epsilon,
which is a linear model because it is linear in the regression coefficients. The ordinary least squares estimator of the regression coefficients is then
\hat{B}_{OLS}=(X^{'}X)^{-1}X'Y.

This is implemented in the below R code

plot(x,y)
n=length(y)
int=rep(1,n)
X=cbind(x,x^2,x^3,int)
B=solve(t(X)%*%X)%*%t(X)%*%y
u=seq(-100,100,1)
v=B[4]+B[1]*u+B[2]*u^2+B[3]*u^3
lines(u,v,col="red")

The array B contains the OLS estimates of the regression coefficients. The fitted polynomial is fitted below in red:
Fitted Polynomial

The extension to higher degree polynomials is simple, just add columns to the design matrix X.

Ages 10-12 Toy Exoplanet Detection

A major objection with the previous simulated light curves is that the baseline is rarely constnat. Instead, from what I have learned, it is a horrible mess of discontinuities and curves due to the telescope rotating and instruments heating up. I spoke to someone who said that there is some periodicity in the curve. It is easy to adapt the previous code to generalize it for a periodic background flux. Consider the model:
f_{t} \sim N(b_{1} + a\text{sin}(\omega t)- b_{2}1_{ \left[ \tau_{1},\tau_{2} \right] }(t), \sigma^{2})
which has two additional parameters, the amplitude a and the angular frequency omega, upon which I put priors:
a \sim unif(0,b_{1})\\  \omega \sim unif(0,1)\\
I’m not sure if the last one is justified, I’d need to talk to an exoplanet expert, but hopefully the background frequency is a lot less than the frequency at which measurements are being made.
The JAGS code looks like this:

model
{
for(t in 1 : N)
{
D[t] ~ dnorm(mu[t],s2)
mu[t] <- b[1]+ amp*sin(omega*t) - step(t - tau1)*step(tau2-t)*step(tau2-tau1) * b[2]
}
b[1] ~ dnorm(0.0,1.0E-6)T(0,)
b[2] ~ dnorm(0.0,1.0E-6)T(0,b[1])
tau1 ~ dunif(1,N)
tau2 ~ dunif(tau1,N)
s2 ~ dunif(0,100)
amp ~ dunif(0,b[1])
omega ~ dunif(0,1)
}

The code to call JAGS from R and make plots can be found here.

rm(list=ls())
library('rjags')

s2=5
bn=1000
dn=400
amp=1
omega=0.005
base=rnorm(bn,100,sqrt(s2))
drop=rnorm(dn,97,sqrt(s2))
t=seq(1,length(c(base,drop,base,base)),1)
sin=amp*sin(omega*t)
trace=c(base,drop,base,base)
trace=trace+sin

plot(trace,type="l",col="red",xlab="Time t",ylab="Flux")

datalist=list(
  D=trace,
  N=length(trace)
  )

jags <- jags.model(file.path('/home/grad/msl33/Dropbox/samsi/cp.bugs'),data = datalist, n.chains = 1, n.adapt = 500)
mcmc.samples=coda.samples(jags,c('b','tau1','tau2','amp','omega'),2000)
plot(mcmc.samples)

and I generated some fake data that looks like this:
Periodic Background Flux with Exoplanet Transit Light Curve
based on the parameters:
b_{1}=100\\  b_{2}=3\\  a=1\\  \tau_{1}=1000\\  \tau_{2}=1400\\

The posterior density estimates are below:
Posterior Angular Frequency and Amplitude
Posterior Exoplanet Transit Times
Background Flux and Dip in Flux

Toy Exoplanet Change Point Light Curve Transit Detection

I’ve been attending an exoplanet data conference this week, a gathering between astrophysicists and statisticians. One way to look for exoplanets is by the “Transit” method. Basically a dip in the flux from a star is observed as an orbiting planet passes across the line of sight between the observer and the star. There was a lot of talk about automated methods of detecting this dip in brightness. I decided to code this up for the lulz:

First I generated some fake data.
fake exoplanet transit

The dip in flux is meant to be as the exoplanet passes between the observer and the star. Interesting parameters are the dip in flux and the time interval of the reduced flux as these correspond to physical parameters of the planet-star system. I decided to model this as a simple changepoint problem:

f(t) \sim N(b_{1} - b_{2}1_{\left[ \tau_{1},\tau_{2} \right]}(t), \sigma^{2})

Parameters of interest are b_1, b_2, tau_1 and tau_2, which are the baseline, the drop, the time at which the drop occurs and the time at which the drop finishes respectively. I proceeded using the following JAGS code:

model
{
for(t in 1 : N)
{
D[t] ~ dnorm(mu[t],s2)
mu[t] <- b[1] - step(t - tau1)*step(tau2-t)*step(tau2-tau1) * b[2]
}
b[1] ~ dnorm(0.0,1.0E-6)T(0,)
b[2] ~ dnorm(0.0,1.0E-6)T(0,b[1])
tau1 ~ dunif(1,N)
tau2 ~ dunif(tau1,N)
s2 ~ dunif(0,100)
}

The script to generate the data and call JAGS from R can be found here. The posterior density estimates are displayed below:
exoplanet posterior summaries

The Bayesian change point model correctly identifies a drop of 40 and the times at which the transit begin and finishes. One might argue that such a large drop in flux is unphysical for detecting Earth like planets, so here is a further example to detect how robust this approach is. Consider the following light curve:
exoplanet light curve
Can you tell by eye how deep the drop is, when it occurs and for how long? Here are the posterior density estimates:

exoplanet posterior distribution
exoplanet posterior distribution
The true valies with which the light curve was generated are
b_{1}=100\\  b_{2}=0.5\\  \tau_{1}=2000\\  \tau_{2}=2400\\  \sigma^{2}=5

An Introduction to Importance Sampling

Importance Sampling is a Monte Carlo integration technique for getting (very accurate) approximations to integrals. Consider the integral
\int e^{-\frac{1}{2}x^{2}}dx,
and suppose we wish to approximate this without doing any calculus. Statistically speaking we want to compute the normalizing constant for a standard normal, which we know to be
\sqrt{2 \pi} \approx 2.506628.
We can rewrite the above integral as
\int \frac{e^{-\frac{1}{2}x^{2}}}{g(x)}g(x)dx,
because this is just multiplying the integrand by 1. One can now make the observation that
\int \frac{e^{-\frac{1}{2}x^{2}}}{g(x)}g(x)dx=\mathbb{E}\left[\frac{e^{-\frac{1}{2}x^{2}}}{g(x)} \right],
and realise that
\mathbb{E}\left[\frac{e^{-\frac{1}{2}x^{2}}}{g(x)} \right] \approx \frac{1}{n}\sum_{i=1}^{n} \frac{e^{-\frac{1}{2}x_{i}^{2}}}{g(x_{i})},
where
x_{i} \sim g(x).
The code below in R performs this summation.

N=10000
cumsum=rep(0,N)
for(i in 2:N){
x=rt(1,3)
cumsum[i]=cumsum[i-1]+((exp((-0.5)*(x^2)))/dt(x,3))
}

for(i in 1:N) cumsum[i]=cumsum[i]/i
plot(cumsum[50:N],type="l")

Importance Sampling Normalizing Constant
At n=10000 the normalizing constant given by importance sampling is 2.516303. Increasing n to n=100000 the normalizing constant is 2.505983

Multivariate Normal Conditional Distribution

Previously I made a post about diagonalizing a positive definite symmetric matrix . One of the applications of this within statistics is to diagonalize the variance matrix of a multivariate normal in order to derive conditional distributions. Let

Y\sim N_{p}(m,V)
Consider multiplying Y by the following matrix
\begin{pmatrix}          1 & -V_{12}V_{22}^{-1}\\          0 & 1\\  \end{pmatrix}Y \sim N_{p}\left( \begin{pmatrix} m_{1} - V_{12}V_{22}^{-1}m_{2} \\ m_{2}\\ \end{pmatrix},\begin{pmatrix}          V_{11}-V_{12}V_{22}^{-1}V_{21} & 0 \\          0 & V_{22}\\  \end{pmatrix}\right)
i.e.
\begin{pmatrix}          Y_{1}-V_{12}V_{22}^{-1}Y_{2} \\          Y_{2}\\  \end{pmatrix}\sim  N\left(  \begin{pmatrix} m_{1} - V_{12}V_{22}^{-1}m_{2} \\ m_{2}\\ \end{pmatrix},\begin{pmatrix}          V_{11}-V_{12}V_{22}^{-1}V_{21} & 0 \\          0 & V_{22}\end{pmatrix} \right)\\
The covariance matrix has now been diagonalized. This is useful because zero covariance implies independence for normally distributed random variables and so it follows that

Y_{1}-V_{12}V_{22}^{-1}Y_{2} \sim N( m_{1} - V_{12}V_{22}^{-1}m_{2} ,V_{11}-V_{12}V_{22}^{-1}V_{21})
Conditiong on Y2 it follows that the conditional distribution is
Y_{1}|Y_{2} \sim N( m_{1} + V_{12}V_{22}^{-1}(Y_{2}-m_{2} ),V_{11}-V_{12}V_{22}^{-1}V_{21})
Notice how the variance of the conditional normal distribution is the marginal variance of Y1 minus something. That is to say the variance of Y1 is reduced given the knowledge of Y2. The variance can be recognized as the Schur complement of the covariance matrix with respect to V22. A similar treatment yields
Y_{2}|Y_{1} \sim N( m_{2} + V_{21}V_{11}^{-1}(Y_{1}-m_{1} ),V_{22}-V_{21}V_{11}^{-1}V_{12})