The post Monotonicity of EM Algorithm Proof appeared first on Lindons Log.

]]>Here the monotonicity of the EM algorithm is established.

$$ f_{o}(Y_{o}|\theta)=f_{o,m}(Y_{o},Y_{m}|\theta)/f_{m|o}(Y_{m}|Y_{o},\theta)$$

$$ \log L_{o}(\theta)=\log L_{o,m}(\theta)-\log f_{m|o}(Y_{m}|Y_{o},\theta) \label{eq:loglikelihood} $$

where \( L_{o}(\theta)\) is the likelihood under the observed data and \(L_{o,m}(\theta)\) is the likelihood under the complete data. Taking the expectation of the second line with respect to the conditional distribution of \(Y_{m}\) given \(Y_{o}\) and the current parameters \(\theta^{k}\) yields

$$\log L_{o}(\theta)= \mathbb{E}_{Y_{m}}\left[\log L_{o,m}(\theta)|Y_{o},\theta^{k}\right]-\mathbb{E}_{Y_{m}}\left[\log f_{m|o}(Y_{m}|Y_{o},\theta)|Y_{o},\theta^{k} \right]$$

which is used to construct the difference between the log-likelihood of a new value of \(\theta\) and the current value \(\theta^{k}\) as

\begin{equation}

\begin{split}

\log L_{o}(\theta)-&\log L_{o}(\theta^{k})=\mathbb{E}_{Y_{m}}\left[ \log L_{o,m}(\theta)|Y_{o},\theta^{k}\right]-\mathbb{E}_{Y_{m}}\left[ \log L_{o,m}(\theta^{k})|Y_{o},\theta^{k}\right] \\

+&\mathbb{E}_{Y_{m}}\left[ \log f_{m|o}(Y_{m}|Y_{o},\theta^{k})|Y_{o},\theta^{k} \right]-\mathbb{E}_{Y_{m}}\left[ \log f_{m|o}(Y_{m}|Y_{o},\theta)|Y_{o},\theta^{k} \right],\\

\end{split}

\end{equation}

or by adopting common notation as

\begin{equation}

\log L_{o}(\theta)-\log L_{o}(\theta^{k})=Q(\theta;\theta^{k})-Q(\theta^{k};\theta^{k})+H(\theta^{k};\theta^{k})-H(\theta;\theta^{k}).\\

\end{equation}

Consider the last two “\( H\)” terms, then by Jensen’s inequality

\begin{align*}

&-\mathbb{E}_{Y_{m}}\left[ \log f_{m|o}(Y_{m}|Y_{o},\theta)- \log f_{m|o}(Y_{m}|Y_{o},\theta^{k})|Y_{o},\theta^{k} \right]\\

&=-\mathbb{E}_{Y_{m}}\left[\log \frac{ f_{m|o}(Y_{m}|Y_{o},\theta)}{ f_{m|o}(Y_{m}|Y_{o},\theta^{k})}|Y_{o},\theta^{k} \right]\\

&\geq-\log \mathbb{E}_{Y_{m}}\left[ \frac{ f_{m|o}(Y_{m}|Y_{o},\theta)}{ f_{m|o}(Y_{m}|Y_{o},\theta^{k})}|Y_{o},\theta^{k} \right]\\

&=-\log \int f_{m|o}(Y_{m}|Y_{o},\theta)dY_{m}\\

&=0 \; \; \; \; \; \; \;\forall \theta\in \Theta.

\end{align*}

It follows that \(\log L_{o}(\theta)-\log L_{o}(\theta^{k})\geq 0\) by choosing \(\theta\) such that \(Q(\theta;\theta^{k})-Q(\theta^{k};\theta^{k})\geq 0\).

Ruslan R Salakhutdinov, Sam T Roweis, & Zoubin Ghahramani (2012). On the Convergence of Bound Optimization Algorithms arXiv arXiv: 1212.2490v1

Wu C.F.J. (1983). On the Convergence Properties of the EM Algorithm, The Annals of Statistics, 11 (1) 95-103. DOI: 10.1214/aos/1176346060

McLachlan G. & Peel D. DOI: 10.1002/0471721182

The post Monotonicity of EM Algorithm Proof appeared first on Lindons Log.

]]>The post C++11 versus R Standalone Random Number Generation Performance Comparison appeared first on Lindons Log.

]]>#define MATHLIB_STANDALONE #include <iostream> #include <vector> #include <random> #include <chrono> #include "Rmath.h" int main(int argc, char *argv[]) { int ndraws=100000000; std::vector<double> Z(ndraws); std::mt19937 engine; std::normal_distribution<double> N(0,1); auto start = std::chrono::steady_clock::now(); for(auto & z : Z ) { z=N(engine); } auto end = std::chrono::steady_clock::now(); std::chrono::duration<double> elapsed=end-start; std::cout << elapsed.count() << " seconds - C++11" << std::endl; start = std::chrono::steady_clock::now(); for(auto & z : Z ) { z=rnorm(0,1); } end = std::chrono::steady_clock::now(); elapsed=end-start; std::cout << elapsed.count() << " seconds - R Standalone" << std::endl; return 0; }

Compiling and run with:

[michael@michael coda]$ g++ normal.cpp -o normal -std=c++11 -O3 -lRmath [michael@michael coda]$ ./normal

5.2252 seconds - C++11 6.0679 seconds - R Standalone

11.2132 seconds - C++11 12.4486 seconds - R Standalone

6.31157 seconds - C++11 6.35053 seconds - R Standalone

As expected the C++11 implementation is faster but not by a huge amount. As the computational cost of my code is dominated by other linear algebra procedures of O(n^3) I’d actually be willing to use the R standalone library because the syntax is more user friendly.

The post C++11 versus R Standalone Random Number Generation Performance Comparison appeared first on Lindons Log.

]]>The post Stochastic Optimization in R by Parallel Tempering appeared first on Lindons Log.

]]>I’ve written a few posts now about using parallel tempering to sample from complicated multi-modal target distributions but there are also other benefits and uses to this algorithm. There is a nice post on Darren Wilkinson’s blog about using tempered posteriors for marginal likelihood calculations. There is also another area where parallel tempering finds application, namely in stochastic optimization. I first encountered parallel tempering whilst doing my MPhys degree at the University of Warwick but at that time it was employed as a stochastic optimization algorithm to find the minimum energy configuration of a Lennard-Jones cluster as opposed to a sampling algorithm. All that is required is one observation to turn this sampling algorithm into a stochastic optimization algorithm. Lets break this observation down into a few steps.

Consider sampling from a simple exponential distribution $$f(E)\propto e^{-\beta E}1_{(0,\infty )}(E),$$

with rate parameter beta. For now lets fix beta=5. One could sample from this distribution using the same Rmpi parallel tempering code given in my previous post by simply changing the target distribution to the exponential above. The histograms of mcmc draws from four tempered distribution would then look something like this:

Note the scale on the x-axis. The two important observations mentioned earlier are

The second point is important because although the sampling algorithm is creating draws that are not the minimum value of E, by increasing the rate parameter one can force these draws to be arbitrarily close to E-min.

How does this relate to optimization? Consider setting $$E(\theta)=(\theta-40)^2$$ Whereas before where using the Metropolis algorithm one would propose a new value of E, say E’, now the proposal is made in θ, and θ’ is accepted based on u < f(E(θ')) / f(E(θ)). By construction the algorithm gives draws close to E-min, which occurs when θ=40. The traceplot of θ is shown below:

Click here for the code.

The above quadratic was an easy uni-modal example. Let’s try a harder function. Consider the minimum of $$ E(\theta)=3sin(\theta)+(0.1\theta-3)^2,$$ which looks like this:

This function has infinitely many local minima but one global minimum around 30. Local minima make optimization challenging and many optimization algorithms get stuck in these regions as locally it appears the minimum has been reached. This is where the parallel tempering really helps. The traceplots of theta are shown for six tempered distributions below:

Click here for the code.

I’m currently working on another example just for fun, namely finding the lowest energy configuration of an n-particle Lennard-Jones cluster. This is a nice example because one can visualize the process using vmd and it also provides some insight into the origins of such terminology as “tempering”, “annealing” and “temperature” which always look somewhat out of place in the statistics literature.

Consider the function

$$ E(\theta)=10\sin(0.3\theta)\sin(1.3\theta^2) + 0.00001\theta^4 + 0.2\theta+80, $$

which is shown below.

The trace-plots for the parallel tempering optimization are shown below

Examining the mcmc draws the minimum is obtained at theta=-15.81515.

Li Y., Protopopescu V.A., Arnold N., Zhang X. & Gorin A. (2009). Hybrid parallel tempering and simulated annealing method, Applied Mathematics and Computation, 212 (1) 216-228. DOI: 10.1016/j.amc.2009.02.023

The post Stochastic Optimization in R by Parallel Tempering appeared first on Lindons Log.

]]>The post Parallel Tempering in R with Rmpi appeared first on Lindons Log.

]]>First one needs to write a density one wishes to sample from

logdensity<-function(theta){ #Distribution one wishes to sample from here. #It may be more convinient to pass a theta as a list sigma2=0.001; Sigma=matrix(0,2,2); Sigma[1,1]=sigma2; Sigma[2,2]=sigma2; density=dmvnorm(theta,c(0,0),Sigma)+dmvnorm(theta,c(-2,0.8),Sigma)+dmvnorm(theta,c(-1,1),Sigma)+dmvnorm(theta,c(1,1),Sigma)+dmvnorm(theta,c(0.5,0.5),Sigma); return(log(density)) }

The density I chose was a mixture of 5 well-separated bi-variate Normals. One should note that it is probably cleanest to pass all the arguments to this function as a list theta. It wasn’t really necessary in this case but if you have a posterior distribution with a number of parameters of varying dimension then it would be much nicer as a list. In a future blog post I may change the target density to be the energy distribution of a Lennard-Jones cluster.

This too is written as a function because Rmpi allows you to pass the function to all slaves and execute it. It was basically the easiest way of writing it for Rmpi.

temper<-function(niter,Bmin,swap.interval){ rank=mpi.comm.rank(); size=mpi.comm.size(); swap=0; swaps.attempted=0; swaps.accepted=0; #Higher ranks run the higher "temperatures" (~smaller fractional powers) B=rep(0,size-1); for(r in 1:size-1){ temp=(r-1)/(size-2); B[r]=Bmin^temp; } #Create a list for proposal moves prop=rep(0,2); theta=matrix(0,niter,2) for(t in 2:niter){ for(c in 1:length(prop)) prop1=theta[t-1,c]+rnorm(1,0,0.1); #Calculate Log-Density at proposed and current position logdensity.current=logdensity(theta[t-1,]) logdensity.prop=logdensity(prop); #Calculate log acceptance probability lalpha=B[rank]*(logdensity.prop-logdensity.current) if(log(runif(1))<lalpha){ #Accept proposed move theta[t,]=prop; logdensity.current=logdensity.prop; }else{ #Otherwise do not move theta[t,]=theta[t-1,]; } if(t%%swap.interval ==0){ for(evenodd in 0:1){ swap=0; logdensity.partner=0; if(rank%%2 == evenodd%%2){ rank.partner=rank + 1; #ranks range from 1:size-1. Cannot have a partner rank == size if(0<rank.partner && rank.partner<size){ #On first iteration, evens receive from above odd #On second iteration, odds receive from above evens logdensity.partner<-mpi.recv.Robj(rank.partner,rank.partner); lalpha = (B[rank]-B[rank.partner])*(logdensity.partner-logdensity.current); swaps.attempted=swaps.attempted+1; if(log(runif(1))<lalpha){ swap=1; swaps.accepted=swaps.accepted+1; } mpi.send.Robj(swap,dest=rank.partner,tag=rank) } if(swap==1){ thetaswap=theta[t,]; mpi.send.Robj(thetaswap,dest=rank.partner,tag=rank) theta[t,]=mpi.recv.Robj(rank.partner,rank.partner) } }else{ rank.partner=rank-1; #ranks range from 1:size-1. Cannot have a partner rank ==0 if(0<rank.partner && rank.partner<size){ #On first iteration, odds send to evens below #On second iteration, evens sent to odds below mpi.send.Robj(logdensity.current,dest=rank.partner,tag=rank); swap=mpi.recv.Robj(rank.partner,rank.partner); } if(swap==1){ thetaswap=theta[t,]; theta[t,]=mpi.recv.Robj(rank.partner,rank.partner); mpi.send.Robj(thetaswap,dest=rank.partner,tag=rank); } } } } } return(theta) }

The bulk of the above code is the communication of each processor with its next nearest neighbors. Metropolis moves will be attempted every *swap.interval* iterations, an argument one can pass to the function. When this code block is entered, even rank processors will partner with their higher ranked odd neighbours (they have a high rank so higher temperature i.e. smaller fractional power – a more “melted down” target density). The higher odd partners will send their lower even partners the value of their density and then the lower even partners will calculate an acceptance probabilty. If the move succeeds the lower rank even processors send their higher rank odd processors a binary swap=1 telling the higher rank odd processors that a send/receive procedure will occur. The lower even rank sends the higher odd rank its parameters and then subsequently the higher odd rank sends its lower even rank its parameters. In this way a metropolis move between processors is achieved. Next, odd rank processors form partners with their higher even ranked neighbours (because we need to swap with processor rank 1, the target density). The same procedure occurs as before but swapping odd for even. More visually, first swaps are attempted between 2-3, 4-5, 6-7 etc and then swaps are attempted between 1-2, 3-4, 5-6. This is almost like a merge-sort style algorithm. One can see how the parameters could be passed from 3 down to 2 and then from 2 down to 1. The main point is that each processor attempts a swap with its nearest-neighbours, the one above and the one below, every *swap.interval* iterations.

With these functions defined one can now proceed to set up the mpi communicator/world.

First spawn some slaves.

library(Rmpi) mpi.spawn.Rslaves(nslaves=6)

If it worked, you should see something like this:

> mpi.spawn.Rslaves(nslaves=6) 6 slaves are spawned successfully. 0 failed. master (rank 0, comm 1) of size 7 is running on: cabbage slave1 (rank 1, comm 1) of size 7 is running on: cabbage slave2 (rank 2, comm 1) of size 7 is running on: cabbage slave3 (rank 3, comm 1) of size 7 is running on: cabbage slave4 (rank 4, comm 1) of size 7 is running on: cabbage slave5 (rank 5, comm 1) of size 7 is running on: cabbage slave6 (rank 6, comm 1) of size 7 is running on: cabbage

(yes, my office computer was named cabbage, lettuce is the one next to me). One can then send the function definitions to the slave processors.

niter=3000 Bmin=0.005 swap.interval=3 #Send to slaves some required data mpi.bcast.Robj2slave(niter) mpi.bcast.Robj2slave(Bmin) mpi.bcast.Robj2slave(swap.interval) #Send to slaves the logdensity function mpi.bcast.Robj2slave(logdensity) #Send to slaves the temper function mpi.bcast.Robj2slave(temper) #Send to slaves the dmvnorm function mpi.bcast.Robj2slave(dmvnorm)

If you want to make sure that the slaves have the correct function definition, one can execute the command *mpi.remote.exec(temper)* and this will return the function definition. That is all, now it can be run.

mcmc=mpi.remote.exec(temper(niter,Bmin,swap.interval))

This returns a list object containing the mcmc draws for each slave.

The end product is something that looks like this

Which are the draws (in black) from the target distribution. It is also useful to build up intuition for parallel tempering to look at what is happening on the other processors. The draws for all processors are shown below:

N.B. Although my computer only has 8 cores I tried running the code 12 slaves. At first I was concerned that the MPI communications would enter a deadlock and the code would hang but it didn’t, so it seems you can scale up the number of slaves above the number of cores.

Notice that the temperature set used in the code has the property that , for c a constant. There is a paper by Kofke(2002) that justifies this temperature set as it yields a constant acceptance ratio between cores under certain conditions. Indeed, the acceptance ratio (the fraction of metropolis moves that succeeded between cores) are roughly constant, as shown below:

[1] 0.7227723 [1] 0.7926793 [1] 0.710171 [1] 0.8037804 [1] 0.7191719 [1] 0.7974797 [1] 0.729673 [1] 0.8223822 [1] 0.8184818 [1] 0.8445845

Earl D.J. & Deem M.W. (2005). Parallel tempering: Theory, applications, and new perspectives, Physical Chemistry Chemical Physics, 7 (23) 3910. DOI: 10.1039/b509983h

Kofke D.A. (2002). On the acceptance probability of replica-exchange Monte Carlo trials, The Journal of Chemical Physics, 117 (15) 6911. DOI: 10.1063/1.1507776

The post Parallel Tempering in R with Rmpi appeared first on Lindons Log.

]]>The post Easy 3-Minute Guide to Making apply() Parallel over Distributed Grids and Clusters in R appeared first on Lindons Log.

]]>First go ahead and download Bosco, the sign-up is only for the developers to get an idea of how many people are using it. Detailed install instructions can be found here but I will also go through the steps.

[lindon@laptop Downloads]$ tar xvzf ./bosco_quickstart.tar.gz [lindon@laptop Downloads]$ ./bosco_quickstart

The executable will then ask some questions:

Do you want to install Bosco? Select y/n and press [ENTER]: y

Type the cluster name and press [ENTER]: dscr-login-01.oit.duke.edu

When prompted “Type your name at dscr-login-01.oit.duke.edu (default YOUR_USER) and press [ENTER]: NetID

When prompted “Type the queue manager for login01.osgconnect.net (pbs, condor, lsf, sge, slurm) and press [ENTER]: sge

Then when prompted “NetID@dscr-login-01.oit.duke.edu’s password: XXXXXXX

For duke users, the HostName of the DCSR is dscr-login-01.oit.duke.edu. You login with your NetID and the queue submission system is the Sun Grid Engine, so type sge. If you already have SSH-Keys set up then I think the last question gets skipped. That takes care of the installation. You can now try submitting on the remote cluster locally from your laptop. Download this test executable and this submission file. Start Bosco and try submitting a job.

[msl33@hotel ~/tutorial-bosco]$ source ~/bosco/bosco_setenv [msl33@hotel ~/tutorial-bosco]$ bosco_start BOSCO Started [msl33@hotel ~/tutorial-bosco]$ condor_submit bosco01.sub Submitting job(s). 1 job(s) submitted to cluster 70. [msl33@hotel ~/tutorial-bosco]$ condor_q -- Submitter: hotel.stat.duke.edu : <127.0.0.1:11000?sock=21707_cbb6_3> : hotel.stat.duke.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 70.0 msl33 8/31 12:08 0+00:00:00 I 0 0.0 short.sh 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

This is the result if all has worked well. Note that you need to start Bosco by the above two lines.

The current version of GridR on CRAN is an older version doesn’t support job submission by bosco. It will when CRAN gets the latest version of GridR but until then you need to install GridR from source so download it here and install it:

install.packages("~/Downloads/GridR_0.9.7.tar.gz", repos=NULL, type="source")

Consider a toy example which approximates pi by monte-carlo.

montecarloPi <- function(trials, inst) { count = 0 for(i in 1:trials) { if((runif(1,0,1)^2 + runif(1,0,1)^2)<1) { count = count + 1 } } return((count*4)/trials) }

One can now use grid.apply from the GridR package combined with Bosco to submit jobs on the remote cluster from within the users local R session.

# load the GridR library library("GridR") grid.init(service="bosco.direct", localTmpDir="tmp") # Send 10 instances of the montecarloPi grid.apply("pi_estimate", montecarloPi, 10000000, c(1:10), batch=c(2))

You can then see how your jobs are getting on by the “grid.printJobs()” command.

When it completes, “pi_estimate” will be a list object with 10 elements containing approximations to pi. Obviously, there is an overhead with submitting jobs and also a lag time while these jobs get through the queue. One must balance this overhead with the computational time required to complete a single iteration of the apply function. Bosco will create and submit a job for every iteration of the apply function. If each iteration does not take too long but there exists a great many of them to perform, one could consider blocking these operations into, say, 10 jobs so that the queue lag and submission overhead is negligible in comparison to the time taken to complete no_apply_iteraions/10 computations, which also saves creating a large number of jobs on the cluster which might aggravate other users. One can also add clusters to bosco using the “bosco_cluster –add” command, so that jobs are submitted to whichever cluster has the most free cores available. All in all this is a great aid for those doing computationally intensive tasks and makes parallel work-sharing very easy indeed.

The post Easy 3-Minute Guide to Making apply() Parallel over Distributed Grids and Clusters in R appeared first on Lindons Log.

]]>The post Error in .C(” *** “) : C symbol name ” *** ” not in load table appeared first on Lindons Log.

]]>> dyn.load("main.so") > .C("linsgp") Error in .C("linsgp") : C symbol name "linsgp" not in load table

My C++ code looked like this

... void linsgp(){ ...

What is missing is extern “C”, so it should look like this:

... extern "C" void linsgp(){ ...

The reason is that C++ supports overloading of function names and so the compiler mangles the name with information about the arguments. C, however, does not support this and doesn’t mangle the name. Inserting extern “C” tells the compiler not to mangle the name such that the name used for linkage is C-compatible.

The post Error in .C(” *** “) : C symbol name ” *** ” not in load table appeared first on Lindons Log.

]]>The post Model Scale Parameterization for MCMC Efficiency appeared first on Lindons Log.

]]>Papaspiliopoulos gave a general definition of the centered parameterization to be when is independent of given

Papaspiliopoulos gave a general definition of the non-centered parameterization to be when and are a priori independent.

Generally when the CP works well, the NCP works poorly and vice versa. Yaming Yu and Xiao-Li Meng[1] present a way of combining both strategies by interweaving the Gibbs steps of both parameterizations at each iteration. The details can be read in their paper. I decided to test all three Gibbs samplers with the following R code:

#Generate Data lam2=0.5 lam=sqrt(lam2) sig2=1 n=1000 Xt=rnorm(n,0,sqrt(lam2*sig2)) Y=rnorm(n,Xt,sqrt(sig2)) nmc=2000 X=Xt #Centered Parameterization cp_lam2=rep(0,nmc) cp_X=matrix(0,nmc,n) for(i in 1:nmc){ inv_lam2=rgamma(1,(n)/2,rate=(t(X)%*%X)/(2*sig2)) lam2=1/inv_lam2 X=rnorm(n,(1/(1/sig2 + 1/(sig2*lam2)))*Y/sig2, sqrt(1/(1/sig2 + 1/(sig2*lam2)))) cp_lam2[i]=lam2 cp_X[i,]=X } mean_cp_X=apply(cp_X,2,mean) #Non-Centered Parameterization X=Xt ncp_lam2=rep(0,nmc) ncp_X=matrix(0,nmc,n) for(i in 1:nmc){ lam=rnorm(1,t(X)%*%Y/(t(X)%*%X), sqrt(sig2/(t(X)%*%X))) lam2=lam*lam; X=rnorm(n, (1/(1/sig2 + lam2/sig2))*lam*Y/sig2, sqrt(1/(1/sig2+lam2/sig2)) ) ncp_lam2[i]=lam2 ncp_X[i,]=X } mean_ncp_X=apply(ncp_X,2,mean) #Interweaving Strategy int_lam2=rep(0,nmc) int_X=matrix(0,nmc,n) for(i in 1:nmc){ X=rnorm(n,(1/(1/sig2 + 1/(sig2*lam2)))*Y/sig2, sqrt(1/(1/sig2 + 1/(sig2*lam2)))) inv_lam2=rgamma(1,(n)/2,rate=(t(X)%*%X)/(2*sig2)) half_lam2=1/inv_lam2 X=X/sqrt(half_lam2) #Transform to Xtilde lam=rnorm(1,t(X)%*%Y/(t(X)%*%X), sqrt(sig2/(t(X)%*%X))) lam2=lam*lam; int_lam2[i]=lam2 int_X[i,]=X } mean_cp_X=apply(cp_X,2,mean) #Remove Burnin cp_lam2=cp_lam2[-(1:1000)] ncp_lam2=ncp_lam2[-(1:1000)] int_lam2=int_lam2[-(1:1000)] #Plot Results par(mfrow=c(3,3)) acf(cp_lam2) plot(cp_lam2,type="l") plot(cp_lam2[1:nmc-1],cp_lam2[2:nmc]) acf(ncp_lam2) plot(ncp_lam2,type="l") plot(ncp_lam2[1:nmc-1],ncp_lam2[2:nmc]) acf(int_lam2) plot(int_lam2,type="l") plot(int_lam2[1:nmc-1],int_lam2[2:nmc])

As lambda gets small the centered parameterization becomes ever more autocorrelated and poorly mixing. When lambda becomes large the non-centered parameterization becomes ever more autocorrelated and poorly mixing. The interweaved Gibbs sampler exhibits great mixing in all cases.

[1] Y. Yu and X. Meng, “To Center or Not to Center: That Is Not the Question–An Ancillarity-Sufficiency Interweaving Strategy (ASIS) for Boosting MCMC Efficiency,” Journal of computational and graphical statistics, vol. 20, iss. 3, pp. 531-570, 2011.

[Bibtex]

[Bibtex]

```
@article{Yu11,
author = {Yu, Yaming and Meng, Xiao-Li},
citeulike-article-id = {10408757},
citeulike-linkout-0 = {http://amstat.tandfonline.com/doi/abs/10.1198/jcgs.2011.203main},
citeulike-linkout-1 = {http://pubs.amstat.org/doi/abs/10.1198/jcgs.2011.203main},
citeulike-linkout-2 = {http://dx.doi.org/10.1198/jcgs.2011.203main},
doi = {10.1198/jcgs.2011.203main},
journal = {Journal of Computational and Graphical Statistics},
number = {3},
pages = {531--570},
posted-at = {2012-03-03 18:10:07},
priority = {2},
title = {{To Center or Not to Center: That Is Not the Question--An Ancillarity-Sufficiency Interweaving Strategy (ASIS) for Boosting MCMC Efficiency}},
url = {http://amstat.tandfonline.com/doi/abs/10.1198/jcgs.2011.203main},
volume = {20},
year = {2011}
}
```

[2] O. Papaspiliopoulos, G. O. Roberts, and M. Sköld, “A general framework for the parametrization of hierarchical models,” Statistical science, vol. 22, iss. 1, pp. 59-73, 2007.

[Bibtex]

[Bibtex]

```
@article{Papaspiliopoulos07,
abstract = {{In this paper, we describe centering and noncentering methodology as complementary techniques for use in parametrization of broad classes of hierarchical models, with a view to the construction of effective MCMC algorithms for exploring posterior distributions from these models. We give a clear qualitative understanding as to when centering and noncentering work well, and introduce theory concerning the convergence time complexity of Gibbs samplers using centered and noncentered parametrizations. We give general recipes for the construction of noncentered parametrizations, including an auxiliary variable technique called the state-space expansion technique. We also describe partially noncentered methods, and demonstrate their use in constructing robust Gibbs sampler algorithms whose convergence properties are not overly sensitive to the data.}},
author = {Papaspiliopoulos, Omiros and Roberts, Gareth O. and Sk\"{o}ld, Martin},
citeulike-article-id = {8977350},
citeulike-linkout-0 = {http://www.jstor.org/stable/27645805},
journal = {Statistical Science},
number = {1},
pages = {59--73},
posted-at = {2011-03-10 18:55:50},
priority = {2},
publisher = {Institute of Mathematical Statistics},
title = {{A general framework for the parametrization of hierarchical models}},
url = {http://www.jstor.org/stable/27645805},
volume = {22},
year = {2007}
}
```

Yu Y. & Meng X.L. (2011). To Center or Not to Center: That Is Not the Question—An Ancillarity–Sufficiency Interweaving Strategy (ASIS) for Boosting MCMC Efficiency, Journal of Computational and Graphical Statistics, 20 (3) 531-570. DOI: 10.1198/jcgs.2011.203main

The post Model Scale Parameterization for MCMC Efficiency appeared first on Lindons Log.

]]>The post Woodbury Matrix Inverse Identity appeared first on Lindons Log.

]]>The Sherman-Woodbury-Morrison matrix inverse identity can be regarded as a transform between Schur complements. That is, given one can obtain by using the Woodbury matrix identity and vice versa. Recall the Woodbury Identity:

and

I recently stumbled across a neat application of this whilst deriving full conditionals for a multivariate normal. Recall that if the data are partitioned into two blocks, , then the variance of the conditional distribution is the Schur complement of the block of total variance matrix , that is, the variance of the conditional distribution is which is the variance of subtracted by something corresponding to the reduction in uncertainty about gained from the knowledge about . If, however, has the form of a Schur complement itself, then it may be possible to exploit the Woodbury identity above to considerably simplify the variance term. I came across this when I derived two very different-looking expressions for the conditional distribution and found them equivalent by the Woodbury identity. Consider the model

where

.

I was seeking the distribution and arrived there through two different paths. The distributions derived looked very different, but they turned out to be equivalent upon considering the Woodbury identity.

This simply manipulates properties of the multivariate normal. Marginalizing over one gets

.

Such that the distribution

It follows that the conditional distribution is

This looks a bit nasty, but notice that looks like it too could be a Schur complement of some matrix.

An alternative route to this distribution is

where

It follows that

which looks different from the distribution obtained through method 1. The expression for the variance is a lot neater. They are in fact identical by the Woodbury identity.

So mean1=mean2.

By the Woodbury Identity it follows that

Therefore

and so variance1=variance2. The trick is recognizing the form of the formulas at the top of the page, then one can write the variance as a much neater expression.

The post Woodbury Matrix Inverse Identity appeared first on Lindons Log.

]]>The post Parallel Tempering Algorithm with OpenMP / C++ appeared first on Lindons Log.

]]>1.2. Physics Origins

2.1 Intra-Thread Metropolis Move

2.2. Inter-Thread Parallel Tempering

2.3. OpenMP Parallelization

3. Full Code

4. Simulation Study

5. On the Future use of Parallel Tempering with OpenMP

Parallel tempering is one of my favourite sampling algorithms to improve MCMC mixing times. This algorithm seems to be used *exclusively* on distributed memory architectures using MPI and remains unexploited on shared memory architectures such as our office computers, which have up to eight cores. I’ve written parallel tempering algorithms in MPI and Rmpi but never in OpenMP. It turns out that the latter has substantial advantages. I guess when people think of parallel tempering they think of processors communicating with each other via MPI and swapping parameters directly. If you are on a shared memory device, however, you can have processor A simply write to a shared array and have processor B read therefrom, which really saves a lot of aggro fiddling around with message numbers, blocking/non-blocking calls and deadlocks etc. Moreover, with OpenMP you can spawn more threads than you have processors, which translates to more parallel MCMC chains in the present context, whereas this becomes troublesome with MPI due to the danger of deadlocks. OpenMP is also much easier to use than MPI, with one line you can fork a serial thread into a desired and hardware-independent number of parallel threads. The code looks as follows:

Each thread simulates an MCMC trajectory from the posterior raised to a fractional power, B. When B=1, the MCMC draws are from the posterior from which we wish to sample. When B=0, the MCMC trajectory is just a realization of a Brownian motion random walk. To see this, consider the acceptance probability of the metropolis move. The density evaluated at the proposed parameters over the density evaluated at the current parameters all raised to the power of zero is unity, whatever the densities are, so the moves always get accepted. Similarly if B is close to zero, then the acceptance probability is near unity and the distribution from which this MCMC is sampling is quite uniform over the parameter space, so the trajectory explores a relatively larger part of the parameter space. As B is increased toward one, the features of the distribution from which we wish to sample start to become more prominent. In the other direction from B=1 to 0 one commonly says that the posterior is “melted down” and spreading out its mass. The terminology has remained from its origins in statistical physics where one would simulated particles at a hotter temperature, so that they would jostle around more and escape wells in the potential energy. The key to parallel tempering is to use the more diffuse, hotter or melted down MCMC chains as proposal distributions for the actual cold distribution we wish to sample from. One proceeds by performing a Metropolis-Hastings move because the proposal distributions are not symmetric. For illustration, thread j uses the hotter thread j+1 as its partner and as proposal distribution. Let theta j+1 be the proposed new position for thread j, being the current position of thread j+1.

The second fraction is the Hastings addition to the Metropolis algorithm and is required to satisfy detailed balance for an unsymmetrical proposal distribution. Now realise that

i.e. they are the same distribution raised to different fractional powers. Working now on the log scale, it can be shown that

It is at this point where sometimes, in order to make things correspond to the earlier physics literature, one defines the “Energy” as

So that the acceptance probability becomes

It’s not necessary to define this energy, it only defines an equivalence mapping between statistics and physics. In physics particles get stuck in the local minima of the energy landscape and in statistics the MCMC gets stuck in the local peaks of the posterior. The reason for this is that in a canonical ensemble lower energy states are more probable (recall that nature tries to minimize the potential energy and that force is the negative gradient of the potential energy), so regions of the parameter space with low potential energy, physically, correspond to regions of high probability density, statistically. To be more precise, a result from statistical physics is that the distribution of energy is exponential with scale parameter kT, where k is Boltzmann’s constant and T is temperature (this condition holds only for a canonical ensemble). An exponential distribution with this scale parameter is called the Boltzmann distribution by physicists. As the temperature increases, higher energy states become more probable and the particle jumps out of the minima more. If you are a statistician you don’t need to worry about this, but sometimes this notation crops up in the literature. Its also the same acceptance probability now as in physics when sampling energies from a Boltzmann distribution. I have decided not to adopt the physics notation for this post.

Each thread, within itself, performs a normal vanilla metropolis move:

//Propose Candidate Position// t1new=t1[rank*nmc+i-1] + normal(stream[rank]); t2new=t2[rank*nmc+i-1] + normal(stream[rank]); //Calculate log-Density at Newly-Proposed and Current Position// lpost_new[rank]=lLikelihood(t1new,t2new) + lprior(t1new,t2new); lpost[rank]=lLikelihood(t1[rank*nmc+i-1],t2[rank*nmc+i-1]) + lprior(t1[rank*nmc+i-1],t2[rank*nmc+i-1]); //Melt Density and Calculate log-Acceptance Probability// lalpha=B[rank]*(lpost_new[rank]-lpost[rank]); //Perform Metropolis Accept-Reject Step// if( log(u(stream[rank])) < lalpha ){ //Accept //Proposed as Current Position t1[rank*nmc+i]=t1new; t2[rank*nmc+i]=t2new; }else{ //Do not Accept //Propogate Current Position t1[rank*nmc+i]=t1[rank*nmc+i-1]; t2[rank*nmc+i]=t2[rank*nmc+i-1]; }

A few comments about the variables. “nmc” is the number of mcmc draws I wish to generate. I have two parameters which I have denoted t1 and t2, because t is closest to theta. Moreover, each processor stores its *nmc* draws of t1 and t2 in a contiguous array in the memory of length nmc times number of threads. “Rank” Identifies the thread and “lpost” and “B” are arrays of length equal to the number of threads in which to store the log posterior density at the current position and the fractional melting power. All of these variables are defined at the top of the code.

if(u(stream[rank]) < 0.5){ rank_partner=rank+1; if(rank_partner < size){ //Inter-Thread Metropolis-Hastings Part lalpha = (B[rank]-B[rank_partner])*(lpost[rank_partner]-lpost[rank]); if(log(u(stream[rank])) < lalpha){ //accept swap swap(t1[rank*nmc+i],t1[rank_partner*nmc+i]); swap(t2[rank*nmc+i],t2[rank_partner*nmc+i]); } } }

The only additional thing to add is that each chain attempts a swap with its neighbour at each iteration with probability 1/2. There is nothing special about 1/2, you could choose what you like, but there are pros and cons. How this made parallel in OpenMP is shown below.

The OpenMP parallel implementation of the above algorithm is very simple!

#pragma omp parallel private(i,t1new,t2new,rank,lalpha,rank_partner) shared(B, lpost, lpost_new,t1,t2,swapt1,swapt2) { //Identify Each Thread rank=omp_get_thread_num(); for (i = 1; i < nmc; ++i) { //***Intra-Thread Metropolis Part***// #pragma omp barrier //Synchronise Threads #pragma omp critical //Executed Critical Code Block Oney Thread at a Time. { //***Inter-Thread Parallel Tempering Part***// } #pragma omp barrier //Synchronise Threads } }

The first parallel pragma simply forks the master thread into a number of threads whereby each thread executes the following code block independently i.e. a number of independent parallel mcmcs. Specifying variables as private means that each thread gets a copy of that variable in its own seperate location in the memory. Shared is the opposite, although I think variables are shared by default. The barrier pragma means that each thread halts until all threads have reached this point. The critical pragma means the following code block is executed by one thread at a time only. This prevents thread j swapping with thread j+1 whilst thread j+1 is attempting a swap with thread j+2, nasty things such as race conditions can occur. The last pragma barrier waits for all threads to have reached the end and then the next iteration of the for loop proceeds.

The full code can be found here. It depends on OpenMP and the TRNG library in order to generate multiple independent streams of random numbers. It takes the number of mcmc draws as a command-line argument.

[michael@michael tempering]$ wget http://www.lindonslog.com/example_code/tempering.cpp [michael@michael tempering]$ g++ tempering.cpp -fopenmp -o tempering -ltrng4 -lm [michael@michael tempering]$ ./tempering 10000 Thread 0 has fractional power 1 Thread 1 has fractional power 0.469117 Thread 2 has fractional power 0.220071 Thread 3 has fractional power 0.103239 Thread 4 has fractional power 0.0484313 Thread 5 has fractional power 0.0227199 Thread 6 has fractional power 0.0106583 Thread 7 has fractional power 0.005 [michael@michael tempering]$

I chose the likelihood to be 5 sharply peaked normal distributions located at the corners of a sort-of unit square plus one at the origin with variances of 0.001. The prior was a normal of variance 1000 centered at the origin. The parallel tempering algorithm was run with 8 threads. The posterior draws and mixing results are below:

I hope the code exemplifies how easy it is to run parallel MCMC chains with OpenMP. I would argue that the metropolis moves are the hardest part. If you can write them for a single serial chain, then it is only a few extra steps to run parallel chains and imlement that parallel tempering algorithm. My laptop has four cores and my office computer has eight. Given the trajectory of technology that shared memory devices have an ever increasing number of cores, it seems to me that parallel tempering is becoming an ever-more valuable algorithm to improve mixing times of MCMC runs. Afterall, had I not used the extra 3 cores on my laptop, they would have remained idle. If you have extra cores, why not use them! Moreover with OpenMP you can spawn as many parallel MCMCs as you desire, avoiding the pitalls of MPI.

Earl D.J. & Deem M.W. (2005). Parallel tempering: Theory, applications, and new perspectives, Physical Chemistry Chemical Physics, 7 (23) 3910. DOI: 10.1039/b509983h

The post Parallel Tempering Algorithm with OpenMP / C++ appeared first on Lindons Log.

]]>The post Parallel Random Number Generation using TRNG appeared first on Lindons Log.

]]>At this point I was quite discouraged but then I came across TRNG, written by Heiko Bauke. At first glance TRNG is an excellently documented C++ PRNG (which stands for pseudo random number generator, not parallel, that would be PPRNG) library built specifically with parallel architectures in mind. Not only does it provide non-uniform distributions, but it can be used easily with MPI, OpenMP, CUDA and TBB, for which many examples are supplied. The documentation is excellent and the many examples of the same problem coded with each of the aforementioned parallelization methods are enlightening. If that weren’t enough, TRNG can be used in combination and interchangeably with the Boost random as well as the C++11 TR1 random libraries, that is, the engines/generators from TRNG can be used with the distribution functions of Boost and C++11 TR1, which was a problem I encountered with RngStream and SPRNG. The way TRNG and RngStream work are slightly different. Whereas RngStream generates multiple independent streams, TRNG uses a single stream and either divides it into blocks, or interleaves it between different processors by a leap-frog type scheme, much like dealing out cards round a table. The point of all this is that the streams of different processors never overlap, otherwise one would get the same draws on processor A as processor B. While purists might argue that L’Ecuyer’s method is more rigorous, I’m happy enough with the way Heiko has done it, especially given TRNG’s out-of-box easy of use and compatibility.

Clone the repository off Github.

[michael@michael$git clone https://github.com/rabauke/trng4 [michael@michael$cd trng4/ [michael@michael trng4]$./configure --prefix=/usr [michael@michael trng4]$make [michael@michael trng4]$make inst [michael@michael trng4]$ sudo bash [sudo] password for michael: [root@michael trng4]# ldconfig [root@michael trng4]#

the “–prefix=” argument just sets where I want the files to be installed and is not necessary. If omitted the default case is /usr/local. After make install, run ldconfig as root in order to update the dynamic linker/loader with the presence of the new library.

Basically there exists a cache /etc/ld.so.cache which is used by the dynamic linker/loader at run-time as a cross-reference for a library’s soname with its full file path. ldconfig is normally run during booting but can also be run anytime to update the cache with the locations of new libraries. Here is what happens if you don’t run ldconfig, as I did the first time.

[michael@michael ~]$ g++ hello_world.cc -L /usr/lib -ltrng4 [michael@michael ~]$ ./a.out ./a.out: error while loading shared libraries: libtrng4.so.0: cannot open shared object file: No such file or directory

It compiled fine, but at run-time the loader couldn’t find the library.

**Nachtrag:** I think instead of using trng::yarn2 gen[max] it is better to do:

trng::yarn2 * gen; gen=new trng::yarn2[max];

The approach will be to generate the the PRNGs in C++ and call it from R using Rcpp. First lets consider the C++ code to generate some random uniforms.

#include <cstdlib> #include <iostream> #include <omp.h> #include <trng/yarn2.hpp> #include <trng/uniform01_dist.hpp> int main() { int max=omp_get_max_threads(); omp_set_num_threads(max); int rank; trng::yarn2 gen[max]; trng::uniform01_dist<> u; std::cout << max << " =max num of threads" << std::endl; for (int i = 0; i < max; i++) { gen[i].split(max,i); } #pragma omp parallel for private(rank) for (int i = 0; i < max; ++i) { rank=omp_get_thread_num(); #pragma omp critical std::cout << u(gen[rank]) << " from thread " << rank << std::endl; } return EXIT_SUCCESS; }

which returns

[michael@michael ~]$ g++ omprng.cpp -o omprng -fopenmp -ltrng4 [michael@michael ~]$ ./omprng 4 =max num of threads 0.919233 from thread 0 0.408994 from thread 1 0.943502 from thread 2 0.401236 from thread 3 [michael@michael ~]$

The salient feature of this code is the leapfrog process by calling split. There exists a sequence of random uniforms and “.split(max,i)” divides it into *max* subsequences, leap-frogging each other, and grab the *i’th* subsequence. You can think of this as *max* players sitting around a poker table and the .split() as continuously dealing out random uniforms to each of the players. The code says let processor i be “player” i and use the sequence of random uniforms dealt to it.

Thanks to Rcpp the above C++ code can be trivially changed so that it can be used from R. Just include the Rcpp header and change the function return type.

#include <cstdlib> #include <iostream> #include <omp.h> #include <trng/yarn2.hpp> #include <trng/uniform01_dist.hpp> #include <Rcpp.h> // [[Rcpp::export]] Rcpp::NumericVector prunif(int n) { int max=omp_get_max_threads(); omp_set_num_threads(max); int rank; trng::yarn2 gen[max]; trng::uniform01_dist<> u; Rcpp::NumericVector draws(n); for (int i = 0; i < max; i++) { gen[i].split(max,i); } #pragma omp parallel for private(rank) for (int i = 0; i < n; ++i) { rank=omp_get_thread_num(); draws[i]=u(gen[rank]); } return draws; }

This code can be compiled and loaded into R on the fly, so lets test it.

> library(Rcpp) > library(rbenchmark) > Sys.setenv("PKG_CXXFLAGS"="-fopenmp") > Sys.setenv("PKG_LIBS"="-ltrng4") > sourceCpp("prunif.cpp") > benchmark(replications=rep(100,0,1),runif(1000000),prunif(1000000)) test replications elapsed relative user.self sys.self user.child 2 prunif(1e+06) 100 0.611 1.00 2.227 0.114 0 1 runif(1e+06) 100 3.837 6.28 3.745 0.086 0

There are a few things to note. Spawning threads incurs its own overhead, so it will obviously be slower for very few draws. As the number of draws becomes larger the time taken to spawn new threads is dwarfed by the time taken to create the draws and so it is worthwhile to do it in parallel. One caveat is that prunif and runif did not in this case use the same generating algorithm. R’s algorithm can be changed with RNG.kind and the TRNG algorithm can be changed by using an alternative to yarn in “trng::yarn2″. Even if they were the same though I would expect the same qualitative behaviour.

Generating large samples of random numbers in one hit quickly is not the reason why I started looking for a good parallel random number generator. Rarely is it important to me to generate large amount of draws in one go but it certainly is important to me to have independent streams. Generally I will port expensive parts of my R code, usually for loops, to C++ and inevitably I will somewhere within these for loops or other expensive parts of code need to draw some random numbers. Since these expensive pieces of code are self-evidently *expensive*, I will want to compute them in parallel in C++ if I can and so it is very important to me to have independent streams from which to draw random numbers.

The post Parallel Random Number Generation using TRNG appeared first on Lindons Log.

]]>