Parallel Random Number Generation using TRNG

To my surprise and disappointment, popular scientific libraries like Boost or GSL provide no native support for parallel random number generation. Recently I came across TRNG, an excellent random number generation library for C++ built specifically with parallel architectures in mind. Over the last few days I’ve been trawling internet forums and reading discussions about the best parallel random number generation libraries. Given the trend in CPU architectures whereby each contains an ever increasing number of cores, it makes sense to start a project by considering what libraries are available to best make use of this technology. The first libraries I came across were RngStream and SPRNG. It seems SPRNG was built specifically with MPI, i.e. for distributed memory architectures, in mind and there are some excellent examples and resources of how to get started with parallel MCMC on Darren Wilkinson’s blog. As a result, it seems a bit contrived to get SPRNG to work with OpenMP, i.e. for shared memory architectures. Moreover, I specifically wanted to use OpenMP because I wanted to write an extension for R for use on personal computers. RngStream is written by the man himself, Pierre L’Ecuyer, and is much more OpenMP amenable. Both of these, however, only generate uniform pseudo random numbers. This isn’t a fault, but it means you need to code up transformations and samplers to generate non-uniform pseudo random numbers. While this would be a good exercise, life is short, and I’d rather leave this sort of thing to the professionals (I don’t want to code up my own Ziggurat algorithm). Also, the generators or engines are of defined types and I found it hard to convert them into the corresponding types of other libraries like Boost or GSL so that I could use their non-uniform generation code. That probably says more about my ability rather than the actual difficulty of the problem and Darren Wilkinson provides some of his own code for getting the RNG engine of SPRNG into the correct datatype to be compatible with GSL.

TRNG

At this point I was quite discouraged but then I came across TRNG, written by Heiko Bauke. At first glance TRNG is an excellently documented C++ PRNG (which stands for pseudo random number generator, not parallel, that would be PPRNG) library built specifically with parallel architectures in mind. Not only does it provide non-uniform distributions, but it can be used easily with MPI, OpenMP, CUDA and TBB, for which many examples are supplied. The documentation is excellent and the many examples of the same problem coded with each of the aforementioned parallelization methods are enlightening. If that weren’t enough, TRNG can be used in combination and interchangeably with the Boost random as well as the C++11 TR1 random libraries, that is, the engines/generators from TRNG can be used with the distribution functions of Boost and C++11 TR1, which was a problem I encountered with RngStream and SPRNG. The way TRNG and RngStream work are slightly different. Whereas RngStream generates multiple independent streams, TRNG uses a single stream and either divides it into blocks, or interleaves it between different processors by a leap-frog type scheme, much like dealing out cards round a table. The point of all this is that the streams of different processors never overlap, otherwise one would get the same draws on processor A as processor B. While purists might argue that L’Ecuyer’s method is more rigorous, I’m happy enough with the way Heiko has done it, especially given TRNG’s out-of-box easy of use and compatibility.

Installation of TRNG

Clone the repository off Github.

[michael@michael$git clone https://github.com/rabauke/trng4
[michael@michael$cd trng4/
[michael@michael trng4]$./configure --prefix=/usr
[michael@michael trng4]$make
[michael@michael trng4]$make inst
[michael@michael trng4]$ sudo bash
[sudo] password for michael: 
[root@michael trng4]# ldconfig
[root@michael trng4]#

the “–prefix=” argument just sets where I want the files to be installed and is not necessary. If omitted the default case is /usr/local. After make install, run ldconfig as root in order to update the dynamic linker/loader with the presence of the new library.

Basically there exists a cache /etc/ld.so.cache which is used by the dynamic linker/loader at run-time as a cross-reference for a library’s soname with its full file path. ldconfig is normally run during booting but can also be run anytime to update the cache with the locations of new libraries. Here is what happens if you don’t run ldconfig, as I did the first time.

[michael@michael ~]$ g++ hello_world.cc -L /usr/lib -ltrng4
[michael@michael ~]$ ./a.out
./a.out: error while loading shared libraries: libtrng4.so.0: cannot open shared object file: No such file or directory

It compiled fine, but at run-time the loader couldn’t find the library.

Parallel Random Number Generation in C++

Nachtrag: I think instead of using trng::yarn2 gen[max] it is better to do:

trng::yarn2 * gen;
gen=new trng::yarn2[max];

The approach will be to generate the the PRNGs in C++ and call it from R using Rcpp. First lets consider the C++ code to generate some random uniforms.

#include <cstdlib>
#include <iostream>
#include <omp.h>
#include <trng/yarn2.hpp>
#include <trng/uniform01_dist.hpp>

int main() {

	int max=omp_get_max_threads();
	omp_set_num_threads(max);
	int rank;
	trng::yarn2 gen[max];
	trng::uniform01_dist<> u;
	std::cout << max << " =max num of threads" << std::endl;


	for (int i = 0; i < max; i++)
	{
		gen[i].split(max,i);
	}


#pragma omp parallel for private(rank)
	for (int i = 0; i < max; ++i)
	{
		rank=omp_get_thread_num();
#pragma omp critical
	std::cout << u(gen[rank]) << " from thread " << rank <<  std::endl;
	}
	return EXIT_SUCCESS;
}

which returns

[michael@michael ~]$ g++ omprng.cpp -o omprng -fopenmp -ltrng4
[michael@michael ~]$ ./omprng 
4 =max num of threads
0.919233 from thread 0
0.408994 from thread 1
0.943502 from thread 2
0.401236 from thread 3
[michael@michael ~]$ 

The salient feature of this code is the leapfrog process by calling split. There exists a sequence of random uniforms and “.split(max,i)” divides it into max subsequences, leap-frogging each other, and grab the i’th subsequence. You can think of this as max players sitting around a poker table and the .split() as continuously dealing out random uniforms to each of the players. The code says let processor i be “player” i and use the sequence of random uniforms dealt to it.

Parallel Random Number Generation in R using Rcpp

Thanks to Rcpp the above C++ code can be trivially changed so that it can be used from R. Just include the Rcpp header and change the function return type.

#include <cstdlib>
#include <iostream>
#include <omp.h>
#include <trng/yarn2.hpp>
#include <trng/uniform01_dist.hpp>
#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::NumericVector prunif(int n) {

	int max=omp_get_max_threads();
	omp_set_num_threads(max);
	int rank;
	trng::yarn2 gen[max];
	trng::uniform01_dist<> u;
	Rcpp::NumericVector draws(n);

	for (int i = 0; i < max; i++)
	{
		gen[i].split(max,i);
	}


#pragma omp parallel for private(rank)
	for (int i = 0; i < n; ++i)
	{
		rank=omp_get_thread_num();
		draws[i]=u(gen[rank]);
	}

	return draws;
}

This code can be compiled and loaded into R on the fly, so lets test it.

Speedup Performance

> library(Rcpp)
> library(rbenchmark) 
> Sys.setenv("PKG_CXXFLAGS"="-fopenmp")
> Sys.setenv("PKG_LIBS"="-ltrng4")
> sourceCpp("prunif.cpp")
> benchmark(replications=rep(100,0,1),runif(1000000),prunif(1000000))
           test replications elapsed relative user.self sys.self user.child
2 prunif(1e+06)          100   0.611     1.00     2.227    0.114          0
1  runif(1e+06)          100   3.837     6.28     3.745    0.086          0

Speedup

Parallel RNG speedup


There are a few things to note. Spawning threads incurs its own overhead, so it will obviously be slower for very few draws. As the number of draws becomes larger the time taken to spawn new threads is dwarfed by the time taken to create the draws and so it is worthwhile to do it in parallel. One caveat is that prunif and runif did not in this case use the same generating algorithm. R’s algorithm can be changed with RNG.kind and the TRNG algorithm can be changed by using an alternative to yarn in “trng::yarn2″. Even if they were the same though I would expect the same qualitative behaviour.

Discussion

Generating large samples of random numbers in one hit quickly is not the reason why I started looking for a good parallel random number generator. Rarely is it important to me to generate large amount of draws in one go but it certainly is important to me to have independent streams. Generally I will port expensive parts of my R code, usually for loops, to C++ and inevitably I will somewhere within these for loops or other expensive parts of code need to draw some random numbers. Since these expensive pieces of code are self-evidently expensive, I will want to compute them in parallel in C++ if I can and so it is very important to me to have independent streams from which to draw random numbers.

Leave a Reply