How to perform permutations using mclapply in a reproducible manner, regardless of the number of threads and the OS?
Is it possible to run some function based on permutations using mclapply in a reproducible manner regardless of the number of threads and OS?
Below is an example of toys. Hashing the resulting list of permutation vectors is just for ease of comparison. I've tried different RNGkind
("L'Ecuyer-CMRG"), different settings for mc.preschedule
and mc.set.seed
. Until you can make them the same.
library("parallel")
library("digest")
set.seed(1)
m <- mclapply(1:10, function(x) sample(1:10),
mc.cores=2, mc.set.seed = F)
digest(m, 'crc32')
set.seed(1)
m <- mclapply(1:10, function(x) sample(1:10),
mc.cores=4, mc.set.seed = F)
digest(m, 'crc32')
set.seed(1)
m <- mclapply(1:10, function(x) sample(1:10),
mc.cores=2, mc.set.seed = F)
digest(m, 'crc32')
set.seed(1)
m <- mclapply(1:10, function(x) sample(1:10),
mc.cores=1, mc.set.seed = F)
digest(m, 'crc32')
set.seed(1)
m <- lapply(1:10, function(x) sample(1:10))
digest(m, 'crc32') # this is equivalent to what I get on Windows.
sessionInfo()
just in case:
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] digest_0.6.8
loaded via a namespace (and not attached):
[1] tools_3.2.0
source to share
One solution I came up with is to create an additional vector with seeds. mclapply
or lapply
iterates over an index that points to both the argument and the corresponding seed. A kind of hack, but it works.
library("parallel")
library("digest")
input <- 1:10
# make random seed vector of length(input).
set.seed(1)
seeds <- sample.int(length(input), replace=TRUE)
f <- function(idx){
# input[i] # do whatever with the input
set.seed(seeds[idx]) # set to proper seed
sample(1:10)}
digest(mclapply(seq_along(input), f, mc.cores=2), 'crc32')
digest(mclapply(seq_along(input), f, mc.cores=4), 'crc32')
digest(mclapply(seq_along(input), f, mc.cores=2), 'crc32')
digest(mclapply(seq_along(input), f, mc.cores=1), 'crc32')
digest(lapply(seq_along(input), f), 'crc32')
The problem with this trick is that when the code is wrapped, set.seed inside the function interferes with the outer set of seed. For example,
set.seed(123)
outcome1a <- digest(mclapply(seq_along(input), f, mc.cores=4), 'crc32')
outcome1b <- digest(sample(1:10), 'crc32')
set.seed(123)
outcome2a <- digest(lapply(seq_along(input), f), 'crc32')
outcome2b <- digest(sample(1:10), 'crc32')
identical(outcome1a, outcome2a)
identical(outcome1b, outcome2b)
Although, indeed, the results of "a" are the same, the results of the stochastic calculations that follow immediately after, that is, "b", are affected and different. My guess is one hack might be to wrap the mclapply / lapply function in such a way that a random seed is generated based on user input, and then after execution, the wrapper dumps the seed to that value.
library("parallel")
library("digest")
wrapply <- function(input, cores){
recover.seed <- floor(runif(1)*1e6)
seeds <- sample.int(length(input), replace=TRUE)
f <- function(idx){
# input[i] # do whatever with the input
set.seed(seeds[idx]) # set to proper seed
sample(1:10)
}
if(is.null(cores)){
out <- digest(lapply(seq_along(input), f), 'crc32')
}else{
out <- digest(mclapply(seq_along(input), f, mc.cores=cores), 'crc32')
}
set.seed(recover.seed)
return(out)
}
input <- 1:10
set.seed(123)
outcome1a <- wrapply(input, cores=4)
outcome1b <- digest(sample(1:10), 'crc32')
set.seed(123)
outcome2a <- wrapply(input, cores=NULL)
outcome2b <- digest(sample(1:10), 'crc32')
identical(outcome1a, outcome2a)
identical(outcome1b, outcome2b)
In this case, the results "a" and "b" are indentical.
source to share
Another approach is to first generate the samples you would like to use and call mclapply on the samples:
library("parallel")
library("digest")
input<-1:10
set.seed(1)
nsamp<-20
## Generate and store all the random samples
samples<-lapply(1:nsamp, function(x){ sample(input) })
## apply the algorithm "diff" on every sample
ncore0<- lapply(samples, diff)
ncore1<-mclapply(samples, diff, mc.cores=1)
ncore2<-mclapply(samples, diff, mc.cores=2)
ncore3<-mclapply(samples, diff, mc.cores=3)
ncore4<-mclapply(samples, diff, mc.cores=4)
## all equal
all.equal(ncore0,ncore1)
all.equal(ncore0,ncore2)
all.equal(ncore0,ncore3)
all.equal(ncore0,ncore4)
This ensures reproducibility by using more memory and slightly longer run times, since the computation performed on each sample is usually the most time consuming operation.
Note. The use mc.set.seed = F
in your question will generate the same sample for each core, which is probably not what you want.
source to share