How to perform permutations using mclapply in a reproducible manner, regardless of the number of threads and the OS?

Is it possible to run some function based on permutations using mclapply in a reproducible manner regardless of the number of threads and OS?
Below is an example of toys. Hashing the resulting list of permutation vectors is just for ease of comparison. I've tried different RNGkind

("L'Ecuyer-CMRG"), different settings for mc.preschedule

and mc.set.seed

. Until you can make them the same.

library("parallel")
library("digest")

set.seed(1)
m <- mclapply(1:10, function(x) sample(1:10),
              mc.cores=2, mc.set.seed = F)
digest(m, 'crc32')

set.seed(1)
m <- mclapply(1:10, function(x) sample(1:10),
              mc.cores=4, mc.set.seed = F)
digest(m, 'crc32')

set.seed(1)
m <- mclapply(1:10, function(x) sample(1:10),
              mc.cores=2, mc.set.seed = F)
digest(m, 'crc32')

set.seed(1)
m <- mclapply(1:10, function(x) sample(1:10),
              mc.cores=1, mc.set.seed = F)
digest(m, 'crc32')

set.seed(1)
m <- lapply(1:10, function(x) sample(1:10))
digest(m, 'crc32') # this is equivalent to what I get on Windows.

      

sessionInfo()

just in case:

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] digest_0.6.8

loaded via a namespace (and not attached):
[1] tools_3.2.0

      

+3


source to share


2 answers


One solution I came up with is to create an additional vector with seeds. mclapply

or lapply

iterates over an index that points to both the argument and the corresponding seed. A kind of hack, but it works.

library("parallel")
library("digest")

input <- 1:10

# make random seed vector of length(input).
set.seed(1)
seeds <- sample.int(length(input), replace=TRUE)

f <- function(idx){ 
    # input[i] # do whatever with the input
    set.seed(seeds[idx]) # set to proper seed
    sample(1:10)}

digest(mclapply(seq_along(input), f, mc.cores=2), 'crc32')
digest(mclapply(seq_along(input), f, mc.cores=4), 'crc32')
digest(mclapply(seq_along(input), f, mc.cores=2), 'crc32')
digest(mclapply(seq_along(input), f, mc.cores=1), 'crc32')
digest(lapply(seq_along(input), f), 'crc32')

      

The problem with this trick is that when the code is wrapped, set.seed inside the function interferes with the outer set of seed. For example,

set.seed(123)
outcome1a <- digest(mclapply(seq_along(input), f, mc.cores=4), 'crc32')
outcome1b <- digest(sample(1:10), 'crc32')

set.seed(123)
outcome2a <- digest(lapply(seq_along(input), f), 'crc32')
outcome2b <- digest(sample(1:10), 'crc32')
identical(outcome1a, outcome2a)
identical(outcome1b, outcome2b)

      



Although, indeed, the results of "a" are the same, the results of the stochastic calculations that follow immediately after, that is, "b", are affected and different. My guess is one hack might be to wrap the mclapply / lapply function in such a way that a random seed is generated based on user input, and then after execution, the wrapper dumps the seed to that value.

library("parallel")
library("digest")

wrapply <- function(input, cores){
    recover.seed <- floor(runif(1)*1e6)
    seeds <- sample.int(length(input), replace=TRUE)
    f <- function(idx){ 
        # input[i] # do whatever with the input
        set.seed(seeds[idx]) # set to proper seed
        sample(1:10)
    }
    if(is.null(cores)){
        out <- digest(lapply(seq_along(input), f), 'crc32')
    }else{
        out <- digest(mclapply(seq_along(input), f, mc.cores=cores), 'crc32')
    }
    set.seed(recover.seed)
    return(out)
}

input <- 1:10

set.seed(123)
outcome1a <- wrapply(input, cores=4)
outcome1b <- digest(sample(1:10), 'crc32')

set.seed(123)
outcome2a <- wrapply(input, cores=NULL)
outcome2b <- digest(sample(1:10), 'crc32')

identical(outcome1a, outcome2a)
identical(outcome1b, outcome2b)

      

In this case, the results "a" and "b" are indentical.

0


source


Another approach is to first generate the samples you would like to use and call mclapply on the samples:

    library("parallel")
    library("digest")

    input<-1:10
    set.seed(1)
    nsamp<-20
    ## Generate and store all the random samples
    samples<-lapply(1:nsamp, function(x){ sample(input) })

    ## apply the algorithm "diff" on every sample
    ncore0<-  lapply(samples, diff)
    ncore1<-mclapply(samples, diff, mc.cores=1)
    ncore2<-mclapply(samples, diff, mc.cores=2)
    ncore3<-mclapply(samples, diff, mc.cores=3)
    ncore4<-mclapply(samples, diff, mc.cores=4)

    ## all equal
    all.equal(ncore0,ncore1)
    all.equal(ncore0,ncore2)
    all.equal(ncore0,ncore3)
    all.equal(ncore0,ncore4)

      



This ensures reproducibility by using more memory and slightly longer run times, since the computation performed on each sample is usually the most time consuming operation.

Note. The use mc.set.seed = F

in your question will generate the same sample for each core, which is probably not what you want.

0


source







All Articles