Fast, unbiased, integer pseudo-random generator with arbitrary limits
For the monte carlo integration process, I need to pull a lot of random samples from a histogram with N buckets, and N is arbitrary (i.e. not the strength of two), but does not change at all during computation.
In many ways, I mean something on the order of 10 ^ 10, 10 billion, so almost any kind of lengthy preliminary estimate is likely to be faced with a huge number of Samples).
I have a very fast pseudo-random number generator at my disposal that typically generates 64 bit unsigned integers (all ints in the discussion below are unsigned).
A naive way to pull out a sample: histogram[ prng() % histogram.size() ]
The naive way is very slow: the modulo operation uses integer division (IDIV) which is terribly expensive and the compiler, not knowing the value histogram.size()
at compile time, cannot match the usual magic (i.e. http://www.azillionmonkeys.com/qed/adiv.html )
Basically, the bulk of my computation time is spent extracting this darn modulo.
A slightly less naive way: I use libdivide ( http://libdivide.com/ ), which is capable of breaking off very fast "division by a constant unknown at compile time".
This gives me a very nice win (25% or so), but I feel like I can do better, here's why:
-
First intuition: libdivide calculates division. I need a module, and there I have to make an additional and multi sub:
mod = dividend - divisor*(uint64_t)(dividend/divisor)
. I suspect there might be a small win by using libdivide-type methods that directly lead to the module. -
Second intuition: I am actually not interested in myself modulo. I really want to efficiently create an evenly distributed integer value that is guaranteed to be strictly less than N.
Modulo this is a pretty standard way to get there because of two of its properties:
-
A)
mod(prng(), N)
guaranteed to be uniformly distributed ifprng()
- -
B) is
mod(prgn(), N)
guaranteed to belong to [0, N [
But modulo is / does a lot more, which just satisfies the above two constraints, and in fact it probably works too much.
Every need is a function, every function that obeys constraints A) and B) and quickly.
So a long introduction, but this is where my two questions come up:
-
Is there something that is equivalent to libdivide which computes integer modules directly?
-
Is there any function F (X, N) of integers X and N that obeys the following two constraints:
- If X is a random variable, uniformly distributed, then F (X, N) is also non-uniformly distributed
- F (X, N) is guaranteed to be in [0, N [
(PS: I know that if N is small, I don't need to dump all 64 bits coming out of the PRNG. In fact, I already do that. But as I said, even this optimization is a minor win when compared to a large fat loss when calculated modulo).
Edit: prng() % N
really not exactly evenly spaced. But for N is big enough, I don't think that's a big problem (or do you?)
Edit 2: prng() % N
indeed potentially very poorly distributed. I never realized how bad it was. Uch. I found a good article on this: http://ericlippert.com/2013/12/16/how-much-bias-is-introduced-by-the-remainder-technique
source to share
If you have quick access to the instruction you need, you can perform a 64 bit multiplication prng()
by N
and return the high 64 bit of the 128 bit result. It's kind of like multiplying the uniform real in [0, 1) by N
and truncating with an offset in modulo order (ie, almost negligible, a 32-bit version of this answer would have a small but possibly noticeable offset).
Another opportunity to explore would be to use the word parallelism for a non-networked modular algorithm operating on single bits to generate random numbers in batches.
source to share
Under these conditions, the simplest approach may work best. One extremely simple approach that might work if your PRNG is fast enough would be to pre-compute one less than the next greater cardinality 2 than your N to use as a mask. Ie, given some number that looks like 0001xxxxxxxx
in binary (which x
means we don't care if it's 1 or 0), we want the mask to be like 000111111111
.
From here we generate numbers like this:
- Create number
-
and
with your mask - if result> n go to 1
The exact efficiency of this will depend on how close N is to power 2. Each successive power 2 (obviously enough) doubles its predecessor. So, in the best case, N is exactly one less power of 2, and our test at step 3 always passes. We only added a mask and compared the time taken by the PRNG itself.
In the worst case, N is exactly equal to the cardinality of 2. In this case, we would expect to throw out about half of the numbers we generated.
On average, N ends about halfway between powers of 2. This means that on average we throw out one of four inputs. We can almost ignore the mask and the comparisons themselves, so our loss of speed compared to the "raw" generator is basically equal to the number of outputs we discard, or 25% on average.
source to share
Libdivide or any other complex optimization of this module will simply overdo it. In a situation like yours, the only sane approach is
-
make sure your table size is two (add padding if necessary!)
-
replace modulo operation with bitmask operation. Like this:
size_t tableSize = 1 << 16; size_t tableMask = tableSize - 1; ... histogram[prng() & tableMask]
Bitmask operation is one cycle on any CPU that is worth the money, you can't beat its speed.
-
Note:
I am not aware of the quality of the random number generator, but it may not be advisable to use the last bits of the random number. Some RNGs produce bad randomness in the last bits and better randomness in the upper bits. If this is the case with your RNG, use a beat break to get the most significant bits:
size_t bitCount = 16;
...
histogram[prng() >> (64 - bitCount)]
It is as fast as bitmask, but uses different bits.
source to share
You can expand yours histogram
to "high" power by two by using it by filling in trailing spaces with some dummy value (guaranteed to never appear in real data). For example. with histogram
[10, 5, 6]
stretch it up to 16 as suggested (assumed to be -1
a valid sentinel signal):
[10, 5, 6, 10, 5, 6, 10, 5, 6, 10, 5, 6, 10, 5, 6, -1]
The sampling can then be done through a binary mask histogram[prng() & mask]
, where mask = (1 << new_length) - 1
, with a check that the sentinel value is repeated, i.e.
int value;
do {
value = histogram[prng() & mask];
} while (value == SENTINEL);
// use `value` here
The expansion is longer than necessary to make retries unlikely, ensuring that the vast majority of items are valid (for example, in the example above, only 1/16 of the lookups will "fail" and this rate can be reduced further, such as 64) ... You can even use the "branch prediction" hint (eg __builtin_expect
in GCC ) on the check to make the compiler code optimal for the case value != SENTINEL
, which is hopefully the general case.
It is rather a trade-off between memory and speed.
source to share
Just a few ideas to complement other good answers:
-
What percentage of the time is spent on modulo and how do you know what that percentage is? I'm only asking because sometimes people say that something is terribly slow when in fact it's less than 10% of the time, and they only think it's big because they only use the silly profiler for their own time. (I find it hard to imagine that modulo operation is time consuming compared to a random number generator.)
-
When is the number of buckets known? If it doesn't change too often, you can write a generator program. When the number of buckets changes, automatically print a new program, compile, link and use it for bulk execution. This way the compiler will know the number of buckets.
-
Have you considered using a quasi-random number generator as opposed to a pseudo-random generator? This can give you better integration fidelity across multiple samples.
-
Could the number of buckets be reduced without compromising integration accuracy too much?
source to share
Dbaupp jaggedness warnings that can be side-stepped by deviating scaling values at least M*(2^64/M)
(before the module is accepted).
If M
no more than 32 bits can be represented, you can get more than one value less M
by re-multiplying (see David Eisenstat's answer) or divmod; alternatively, you can use bitwise operations to extract bit patterns long enough for M
, again rejecting values no less than M
.
(I would be surprised that the module is not dwarfed in time / cycle / power consumption by generating random numbers.)
source to share
To feed the bucket, you can use std::
binomial_distribution to feed each bucket directly, rather than feeding the bucket one sample per sample:
The following might help:
int nrolls = 60; // number of experiments
const std::size_t N = 6;
unsigned int bucket[N] = {};
std::mt19937 generator(time(nullptr));
for (int i = 0; i != N; ++i) {
double proba = 1. / static_cast<double>(N - i);
std::binomial_distribution<int> distribution (nrolls, proba);
bucket[i] = distribution(generator);
nrolls -= bucket[i];
}
source to share
Instead of integer division, you can use fixed point math, i.e. integer multiplication and bitrate. Let's say if your prng () returns values in the range 0-65535 and you want this quantized range to be in the range 0-99, then you do (prng () * 100) -> 16. Just make sure the multiplication does not overflow your integer type, so you may need to change the result of prng () to the right. Note that this map is better than modulo because it preserves a uniform distribution.
source to share
Thanks everyone for your suggestions.
First, I am now completely convinced that modulo is really evil.
It is very slow and gives incorrect results in most cases.
After implementing and testing several suggestions, what
seems like the best speed / quality tradeoff is the suggested solution
by @Gene:
-
precompute
normalizer
as:auto normalizer = histogram.size() / (1.0+urng.max());
-
draw samples with:
return histogram[ (uint32_t)floor(urng() * normalizer);
This is the fastest of all the methods I've tried so far, and as far as I can tell,
it gives a much better distribution, even if it may not be as ideal
as the deviation method.
Edit: I used David Eizenstat method, which is more or less the same as the proposal Yarkkola: index = (rng() * N) >> 32
. It works in the same way as floating point normalization and is slightly faster (actually 9% faster). So this is my preferred way.
source to share