Random number generator that returns unique records in sorted order

I need a generator for many (up to one trillion, 10 ^ 12) unique random 64 bit numbers. The generator needs to return the numbers in sorted order (Long.MIN_VALUE - Long.MAX_VALUE). The problem is that sorting $ 10 ^ {12} $ numbers is slow. The use case replicates the test that was run for BBHash (in 4.5 Trillion Key Indexing).

A simple solution is to create a set in memory using a huge bit or so to ensure that no duplicates are returned. But this is too much memory or I / O. I would like to use no more than a few MB of internal state.

The generator must use java.util.Random internally. It should be as “fair” as possible (have the same statistical distribution as if it were formed differently). I would also like to have a version for 128 bit numbers (2 longs).

So far I have created a code to create a set in memory (Java code):

public static void main(String... args) {
    for(long x : randomSet(10, 0)) {
        System.out.println(x);
    }
}

static Iterable<Long> randomSet(int size, int seed) {
    Random r = new Random(seed);
    TreeSet<Long> set = new TreeSet<Long>();
    while (set.size() < size) {
        set.add(r.nextLong());
    }
    return set;
}

-8292973307042192125
-7423979211207825555
-6688467811848818630
-4962768465676381896
-2228689144322150137
-1083761183081836303
-279624296851435688
4437113781045784766
6146794652083548235
7105486291024734541

      

The simplest (wrong) solution, which is not random, is to distribute the results evenly. I don't think the solution along the line "add random space" will work because it's slow and the sum of such gaps after 10 ^ 12 won't land where it should (well, maybe: remember how many numbers are left, then recalculate the distribution ...). I think the following should work, but it's tricky, and not sure which formulas to use: for each bit level, recursively, calculating how many 0s / 1s are likely to happen (using binomial distribution or approximation, normal / Gaussian distribution, which -that way). Stop at some point (say blocks of 1 million records or less), use the above code for speed. But maybe there is an elegant solution.Maybe it has something to do with the Metropolis-Hastings algorithm, I'm not sure. I read "Efficient Sequential Random Sampling Algorithm" but I think it is only for small n and I am having a hard time getting a simple algorithm out of it.

Java code would be better, but C is great (anyway, at some point I might have to convert it to C / C ++). I would not like to use too many libraries to make porting easier.

+3


source to share


6 answers


I have a solution.

(It turns out that generating 100,000 or more records in a roughly ordered order is faster than generating using a large HashSet. The roughly sorted means are replaced TreeSet

with HashSet

and use the limit of 10,000 instead of 5. This is because testing for duplicates is much faster.)

Random number of records for each fixed (sub-band) range



Create a tree: For each bit level (starting with the most significant bit), recursively generate a random number of how many entries should have a bit at that level equal to 0 using the normal distribution. The rest of the entries have a bit at this level equal to 1. At each level of recursion, this will reduce the range by about half. Stop, for example, when there are less than 1 million entries left, then switch to using in-memory pseudo-RNGs and sort those numbers (or use a bit field).

Here's some code (not tested yet):

public static void main(String... args) {
    Random r = new Random();
    Iterator<Long> it = randomSequence(r, 10, 32);
    while(it.hasNext()) {
        System.out.println(it.next());
    }
}

/**
 * Random sequence generator.
 *
 * @param r the random generator
 * @param size the number of entries to generate
 * @param shift the number of bits of the result
 * @return the iterator
 */
static Iterator<Long> randomSequence(final Random r, final long size, final int shift) {
    if (size < 5) {
        // small lists are generated using a regular hash set
        TreeSet<Long> set = new TreeSet<Long>();
        while (set.size() < size) {
            set.add(r.nextLong() & ((2L << shift) - 1));
        }
        return set.iterator();
    }
    // large lists are created recursively
    return new Iterator<Long>() {
        long remaining = size, zeros = randomHalf(r, size);
        Iterator<Long> lowBits0 = randomSequence(r, zeros, shift - 1);
        Iterator<Long> lowBits1;
        @Override
        public boolean hasNext() {
            return remaining > 0;
        }
        @Override
        public Long next() {
            remaining--;
            if (lowBits0.hasNext()) {
                return lowBits0.next();
            }
            if (lowBits1 == null) {
                lowBits1 = randomSequence(r, size - zeros, shift - 1);
            }
            return (1L << shift) + lowBits1.next();
        }
    };
}

/**
 * Get the number of entries that are supposed to be below the half,
 * according to the probability theory. For example, for a number of coin
 * flips, how many are heads.
 *
 * @param r the random generator
 * @param samples the total number of entries
 * @return the number of entries that should be used for one half
 */
static long randomHalf(Random r, long samples) {
    long low = 0, high = samples;
    double x = r.nextDouble();
    while (low + 1 < high) {
        long mid = (low + high) / 2;
        double p = probabilityBucketAtMost(samples, mid);
        if (x > p) {
            low = mid;
        } else {
            high = mid;
        }
    }
    return (low + high) / 2;
}

static double probabilityBucketAtMost(long flips, long heads) {
    // https://www.fourmilab.ch/rpkp/experiments/statistics.html
    long x = heads;
    long n = flips;
    double variance = Math.sqrt(n/4);
    // mean
    long mu = n / 2;
    // https://en.wikipedia.org/wiki/Normal_distribution
    // Numerical approximations for the normal CDF
    // the probability that the value of a standard normal random variable X is <= x
    return phi((x - mu) / variance);
}

static double phi(double x) {
    return 0.5 * (1 + Math.signum(x) * Math.sqrt(1 - Math.exp(-2 * x * x / Math.PI)));
}

      

+1


source


For requirements

  • generate a sequence of random numbers r_i from an integer i = [- (R + 1), R], R> 0 with a statistical distribution like java.util.Random
  • sequence r_i must strictly increase (r_i> r_j for i> j)

we could find a simple algorithm

A1:
 - draw a random number r_i from I via a library call
 - discard it, if it is less or equal the last draw, try another pick

      

A possible complaint would be that this algorithm would probably give the wrong number of generated r_i, there is a fuzzy requirement of N = 10 ^ 12 expected numbers in total

  1. "need a generator for many (up to one trillion, 10 ^ 12) unique random 64-bit numbers"

The solution for this would be

A2:
 - to generate N numbers and then 
 - sort them

      

However, there is one more requirement: not enough memory.

  1. "I would like to use no more than a few MB of internal state."

My hypothesis is that it is impossible to fulfill all of these requirements at once.

As a compromise, I suggest



A3:
 R=2^63 = 9 10^18  
 N=1 Trillion = 10^12
 - divide the range I=[-R,R-1] into N intervals of length (2R+1)/N each 
 - visit each of those intervals (visiting one interval after another)
 - draw a random number from that interval

      

This will give N random numbers in ascending order.

Update:

After looking at the BBHash paper and sources a couple of times, this is my understanding:

Considering some whole set I and a subset S with N = | S | elements, BBHash will compute a function f that maps S to some permutation {1, .., N} (which permutation is apparently implicitly determined by BBHash) and maps all other elements from i to the special Imax value of I.

Possible tests:

Given S and f, one can check whether the membership in S is correctly calculated for some arbitrary element of I.

You can also check that f (S) = {1, ..., N}.

I am assuming that the requested algorithm is designed to compute a set of samples S for N = 10 ^ 12 on the fly under a limited memory budget requiring a unique sequence of random numbers rather than monotonicity.

To quote fooobar.com/questions/47711 / ...

Probabilistic data structures cannot give you a definite answer, instead they provide you with a reasonable approximation of the answer and a way to approximate that estimate. They are extremely useful for big data and streaming applications as they allow you to significantly reduce the amount of memory you need (compared to the data structures that give you the exact answers).

In most cases, these data structures use hash functions to randomize items. Because they ignore collisions, they keep the size constant, but that's also the reason they can't give you an exact value.

In the case of BBHash, a sequence of different hash functions h_i is used. Different h_i are applied until a collision occurs. This only works if the input is unique. It will only work if the implementation has sufficiently different h_i values ​​for a particular S.

+1


source


Let's call your universe random U values. First, that's a 64-bit signed integer range, so there are 2 ^ 64 possible values. Let's say the total number of sorted random values ​​needed to create the N you say is about 10 ^ 12.

Determine how reasonable the amount of memory is being used. Let's say your machine can allocate and use 1 GB without issue. These are 134,217,728 64-bit values. Call A (array size).

N / A = 7450.58 ... so round up to 7451 buckets and adjust A to ceiling (N / 7451) which is 134 210.173. Calculate R (bucket range) = U / 7451.

Loop over 7451 buckets (B):
    Generate 134,210,173 random values in the range (0..R),
    inserting them into the array as they are produced. Binary
    insertion should be reasonable (N*log(N), just like generating
    them all then sorting, but you can use the insertion to catch
    duplicates so you don't need extra memory or time for that).

    Output the bucket of values, adding (B*R) to each.

      

You will miss N by several; if it is a critical value, then arbitrarily select the required number of buckets and remove one value from each.

0


source


10 ^ 12 is about 2 40, i.e. the average step between successive values ​​will be 2 ^ 24.

So if the goal is to generate an unpredictable but ordered sequence of hashes, then this is not possible, 2 ^ 24 is too easy for brute-force

But if that's not the goal, then why not just concatenate an incremental counter of 2 ^ 40 in the high bits with a random value of 2 ^ 24 in the low bits?

-1


source


The random generator gives you a number from the interval [A, B]. Call this number R. (A <R <B). This way, you can just recursively generate numbers from [A, R) and (R, B], which ensures that all numbers are unique and in sorted order.

Smth as quicksort method.

-1


source


You want a lot of pseudo random 64 bit numbers, all unique. Given unique inputs and the same key, encryption is unique - this must be because it is reversible. DES is a 64-bit block cipher, so encrypting the numbers 0, 1, 2, 3, 4, ... 10 ^ 12 in DES using ECB mode will give you a trillion unique 64-bit numbers. With the same key, they are guaranteed to be unique because the inputs are unique. Another key will provide a different set of unique numbers, but some may be duplicates of the numbers in the first set.

For 128-bit numbers, AES is used, which has a block size of 128 bits, again in ECB mode and with a fixed key.

The only internal state that needs to be used is the key you are using and one number indicating how far you are in the range [0..10 ^ 12].

You will need to sort the result separately. Given the ease of restarting the process from the last saved number to create the next batch of numbers, I suspect that merge sort will be relatively easy to implement, with each new batch being merged into an already sorted main file as it is created.The batch size can be kept within the memory capacity , while the main file is stored on disk.

This solution does not use java.util.Random

. How important is it to you? The encryption is meant to appear random to all but the most sophisticated cryptographic analysis and is probably "more random" than the standard Java Random

PRNG.

-2


source







All Articles