Zipf's law in Java for generating text - too slow

Hi I am working on a text generator that needs to generate millions of different texts. to make the content of each text realistic, I used Zipf Law It works well, word distribution is correct.

But the following function next()

is very slow and since I want to generate millions of articles, it needs to be changed. (The while loop is the slow part)

Can anyone help me?

I have implemented it like this:

   public int next() {

    int rank;
    double frequency = 0;
    double dice;

    rank = rnd.nextInt(size);
    frequency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
    dice = rnd.nextDouble();


    while (!(dice < frequency) || (rank == 0)) {
        rank = rnd.nextInt(size);
        frequency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
        dice = rnd.nextDouble();
    }

    return rank;
}

      

EDIT: I got the code from: http://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/

+3


source to share


3 answers


The implementation you copied ... has some problems. One could probably say that this is clearly wrong, because it uses random values, and when in a calculation like

rank = rnd.nextInt(size);
friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;

      

Meaning rank

0

, then the frequency Infinity

will spoil some statistics.

I tried to fix the erros, but not analyzed the implementation and are not compared it with the definition of Zipf distribution function. So if someone copies my code, they might find out that they still "... have some problems".


The implementation of the function is next

, strictly speaking, not " total correct " in the sense that it does not necessarily end. There is nothing stopping the loop from running forever. Depending on the parameters, it may be more or less likely that it will take some time until it stops. And I think this is also one of the main reasons for the "performance" problem: for some values, the condition is (dice < frequency)

just unlikely to happen ....


Regardless, the goal you want to achieve can be formulated more generally: you have a definite probability distribution. And you want a "random" function that returns random values ​​based on that distribution.



One simple and general way to achieve this is to map the (cumulative) probability distribution to the target values ​​using NavigableMap

. This map can then be used to quickly find the target value given a random value between 0.0 and 1.0 that is supplied by the instance java.util.Random

.

In some cases, there may be better solutions, but again: this is very general and simple (and still quite effective).


I have implemented this here for Zipf distribution. Again, I haven't checked everything in detail, and there are some +1

/ -1

oddities (mentioned in the first paragraph), but it should show the idea: FastZipfGenerator

fills in the map containing the probability distribution and in a function next()

, just does a search:

import java.util.LinkedHashMap;
import java.util.Map;
import java.util.NavigableMap;
import java.util.Random;
import java.util.TreeMap;

public class ZipfGeneratorTest
{
    public static void main(String[] args) {

        int size = 10;
        double skew = 2.0;

        ZipfGenerator z0 = new ZipfGenerator(size, skew);
        FastZipfGenerator z1 = new FastZipfGenerator(size, skew);

        long before = 0;
        long after = 0;

        int n = 5000000;

        before = System.nanoTime();
        Map<Integer, Integer> counts0 = computeCounts(z0, size, n);
        after = System.nanoTime();
        System.out.println(counts0+", duration "+(after-before)/1e6);

        before = System.nanoTime();
        Map<Integer, Integer> counts1 = computeCounts(z1, size, n);
        after = System.nanoTime();
        System.out.println(counts1+", duration "+(after-before)/1e6);
    }

    private static Map<Integer, Integer> computeCounts(
        ZipfGenerator z, int size, int n)
    {
        Map<Integer, Integer> counts = new LinkedHashMap<Integer, Integer>();
        for (int i=1; i<=size; i++)
        {
            counts.put(i, 0);
        }
        for (int i=1; i<=n; i++)
        {
            int k = z.next();
            counts.put(k, counts.get(k)+1);
        }
        return counts;
    }

    private static Map<Integer, Integer> computeCounts(
        FastZipfGenerator z, int size, int n)
    {
        Map<Integer, Integer> counts = new LinkedHashMap<Integer, Integer>();
        for (int i=1; i<=size; i++)
        {
            counts.put(i, 0);
        }
        for (int i=1; i<=n; i++)
        {
            int k = z.next();
            counts.put(k, counts.get(k)+1);
        }
        return counts;
    }

}

// Based on http://diveintodata.org/tag/zipf/
class ZipfGenerator {
    private Random rnd = new Random(0);
    private int size;
    private double skew;
    private double bottom = 0;

    public ZipfGenerator(int size, double skew) {
        this.size = size;
        this.skew = skew;

        for(int i=1;i <=size; i++) {
            this.bottom += (1/Math.pow(i, this.skew));
        }
    }

    // the next() method returns an random rank id.
    // The frequency of returned rank ids are follows Zipf distribution.
    public int next() {
        int rank;
        double friquency = 0;
        double dice;

        rank = rnd.nextInt(size)+1;
        friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
        dice = rnd.nextDouble();

        while(!(dice < friquency)) {
            rank = rnd.nextInt(size)+1;
            friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
            dice = rnd.nextDouble();
        }

        return rank;
    }


    // This method returns a probability that the given rank occurs.
    public double getProbability(int rank) {
        return (1.0d / Math.pow(rank, this.skew)) / this.bottom;
    }
}



class FastZipfGenerator
{
    private Random random = new Random(0);
    private NavigableMap<Double, Integer> map;

    FastZipfGenerator(int size, double skew)
    {
        map = computeMap(size, skew);
    }

    private static NavigableMap<Double, Integer> computeMap(
        int size, double skew)
    {
        NavigableMap<Double, Integer> map = 
            new TreeMap<Double, Integer>();

        double div = 0;
        for (int i = 1; i <= size; i++)
        {
            div += (1 / Math.pow(i, skew));
        }

        double sum = 0;
        for(int i=1; i<=size; i++)
        {
            double p = (1.0d / Math.pow(i, skew)) / div;
            sum += p;
            map.put(sum,  i-1);
        }
        return map;
    }

    public int next()
    {
        double value = random.nextDouble();
        return map.ceilingEntry(value).getValue()+1;
    }

}

      

It prints the result of a random sample (basically a "histogram") and some synchronization results. The sync results are something like

duration 6221.835052
duration 304.761282

      

showing that it will most likely be faster (although this should not be considered a "benchmark" ...)

+4


source


The source obtained from https://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/ has some bugs.

Here are quick fixes. (1) In the ZipfGeneator (int, double) constructor, make sure you compute to size using the equal sign.

public ZipfGenerator(int size, double skew) {
  this.size = size;
  this.skew = skew;

  for(int i=1;i <= size; i++) {
  this.bottom += (1/Math.pow(i, this.skew));
  }
 }

      

(2) replace

rank = rnd.nextInt(size); 

      

from



rank = rnd.nextInt(size)+1; 

      

Here is the complete source code.

import java.util.Random;

//credit: https://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/ [Online; December 2017]

public class ZipfGenerator {
 private Random rnd = new Random(System.currentTimeMillis());
 private int size;
 private double skew;
 private double bottom = 0;

 public ZipfGenerator(int size, double skew) {
  this.size = size;
  this.skew = skew;

  for(int i=1;i <= size; i++) {
  this.bottom += (1/Math.pow(i, this.skew));
  }
 }

 // the next() method returns an random rank id.
 // The frequency of returned rank ids are follows Zipf distribution.
 public int next() {
   int rank;
   double friquency = 0;
   double dice;

   rank = rnd.nextInt(size)+1;
   friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
   dice = rnd.nextDouble();

   while(!(dice < friquency)) {
     rank = rnd.nextInt(size)+1;
     friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
     dice = rnd.nextDouble();
   }

   return rank;
 }

 // This method returns a probability that the given rank occurs.
 public double getProbability(int rank) {
   return (1.0d / Math.pow(rank, this.skew)) / this.bottom;
 }

 public static void main(String[] args) {
   if(args.length != 2) {
     System.out.println("usage: ./zipf size skew");
     System.exit(-1);
   }

   ZipfGenerator zipf = new ZipfGenerator(Integer.valueOf(args[0]),
   Double.valueOf(args[1]));
   for(int i= 1;i <= 10; i++) {
     System.out.println(i+" "+zipf.getProbability(i));
   }
   //use size = 10 and skew = 2 for testing below
   int hist [] = new int [12];
   for(int i=0;i<12;i++) {
       hist[i] = 0;
   }
   System.out.println("Testing the probability distribution:");
   int sum = 0;
    for(int i= 1;i <= 1000000; i++) {
        hist[zipf.next()]++; 
   }
   for(int i=0;i<12;i++)
     System.out.println(i+" "+hist[i]/1000000.0);
    }

}

      

Result:

Probability distribution from the formula:
1 0.6452579827864142
2 0.16131449569660355
3 0.07169533142071269
4 0.04032862392415089
5 0.02581031931145657
6 0.017923832855178172
7 0.013168530260947229
8 0.010082155981037722
9 0.007966147935634743
10 0.006452579827864143
Testing the probability distribution from sampling:
0 0.0
1 0.645813
2 0.160766
3 0.071527
4 0.040346
5 0.026039
6 0.01801
7 0.013215
8 0.009953
9 0.007863
10 0.006468
11 0.0

      

Note. 0 and 11 have a probability of 0 as expected.

+2


source


You asked about speed, so I present a small optimization. First of all, get rid of the repetitive things to find out what it's all about:

public int next() {
    while (true) {
        int rank = rnd.nextInt(size);
        if (rank == 0) return return rank;
        double frequency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
        double dice = rnd.nextDouble();
        if (dice < frequency) return rank;
    }
}

      

For now, this should work exactly the same (unless I missed something). I moved the test rank

up, as the following computation is useless if it is zero. Now there is one line that we can speed up a bit, for example

double frequency = Math.pow(rank, -this.skew) * inverseBottom;

      

Actually, this might slightly change the result due to rounding errors, but I doubt you care. If rank

it was permanent, you can turn pow

in exp

to make it faster, but it isn't. For a little, size

you can pre-copy the table ln(rank)

and use it like

double frequency = Math.exp(ln[rank] * -this.skew) * inverseBottom;

      

A better algorithm could give you more than these low-level optimizations.

0


source







All Articles