Combining two samples from numpy.random does not end in a random sequence

Question

Combining two samples from numpy.random does not end in a random sequence

I have implemented the Wald-Wolfowitz test , but during testing I encountered strange behavior, the steps I am taking are as follows:

I am taking two samples from the same distribution:

import numpy as np
list_dist_A = np.random.chisquare(2, 1000)
list_dist_B = np.random.chisquare(2, 1000)

I am concatenating the two lists and sorting them, remembering which number came from. The following function does this and returns a list of labels ["A", "B", "A", "A", ... "B"]

def _get_runs_list(list1, list2):
     # Add labels  
     l1 = list(map(lambda x: (x, "A"), list1))
     l2 = list(map(lambda x: (x, "B"), list2))
     # Concatenate
     lst = l1 + l2
     # Sort
     sorted_list = sorted(lst, key=lambda x: x[0])
     # Return only the labels:
     return [l[1] for l in sorted_list]

Now I want to calculate the number of runs (sequential sequence of identical labels). eg:.

a, b, a, b has 4 runs
a, a, a, b, b has 2 runs
a, b, b, b, a, a has three runs

I use the following code for this:

def _calculate_nruns(labels):
    nruns = 0
    last_seen = None

    for label in labels:
        if label != last_seen:
            nruns += 1
        last_seen = label

    return nruns

Since all the elements are randomly drawn, I thought that I should end up with a sequence a,b,a,b,a,b...

So, this would mean that the number of runs is about 2000. However, as you can see in this snippet on "repl.it" , this is not the case, it is approximately 1000.

Can someone explain why this is the case?

+3

python random

warreee Apr 26. 17 at 22:10

source to share

2 answers

Oh, that reminds me of Player Error.

I'm not a statistician, but to get 2000 runs you need a 100% chance of what A

follows B

and B

follows A

. This would mean that the PRNG has some kind of memory from previous draws. It would be nice...

OTOH, suppose you drew the value marked A

, then there is a 50% chance to draw another one A

and 50% chance to draw one B

. Thus, the chance of drawing a length-one-stroke is actually only 50%, the chance of getting a length-two runs is 25%, for a length of three it is 12.5%, for a length of 4 x 6.25, etc. ...

The last part can be easily verified:

import numpy as np
list_dist_A = np.random.chisquare(2, 1000)
list_dist_B = np.random.chisquare(2, 1000)

listA = [(value, 'A') for value in list_dist_A]
listB = [(value, 'B') for value in list_dist_B]
combined = sorted(listA+listB, key=lambda x: x[0])
combined = [x[1] for x in combined]

from itertools import groupby
from collections import Counter

runlengths = [len(list(it)) for _, it in groupby(combined)]  # lengths of the individual runs
print(Counter(runlengths))  # similar to a histogram
# Counter({1: 497, 2: 234, 3: 131, 4: 65, 5: 29, 6: 20, 7: 11, 8: 2, 10: 1, 14: 1})

So this is actually very close to expectation (which would be: 1: 500, 2: 250, 3: 125, 4:62, ...

as mentioned above). If your guess were correct, it would be closer to1:2000, 2: 0, ...

+2

MSeifert Apr 27. 17 at 0:05

source to share

Robert Kern · Accepted Answer · 2017-04-26T23:17:50+0000

~ 1000 is the expected result. Following the Wikipedia article on this statistical test, you have Np = Nn = 1000

and N = Np + Nn = 2000

. This means that the expected value for the number of runs mu = 2 * Np * Nn / N + 1

is 1001.

Combining two samples from numpy.random does not end in a random sequence

More articles: