Random selection from Pandas groups with equal probability - unexpected behavior

I have 12 unique groups that I am trying to randomly sample, each with a different number of observations. I want to randomly select from the entire population (dataframe), with each group having the same probability of being selected. The simplest example of this would be a data frame with two groups.

    groups  probability
0       a       0.25
1       a       0.25
2       b       0.5

      

using np.random.choice(df['groups'], p=df['probability'], size=100)

Each iteration will now get a 50% chance to choose group a

and a 50% chance to choosegroup b

To come up with the probabilities, I used the formula:

(1. / num_groups) / size_of_groups

      

or in Python:

num_groups = len(df['groups'].unique())  # 2
size_of_groups = df.groupby('label').size()  # {a: 2, b: 1}
(1. / num_groups) / size_of_groups

      

What returns

    groups
a    0.25
b    0.50

      

This works great until I get 10 unique groups, at which point I start getting weird distros. Here's a small example:

np.random.seed(1234)

group_size = 12
groups = np.arange(group_size)

probs = np.random.uniform(size=group_size)
probs = probs / probs.sum()

g = np.random.choice(groups, size=10000, p=probs)
df = pd.DataFrame({'groups': g})

prob_map = ((1. / len(df['groups'].unique())) / df.groupby('groups').size()).to_dict()

df['probability'] = df['groups'].map(prob_map)

plt.hist(np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True))
plt.xticks(np.arange(group_size))
plt.show()

      

Histogram

I would expect a fairly even distribution with a large enough sample size, but I get these wings when the number of groups is 11+. If I change the variable group_size

to 10 or lower, I get the uniform distribution I want.

I can't tell if the problem is with my formula for calculating the probabilities, or perhaps a floating point precision issue? Does anyone know a better way to accomplish this or fix it for this example?

Thanks in advance!

+3


source to share


2 answers


you are using hist

which contains 10

beans by default ...

enter image description here

plt.rcParams['hist.bins']

10

      




pass group_size

as a parameter bins

.

plt.hist(
    np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True),
    bins=group_size)

      

enter image description here

+2


source


There is no problem with your calculations. Your resulting array:

arr = np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True)

      

If you check the number of values:

pd.Series(arr).value_counts().sort_index()
Out: 
0     855
1     800
2     856
3     825
4     847
5     835
6     790
7     847
8     834
9     850
10    806
11    855
dtype: int64

      



This is pretty close to being evenly distributed. The problem is that the default number of bins is (10) of the histogram. Try this instead:

bins = np.linspace(-0.5, 10.5, num=12)
pd.Series(arr).plot.hist(bins=bins)

      

enter image description here

+2


source







All Articles