Random selection from Pandas groups with equal probability - unexpected behavior
I have 12 unique groups that I am trying to randomly sample, each with a different number of observations. I want to randomly select from the entire population (dataframe), with each group having the same probability of being selected. The simplest example of this would be a data frame with two groups.
groups probability
0 a 0.25
1 a 0.25
2 b 0.5
using np.random.choice(df['groups'], p=df['probability'], size=100)
Each iteration will now get a 50% chance to choose group a
and a 50% chance to choosegroup b
To come up with the probabilities, I used the formula:
(1. / num_groups) / size_of_groups
or in Python:
num_groups = len(df['groups'].unique()) # 2
size_of_groups = df.groupby('label').size() # {a: 2, b: 1}
(1. / num_groups) / size_of_groups
What returns
groups
a 0.25
b 0.50
This works great until I get 10 unique groups, at which point I start getting weird distros. Here's a small example:
np.random.seed(1234) group_size = 12 groups = np.arange(group_size) probs = np.random.uniform(size=group_size) probs = probs / probs.sum() g = np.random.choice(groups, size=10000, p=probs) df = pd.DataFrame({'groups': g}) prob_map = ((1. / len(df['groups'].unique())) / df.groupby('groups').size()).to_dict() df['probability'] = df['groups'].map(prob_map) plt.hist(np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True)) plt.xticks(np.arange(group_size)) plt.show()
I would expect a fairly even distribution with a large enough sample size, but I get these wings when the number of groups is 11+. If I change the variable group_size
to 10 or lower, I get the uniform distribution I want.
I can't tell if the problem is with my formula for calculating the probabilities, or perhaps a floating point precision issue? Does anyone know a better way to accomplish this or fix it for this example?
Thanks in advance!
source to share
There is no problem with your calculations. Your resulting array:
arr = np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True)
If you check the number of values:
pd.Series(arr).value_counts().sort_index()
Out:
0 855
1 800
2 856
3 825
4 847
5 835
6 790
7 847
8 834
9 850
10 806
11 855
dtype: int64
This is pretty close to being evenly distributed. The problem is that the default number of bins is (10) of the histogram. Try this instead:
bins = np.linspace(-0.5, 10.5, num=12) pd.Series(arr).plot.hist(bins=bins)
source to share