Pythonic way to randomly assign pandas data records
Suppose we have a data frame
In [1]: df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
In [2]: df
Out[3]:
A B C D
0 45 88 44 92
1 62 34 2 86
2 85 65 11 31
3 74 43 42 56
4 90 38 34 93
5 0 94 45 10
.. .. .. .. ..
How can I randomly replace x% of all records with a value, for example None
?
In [4]: something(df, percent=25)
Out[5]:
A B C D
0 45 88 None 92
1 62 34 2 86
2 None None 11 31
3 74 43 None 56
4 90 38 34 None
5 None 94 45 10
.. .. .. .. ..
I found information on sampling individual axes and I can imagine a way to randomly generate integers in the dimensions of my data frame and set those to be equal None
, but this is not very Pythonic.
- Edit: forgot the "path" in the title
source to share
You can combine DataFrame.where
and np.random.uniform
:
In [37]: df
Out[37]:
A B C D
0 1 0 2 2
1 2 2 0 3
2 3 0 0 3
3 0 2 3 1
In [38]: df.where(np.random.uniform(size=df.shape) > 0.3, None)
Out[38]:
A B C D
0 1 0 2 None
1 2 2 0 3
2 3 0 None None
3 None 2 3 None
It's not the shortest but gets the job done.
Note that you have to ask yourself if you really want to do this if you still have calculations. If you put None in the column, then pandas will have to use a slow dtype object instead of fast like int64 or float64.
source to share