Pythonic way to randomly assign pandas data records

Suppose we have a data frame

In [1]: df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

In [2]: df
Out[3]:
     A   B   C   D
0   45  88  44  92
1   62  34   2  86
2   85  65  11  31
3   74  43  42  56
4   90  38  34  93
5    0  94  45  10
..  ..  ..  ..  ..

      

How can I randomly replace x% of all records with a value, for example None

?

In [4]: something(df, percent=25)
Out[5]:
     A   B   C   D
0   45  88  None  92
1   62  34   2  86
2   None  None  11  31
3   74  43  None  56
4   90  38  34  None
5    None  94  45  10
..  ..  ..  ..  ..

      

I found information on sampling individual axes and I can imagine a way to randomly generate integers in the dimensions of my data frame and set those to be equal None

, but this is not very Pythonic.

  • Edit: forgot the "path" in the title
+3


source to share


1 answer


You can combine DataFrame.where

and np.random.uniform

:

In [37]: df
Out[37]: 
   A  B  C  D
0  1  0  2  2
1  2  2  0  3
2  3  0  0  3
3  0  2  3  1

In [38]: df.where(np.random.uniform(size=df.shape) > 0.3, None)
Out[38]: 
      A  B     C     D
0     1  0     2  None
1     2  2     0     3
2     3  0  None  None
3  None  2     3  None

      



It's not the shortest but gets the job done.

Note that you have to ask yourself if you really want to do this if you still have calculations. If you put None in the column, then pandas will have to use a slow dtype object instead of fast like int64 or float64.

+4


source







All Articles