Random sampling with Pandas non-overlapping groups of data frames

Question

Random sampling with Pandas non-overlapping groups of data frames

I need to randomly split a dataframe into two disjoint sets with an attribute 'ids'

. For example, consider the following data frame:

df=
Out[470]: 
          0     1     2     3       ids
0      17.0  18.0  16.0  15.0      13.0
1      18.0  16.0  15.0  15.0      13.0
2      16.0  15.0  15.0  16.0      13.0
131    12.0   8.0  21.0  19.0      14.0
132     8.0  21.0  19.0  20.0      14.0
133    21.0  19.0  20.0   9.0      14.0
248     NaN   NaN  12.0  11.0      17.0
249     NaN  12.0  11.0  10.0      17.0
250    12.0  11.0  10.0   NaN      17.0
287     3.0   3.0   1.0   8.0      20.0
288     3.0   1.0   8.0   3.0      20.0
289     1.0   8.0   3.0   3.0      20.0
413    21.0   7.0  16.0  18.0      25.0
414     7.0  16.0  18.0  19.0      25.0
415    16.0  18.0  19.0  18.0      25.0
665    10.0   8.0   8.0   7.0      27.0
666     8.0   8.0   7.0   9.0      27.0
667     8.0   7.0   9.0   8.0      27.0
790     NaN   NaN  15.0   NaN      33.0
791     NaN  15.0   NaN  10.0      33.0
792    15.0   NaN  10.0   NaN      33.0
812     NaN  16.0   NaN  17.0      34.0
813    16.0   NaN  17.0   NaN      34.0
814     NaN  17.0   NaN  13.0      34.0
944     3.0   4.0   3.0  18.0      35.0
945     4.0   3.0  18.0  18.0      35.0
946     3.0  18.0  18.0  11.0      35.0
1059    9.0  10.0   3.0   4.0      56.0
1060   10.0   3.0   4.0   3.0      56.0
1061    3.0   4.0   3.0   3.0      56.0
    ...   ...   ...   ...       ...
10125   NaN   9.0   5.0   5.0  101317.0
10126   9.0   5.0   5.0   5.0  101317.0
10127   5.0   5.0   5.0   7.0  101317.0

I need to get two (randomly separated with some size dimensions) data with no , overlapping values ids

.

I know how to solve it in a "nepandai" way:

get unique values ids
randomly split unique values into two non-overlapping groups
select the row according to the values ids

in the two groups using.isin()

I am wondering if there is a simple and neat way to do this using the built-in pandas function, for example .sample()

?

+3

python pandas disjoint-sets

Arnold klein May 16 '17 at 17:01

source to share

2 answers

UPDATE:

df1 = df.sample(frac=1).loc[df.ids % 2 == 0]
df2 = df.loc[df.index.difference(df1.index)]

OLD is wrong (it doesn't need to be separated by IDs):

you can first shuffle your DF with sample(frac=1)

and then use np.split () :

df1, df2 = np.split(df.sample(frac=1), 2)

+2

MaxU May 16 '17 at 17:09

source to share

root · Accepted Answer · 2017-05-16T17:47:27+0000

Using sklearn.model_selection.GroupShuffleSplit

to perform splitting:

from sklearn.model_selection import GroupShuffleSplit

# Initialize the GroupShuffleSplit.
gss = GroupShuffleSplit(n_splits=1, test_size=0.5)

# Get the indexers for the split.
idx1, idx2 = next(gss.split(df, groups=df.ids))

# Get the split DataFrames.
df1, df2 = df.iloc[idx1], df.iloc[idx2]

Random sampling with Pandas non-overlapping groups of data frames

More articles: