How to use sequential random sampling in Python Pandas?

Below I have a code where I can read a csv file and take an arbitrary sample 700

from the file. I need to do this on multiple files, but if I iterate over the files, the sample (as random) will be different for each file, if I want to keep it in the same order once it is randomly generated.

df = pd.read_csv(file.csv, delim_whitespace=True)
df_s = df.sample(n=700)

      

My ideas are to take the line number and then pipe it to the next file, however that doesn't seem very elegant.

Do you know any good solutions to this problem?

THE CONFIRMATION

The file length is different, but the minimum file length is 750.

desired result EXAMPLE

df1 = pd.read_csv(file1.csv, delim_whitespace=True)
df_s1 = df1.sample(n=700) # choose random sample

df2 = pd.read_csv(file2.csv, delim_whitespace=True)
df_s2 = df2.sample(n=700) # use same random sample as above

      

+3


source to share


1 answer


I think you can use the parameter random_state

in sample

, but it only works when all files are the same size, so add the parameter nrows

- read_csv

:



df = pd.read_csv(file.csv, delim_whitespace=True, nrows=750)
df_s = df.sample(n=700, random_state=123)

      

+2


source







All Articles