Slicing a Dask data frame

Question

Slicing a Dask data frame

I have the following code where I like to do train / test splitting on Datascrame Dask

df = dd.read_csv(csv_filename, sep=',', encoding="latin-1",
                     names=cols, header=0, dtype='str')

But when I try to do fragments like

for train, test in cv.split(X, y):
    df.fit(X[train], y[train])

it fails with an error

KeyError: '[11639 11641 11642 ..., 34997 34998 34999] not in index'

Any ideas?

+3

python dataframe dask

Zubair ahmed June 10. 17 at 16:17

source to share

1 answer

MRocklin · Answer 1 · 2017-06-10T16:36:32+0000

Dask.dataframe does not support sorting by multiple rows. It supports the operation loc

if you have a reasonable index.

However, in your case of a train / test split, you would probably be better off using the random_split method .

train, test = df.random_split([0.80, 0.20])

You can also do many sections and concat in different ways

splits = df.random_split([0.20, 0.20, 0.20, 0.20, 0.20])

for i in range(5):
    trains = [splits[j] for j in range(5) if j != i]
    train = dd.concat(trains, axis=0)
    test = splits[i]

Slicing a Dask data frame

More articles: