How can I split a dataset from a CSV file for training and testing?

I am using Python and I need to split my imported .csv data into two parts: training and test set, 70% EG training and 30% test.

I keep getting various errors like 'list' object is not callable

and so on.

Is there an easy way to do this?

thank

EDIT:

The code is basic, I just want to split the dataset.

from csv import reader
with open('C:/Dataset.csv', 'r') as f:
    data = list(reader(f)) #Imports the CSV
    data[0:1] ( data )

      

TypeError: 'list' object is not callable

+4


source to share


3 answers


You can use pandas

:



import pandas as pd
import numpy as np

df = pd.read_csv('C:/Dataset.csv')
df['split'] = np.random.randn(df.shape[0], 1)

msk = np.random.rand(len(df)) <= 0.7

train = df[msk]
test = df[~msk]

      

+9


source


You have to use a function read_csv ()

from the pandas module. It reads all of your data straight into a dataframe, which you can use further to split your data into train and test. Likewise, you can use a function train_test_split()

from the scikit-learn module.



+2


source


Best practice, and possibly more casual, is to use df.sample

:

from numpy.random import RandomState
import pandas as pd

df = pd.read_csv('C:/Dataset.csv')
rng = RandomState()

train = df.sample(frac=0.7, random_state=rng)
test = df.loc[~df.index.isin(train.index)]

      

0


source







All Articles