How to select only complete panda dataframe dataset

I have the following dataset in python

import pandas as pd
bcw = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None)

      

Lines of type 24 have missing values:

1057013,8,4,5,1,2,?,7,3,1,4

      

B column 7

exists '?'

and I want to delete this line. How can I achieve this?

+3


source to share


2 answers


For your specific example in column: 7

:

bcw = bcw[bcw[7] != '?']

      



However, I did load the dataset and found the same anomaly in column: 6

, so this code will go through all columns for '?'

and remove rows:

for col in bcw.columns:
    if bcw[col].dtype != 'int64':
        print "Removing possible '?' in column %s..." % col
        bcw = bcw[bcw[col] != '?']

>>> Removing possible '?' in column 6...

      

+2


source


You may try

import numpy as np
irow = np.all(np.array(bcw) != '?', axis=1)
bcw = bcw.ix[irow, :]

      

np.array(bcw) != '?'

results in a boolean array (I tried to compare bcw

with '?'

directly but got errors, so I convert it to first np.array

) that specifies the positions where not '?'

.

np.all(xx, axis=1)

converts a 2-dimensional boolean array to a 1-dimensional, axis=1

meaning a string: if and only if all the elements in the string True

, the corresponding element in the result array True

. We now get an array of boolean indices indicating the lines containing '?'

.



Since it irow

is a boolean array of indices, you can also index bcw

using the following forms:

bcw.ix[irow]
bcw[irow]

      

But if irow

is an array of indices Integer instead of Boolean, the last form will throw an error. I am a little confused about pandas dataframe indexing, so I would be grateful if someone could tell me.

+1


source







All Articles