How to select only complete panda dataframe dataset
I have the following dataset in python
import pandas as pd
bcw = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None)
Lines of type 24 have missing values:
1057013,8,4,5,1,2,?,7,3,1,4
B column 7
exists '?'
and I want to delete this line. How can I achieve this?
source to share
For your specific example in column: 7
:
bcw = bcw[bcw[7] != '?']
However, I did load the dataset and found the same anomaly in column: 6
, so this code will go through all columns for '?'
and remove rows:
for col in bcw.columns:
if bcw[col].dtype != 'int64':
print "Removing possible '?' in column %s..." % col
bcw = bcw[bcw[col] != '?']
>>> Removing possible '?' in column 6...
source to share
You may try
import numpy as np irow = np.all(np.array(bcw) != '?', axis=1) bcw = bcw.ix[irow, :]
np.array(bcw) != '?'
results in a boolean array (I tried to compare bcw
with '?'
directly but got errors, so I convert it to first np.array
) that specifies the positions where not '?'
.
np.all(xx, axis=1)
converts a 2-dimensional boolean array to a 1-dimensional, axis=1
meaning a string: if and only if all the elements in the string True
, the corresponding element in the result array True
. We now get an array of boolean indices indicating the lines containing '?'
.
Since it irow
is a boolean array of indices, you can also index bcw
using the following forms:
bcw.ix[irow]
bcw[irow]
But if irow
is an array of indices Integer instead of Boolean, the last form will throw an error. I am a little confused about pandas dataframe indexing, so I would be grateful if someone could tell me.
source to share