Pandas: read_csv how to force data bool to use dtype bool instead of object
I am reading in a large flat file that has multiple columns temp data. The data has a boolean column which can be True / False or cannot have a record (which evaluates to nan).
When reading a csv, the bool column gets a typecast as an object, which prevents the data from being stored in hdfstore due to a serialization error.
data examples:
A B C D
a 1 2 true
b 5 7 false
c 3 2 true
d 9 4
I am using the following command to read
import pandas as pd
pd.read_csv('data.csv', parse_dates=True)
One solution is to specify the dtype while reading in csv, but I was hoping for a more concise solution like convert_objects where I can specify parse_numeric or parse_dates.
Since you had a missing value in your csv the column dtype is rendered as an object, since you have mixed dtypes, the first 3 rows are boolean, the last one will be float.
To convert the value NaN
use fillna
, it takes a dict to map the desired padding values ββto the columns and create a homogeneous type
In [9]:
t="""A B C D
a 1 NaN true
b 5 7 false
c 3 2 true
d 9 4"""
β
df = pd.read_csv(io.StringIO(t),sep='\s+')
β
df
Out[9]:
A B C D
0 a 1 NaN True
1 b 5 7 False
2 c 3 2 True
3 d 9 4 NaN
In [11]:
df.fillna({'C':0, 'D':False})
Out[11]:
A B C D
0 a 1 0 True
1 b 5 7 False
2 c 3 2 True
3 d 9 4 False
You can use dtype
, it takes a dictionary to display columns:
dtype : Type name or dict of column -> type
Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
import pandas as pd
import numpy as np
import io
# using your sample
csv_file = io.BytesIO('''
A B C D
a 1 2 true
b 5 7 false
c 3 2 true
d 9 4''')
df = pd.read_csv(csv_file, sep=r'\s+', dtype={'D': np.bool})
# then fillna to convert NaN to False
df = df.fillna(value=False)
df
A B C D
0 a 1 2 True
1 b 5 7 False
2 c 3 2 True
3 d 9 4 False
df.D.dtypes
dtype('bool')