Pandas: read_csv how to force data bool to use dtype bool instead of object

Question

Pandas: read_csv how to force data bool to use dtype bool instead of object

I am reading in a large flat file that has multiple columns temp data. The data has a boolean column which can be True / False or cannot have a record (which evaluates to nan).

When reading a csv, the bool column gets a typecast as an object, which prevents the data from being stored in hdfstore due to a serialization error.

data examples:

A    B    C    D
a    1    2    true
b    5    7    false
c    3    2    true
d    9    4

I am using the following command to read

import pandas as pd
pd.read_csv('data.csv', parse_dates=True)

One solution is to specify the dtype while reading in csv, but I was hoping for a more concise solution like convert_objects where I can specify parse_numeric or parse_dates.

+3

python pandas

Prasanjit prakash Apr 20 '15 at 5:27

source to share

2 answers

You can use dtype

, it takes a dictionary to display columns:

dtype : Type name or dict of column -> type
    Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}

import pandas as pd
import numpy as np
import io

# using your sample
csv_file = io.BytesIO('''
A    B    C    D
a    1    2    true
b    5    7    false
c    3    2    true
d    9    4''')

df = pd.read_csv(csv_file, sep=r'\s+', dtype={'D': np.bool})
# then fillna to convert NaN to False
df = df.fillna(value=False)

df 
   A  B  C      D
0  a  1  2   True
1  b  5  7  False
2  c  3  2   True
3  d  9  4  False

df.D.dtypes
dtype('bool')

+1

Anzel Apr 20 15 at 6:04

source to share

EdChum · Accepted Answer · 2015-04-20T06:11:05+0000

Since you had a missing value in your csv the column dtype is rendered as an object, since you have mixed dtypes, the first 3 rows are boolean, the last one will be float.

To convert the value NaN

use fillna

, it takes a dict to map the desired padding values to the columns and create a homogeneous type

In [9]:

t="""A    B    C    D
a    1    NaN    true
b    5    7    false
c    3    2    true
d    9    4"""

df = pd.read_csv(io.StringIO(t),sep='\s+')

df
Out[9]:
   A  B   C      D
0  a  1 NaN   True
1  b  5   7  False
2  c  3   2   True
3  d  9   4    NaN
In [11]:

df.fillna({'C':0, 'D':False})
Out[11]:
   A  B  C      D
0  a  1  0   True
1  b  5  7  False
2  c  3  2   True
3  d  9  4  False

Pandas: read_csv how to force data bool to use dtype bool instead of object

More articles: