Dask read_csv upcasts bool to object

Question

Dask read_csv upcasts bool to object

I am working with monthly transaction files for which I have to analyze for a whole year, which is out of memory. So I am using dask for this.

First, I processed my files in pandas in such a way that it would be the most efficient data size (11 columns with 3 bools and 8 int) and wrote them to csv from pandas. If I re-read them through pandas the types are what they should be. However, if I read it through Dask, the 3 bool columns are typed as objects which take up a lot more memory.

I checked if there was a mixed types problem, but since I made bool columns in pandas by adding new columns with initial False values and then changing that value to True based on some condition, I didn't find any missing or None.

I also tried adding dtypes to the read_csv command, which should pipe it to pandas read_csv, but that doesn't change the types, they still remain objects.

Is there a way to get around this or am I missing something?

-Milan

Edit: @MRocklin I don't do more than the following:

import dask.dataframe as dd

import pandas as pd

Transaction_folder = 'localpath/Transactions_*.csv.bz2'

trans_reader = pd.read_csv('localpath/Transactions_2016.csv.bz2', nrows = 100000) #to check if reading in 1 file via pandas works as expected

trans_reader.dtypes # gives bools and ints as expected

but then:

df = dd.read_csv(Transaction_folder, compression = 'bz2')

df.dtypes # gives ints and floats for the right columns but objects for everything that was bool

Could this be due to compression?

+3

python pandas dask

Milan van den heuvel Apr 27. 17 at 15:07

source to share