Dask read_csv upcasts bool to object
I am working with monthly transaction files for which I have to analyze for a whole year, which is out of memory. So I am using dask for this.
First, I processed my files in pandas in such a way that it would be the most efficient data size (11 columns with 3 bools and 8 int) and wrote them to csv from pandas. If I re-read them through pandas the types are what they should be. However, if I read it through Dask, the 3 bool columns are typed as objects which take up a lot more memory.
I checked if there was a mixed types problem, but since I made bool columns in pandas by adding new columns with initial False values ββand then changing that value to True based on some condition, I didn't find any missing or None.
I also tried adding dtypes to the read_csv command, which should pipe it to pandas read_csv, but that doesn't change the types, they still remain objects.
Is there a way to get around this or am I missing something?
-Milan
Edit: @MRocklin I don't do more than the following:
import dask.dataframe as dd
import pandas as pd
Transaction_folder = 'localpath/Transactions_*.csv.bz2'
trans_reader = pd.read_csv('localpath/Transactions_2016.csv.bz2', nrows = 100000) #to check if reading in 1 file via pandas works as expected
trans_reader.dtypes # gives bools and ints as expected
but then:
df = dd.read_csv(Transaction_folder, compression = 'bz2')
df.dtypes # gives ints and floats for the right columns but objects for everything that was bool
Could this be due to compression?
source to share
No one has answered this question yet
Check out similar questions: