Working with large (+15 gb) CSV and Pandas / XGBoost datasets
I'm trying to find a way to get started with very large CSV files in Pandas, eventually to do some machine learning with XGBoost.
I'm torn between using mySQL or some sqllite map to manage chunks of my data; my problem is the machine learning aspect further down the line and loading in chunks at a time to train the model.
My other thought was to use Dask
which is built from Pandas but also has XGBoost functionality.
I'm not sure what is the best starting point and was hoping to ask for an opinion! I cant to Dask
, but I haven't used it yet.
This blogpost goes through an example using XGBoost on a large CSV dataset. However, he did it using a distributed cluster with enough RAM to fit the entire dataset into memory at once. While many dask.dataframe operations can work in a small space, I don't think learning XGBoost is likely to be one of them. XGBoost seems to work best when all data is available at all times.
I haven't tried this, but I would load your data into an hdf5 file using h5py . This library allows you to store data on disk, but access it like a numpy array. Therefore, you are no longer limited by the memory for your dataset.
For the XGBoost part, I would use the sklearn API and pass the h5py object as the value X
. I recommend sklearn API as it accepts numpy as input arrays which should allow h5py objects to work. Be sure to use a small value for subsample
, otherwise you will likely run out of memory quickly.