Pandas: delete column and free memory
I am processing a large dataset with about 20,000,000 rows and 4 columns. Unfortunately, the available memory on my machine (~ 16GB) is not enough.
Example (time is seconds since midnight):
Date Time Price Vol
0 20010102 34222 51.750 227900
1 20010102 34234 51.750 5600
2 20010102 34236 51.875 14400
Then I convert the dataset to my own time series object:
Date Time Price Vol
2001-01-02 09:30:22 20010102 34222 51.750 227900
2001-01-02 09:30:34 20010102 34234 51.750 5600
2001-01-02 09:30:36 20010102 34236 51.875 14400
2001-01-02 09:31:03 20010102 34263 51.750 2200
To free up memory, I want to dump the redundant Date and Time columns. I do it with a method .drop()
, but the memory is not freed. I also tried calling gc.collect()
after that, but that didn't help either.
This is the code I am calling to handle the described actions. Part del
frees memory, but not part drop
.
# Store date and time components
m, s = divmod(data.Time.values, 60)
h, m = divmod(m, 60)
s, m, h = pd.Series(np.char.mod('%02d', s)), pd.Series(np.char.mod('%02d', m)), pd.Series(np.char.mod('%02d', h))
# Set time series index
data = data.set_index(pd.to_datetime(data.Date.reset_index(drop=True).apply(str) + h + m + s, format='%Y%m%d%H%M%S'))
# Remove redundant information
del s, m, h
data.drop('Date', axis=1, inplace=True)
data.drop('Time', axis=1, inplace=True)
How can I free memory from a pandas dataframe?
source to share
There is one thing you can always count on when it comes to Python and freeing memory, and that is the OS freeing up process resources. So, in this case, I would suggest the following:
-
Start your main process
multiprocessing.Process
-
The child process function must:
- read DataFrame
- execute
drop
- write DataFrame to file
- return.
-
From the main process,
join
child process and read the minified DataFrame from disk.
source to share