Pandas: delete column and free memory

I am processing a large dataset with about 20,000,000 rows and 4 columns. Unfortunately, the available memory on my machine (~ 16GB) is not enough.

Example (time is seconds since midnight):

           Date   Time   Price     Vol
0      20010102  34222  51.750  227900
1      20010102  34234  51.750    5600
2      20010102  34236  51.875   14400

      

Then I convert the dataset to my own time series object:

                         Date   Time   Price     Vol
2001-01-02 09:30:22  20010102  34222  51.750  227900
2001-01-02 09:30:34  20010102  34234  51.750    5600
2001-01-02 09:30:36  20010102  34236  51.875   14400
2001-01-02 09:31:03  20010102  34263  51.750    2200

      

To free up memory, I want to dump the redundant Date and Time columns. I do it with a method .drop()

, but the memory is not freed. I also tried calling gc.collect()

after that, but that didn't help either.

This is the code I am calling to handle the described actions. Part del

frees memory, but not part drop

.

# Store date and time components
m, s = divmod(data.Time.values, 60)
h, m = divmod(m, 60)
s, m, h = pd.Series(np.char.mod('%02d', s)), pd.Series(np.char.mod('%02d', m)), pd.Series(np.char.mod('%02d', h))

# Set time series index
data = data.set_index(pd.to_datetime(data.Date.reset_index(drop=True).apply(str) + h + m + s, format='%Y%m%d%H%M%S'))

# Remove redundant information
del s, m, h
data.drop('Date', axis=1, inplace=True)
data.drop('Time', axis=1, inplace=True)

      

How can I free memory from a pandas dataframe?

+8


source to share


2 answers


del data['Date']
del data['Time']

      



This will free up memory.

+1


source


There is one thing you can always count on when it comes to Python and freeing memory, and that is the OS freeing up process resources. So, in this case, I would suggest the following:



  • Start your main process multiprocessing.Process

  • The child process function must:

    • read DataFrame
    • execute drop

    • write DataFrame to file
    • return.
  • From the main process, join

    child process and read the minified DataFrame from disk.

0


source







All Articles