Is there a way to parallelize Pandas' Append method?

Question

Is there a way to parallelize Pandas' Append method?

I have 100 XLS files that I would like to combine into one CSV file. Is there a way to improve the speed of putting them all together?

This problem using concat is that it lacks the arguments that to_csv gives:

listOfFiles = glob.glob(file_location)
frame = pd.DataFrame()
for idx, a_file in enumerate(listOfFiles):
    print a_file
    data = pd.read_excel(a_file, sheetname=0, skiprows=range(1,2), header=1)

    frame = frame.append(data)

# Save to CSV..
print frame.info()
frame.to_csv(output_dir, index=False, encoding='utf-8', date_format="%Y-%m-%d")

+3

python pandas csv

NumenorForLife May 14 '15 at 20:32

source to share

2 answers

Ami tavory · Answer 1 · 2015-05-14T20:43:17+0000

Using multiprocessing , you can read them in parallel using something like:

import multiprocessing
import pandas as pd

dfs = multiprocessing.Pool().map(df.read_excel, f_names)

and then combine them into one:

df = pd.concat(dfs)

You probably need to check if the first part is faster than

dfs = map(df.read_excel, f_names)

YMMV - It depends on files, disks, etc.

EdChum · Answer 2 · 2015-05-14T20:33:25+0000

It would be more convenient to read them in a list and then call concat

:

merged = pd.concat(df_list)

so something like

df_list=[]
for f in xl_list:
    df_list.append(pd.read_csv(f)) # or read_excel

merged = pd.concat(df_list)

The problem with adding multiple times to a dataframe is that memory has to be allocated to fit the new size and content copied, and you really only want to do this once.

Is there a way to parallelize Pandas' Append method?

More articles: