Is there a way to parallelize Pandas' Append method?

I have 100 XLS files that I would like to combine into one CSV file. Is there a way to improve the speed of putting them all together?

This problem using concat is that it lacks the arguments that to_csv gives:

listOfFiles = glob.glob(file_location)
frame = pd.DataFrame()
for idx, a_file in enumerate(listOfFiles):
    print a_file
    data = pd.read_excel(a_file, sheetname=0, skiprows=range(1,2), header=1)

    frame = frame.append(data)

# Save to CSV..
print frame.info()
frame.to_csv(output_dir, index=False, encoding='utf-8', date_format="%Y-%m-%d")

      

+3


source to share


2 answers


Using multiprocessing , you can read them in parallel using something like:

import multiprocessing
import pandas as pd

dfs = multiprocessing.Pool().map(df.read_excel, f_names)

      

and then combine them into one:

df = pd.concat(dfs)

      




You probably need to check if the first part is faster than

dfs = map(df.read_excel, f_names)

      

YMMV - It depends on files, disks, etc.

+2


source


It would be more convenient to read them in a list and then call concat

:

merged = pd.concat(df_list)

      

so something like



df_list=[]
for f in xl_list:
    df_list.append(pd.read_csv(f)) # or read_excel

merged = pd.concat(df_list)

      

The problem with adding multiple times to a dataframe is that memory has to be allocated to fit the new size and content copied, and you really only want to do this once.

+1


source







All Articles