Is there a way to parallelize Pandas' Append method?
I have 100 XLS files that I would like to combine into one CSV file. Is there a way to improve the speed of putting them all together?
This problem using concat is that it lacks the arguments that to_csv gives:
listOfFiles = glob.glob(file_location)
frame = pd.DataFrame()
for idx, a_file in enumerate(listOfFiles):
print a_file
data = pd.read_excel(a_file, sheetname=0, skiprows=range(1,2), header=1)
frame = frame.append(data)
# Save to CSV..
print frame.info()
frame.to_csv(output_dir, index=False, encoding='utf-8', date_format="%Y-%m-%d")
+3
source to share
2 answers
Using multiprocessing , you can read them in parallel using something like:
import multiprocessing
import pandas as pd
dfs = multiprocessing.Pool().map(df.read_excel, f_names)
and then combine them into one:
df = pd.concat(dfs)
You probably need to check if the first part is faster than
dfs = map(df.read_excel, f_names)
YMMV - It depends on files, disks, etc.
+2
source to share
It would be more convenient to read them in a list and then call concat
:
merged = pd.concat(df_list)
so something like
df_list=[]
for f in xl_list:
df_list.append(pd.read_csv(f)) # or read_excel
merged = pd.concat(df_list)
The problem with adding multiple times to a dataframe is that memory has to be allocated to fit the new size and content copied, and you really only want to do this once.
+1
source to share