How to read .xls in parallel using pandas?

I would like to read a large .xls file in parallel using pandas. I am currently using this:

LARGE_FILE = "LARGEFILE.xlsx"
CHUNKSIZE = 100000 # processing 100,000 rows at a time

def process_frame(df):
      # process data frame
      return len(df)

if __name__ == '__main__':
      reader = pd.read_excel(LARGE_FILE, chunksize=CHUNKSIZE)
      pool = mp.Pool(4) # use 4 processes

      funclist = []
      for df in reader:
              # process each data frame
              f = pool.apply_async(process_frame,[df])
              funclist.append(f)

      result = 0
      for f in funclist:
              result += f.get(timeout=10) # timeout in 10 seconds

      

While this is running, I don't think it speeds up the process of reading the file. Is there a more efficient way to achieve this?

+3


source to share


1 answer


Just for your information: I am reading 13MB, 29000 lines of csv in about 4 seconds. (not using parallel processing) Archlinux, AMD Phenom II X2, Python 3.4, python-pandas 0.16.2.

How big is your file and how long does it take to read it? This will help you better understand the problem. Is your Excel sheet very complex? Maybe read_excel has a hard time handling this complexity?



Suggestion: Install genumeric and use the ssconvert helper function to convert the file to csv. In your program change to read_csv. Check the time used by ssconvert and the time taken to read_csv. By the way, python-pandas had significant improvements while it went from version 13 .... 16, so it is useful to check that you have the latest version.

0


source







All Articles