Is there a chunksize argument for read_excel in pandas?

I am trying to create a progress bar for reading Excel data into pandas using tqdm. I can do this easily with CSV using the chunksize argument like so:

data_reader = pd.read_csv(path,
                          chunksize = 1000)

for row in tqdm(data_reader,
                total = 200):
    df_list = []
    df_list.append(row)

      

Which updates the progress bar for every 1000 chunk of the 200 total chunks. pd.read_excel

no longer has an argument chunksize

. Is there any alternative?

Edit: I read the question re: reading an excel file in chunks ( reading part of a large xlsx file with python ), however read_excel no longer has an argument pd.ExcelFile.parse

and is pd.ExcelFile.parse

equivalent. I am wondering if there is an alternative to the argument chunksize

or some other way to create an iterable loop for the chunks while they are being read.

+6


source to share


1 answer


If you want to add a progress indicator, you can use the method .tell()

for file objects. It's not entirely accurate, of course, but it might give your users enough precision to estimate how long they can take a coffee break :-)

So here's the plan: open your Excel file with open

and pass the result object to pd.read_excel

. According to the documentation, this should be possible and I just tested it with a simple example for an xlsx file.

First, you estimate the size of the file. For example:

import io
fp.seek(0, io.SEEK_END) # set the file cursor to the end of the file
fp_len= fp.tell()
fp.seek(0, io.SEEK_SET) # set the file cursor back to the beginning of the file

      



With this setup, you have two options:

  1. Either you create a thread that updates the progress bar from time to time by calling fp.tell()

    on the file object you opened for the xlsx file, or
  2. create your own wrapper that exposes methods, pandas should read data (at least a method read

    ) and update the progress bar synchronously so you don't need an extra thread. Your class will just need to pass method calls to the actual class of the file. In this sense, you can compare it to a proxy object.

I have to admit that 2 is a little messy. But I am convinced that both methods will work, because I'm just pd.read_excel

that pd.read_excel

can actually read from the file object ( io.BufferedReader

), as well as xlsx files that are formatted files io.BufferedReader

. This method simply would not be as accurate, because the file pointer may not move linearly over time, depending on things like fluctuating compression ratios (some parts of the file may be compressed at a faster rate than others).

0


source







All Articles