Fastest way to read .xlsx file using Python

I am trying to read data from a .xlsx file in a MySQL database using Python.

Here's my code:

wb = openpyxl.load_workbook(filename="file", read_only=True)
ws = wb['My Worksheet']

conn = MySQLdb.connect()
cursor = conn.cursor()

cursor.execute("SET autocommit = 0")

for row in ws.iter_rows(row_offset=1):
     sql_row = # data i need
     cursor.execute("INSERT sql_row")

conn.commit() 

      

Unfortunately openpyxl ws.iter_rows () is very slow. I've tried similar methods with xlrd and pandas modules. Still slow. Any thoughts?

+3


source to share


1 answer


You really need to check your code and provide information about the size of the worksheet and the time it took to process it.

openpyxl read-only mode is essentially a memory optimization that avoids loading the entire worksheet into memory. When it comes to parsing Excel spreadsheets, most of the work comes from converting XML to Python, and there are limitations to that.

However, two optimizations do spring:

  • keep the SQL statement outside the loop
  • use executemany

    to pass many lines at once to the driver

They can be combined into something like



INSERT_SQL = "INSERT INTO mytable (name, age…) VALUES (%s, %s, …)"
c.executemany(INSERT_SQL, ws.values)

      

If you only want a subset of the lines, take a look at itertools.islice

It should be faster than your current code, but you shouldn't expect miracles.

When it comes to sheer performance, xlrd is slightly faster than openpyxl when reading worksheets as it has a smaller memory footprint, mostly related to the read-only library. But it always loads a whole book into memory that you might not need.

0


source







All Articles