Commit () guidelines when inserting many rows

[Situation-Aim]
We have a dataset in CSV of 1 million rows. They need to be imported into mysql 5.6 database, table with innodb engine, with the following columns (corresponding csv file):
Column1: bigint (15)
Column2: datetime

To import this data, we could use the following python script:

#python version: 3.4.3
#mysql connector version: 2.0.4
#Note: this is just a simple example script!

import mysql.connector
import csv

#Set-up db connection
cnx = mysql.connector.connect(user='user', password='secret', database='mydatabase')
cursor = cnx.cursor()

#read/ open csv
reader = csv.reader(open("C:/test.csv", "rt"))

#ignore header
next(reader)

#read CSV and send to mysql
for row in reader:
    id=row[0]
    adate = row[1]
    cursor.execute("""INSERT INTO mytable VALUES ('%s','%s')""" % (id,adate))

#commit the query
cnx.commit()

      


[Questions]
What conceptually happens when I put cnx.commit () inside a loop for like this:

for row in reader:
    id=row[0]
    adate = row[1]
    cursor.execute("""INSERT INTO mytable VALUES ('%s','%s')""" % (id,adate))    
    cnx.commit()

      

  • If I understand correctly, this is the difference between forcing mysql to WRITE after each iteration opposite to mysql to store everything in its cache and then sending a commit () - write command.
  • What are the performance advantages / disadvantages for this?
  • Are there any conventions / guidelines or practices that you follow and why?

Thank!

+3


source to share


1 answer


You have shown two extreme positions.

One of them is to commit only after your insertion input. Another is to commit after each line.

Both of them have poor performance. The first is bad, because MySQL creates a large transaction and then has to do it all at the end. This takes up a lot of RAM or temporary space. The second is bad because it makes a lot of small transactions.



The most efficient way to do this kind of bulk insertion is to commit every thousand lines or so. But this is a little more difficult to program than the alternatives. In my experience, the commit after every thousand lines or so is several times faster than the alternatives, so it's worth a small amount of extra complexity.

Another quick (but fragile) way to do such a bulk load is this LOAD DATA INFILE

one you can read about here: https://dev.mysql.com/doc/refman/5.6/en/load-data.html

There's one little twist in your megarow commit: MySQL connections can have autosave mode set. In this case, your first example is equivalent to your second. By default, this mode is disabled in the python connector.

+1


source







All Articles