How to import mysqldump into Pandas

I am wondering if there is an easy way to import mysqldump into Pandas.

I have several small (~ 110MB) tables and I would like to have them as DataFrames.

I would like to avoid having to return data to the database as it would require setting up / connecting to such a database. I have .sql files and want to import the contained tables into Pandas. Is there any module for this?

If version control is set to .sql files, all lists "MySQL dump 10.13 Distrib. 5.6.13, for Win32 (x86)" as a system dump is generated.

Background in retrospect

I was working locally on a computer with no database connection. The normal flow for my job was to get the .tsv, .csv or json from a third party and do some analysis that will be returned. The new third party provided all of their data in .sql format and this broke my workflow as it would have required a lot of overhead to convert it to a format that my programs could accept as input. We ended up asking to send data in a different format, but for business / reputation reasons wanted to look for a job first.

Edit: Below is an example MYSQLDump file with two tables.

/*
MySQL - 5.6.28 : Database - ztest
*********************************************************************
*/


/*!40101 SET NAMES utf8 */;

/*!40101 SET SQL_MODE=''*/;

/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/`ztest` /*!40100 DEFAULT CHARACTER SET latin1 */;

USE `ztest`;

/*Table structure for table `food_in` */

DROP TABLE IF EXISTS `food_in`;

CREATE TABLE `food_in` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `Cat` varchar(255) DEFAULT NULL,
  `Item` varchar(255) DEFAULT NULL,
  `price` decimal(10,4) DEFAULT NULL,
  `quantity` decimal(10,0) DEFAULT NULL,
  KEY `ID` (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=latin1;

/*Data for the table `food_in` */

insert  into `food_in`(`ID`,`Cat`,`Item`,`price`,`quantity`) values 

(2,'Liq','Beer','2.5000','300'),

(7,'Liq','Water','3.5000','230'),

(9,'Liq','Soda','3.5000','399');

/*Table structure for table `food_min` */

DROP TABLE IF EXISTS `food_min`;

CREATE TABLE `food_min` (
  `Item` varchar(255) DEFAULT NULL,
  `quantity` decimal(10,0) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

/*Data for the table `food_min` */

insert  into `food_min`(`Item`,`quantity`) values 

('Pizza','300'),

('Hotdogs','200'),

('Beer','300'),

('Water','230'),

('Soda','399'),

('Soup','100');

/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;

      

+3


source to share


3 answers


Not

Pandas has no native way to read mysqldump without going through the database.

There is a possible workaround, but this is a very bad idea in my opinion.

Workaround (not recommended for production use)

Of course, you could parse the data from the mysqldump file using the preprocessor.

MySQLdump files often contain a lot of extra data that we are not interested in when loading the pandas frame, so we need to preprocess it and remove noise and even reformat the lines to match.

Using StringIO

, we can read the file, process the data before it is sent to pandas.read_csv

funcion

from StringIO import StringIO
import re

def read_dump(dump_filename, target_table):
    sio = StringIO()

    fast_forward = True
    with open(dump_filename, 'rb') as f:
        for line in f:
            line = line.strip()
            if line.lower().startswith('insert') and target_table in line:
                fast_forward = False
            if fast_forward:
                continue
            data = re.findall('\([^\)]*\)', line)
            try:
                newline = data[0]
                newline = newline.strip(' ()')
                newline = newline.replace('`', '')
                sio.write(newline)
                sio.write("\n")
            except IndexError:
                pass
            if line.endswith(';'):
                break
    sio.pos = 0
    return sio

      

Now that we have a function that reads and formats data to make it look like a CSV file, we can read it using pandas.read_csv()



import pandas as pd

food_min_filedata = read_dump('mysqldumpexample', 'food_min')
food_in_filedata = read_dump('mysqldumpexample', 'food_in')

df_food_min = pd.read_csv(food_min_filedata)
df_food_in = pd.read_csv(food_in_filedata)

      

Results in:

        Item quantity
0    'Pizza'    '300'
1  'Hotdogs'    '200'
2     'Beer'    '300'
3    'Water'    '230'
4     'Soda'    '399'
5     'Soup'    '100'

      

and

   ID    Cat     Item     price quantity
0   2  'Liq'   'Beer'  '2.5000'    '300'
1   7  'Liq'  'Water'  '3.5000'    '230'
2   9  'Liq'   'Soda'  '3.5000'    '399'

      

Stream processing note

This approach is called streaming and is incredibly optimized with almost no memory consumption at all. It is generally recommended to use this approach for more efficient reading of csv files in pandas.

This is the parsing of the mysqldump file I recommend

+14


source


One way is to export mysqldump to sqlite (for example, run this shell script ), then read the sqlite file / database.

See the SQL section of the documentation :

pd.read_sql_table(table_name, sqlite_file)

      




Another option is to just run read_sql

on mysql database directly ...

+3


source


I ended up in a similar situation with yours and the answer from @ firelynx was really helpful!

But since I only had limited knowledge of the tables included in the file, I extended the script by adding header generation (pandas automatically selects it) as well as finding all tables in the dump file. As a result, I ended up with the following script, which really works really fast. I switched to io.StringIO

and saved the resulting tables as table_name.csv

files.

PS I also support the recommendation not to rely on this approach and provide the code for illustration only :)

So first, we can increase a function read_dump

like this

from io import StringIO
import re, shutil

def read_dump(dump_filename, target_table):
    sio = StringIO()

    read_mode = 0 # 0 - skip, 1 - header, 2 - data
    with open(dump_filename, 'r') as f:
        for line in f:
            line = line.strip()
            if line.lower().startswith('insert') and target_table in line:
                read_mode = 2
            if line.lower().startswith('create table') and target_table in line:
                read_mode = 1
                continue

            if read_mode==0:
                continue

            # Filling up the headers
            elif read_mode==1:
                if line.lower().startswith('primary'):
                    # add more conditions here for different cases 
                    #(e.g. when simply a key is defined, or no key is defined)
                    read_mode=0
                    sio.seek(sio.tell()-1) # delete last comma
                    sio.write('\n')
                    continue
                colheader = re.findall('`([\w_]+)`',line)
                for col in colheader:
                    sio.write(col.strip())
                    sio.write(',')

            # Filling up the data -same as @firelynx code
            elif read_mode ==2:
                data = re.findall('\([^\)]*\)', line)
                try:
                    # ...
                except IndexError:
                    pass
                if line.endswith(';'):
                    break
    sio.seek(0)
    with open (target_table+'.csv', 'w') as fd:
        shutil.copyfileobj(sio, fd,-1)
    return # or simply return sio itself

      

To find a list of tables, we can use the following function:

def find_tables(dump_filename):
    table_list=[]

    with open(dump_filename, 'r') as f:
        for line in f:
            line = line.strip()
            if line.lower().startswith('create table'):
                table_name = re.findall('create table `([\w_]+)`', line.lower())
                table_list.extend(table_name)

    return table_list

      

Then just concatenate the two, for example in a .py script that you run like

python this_script.py mysqldump_name.sql [table_name]

import os.path
def main():
    try:
        if len(sys.argv)>=2 and os.path.isfile(sys.argv[1]):
            if len(sys.argv)==2:
                print('Table name not provided, looking for all tables...')
                table_list = find_tables(sys.argv[1])
                if len(table_list)>0:
                    print('Found tables: ',str(table_list))
                    for table in table_list:
                        read_dump(sys.argv[1], table)
            elif len(sys.argv)==3:
                read_dump(sys.argv[1], sys.argv[2])
    except KeyboardInterrupt:
        sys.exit(0)

      

+3


source







All Articles