Parsing a large file in mysql

I have a task where I need to parse an extremely large file and write the results to the mysql database. "extremely large" means we are talking about 1.4 GB of CSV data, amounting to roughly 10 million lines of text.

The thing is not "HOW" to do it, but how to do it FAST. my first approach was to just do it in php without any speed optimization and then let it run for a few days until it was done. unfortunately it has been running for 48 hours right now and only processed 2% of the total file. therefore, this is not an option.

the file format looks like this:

A:1,2

      

where the number of comma-separated numbers after ":" can be 0-1000. an example dataset should enter the table like this:

| A | 1 |
| A | 2 |

      

so right now, i did it like this:

$fh = fopen("file.txt", "r");

$line = ""; // buffer for the data
$i = 0; // line counter
$start = time(); // benchmark

while($line = fgets($fh))
{
    $i++;       
    echo "line " . $i . ": ";

    //echo $i . ": " . $line . "<br>\n";

    $line = explode(":", $line);

    if(count($line) != 2 || !is_numeric(trim($line[0])))
    {
        echo "error: source id [" .  trim($line[0]) . "]<br>\n";
        continue;
    }

    $targets = explode(",", $line[1]);

    echo "node " .  $line[0] . " has " . count($targets) . " links<br>\n";

    // insert links in link table
    foreach($targets as $target)
    {
            if(!is_numeric(trim($target)))
            {
                echo "line " . $i . " has malformed target [" . trim($target) . "]<br>\n";
                continue;
            }

            $sql = "INSERT INTO link (source_id, target_id) VALUES ('" .  trim($line[0]) . "', '" .  trim($target) . "')";
            mysql_query($sql) or die("insert failed for SQL: ". mysql_error());
        }
}

echo "<br>\n--<br>\n<br>\nseconds wasted: " . (time() - $start);

      

this is obviously not optimized for speed ANY way. any tips for a fresh start? should i switch to another language?

+3


source to share


4 answers


The first optimization will be an insert with a transaction - every 100 or 1000 rows commit and start a new transaction. Obviously, you will have to use a storage engine that supports transactions.

Then take note of the CPU usage with the command top

- if you have multiple cores, the mysql process doesn't do much and the PHP process does most of the work, rewrite the script to accept an option that skips n lines from the beginning and only import 10,000 lines. Then run multiple instances of the script, each with a different starting point.



A third solution would be to convert the file to CSV with PHP (no INSERT at all, just write to the file) and use LOAD DATA INFILE

as m4t1t0 suggested.

+2


source


as promised, you will find the solution I went for in this post. I compared it, and it turned out that it is 40 times (!) Faster than the old one :) of course - there is still a lot of room for optimization, but it's fast enough for me right now :)



$db = mysqli_connect(/*...*/) or die("could not connect to database");

$fh = fopen("data", "r");

$line = "";             // buffer for the data
$i = 0;                 // line counter
$start = time();        // benchmark timer
$node_ids = array();    // all (source) node ids

mysqli_autocommit($db, false);

while($line = fgets($fh))
{
$i++;

echo "line " . $i . ": ";

$line = explode(":", $line);
$line[0] = trim($line[0]);

if(count($line) != 2 || !is_numeric($line[0]))
{
    echo "error: source node id [" .  $line[0] . "] - skipping...\n";
    continue;
}
else
{
    $node_ids[] = $line[0];
}

$targets = explode(",", $line[1]);

echo "node " .  $line[0] . " has " . count($targets) . " links\n";

// insert links in link table
foreach($targets as $target)
{
    if(!is_numeric($target))
    {
        echo "line " . $i . " has malformed target [" . trim($target) . "]\n";
        continue;
    }

    $sql = "INSERT INTO link (source_id, target_id) VALUES ('" .  $line[0] . "', '" .  trim($target) . "')";
    mysqli_query($db, $sql) or die("insert failed for SQL: ". $db::error);
}

if($i%1000 == 0)
{
    $node_ids = array_unique($node_ids);
    foreach($node_ids as $node)
    {
        $sql = "INSERT INTO node (node_id) VALUES ('" . $node . "')";
        mysqli_query($db, $sql);
    }
    $node_ids = array();

    mysqli_commit($db);
    mysqli_autocommit($db, false);
    echo "committed to database\n\n";
}
}

echo "<br>\n--<br>\n<br>\nseconds wasted: " . (time() - $start);

      

+1


source


I find your description rather confusing - and it doesn't match the code you provided.

if (count ($ line)! = 2 ||! is_numeric (trim ($ line [0])))

trimming is redundant here - spaces don't change the behavior of is_numberic. But you told aleswhere that the beginning of the line is a letter, so this will always fail.

If you want to speed it up, switch to using input stream processing rather than message handling (PHP arrays can be very slow), or use a different language and aggregate inserts into multi-line inserts.

0


source


At first I just used a script to generate the SQL file. Then lock the table using http://dev.mysql.com/doc/refman/5.0/en/lock-tables.html by placing the appropriate commands at the beginning / end of the SQL file (might get you a script before doing this).

Then just use a tool to inject SQL into the database (preferably on the machine where the database is located).

0


source







All Articles