How to handle CSV with 100k + lines in PHP?

I have a CSV file with over 100,000 lines, each line has 3 values โ€‹โ€‹separated by semicolons. The total file size is approx. 5MB.

CSV file is in this format:

stock_id;product_id;amount
==========================
1;1234;0
1;1235;1
1;1236;0
...
2;1234;3
2;1235;2
2;1236;13
...
3;1234;0
3;1235;2
3;1236;0
...

      

We have 10 stocks indexed 1-10 in CSV. We have saved them as 22-31 in the database.

CSV is sorted by share_id, product_id, but I think it doesn't matter.

What do I have

<?php

session_start();

require_once ('db.php');

echo '<meta charset="iso-8859-2">';

// convert table: `CSV stock id => DB stock id`
$stocks = array(
    1  => 22,
    2  => 23,
    3  => 24,
    4  => 25,
    5  => 26,
    6  => 27,
    7  => 28,
    8  => 29,
    9  => 30,
    10 => 31
);

$sql = $mysqli->query("SELECT product_id FROM table WHERE fielddef_id = 1");

while ($row = $sql->fetch_assoc()) {
    $products[$row['product_id']] = 1;
}

$csv = file('export.csv');

// go thru CSV file and prepare SQL UPDATE query
foreach ($csv as $row) {
    $data = explode(';', $row);
    // $data[0] - stock_id
    // $data[1] - product_id
    // $data[2] - amount

    if (isset($products[$data[1]])) {
        // in CSV are products which aren't in database
        // there is echo which should show me queries
        echo "  UPDATE t 
                SET value = " . (int)$data[2] . " 
                WHERE   fielddef_id = " . (int)$stocks[$data[0]] . " AND 
                        product_id = '" . $data[1] . "' -- product_id isn't just numeric
                LIMIT 1<br>";
    }
}

      

The problem is that writing 100k lines is echo

too slow, takes many minutes. I'm not sure what MySQL will do if it's faster or takes the same amount of time. I don't have a test machine here, so I'm worried about testing on a prod server.

My idea was to load the CSV file into more variables (better array) like below, but I don't know why.

$csv[0] = lines 0      - 10.000;
$csv[1] = lines 10.001 - 20.000;
$csv[2] = lines 20.001 - 30.000;
$csv[3] = lines 30.001 - 40.000;
etc. 

      

I found for example. Effectively counts the number of lines in a text file. (200mb +) but I'm not sure how this can help me.

When I replace foreach

with print_r

, I get a dump in <1 sec. The challenge is to speed up the foreach loop while updating the database.

Any ideas on how to update so many records in the database?
Thank.

+3


source to share


5 answers


Due to the answers and comments on this matter, I have a solution. The base for this is from @Dave, I just updated it to better answer the question.



<?php

require_once 'include.php';

// stock convert table (key is ID in CSV, value ID in database)
$stocks = array(
    1  => 22,
    2  => 23,
    3  => 24,
    4  => 25,
    5  => 26,
    6  => 27,
    7  => 28,
    8  => 29,
    9  => 30,
    10 => 31
);

// product IDs in CSV (value) and Database (product_id) are different. We need to take both IDs from database and create an array of e-shop products
$sql = mysql_query("SELECT product_id, value FROM cms_module_products_fieldvals WHERE fielddef_id = 1") or die(mysql_error());

while ($row = mysql_fetch_assoc($sql)) {
    $products[$row['value']] = $row['product_id'];
}

$handle = fopen('import.csv', 'r');
$i = 1;

while (($data = fgetcsv($handle, 1000, ';')) !== FALSE) {
    $p_id = (int)$products[$data[1]];

    if ($p_id > 0) {
        // if product exists in database, continue. Without this condition it works but we do many invalid queries to database (... WHERE product_id = 0 updates nothing, but take a time)
        if ($i % 300 === 0) {
            // optional, we'll see what it do with the real traffic
            sleep(1);
        }

        $updatesql = "UPDATE table SET value = " . (int)$data[2] . " WHERE fielddef_id = " . $stocks[$data[0]] . " AND product_id = " . (int)$p_id . " LIMIT 1";
        echo "$updatesql<br>";//for debug only comment out on live
        $i++;
    }
}

// cca 1.5sec to import 100.000k+ records
fclose($handle);

      

+3


source


Something like this (note that this is 100% untested and it might take some tweaking from my head to actually work :))

//define array may (probably better ways of doing this
$stocks = array(
    1  => 22,
    2  => 23,
    3  => 24,
    4  => 25,
    5  => 26,
    6  => 27,
    7  => 28,
    8  => 29,
    9  => 30,
    10 => 31
);

$handle = fopen("file.csv", "r")); //open file
while (($data = fgetcsv($handle, 1000, ";")) !== FALSE) {
    //loop through csv

    $updatesql = "UPDATE t SET `value` = ".$data[2]." WHERE   fielddef_id = ".$stocks[$data[0]]." AND product_id = ".$data[1];
   echo "$updatesql<br>";//for debug only comment out on live
}

      

There is no need to make your initial choice as you only ever set your product data to 1 anyway in your code and it can be seen from your description that your product ID is always correct exactly in your fielddef column that has a map.

Also just for life don't forget to put your actual mysqli execute command on your $ updatesql;



To give you a comparison with the actual usage code (I can compare with!) This is the code I use to import the uploaded file (its not perfect, but it does the job)

if (isset($_POST['action']) && $_POST['action']=="beginimport") {
            echo "<h4>Starting Import</h4><br />";
            // Ignore user abort and expand time limit 
            //ignore_user_abort(true);
            set_time_limit(60);
                if (($handle = fopen($_FILES['clientimport']['tmp_name'], "r")) !== FALSE) {
                    $row = 0;
                    //defaults 
                    $sitetype = 3;
                    $sitestatus = 1;
                    $startdate = "2013-01-01 00:00:00";
                    $enddate = "2013-12-31 23:59:59";
                    $createdby = 1;
                    //loop and insert
                    while (($data = fgetcsv($handle, 10000, ",")) !== FALSE) {  // loop through each line of CSV. Returns array of that line each time so we can hard reference it if we want.
                        if ($row>0) {
                            if (strlen($data[1])>0) {
                                $clientshortcode = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[0])));
                                $sitename = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[0]))." ".trim(stripslashes($data[1])));
                                $address = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[1])).",".trim(stripslashes($data[2])).",".trim(stripslashes($data[3])));
                                $postcode = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[4])));
                                //look up client ID
                                $client = $db->queryUniqueObject("SELECT ID FROM tblclients WHERE ShortCode='$clientshortcode'",ENABLE_DEBUG);

                                if ($client->ID>0 && is_numeric($client->ID)) {
                                    //got client ID so now check if site already exists we can trust the site name here since we only care about double matching against already imported sites.
                                    $sitecount = $db->countOf("tblsites","SiteName='$sitename'");
                                    if ($sitecount>0) {
                                        //site exists
                                        echo "<strong style=\"color:orange;\">SITE $sitename ALREADY EXISTS SKIPPING</strong><br />";
                                    } else {
                                        //site doesn't exist so do import
                                        $db->execute("INSERT INTO tblsites (SiteName,SiteAddress,SitePostcode,SiteType,SiteStatus,CreatedBy,StartDate,EndDate,CompanyID) VALUES 
                                        ('$sitename','$address','$postcode',$sitetype,$sitestatus,$createdby,'$startdate','$enddate',".$client->ID.")",ENABLE_DEBUG);
                                        echo "IMPORTED - ".$data[0]." - ".$data[1]."<br />";
                                    }
                                } else {
                                    echo "<strong style=\"color:red;\">CLIENT $clientshortcode NOT FOUND PLEASE ENTER AND RE-IMPORT</strong><br />";
                                }
                                fcflush();
                                set_time_limit(60); // reset timer on loop
                            }
                        } else {
                            $row++;
                        }
                    } 
                    echo "<br />COMPLETED<br />";
                }
                fclose($handle);
                unlink($_FILES['clientimport']['tmp_name']);
            echo "All Imports finished do not reload this page";
        }

      

This imported 150k lines in 10 seconds

+3


source


As I said in the comment, use SPLFileObject to iterate over the CSV file. Use prepared statements to reduce the performance overhead of calling UPDATE in each loop. Also, combine your two queries together, there is no reason to pull all product lines first and check them against CSV. You can use a JOIN to make sure that only the stocks in the second table that are related to the product in the first are updated, and this is the current CSV row:

/* First the CSV is pulled in */
$export_csv = new SplFileObject('export.csv');
$export_csv->setFlags(SplFileObject::READ_CSV | SplFileObject::DROP_NEW_LINE | SplFileObject::READ_AHEAD);
$export_csv->setCsvControl(';');

/* Next you prepare your statement object */
$stmt = $mysqli->prepare("
UPDATE stocks, products 
SET value = ?
WHERE
stocks.fielddef_id = ? AND 
product_id = ? AND
products.fielddef_id = 1
LIMIT 1
");

$stmt->bind_param('iis', $amount, $fielddef_id, $product_id);

/* Now you can loop through the CSV and set the fields to match the integers bound to the prepared statement and execute the update on each loop. */

foreach ($export_csv as $csv_row) {
    list($stock_id, $product_id, $amount) = $csv_row;
    $fielddef_id = $stock_id + 21;

    if(!empty($stock_id)) {
        $stmt->execute();
    }
}

$stmt->close();

      

+2


source


Make the request larger, i.e. use a loop to compile a larger query. You might have to break it down into chunks (e.g. process 100 at a time), but of course not doing one request at a time (applies to any kind, insert, update and even select if possible). This should improve performance significantly.

It is generally recommended that you do not query in a loop.

0


source


Updating each record every time would be too expensive (mainly due to search, but also from recording).

You must first the TRUNCATE

table and then insert all records again (assuming no foreign foreign keys are associated with that table).

To make it even faster, you must lock the table before inserting and unlock it. This will prevent indexing from being performed on every insert.

0


source







All Articles