Define .csv delimiter in PHP

Note. ... I'll start by saying that I know that I probably missed something really obvious. I'm in one of those codes where I don't see an easy solution.

Problem: I wrote a script in PHP to parse a CSV file, select the column that contains the email addresses and put them in the database. Now I have found that users are trying to upload files with a file like .csv but not actually comma separated. I am trying to write a function that will correctly identify the delimiter (tab, newline, space, etc.), but I have some problems with this. I think I would like to get an array of all these addresses so that the number of keys add trust to that delimiter.

Code:

$filename = "../some/path/test.csv";   
if (($handle = fopen($fileName, "r")) !== FALSE) {
    $delimiters = array(',', ' ', "\t", "\n");
    $delimNum = 0;
    foreach ($delimiters as $delimiter) {
      $row = 0;
      while (($data = fgetcsv($handle, 1000, $delimiter)) !== FALSE) {
        $data = (string)$data[0];
        $delimiterList[$delimNum] = explode($delimiter, $data);
        $row++;
    }
    $delimNum++;
}
die(print_r($delimiterList));
}

      

Result:

Array
(
[0] => Array
    (
        [0] => email
peter.parker@example.com
atticus.finch@example.com
steve.rogers@example.com
phileas.fogg@example.com
s.winston@example.com
paul.revere@example.com
fscott.fitzgerald@example.com
jules.verne@example.com
martin.luther@example.com
ulysses.grant@example.com
tony.stark@example.com
    )
)

      

As I said, I know this is probably the wrong approach to this, so I'm grateful for any information you can provide!

+3


source to share


5 answers


Solve this issue with usability instead of code. Ask the user to select a separator.

However, since they might not know what tab delimiters, CSV and others are, just show them a preview. They can choose from options until the result looks correct and tabular.



Then you will parse it according to the selected format.

+2


source


It's not a perfect solution, but it MAY help you - if you can't ask what the delimiter is.

Instead of trying to parse the CSV anymore, try to just get valid email addresses. I don't think space, comma, tab, or newline is a valid part of the email? (Who knows;) Check out this discussion using regular expressions to validate email - so you can see some of the bugs in this solution.



But then I would write a regex using preg_match_all () and return a list of all strings in a valid email format.

Good luck!

+1


source


SplFileObject::getCsvControl

in leadership

I didn't find it too late, so I wrote a function that works well. If this is helpful / interesting, my approach is:

I used parameters $handle

and $ColName

with $ColName

optional

$ ColName allows you to check what delimiter finds the expected header column name in the first record if the csv file has a header row.

If there is no header row, or you don't know the column names, it resorts to the default check: which delimiter finds most fields for a single record (this will usually be correct). Then I also check that this delimiter returns the same number of fields for each of the next few rows.

fgetcsv seems to work in blocks and forces each record to have the same number of fields as the max in the block, so this will work even when changing the number of fields per record

+1


source


I'll show you an algorithm that might be a pretty good solution, don't consider this problem easy, it seems like guesswork, so this problem won't have a perfect solution.

Instead, you should try to get closer to a 99% good solution using statistics or some other heuristic. I am a computer scientist as well as a developer, but this is an approximation that machine learning or a scientist will give.

Here he is:

  • Select the number of random lines of all lines of the file, say 10
  • Count the number of occurrences of each separator candidate.
  • This number is used to calculate the mean and variance of each separator.
  • Normalize numbers, this means you can specify numbers from 0 to 1 using a custom linear function
  • Give bindings to two values ​​for each delimiter and sum , this gives an estimate of each delimiter you can use as a solution

Sounds complicated, but it's a pretty good and not complicated algorithm. Below is one example of calculations:

Delimitors count (bar graph)

|         | ; | , | \t  |
|---------|---|---|-----|
| LINE 1  | 3 | 1 |  13 |
| LINE 2  | 2 | 1 |   0 |
| LINE 3  | 3 | 1 |   0 |
| LINE 4  | 3 | 1 | 124 |
| LINE 5  | 2 | 1 |   2 |
| LINE 6  | 2 | 1 |   2 |
| LINE 7  | 3 | 1 |  12 |
| LINE 8  | 3 | 1 |   0 |
| LINE 9  | 3 | 1 |   0 |
| LINE 10 | 3 | 1 |   0 |

      

Calculations and final result

|            |  ;   |  ,   |  \t  |  | WEIGHTS |  ;   |  ,   | \t |
|------------|------|------|------|--|---------|------|------|----|
| AVERAGE    |  2,7 |    1 | 15,3 |  |         |      |      |    |
| NORMALIZED | 0,17 | 0,06 |    1 |  | 1       | 0,17 | 0,06 |  1 |
| VARIANCE   | 0,21 |    0 | 1335 |  |         |      |      |    |
| NORMALIZED | 0,99 |    1 |    0 |  | 3       | 2,99 |    3 |  0 |
|            |      |      |      |  | SCORE   | 3,17 | 3,06 |  1 |

      

As you can see the separator ';' has the best score. I think it's also useful to weigh more variance than the average of the delimiters found. Most likely there is a file in which the delimiters do not differ much on each line.

+1


source


This is my decision. Its work if you know how many columns you expect. Finally, the separator character is $ actual_separation_character

$separator_1=",";
$separator_2=";";
$separator_3="\t";
$separator_4=":";
$separator_5="|";

$separator_1_number=0;
$separator_2_number=0;
$separator_3_number=0;
$separator_4_number=0;
$separator_5_number=0;

/* YOU NEED TO CHANGE THIS VARIABLE */
// Expected number of separation character ( 3 colums ==> 2 sepearation caharacter / row )
$expected_separation_character_number=2;  


$file = fopen("upload/filename.csv","r");
while(! feof($file)) //read file rows
{
    $row= fgets($file);

    $row_1_replace=str_replace($separator_1,"",$row);
    $row_1_length=strlen($row)-strlen($row_1_replace);

    if(($row_1_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
    $separator_1_number=$separator_1_number+$row_1_length;
    }

    $row_2_replace=str_replace($separator_2,"",$row);
    $row_2_length=strlen($row)-strlen($row_2_replace);

    if(($row_2_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
    $separator_2_number=$separator_2_number+$row_2_length;
    }

    $row_3_replace=str_replace($separator_3,"",$row);
    $row_3_length=strlen($row)-strlen($row_3_replace);

    if(($row_3_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
    $separator_3_number=$separator_3_number+$row_3_length;
    }

    $row_4_replace=str_replace($separator_4,"",$row);
    $row_4_length=strlen($row)-strlen($row_4_replace);

    if(($row_4_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
    $separator_4_number=$separator_4_number+$row_4_length;
    }

    $row_5_replace=str_replace($separator_5,"",$row);
    $row_5_length=strlen($row)-strlen($row_5_replace);

    if(($row_5_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
    $separator_5_number=$separator_5_number+$row_5_length;
    }

} // while(! feof($file))  END
fclose($file);

/* THE FILE ACTUAL SEPARATOR (delimiter) CHARACTER */
/* $actual_separation_character */

if ($separator_1_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_1;}
else if ($separator_2_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_2;}
else if ($separator_3_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_3;}
else if ($separator_4_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_4;}
else if ($separator_5_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_5;}
else {$actual_separation_character=";";}

/* 
if the number of columns more than what you expect, do something ...
*/

if ($expected_separation_character_number>0){
if ($separator_1_number==0 and $separator_2_number==0 and $separator_3_number==0 and $separator_4_number==0 and $separator_5_number==0){/* do something ! more columns than expected ! */}
}

      

0


source







All Articles