Csv file line end character detection

I need to determine which line endings are in the csv file:

  • \n

    (UNIX by default)
  • \r

    (Mac Excel)
  • \r\n

    (Windows)
  • or anything else

To get the decimeter character, encosure and escape, I used SplFileObject :: getCsvControl - it would be great to have something like this for the string ends with char.

Opening a file

+3


source to share


3 answers


I haven't tried this, but I thought it was an interesting problem, so here's my crack on a possible solution:

// first, have PHP auto-detect the line endings, like @AbraCadaver suggested:
ini_set("auto_detect_line_endings", true);

// now open the file and read a single line from it
$file = fopen('/path/to/file.csv', 'r');
fgets($file);

// fgets() moves the pointer, so get the current position
$position = ftell($file);

// now get a couple bytes (here: 10) from around that position
fseek($file, $position - 5);
$data = fread($file, 10);

// we no longer need the file
fclose($file);

// now find out how many of each type EOL there are in those 10 bytes
// expected result is that two of these will be 0 and one will be 1
$eols = array(
    "\r\n" => substr_count($data, "\r\n"),
    "\r" => substr_count($data, "\r"),
    "\n" => substr_count($data, "\n"),
);

// sort the EOL count in reverse order, so that the EOL with the highest
// count (expected: 1) will be the first item
arsort($eols);

// get the first item key
$eol = key($eols);

// $eol will now be "\r\n", "\r" or "\n"

      

There are probably better ways to do this, and note that I am making some assumptions about your CSV file here:



  • the file does not start with an empty line;
  • the first line is at least 5 bytes long;
  • the second line is not empty and is at least 5 bytes long;
  • the last column of the first row and the first column of the last row do not contain line breaks within them;
  • You are not dealing with a file that has mixed line endings.

If you cannot count on these conditions, you will need to add some validation steps, such as checking if the result fgets()

was indeed a multi-character string. If the strings might be shorter than 5 bytes, you might also need to consider the fact that there might be line endings \r\n

, but by accessing the raw bytes we end up with a string of type "abcde\r\nfg\r"

where we just skipped the second \n

and you get the wrong result.

But if you can be sure of the construction of the CSV file, this could be a (messy, I admit) step in the right direction.

+2


source


This is an interesting problem - and no one can give you a complete solution here. The obvious approaches are:

1) keep reading the file until the first occurrence of \ r or \ n. In the case of the former, read another character to see if it follows \ n.

This sounds very simple - but you need to implement quote processing to determine if EOL is included in the quoted data field - and you don't know how the data is quoted. In addition to detecting the opening and closing quotes, you also need to determine if the quote character is escaped, and there are at least two different ways of escaping the quote characters.

2) Analyze the frequency of characters in the file. If you can ignore spaces, alhpa characters and numbers, then the most common others should be CSV metacharacters. But they won't work for very short files.



3) create a representation of data strings in the file and look for re-correction patterns, for example. if you find number, space, alpha, space, number, punctuation, number, spam, alpha, punctuation, alpha, space, number, punctuation, number, space, alpha, space, number, punctuation, then you can assume that field the delimiter was a space, and the entries were limited to punctuation, which could also appear as an inline character.

But this requires very complex code.

If it were me, I would just ask who provided the file to provide details of the file format. Or, if this information is not available, open the file with a hex editor.

0


source


I used @rickdenhaan's solution and found an issue with arsort () and PHP version.

if eol is "\ r \ n" the $ eols array will be:

array ("\ r \ n" => 1, "\ r" => 1, "\ n" => 1);

(because in addition to 1 "\ r \ n" 1 "\ r" and 1 "\ n" were also found)

and in PHP 7 after arsort ($ eols) the order of the keys is the same:

array ("\ r \ n" => 1, "\ r" => 1, "\ n" => 1);

and after "$ eol = key ($ eols);" $ eol will be "\ r \ n"

But in PHP 5.6 after arsort ($ eols), the order of the keys is as follows:

array ("\ n" => 1, "\ r" => 1, "\ r \ n" => 1);

and after "$ eol = key ($ eols);" $ eol will be "\ n"

I solved with this check after "$ eol = key ($ eols);":

if (($eols["\r\n"] == $eols["\r"]) AND ($eols["\r\n"] == $eols["\n"])) {
    $line_separator = "\r\n";
}

      

0


source







All Articles