Why is this regex returning more groups than needed?

I was looking through a popular book on regex and found this part of regex that is supposed to extract values ​​from a string containing comma separated values.

It is supposed to handle double quotes and is ""

treated as a hidden double quote (sequence is ""

allowed within another pair of double quotes)

Here's a perl script I wrote for this:

$str = "Ten Thousand,10000, 2710 ,,\"10,000\",\"It \"\"10 Grand\"\", baby\",10K";
#$regex = qr"(?:^|,)(?:\"((?:[^\"]|\"\")+)\"|([^\",]+))*";
$regex = qr!
        (?: ^|,)
        (?: 
            "
                ( (?: [^"] | "" )+ )
            "
            |
            ( [^",]+ )
        )
    !x;

@matches = ($str =~ m#$regex#g);
print "\nString : $str\n";
if (scalar(@matches) > 0 ) {
    print "\nMatches\n";
    print "\nNumber of groups: ", scalar(@matches), "\n";
    for ($i=0; $i < scalar(@matches); $i++) {
        print "\nGroup $i - |$matches[$i]|\n";
    }
}
else {
    print "\nDoesnt match\n";
}

      

This is the output I expect (this is also what the author expects, as far as I can figure it out):

String : Ten Thousand,10000, 2710 ,,"10,000","It ""10 Grand"", baby",10K
   Matches
   Number of groups: 7
   Group 0 - |Ten Thousand|
   Group 1 - |10000|
   Group 2 - | 2710 |
   Group 3 - |10,000|
   Group 4 - ||
   Group 5 - |It ""10 Grand"", baby|
   Group 6 - |10K|

      

This is the result I am getting:

String : Ten Thousand,10000, 2710 ,,"10,000","It ""10 Grand"", baby",10K
   Matches
   Number of groups: 12
   Group 0 - ||
   Group 1 - |Ten Thousand|
   Group 2 - ||
   Group 3 - |10000|
   Group 4 - ||
   Group 5 - | 2710 |
   Group 6 - |10,000|
   Group 7 - ||
   Group 8 - |It ""10 Grand"", baby|
   Group 9 - ||
   Group 10 - ||
   Group 11 - |10K|

      

Can someone explain why there are empty groups in the actual release (other than what was expected to be up to 10,000)? I copied the regex directly from the book, so is there anything else I am doing wrong?

TIA

+3


source to share


3 answers


This regex has 2 capturing groups and several non-capturing groups. When you applied a regex to a string, you used the g modifier to keep it matching as many times as it can. In this case, the pattern matched 6 times each time, returning 2 captured groups for a total of 12 elements in the array.

The regular expression:

(?-imsx:!
        (?: ^|,)

        (?:

            "

                ( (?: [^"] | "" )+ )

            "

            |

            ( [^",]+ )
        )
    !x)

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  !                        '!\n        '
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
                             ' '
----------------------------------------------------------------------
    ^                        the beginning of the string
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ,                        ','
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
                           '\n\n        '
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
                  "          '\n\n            "\n\n                '
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
                               ' '
----------------------------------------------------------------------
      (?:                      group, but do not capture (1 or more
                               times (matching the most amount
                               possible)):
----------------------------------------------------------------------
                                 ' '
----------------------------------------------------------------------
        [^"]                     any character except: '"'
----------------------------------------------------------------------
                                 ' '
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
         ""                      ' "" '
----------------------------------------------------------------------
      )+                       end of grouping
----------------------------------------------------------------------
                               ' '
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
                  "          '\n\n            "\n\n            '
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
                             '\n\n            '
----------------------------------------------------------------------
    (                        group and capture to \2:
----------------------------------------------------------------------
                               ' '
----------------------------------------------------------------------
      [^",]+                   any character except: '"', ',' (1 or
                               more times (matching the most amount
                               possible))
----------------------------------------------------------------------
                               ' '
----------------------------------------------------------------------
    )                        end of \2
----------------------------------------------------------------------
                             '\n        '
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
       !x                  '\n    !x'
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

      

TLP already mentioned that you can also use the Text :: CSV module. Here's this example.



#!/usr/bin/perl

use strict;
use warnings;
use Text::CSV_XS;
use Data::Dumper;

my $csv = Text::CSV_XS->new({binary => 1, eol => $/, allow_whitespace => 1});

while (my $row = $csv->getline (*DATA)) {
    print Dumper $row;
}

__DATA__
Ten Thousand,10000, 2710 ,,"10,000","It ""10 Grand"", baby",10K;

      

Outputs:

$VAR1 = [
          'Ten Thousand',
          '10000',
          '2710',
          '',
          '10,000',
          'It\ "10 Grand", baby',
          '10K;'
        ];

      

+2


source


You can find a useful kernel module Text::ParseWords

. It does everything you are trying to do with a few lines of code. Also note that you can use q()

both qq()

to emulate single and double quotes so you don't have to hide the quotes. They can also be used with almost any punctuation character, as can most perl quote-like statements.

use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;

my $str = q(Ten Thousand,10000, 2710 ,,"10,000","It ""10 Grand"", baby",10K);
my @words = quotewords(',', 1, $str);
print Dumper \@words;

      

Output:



$VAR1 = [
          'Ten Thousand',
          '10000',
          ' 2710 ',
          '',
          '"10,000"',
          '"It\ ""10 Grand"", baby"',
          '10K'
        ];

      

(Note: Described single quote in It\'s

from Data::Dumper

)

If your data is correct csv data you can use Text::CSV

.

+1


source


I agree with @RonBergin. Capture groups are always saved.
So if you have 2 capture groups every 6 matches, this will create an array of 12 elements.

It looks like you want to trim and merge the capture groups into one is to use the Reset branch, which will make the channel parallel.

I don't want to change my regex, however the example below uses the Reset branch with some robust additions.

 # (?:^|,)(?|\s*"((?:[^"]|"")*)"\s*|\s*([^",]*?)\s*)(?=,|$)

 (?: ^ | , )                     # BOL or comma
 (?|                             # Start Branch Reset
      \s* 
      "
      (                               # (1 start), Quoted content
           (?: [^"] | "" )*
      )                               # (1 end)
      "
      \s* 
   |  
      \s*                             # Whitespace trim
      ( [^",]*? )                     # (1), Optional Non-quoted content
      \s*                             # Whitespace trim
 )                               # End Branch Reset
 (?= , | $ )                     # Lookahead for comma or EOL
                                 # (needed because content is optional)

      

+1


source







All Articles