Perl regex is not greedy enough

Question

Perl regex is not greedy enough

I am writing a regex in perl to match the Perl code that runs a perl subroutine definition. Here's my regex:

my $regex = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{';

The $ regex corresponds to the code that runs the subroutine. I'm also trying to capture the subroutine name at $ 1 and any white space and comments between the subroutine name and the initial open curly brace at $ 2. It's $ 2, which is giving me the problem.

Consider the following perl code:

my $x = 1;

sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
    $x = 2;
    return;
}

When I put this Perl code on a line and match it to $ regex, $ 2 is "# This is comment 3. \ n", not the three lines of comments I want. I thought the regex would eagerly put all three lines of comments at $ 2, but that doesn't seem to be the case.

I would like to understand why $ regex does not work and does not require simple replacement. As the program below shows, I have a more complex replacement ($ re3) that works. But I think it is important for me to understand why $ regex is not working.

use strict;
use English;

my $code_string = <<END_CODE;
my \$x = 1;

sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
    \$x = 2;
    return;
}
END_CODE

my $re1 = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{';
my $re2 = '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n){0,}\s*\{';
my $re3 = '\s*sub\s+([a-zA-Z_]\w*)((\s*#.*\n)+)?\s*\{';

print "\$code_string is '$code_string'\n";
if  ($code_string =~ /$re1/) {print "For '$re1', \$2 is '$2'\n";}
if  ($code_string =~ /$re2/) {print "For '$re2', \$2 is '$2'\n";}
if  ($code_string =~ /$re3/) {print "For '$re3', \$2 is '$2'\n";}
exit 0;

__END__

The output from the perl script above is:

$code_string is 'my $x = 1;

sub zz
# This is comment 1.
# This is comment 2.
# This is comment 3.
{
    $x = 2;
    return;
} # sub zz
'
For '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n)*\s*\{', $2 is '# This is comment 3.
'
For '\s*sub\s+([a-zA-Z_]\w*)(\s*#.*\n){0,}\s*\{', $2 is '# This is comment 3.
'
For '\s*sub\s+([a-zA-Z_]\w*)((\s*#.*\n)+)?\s*\{', $2 is '
# This is comment 1.
# This is comment 2.
# This is comment 3.
'

+3

regex perl regex-greedy

David Levner 13 Mar 12 at 19:51

source to share

3 answers

If you add a repeat to a capture group, it will only capture the final match for that group. This is why $regex

only the final comment line matches.

This is how I would rewrite the regex for you:

my $regex = '\s*sub\s+([a-zA-Z_]\w*)((?:\s*#.*\n)*)\s*\{';

This is very similar to yours $re3

, except for the following changes:

Some of the spaces and comments coincide with the group not participating in the recording.
I changed that part of the regex from ((...)+)?

to ((...)*)

, which is equivalent.

+4

Andrew Clark 13 Mar 12 at 19:59

source to share

The problem is that \n

it is not part of the string by default . The regular expression stops at \n

.

You need to use a modifier s

for multi-line matches:

if  ($code_string =~ /$re1/s) {print "For '$re1', \$2 is '$2'\n";}

Notice s

after the regex.

+1

Nathan Fellman 13 Mar 12 at 19:55

source to share

Ryan Thompson · Accepted Answer · 2012-03-13T20:02:50+0000

Look only at the part of your regex that is capturing $2

. This is (\s*#.*\n)

. By itself, this can only capture one line of comment. After that, you have an asterisk to grab multiple lines of comments and it works great. It grabs multiple lines of comments and places each of them $2

one after the other, overwriting the previous value each time $2

. So the final value$2

when the regex is executed is the last thing the capture group agreed on, which is the final line of the comment. Only. To fix this, you need to place the asterisk in the capture group. But then you need to put in another set of parentheses (not capturing, this time) to make sure the asterisk applies to everything. So instead (\s*#.*\n)*

you need ((?:\s*#.*\n)*)

.

Your third regex works because you have unwittingly surrounded the entire expression in parentheses so that you can put a question mark after it. This forced $2

to capture all comments at once, but $3

only the final comment.

When you debug your regular expression, make sure you print out the values of all variables of compliance that you use: $1

, $2

, $3

etc. You would see that there $1

was just the name of the subroutine and $2

there was only the third comment. This might have got you thinking about how your regex missed the first two comments when there is nothing between the first and second capture groups, which will eventually lead you to find what happens when the capture group runs multiple times ...

By the way, it looks like you also write spaces after the subroutine name in $1

. Is this intentional? (Oops, I messed up my mnemonics and thought there \w

was a "w for spaces".)

Perl regex is not greedy enough

More articles: