How can I get the minimum match between two known tokens?

I have a selection of text that looks like this. I need to do a rudimentary edit on it, but I can't seem to figure out the regex I need. Maybe it has been a long day and I don't see what I need.

Sample data:

START ITEM = 1235
    BEGIN
        WORD
        RATE = 98
        MORE WORDS
        CODE = XX
        STUFF
    END
    BEGIN
        TEXT
        MORE WORDS
        RATE = 57
        ADDITIONAL TEXT
        CODE = YY
        OTHER THINGS
    END
STOP
START ITEM = 9983
    BEGIN
        WORD
        RATE = 01
        MORE WORDS
        CODE = AA
        STUFF
    END
    BEGIN
        TEXT
        MORE WORDS
        RATE = 99
        ADDITIONAL TEXT
        CODE = XX
        OTHER THINGS
    END
STOP

      

I am assigned a number CODE

and ITEM

, and you need to edit the speed in the appropriate section BEGIN

/ END

. Fortunately, sections are well defined with STOP

/ START

BEGIN

/ END

(they are keywords and are not found anywhere).

My toolbox for this is Perl regular expressions. *

The first solution I tried doesn't work (values ​​are hardcoded):

    $tx =~ s/(START \s ITEM \s = \s 9983.*?
                            BEGIN
                                .*?
                               RATE \s = \s )\d+
                                    (.*?       # Goes too far
                                CODE \s = \s XX)
                        /$1$newRate$2
                        /sx;

      

Since the specified code matches the match too closely, searching for the correct code is further, but always editing the first entry.

Suggestions?


*

The actual code relies on adding a regex to a regex stack (like a post-processing filter), each of which is applied in turn to the text for editing. Hell, I could do a full parser if I had text. But I was hoping not to break this code and stick to the API that I have.

+2


source to share


3 answers


Regular expression is not good for this kind of problem. I recommend a simple iterative solution:

while (<FILE>) {
    # push lines straight to output until we find the START that we want
    print OUT $_;
    next unless m/START ITEM = $number/;

    # save the lines until we get to the CODE that we want
    my @lines;
    while (<FILE>)
    {
        push @lines, $_;
        last if m/CODE = $code/;
    }

    # @lines now has everything from the START to the CODE. Get the last RATE in
    # @lines and change its value.
    my $strref = \( grep m/RATE/ @lines )[-1];
    $$strref = $new_value;

    # print out the lines we saved and exit the loop
    print OUT @lines;
    last;
}

      

Edit: If you really want a regex you can use something like this (untested):



$tx =~ s/(START \s+ ITEM \s+ = \s+ 9983.*?
                            BEGIN
                                .*?
                               RATE \s+ = \s+ )\d+
                                ( (?: (?! END ) . )*
                                    CODE \s+ = \s+ XX)
                        /$1$newRate$2
                        /sx;

      

The added one (?: (?! END ) . )*

ensures that the match between RATE and CODE does not cross END. But this will be significantly slower than the non-regex approach.

+6


source


Even though I don't like how much it backs off, making the dexterity between BEGIN

and RATE

will allow you to jump to RATE

the section where CODE

= XX

. Like this:

$tx = qr/(START \s+ ITEM \s+ = \s+ 9983 \s+ 
                        BEGIN
                            .*
                           RATE \s+ = \s+ )\d+
...

      

The main problem is that it will transition into another one if necessary ITEM

, so you need to make sure you don't gobble up STOP

. For example:

my $tx = qr/(START \s+ ITEM \s+ = \s+ 9983 \s+
                 BEGIN
                     (?: (?! \b STOP \b ) . )*
                    RATE \s+ = \s+ )\d+
                         (.*?       # Goes too far
                     CODE \s+ = \s+ XX)
          /msx
          ;

      

He still lags behind more than he would like.

(An hour later) I realized that the field RATE

and CODE

, the value of which XX

, should not be divided by END

. So another solution:

my $tx = qr/(START \s+ ITEM \s+ = \s+ 9983 \s+
                 BEGIN
                     .*?
                    RATE \s+ = \s+ )\d+
                         ((?:(?! ^ \s+ END \s* $ ) . )*? 
                     CODE \s+ = \s+ XX)
                        /msx
                        ;

      



(I revisited this to look for an END on its own in a string. If ADDITIONAL TEXT

can contain a single END then it would be hard to parse no matter what happens)

I think it doesn't back down because it starts with RATE =

and then scans CODE =

before it hits END

, if we don't CODE = XX

, then it gets prune back to where it thinks it matches RATE

and looks for the next one RATE

. We could add a negative lookahead for STOP

if we don't know that Item # 9983 will definitely have a 'XX' code.


Edited to correct the error \s

.

Note: it now looks like this:

START ITEM = 9983
    BEGIN
        WORD
        RATE = 01
        MORE WORDS
        CODE = AA
        STUFF
    END
    BEGIN
        TEXT
        MORE WORDS
        RATE = 99
        ADDITIONAL TEXT <-- DON'T END HERE!
        CODE = XX
        OTHER THINGS
    END
STOP

      

+4


source


Regular expressions are not always the best answer for parsing text. Your example shows that you actually have a file that can be represented by a grammar. It is much easier to use a parser to extract fields and then update the extracted information.

0


source







All Articles