How can I get the minimum match between two known tokens?
I have a selection of text that looks like this. I need to do a rudimentary edit on it, but I can't seem to figure out the regex I need. Maybe it has been a long day and I don't see what I need.
Sample data:
START ITEM = 1235
BEGIN
WORD
RATE = 98
MORE WORDS
CODE = XX
STUFF
END
BEGIN
TEXT
MORE WORDS
RATE = 57
ADDITIONAL TEXT
CODE = YY
OTHER THINGS
END
STOP
START ITEM = 9983
BEGIN
WORD
RATE = 01
MORE WORDS
CODE = AA
STUFF
END
BEGIN
TEXT
MORE WORDS
RATE = 99
ADDITIONAL TEXT
CODE = XX
OTHER THINGS
END
STOP
I am assigned a number CODE
and ITEM
, and you need to edit the speed in the appropriate section BEGIN
/ END
. Fortunately, sections are well defined with STOP
/ START
BEGIN
/ END
(they are keywords and are not found anywhere).
My toolbox for this is Perl regular expressions. *
The first solution I tried doesn't work (values are hardcoded):
$tx =~ s/(START \s ITEM \s = \s 9983.*?
BEGIN
.*?
RATE \s = \s )\d+
(.*? # Goes too far
CODE \s = \s XX)
/$1$newRate$2
/sx;
Since the specified code matches the match too closely, searching for the correct code is further, but always editing the first entry.
Suggestions?
*
The actual code relies on adding a regex to a regex stack (like a post-processing filter), each of which is applied in turn to the text for editing. Hell, I could do a full parser if I had text. But I was hoping not to break this code and stick to the API that I have.
source to share
Regular expression is not good for this kind of problem. I recommend a simple iterative solution:
while (<FILE>) {
# push lines straight to output until we find the START that we want
print OUT $_;
next unless m/START ITEM = $number/;
# save the lines until we get to the CODE that we want
my @lines;
while (<FILE>)
{
push @lines, $_;
last if m/CODE = $code/;
}
# @lines now has everything from the START to the CODE. Get the last RATE in
# @lines and change its value.
my $strref = \( grep m/RATE/ @lines )[-1];
$$strref = $new_value;
# print out the lines we saved and exit the loop
print OUT @lines;
last;
}
Edit: If you really want a regex you can use something like this (untested):
$tx =~ s/(START \s+ ITEM \s+ = \s+ 9983.*?
BEGIN
.*?
RATE \s+ = \s+ )\d+
( (?: (?! END ) . )*
CODE \s+ = \s+ XX)
/$1$newRate$2
/sx;
The added one (?: (?! END ) . )*
ensures that the match between RATE and CODE does not cross END. But this will be significantly slower than the non-regex approach.
source to share
Even though I don't like how much it backs off, making the dexterity between BEGIN
and RATE
will allow you to jump to RATE
the section where CODE
= XX
. Like this:
$tx = qr/(START \s+ ITEM \s+ = \s+ 9983 \s+
BEGIN
.*
RATE \s+ = \s+ )\d+
...
The main problem is that it will transition into another one if necessary ITEM
, so you need to make sure you don't gobble up STOP
. For example:
my $tx = qr/(START \s+ ITEM \s+ = \s+ 9983 \s+
BEGIN
(?: (?! \b STOP \b ) . )*
RATE \s+ = \s+ )\d+
(.*? # Goes too far
CODE \s+ = \s+ XX)
/msx
;
He still lags behind more than he would like.
(An hour later) I realized that the field RATE
and CODE
, the value of which XX
, should not be divided by END
. So another solution:
my $tx = qr/(START \s+ ITEM \s+ = \s+ 9983 \s+
BEGIN
.*?
RATE \s+ = \s+ )\d+
((?:(?! ^ \s+ END \s* $ ) . )*?
CODE \s+ = \s+ XX)
/msx
;
(I revisited this to look for an END on its own in a string. If ADDITIONAL TEXT
can contain a single END then it would be hard to parse no matter what happens)
I think it doesn't back down because it starts with RATE =
and then scans CODE =
before it hits END
, if we don't CODE = XX
, then it gets prune back to where it thinks it matches RATE
and looks for the next one RATE
. We could add a negative lookahead for STOP
if we don't know that Item # 9983 will definitely have a 'XX' code.
Edited to correct the error \s
.
Note: it now looks like this:
START ITEM = 9983
BEGIN
WORD
RATE = 01
MORE WORDS
CODE = AA
STUFF
END
BEGIN
TEXT
MORE WORDS
RATE = 99
ADDITIONAL TEXT <-- DON'T END HERE!
CODE = XX
OTHER THINGS
END
STOP
source to share