How can I make this regex more compact?

Question

How can I make this regex more compact?

Let's say I have a line of text like this

Small   0.0..20.0   0.00    1.49    25.71   41.05   12.31   0.00    80.56

I want to grab the last six numbers and ignore the Minor and first two groups of numbers.

For this exercise, let's not ignore the fact that it would be easier to just do some sort of string layout instead of a regular expression.

I have this regex that works but looks pretty awful.

^(Small).*?[0-9.]+.*?[0-9.]+.*?([0-9.]+).*?([0-9.]+).*?([0-9.]+).*?([0-9.]+).*?([0-9.]+).*?([0-9.]+)

Is there a way to compensate for this?

For example, is it possible to combine the check for the last 6 numbers into a single statement that stores the results as 6 separate group matches?

+1

regex

Mark biek 15 nov. '08 at 22:14

source to share

3 answers

If you want to save each match in a separate backward direction, you have no choice but to "conjure it" - if you use repetition, you can either catch all six groups "as one" or just the last one, depending on where you place brackets for writing. So no, it is not possible to copy the regex and keep all six individual matches.

A slightly more efficient (though not pretty) regular expression would be:

^Small\s+[0-9.]+\s+[0-9.]+\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)

since it clearly matches whitespace explicitly. Your regex will result in a lot of churn. My regex matches 28 steps, yours does 106.

As aside: in Python, you can simply do

>>> pieces = "Small   0.0..20.0   0.00    1.49    25.71   41.05   12.31   0.00    80.56".split()[-6:]
>>> print pieces
['1.49', '25.71', '41.05', '12.31', '0.00', '80.56']

+5

Tim Pietzcker 15 nov. '08 at 22:24

source to share

For ease of use, you should use string substitution to create a regular expression from constituent parts.

$d = "[0-9.]+"; 
$s = ".*?"; 

$re = "^(Small)$s$d$s$d$s($d)$s($d)$s($d)$s($d)$s($d)$s($d)";

At least then you can see the structure behind the pattern, and changing one part changes them all.

If you want truly ANSI, you can do metasynthesis in no time and make it even easier to read:

$re = "^(Small)_#D_#D_(#D)_(#D)_(#D)_(#D)_(#D)_(#D)"; 
$re = str_replace('#D','[0-9.]+',$re); 
$re = str_replace('_', '.*?' , $re );

(So this also makes it trivial to change the definition of what a space token is or what a token stands for)

+1

Kent fredric 15 nov. '08 at 22:45

source to share

PhiLho · Accepted Answer · 2008-11-15T23:01:20+0000

Here's the shortest I could get:

^Small\s+(?:[\d.]+\s+){2}([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s*$

It must be long because every capture must be specified explicitly. However, there is no need to fix "Small". But it's better to be specific (\ s instead of.) When you can, and bind at both ends.

How can I make this regex more compact?

More articles: