How can I make this regex more compact?
Let's say I have a line of text like this
Small 0.0..20.0 0.00 1.49 25.71 41.05 12.31 0.00 80.56
I want to grab the last six numbers and ignore the Minor and first two groups of numbers.
For this exercise, let's not ignore the fact that it would be easier to just do some sort of string layout instead of a regular expression.
I have this regex that works but looks pretty awful.
^(Small).*?[0-9.]+.*?[0-9.]+.*?([0-9.]+).*?([0-9.]+).*?([0-9.]+).*?([0-9.]+).*?([0-9.]+).*?([0-9.]+)
Is there a way to compensate for this?
For example, is it possible to combine the check for the last 6 numbers into a single statement that stores the results as 6 separate group matches?
source to share
Here's the shortest I could get:
^Small\s+(?:[\d.]+\s+){2}([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s*$
It must be long because every capture must be specified explicitly. However, there is no need to fix "Small". But it's better to be specific (\ s instead of.) When you can, and bind at both ends.
source to share
If you want to save each match in a separate backward direction, you have no choice but to "conjure it" - if you use repetition, you can either catch all six groups "as one" or just the last one, depending on where you place brackets for writing. So no, it is not possible to copy the regex and keep all six individual matches.
A slightly more efficient (though not pretty) regular expression would be:
^Small\s+[0-9.]+\s+[0-9.]+\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)\s+([0-9.]+)
since it clearly matches whitespace explicitly. Your regex will result in a lot of churn. My regex matches 28 steps, yours does 106.
As aside: in Python, you can simply do
>>> pieces = "Small 0.0..20.0 0.00 1.49 25.71 41.05 12.31 0.00 80.56".split()[-6:]
>>> print pieces
['1.49', '25.71', '41.05', '12.31', '0.00', '80.56']
source to share
For ease of use, you should use string substitution to create a regular expression from constituent parts.
$d = "[0-9.]+";
$s = ".*?";
$re = "^(Small)$s$d$s$d$s($d)$s($d)$s($d)$s($d)$s($d)$s($d)";
At least then you can see the structure behind the pattern, and changing one part changes them all.
If you want truly ANSI, you can do metasynthesis in no time and make it even easier to read:
$re = "^(Small)_#D_#D_(#D)_(#D)_(#D)_(#D)_(#D)_(#D)";
$re = str_replace('#D','[0-9.]+',$re);
$re = str_replace('_', '.*?' , $re );
(So ββthis also makes it trivial to change the definition of what a space token is or what a token stands for)
source to share