Regular matching of additional numbers

I have a text file that is currently being processed by an expression regex

and it works well. The file format is well defined, 2 numbers separated by spaces followed by an optional comment.

Now we need to add an additional (but optional) third number to this file, resulting in a format, 2 or 3 numbers separated by a space with an optional comment.

I have an object regex

that at least matches all the required string formats, but I had no luck with actually capturing the third (optional) number, even if present.

code:

#include <iostream>
#include <regex>
#include <vector>
#include <string>
#include <cassert>
using namespace std;

bool regex_check(const std::string& in)
{
   std::regex check{
      "[[:space:]]*?"                    // eat leading spaces
      "([[:digit:]]+)"                   // capture 1st number
      "[[:space:]]*?"                    // each second set of spaces
      "([[:digit:]]+)"                   // capture 2nd number
      "[[:space:]]*?"                    // eat more spaces
      "([[:digit:]]+|[[:space:]]*?)"     // optionally, capture 3rd number
      "!*?"                              // Anything after '!' is a comment
      ".*?"                              // eat rest of line
   };

   std::smatch match;

   bool result = std::regex_match(in, match, check);

   for(auto m : match)
   {
      std::cout << "  [" << m << "]\n";
   }

   return result;
}

int main()
{
   std::vector<std::string> to_check{
      "  12  3",
      "  1  2 ",
      "  12  3 !comment",
      "  1  2     !comment ",
      "\t1\t1",
      "\t  1\t  1\t !comment   \t",
      " 16653    2      1",
      " 16654    2      1 ",
      " 16654    2      1   !    comment",
      "\t16654\t\t2\t   1\t ! comment\t\t",
   };

   for(auto s : to_check)
   {
      assert(regex_check(s));
   }

   return 0;
}

      

This gives the following output:

  [  12  3]
  [12]
  [3]
  []
  [  1  2 ]
  [1]
  [2]
  []
  [  12  3 !comment]
  [12]
  [3]
  []
  [  1  2     !comment ]
  [1]
  [2]
  []
  [ 1   1]
  [1]
  [1]
  []
  [   1   1  !comment       ]
  [1]
  [1]
  []
  [ 16653    2      1]
  [16653]
  [2]
  []
  [ 16654    2      1 ]
  [16654]
  [2]
  []
  [ 16654    2      1   !    comment]
  [16654]
  [2]
  []
  [ 16654       2      1     ! comment      ]
  [16654]
  [2]
  []

      

As you can see, this fits all expected input formats, but can never actually capture the 3rd number, even if present.

I am currently testing this with GCC 5.1.1, but this actual target compiler will be GCC 4.8.2 using boost::regex

instead std::regex

.

+3


source to share


1 answer


Step by step processing in the following example.

 16653    2      1
^

      

^

is the currently matched offset. At this point we are here in the template:

\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
^             

      

(I've simplified [[:space:]]

before \s

and [[:digit:]]

before \d

for brievty.




\s*?

and then (\d+)

matches. We ended up in the following state:

 16653    2      1
      ^

      

\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
         ^

      




Same thing: \s*?

matches and then (\d+)

matches. Condition:

 16653    2      1
           ^

      

\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
                  ^ 

      




Now things get more complicated.

You have a \s*?

lazy quantifier here. The engine tries to match nothing and sees if the rest of the pattern will match. Therefore he tries to alternate.

The first alternative is \d+

, but it fails because you don't have a digit at that position.

The second option is \s*?

, and after that there are no other alternatives. This is lazy, so try matching an empty string first.

The next token !*?

, but it also matches an empty string and then follows .*?

, which will match all to the end of the line (this is because you are using regex_match

- it would match an empty string with regex_search

).

At this point, you've successfully reached the end of the pattern, and you've got a match without having to match \d+

against a string.

The point is that this whole part of the template becomes optional:

\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
                  \__________________/

      




So what can you do? You can rewrite your template like this:

\s*?(\d+)\s+(\d+)(?:\s+(\d+))?\s*(?:!.*)?

      

Demo (with added anchors to simulate behavior regex_match

)

This way you force the regex engine to consider \d

and don't go away with lazy matching on an empty string. There is no need for lazy quantifiers, since \s

both \d

do not overlap.

!*?.*?

was also suboptimal, since it is !*?

already covered by the next one .*?

. I rewrote it as (?:!.*)?

to require it !

at the beginning of the comment if it doesn't end there.

+3


source







All Articles