Expected behavior of posix extended regex: (() | abc) xyz

On my OS X 10.5.8 machine, using regcomp and regexec C functions to match the extended regex "(() | abc) xyz", I find a match for the string "abcxyz", but only from offset 3 to offset 6. My guess was that the entire string would be matched and that I would see dispatch for the initial "abc" portion of the string.

When I try to use the same pattern and text with awk on the same machine, it shows a match for the whole line as I would expect.

I expect my limited experience with regular expressions to be a problem. Can someone please explain what's going on? Is my regex valid? If so, why doesn't it match the entire line?

I understand that "((abc) {0,1}) xyz" can be used as an alternative, but an interesting picture is automatically generated from a different template format and excluding instances of "()" is extra work I would like to avoid if possible ...

For reference, the flags I pass to regcomp are only REG_EXTENDED. I am passing an empty set of flags (0) to regexec.

+2


source to share


3 answers


The POSIX says:

9.4.3 ERE special characters

The special character ERE has special properties in certain contexts. Outside of these contexts, or when preceded by a <backslash>

, such character must be the ERE that matches the special character itself. The extended special character regular expressions and the contexts in which they should have their special meaning are as follows:

.[\(

<period>

, <left-square-bracket>

, <backslash>

And <left-parenthesis>

should be specific, except when they are used in terms of brackets (see. The expression RE Bracket Expression). Outside of a parenthesis expression <left-parenthesis>

, a immediately followed by a <right-parenthesis>

produces undefined results.



What you see is the result of invoking undefined behavior - everything goes.

If you want reliable portable results, you will need to remove the empty " ()

" symbols .

+2


source


If you iterate over all the matches and don't get both [3,6) and [0,6) then an error appears. I'm not sure what posix is โ€‹โ€‹setting as far as the order in which matches are returned.



0


source


Try it (abc|())xyz

- I'm sure it will produce the same result in both places. I can only assume that the C version tries to match xyz

wherever it can, and if that fails, it tries to match abcxyz

wherever it can (but as you can see, it doesn't fail, so we never bother with the part " abc "), whereas it awk

should use its own regex engine that does what you expect.

Your regex is valid. I think the problem is that: a) POSIX isn't quite clear on how a regex is supposed to work, or b) awk

doesn't use 100% POSIX-compliant regexes (probably because OS X appears to ship with a more original version of awk

). Whatever the problem, this is probably because it is more of an edge case and most people would not write regex that way.

0


source







All Articles