.Net Regex Lookahead Group with \ Z (end of line) exception

I think I found a bug in the .NET Regex engine and wondered if anyone else met it, or if this was somehow expecting behavior.

This occurs when the end of a character assert is matched \Z

within a group of alternatives []

within a lookahead (?=)

, as in this example the expression that throws an exception when it is thrown.

Regex test = new Regex(@"(?=[\Z])");

      

Exception returned: parsing "(?=[\Z])" - Unrecognized escape sequence \Z.

However, Regex [\Z]

works
like(?=\Z)

The workaround is simple enough to use (?=[]|\Z)

with any of the other alternate characters in the alternate group, but it's still odd.

Edit: I think there must have been a typo in my initial tests, since, as nhahtdh points out, the templates above do throw an exception.

Tested in .NET 4.5 with C #

+3


source to share


2 answers


I don't know why you claim to @"[\Z]"

work, but from my testing on ideone (which is currently running on .NET 4.0.30319.17020) it throws the same exception as @"(?=[\Z])"

:

System.ArgumentException: parsing '[\Z]' - Unrecognized escape sequence Z.
  at System.Text.RegularExpressions.RegexParser.ScanCharEscape () [0x00000] in <filename unknown>:0 
  at System.Text.RegularExpressions.RegexParser.ScanCharClass (Boolean caseInsensitive, Boolean scanOnly) [0x00000] in <filename unknown>:0 
  [...]

      

By the way, it (?=[]|\Z)

also throws an exception as it tries to parse the character class consisting of ]

, |

and encounter an invalid escape sequence \Z

.

Code validation RegexParser.ScanCharEscape

, except in ECMAScript ( !UseOptionE()

) mode , the code throws an exception if it comes across \

a word character after a word character that does not form a known escape sequence (note that in .NET a word character is not only limited A-Za-z0-9_

, but also includes another word character in Unicode).



            default:
                if (!UseOptionE() && RegexCharClass.IsWordChar(ch))
                    throw MakeException(SR.Format(SR.UnrecognizedEscape, ch.ToString()));
                return ch;

      

This is probably a design decision to allow a future extension of the escape syntax without breaking the existing codebase when people move to a newer version of the .NET framework. Java also follows the same design principle in its class Pattern

, but only throws an unrecognized escape exception for A-Za-z

. On the other hand, JavaScript / ECMAScript does not have this limitation and interprets an unrecognized escape sequence as the character following \

.

Back to the issue in the question, note that \Z

is an end of input assertion, that is, it matches an empty string. The assertion is not a symbol, so it makes no sense to put it in a character class. Use alternation |

if you want to specify it by character class.

+3


source


You have a misunderstanding which \Z

is ... since this is a template , and not an actual symbol ; so the exception is valid when trying to use it in character set ( [ ]

).

It can be used to match \n

as long as it \n

exists at the end of the data, but it is not a character \n

.



To quote MSDN ( Anchors in Regular Expressions ):

The \ Z anchor indicates that the match should occur at the end of an input line or before \ n at the end of an input line. it is identical to $ anchor, except that \ Z ignores the RegexOptions.Multiline Option. Therefore, in a multi-line line, this can only match the end of the last line or the last line before \ n.

Note that \ Z matches \ n, but not \ r \ n (CR / LF character combination). To match CR / LF, include \ r? \ Z in the regular expression pattern.

+1


source







All Articles