.Net Regex Lookahead Group with \ Z (end of line) exception
I think I found a bug in the .NET Regex engine and wondered if anyone else met it, or if this was somehow expecting behavior.
This occurs when the end of a character assert is matched \Z
within a group of alternatives []
within a lookahead (?=)
, as in this example the expression that throws an exception when it is thrown.
Regex test = new Regex(@"(?=[\Z])");
Exception returned: parsing "(?=[\Z])" - Unrecognized escape sequence \Z.
However, Regex like[\Z]
works(?=\Z)
The workaround is simple enough to use (?=[]|\Z)
with any of the other alternate characters in the alternate group, but it's still odd.
Edit: I think there must have been a typo in my initial tests, since, as nhahtdh points out, the templates above do throw an exception.
Tested in .NET 4.5 with C #
source to share
I don't know why you claim to @"[\Z]"
work, but from my testing on ideone (which is currently running on .NET 4.0.30319.17020) it throws the same exception as @"(?=[\Z])"
:
System.ArgumentException: parsing '[\Z]' - Unrecognized escape sequence Z.
at System.Text.RegularExpressions.RegexParser.ScanCharEscape () [0x00000] in <filename unknown>:0
at System.Text.RegularExpressions.RegexParser.ScanCharClass (Boolean caseInsensitive, Boolean scanOnly) [0x00000] in <filename unknown>:0
[...]
By the way, it (?=[]|\Z)
also throws an exception as it tries to parse the character class consisting of ]
, |
and encounter an invalid escape sequence \Z
.
Code validation RegexParser.ScanCharEscape
, except in ECMAScript ( !UseOptionE()
) mode , the code throws an exception if it comes across \
a word character after a word character that does not form a known escape sequence (note that in .NET a word character is not only limited A-Za-z0-9_
, but also includes another word character in Unicode).
default:
if (!UseOptionE() && RegexCharClass.IsWordChar(ch))
throw MakeException(SR.Format(SR.UnrecognizedEscape, ch.ToString()));
return ch;
This is probably a design decision to allow a future extension of the escape syntax without breaking the existing codebase when people move to a newer version of the .NET framework. Java also follows the same design principle in its class Pattern
, but only throws an unrecognized escape exception for A-Za-z
. On the other hand, JavaScript / ECMAScript does not have this limitation and interprets an unrecognized escape sequence as the character following \
.
Back to the issue in the question, note that \Z
is an end of input assertion, that is, it matches an empty string. The assertion is not a symbol, so it makes no sense to put it in a character class. Use alternation |
if you want to specify it by character class.
source to share
You have a misunderstanding which \Z
is ... since this is a template , and not an actual symbol ; so the exception is valid when trying to use it in character set ( [ ]
).
It can be used to match \n
as long as it \n
exists at the end of the data, but it is not a character \n
.
To quote MSDN ( Anchors in Regular Expressions ):
The \ Z anchor indicates that the match should occur at the end of an input line or before \ n at the end of an input line. it is identical to $ anchor, except that \ Z ignores the RegexOptions.Multiline Option. Therefore, in a multi-line line, this can only match the end of the last line or the last line before \ n.
Note that \ Z matches \ n, but not \ r \ n (CR / LF character combination). To match CR / LF, include \ r? \ Z in the regular expression pattern.
source to share