Regex dictionary in unicode
This works as expected for me
string foo = "Hola, la niña está gritando en alemán: Maüschen raus!";
Regex r = new Regex(@"\w+");
MatchCollection mc = r.Matches(foo);
foreach (Match ma in mc)
{
Console.WriteLine(ma.Value);
}
It outputs
Hola la niña está gritando ru alemán Maüschen raus
Are you using .Match () instead of .Matches ()?
Another possible explanation is that you have a wordless character in what you expect to receive, such as a comma.
source to share
You should take a look at http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx#ECMAScript
There's also a good Cheat Sheet for using regex in .net: http://regexlib.com/CheatSheet.aspx
source to share
The "official" Unicode identifier for letters \p{L}
, for numbers \p{N}
. So for completeness, in cases where \w
Unicode letters / numbers are not, the equivalent for \w+
would be [\p{L}\p{N}\p{Pc}]+
. Don't forget that underscores and other "punctuation" characters are also contained in \w
(so you can decide for yourself whether to keep them or not).
source to share