Regex dictionary in unicode

Question

Regex dictionary in unicode

How do I convert the regex \ W + to give me all the words in Unicode - not just ASCII?

I am using .net

+1

regex .net unicode character-properties

fm64 25 nov. '09 at 12:22

source to share

4 answers

Andomar · Answer 1 · 2009-11-25T12:27:29+0000

In .NET, \w

will match Unicode characters that are Unicode letters or numbers. For example, it will match ì

and Æ

.

To just combine ASCII characters, you can use [a-zA-Z0-9]

.

Vinko Vrsalovic · Answer 2 · 2009-11-25T12:28:48+0000

This works as expected for me

        string foo = "Hola, la niña está gritando en alemán: Maüschen raus!";
        Regex r = new Regex(@"\w+");
        MatchCollection mc = r.Matches(foo);
        foreach (Match ma in mc)
        {
            Console.WriteLine(ma.Value);
        }

It outputs

Hola
la
niña
está
gritando
ru
alemán
Maüschen
raus

Are you using .Match () instead of .Matches ()?

Another possible explanation is that you have a wordless character in what you expect to receive, such as a comma.

ikkebr · Answer 3 · 2009-11-25T12:27:16+0000

You should take a look at http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx#ECMAScript
There's also a good Cheat Sheet for using regex in .net: http://regexlib.com/CheatSheet.aspx

Tim Pietzcker · Answer 4 · 2009-11-25T12:32:18+0000

The "official" Unicode identifier for letters \p{L}

, for numbers \p{N}

. So for completeness, in cases where \w

Unicode letters / numbers are not, the equivalent for \w+

would be [\p{L}\p{N}\p{Pc}]+

. Don't forget that underscores and other "punctuation" characters are also contained in \w

(so you can decide for yourself whether to keep them or not).

Regex dictionary in unicode

More articles: