How to match unicode characters correctly with awk regex?

Question

How to match unicode characters correctly with awk regex?

I have the following instruction in a script to get the domain portion of an email address from different email logs with a robust formatted To: line:

awk '/^To: / { r = gensub(/^To: .+@(.+) .*$/, "\\1", "g"); print r}'

This matches strings like To: doc@bequerelint.net (Omer)

. However, it does not match strings To: andy.vitrella@uol.com.br (André)

or To: boggers@operamail.com (Pål)

, nor any other string with a non-ascii character in end brackets after the email address.

By the way, od -c

for the first example of inconsistency gives:

0000000   T   o   :       a   n   d   y   .   v   i   t   r   e   l   l
0000020   a   @   u   o   l   .   c   o   m   .   b   r       (   A   n
0000040   d   r 351   )  \n
0000045

I'm guessing something is happening with awk regex .

not matching non-ascii character in (André)

. What's the correct regex to match such a string?

+3

regex shell awk unicode

Backgammon Oct 20 14 at 14:11

source to share

1 answer

gboffi · Answer 1 · 2014-10-21T17:20:10+0000

I am giving my comment as an answer to format the code correctly,

% echo 'To: andy.vitrella@uol.com.br (André)
To: boggers@operamail.com (Pål)' | gawk '/^To: / { r = gensub(/^To: .+@(.+) .*$/, "\\1", "g"); print r}'
uol.com.br
operamail.com
% echo 'To: andy.vitrella@uol.com.br (André)
To: boggers@operamail.com (Pål)' > fileee12
% gawk '/^To: / { r = gensub(/^To: .+@(.+) .*$/, "\\1", "g"); print r}' fileee12
uol.com.br
operamail.com
% env | grep -e '\(LOC\)\|\(LAN\)'
LANG=C
XTERM_LOCALE=C
%

as you can see, your command works both reading from stdin and reading from a file using the C language, so I can rule out that on my computer this is the locale or the differences between reading from stdin and not from a file in make a difference.

My computer has linux, my gawk is 4.1.1, what are your circumstances?

How to match unicode characters correctly with awk regex?

More articles: