How to match unicode characters correctly with awk regex?
I have the following instruction in a script to get the domain portion of an email address from different email logs with a robust formatted To: line:
awk '/^To: / { r = gensub(/^To: .+@(.+) .*$/, "\\1", "g"); print r}'
This matches strings like To: doc@bequerelint.net (Omer)
. However, it does not match strings To: andy.vitrella@uol.com.br (André)
or To: boggers@operamail.com (Pål)
, nor any other string with a non-ascii character in end brackets after the email address.
By the way, od -c
for the first example of inconsistency gives:
0000000 T o : a n d y . v i t r e l l
0000020 a @ u o l . c o m . b r ( A n
0000040 d r 351 ) \n
0000045
I'm guessing something is happening with awk regex .
not matching non-ascii character in (André)
. What's the correct regex to match such a string?
source to share
I am giving my comment as an answer to format the code correctly,
% echo 'To: andy.vitrella@uol.com.br (André)
To: boggers@operamail.com (Pål)' | gawk '/^To: / { r = gensub(/^To: .+@(.+) .*$/, "\\1", "g"); print r}'
uol.com.br
operamail.com
% echo 'To: andy.vitrella@uol.com.br (André)
To: boggers@operamail.com (Pål)' > fileee12
% gawk '/^To: / { r = gensub(/^To: .+@(.+) .*$/, "\\1", "g"); print r}' fileee12
uol.com.br
operamail.com
% env | grep -e '\(LOC\)\|\(LAN\)'
LANG=C
XTERM_LOCALE=C
%
as you can see, your command works both reading from stdin and reading from a file using the C language, so I can rule out that on my computer this is the locale or the differences between reading from stdin and not from a file in make a difference.
My computer has linux, my gawk is 4.1.1, what are your circumstances?
source to share