Why doesn't "\ w" match Unicode word characters (for example, "ğ, İ, ş, ç, ö, ü") in a Perl regular expression?

Why doesn't "\ w" match Unicode word characters (for example, "ğ, İ, ş, ç, ö, ü") in a Perl regular expression?

I tried to include these characters in the regex m{\w+}g

. However, it does not match "ğ, İ, ş, ç, ö, ü".

How can I make this work?

use strict;
use warnings;
use v5.12;
use utf8;

open(MYINPUTFILE, "< $ARGV[0]");

my @strings;
my $delimiter;
my $extensions;
my $id;

while(<MYINPUTFILE>)
{
    my($line) = $_;
    chomp($line);
    print $line."\n";
    unshift(@strings,$line =~ /\w+/g);
    $delimiter = /[._\s]/;
    $extensions = /pdf$|doc$|docx$/;
    $id = /^200|^201/;
}

foreach(@strings){
    print $_."\n";
}

      

The input file looks like this:

Çidem_Şener
Hüsnü Tağlip
...

The output looks like this:

H 

sn 

Ta 

lip

 

idem_ 

ener

      

In code, I am trying to read a file and take each line in an array. (The delimiter can be _

either .

or \s

).

+3


source to share


3 answers


Unicode can be a problem, and Perl has its own quirks. Essentially, Perl creates a firewall that surrounds all Unicode I / O capabilities. You must tell Perl if the I / O path is encoded. If so, the DECODE rule for any input and / or ENCODE for any output.

Decoding converts the data from {encoding} to Perl's internal representation, which is probably a combination of bytes and code points.

The coding is just the opposite.

Thus, it is actually possible to "decode into" and "encode" up to two different encodings. You just need to say what it is. Encoding / decoding is usually done with the file I / O layer, but you can use the encoding module (distribution part) to manually convert between encodings.

perldocs in Unicode does not read light.



Here's an example that can help visualize it (there are many other ways).

use strict;
use warnings;
use Encode;


# This is an internalized string with these UTF-8 codepoints
# ----------------------------------------------
my $internal_string_1 = "\x{C7}\x{69}\x{64}\x{65}\x{6D}\x{5F}\x{15E}\x{65}\x{6E}\x{65}\x{72}\x{20}\x{48}\x{FC}\x{73}\x{6E}\x{FC}\x{20}\x{54}\x{61}\x{11F}\x{6C}\x{69}\x{70}";


# Open a temp file for writing as UTF-8.
# Output to this file will be automatically encoded from Perl internal to UTF-8 octets.
# Write the internal string.
# Check the file with a UTF-8 editor.
# ----------------------------------------------
open (my $out, '>:utf8', 'temp.txt') or die "can't open temp.txt for writing $!";
print $out $internal_string_1;
close $out;


# Open the temp file for readin as UTF-8.
# All input from this file will be automatically decoded as UTF-8 octets to Perl internal.
# Read/decode to a different internal string.
# ----------------------------------------------
open (my $in, '<:utf8', 'temp.txt') or die "can't open temp.txt for reading $!";
$/ = undef;
my $internal_string_2 = <$in>;
close $in;


# Change the binmode of STDOUT to UTF-8.
# Output to STDOUT will now be automatically encoded from Perl internal to UTF-8 octets.
# Capture STDOUT to a file then check with a UTF-8 editor.
# ----------------------------------------------
binmode STDOUT, ':utf8';
print $internal_string_2, "\n\n";


# Use encode() to convert an internal string to UTF-8 octets
# Format the UTF-8 octets to hex values
# Print to STDOUT
# ----------------------------------------------
my $octets = encode ("utf8", $internal_string_2);
print "Encoded (out) string -> UTF-8 (octets):\n";
print "   length  =  ".length($octets)."\n";
print "   octets  =  $octets\n";
print "   HEX val =  ";
for (split //, $octets) {
    printf ("0x%X ", ord($_));
}
print "\n\n";


# Use decode() to convert external UTF-8 octets to an internal string.
# Format the internal string to codepoints (hex values).
# Print to STDOUT.
# ----------------------------------------------
my $internal_string_3 = decode ("utf8", $octets);
print "Decoded (in) string <- UTF-8 (octets):\n";
print "   length      =  ".length($internal_string_3)."\n";
print "   string      =  $internal_string_3\n";
print "   code points =  ";
for (split //, $internal_string_3) {
    printf ("\\x{%X} ", ord($_));
}

      

Output

Çidem_Şener Hüsnü Tağlip

Encoded (out) string -> UTF-8 (octets):
   length  =  29
   octets  =  Ãidem_Åener Hüsnü TaÄlip
   HEX val =  0xC3 0x87 0x69 0x64 0x65 0x6D 0x5F 0xC5 0x9E 0x65 0x6E 0x65 0x72 0x20 0x48 0xC3 0xBC 0x73 0x6E 0xC3 0xBC 0x20 0x54 0x61 0xC4 0x9F 0x6C 0x69 0x70

Decoded (in) string <- UTF-8 (octets):
   length      =  24
   string      =  Çidem_Şener Hüsnü Tağlip
   code points =  \x{C7} \x{69} \x{64} \x{65} \x{6D} \x{5F} \x{15E} \x{65} \x{6E} \x{65} \x{72} \x{20} \x{48} \x{FC} \x{73} \x{6E} \x{FC} \x{20} \x{54} \x{61} \x{11F} \x{6C} \x{69} \x{70}

      

+1


source


Make sure Perl treats the data as UTF-8.

eg. if it's embedded in the script itself:



#!/usr/bin/perl

use strict;
use warnings; 
use v5.12;
use utf8;   # States that the Perl program itself is saved using utf8 encoding

say "matched" if "ğİşçöü" =~ /^\w+$/;

      

Corresponds to the findings. If I delete the line use utf8;

it is not.

+3


source


\w

matches any of them ğ

İ

ş

ç

ö

ü

just fine.

'ğİşçöü' =~ /\A \w+ \z/msx;     # true

      

You probably made a mistake and forgot to decode the octet input into Perl characters. I suspect your regex is checking stuff at byte level instead of character level, as you would expect.

Read http://p3rl.org/UNI and http://training.perl.com/scripts/perlunicook.html to learn about the topic of coding in Perl.


Edit

Probably the problem is here (I can't tell for sure without the content of the file):

open(MYINPUTFILE, "< $ARGV[0]");

      

Find out the file encoding, perhaps UTF-8

or Windows-1254

. Rewrite it, for example:

open $in, '<:utf8', $ARGV[0];
open $in, '<:encoding(Windows-1254)', $ARGV[0];

      

Likewise, typing characters in STDOUT (near the end of your program) also breaks due to lack of encoding. ℞ 16: Declare STD{IN,OUT,ERR} to be in locale encoding

shows one way how to do it correctly.

+3


source







All Articles