Perl UTF-8 encoding in DATA and ARGV files

I have several text files with a lot of Unicode Hebrew and Greek in them that need to be wrapped in an HTML element <span class ="hebrew">...</span>

. These files are for a project that has been running for several years.

About eight years ago, we successfully used this Perl script to do a job.

#!/usr/bin/perl

use utf8;

my $table = [
  {
    FROM  => "\\x{0590}",
    TO    => "\\x{05ff}",
    REGEX => "[\\x{0590}-\\x{05ff}]",
    OPEN  => "<span class =\"hebrew\">",
    CLOSE => "</span>",
  },
  {
    FROM  => "\\x{0370}",
    TO    => "\\x{03E1}",
    REGEX => "[\\x{0370}-\\x{03E1}]|[\\x{1F00}-\\x{1FFF}]",
    OPEN  => "<span class =\"greek\">",
    CLOSE => "</span>",
  },
];

binmode(STDIN,":utf8");
binmode(STDIN,"encoding(utf8)");

binmode(STDOUT,":utf8");
binmode(STDOUT,"encoding(utf8)");

while (<>) {

  my $line = $_;

  foreach my $l (@$table) {

    my $regex          = $l->{REGEX},
    my ($from, $to)    = ($l->{FROM},$l->{TO});
    my ($open, $close) = ($l->{OPEN},$l->{CLOSE});

    $line =~ s/(($regex)+(\s+($regex)+)*)/$open\1$close/g;
  }

  print $line;
}

      

Scans a text file that looks for specific Unicode ranges and inserts the appropriate wrapper span

.

I haven't used this script for some time and now I need to process some more text files. But somehow the Unicode is not being saved: the Unicode text is corrupted and not wrapped in tags <span>

.

I need help with a fix before I can proceed.

Here's a sample input

Mary had a little כֶּבֶשׂ, its fleece was white as χιών. And πάντα that Mary went, the כֶּבֶשׂ was sure to go.

      

And this is what I get as output:

Mary had a little ×Ö¼Ö¶×ֶש×, its fleece was white as ÏιÏν. And ÏάνÏα that Mary went, the ×Ö¼Ö¶×Ö¶×©× was sure to go.

      

I am currently on a Linux Mint 13 LTS machine. Another OS is Ubuntu 14.04. Perl version is listed as v 5.14.2. I am running a script like this

perl uconv.pl infile.txt > outfile.txt

      

I'm not sure what's going on, and despite a few Stack Overflow questions and answers ( this one , I'm not wiser. Perhaps I need to set some environment variable? Or is something in this script now out of date? Or ...?

+3


source to share


1 answer


Your exit is fine. Perl prints the correct byte sequences for a UTF-8 encoded string.

For example, the first Hebrew word כֶּבֶשׂ

contains these seven Unicode characters

05DB   05BC   05B6   05D1   05B6   05E9   05C2
kaf    dagesh segol  bet    segol  shin   sin dot

      

which is encoded in UTF-8 as fourteen bytes (two per character)

[D7 9B] [D6 BC] [D6 B6] [D7 91] [D6 B6] [D7 A9] [D7 82]

      

and this is the content of the displayed string.

The problem is not that the program is printing the wrong characters, but what you are using to validate the output is not expecting UTF-8.


Update

It looks like the problem is ARGV

, not STDIN

. Reading from a file a zero file is actually being read from ARGV

, so setting the UTF-8 Perl IO IO level STDIN

to using binmode

as you did has no effect. Also, you cannot set the mode ARGV

in the same way because it is not open yet.

But you can fix it using

use open qw/ :std :encoding(utf8) /;

      

which defines the default levels for new open input (and output) handles including ARGV

. So when it opens automatically on first execution <>

, your data should be read correctly.




Update

It also just occurred to me why the output text was wrong.

My misconception was that even if the input was read as a sequence of octets instead of UTF-8 encoded wide characters, it should still give the correct result if the same octets were copied, unmodified on the output.

What is obvious now is that when the input is in bytes, it is STDOUT

set to UTF-8 encoding, so the already encoded data will be recoded. Let take that Hebrew word for lamb from above

[D7 9B] [D6 BC] [D6 B6] [D7 91] [D6 B6] [D7 A9] [D7 82]

      

Since ARGV

it was still set to :raw

, the input was interpreted as these fourteen single-byte characters instead of seven UTF-8 encoded wide characters

D7 9B D6 BC D6 B6 D7 91 D6 B6 D7 A9 D7 82

      

Now if this string is printed it will be UTF-8 encoded because that is how it was set STDOUT

. ASCII (seven-bit) characters will survive without UTF-8 encoding, but all "characters" in this string are at code point 0x80 or higher, so they will be encoded as multibyte characters.

The result of encoding these fourteen characters is this series of twenty-eight octets

[C3 97] [C2 9B] [C3 96] [C2 BC] [C3 96] [C2 B6] [C3 97] [C2 91] [C3 96] [C2 B6] [C3 97] [C2 A9] [C3 97] [C2 82]

      

which, displayed as a UTF8 encoded string, appears as fourteen meaningless "characters" that were the result of reading from ARGV

without decoding.

Um, QED, I think.

+5


source







All Articles