Perl UTF-8 encoding in DATA and ARGV files
I have several text files with a lot of Unicode Hebrew and Greek in them that need to be wrapped in an HTML element <span class ="hebrew">...</span>
. These files are for a project that has been running for several years.
About eight years ago, we successfully used this Perl script to do a job.
#!/usr/bin/perl
use utf8;
my $table = [
{
FROM => "\\x{0590}",
TO => "\\x{05ff}",
REGEX => "[\\x{0590}-\\x{05ff}]",
OPEN => "<span class =\"hebrew\">",
CLOSE => "</span>",
},
{
FROM => "\\x{0370}",
TO => "\\x{03E1}",
REGEX => "[\\x{0370}-\\x{03E1}]|[\\x{1F00}-\\x{1FFF}]",
OPEN => "<span class =\"greek\">",
CLOSE => "</span>",
},
];
binmode(STDIN,":utf8");
binmode(STDIN,"encoding(utf8)");
binmode(STDOUT,":utf8");
binmode(STDOUT,"encoding(utf8)");
while (<>) {
my $line = $_;
foreach my $l (@$table) {
my $regex = $l->{REGEX},
my ($from, $to) = ($l->{FROM},$l->{TO});
my ($open, $close) = ($l->{OPEN},$l->{CLOSE});
$line =~ s/(($regex)+(\s+($regex)+)*)/$open\1$close/g;
}
print $line;
}
Scans a text file that looks for specific Unicode ranges and inserts the appropriate wrapper span
.
I haven't used this script for some time and now I need to process some more text files. But somehow the Unicode is not being saved: the Unicode text is corrupted and not wrapped in tags <span>
.
I need help with a fix before I can proceed.
Here's a sample input
Mary had a little כֶּבֶשׂ, its fleece was white as χιών. And πάντα that Mary went, the כֶּבֶשׂ was sure to go.
And this is what I get as output:
Mary had a little ×Ö¼Ö¶×ֶש×, its fleece was white as ÏιÏν. And ÏάνÏα that Mary went, the ×Ö¼Ö¶×Ö¶×©× was sure to go.
I am currently on a Linux Mint 13 LTS machine. Another OS is Ubuntu 14.04. Perl version is listed as v 5.14.2. I am running a script like this
perl uconv.pl infile.txt > outfile.txt
I'm not sure what's going on, and despite a few Stack Overflow questions and answers ( this one , I'm not wiser. Perhaps I need to set some environment variable? Or is something in this script now out of date? Or ...?
source to share
Your exit is fine. Perl prints the correct byte sequences for a UTF-8 encoded string.
For example, the first Hebrew word כֶּבֶשׂ
contains these seven Unicode characters
05DB 05BC 05B6 05D1 05B6 05E9 05C2
kaf dagesh segol bet segol shin sin dot
which is encoded in UTF-8 as fourteen bytes (two per character)
[D7 9B] [D6 BC] [D6 B6] [D7 91] [D6 B6] [D7 A9] [D7 82]
and this is the content of the displayed string.
The problem is not that the program is printing the wrong characters, but what you are using to validate the output is not expecting UTF-8.
Update
It looks like the problem is ARGV
, not STDIN
. Reading from a file a zero file is actually being read from ARGV
, so setting the UTF-8 Perl IO IO level STDIN
to using binmode
as you did has no effect. Also, you cannot set the mode ARGV
in the same way because it is not open yet.
But you can fix it using
use open qw/ :std :encoding(utf8) /;
which defines the default levels for new open input (and output) handles including ARGV
. So when it opens automatically on first execution <>
, your data should be read correctly.
Update
It also just occurred to me why the output text was wrong.
My misconception was that even if the input was read as a sequence of octets instead of UTF-8 encoded wide characters, it should still give the correct result if the same octets were copied, unmodified on the output.
What is obvious now is that when the input is in bytes, it is STDOUT
set to UTF-8 encoding, so the already encoded data will be recoded. Let take that Hebrew word for lamb from above
[D7 9B] [D6 BC] [D6 B6] [D7 91] [D6 B6] [D7 A9] [D7 82]
Since ARGV
it was still set to :raw
, the input was interpreted as these fourteen single-byte characters instead of seven UTF-8 encoded wide characters
D7 9B D6 BC D6 B6 D7 91 D6 B6 D7 A9 D7 82
Now if this string is printed it will be UTF-8 encoded because that is how it was set STDOUT
. ASCII (seven-bit) characters will survive without UTF-8 encoding, but all "characters" in this string are at code point 0x80 or higher, so they will be encoded as multibyte characters.
The result of encoding these fourteen characters is this series of twenty-eight octets
[C3 97] [C2 9B] [C3 96] [C2 BC] [C3 96] [C2 B6] [C3 97] [C2 91] [C3 96] [C2 B6] [C3 97] [C2 A9] [C3 97] [C2 82]
which, displayed as a UTF8 encoded string, appears as fourteen meaningless "characters" that were the result of reading from ARGV
without decoding.
Um, QED, I think.
source to share