Same code, different results on different machines regarding UTF8 characters

Question

Same code, different results on different machines regarding UTF8 characters

I have this code:

use strict;
use warnings;
use utf8;
use HTML::Entities;
use feature 'say';

binmode STDOUT, ':encoding(utf-8)';

my $t1 = "&#x010c;esk&aacute; Spo&#x0159;itelna - Q3 2014";
my $t2 =  "&#268;esk&aacute; Spo&#345;itelna - Q3 2014";

say decode_entities($t1);
say decode_entities($t2);

which when executed on my dev machine outputs:

Česká Spořitelna - Q3 2014
Česká Spořitelna - Q3 2014

and when executed on the machine UAT (Receiver Acceptance Test) the outputs are:

ÄeskÃ¡ SpoÅitelna - Q3 2014
ÄeskÃ¡ SpoÅitelna - Q3 2014

Now, on both machines, when I run perl -v

we have this perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi-ld

and the version HTML::Entities

on both machines is the same:

    Installed: 3.69
    CPAN:      3.69  up to date

My dev machine is running CentOS release 5.8 (Final)

and the UAT machine is runningRed Hat Enterprise Linux Server release 5.8 (Tikanga)

EDIT (regarding the command output locale

) The output is the same on both machines:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

UPDATE :

I posted a link to this question on the perl developers group on facebook and got some really helpful ideas from there: compare the output bytes across the two systems. If they are identical, it is a display problem. And they are. There are now several ways to do this :

1)

say join ':', map { ord } split //, decode_entities($t1);
say join ':', map { ord } split //, decode_entities($t2);

which displays 268:101:115:107:225:32:83:112:111:345:105:116:101:108:110:97:32:45:32:81:51:32:50:48:49:52

on both systems, so the bytes are the same

2) print $t1

and $t2

output to a file on each system, then run hexdump -C

against those files and compare the output. This method also showed that the contents of the files are the same

Conclusion

This is a display issue - the console (putty) is not displaying characters properly. We have this problem when we add these symbols to the DB and I thought I was able to highlight it with the above code. Your answers (and some of the fb's) helped me find out what decode_entities()

works as expected, and our problem lies somewhere else (most likely mysql table encoding or mysql join).

+3

perl utf-8

Tudor Constantin 27 Aug 14 at 12:40

source to share

1 answer

Borodin · Accepted Answer · 2014-08-27T12:45:07+0000

The encoding expected by the commands is different. If you want to print UTF-8 you have to set both terminals to expect UTF-8, for example Romanian

LANG=ro_RO.UTF-8

and also set STDOUT

to encode the output that way in your Perl like

binmode STDOUT, ':encoding(utf-8)'

Update

I can explain what's going on, although of course why so I'm not sure.

Take the first character of the line: "\x{010C}"

which is the capital of C caron. This is encoded by Perl as two octet code "\x{C4}\x{8C}"

and sent to the terminal, which on your development machine decodes it and displays it correctly.

However, on your test machine, the terminal decodes the first octet of the encoded character - C4

- as if it were ISO-8859-1, capital A umlaut. The second octet - - 8C

is ignored because it is an invalid character in this encoding.

So, you need to change the code page your terminal is using. The way to do this is by installing LANG

as I described, but I can't explain why it doesn't work if your locale is set correctly.

Same code, different results on different machines regarding UTF8 characters

More articles: