What is the correct stored value for umlaut "ü" in Perl?

Question

What is the correct stored value for umlaut "ü" in Perl?

I would like to directly deliver UTF-8 websites using Perl. I ran into several encoding issues because the original data is not fully stored in UTF-8. Due to debugging session for encoding issues, I found two different representations for German umlaut ü

. Which one is the correct stored value with Perl?

\xFC

which is the Unicode position U+00FC

forü
0xC3 0xBC

which is the UTF-8 hexadecimal representation for ü

If there is no difference, why does Perl store umlauts in different representations and not store them in either Unicode position or UTF-8 hexadecimal representation.

Link to Unicode / UTF-8 character table

+3

perl unicode utf-8 diacritics

burnersk 05 Aug 14 at 16:02

source to share

3 answers

There are no "correct" ones, they represent different views. Generally speaking, it would be better to stick with Unicode and print it as UTF-8, but the major complication is knowing exactly what you have at each stage of processing; if you can use UTF-8 reliably all over the place, it might be easier in your case.

+3

tripleee 05 Aug 14 at 16:13

source to share

Both are correct. It depends on your intentions.

\xFC

is a regular form of a Unicode text string that contains the ü character. This is usually the form in which you process a line of text in your application.

0xC3 0xBC

is the correct form of a byte string that encodes the ü character to UTF-8. This is usually the form in which you receive or transmit UTF-8 bytes from or to some external object, such as a network socket or a disk file descriptor.

+2

LeoNerd 05 Aug 14 at 19:14

source to share

ikegami · Accepted Answer · 2014-08-05T16:06:31+0000

Use Encoding :: FixLatin fix_latin

.

$ perl -MEncoding::FixLatin=fix_latin -MEncode=encode_utf8 \
   -E'say sprintf "%v02X", encode_utf8(fix_latin("\xFC\xC3\xBC"))'
C3.BC.C3.BC

Internally, it's better to work with Unicode. Decoding inputs, coding outputs. You probably have a connection forgetting to encode the output.

What is the correct stored value for umlaut "ü" in Perl?

More articles: