What is the correct stored value for umlaut "ü" in Perl?
I would like to directly deliver UTF-8 websites using Perl. I ran into several encoding issues because the original data is not fully stored in UTF-8. Due to debugging session for encoding issues, I found two different representations for German umlaut ü
. Which one is the correct stored value with Perl?
-
\xFC
which is the Unicode positionU+00FC
forü
-
0xC3 0xBC
which is the UTF-8 hexadecimal representation forü
If there is no difference, why does Perl store umlauts in different representations and not store them in either Unicode position or UTF-8 hexadecimal representation.
Link to Unicode / UTF-8 character table
Use Encoding :: FixLatin fix_latin
.
$ perl -MEncoding::FixLatin=fix_latin -MEncode=encode_utf8 \
-E'say sprintf "%v02X", encode_utf8(fix_latin("\xFC\xC3\xBC"))'
C3.BC.C3.BC
Internally, it's better to work with Unicode. Decoding inputs, coding outputs. You probably have a connection forgetting to encode the output.
There are no "correct" ones, they represent different views. Generally speaking, it would be better to stick with Unicode and print it as UTF-8, but the major complication is knowing exactly what you have at each stage of processing; if you can use UTF-8 reliably all over the place, it might be easier in your case.
Both are correct. It depends on your intentions.
\xFC
is a regular form of a Unicode text string that contains the ü character. This is usually the form in which you process a line of text in your application.
0xC3 0xBC
is the correct form of a byte string that encodes the ü character to UTF-8. This is usually the form in which you receive or transmit UTF-8 bytes from or to some external object, such as a network socket or a disk file descriptor.