What is the correct stored value for umlaut "ü" in Perl?
I would like to directly deliver UTF-8 websites using Perl. I ran into several encoding issues because the original data is not fully stored in UTF-8. Due to debugging session for encoding issues, I found two different representations for German umlaut ü
. Which one is the correct stored value with Perl?
-
\xFC
which is the Unicode positionU+00FC
forü
-
0xC3 0xBC
which is the UTF-8 hexadecimal representation forü
If there is no difference, why does Perl store umlauts in different representations and not store them in either Unicode position or UTF-8 hexadecimal representation.
source to share
Use Encoding :: FixLatin fix_latin
.
$ perl -MEncoding::FixLatin=fix_latin -MEncode=encode_utf8 \
-E'say sprintf "%v02X", encode_utf8(fix_latin("\xFC\xC3\xBC"))'
C3.BC.C3.BC
Internally, it's better to work with Unicode. Decoding inputs, coding outputs. You probably have a connection forgetting to encode the output.
source to share
There are no "correct" ones, they represent different views. Generally speaking, it would be better to stick with Unicode and print it as UTF-8, but the major complication is knowing exactly what you have at each stage of processing; if you can use UTF-8 reliably all over the place, it might be easier in your case.
source to share
Both are correct. It depends on your intentions.
\xFC
is a regular form of a Unicode text string that contains the ü character. This is usually the form in which you process a line of text in your application.
0xC3 0xBC
is the correct form of a byte string that encodes the ü character to UTF-8. This is usually the form in which you receive or transmit UTF-8 bytes from or to some external object, such as a network socket or a disk file descriptor.
source to share