What is the correct stored value for umlaut "ü" in Perl?

I would like to directly deliver UTF-8 websites using Perl. I ran into several encoding issues because the original data is not fully stored in UTF-8. Due to debugging session for encoding issues, I found two different representations for German umlaut ü

. Which one is the correct stored value with Perl?

  • \xFC

    which is the Unicode position U+00FC

    forü

  • 0xC3 0xBC

    which is the UTF-8 hexadecimal representation for ü

If there is no difference, why does Perl store umlauts in different representations and not store them in either Unicode position or UTF-8 hexadecimal representation.

Link to Unicode / UTF-8 character table

+3


source to share


3 answers


Use Encoding :: FixLatin fix_latin

.

$ perl -MEncoding::FixLatin=fix_latin -MEncode=encode_utf8 \
   -E'say sprintf "%v02X", encode_utf8(fix_latin("\xFC\xC3\xBC"))'
C3.BC.C3.BC

      



Internally, it's better to work with Unicode. Decoding inputs, coding outputs. You probably have a connection forgetting to encode the output.

+8


source


There are no "correct" ones, they represent different views. Generally speaking, it would be better to stick with Unicode and print it as UTF-8, but the major complication is knowing exactly what you have at each stage of processing; if you can use UTF-8 reliably all over the place, it might be easier in your case.



+3


source


Both are correct. It depends on your intentions.

\xFC

is a regular form of a Unicode text string that contains the ü character. This is usually the form in which you process a line of text in your application.

0xC3 0xBC

is the correct form of a byte string that encodes the ü character to UTF-8. This is usually the form in which you receive or transmit UTF-8 bytes from or to some external object, such as a network socket or a disk file descriptor.

+2


source







All Articles