Why is unicode char stored as UTF-8 in std :: string and UTF-16/32 in wchar_t?

I have a small piece of code:

#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string>

wchar_t widec('€');
wchar_t widecl(L'€');
std::string tc("€");

int main(int argc, char *argv[])
{
    printf("printf as hex - std::string tc(\"€\") = %x %x %x\n\r", tc.c_str()[0], tc.c_str()[1], tc.c_str()[2]);
    printf("printf as hex - wchar_t widec('€') = %x\n\r", widec);
    printf("printf as hex - wchar_t widecl(L'€') = %x\n\r", widecl);

    return 0;
}

      

This outputs:

printf as hex - std::string tc("€") = ffffffe2 ffffff82 ffffffac
printf as hex - wchar_t widec('€') = e282ac
printf as hex - wchar_t widecl(L'€') = 20ac

      

I don't understand two things.

  • Why tc.c_str()

    (more precisely, indices [0]

    , [1]

    and [2]

    ) are printed as UTF-8, similar to UTF-16/32 with FF leading bytes?

  • Why does initializing the same variable wchar_t

    give different output depending on whether the prefix is ​​used L

    or not, i.e. using it seems to create UTF-16/32 and UTF-8 content without prefix L

    , why is that?

+3


source to share


1 answer


  • A char

    without an explicit pointer specifier is either signed

    or unsigned

    , depending on the compiler. The standard does not dictate the default type, it is the choice of the compiler vendor.

    Passing char

    to print()

    expands the value from 8 to 32 bits in the call stack. It then %x

    prints the bits of that 32-bit value, ignoring the default leading zeros (unless you use the on length specifier %x

    to store them). How an 8-bit value grows to 32 bits depends on its actual type.

    In your case, the extras f

    you see are associated with the char

    values with sign expansion . The high bit 0xEx

    , 0x8x

    and 0xAx

    are all 1s , and therefore 1 is used to fill the high 24 bits during expansion. This means that your compiler implements char

    as a type signed

    and expands the values ​​to signed int

    . You can manually enter values char

    in unsigned

    to force them to be null-extended :

    printf("printf as hex - std::string tc(\"€\") = %x %x %x\n",
           (unsigned char) tc[0], (unsigned char) tc[1], (unsigned char) tc[2]);
    
          

    (note that I removed the use c_str()

    , this is not needed in your example)

  • Interpretation '€'

    and "€"

    without any prefixes depends on the encoding of what your source file is saved as and the encoding the compiler is configured to work with.

    The only way it can be non-prefixed '€'

    and "€"

    literals in UTF-8 is if your source code file is saved in UTF-8 (to force UTF-8 literals, you can use u8

    prefix in C ++ 11 and later ). Save the file in a different encoding and you will see different results. The result of this interpretation is then assigned as-is to tc

    and encoded as-is as wchar_t

    in widec

    .

    The prefix L

    , on the other hand, forces the compiler to interpret L'€'

    as a wide literal instead of a narrow literal, so there is no question of how it should be interpreted. It knows the literal is Unicode, and so it restricts the value of the Unicode codeword and then encodes it as a value wchar_t

    ( wchar_t

    is 16-bit on Windows and 32-bit on other platforms) in widecl

    . The Unicode code number €

    is U+20AC EURO SIGN

    .



+1


source







All Articles