Why is unicode char stored as UTF-8 in std :: string and UTF-16/32 in wchar_t?

Question

Why is unicode char stored as UTF-8 in std :: string and UTF-16/32 in wchar_t?

I have a small piece of code:

#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string>

wchar_t widec('€');
wchar_t widecl(L'€');
std::string tc("€");

int main(int argc, char *argv[])
{
    printf("printf as hex - std::string tc(\"€\") = %x %x %x\n\r", tc.c_str()[0], tc.c_str()[1], tc.c_str()[2]);
    printf("printf as hex - wchar_t widec('€') = %x\n\r", widec);
    printf("printf as hex - wchar_t widecl(L'€') = %x\n\r", widecl);

    return 0;
}

This outputs:

printf as hex - std::string tc("€") = ffffffe2 ffffff82 ffffffac
printf as hex - wchar_t widec('€') = e282ac
printf as hex - wchar_t widecl(L'€') = 20ac

I don't understand two things.

Why tc.c_str()

(more precisely, indices [0]

, [1]

and [2]

) are printed as UTF-8, similar to UTF-16/32 with FF leading bytes?
Why does initializing the same variable wchar_t

give different output depending on whether the prefix is used L

or not, i.e. using it seems to create UTF-16/32 and UTF-8 content without prefix L

, why is that?

+3

c ++ unicode utf-8

user5811974 Apr 18 17 at 19:17

source to share

1 answer

Remy Lebeau · Answer 1 · 2017-04-18T19:30:50+0000

A char

without an explicit pointer specifier is either signed

or unsigned

, depending on the compiler. The standard does not dictate the default type, it is the choice of the compiler vendor.

Passing char

to print()

expands the value from 8 to 32 bits in the call stack. It then %x

prints the bits of that 32-bit value, ignoring the default leading zeros (unless you use the on length specifier %x

to store them). How an 8-bit value grows to 32 bits depends on its actual type.

In your case, the extras f

you see are associated with the char

values with sign expansion . The high bit 0xEx

, 0x8x

and 0xAx

are all 1s , and therefore 1 is used to fill the high 24 bits during expansion. This means that your compiler implements char

as a type signed

and expands the values to signed int

. You can manually enter values char

in unsigned

to force them to be null-extended :
```
printf("printf as hex - std::string tc(\"€\") = %x %x %x\n",
       (unsigned char) tc[0], (unsigned char) tc[1], (unsigned char) tc[2]);

      

        
        
        
      

    
```
(note that I removed the use c_str()

, this is not needed in your example)
Interpretation '€'

and "€"

without any prefixes depends on the encoding of what your source file is saved as and the encoding the compiler is configured to work with.

The only way it can be non-prefixed '€'

and "€"

literals in UTF-8 is if your source code file is saved in UTF-8 (to force UTF-8 literals, you can use u8

prefix in C ++ 11 and later ). Save the file in a different encoding and you will see different results. The result of this interpretation is then assigned as-is to tc

and encoded as-is as wchar_t

in widec

.

The prefix L

, on the other hand, forces the compiler to interpret L'€'

as a wide literal instead of a narrow literal, so there is no question of how it should be interpreted. It knows the literal is Unicode, and so it restricts the value of the Unicode codeword and then encodes it as a value wchar_t

( wchar_t

is 16-bit on Windows and 32-bit on other platforms) in widecl

. The Unicode code number €

is U+20AC EURO SIGN

.

Why is unicode char stored as UTF-8 in std :: string and UTF-16/32 in wchar_t?

More articles: