Why is unicode char stored as UTF-8 in std :: string and UTF-16/32 in wchar_t?
I have a small piece of code:
#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string>
wchar_t widec('β¬');
wchar_t widecl(L'β¬');
std::string tc("β¬");
int main(int argc, char *argv[])
{
printf("printf as hex - std::string tc(\"β¬\") = %x %x %x\n\r", tc.c_str()[0], tc.c_str()[1], tc.c_str()[2]);
printf("printf as hex - wchar_t widec('β¬') = %x\n\r", widec);
printf("printf as hex - wchar_t widecl(L'β¬') = %x\n\r", widecl);
return 0;
}
This outputs:
printf as hex - std::string tc("β¬") = ffffffe2 ffffff82 ffffffac
printf as hex - wchar_t widec('β¬') = e282ac
printf as hex - wchar_t widecl(L'β¬') = 20ac
I don't understand two things.
-
Why
tc.c_str()
(more precisely, indices[0]
,[1]
and[2]
) are printed as UTF-8, similar to UTF-16/32 with FF leading bytes? -
Why does initializing the same variable
wchar_t
give different output depending on whether the prefix is ββusedL
or not, i.e. using it seems to create UTF-16/32 and UTF-8 content without prefixL
, why is that?
source to share
-
A
char
without an explicit pointer specifier is eithersigned
orunsigned
, depending on the compiler. The standard does not dictate the default type, it is the choice of the compiler vendor.Passing
char
toprint()
expands the value from 8 to 32 bits in the call stack. It then%x
prints the bits of that 32-bit value, ignoring the default leading zeros (unless you use the on length specifier%x
to store them). How an 8-bit value grows to 32 bits depends on its actual type.In your case, the extras
f
you see are associated with thechar
values with sign expansion . The high bit0xEx
,0x8x
and0xAx
are all 1s , and therefore 1 is used to fill the high 24 bits during expansion. This means that your compiler implementschar
as a typesigned
and expands the values ββtosigned int
. You can manually enter valueschar
inunsigned
to force them to be null-extended :printf("printf as hex - std::string tc(\"β¬\") = %x %x %x\n", (unsigned char) tc[0], (unsigned char) tc[1], (unsigned char) tc[2]);
(note that I removed the use
c_str()
, this is not needed in your example) -
Interpretation
'β¬'
and"β¬"
without any prefixes depends on the encoding of what your source file is saved as and the encoding the compiler is configured to work with.The only way it can be non-prefixed
'β¬'
and"β¬"
literals in UTF-8 is if your source code file is saved in UTF-8 (to force UTF-8 literals, you can useu8
prefix in C ++ 11 and later ). Save the file in a different encoding and you will see different results. The result of this interpretation is then assigned as-is totc
and encoded as-is aswchar_t
inwidec
.The prefix
L
, on the other hand, forces the compiler to interpretL'β¬'
as a wide literal instead of a narrow literal, so there is no question of how it should be interpreted. It knows the literal is Unicode, and so it restricts the value of the Unicode codeword and then encodes it as a valuewchar_t
(wchar_t
is 16-bit on Windows and 32-bit on other platforms) inwidecl
. The Unicode code numberβ¬
isU+20AC EURO SIGN
.
source to share