How to get listing letter of code in utf8

I just figured out the character prefix u8

for C ++ 17 is not for all utf8 codepoints, only for the ASCII part.

From cppreference

Literal character UTF-8, for example. u8'a'

... Such a literal has a type char

and value equal to the ISO 10646 c-char code point value, provided the code point value is represented with a single UTF-8 code point. If c-char is not in a Unicode Basic Latin or C0 Controls block, the program is ill-formed.

auto hello = u8'嗨';     // ill-formed
auto world = u8"世";     // not a character
auto what = 0xE7958C;    // almost human-readable
auto wrong = u8"錯"[0];  // not even correct

      

How to get listing letter of code in utf8 concisely?

EDIT: For people wondering how the utf8 code point can be stored, the way I find it reasonable is similar to the way Golang does , The main idea is to keep a single code point in a 32-bit type when required only one code point.

EDIT2: From the arguments given by the helpful comments, there is no reason for all utf8 encoded to be stored in a 32 bit type. Either it will be decoded, which will be utf32 and prefixed U

, or encoded in a prefixed string u8

.

+3


source to share


2 answers


If you need code, you should use char32_t

and for the prefix U

:

auto hello = U'嗨';

      

UTF-8 stores code points as a sequence of 8-bit code units. A char

in C ++ is a block of code, and therefore cannot store all of the Unicode code. The prefix u8

in character literals does not compile if you provide a code point for which you want to store multiple code units, because a character literal only gives one char

.

If you want a single UTF8 encoded Unicode codebase, then what you want is a string literal, not a character literal:

auto hello = u8"嗨";

      




the way I find it sane is similar to how Golang does it.

Well, you're not using Go, are you?

In C ++, if you ask for a character literal, then you mean one object of that type. The literal u8

will always be char

. Its type will not change depending on what is literally. You literally ask for a person, you get an alphabetic character.

It is clear from the site you linked to that Go actually has no concept of a literal UTF-8 character. It just has character literals, all of which are 32 bit values. Basically, all character literals in Go behave like U''

.

+6


source


In C ++, a character literal is exactly one character object. a character object in C ++ terminology corresponds to a block of code in Unicode. Some UTF-8 code points require more than one block of code. Therefore, not all UTF-8 code points can be represented by a single character object. The code points that are represented are Latin base and C0 locks.



To represent any UTF-8 code point, you need an array of code blocks, that is, a string. There is a similar prefix for string literals: u8"☺"

.

+1


source







All Articles