Index character in wchar_t array

The stream " Size of wchar_t * for surrogate pair " shows that the size of memory required to store the value of wchar_t may differ as more space may be required to encode some characters (surrogate pair). This leads me to my next question: How do I navigate through the array of wchar_t values? Because now I can not just increase or decrease the current address with a fixed size wchar_t.

CORRECTION: "How do I navigate an array of wchar_t values" I meant how you navigate between code points, which can be represented by a variable number of wchar_t values.

+3


source to share


4 answers


The size wchar_t

may differ on different systems, but it is determined and fixed at run time or compile time on the machine.

You can get its size with an operator sizeof

, and you can iterate over it just like other types.



A wchar_t

locale-specific type has a maximum size to store a character. So the mapping between string code units to text characters is one-to-one, so don't worry about iterating over wide string characters in the same way as other types to read the next or previous character. (Unlike Unicode)

However, this is the only bright part of the lines wchar_t

. Using them as a general way to store any arbitrary string is not an easy task. So, you have to use Unicode. Related Q&A here .

0


source


Do not use wchar_t

Unicode string operations. Seriously, just don't. As you've noticed, wchar_t

there is no one-to-one correspondence between Unicode objects and code points. Use a library like ICU to manipulate Unicode text.



+4


source


There are several problems here, and using a library like ICU will help you avoid a lot of problems. The problem with surrogate characters in UTF-16 is not the only problem if you are trying to count "characters".

If you just need to traverse the wchar_t string, the values ​​for the surrogate values ​​are uniquely identified as the leading value (0xd800 to 0xdbff) followed by the trailing value (0xdc00 to 0xdfff). You can use this knowledge to walk forward or backward through an array of "characters". This assumes that you have a valid set of values.

Another problem is the values ​​in the stream, which are not symbols in themselves. For example, U + 0301 is the COMBINED ACCENT with an emphasis added to the previous value. It might be a problem using UTF-8, UTF-16, or UTF-32.

+3


source


This answer clarifies nature wchar_t

as a type. It looks like it was misunderstood before the question added "CORRECTION".

As with any specific type, it is sizeof(wchar_t)

constant for a specific system, like sizeof(wchar_t *)

.

In linguistic terms, you can navigate an array wchar_t

just like you can navigate an array of any other type.

However, dealing with text characters encoded with different numbers wchar_t

is another and more complex issue. Other replies dealt with this to some extent.

0


source







All Articles