How to convert ANSI project to UNICODE?

I have an ANSI C ++ project developed by Visual Age. I want to convert it to Unicode so that it can display multilingual characters correctly on an English operating system. I changed all the variables char

to wchar_t

after defining the macro UNICODE

.

Is it the right thing to do? In the source code has some API-interfaces that accept only lines char*

(e.g., system()

, fopen()

, mkdir()

). How can I get them to work with strings wchar_t

since all the strings in the code are changed to wchar_t

?

+3


source to share


2 answers


There are several ways to represent character strings in Unicode, the most common are:

  • encoded in UTF-8

    , stored in char

    strings
  • encoded in UTF-16

    , stored on strings of 16-bit integers
  • encoded in UTF-32

    , stored in strings of 32-bit integers.

For UTF-16 and UTF-32, you need to know the byte order of your system and decide if you want to transfer your strings in high or low order.

There is an older encoding named UCS-2

, with this encoding you can only represent Unicode characters below 0x10000. You shouldn't use this, not all Chinese characters can be represented in it.



Another thing to keep in mind is that it wchar_t

is 2 bytes or 4 bytes wide, so on some systems it can be used to store UTF-16

other characters as well UTF-32

.

One more thing to be aware of: most string length functions return the number of bytes or words counted, not the number of Unicode characters to display.

I personally prefer to store everything both internally and externally UTF-8

, and convert to 16 or 32 bit encoding if necessary. This way, avoid endianness problems.

Chances are, if you make sure everything is encoded in UTF-8

, most things will work.

+1


source


It's hard to tell without knowing what you are doing with the text and where it comes from. If all you are doing is reading it from a file and displaying it, then simply changing char

to wchar_t

may be enough. (But in this case, you might want to consider sticking to char

and using UTF-8.) Once you start doing more, the problems get more complex:

  • As you noticed, things like filenames will usually have char

    . Using UTF-8 works around this problem, sort of, but which character strings are or are not legal is still an open problem and is highly system dependent.

  • The analysis can get more complicated depending on what you are trying to do. You may have to give up simple functions in <ctype.h>

    ; C ++ has functions in <locale>

    it that you can use wchar_t

    , but they are much less easy to use. And also, while isspace

    or finding a particular constraint works more or less than advertised, things like toupper

    become extremely problematic (since there is no one size fits all approach to a single relationship between top and bottom).

  • When reading and writing files in UTF-16 or UTF-32, endianness becomes a problem. Regardless of the type and encoding used, internally I also stick to char

    UTF-8 anytime I import or export data.



In general, I would tend to stick char

with UTF-8 as well, unless I have been doing significant parsing or text manipulation. In which I would look into the ICU library which has full UTF-16 support. And if I was not 100% sure that I would only have to support one platform, forever, I would avoid wchar_t

one that has no real standard size or encoding; For example, ICU puts UTF-16 characters in unsigned short

. (The same can be said for char

, but machine, where char

not 8 bits, but extremely rare, and for internationalization, about the only encoding you are likely to encounter is UTF-8.)

+1


source







All Articles