How to convert ANSI project to UNICODE?
I have an ANSI C ++ project developed by Visual Age. I want to convert it to Unicode so that it can display multilingual characters correctly on an English operating system. I changed all the variables char
to wchar_t
after defining the macro UNICODE
.
Is it the right thing to do? In the source code has some API-interfaces that accept only lines char*
(e.g., system()
, fopen()
, mkdir()
). How can I get them to work with strings wchar_t
since all the strings in the code are changed to wchar_t
?
source to share
There are several ways to represent character strings in Unicode, the most common are:
- encoded in
UTF-8
, stored inchar
strings - encoded in
UTF-16
, stored on strings of 16-bit integers - encoded in
UTF-32
, stored in strings of 32-bit integers.
For UTF-16 and UTF-32, you need to know the byte order of your system and decide if you want to transfer your strings in high or low order.
There is an older encoding named UCS-2
, with this encoding you can only represent Unicode characters below 0x10000. You shouldn't use this, not all Chinese characters can be represented in it.
Another thing to keep in mind is that it wchar_t
is 2 bytes or 4 bytes wide, so on some systems it can be used to store UTF-16
other characters as well UTF-32
.
One more thing to be aware of: most string length functions return the number of bytes or words counted, not the number of Unicode characters to display.
I personally prefer to store everything both internally and externally UTF-8
, and convert to 16 or 32 bit encoding if necessary. This way, avoid endianness problems.
Chances are, if you make sure everything is encoded in UTF-8
, most things will work.
source to share
It's hard to tell without knowing what you are doing with the text and where it comes from. If all you are doing is reading it from a file and displaying it, then simply changing char
to
wchar_t
may be enough. (But in this case, you might want to consider sticking to char
and using UTF-8.) Once you start doing more, the problems get more complex:
-
As you noticed, things like filenames will usually have
char
. Using UTF-8 works around this problem, sort of, but which character strings are or are not legal is still an open problem and is highly system dependent. -
The analysis can get more complicated depending on what you are trying to do. You may have to give up simple functions in
<ctype.h>
; C ++ has functions in<locale>
it that you can usewchar_t
, but they are much less easy to use. And also, whileisspace
or finding a particular constraint works more or less than advertised, things liketoupper
become extremely problematic (since there is no one size fits all approach to a single relationship between top and bottom). -
When reading and writing files in UTF-16 or UTF-32, endianness becomes a problem. Regardless of the type and encoding used, internally I also stick to
char
UTF-8 anytime I import or export data.
In general, I would tend to stick char
with UTF-8 as well, unless I have been doing significant parsing or text manipulation. In which I would look into the ICU library which has full UTF-16 support. And if I was not 100% sure that I would only have to support one platform, forever, I would avoid wchar_t
one that has no real standard size or encoding; For example, ICU puts UTF-16 characters in unsigned
short
. (The same can be said for char
, but machine, where
char
not 8 bits, but extremely rare, and for internationalization, about the only encoding you are likely to encounter is UTF-8.)
source to share