Reading UTF8 files on windows and linux via c ++

I have text files that are encoded using UTF-8. Is there a way to read them using C ++ stream classes (like with stream)?

I have seen some external links like boost and some code snippets. But I don't want to use it just for this purpose.

On linux this works like a call to imbue (std :: locale ("en_US")), but not on windows. I think the problem is that the window assumes the wifstream will be UTF-16 encoded stream. Can't I specify the unicode encoding with the wifstream class anyway so that it uses UTF-8 and not UTF-16?

+3


source to share


2 answers


In addition to just reading bytes from a file normally and treating them as UTF-8 (e.g. not passing them on to anything that locale-encoded strings expect, only on things that expect UTF-8), Windows has a different way of reading in UTF-8.

You can set the mode to "UTF-8" on file descriptors and then use wide character input and output in that file descriptor, and the Microsoft C runtime will handle converting wide characters to and from UTF-8 encoded byte streams: / p>

#include <fcntl.h>
#include <io.h>
#include <stdio.h>

int main(void) {
  _setmode(_fileno(stdout), _O_U8TEXT);
  wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n");
}

      



If you run the above program with the output redirected to a file, you get a UTF-8 encoded file.

Setting one of these Unicode modes in the file descriptor has an additional effect on the console that wide characteristic output will work on the console. I'm not sure why Microsoft chose "break" by default, but at least there is a way to enable "non-broken" mode.

+2


source


You can read utf8 files fine on windows - the only problem is when you want to do something with them.



Almost all Windows API calls use UTF16 or MBCS, you will need to convert UTF8-MBCS whenever you pass it to Windows API - see Converting C Strings from Local Encoding to UTF8

0


source







All Articles