C ++ ctype facet for UTF-8 in mingw

In the project, all internal strings are stored in utf-8 encoding. The project is ported to Linux and Windows. Now there is a need for to.lower functions.

On POSIX OS, I can use std :: ctype_byname ("ru_RU.UTF-8"). But with g ++ (Debian 4.3.4-1) ctype :: tolower () does not recognize Russian UTF-8 characters (Latin text is subscript).

On Windows, the mingw standard library throws a "std :: runtime_error: locale :: facet :: _ S_create_c_locale name not valid" exception when I try to build std :: ctype_byname with the "ru_RU.UTF-8" argument.

How to implement / find std :: ctype for utf-8 on Windows? The project already depends on libiconv (it uses the codecvt facet), but I don't see an obvious way to implement to_lower with it.

+2


source to share


3 answers


If all you need is to_lower for Cyrillic characters, you can write the function yourself.

ABVGDEZH in UTF8 D0 90 D0 91 D0 92 D0 93 D0 94 D0 95 D0 96 0A
abcdezh in UTF8 D0 B0 D0 B1 D0 B2 D0 B3 D0 B4 D0 B5 D0 B6 0A


But don't forget that UTF8 is a multibyte encoding.

Also you can try converting the string from UTF8 to wchar_t (using libiconv) and use a special Windows function to implement to_lower.

+2


source


Try using STLport



  Here is a description of how you can use STLport to read / write utf8 files.
utf8 is a way of encoding wide characters. As so, management of encoding in
the C ++ Standard library is handle by the codecvt locale facet which is part
of the ctype category. However utf8 only describe how encoding must be
, it cannot be used to classify characters performed so it is not enough info
to know how to generate the whole ctype category facets of a locale
instance.

In C ++ it means that the following code will throw an exception to
signal that creation failed:

#include 
// Will throw a std :: runtime_error exception.
std :: locale loc (". utf8");

For the same reason building a locale with the ctype facets based on
UTF8 is also wrong:

// Will throw a std :: runtime_error exception:
std :: locale loc (locale :: classic (), ".utf8", std :: locale :: ctype);

The only solution to get a locale instance that will handle utf8 encoding
is to specifically signal that the codecvt facet should be based on utf8
encoding:

// Will succeed if there is necessary platform support.
locale loc (locale :: classic (), new codecvt_byname (". utf8"));

  Once you have obtain a locale instance you can inject it in a file stream to
read / write utf8 files:

std :: fstream fstr ("file.utf8");
fstr.imbue (loc);

You can also access the facet directly to perform utf8 encoding / decoding operations:

typedef std :: codecvt codecvt_t;
const codecvt_t & encoding = use_facet (loc);

Notes:

1. The dot ('.') Is mandatory in front of utf8. This is a POSIX convention, locale
names have the following format:
language [_country [.encoding]]

Ex: 'fr_FR'
    'french'
    'ru_RU.koi8r'

2.utf8 encoding is only supported for the moment under Windows. The less common
utf7 encoding is also supported. 
+3


source


There are several STLs (for example, one from Apache - STDCXX, for example) that comes with several locales. But in other situations, the locale is only system dependent.

If you use the name "ru_RU.UTF-8" for one operating system, this does not mean that other systems have the same name for that locale. Debian and windows have possibly different names and it is for this reason that you have a runtime exception.

You must install the locales you want on the system earlier. Or use STL which already has this language.

My cents ...

0


source







All Articles