Wrong output for UTF8 conversion with iconv

I am trying to convert an ISO-8859-1 encoded string to UTF-8 on Linux . I am using the iconv function to do this in C ++. This is the code I have:

//Conversion from ISO-8859-1 to UTF-8
iconv_t cd = iconv_open("UTF-8","ISO-8859-1");

char *input = "€"; // the byte value is 128 in ISO-8859-1
char *inputbuf= input;
size_t inputSize=1;

char *output = (char*)malloc(inputSize*4); // maximum size of a character in UTF8 is 4
char *outputbuf = output;
size_t outputSize = inputSize*4;

//Conversion Function
iconv (cd, &inputbuf, &inputSize, &outputbuf, &outputSize);

//Display input bytes(ISO-8859-1)
cout << "input bytes(ISO-8859-1):"
for (int i=0; i<inputSize; i++)
{
    cout <<(int) *(input+i) << ", ";
}
cout<< std::endl;

//Display Converted bytes(UTF-8)
cout << "output bytes(UTF-8):"
for (int i=0; i<outputSize; i++) //displaying all the 4 bytes allocated
{
    cout <<(int) *(output+i) << ", ";
}
cout<< std::endl;
iconv(cd);

      

This is the result that I am seeing:

input bytes(ISO-8859-1): 128
output bytes(UTF-8): 194, 128, 0, 0

      

As you can see, the UTF-8 output converted bytes 194,128. However, the expected UTF-8 output is 226,130,172. I have verified that none of the iconv functions have an error.

Can anyone help me figure out if there is something here?

+3


source to share


2 answers


You can use utfcpp library: http://utfcpp.sourceforge.net/ or Boost.Locale for this purpose



0


source


This is an iconv bug as it 0xc2 0x80

is a valid utf-8 sequence for a code point U+0080

glyph <control>
.



This character is often mistaken for the EURO SIGN glyph, a code pointU+20AC

encoded as 0xe2 0x82 0xac

in UTF-8.

-1


source







All Articles