Wrong output for UTF8 conversion with iconv
I am trying to convert an ISO-8859-1 encoded string to UTF-8 on Linux . I am using the iconv function to do this in C ++. This is the code I have:
//Conversion from ISO-8859-1 to UTF-8
iconv_t cd = iconv_open("UTF-8","ISO-8859-1");
char *input = "β¬"; // the byte value is 128 in ISO-8859-1
char *inputbuf= input;
size_t inputSize=1;
char *output = (char*)malloc(inputSize*4); // maximum size of a character in UTF8 is 4
char *outputbuf = output;
size_t outputSize = inputSize*4;
//Conversion Function
iconv (cd, &inputbuf, &inputSize, &outputbuf, &outputSize);
//Display input bytes(ISO-8859-1)
cout << "input bytes(ISO-8859-1):"
for (int i=0; i<inputSize; i++)
{
cout <<(int) *(input+i) << ", ";
}
cout<< std::endl;
//Display Converted bytes(UTF-8)
cout << "output bytes(UTF-8):"
for (int i=0; i<outputSize; i++) //displaying all the 4 bytes allocated
{
cout <<(int) *(output+i) << ", ";
}
cout<< std::endl;
iconv(cd);
This is the result that I am seeing:
input bytes(ISO-8859-1): 128
output bytes(UTF-8): 194, 128, 0, 0
As you can see, the UTF-8 output converted bytes 194,128. However, the expected UTF-8 output is 226,130,172. I have verified that none of the iconv functions have an error.
Can anyone help me figure out if there is something here?
source to share
This is an iconv bug as it 0xc2 0x80
is a valid utf-8 sequence for a code point U+0080
glyph <control> .
This character is often mistaken for the EURO SIGN glyph, a code pointU+20AC
encoded as 0xe2 0x82 0xac
in UTF-8.
source to share