Java String UTF-8 decode 0xFF as 0xC3BF

Question

Java String UTF-8 decode 0xFF as 0xC3BF

I have a weird problem writing certain bytes to a file with an OutputStream.

The problem appears to be caused by "encoding" the data.

If I explain writing to the output stream

saveFile.write(new byte[]{(byte)0xFF});

It works correctly and I can see 0xFF in my hex editor.

But when I try to do it with strings, it doesn't work. Example:

scriptData = "some script data thats all text and stuff" + ((char)0xFF) + ((char)0x3B);
saveFile.write(scriptData.getBytes(Charset.forName("UTF-8")));

In my hex editor, I see text and then 0xC3BF and then 0x3B. Why does 0x3B write the file correctly, but 0xFF changes to 0xC3BF?

There was another stream I saw about this, but involved PrintStream, which I am not using AFAIK.

Problem writing 0xFF to file

Thank.

+3

java string utf-8 byte

new Objekt 22 Aug 14 at 21:25

source to share

1 answer

TJ Crowder · Accepted Answer · 2014-08-22T21:28:05+0000

You are asking for the UTF-8 equivalent of the 0xFF character (pretty explicit). The 0xFF character, in UTF-8, is expressed as two bytes: 0xC3 and 0xBF. If you don't want to be UTF-8 encoded, don't use getBytes

with UTF-8 encoding.

Remember UTF-8 is not a single byte encoding. UTF-8 (like all Unicode conversions) is required to represent every Unicode character. This means that some characters in UTF-8 are one byte; others are two bytes; the third is three bytes and the third is four bytes.

Java String UTF-8 decode 0xFF as 0xC3BF

More articles: