Is Java char array always valid UTF-16 (Big Endian) encoding?
Let's say that I will encode a Java ( char[]
) character array as bytes:
- using two bytes for each character
- using big end encoding (storing the most significant 8 bits in the leftmost and least significant 8 bits in the rightmost byte)
Will this always produce a valid UTF-16BE encoding? If not, what code points would result in incorrect encoding?
This question is very related to this question about Java char type and this question about the internal representation of Java strings .
source to share
Not. You can create instances char
that contain any 16-bit value you want - there is nothing that restricts them to valid UTF-16 code units and does not restrict an array of them to a valid UTF-16 sequence. String
Doesn't even require its data to be valid UTF-16:
char data[] = {'\uD800', 'b', 'c'}; // Unpaired lead surrogate
String str = new String(data);
Requirements for valid UTF-16 data are outlined in Chapter 3 of the Unicode standard (basically, everything must be a scannable Unicode value, and all surrogates must be properly mated). You can check if an array is a char
valid UTF-16 sequence and turn it into a UTF-16BE (or LE) byte sequence with CharsetEncoder
:
CharsetEncoder encoder = Charset.forName("UTF-16BE").newEncoder();
ByteBuffer bytes = encoder.encode(CharBuffer.wrap(data)); // throws MalformedInputException
(And similarly with help CharsetDecoder
if you have bytes.)
source to share