Is Java char array always valid UTF-16 (Big Endian) encoding?

Question

Is Java char array always valid UTF-16 (Big Endian) encoding?

Let's say that I will encode a Java ( char[]

) character array as bytes:

using two bytes for each character
using big end encoding (storing the most significant 8 bits in the leftmost and least significant 8 bits in the rightmost byte)

Will this always produce a valid UTF-16BE encoding? If not, what code points would result in incorrect encoding?

This question is very related to this question about Java char type and this question about the internal representation of Java strings .

+3

java arrays char unicode character-encoding

Maarten bodewes Jul 24 15 at 14:58

source to share

1 answer

一二三 · Accepted Answer · 2015-07-25T02:41:29+0000

Not. You can create instances char

that contain any 16-bit value you want - there is nothing that restricts them to valid UTF-16 code units and does not restrict an array of them to a valid UTF-16 sequence. String

Doesn't even require its data to be valid UTF-16:

char data[] = {'\uD800', 'b', 'c'};  // Unpaired lead surrogate
String str = new String(data);

Requirements for valid UTF-16 data are outlined in Chapter 3 of the Unicode standard (basically, everything must be a scannable Unicode value, and all surrogates must be properly mated). You can check if an array is a char

valid UTF-16 sequence and turn it into a UTF-16BE (or LE) byte sequence with CharsetEncoder

:

CharsetEncoder encoder = Charset.forName("UTF-16BE").newEncoder();
ByteBuffer bytes = encoder.encode(CharBuffer.wrap(data)); // throws MalformedInputException

(And similarly with help CharsetDecoder

if you have bytes.)

Is Java char array always valid UTF-16 (Big Endian) encoding?

More articles: