Request to read bytes from "UTF-8" world to Java "char"

Question

Request to read bytes from "UTF-8" world to Java "char"

With the below code snippet given in this link ,

byte[] bytes = {0x00, 0x48, 0x00, 0x69, 0x00, 0x2C,
                      0x60, (byte)0xA8, 0x59, 0x7D, 0x00, 0x21};  // "Hi,您好!"

Charset charset = Charset.forName("UTF-8");
// Encode from UCS-2 to UTF-8
// Create a ByteBuffer by wrapping a byte array
ByteBuffer bb = ByteBuffer.wrap(bytes);
// Create a CharBuffer from a view of this ByteBuffer
CharBuffer cb = bb.asCharBuffer();

Using the method wrap()

: "The new buffer will be supported by the given byte array." Here we don't have any encoding from byte to other format, it just puts the byte array into a buffer.

Could you please help me understand what exactly we are doing when we say bb.asCharBuffer()

in the above code? cb

looks like an array of characters. Since char

is UTF-16 in Java, using the method asCharBuffer()

, do we consider every 2 bytes in bb

how char

? Is it correct? If not, please help me with the correct approach.

Edit: I tried this program as Meisch recommended below,

byte[] bytes = {0x00, 0x48, 0x00, 0x69, 0x00, 0x2C,
                0x60, (byte)0xA8, 0x59, 0x7D, 0x00, 0x21};  // "Hi,您好!"

        Charset charset = Charset.forName("UTF-8");
        CharsetDecoder decoder = charset.newDecoder();
        ByteBuffer bb = ByteBuffer.wrap(bytes);
        CharBuffer cb = decoder.decode(bb);

which gives an exception

Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
    at java.nio.charset.CoderResult.throwException(Unknown Source)
    at java.nio.charset.CharsetDecoder.decode(Unknown Source)
    at TestCharSet.main(TestCharSet.java:16)

Please help me, I'm stuck here !!!

Note: I am using java 1.6

+3

java character-encoding nio bytebuffer

overexchange Dec 29. 15 at 17:04

source to share

2 answers

VGR · Answer 1 · 2014-12-29T19:34:36+0000

You ask, "Since char

Java is UTF-16, using the method asCharBuffer()

, we consider every 2 bytes in bb

how char

?"

The answer to this question is yes. Your understanding is correct.

Your next question is, "Is this correct?"

If you're just trying to demonstrate how the ByteBuffer, CharBuffer, and Charset classes work, that's fine.

However, when you code your application, you will never write such code. First, there is no need for a byte array; you can think of characters as String literal:

String s = "Hi,\u60a8\u597d!";

If you want to convert your string to UTF-8 bytes, you can simply do this:

byte[] encodedBytes = s.getBytes(StandardCharsets.UTF_8);

If you are still using Java 6, you should do this instead:

byte[] encodedBytes = s.getBytes("UTF-8");

Update: . Your byte array represents UTF-16BE (big-endian) encoded characters. In particular, your array has exactly two bytes per character. This is not a valid UTF-8 encoded sequence of bytes, which is why you get a MalformedInputException.

When characters are encoded as UTF-8 bytes, each character will be represented by 1 to 4 bytes. For your second piece of code to work, the array must be:

byte[] bytes = {
    0x48, 0x69, 0x2c,                       // ASCII chars are 1 byte each
    (byte) 0xe6, (byte) 0x82, (byte) 0xa8,  // U+60A8
    (byte) 0xe5, (byte) 0xa5, (byte) 0xbd,  // U+597D
    0x21
};

When converting from bytes to characters, my previous statement still applies: you don't need a ByteBuffer or a CharBuffer or a Charset or a CharsetDecoder. You can use these classes, but it is usually easiest to create a string:

String s = new String(bytes, "UTF-8");

If you want a CharBuffer, just wrap the String:

CharBuffer cb = CharBuffer.wrap(s);

You might be wondering when you need to use CharsetDecoder directly. You would do this if the bytes were coming from a source that is not under your control and you have every reason to believe that it cannot contain properly encoded UTF-8 bytes. Using an explicit CharsetDecoder allows you to customize how invalid bytes are handled.

PJMeisch · Answer 2 · 2014-12-29T17:30:21+0000

I just looked at the sources, it boils down to two bytes from the buffer concatenated into one character. The order in which the two bytes are used depends on the endianness , by default ist big-endian.

Another approach using nio classes than what I wrote in the comments would be to use the CharsetDecoder.decode () method.

Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
ByteBuffer bb = ByteBuffer.wrap(bytes);
CharBuffer cb = decoder.decode(bb);

Request to read bytes from "UTF-8" world to Java "char"

More articles: