Why does a newline with UTF-8 contain more bytes

byte bytes[] = new byte[16];
random.nextBytes(bytes);
try {
   return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
   log.warn("Hash generation failed", e);
}

      

When I create a String with a given method and when I apply string.getBytes().length

it returns a different value. Max was 32. Why does a 16 byte array end up creating a different size byte string?

But if I do string.length()

, it returns 16.

+3


source to share


6 answers


This is because your bytes are first converted to a Unicode string, which tries to create a UTF-8 char sequence from those bytes. If a byte cannot be treated as an ASCII char and not written with the next byte (s) to form a legal unicode char, it is replaced with "". Such a char is converted to 3 bytes when called String#getBytes()

, thus adding 2 extra bytes to the result.

If you are lucky enough to generate only ASCII characters String#getBytes()

will return a 16-byte array, if not, the resulting array may be longer. For example, the following piece of code:



byte[] b = new byte[16]; 
Arrays.fill(b, (byte) 190);  
b = new String(b, "UTF-8").getBytes(); 

      

returns an array of 48 (!) bytes long.

+4


source


The generated bytes can contain valid multibyte characters.

Let's take this as an example. The string contains only one character, but three bytes are required as a byte representation.

String s = "Ω";
System.out.println("length = " + s.length());
System.out.println("bytes = " + Arrays.toString(s.getBytes("UTF-8")));

      

String.length()

returns the length of the string in characters. A character

is one character whereas it is 3 bytes long in UTF-8.



If you change your code like this

Random random = new Random();
byte bytes[] = new byte[16];
random.nextBytes(bytes);
System.out.println("string = " + new String(bytes, "UTF-8").length());
System.out.println("string = " + new String(bytes, "ISO-8859-1").length());

      

The same bytes are interpreted with a different encoding. And after the javadoc fromString(byte[] b, String charset)

The length of the new String is a function of the charset, and hence may
not be equal to the length of the byte array.

      

+3


source


The classic mistake of not understanding the relationship between byte

and char

s, so we go again.

There is no 1-to-1 mapping between byte

and char

; it all depends on the character encoding you are using (in Java, that is Charset

).

Worse: If a sequence is given byte

, it may or may not be encoded in the sequence char

.

Try this for example:

final byte[] buf = new byte[16];
new Random().nextBytes(buf);

final Charset utf8 = StandardCharsets.UTF_8;
final CharsetDecoder decoder = utf8.newDecoder()
    .onMalformedInput(CodingErrorAction.REPORT);

decoder.decode(ByteBuffer.wrap(buf));

      

It is very likely that you will throw it away MalformedInputException

.

I know this is not really an answer, but then you didn't clearly explain your problem; and the example above shows that you have a misunderstanding between what is byte

and what is char

.

+3


source


If you look at the string you are producing, most of the random bytes you generate do not form valid UTF-8 characters. Therefore, the constructor String

replaces them with unicode 'REPLACEMENT CHARACTER' & # xfffd; which takes 3 bytes, 0xFFFD.

As an example:

public static void main(String[] args) throws UnsupportedEncodingException
{
    Random random = new Random();

    byte bytes[] = new byte[16];
    random.nextBytes(bytes);
    printBytes(bytes);

    final String s = new String(bytes, "UTF-8");
    System.out.println(s);
    printCharacters(s);
}

private static void printBytes(byte[] bytes)
{
    for (byte aByte : bytes)
    {
        System.out.print(
                Integer.toHexString(Byte.toUnsignedInt(aByte)) + " ");
    }
    System.out.println();
}

private static void printCharacters(String s)
{
    s.codePoints().forEach(i -> System.out.println(Character.getName(i)));
}

      

On this run I got this output:

30 41 9b ff 32 f5 38 ec ef 16 23 4a 54 26 cd 8c 
0A 2 8 # JT & ͌
DIGIT ZERO
LATIN CAPITAL LETTER A
REPLACEMENT CHARACTER
REPLACEMENT CHARACTER
DIGIT TWO
REPLACEMENT CHARACTER
DIGIT EIGHT
REPLACEMENT CHARACTER
REPLACEMENT CHARACTER
SYNCHRONOUS IDLE
NUMBER SIGN
LATIN CAPITAL LETTER J
LATIN CAPITAL LETTER T
AMPERSAND
COMBINING ALMOST EQUAL TO ABOVE
+1


source


String.getBytes (). length is likely to be longer as it counts the bytes needed to represent the string, and length () is 2-byte code units.

more details here

0


source


This will try to create a string assuming the bytes are in UTF-8.

new String(bytes, "UTF-8");

      

This would be terribly wrong in general, since multibyte UTF-8 sequences might not be valid.

how

String s = new String(new byte[] { -128 }, StandardCharsets.UTF_8);

      

Second step:

byte[] bytes = s.getBytes();

      

will use platform encoding ( System.getProperty("file.encoding")

). Better indicate it.

byte[] bytes = s.getBytes(StandardCharsets.UTF_8);

      

It should be understood that internally String will support Unicode, an array of 16 bits char

in UTF-16.

You should completely refrain from using String

for byte[]

. It will always include conversion, double memory cost, and error prone.

0


source







All Articles