Why does a newline with UTF-8 contain more bytes
byte bytes[] = new byte[16];
random.nextBytes(bytes);
try {
return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
log.warn("Hash generation failed", e);
}
When I create a String with a given method and when I apply string.getBytes().length
it returns a different value. Max was 32. Why does a 16 byte array end up creating a different size byte string?
But if I do string.length()
, it returns 16.
source to share
This is because your bytes are first converted to a Unicode string, which tries to create a UTF-8 char sequence from those bytes. If a byte cannot be treated as an ASCII char and not written with the next byte (s) to form a legal unicode char, it is replaced with "". Such a char is converted to 3 bytes when called String#getBytes()
, thus adding 2 extra bytes to the result.
If you are lucky enough to generate only ASCII characters String#getBytes()
will return a 16-byte array, if not, the resulting array may be longer. For example, the following piece of code:
byte[] b = new byte[16];
Arrays.fill(b, (byte) 190);
b = new String(b, "UTF-8").getBytes();
returns an array of 48 (!) bytes long.
source to share
The generated bytes can contain valid multibyte characters.
Let's take this as an example. The string contains only one character, but three bytes are required as a byte representation.
String s = "Ω";
System.out.println("length = " + s.length());
System.out.println("bytes = " + Arrays.toString(s.getBytes("UTF-8")));
String.length()
returns the length of the string in characters. A character Ω
is one character whereas it is 3 bytes long in UTF-8.
If you change your code like this
Random random = new Random();
byte bytes[] = new byte[16];
random.nextBytes(bytes);
System.out.println("string = " + new String(bytes, "UTF-8").length());
System.out.println("string = " + new String(bytes, "ISO-8859-1").length());
The same bytes are interpreted with a different encoding. And after the javadoc fromString(byte[] b, String charset)
The length of the new String is a function of the charset, and hence may
not be equal to the length of the byte array.
source to share
The classic mistake of not understanding the relationship between byte
and char
s, so we go again.
There is no 1-to-1 mapping between byte
and char
; it all depends on the character encoding you are using (in Java, that is Charset
).
Worse: If a sequence is given byte
, it may or may not be encoded in the sequence char
.
Try this for example:
final byte[] buf = new byte[16];
new Random().nextBytes(buf);
final Charset utf8 = StandardCharsets.UTF_8;
final CharsetDecoder decoder = utf8.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT);
decoder.decode(ByteBuffer.wrap(buf));
It is very likely that you will throw it away MalformedInputException
.
I know this is not really an answer, but then you didn't clearly explain your problem; and the example above shows that you have a misunderstanding between what is byte
and what is char
.
source to share
If you look at the string you are producing, most of the random bytes you generate do not form valid UTF-8 characters. Therefore, the constructor String
replaces them with unicode 'REPLACEMENT CHARACTER' & # xfffd; which takes 3 bytes, 0xFFFD.
As an example:
public static void main(String[] args) throws UnsupportedEncodingException
{
Random random = new Random();
byte bytes[] = new byte[16];
random.nextBytes(bytes);
printBytes(bytes);
final String s = new String(bytes, "UTF-8");
System.out.println(s);
printCharacters(s);
}
private static void printBytes(byte[] bytes)
{
for (byte aByte : bytes)
{
System.out.print(
Integer.toHexString(Byte.toUnsignedInt(aByte)) + " ");
}
System.out.println();
}
private static void printCharacters(String s)
{
s.codePoints().forEach(i -> System.out.println(Character.getName(i)));
}
On this run I got this output:
30 41 9b ff 32 f5 38 ec ef 16 23 4a 54 26 cd 8c 0A 2 8 # JT & ͌ DIGIT ZERO LATIN CAPITAL LETTER A REPLACEMENT CHARACTER REPLACEMENT CHARACTER DIGIT TWO REPLACEMENT CHARACTER DIGIT EIGHT REPLACEMENT CHARACTER REPLACEMENT CHARACTER SYNCHRONOUS IDLE NUMBER SIGN LATIN CAPITAL LETTER J LATIN CAPITAL LETTER T AMPERSAND COMBINING ALMOST EQUAL TO ABOVE
source to share
This will try to create a string assuming the bytes are in UTF-8.
new String(bytes, "UTF-8");
This would be terribly wrong in general, since multibyte UTF-8 sequences might not be valid.
how
String s = new String(new byte[] { -128 }, StandardCharsets.UTF_8);
Second step:
byte[] bytes = s.getBytes();
will use platform encoding ( System.getProperty("file.encoding")
). Better indicate it.
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
It should be understood that internally String will support Unicode, an array of 16 bits char
in UTF-16.
You should completely refrain from using String
for byte[]
. It will always include conversion, double memory cost, and error prone.
source to share