Negating a string causes unexpected behavior
I have been playing with String
both its constructor and noticed some behavior that I cannot explain.
I created the following method
public static String negate(String s) {
byte[] b = s.getBytes();
for (int i = 0; i < b.length; i++) {
b[i] = (byte)(~b[i] + 1);
}
System.out.println(Arrays.toString(b));
return new String(b);
}
which just does 2 additions on each byte
and returns a new one for it String
. When you call it
System.out.println(negate("Hello"));
I got output
[-72, -101, -108, -108, -111]
which I think is fine as there are no negative ASCII values.
But when I put calls like this
System.out.println(negate(negate("Hello")));
My conclusion was like
[-72, -101, -108, -108, -111]
[17, 65, 67, 17, 65, 67, 17, 65, 67, 17, 65, 67, 17, 65, 67]
ACACACACAC // 5 groups of 3 characters (1 ctrl-char and "AC")
I expected the result to match my input string exactly "Hello"
, but I got this instead. What for? This also happens with any other input string. Once nested, every single character from the input becomes easy AC
.
I went ahead and created a method that does the same thing, but only with raw byte
arrays
public static byte[] n(byte[] b) {
for (int i = 0; i < b.length; i++) {
b[i] = (byte)(~b[i] + 1);
}
System.out.println(Arrays.toString(b));
return b;
}
Here the output will be as expected. For
System.out.println(new String(n(n("Hello".getBytes()))));
I get
[-72, -101, -108, -108, -111]
[72, 101, 108, 108, 111]
Hello
So my guess is that it has something to do with how is created String
, since this only happened when I called negate
with an instance that already received negative byte
s?
I even went through the class tree to look at the inner classes, but I couldn't find where this is coming from.
Also in the String docs there is the following paragraph which might be an explanation:
The behavior of this constructor when the given bytes are not valid in the default encoding is not specified
Can anyone tell me why this is the case and what exactly is happening here?
source to share
The problem is that you are taking inverted bytes and trying to interpret them as a valid stream of bytes in the default character set (remember, characters are not bytes). Since the docs string constructor you quoted tells you the result is unspecified and likely includes bug fixing, clearing invalid values, etc. Etc. Naturally, then this is a lossy process and reversing will not bring you back to the original string.
If you get the bytes and flush them twice without converting the intermediate bytes to a string, you will get the original result back.
This example demonstrates the loss of character new String(/*invalid bytes*/)
:
String s = "Hello";
byte[] b = s.getBytes();
for (int i = 0; i < b.length; i++) {
b[i] = (byte)(~b[i] + 1);
}
// Show the negated bytes
System.out.println(Arrays.toString(b));
String s2 = new String(b);
// Show the bytes of the string constructed from them; note they're not the same
System.out.println(Arrays.toString(s2.getBytes()));
On my system, which I think is UTF-8 by default, I get:
[-72, -101, -108, -108, -111] [-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67]
Note what happened when I took the wrong stream of bytes, made a string out of it, and then got the bytes of that string.
source to share