Negating a string causes unexpected behavior

I have been playing with String

both its constructor and noticed some behavior that I cannot explain.

I created the following method

public static String negate(String s) {
    byte[] b = s.getBytes();
    for (int i = 0; i < b.length; i++) {
        b[i] = (byte)(~b[i] + 1);
    }
    System.out.println(Arrays.toString(b));
    return new String(b);
}

      

which just does 2 additions on each byte

and returns a new one for it String

. When you call it

System.out.println(negate("Hello"));

      

I got output

[-72, -101, -108, -108, -111]
     

      

which I think is fine as there are no negative ASCII values.
But when I put calls like this

System.out.println(negate(negate("Hello")));

      

My conclusion was like

[-72, -101, -108, -108, -111]
[17, 65, 67, 17, 65, 67, 17, 65, 67, 17, 65, 67, 17, 65, 67]
ACACACACAC // 5 groups of 3 characters (1 ctrl-char and "AC")

      

I expected the result to match my input string exactly "Hello"

, but I got this instead. What for? This also happens with any other input string. Once nested, every single character from the input becomes easy AC

.

I went ahead and created a method that does the same thing, but only with raw byte

arrays

public static byte[] n(byte[] b) {
    for (int i = 0; i < b.length; i++) {
        b[i] = (byte)(~b[i] + 1);
    }
    System.out.println(Arrays.toString(b));
    return b;
}

      

Here the output will be as expected. For

System.out.println(new String(n(n("Hello".getBytes()))));

      

I get

[-72, -101, -108, -108, -111]
[72, 101, 108, 108, 111]
Hello

      

So my guess is that it has something to do with how is created String

, since this only happened when I called negate

with an instance that already received negative byte

s?

I even went through the class tree to look at the inner classes, but I couldn't find where this is coming from.

Also in the String docs there is the following paragraph which might be an explanation:

The behavior of this constructor when the given bytes are not valid in the default encoding is not specified

Can anyone tell me why this is the case and what exactly is happening here?

+3


source to share


2 answers


The problem is that you are taking inverted bytes and trying to interpret them as a valid stream of bytes in the default character set (remember, characters are not bytes). Since the docs string constructor you quoted tells you the result is unspecified and likely includes bug fixing, clearing invalid values, etc. Etc. Naturally, then this is a lossy process and reversing will not bring you back to the original string.

If you get the bytes and flush them twice without converting the intermediate bytes to a string, you will get the original result back.

This example demonstrates the loss of character new String(/*invalid bytes*/)

:

String s = "Hello";
byte[] b = s.getBytes();
for (int i = 0; i < b.length; i++) {
    b[i] = (byte)(~b[i] + 1);
}
// Show the negated bytes
System.out.println(Arrays.toString(b));
String s2 = new String(b);
// Show the bytes of the string constructed from them; note they're not the same
System.out.println(Arrays.toString(s2.getBytes()));

      



On my system, which I think is UTF-8 by default, I get:

[-72, -101, -108, -108, -111]
[-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67]

Note what happened when I took the wrong stream of bytes, made a string out of it, and then got the bytes of that string.

+4


source


You "deny" the symbol and it becomes invalid. Then you get the placeholder

(U + FFFD). At this moment everything is ruined. Then you "deny" it, and you get yours AC

from each of the placeholders.



+2


source







All Articles