Converting int to char and then back to int - does not always give the same result
I am trying to get char
from the value int
> 0xFFFF
. But instead, I always return the same value char
that, when clicked, int
outputs the value 65535
( 0xFFFF
).
I couldn't figure out why it generates characters for unicode> 0xFFFF
.
int hex = 0x10FFFF;
char c = (char)hex;
System.out.println((int)c);
I expected the output to be 0x10FFFF
. Instead, the output is returned as 65535
.
source to share
This is because while it int
is 4 bytes, it char
is only 2 bytes. Thus, you cannot represent all the values ββin char
that you can in int
. By using the standard unsigned integer representation, you can only represent a range of values ββfrom 0
to 2^16 - 1 == 65535
in a 2-byte value, so if you convert any number outside that range to a 2-byte value and back, you lose data.
source to share
Your number was too big to be char, which is 2 bytes. But it was small enough where it fit into an int which is 4 bytes. 65535 is the largest amount that is good for char, so you got that value. Also, if the char was big enough to fit your number when you cast it back to int, it could return the decimal value for 0x10FFFF, which is 1114111.
source to share
Unfortunately, I think you expected Java to char
be the same as the Unicode code point. They are not the same thing.
Java char
, as already expressed by other answers, can only support code points which can be represented in 16 bits, whereas Unicode requires 21 bits to support all code points.
In other words, Java char
itself only supports Basic Multilingual Plane characters (code points <= 0xFFFF
). In Java, if you want to represent a Unicode code point that is in one of the extended planes (code points> 0xFFFF
), you need surrogate characters or a character pair for that. This is how UTF-16 works . And, internally, this is how Java strings work. Just for fun, run the following snippet to see how a single Unicode code point is actually represented by two characters if the code point is> 0xFFFF
:
// Printing string length for a string with
// a single unicode code point: 0x22BED.
System.out.println("π’―".length()); // prints 2, because it uses a surrogate pair.
If you want to safely convert a value int
representing a Unicode code point to char
(or char
more precisely) and then convert it back to a code point int
, you would use code like this:
public static void main(String[] args) {
int hex = 0x10FFFF;
System.out.println(Character.isSupplementaryCodePoint(hex)); // prints true because hex > 0xFFFF
char[] surrogateChars = Character.toChars(hex);
int codePointConvertedBack = Character.codePointAt(surrogateChars, 0);
System.out.println(codePointConvertedBack); // prints 1114111
}
Alternatively, instead of manipulating arrays, char
you can use String
, for example:
public static void main(String[] args) {
int hex = 0x10FFFF;
System.out.println(Character.isSupplementaryCodePoint(hex)); // prints true because hex > 0xFFFF
String s = new String(new int[] {hex}, 0, 1);
int codePointConvertedBack = s.codePointAt(0);
System.out.println(codePointConvertedBack); // prints 1114111
}
For further reading: Java Character Class
source to share