How can it be 21 UTF-8 bytes of order only 5 characters?
After writing some basic code to count the number of characters in String
, I found one example where UTF-8 encoded output creates 21 bytes from a 5-character character String
.
Here's the output:
String ==¦ อภิชาติ ¦==
Code units 7
UTF8 Bytes 21
8859 Bytes 7
Characters 5
I understand that Java's internal representation char
is 2 bytes and there is a possibility that some characters may need two blocks of Unicode code to display them.
Since UTF-8 does not use at most 4 bytes per character, how long is byte[]
more than 20 possible for 5 characters String
?
Here's the source:
import java.io.UnsupportedEncodingException;
public class StringTest {
public static void main(String[] args) {
displayStringInfo("อภิชาติ");
}
public static void displayStringInfo(String s) {
System.out.println("Code units " + s.length());
try {
System.out.println("UTF8 Bytes " + s.getBytes("UTF-8").length);
} catch (UnsupportedEncodingException e) { // not handled }
System.out.println("Characters " + characterLength(s));
}
public static int characterLength(String s) {
int count = 0;
for(int i=0; i<s.length(); i++) {
if(!isLeadingUnit(s.charAt(i)) && !isMark(s.charAt(i))) count++;
}
return count;
}
private static boolean isMark(char ch) {
int type = Character.getType(ch);
return (type == Character.NON_SPACING_MARK ||
type == Character.ENCLOSING_MARK ||
type == Character.COMBINING_SPACING_MARK);
}
private static boolean isLeadingUnit(char ch) {
return Character.isHighSurrogate(ch);
}
}
source to share
Your "5 characters" string is actually 7 Unicode code points:
- U + 0E2D THAI CHARACTER O ANG
- U + 0E20 THAI CHARACTER PHO SAMPHAO
- U + 0E34 THAI CHARACTER SARA I
- U + 0E0A THAI CHARACTER CHO CHANG
- U + 0E32 THAI CHARACTER SARA AA
- U + 0E15 THAI CHARACTER TO TAO
- U + 0E34 THAI CHARACTER SARA I
They are all in the range U + 0800 to U + FFFF, which requires 3 bytes per character in UTF-8, hence the total length is 7 and times; 3 = 21 bytes.
source to share
On line 7 :
' อ' (0x0e2d) encoded as {0xe0, 0xb8, 0xad}
'ภ' (0x0e20) - / - {0xe0, 0xb8, 0xa0}
' ิ' (0x0e34) - / - {0xe0, 0xb8, 0xb4}
'ช' (0x0e0a) - / - {0xe0, 0xb8, 0x8a}
'า' (0x0e32) - / - {0xe0, 0xb8, 0xb2}
'ต' (0x0e15) - / - {0xe0, 0xb8, 0x95}
' ิ' (0x0e34) - / - {0xe0, 0xb8, 0xb4}
each character is encoded with 3 bytes in UTF-8 and so you have 7 * 3 == 21
bytes altogeter
source to share