StringBuilder # appendCodePoint (int) behaves unexpectedly
The java.lang.StringBuilder appendCodePoint (...) method, for me, behaves in an unpredictable manner.
For the Unicode coded characters above Character.MAX_VALUE (encoding to UTF-8, which is my Eclipse workbench, it takes 3 or 4 bytes), this behaves weird.
I am adding one-line String Unicode code to StringBuilder, but its output looks different at the end. I suspect that calling Character.toSurrogates (codePoint, value, count) on AbstractStringBuilder # appendCodePoint (...) is causing this, but I don't know how to get around it.
My code:
// returns random string in range of unicode code points 0x2F800 to 0x2FA1F
// e.g. 槪𥥼報悔𦖨嘆汧犕尢𦔣洴真硎尢趼犀㠯弢卿𢛔芋玥峀䔫䩶莭型築𡷦𩐊
String s = getRandomChineseJapaneseKoreanStringCompatibilitySupplementOfMaxLength(length);
System.out.println(s);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < getCodePointCount(s); i++) {
sb.appendCodePoint(s.codePointAt(i));
}
// prints some of the CJK characters, but between them there is a '?'
// e.g. 槪?𥥼?報?悔?𦖨?嘆?汧?犕?尢?𦔣?洴?真?硎?尢?趼?
System.out.println(sb.toString());
// returns random string in range of unicode code points 0x20000 to 0x2A6DF
// e.g. 𤸥𤈍𪉷𪉔𤑺𡹋𠋴𨸁𦧖𣯠𨚾𣥷𪂶𦄃𧊈𤧘𢙕𪚋𤧒𥩛𧆞𨕌𣸑𡚊𥽚𡛳𣐸𩆟𩣞𥑡
s = getRandomChineseJapaneseKoreanStringExtensionBOfMaxLength(length);
// prints the CJK characters correctly
System.out.println(s);
sb = new StringBuilder();
for (int i = 0; i < getCodePointCount(s); i++) {
sb.appendCodePoint(s.codePointAt(i));
}
// prints some of the CJK characters, but between them there is a '?'
// e.g. 𤸥?𤈍?𪉷?𪉔?𤑺?𡹋?𠋴?𨸁?𦧖?𣯠?𨚾?𣥷?𪂶?𦄃?𧊈?
System.out.println(sb.toString());
FROM
public static int getCodePointCount(String s) {
return s.codePointCount(0, s.length());
}
public static String getRandomChineseJapaneseKoreanStringExtensionBOfMaxLength(int length) {
return getRandomStringOfMaxLengthInRange(length, 0x20000, 0x2A6DF);
}
public static String getRandomChineseJapaneseKoreanStringCompatibilitySupplementOfMaxLength(int length) {
return getRandomStringOfMaxLengthInRange(length, 0x2F800, 0x2FA1F);
}
private static String getRandomStringOfMaxLengthInRange(int length, int from, int to) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < length; i++) {
// try to find a valid character MAX_TRIES times
for (int j = 0; j < MAX_TRIES; j++) {
int unicodeInt = from + random.nextInt(to - from);
if (Character.isValidCodePoint(unicodeInt) &&
(Character.isLetter(unicodeInt) || Character.isDigit(unicodeInt) ||
Character.isWhitespace(unicodeInt))) {
sb.appendCodePoint(unicodeInt);
break;
}
}
}
return new String(sb.toString().getBytes(), "UTF-8");
}
source to share
You are repeating codes incorrectly. You should use the strategy presented by Jonathan Feinberg here
final int length = s.length();
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);
// do something with the codepoint
offset += Character.charCount(codepoint);
}
or since Java 8
s.codePoints().forEach(/* do something */);
Pay attention to the Javadoc String#codePointAt(int)
Returns the character (Unicode code point) at the specified index. The index refers to char values (Unicode code units) and ranges from 0 to length () - 1.
You have repeated from 0 to codePointCount
. If the character is not a low-level surrogate pair, he returns alone. In this case, your index should only be incremented by 1. Otherwise, it should be incremented by 2 ( Character#charCount(int)
deals with this) as you end up with the code matching the pair.
source to share
Change your loops:
for (int i = 0; i < getCodePointCount(s); i++) {
:
for (int i = 0; i < getCodePointCount(s); i = s.offsetByCodePoints(i, 1)) {
In Java, char is one UTF-16 value. Additional code points take up two characters per line.
But you are looping over each char in your string. This means that you read each additional code twice: the first time you read both of your UTF-16 surrogate characters; the second time you read and add only the low surrogate char.
Consider a string containing only a single code 0x2f8eb
. The Java string representing this code actually contains this:
"\ud87e\udceb"
If you loop through each individual char index, then your loop will efficiently do this:
sb.appendCodePoint(0x2f8eb); // codepoint found at index 0
sb.appendCodePoint(0xdceb); // codepoint found at index 1
source to share