Why are String.endsWith and String.startWith incompatible?
I have the following test case and only the first assertion passes. Why?
@Test
public void test() {
String i1 = "i";
String i2 = "İ".toLowerCase();
System.out.println((int)i1.charAt(0)); // 105
System.out.println((int)i2.charAt(0)); // 105
assertTrue(i2.startsWith(i1));
assertTrue(i2.endsWith(i1));
assertTrue(i1.endsWith(i2));
assertTrue(i1.startsWith(i2));
}
Update after answers
I am trying to use startsWith
and endsWith
in a case insensitive way, so below expression should return true.
"ALİ".toLowerCase().endsWith("i");
source to share
This is because the lowercase İ
("latin capital letter i with dot above") in English locales turns into two characters: "latin small letter i
" and "combination of dot above".
This explains why it starts with i
but does not end with i
(it ends with a combination of a diacritical mark).
In Turkish, the lowercase İ
just becomes "latin small letter i
" according to the rules of Turkish linguistics, so your code will work.
Here's a test program to help you figure out what's going on:
class Test {
public static void main(String[] args) {
char[] foo = args[0].toLowerCase().toCharArray();
System.out.print("Lowercase " + args[0] + " has " + foo.length + " chars: ");
for(int i=0; i<foo.length; i++) System.out.print("0x" + Integer.toString((int)foo[i], 16) + " ");
System.out.println();
}
}
This is what we get when we run it on a system configured for English:
$ LC_ALL=en_US.utf8 java Test "İ"
Lowercase İ has 2 chars: 0x69 0x307
This is what we get when we run it on a system configured for Turkish:
$ LC_ALL=tr_TR.utf8 java Test "İ"
Lowercase İ has 1 chars: 0x69
This is even a specific example used by the API docs for String.toLowerCase (Locale) , which is a method you can use to get the lowercase version of a specific locale, rather than the system's default locale.
source to share
İ
is the Unicode character 'LATIN CAPITAL LETTER i WITH DOT ABOVE' (U + 0130) and is a Java string with length 1.
"İ".toLowerCase()
returns a Java string of length 2:
- Unicode Character 'LATIN SMALL LETTER I' (U + 0069) (normal
i
). - Unicode Character 'COMBINING DOT ABOVE' (U + 0307) .
And this is because there is no such symbol as 'LATIN SMALL LETTER I WITH DOT ABOVE'
. It doesn't exist in Unicode.
source to share
After executing the function, toLowerCase()
the string length is 2 instead of 1; the lowercase version of this character is represented by two characters:
000> "İ".length()
===> 1
000> "İ".toLowerCase().length()
===> 2
The first character in its lowercase is lowercase Latin i
, and the second character is diacritic:
000> "İ".toLowerCase().charAt(0)
===> i
000> "İ".toLowerCase().charAt(1)
===> ̇
So the line string "starts with" i
, but it doesn't end.
source to share
Your test is not working because you are using the wrong methods ...
String i2 = "İ"
represents the Turin metropolitan form i, and if you don't provide the language code to convert, the method will fail
using locale might help :)
public static void main(String[] args) {
String i1 = "i";
String i2 = "İ".toLowerCase(Locale.forLanguageTag("tr-TR"));
System.out.println((int)i1.charAt(0)); // 105
System.out.println((int)i2.charAt(0)); // 105
System.out.println(i2.startsWith(i1));
System.out.println(i2.endsWith(i1));
System.out.println(i1.endsWith(i2));
System.out.println(i1.startsWith(i2));
}
the output will be
105
105
True
True
True
True
source to share