Why are String.endsWith and String.startWith incompatible?

Question

Why are String.endsWith and String.startWith incompatible?

I have the following test case and only the first assertion passes. Why?

@Test
public void test() {
    String i1 = "i";
    String i2 = "İ".toLowerCase();

    System.out.println((int)i1.charAt(0)); // 105
    System.out.println((int)i2.charAt(0)); // 105

    assertTrue(i2.startsWith(i1));

    assertTrue(i2.endsWith(i1));
    assertTrue(i1.endsWith(i2));
    assertTrue(i1.startsWith(i2));
}

Update after answers

I am trying to use startsWith

and endsWith

in a case insensitive way, so below expression should return true.

"ALİ".toLowerCase().endsWith("i");

I think this is different from C # and Java .

+3

java string character-encoding locale

Mehmet Ataş 04 Aug 17 at 20:21

source to share

4 answers

İ

is the Unicode character 'LATIN CAPITAL LETTER i WITH DOT ABOVE' (U + 0130) and is a Java string with length 1.

"İ".toLowerCase()

returns a Java string of length 2:

Unicode Character 'LATIN SMALL LETTER I' (U + 0069) (normal i

).
Unicode Character 'COMBINING DOT ABOVE' (U + 0307) .

And this is because there is no such symbol as 'LATIN SMALL LETTER I WITH DOT ABOVE'

. It doesn't exist in Unicode.

+3

Andreas 04 Aug 17 at 20:34

source to share

After executing the function, toLowerCase()

the string length is 2 instead of 1; the lowercase version of this character is represented by two characters:

000> "İ".length()
===> 1
000> "İ".toLowerCase().length()
===> 2

The first character in its lowercase is lowercase Latin i

, and the second character is diacritic:

000> "İ".toLowerCase().charAt(0)
===> i
000> "İ".toLowerCase().charAt(1)
===> ̇

So the line string "starts with" i

, but it doesn't end.

+3

nbrooks 04 Aug 17 at 20:35

source to share

Your test is not working because you are using the wrong methods ...

String i2 = "İ"

represents the Turin metropolitan form i, and if you don't provide the language code to convert, the method will fail

using locale might help :)

public static void main(String[] args) {

    String i1 = "i";
    String i2 = "İ".toLowerCase(Locale.forLanguageTag("tr-TR"));

    System.out.println((int)i1.charAt(0)); // 105
    System.out.println((int)i2.charAt(0)); // 105

    System.out.println(i2.startsWith(i1));
    System.out.println(i2.endsWith(i1));
    System.out.println(i1.endsWith(i2));
    System.out.println(i1.startsWith(i2));
}

the output will be

105

105

True

True

True

True

+1

ΦXocę 웃 Pepeúpa ツ 04 Aug 17 at 20:45

source to share

that other guy · Accepted Answer · 2017-08-04T20:35:40+0000

This is because the lowercase İ

("latin capital letter i with dot above") in English locales turns into two characters: "latin small letter i

" and "combination of dot above".

This explains why it starts with i

but does not end with i

(it ends with a combination of a diacritical mark).

In Turkish, the lowercase İ

just becomes "latin small letter i

" according to the rules of Turkish linguistics, so your code will work.

Here's a test program to help you figure out what's going on:

class Test {
  public static void main(String[] args) {
    char[] foo = args[0].toLowerCase().toCharArray();
    System.out.print("Lowercase " + args[0] + " has " + foo.length + " chars: ");
    for(int i=0; i<foo.length; i++) System.out.print("0x" + Integer.toString((int)foo[i], 16) + " ");
    System.out.println();
  }
}

This is what we get when we run it on a system configured for English:

$ LC_ALL=en_US.utf8 java Test "İ"
Lowercase İ has 2 chars: 0x69 0x307

This is what we get when we run it on a system configured for Turkish:

$ LC_ALL=tr_TR.utf8 java Test "İ"
Lowercase İ has 1 chars: 0x69

This is even a specific example used by the API docs for String.toLowerCase (Locale) , which is a method you can use to get the lowercase version of a specific locale, rather than the system's default locale.

Why are String.endsWith and String.startWith incompatible?

Update after answers

More articles: