Split line using Unicode delimiter

I need to split a string with "-" as delimiter in java. Example: "Single-Local Room - Enjoy Your Stay"

I have the same data as in English and German depending on the language. Hence, I cannot use the normal string.split ("-"). Unicode for character "-" is 8212 (dec) or x2014 (hex). How do I split a string using unicode?

+3


source to share


4 answers


You may be wrong about which unicode character you are getting. Since Unicode v6.1, there are 27 code points with the property \p{Dash}

:

U+002D ‭ -  HYPHEN-MINUS
U+058A ‭ ֊  ARMENIAN HYPHEN
U+05BE ‭ ־  HEBREW PUNCTUATION MAQAF
U+1400 ‭ ᐀  CANADIAN SYLLABICS HYPHEN
U+1806 ‭ ᠆  MONGOLIAN TODO SOFT HYPHEN
U+2010 ‭ ‐  HYPHEN
U+2011 ‭ ‑  NON-BREAKING HYPHEN
U+2012 ‭ ‒  FIGURE DASH
U+2013 ‭ –  EN DASH
U+2014 ‭ —  EM DASH
U+2015 ‭ ―  HORIZONTAL BAR
U+2053 ‭ ⁓  SWUNG DASH
U+207B ‭ ⁻  SUPERSCRIPT MINUS
U+208B ‭ ₋  SUBSCRIPT MINUS
U+2212 ‭ −  MINUS SIGN
U+2E17 ‭ ⸗  DOUBLE OBLIQUE HYPHEN
U+2E1A ‭ ⸚  HYPHEN WITH DIAERESIS
U+2E3A ‭ ⸺  TWO-EM DASH
U+2E3B ‭ ⸻  THREE-EM DASH
U+301C ‭ 〜 WAVE DASH
U+3030 ‭ 〰 WAVY DASH
U+30A0 ‭ ゠ KATAKANA-HIRAGANA DOUBLE HYPHEN
U+FE31 ‭ ︱ PRESENTATION FORM FOR VERTICAL EM DASH
U+FE32 ‭ ︲ PRESENTATION FORM FOR VERTICAL EN DASH
U+FE58 ‭ ﹘ SMALL EM DASH
U+FE63 ‭ ﹣ SMALL HYPHEN-MINUS
U+FF0D ‭ - FULLWIDTH HYPHEN-MINUS

      

In Perl or ICU, you can just split directly into \p{Dash}

, but since the Sun class Pattern

does not fully support such Unicode properties, you must synthesize it using an enumerated character class with square brackets. So, we split the pattern:



string.split("[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A-\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]")

      

should do the trick for you. You can actually double the backslash if you're afraid the Java preprocessor will be in your way, because the regex parser needs to know to understand the alternative notation.

+3


source


Pattern p = Pattern.compile("\u0001", Pattern.LITERAL);
String items[] = p.split(message);

      



+2


source


String s = "Single Room - Enjoy your stay":
String splits[] = s.split("\u002D");
for(String s1:splits){
    System.out.println(s1);
}

      

+1


source


The six for "-" is 2d (or) 45 decimal (or) 55 octal. Use the following program to find integer values ​​for all characters. So, we split with \ u002d

public static void main(String[] args) {        
    int j=0;


    for(int i=32; i<=131;i++)
    {

        System.out.print(i + ":\t"  + (char)i +"   ");


        j++;

        if(j>10)
        {
            System.out.println();
            j=0;
        }
    }

      

0


source







All Articles