Unicode character regex, capture groups

I got the regular expression \ p {L} \ p {M} *, which I use to separate words into characters, this is especially necessary for Hindi or Thai words, where a character can contain several "characters", in the form मछली if I split in regular way in Java, I get [म] [छ] [ल] [ी] Where I want [म] [छ] [ली]

I am trying to improve this regex by including space characters so that when I split फार्म पशु I would get groups followng [फा] [र्] [म] [] [प] [शु]

But I was out of luck. Can anyone help me?

Also, in case anyone has an alternative way to do this, this is Java, which might also be an alternative solution. My current Java code is

Pattern pat = Pattern.compile("\\p{L}\\p{M}*");
    Matcher matcher = pat.matcher(word);
    while (matcher.find()) {
        characters.add(matcher.group());
    }

      

+3


source to share


1 answer


Consider using BreakIterator :

String text = "मछली";
Locale hindi = new Locale("hi", "IN");
BreakIterator breaker = BreakIterator.getCharacterInstance(hindi);
breaker.setText(text);
int start = breaker.first();
for (int end = breaker.next();
  end != BreakIterator.DONE;
  start = end, end = breaker.next()) {
  System.out.println(text.substring(start,end));
}

      



I tested a sample string using Oracle's Java 8 implementation. Also consider the ICU4J version of BreakIterator if needed.

+5


source







All Articles