Unicode character regex, capture groups

Question

Unicode character regex, capture groups

I got the regular expression \ p {L} \ p {M} *, which I use to separate words into characters, this is especially necessary for Hindi or Thai words, where a character can contain several "characters", in the form मछली if I split in regular way in Java, I get [म] [छ] [ल] [ी] Where I want [म] [छ] [ली]

I am trying to improve this regex by including space characters so that when I split फार्म पशु I would get groups followng [फा] [र्] [म] [] [प] [शु]

But I was out of luck. Can anyone help me?

Also, in case anyone has an alternative way to do this, this is Java, which might also be an alternative solution. My current Java code is

Pattern pat = Pattern.compile("\\p{L}\\p{M}*");
    Matcher matcher = pat.matcher(word);
    while (matcher.find()) {
        characters.add(matcher.group());
    }

+3

java regex unicode

DianeH 20 Aug 14 at 5:15 am

source to share

1 answer

McDowell · Accepted Answer · 2014-08-20T07:24:49+0000

Consider using BreakIterator :

String text = "मछली";
Locale hindi = new Locale("hi", "IN");
BreakIterator breaker = BreakIterator.getCharacterInstance(hindi);
breaker.setText(text);
int start = breaker.first();
for (int end = breaker.next();
  end != BreakIterator.DONE;
  start = end, end = breaker.next()) {
  System.out.println(text.substring(start,end));
}

I tested a sample string using Oracle's Java 8 implementation. Also consider the ICU4J version of BreakIterator if needed.

Unicode character regex, capture groups

More articles: