Tokenize regex string with special characters

I am trying to find tokens in a string that has words, numbers and special characters. I tried the following code:

String Pattern = "(\\s)+";
String Example = "This `99 is my small \"yy\"  xx`example ";
String[] splitString = (Example.split(Pattern));
System.out.println(splitString.length);
for (String string : splitString) {
    System.out.println(string);
}

      

And got the following output:

This:`99:is:my:small:"yy":xx`example:

      

But I really want it, i.e. i want special characters to be separate tokens as well:

This:`:99:is:my:small:":yy:":xx:`:example:

      

I tried to set special characters inside the template, but now the special characters are gone completely:

String Pattern = "(\"|`|\\.|\\s+)";
This::99:is:my:small::yy::xx:example:

      

What template will I get my desired result with? Or should I try a different approach than using regex?

+1


source to share


1 answer


You can use a matching approach to match strings of letters (with or without a combination of labels), numbers, or individual characters other than word and space. I think it _

should be treated as a special char in this approach.

Using

"(?U)(?>[^\\W\\d]\\p{M}*+)+|\\d+|[^\\w\\s]"

      

See regex demo .



More details

  • (?U)

    - inline version of the modifier Pattern.UNICODE_CHARACTER_CLASS

  • (?>[^\\W\\d]\\p{M}*+)+

    - 1 or more letters or _

    with / without merging marks after
  • |

    - or
  • \\d+

    - any 1 + numbers
  • |

    - or
  • [^\\w\\s]

    is a single char that is either any char but a word and a space.

See Java demo :

String str = "This `99 is my small \"yy\"  xx`example_and_more ";
Pattern ptrn = Pattern.compile("(?U)(?>[^\\W\\d]\\p{M}*+)+|\\d+|[^\\w\\s]");
List<String> res = new ArrayList<>();
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
    res.add(matcher.group());
}
System.out.println(res);
// => [This, `, 99, is, my, small, ", yy, ", xx, `, example_and_more]

      

+2


source







All Articles