Latin regular expression with characters

I need to split the text and get only words, numbers and hyphenated words. I also need to get Latin words, then I used \p{L}

which gives me é, ú ü ã, etc. Example:

String myText = "Some latin text with symbols, ? 987 (A la pointe sud-est de l'île se dresse la cathédrale Notre-Dame qui fut lors de son achèvement en 1330 l'une des plus grandes cathédrales d'occident) : ! @ # $ % ^& * ( ) + - _ #$% "  ' : ; > < / \  | ,  here some is wrong… * + () e -"

Pattern pattern = Pattern.compile("[^\\p{L}+(\\-\\p{L}+)*\\d]+");
String words[] = pattern.split( myText );

      

What's wrong with this regex? Why did he meets characters such as "("

, "+"

, "-"

, "*"

and "|"

?

Some results:

dresse     // OK
sud-est    // OK
occident)  // WRONG
987        // OK
()         // WRONG
(a         // WRONG
*          // WRONG
-          // WRONG
+          // WRONG
(          // WRONG
|          // WRONG

      

Explanation of regex:

[^\p{L}+(\-\p{L}+)*\d]+

 * Word separator will be:
 *     [^  ...  ]  No sequence in:
 *     \p{L}+        Any latin letter
 *     (\-\p{L}+)*   Optionally hyphenated
 *     \d            or numbers
 *     [ ... ]+      once or more.

      

+3


source to share


3 answers


If my understanding of your requirement is correct, this regex will match what you want:

"\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+"

      

It will match:

  • A contiguous sequence of Unicode Latin script characters . I am limiting it to a latin script as it \p{L}

    will match a letter in any script. Change \\p{IsLatin}

    to \\pL

    if your Java version doesn't support the syntax.
  • Or several such sequences, a hyphen.
  • Or a continuous sequence of decimal digits (0-9)

The re-expression above should be used when calling Pattern.compile

and calling matcher(String input)

to get an object Matcher

and use a loop to find matches.



Pattern pattern = Pattern.compile("\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+");
Matcher matcher = pattern.matcher(inputString);

while (matcher.find()) {
    System.out.println(matcher.group());
}

      

If you want to allow words with apostrophes '

:

"\\p{IsLatin}+(?:['\\-]\\p{IsLatin}+)*|\\d+"

      

I am also avoiding -

in the character class ['\\-]

just in case you want to add more. It does -

n't really need escaping if it is first or last in a character class, but I avoid it anyway just to be safe.

+2


source


If the opening parenthesis of a character class is followed by a character ^

, then characters listed within the class are not allowed. Thus, your regular expression can be nothing but the unicode letter-, +

, (

, -

, )

, *

and numbers, occurring one or more times.

Please note that characters such as +

, (

, )

, *

etc. do not have any special meaning within a character class.

What pattern.split is is that it splits the string into patterns that match the regex. Your regex matches a whitespace, and hence the splitting occurs in each case of one or more whitespace. So the result will be like this.

For example, consider this



Pattern pattern = Pattern.compile("a");
    for (String s : pattern.split("sda  a  f  g")) {
        System.out.println("==>"+s);
    }

      

The output will be

==> sd

==>

==> fg

+2


source


A regex set description with []

can only contain letters, classes ( \p{...}

), sequences (for example a-z

), and the completion character ( ^

). You have to put other magic symbols you use ( +*()

) outside the block [ ]

.

0


source







All Articles