Latin regular expression with characters
I need to split the text and get only words, numbers and hyphenated words. I also need to get Latin words, then I used \p{L}
which gives me é, ú ü ã, etc. Example:
String myText = "Some latin text with symbols, ? 987 (A la pointe sud-est de l'île se dresse la cathédrale Notre-Dame qui fut lors de son achèvement en 1330 l'une des plus grandes cathédrales d'occident) : ! @ # $ % ^& * ( ) + - _ #$% " ' : ; > < / \ | , here some is wrong… * + () e -"
Pattern pattern = Pattern.compile("[^\\p{L}+(\\-\\p{L}+)*\\d]+");
String words[] = pattern.split( myText );
What's wrong with this regex? Why did he meets characters such as "("
, "+"
, "-"
, "*"
and "|"
?
Some results:
dresse // OK
sud-est // OK
occident) // WRONG
987 // OK
() // WRONG
(a // WRONG
* // WRONG
- // WRONG
+ // WRONG
( // WRONG
| // WRONG
Explanation of regex:
[^\p{L}+(\-\p{L}+)*\d]+
* Word separator will be:
* [^ ... ] No sequence in:
* \p{L}+ Any latin letter
* (\-\p{L}+)* Optionally hyphenated
* \d or numbers
* [ ... ]+ once or more.
source to share
If my understanding of your requirement is correct, this regex will match what you want:
"\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+"
It will match:
- A contiguous sequence of Unicode Latin script characters . I am limiting it to a latin script as it
\p{L}
will match a letter in any script. Change\\p{IsLatin}
to\\pL
if your Java version doesn't support the syntax. - Or several such sequences, a hyphen.
- Or a continuous sequence of decimal digits (0-9)
The re-expression above should be used when calling Pattern.compile
and calling matcher(String input)
to get an object Matcher
and use a loop to find matches.
Pattern pattern = Pattern.compile("\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+");
Matcher matcher = pattern.matcher(inputString);
while (matcher.find()) {
System.out.println(matcher.group());
}
If you want to allow words with apostrophes '
:
"\\p{IsLatin}+(?:['\\-]\\p{IsLatin}+)*|\\d+"
I am also avoiding -
in the character class ['\\-]
just in case you want to add more. It does -
n't really need escaping if it is first or last in a character class, but I avoid it anyway just to be safe.
source to share
If the opening parenthesis of a character class is followed by a character ^
, then characters listed within the class are not allowed. Thus, your regular expression can be nothing but the unicode letter-, +
, (
, -
, )
, *
and numbers, occurring one or more times.
Please note that characters such as +
, (
, )
, *
etc. do not have any special meaning within a character class.
What pattern.split is is that it splits the string into patterns that match the regex. Your regex matches a whitespace, and hence the splitting occurs in each case of one or more whitespace. So the result will be like this.
For example, consider this
Pattern pattern = Pattern.compile("a");
for (String s : pattern.split("sda a f g")) {
System.out.println("==>"+s);
}
The output will be
==> sd
==>
==> fg
source to share