Regular expression to count words in a sentence
public static int getWordCount(String sentence) {
return sentence.split("(([a-zA-Z0-9]([-][_])*[a-zA-Z0-9])+)", -1).length
+ sentence.replaceAll("([[a-z][A-Z][0-9][\\W][-][_]]*)", "").length() - 1;
}
My intention is to count the number of words in a sentence. Entering this function is a long sentence. It can contain 255 words.
- The word must accept hyphens or underscores between
- The function should only take into account valid words, so the special character should not be counted, for example. && & & or #### should not be counted as a word.
The above regex works fine, but when a hyphen or underscore comes in between a word like: cooperation, the counter is returned as 2, it should be 1. Can anyone help?
source to share
Instead of using .split
and .replaceAll
, which are quite expensive operations, use a persistent memory approach.
As per your requirements, you seem to be looking for the following regex:
[\w-]+
You can then use this approach to count the number of matches:
public static int getWordCount(String sentence) {
Pattern pattern = Pattern.compile("[\\w-]+");
Matcher matcher = pattern.matcher(sentence);
int count = 0;
while (matcher.find())
count++;
return count;
}
This approach works in (larger) read-only memory: when splitting, the program creates an array, which is mostly useless since you never check the contents of the array.
If you don't want words to start or end with a hyphen, you can use the following regular expression:
\w+([-]\w+)*
source to share
This part ([-][_])*
is wrong. The notation [xyz]
means "any one of the characters within the brackets" (see http://www.regular-expressions.info/charclass.html ). This way you can specify the character -
exactly and exactly the character _
in that order.
Fixing your group does the job:
[a-zA-Z0-9]+([-_][a-zA-Z0-9]+)*
and it can be simplified using \w
to
\w+(-\w+)*
because it \w
corresponds to 0..9
, A..Z
, A..Z
and _
( http://www.regular-expressions.info/shorthand.html ), and so you need to add -
.
source to share