Regular expression to count words in a sentence

public static int getWordCount(String sentence) {
    return sentence.split("(([a-zA-Z0-9]([-][_])*[a-zA-Z0-9])+)", -1).length
         + sentence.replaceAll("([[a-z][A-Z][0-9][\\W][-][_]]*)", "").length() - 1;
}

      

My intention is to count the number of words in a sentence. Entering this function is a long sentence. It can contain 255 words.

  • The word must accept hyphens or underscores between
  • The function should only take into account valid words, so the special character should not be counted, for example. && & & or #### should not be counted as a word.

The above regex works fine, but when a hyphen or underscore comes in between a word like: cooperation, the counter is returned as 2, it should be 1. Can anyone help?

+3


source to share


3 answers


Instead of using .split

and .replaceAll

, which are quite expensive operations, use a persistent memory approach.

As per your requirements, you seem to be looking for the following regex:

[\w-]+

      

You can then use this approach to count the number of matches:

public static int getWordCount(String sentence) {
    Pattern pattern = Pattern.compile("[\\w-]+");
    Matcher  matcher = pattern.matcher(sentence);
    int count = 0;
    while (matcher.find())
        count++;
    return count;
}

      



jDoodle online demo .

This approach works in (larger) read-only memory: when splitting, the program creates an array, which is mostly useless since you never check the contents of the array.

If you don't want words to start or end with a hyphen, you can use the following regular expression:

\w+([-]\w+)*

      

+4


source


This part ([-][_])*

is wrong. The notation [xyz]

means "any one of the characters within the brackets" (see http://www.regular-expressions.info/charclass.html ). This way you can specify the character -

exactly and exactly the character _

in that order.

Fixing your group does the job:

[a-zA-Z0-9]+([-_][a-zA-Z0-9]+)*

      



and it can be simplified using \w

to

\w+(-\w+)*

      

because it \w

corresponds to 0..9

, A..Z

, A..Z

and _

( http://www.regular-expressions.info/shorthand.html ), and so you need to add -

.

+3


source


if you can use java 8:

long wordCount = Arrays.stream(sentence.split(" ")) //split the sentence into words   
.filter(s -> s.matches("[\\w-]+")) //filter only matching words
.count();

      

+2


source







All Articles