Java (Regex) - get all words in a sentence

I need to split a java string into an array of words. Let's say the line is:

"Hi!! I need to split this string, into a serie of words?!"

      

At the moment I tried using this one String[] strs = str.split("(?!\\w)")

, however it stores symbols like! in an array, and it also stores strings like "Hello!" in an array. The string I split will always be lowercase. I would like to create an array that looks like this: {"hi", "i", "need", "to", "split", "this", "string", "into", "a", "serie's", "of", "words"}

- Note that the apostrophe is preserved.

How can I change my regex to not include characters in the array?

Sorry, I would define a word as a sequence of alphanumeric characters, but only with an inclusive character if it is in the above context, such as "this", not if it is used for a quote, such as the word "his'". Also, in this context, "hello" or "hello-human" are not words, but "hello" and "human". Hope this clears up the question.

+3


source to share


7 replies


You can remove all characters ?!

and then split into words

str = str.replaceAll("[!?,]", "");
String[] words = str.split("\\s+");

      



Result:

Hi, I, need, to, split, this, string, into, a, serie's, of, words

+9


source


Should work for what you want.

String line = "Hi!! I need to split this string, into a serie of words?! but not '' or ''' word";
String regex = "([^a-zA-Z']+)'*\\1*";
String[] split = line.split(regex);
System.out.println(Arrays.asList(split));

      



gives

[Hi, I, need, to, split, this, string, into, a, serie's, of, words, but, not, or, word]

      

+3


source


If you define a word as a sequence of characters without spaces (a space character as defined \s

), then you can separate the space characters:

str.split("\\s+")

      

Note that ";.';.@#$>?>@4"

, "very,bad,punctuation"

and "'goodbye'"

are words as defined above.

Then another approach is to define the word as a sequence of characters from the set of valid characters. If you want to allow a-z

, a-z

and '

as part of a word, you can split everything else:

str.split("[^a-zA-Z']+")

      

It will still allow it to "''''''"

be defined as a word.

+2


source


You can filter out characters that you think are "non-words" characters:

String[] strs = str.split("[,!? ]+");

      

0


source


I would use str.split("[\\s,?!]+")

. You can add any character you want to break inside the brackets []

.

0


source


So you want to split into anything that is not a word character [a-zA-Z] and is not "This regex will do this" [^ a-zA-Z '] \ s "There will be a problem if the string contains a quote specified in '

I usually use this page to test my regex ' http://www.regexplanet.com/advanced/java/index.html

0


source


myString.replaceAll("[^a-zA-Z'\\s]","").toLowerCase().split("\\s+");

      

replaceAll("[^a-zA-Z'\\s]","")

the method replaces all characters that are not a-z

either a-z

or '

or whitespace

nothing ( ""

), and then the toLowerCase

method makes all characters returned from the replaceAll

lowercase way. Finally, we split

ting the string as char spaces. more readable;

myString = myString.replaceAll("[^a-zA-Z'\\s]","");
myString = myString.toLowerCase();
String[] strArr = myString.split("\\s+");

      

0


source







All Articles