Java (Regex) - get all words in a sentence
I need to split a java string into an array of words. Let's say the line is:
"Hi!! I need to split this string, into a serie of words?!"
At the moment I tried using this one String[] strs = str.split("(?!\\w)")
, however it stores symbols like! in an array, and it also stores strings like "Hello!" in an array. The string I split will always be lowercase. I would like to create an array that looks like this:
{"hi", "i", "need", "to", "split", "this", "string", "into", "a", "serie's", "of", "words"}
- Note that the apostrophe is preserved.
How can I change my regex to not include characters in the array?
Sorry, I would define a word as a sequence of alphanumeric characters, but only with an inclusive character if it is in the above context, such as "this", not if it is used for a quote, such as the word "his'". Also, in this context, "hello" or "hello-human" are not words, but "hello" and "human". Hope this clears up the question.
source to share
Should work for what you want.
String line = "Hi!! I need to split this string, into a serie of words?! but not '' or ''' word";
String regex = "([^a-zA-Z']+)'*\\1*";
String[] split = line.split(regex);
System.out.println(Arrays.asList(split));
gives
[Hi, I, need, to, split, this, string, into, a, serie's, of, words, but, not, or, word]
source to share
If you define a word as a sequence of characters without spaces (a space character as defined \s
), then you can separate the space characters:
str.split("\\s+")
Note that ";.';.@#$>?>@4"
, "very,bad,punctuation"
and "'goodbye'"
are words as defined above.
Then another approach is to define the word as a sequence of characters from the set of valid characters. If you want to allow a-z
, a-z
and '
as part of a word, you can split everything else:
str.split("[^a-zA-Z']+")
It will still allow it to "''''''"
be defined as a word.
source to share
So you want to split into anything that is not a word character [a-zA-Z] and is not "This regex will do this" [^ a-zA-Z '] \ s "There will be a problem if the string contains a quote specified in '
I usually use this page to test my regex ' http://www.regexplanet.com/advanced/java/index.html
source to share
myString.replaceAll("[^a-zA-Z'\\s]","").toLowerCase().split("\\s+");
replaceAll("[^a-zA-Z'\\s]","")
the method replaces all characters that are not a-z
either a-z
or '
or whitespace
nothing ( ""
), and then the toLowerCase
method makes all characters returned from the replaceAll
lowercase way. Finally, we split
ting the string as char spaces. more readable;
myString = myString.replaceAll("[^a-zA-Z'\\s]","");
myString = myString.toLowerCase();
String[] strArr = myString.split("\\s+");
source to share