How to handle abbreviations with regex word boundaries in javascript
I have a nodejs script that reads in a file and counts word frequencies. I am currently passing each line to a function:
function getWords(line) {
return line.match(/\b\w+\b/g);
}
This matches almost everything except that it skips contractions
getWords("I'm") -> {"I", "m"}
However, I can't just include the apostrophes, as I would like the matched apostrophes to be word boundaries:
getWords("hey'there'") -> {"hey", "there"}
Is there a way to capture the capture while treating other apostrophes as word boundaries?
source to share
The closest I believe you can get a regex would be line.match(/(?!'.*')\b[\w']+\b/g)
, but keep in mind that if '
there is no space between a word and a word , it will be treated as a compression.
As Aaron Dufour already mentioned, there will be no way for a regex to know what I'm
is shorthand but is hey'there
not.
See below:
source to share