How to handle abbreviations with regex word boundaries in javascript

I have a nodejs script that reads in a file and counts word frequencies. I am currently passing each line to a function:

function getWords(line) {
    return line.match(/\b\w+\b/g);
}

      

This matches almost everything except that it skips contractions

getWords("I'm") -> {"I", "m"}

      

However, I can't just include the apostrophes, as I would like the matched apostrophes to be word boundaries:

getWords("hey'there'") -> {"hey", "there"}

      

Is there a way to capture the capture while treating other apostrophes as word boundaries?

+3


source to share


2 answers


The closest I believe you can get a regex would be line.match(/(?!'.*')\b[\w']+\b/g)

, but keep in mind that if '

there is no space between a word and a word , it will be treated as a compression.

As Aaron Dufour already mentioned, there will be no way for a regex to know what I'm

is shorthand but is hey'there

not.



See below:

enter image description here

+2


source


You can match letters and possible apostrophes followed by letters.



line.match(/[A-Za-z]+('[A-Za-z]+)?/g

      

+1


source







All Articles