How do I create a RegEx template that will receive N words using a custom Word border?

I need a RegEx pattern that will return the first N words using a custom word boundary, which is the usual RegEx white space (\ s) plus punctuation marks like .,;:!?-*_

EDIT # 1: Thanks for your comments.

To be clear:

  • I would like to set characters to be word delimiters
  • Lets call it "Delimiter Set" or strDelimiters
  • strDelimiters = ".,;:!?-*_"

  • nNumWordsToFind = 5

  • Word is defined as any contiguous text that does NOT contain a character in strDelimiters
  • RegEx word boundary is any continuous text containing one or more characters in strDelimiters
  • I would like to create a RegEx template to get / return the first nNumWordsToFind using strDelimiters.

EDIT # 2 Sat 08 Aug 2015 12:49 pm US CT

@maraca definitely answered my question as originally said. But I really need to return the number of words ≤ nNumWordsToFind. So if the original text only has 3 words, but my RegEx is asking for 4 words, I need them to return 3 words. The answer provided by maraca fails if nNumWordsToFind> the number of actual words in the source text.

For example:

one,two;three-four_five.six:seven eight    nine! ten

      

He will see it as 10 words. If I want the first 5 words, it will return:

one,two;three-four_five.

      

I have this pattern using regular space that works, but NOT exactly what I need:

([\w]+\s+){<NumWordsOut>}

      

where <NumWordsOut>

is the number of words returned.

I also found this word border pattern, but I don't know how to use it:

"real word boundary", which defines the boundary between an ASCII letter and a non-letter.

(?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z])

      

However, I would like my words to resolve numbers as well.

IAC, I was unable to use the above word boundary pattern to return the first N words of my text.

By the way, I will be using this in the Keyboard Maestro macro .

Can anyone please help? TIA.

+3


source to share


2 answers


All you have to do is adapt your template ([\w]+\s+){<NumWordsOut>}

to, including some special cases:

^[\s.,;:!?*_-]*([^\s.,;:!?*_-]+([\s.,;:!?*_-]+|$)){<NumWordsOut>}
1.             2.              3.             4.  5.

      



  • Matches any number of delimiters before the first word
  • Match word (= at least one without separator)
  • The word must be followed by at least one separator
  • Or it can be at the end of the line (in case there is no delimiter at the end)
  • Repeat from 2. to 4. <NumWordsOut> times

Notice how I changed the order -

, it must be at the beginning or at the end, otherwise it must be shielded: \-

.

+1


source


Thanks to @maraca for providing a complete answer to my question.

I just wanted to post the Keyboard Maestro macro that I created using the @maraca RegEx template for anyone interested in a complete solution.



See KM forum macro: Get maximum N words in strings using RegEx

0


source







All Articles