An efficient regex to query for a specific sentence pattern but accepts html, etc.

(As is often the case when writing this document, I think I fixed this expression, so it now works for my purposes, so efficiency is currently my main concern, but I would still like to know if the expression has improved or will miss the path more than it needs to, so I left the whole explanation in.)

I am trying to write a regex that will check that the text provided by the user matches the length requirement. Users must write 7 or more complete sentences of 4 or more words. We define it as follows:

- 4 words means 3 or more sections of '1 or more non-space characters followed by 1 or more spaces', then 1 instance of '1 or more non-space characters optionally followed by a space' (because some people like to put spaces before their punctuation marks I guess)  
- A sentence is ended with a punctuation mark (.?!)  
- Zero or more spaces are allowed after each sentence  
- (Repeat 7 times)  

      

This definition can be changed to any reasonable one, but whatever I have come up with so far. Which gives me the following RegEx:

((\S+\s+){3,}\S+[.?!]\s*){7,}  

      

It seems to work, but I have obviously thought of a lot of things and wonder if anyone has a better idea. (It has to allow html at all times and many other quirks from user writing. I'm not too concerned about people playing the system - there are still manual checks, this is just a first step check to ease the load.)

My main concern is efficiency - I'm new to regex and don't know what the "normal" computation time is, but the debugger I use struggles from time to time when I insert a block of text to check and I don't know if whether it is my RegEx or a debugger. There is often not enough time for longer sections of text where there is no match. Is there a more efficient way to do what I want ...?

+3


source to share


1 answer


First, when doing a full text match, always surround the regex ^...$

. ^

binds the beginning of the regular expression to the beginning of the test string, and $

binds the end of the regular expression to the end of the string. Otherwise, if it doesn't match, it will retry the validation starting at every single character (which is at least (4 words * 3 spaces) * 7 sentences = excessive work).

Second, always use mutually exclusive groups where you can. \S (anything not white-space)

contains characters .?!

, so in the absence of punctuation, it must indented and repeat every one \S

it matched. (Namely, because the first pass will mark it as a word instead of punctuation). So I would recommend replacing it with \S

the more mutually exclusive "nothing but white or punctuation" [^\s.?!]

. Note that [] contains lowercase s instead of uppercase. [^...]

"matches any NOT in this group."

These two things will save you from a catastrophic fallback to a reasonable ~ 1-3K increments depending on the length of the paragraph.

UPDATE:
If you allowed a slight change in the validation logic, making it so that multiple short sentences can be counted together as one sentence, then the next regex should be executed.



^(\s*(\S+\s+){3}([.?!]\s*)?([^\s.?!]+\s+)*\S+\s*[.?!]){7,}$

      

This hybrid version will allow short sentences to be executed without causing catastrophic returns. Without changing the little rule, you need to nest a variable length template in a variable length template; which is disastrous when the pattern is not completely mutually exclusive. (updated version)

Also, technically you can replace {7,}$

with just {7}

, if 7 sentences were found right away, you don't care what happens after that. (This will allow the regex to stop as soon as the minimum liveness is found, which will be more susceptible to some cases of extreme edges)

(You can play with it here at regex101.com)

0


source







All Articles