Lots of Regex.Replace or or'ed pattern?

A regular question is asked again.

Which is more efficient? Cascading many Regex.Replaces with each specific pattern to OR only one Regex.Replace with or'ed pattern (pattern1 | pattern2 | ...)?

Thanks in advance, Fabian

+1


source to share


4 answers


It depends on how big your text is and how many matches you expect. If at all possible, put a text literal or anchor (like ^) at the beginning of the Regex. The .NET Regex engine optimizes this to look for this text using the fast Boyer-Moore algorithm (which can skip characters) rather than the standard IndexOf, which looks at each character. In case you have multiple patterns with literal text in front, there is an optimization to create a set of possible start characters. All others are quickly ignored.

In general, you may need to read Mastering Regular Expressions , which covers general optimizations, to get an idea of ​​better performance (especially in Chapter 6).

I would say that you can get a faster performance if you put everything in one Regex, but put the most likely option first, then the second most likely, etc. The only thing to watch out for is the rollback. If you do something like



".*"

      

to match the quotes, understand that once it finds the first one, "then it will always go to the end of the line by default and then start backing up until it finds another."

Mastering regular expressions is pretty much about how to avoid this.

+1


source


My answer sucks, but: it depends. How many do you have? Will the few milliseconds you save really matter? What's the most readable, easiest to maintain, and most scalable solution?



Try both methods for your specific requirements and you will see. You might be surprised.

+3


source


Depends entirely on the pattern and logic of the implementation - if it's simple (and I think most real-world cases would be), a regex will be faster if complex multiple operations can be, but benchmarking is the answer if it's a situation where it's actually business matters.

Otherwise it will be relatively close, you don't care, premature optimization and all.

+1


source


I'm surprised that your benchmarking showed that using multiple separate expressions would be faster, and I'd be curious to see an example of the regex you are using. Basic regular expressions (ie, without additional features like backtracking) can be compiled into "finite state machines" that are O (n) speed relative to the length of the string being examined and are not related to the length of the pattern. Thus, running 10 delta regexes should, on average, need 10 times longer than a single regex that combines these patterns with "|".

(I know this is an old question from the march, but I couldn't resist adding two cents :)

+1


source







All Articles