Which has the best performance when matching a string: testing with regex / regex1 | regex2 | regex3 / or checking one pattern after another?

Suppose I have several million (possibly long) lines, and you need to know if each contains any of these patterns:

  • regex1

  • regex2

  • regex3

  • ...

Efficiency would be better:

  • Check each line for a "complete" regex /regex1|regex2|regex3|.../

    or

  • Check each line for regex1

    if not, then test with regex2

    etc ....?

I was wondering about this, and since my knowledge of regex implementations is very limited, I have no idea if they will lead to similar behavior or not.


Edit: I just did some quick benchmarking. Didn't think too much, just blurted out some code. Please point to anything that might bias the output.

This is JavaScript and I checked with Node.js.

Note. I tried working with 5 million lines and 500 regex, but ran out of memory in the process, so I omitted the numbers

"use strict";

var strMinSize      = 50;
var strMaxSize      = 500;
var howManyStrings  = 100000;  // hundred thousand
var howManyRegex    = 50;      // fifty

var possible = " ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";

function makestr() {
    var text = "";
    var strSize = Math.floor(Math.random() * strMaxSize) + strMinSize;
    for (var i=0; i < strSize; i++) {
      text += possible.charAt(Math.floor(Math.random() * possible.length));
    }
    return text;
}

function makeregex() {
    var regexstr = "";
    var regexSize = Math.floor(Math.random() * 50) + 5;
    for (var i=0; i < regexSize; ++i) {
      regexstr += possible.charAt(Math.floor(Math.random() * possible.length));
    }
    return regexstr;
}

var stringList = [];
for (var i=0; i < howManyStrings; ++i) {
    stringList.push(makestr());
}
var regexList = [];
var fullRegex = ""; // aux to build the disjunction
for (var i=0; i < howManyRegex; ++i) {
    var localRegex = makeregex();
    regexList.push(new RegExp(localRegex));
    fullRegex += '|' + localRegex;
}
fullRegex = new RegExp(fullRegex.substr(1));

// let do this...

for (var kase=1; kase < 10; ++kase) {
    // Test 1: one disjunction with every regex
    var time1 = 0;
    var time2 = 0;

    var start = new Date().getTime();
    stringList.forEach( function(str) {
      fullRegex.test(str);
    });
    var end = new Date().getTime();
    time1 = end - start;

    // Test 2: one regex at a time
    start = new Date().getTime();
    stringList.forEach( function(str) {
      regexList.every( function(rx) {
        if (rx.test(str)) {
          return false;
        } else {
          return true;
        }
      });
    });
    end = new Date().getTime();
    time2 = end - start;

    console.log(time1 + ";" + time2);
}

      

Working hours:

+--------+---------+
| Test 1 | Test 2  |
+--------+---------+
|   813  |  1817   |
|   558  |  1750   |
|   566  |  1756   |
|   558  |  1783   |
|   560  |  1755   |
|   559  |  1736   |
|   551  |  1749   |
|   552  |  1743   |
|   558  |  1746   |
+--------+---------+

      

So, as I suspected, the second alternative is worse ... But why so many?

+3


source to share


1 answer


A single regex will always be faster, because each regex test requires traversing the input, and while the combined regex is (slightly) more complex than the individual expressions, it still evaluates to constant time.

Expressing the problem using the big O notation:

  • evaluating one regex at a given input location = O (1)
  • combined regex evaluation at a given input location = effectively O (1)
  • regex in string = O (n) (where n = string length)


Of these, we can say that the individual passes for each term = O (n * k), where k is the number of regexes / members, but one regex is O (n).

This is born out of your tests showing about 3x slower for individual regexes.

It all depends on the premise that the concatenated regex is "about as fast" as the simple one. This is because the regex state mechanism is extremely efficient, which reduces the execution time to nearly the same for simple interleaving as a simple pattern. It's a little slower, but not where it is almost slow enough to guarantee separate passes for individual regexes, no matter how long the term list has become.

+2


source







All Articles