Fastest regular expression that doesn't match any lines

What is the fastest regular expression that doesn't match any string? This may sound useless, but consider a program that accepts a required regular expression as a filter (for example, this is my scenario). I tried several and found the b(?<!b)

best performer given that it b

is rare in the input.

Here is the python code I wrote to test different templates for their speed:

#!/usr/bin/env python

import re
import time

tests = [
  r'a\A',
  r'b\A',
  r'a^',
  r'b^',
  r'[^\s\S]',
  r'^(?<=a)',
  r'^(?<=b)',
  r'a(?<!a)',
  r'b(?<!b)',
  r'\Za',
  r'\Zb',
  r'$a',
  r'$b'
]
timing = []
text = 'a' * 50000000

for t in tests:
  pat = re.compile(t)
  start = time.time()
  pat.search(text)
  dur = time.time() - start
  timing.append((t, dur))

timing.sort(key=lambda x: x[1])
print('%-30s %s' % ('Pattern', 'Time'))
for t, dur in timing:
  print('%-30s %0.3f' % (t, dur))

      

On my machine, I get the following points:

Pattern                        Time
b(?<!b)                        0.043
b\A                            0.043
b^                             0.043
$a                             0.382
$b                             0.382
^(?<=a)                        0.395
\Za                            0.395
\Zb                            0.395
^(?<=b)                        0.414
a\A                            0.437
a^                             0.440
a(?<!a)                        0.796
[^\s\S]                        1.469

      

update: Added benchmark for some suggested regular expressions.

+3


source to share


3 answers


One character is a valid regular expression. The only character that is not "magic" matches itself. If you can identify one character that will never appear in the text, you can make a pattern out of that.

How about ASCII NUL, character 0?

I am stuck on one line in a test program, line: '\0'

It was about as fast as your best template: b(?<!b)

Ok, you already have a character after the end of the line. How about a character before the start of a line? It's impossible:'x^'

Aha! This is faster than checking for a character after the end of the line. But it's about as fast as your best sample.

I suggest replacing b

NUL with ASCII and calling it good. When I tried this pattern:\0(?<!\0)

He wins with a tiny fraction. But in fact, on my computer, all those discussed above are so close to each other that there is not much to distinguish between them.

Results:



Pattern                        Time
\0(?<!\0)                      0.098
\0                             0.099
x^                             0.099
b(?<!b)                        0.099
^(?<=x)                        1.416
$b                             1.446
$a                             1.447
\Za                            1.462
\Zb                            1.465
[^\s\S]                        2.280
a(?<!a)                        2.843

      

It was fun. Thanks for posting the question.

EDIT: Oh yes! I rewrote the program for testing with real inputs and got a different result.

I downloaded "The Complete Works of William Shakespeare" from Project Gutenberg as a text file. (Strange, this gave an error on wget

, but let my browser do this ... some measure to protect against automatic copying?) URL: http://www.gutenberg.org/cache/epub/100/pg100.txt

Here are the results followed by the modified program when I ran it.

Pattern                        Time
\0(?<!\0)                      0.110
\0                             0.118
x^                             0.119
b(?<!b)                        0.143
a(?<!a)                        0.275
^(?<=x)                        1.577
$b                             1.605
$a                             1.611
\Za                            1.634
\Zb                            1.634
[^\s\S]                        2.441

      

So, I'm definitely going with this first.

#!/usr/bin/env python

import re
import time

tests = [
  r'x^',
  r'\0',
  r'[^\s\S]',
  r'^(?<=x)',
  r'a(?<!a)',
  r'b(?<!b)',
  r'\0(?<!\0)',
  r'\Za',
  r'\Zb',
  r'$a',
  r'$b'
]
timing = []
#text = 'a' * 50000000
text = open("/tmp/pg100.txt").read()
text = text * 10

for t in tests:
  pat = re.compile(t)
  start = time.time()
  pat.search(text)
  dur = time.time() - start
  timing.append((t, dur))

timing.sort(key=lambda x: x[1])
print('%-30s %s' % ('Pattern', 'Time'))
for t, dur in timing:
  print('%-30s %0.3f' % (t, dur))

      

+2


source


Probably it's not quite the answer that you are looking for, but I played a little bit with this for a few minutes, and the best thing I could do is something like: 'a{%d}' % (len(input)+1)

. Or, in other words, evaluate your pattern as any character with a maximum quantifier longer than the known string length. This obviously only works if you have access to the input string. He looks very handsome too. I split the texts into texts = [os.urandom(20000) for x in range(20000)]

, in which case it a{20001}

still syncs to 0.013

.

I have not, however, calculated the cost of the operations to get the line length and create a pattern that you will obviously have to deal with in the wild.

Updated script:



import os
import re
import time

tests = [
  r'\0(?<!\0)',
  r'\0',
  r'x^',
  r'a{2001}',
  r'a(?<!a)',
  r'b(?<!b)',
  r'\Za',
  r'\Zb',
  r'$a',
  r'$b'
]
texts = [os.urandom(2000) for x in range(20000)]
flag = 0

for t in tests:
    pat = re.compile(t)
    start = time.time()
    for text in texts:
        if pat.search(text):
            print('%-30s %10s' % (t, "FAILED"))
            flag = 1
            break
    if flag == 0:
        dur = time.time() - start
        print('%-30s %0.3f' % (t, dur))
    else:
        flag = 0

      

Landmarks:

\0(?<!\0)                      0.058
\0                                 FAILED
x^                             0.055
a{2001}                        0.022
a(?<!a)                        0.060
b(?<!b)                        0.073
\Za                            0.798
\Zb                            0.797
$a                             0.757
$b                             0.767

      

0


source


We usually use a short and convenient (?!)

one that doesn't work directly.

If you want to be more explicit, Perl / PCRE has a verb (*FAIL)

or (*F)

, which you can use.

0


source







All Articles