Fastest regular expression that doesn't match any lines

Question

Fastest regular expression that doesn't match any lines

What is the fastest regular expression that doesn't match any string? This may sound useless, but consider a program that accepts a required regular expression as a filter (for example, this is my scenario). I tried several and found the b(?<!b)

best performer given that it b

is rare in the input.

Here is the python code I wrote to test different templates for their speed:

#!/usr/bin/env python

import re
import time

tests = [
  r'a\A',
  r'b\A',
  r'a^',
  r'b^',
  r'[^\s\S]',
  r'^(?<=a)',
  r'^(?<=b)',
  r'a(?<!a)',
  r'b(?<!b)',
  r'\Za',
  r'\Zb',
  r'$a',
  r'$b'
]
timing = []
text = 'a' * 50000000

for t in tests:
  pat = re.compile(t)
  start = time.time()
  pat.search(text)
  dur = time.time() - start
  timing.append((t, dur))

timing.sort(key=lambda x: x[1])
print('%-30s %s' % ('Pattern', 'Time'))
for t, dur in timing:
  print('%-30s %0.3f' % (t, dur))

On my machine, I get the following points:

Pattern                        Time
b(?<!b)                        0.043
b\A                            0.043
b^                             0.043
$a                             0.382
$b                             0.382
^(?<=a)                        0.395
\Za                            0.395
\Zb                            0.395
^(?<=b)                        0.414
a\A                            0.437
a^                             0.440
a(?<!a)                        0.796
[^\s\S]                        1.469

update: Added benchmark for some suggested regular expressions.

+3

python regex

Mansour Jan 26. At 8:25 am

source to share

3 answers

Probably it's not quite the answer that you are looking for, but I played a little bit with this for a few minutes, and the best thing I could do is something like: 'a{%d}' % (len(input)+1)

. Or, in other words, evaluate your pattern as any character with a maximum quantifier longer than the known string length. This obviously only works if you have access to the input string. He looks very handsome too. I split the texts into texts = [os.urandom(20000) for x in range(20000)]

, in which case it a{20001}

still syncs to 0.013

.

I have not, however, calculated the cost of the operations to get the line length and create a pattern that you will obviously have to deal with in the wild.

Updated script:

import os
import re
import time

tests = [
  r'\0(?<!\0)',
  r'\0',
  r'x^',
  r'a{2001}',
  r'a(?<!a)',
  r'b(?<!b)',
  r'\Za',
  r'\Zb',
  r'$a',
  r'$b'
]
texts = [os.urandom(2000) for x in range(20000)]
flag = 0

for t in tests:
    pat = re.compile(t)
    start = time.time()
    for text in texts:
        if pat.search(text):
            print('%-30s %10s' % (t, "FAILED"))
            flag = 1
            break
    if flag == 0:
        dur = time.time() - start
        print('%-30s %0.3f' % (t, dur))
    else:
        flag = 0

Landmarks:

\0(?<!\0)                      0.058
\0                                 FAILED
x^                             0.055
a{2001}                        0.022
a(?<!a)                        0.060
b(?<!b)                        0.073
\Za                            0.798
\Zb                            0.797
$a                             0.757
$b                             0.767

0

abathur Jan 26. '13 at 9:29

source to share

We usually use a short and convenient (?!)

one that doesn't work directly.

If you want to be more explicit, Perl / PCRE has a verb (*FAIL)

or (*F)

, which you can use.

0

Qtax Jan 26. At 9:57 am

source to share

steveha · Accepted Answer · 2013-01-26T09:24:36+0000

One character is a valid regular expression. The only character that is not "magic" matches itself. If you can identify one character that will never appear in the text, you can make a pattern out of that.

How about ASCII NUL, character 0?

I am stuck on one line in a test program, line: '\0'

It was about as fast as your best template: b(?<!b)

Ok, you already have a character after the end of the line. How about a character before the start of a line? It's impossible:'x^'

Aha! This is faster than checking for a character after the end of the line. But it's about as fast as your best sample.

I suggest replacing b

NUL with ASCII and calling it good. When I tried this pattern:\0(?<!\0)

He wins with a tiny fraction. But in fact, on my computer, all those discussed above are so close to each other that there is not much to distinguish between them.

Results:

Pattern                        Time
\0(?<!\0)                      0.098
\0                             0.099
x^                             0.099
b(?<!b)                        0.099
^(?<=x)                        1.416
$b                             1.446
$a                             1.447
\Za                            1.462
\Zb                            1.465
[^\s\S]                        2.280
a(?<!a)                        2.843

It was fun. Thanks for posting the question.

EDIT: Oh yes! I rewrote the program for testing with real inputs and got a different result.

I downloaded "The Complete Works of William Shakespeare" from Project Gutenberg as a text file. (Strange, this gave an error on wget

, but let my browser do this ... some measure to protect against automatic copying?) URL: http://www.gutenberg.org/cache/epub/100/pg100.txt

Here are the results followed by the modified program when I ran it.

Pattern                        Time
\0(?<!\0)                      0.110
\0                             0.118
x^                             0.119
b(?<!b)                        0.143
a(?<!a)                        0.275
^(?<=x)                        1.577
$b                             1.605
$a                             1.611
\Za                            1.634
\Zb                            1.634
[^\s\S]                        2.441

So, I'm definitely going with this first.

#!/usr/bin/env python

import re
import time

tests = [
  r'x^',
  r'\0',
  r'[^\s\S]',
  r'^(?<=x)',
  r'a(?<!a)',
  r'b(?<!b)',
  r'\0(?<!\0)',
  r'\Za',
  r'\Zb',
  r'$a',
  r'$b'
]
timing = []
#text = 'a' * 50000000
text = open("/tmp/pg100.txt").read()
text = text * 10

for t in tests:
  pat = re.compile(t)
  start = time.time()
  pat.search(text)
  dur = time.time() - start
  timing.append((t, dur))

timing.sort(key=lambda x: x[1])
print('%-30s %s' % ('Pattern', 'Time'))
for t, dur in timing:
  print('%-30s %0.3f' % (t, dur))

Fastest regular expression that doesn't match any lines

More articles: