Fastest regular expression that doesn't match any lines
What is the fastest regular expression that doesn't match any string? This may sound useless, but consider a program that accepts a required regular expression as a filter (for example, this is my scenario). I tried several and found the b(?<!b)
best performer given that it b
is rare in the input.
Here is the python code I wrote to test different templates for their speed:
#!/usr/bin/env python
import re
import time
tests = [
r'a\A',
r'b\A',
r'a^',
r'b^',
r'[^\s\S]',
r'^(?<=a)',
r'^(?<=b)',
r'a(?<!a)',
r'b(?<!b)',
r'\Za',
r'\Zb',
r'$a',
r'$b'
]
timing = []
text = 'a' * 50000000
for t in tests:
pat = re.compile(t)
start = time.time()
pat.search(text)
dur = time.time() - start
timing.append((t, dur))
timing.sort(key=lambda x: x[1])
print('%-30s %s' % ('Pattern', 'Time'))
for t, dur in timing:
print('%-30s %0.3f' % (t, dur))
On my machine, I get the following points:
Pattern Time
b(?<!b) 0.043
b\A 0.043
b^ 0.043
$a 0.382
$b 0.382
^(?<=a) 0.395
\Za 0.395
\Zb 0.395
^(?<=b) 0.414
a\A 0.437
a^ 0.440
a(?<!a) 0.796
[^\s\S] 1.469
update: Added benchmark for some suggested regular expressions.
source to share
One character is a valid regular expression. The only character that is not "magic" matches itself. If you can identify one character that will never appear in the text, you can make a pattern out of that.
How about ASCII NUL, character 0?
I am stuck on one line in a test program, line: '\0'
It was about as fast as your best template: b(?<!b)
Ok, you already have a character after the end of the line. How about a character before the start of a line? It's impossible:'x^'
Aha! This is faster than checking for a character after the end of the line. But it's about as fast as your best sample.
I suggest replacing b
NUL with ASCII and calling it good. When I tried this pattern:\0(?<!\0)
He wins with a tiny fraction. But in fact, on my computer, all those discussed above are so close to each other that there is not much to distinguish between them.
Results:
Pattern Time
\0(?<!\0) 0.098
\0 0.099
x^ 0.099
b(?<!b) 0.099
^(?<=x) 1.416
$b 1.446
$a 1.447
\Za 1.462
\Zb 1.465
[^\s\S] 2.280
a(?<!a) 2.843
It was fun. Thanks for posting the question.
EDIT: Oh yes! I rewrote the program for testing with real inputs and got a different result.
I downloaded "The Complete Works of William Shakespeare" from Project Gutenberg as a text file. (Strange, this gave an error on wget
, but let my browser do this ... some measure to protect against automatic copying?) URL: http://www.gutenberg.org/cache/epub/100/pg100.txt
Here are the results followed by the modified program when I ran it.
Pattern Time
\0(?<!\0) 0.110
\0 0.118
x^ 0.119
b(?<!b) 0.143
a(?<!a) 0.275
^(?<=x) 1.577
$b 1.605
$a 1.611
\Za 1.634
\Zb 1.634
[^\s\S] 2.441
So, I'm definitely going with this first.
#!/usr/bin/env python
import re
import time
tests = [
r'x^',
r'\0',
r'[^\s\S]',
r'^(?<=x)',
r'a(?<!a)',
r'b(?<!b)',
r'\0(?<!\0)',
r'\Za',
r'\Zb',
r'$a',
r'$b'
]
timing = []
#text = 'a' * 50000000
text = open("/tmp/pg100.txt").read()
text = text * 10
for t in tests:
pat = re.compile(t)
start = time.time()
pat.search(text)
dur = time.time() - start
timing.append((t, dur))
timing.sort(key=lambda x: x[1])
print('%-30s %s' % ('Pattern', 'Time'))
for t, dur in timing:
print('%-30s %0.3f' % (t, dur))
source to share
Probably it's not quite the answer that you are looking for, but I played a little bit with this for a few minutes, and the best thing I could do is something like: 'a{%d}' % (len(input)+1)
. Or, in other words, evaluate your pattern as any character with a maximum quantifier longer than the known string length. This obviously only works if you have access to the input string. He looks very handsome too. I split the texts into texts = [os.urandom(20000) for x in range(20000)]
, in which case it a{20001}
still syncs to 0.013
.
I have not, however, calculated the cost of the operations to get the line length and create a pattern that you will obviously have to deal with in the wild.
Updated script:
import os
import re
import time
tests = [
r'\0(?<!\0)',
r'\0',
r'x^',
r'a{2001}',
r'a(?<!a)',
r'b(?<!b)',
r'\Za',
r'\Zb',
r'$a',
r'$b'
]
texts = [os.urandom(2000) for x in range(20000)]
flag = 0
for t in tests:
pat = re.compile(t)
start = time.time()
for text in texts:
if pat.search(text):
print('%-30s %10s' % (t, "FAILED"))
flag = 1
break
if flag == 0:
dur = time.time() - start
print('%-30s %0.3f' % (t, dur))
else:
flag = 0
Landmarks:
\0(?<!\0) 0.058
\0 FAILED
x^ 0.055
a{2001} 0.022
a(?<!a) 0.060
b(?<!b) 0.073
\Za 0.798
\Zb 0.797
$a 0.757
$b 0.767
source to share