Is there a way to use memoryview with regular expressions in Python 2?

In Python 3, the module re

can be used with memoryview

:

~$ python3
Python 3.2.3 (default, Feb 20 2013, 14:44:27)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = b"abc"
>>> import re
>>> re.search(b"b", memoryview(x))
<_sre.SRE_Match object at 0x7f14b5fb8988>

      

However, it doesn't look like this in Python 2:

~$ python
Python 2.7.3 (default, Mar 13 2014, 11:03:55)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = "abc"
>>> import re
>>> re.search(b"b", memoryview(x))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/re.py", line 142, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or buffer

      

I can pass a string to buffer

, but looking at the buffered documentation it doesn't mention exactly how it buffer

works versus memoryview

.

Performing an empirical comparison shows that using an object buffer

in Python 2 does not offer performance benefits when used memoryview

in Python 3:

playground$ cat speed-test.py
import timeit
import sys

print(timeit.timeit("regex.search(mv[10:])", setup='''
import re
regex = re.compile(b"ABC")
PYTHON_3 = sys.version_info >= (3, )
if PYTHON_3:
    mv = memoryview(b"Can you count to three or sing 'ABC?'" * 1024)
else:
    mv = buffer(b"Can you count to three or sing 'ABC?'" * 1024)
'''))
playground$ python2.7 speed-test.py
2.33041596413
playground$ python2.7 speed-test.py
2.3322429657
playground$ python3.2 speed-test.py
0.381270170211792
playground$ python3.2 speed-test.py
0.3775448799133301
playground$

      

If the argument is regex.search

changed from mv[10:]

to mv

, Python 2's performance is about the same as Python 3, but there are a lot of duplicate lines in the code I'm writing.

Is there a way to get around this issue in Python 2 while still having the performance benefits memoryview

with zero copy
+3


source to share


1 answer


As I understand an object in Python 2, you have to use it without slicing:

>>> s = b"Can you count to three or sing 'ABC?'"
>>> str(buffer(s, 10))
"unt to three or sing 'ABC?'"

      

Thus, instead of slicing the resulting buffer, you use the buffer function directly to perform the slicing, which results in quick access to the substring of interest:

import timeit
import sys
import re

r = re.compile(b'ABC')
s = b"Can you count to three or sing 'ABC?'" * 1024

PYTHON_3 = sys.version_info >= (3, )
if len(sys.argv) > 1: # standard slicing
    print(timeit.timeit("r.search(s[10:])", setup='from __main__ import r, s'))
elif PYTHON_3: # memoryview in Python 3
    print(timeit.timeit("r.search(s[10:])", setup='from __main__ import r, s; s = memoryview(s)'))
else: # buffer in Python 2
    print(timeit.timeit("r.search(buffer(s, 10))", setup='from __main__ import r, s'))

      

I got very similar results in Python 2 and 3, which suggests that a use buffer

like that with a module re

has a similar effect than a new one memoryview

(which then seems to be a lazy-evaluated buffer):

$ python2 .\speed-test.py
0.681979371561
$ python3 .\speed-test.py
0.5693422508853488

      

And as a comparison with standard string slicing:



$ python2 .\speed-test.py standard-slicing
7.92006735956
$ python3 .\speed-test.py standard-slicing
7.817641705304309

      


If you want to maintain slice access (so the same syntax can be used everywhere), you can easily create a type that dynamically creates a new buffer when you slice it:

class slicingbuffer:
    def __init__ (self, source):
        self.source = source
    def __getitem__ (self, index):
        if not isinstance(index, slice):
            return buffer(self.source, index, 1)
        elif index.stop is None:
            return buffer(self.source, index.start)
        else:
            end = max(index.stop - index.start, 0)
            return buffer(self.source, index.start, end)

      

If you only use it with a module re

, it can probably work as a direct replacement for memoryview

. However, my tests show that this already gives you a lot of overhead. Thus, you may want to do the opposite and wrap your Python 3s memory object into a wrapper that gives you the same interface as buffer

:

def memoryviewbuffer (source, start, end = -1):
    return source[start:end]

PYTHON_3 = sys.version_info >= (3, )
if PYTHON_3:
    b = memoryviewbuffer
    s = memoryview(s)
else:
    b = buffer

print(timeit.timeit("r.search(b(s, 10))", setup='from __main__ import r, s, b'))

      

+2


source







All Articles