Regex.h efficiency
I'm trying to understand the conflicting performance differences between Python, Cython and pure C with regex matching.
There is a small example program that takes a source text file (17KB), a dictionary of 2000 words, creates a regex with those words (word1 | word2 | ...) and finds all instances of the said dictionary in the source file.
First, I made a clean Python implementation that looks like this:
def scanFile(filename, patterns):
pattern_regex = re.compile('|'.join(patterns))
pageContent = open(filename).read()
matchingPatterns = set()
for matchObj in pattern_regex.finditer(pageContent):
matchingPatterns.add(matchObj.group(0))
return matchingPatterns
Then I tried to optimize this by overriding it with Cython, on top of regex.h
, not a Python module re
.
cdef extern from "regex.h" nogil:
ctypedef struct regmatch_t:
int rm_so
int rm_eo
ctypedef struct regex_t:
pass
int REG_EXTENDED
int regcomp(regex_t* preg, const char* regex, int cflags)
int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags)
void regfree(regex_t* preg)
def matchPatterns(bytes pageContent, bytes regex):
cdef set matchingPatterns = set()
cdef regex_t regex_obj
cdef regmatch_t regmatch_obj[1]
cdef int regex_res = 0
cdef int current_str_pos = 0
regcomp(®ex_obj, regex, REG_EXTENDED)
regex_res = regexec(®ex_obj, pageContent[current_str_pos:], 1, regmatch_obj, 0)
while regex_res == 0:
matchingPatterns.add(pageContent[current_str_pos + regmatch_obj[0].rm_so: current_str_pos + regmatch_obj[0].rm_eo])
current_str_pos += regmatch_obj[0].rm_eo
regex_res = regexec(®ex_obj, pageContent[current_str_pos:], 1, regmatch_obj, 0)
regfree(®ex_obj)
return matchingPatterns
Performance, however, turned out to be quite the opposite: Cython + regex.h takes about 2.34 seconds and Python takes 0.92 seconds.
After running a little profiling and custom code with comments, I confirmed the suspicion that it is up regexec
, which takes 10s milliseconds on each call.
Just to make sure it wasn't Cython's fault, I prepared a stand-alone C unit test that uses the same inputs and regex.h, and also showed worse results than Python (about 1.60s, i.e. 60% slower, than Python).
So with all this, I would be grateful for any understanding of why it regexec
has such poor performance.
I am running this on Python 2.7.10, gcc 4.9.2, Cython 0.22 and the platform is Cygwin / Windows. I had a similar inconsistency while running on Ubuntu.
source to share
Based on what is in this question, I can guess several problems: -You are using POSIX on Windows and Cygwin is an overhead, Windows is not a POSIX system. -There is a comparison between pcre (let me assume pcre2) and regex.h -Standard compiled code differs from exported functions (compiler cannot guess anything) -Standalone C program has a big footprint telling you that recompiling a picture or some other thing is going on under the hood ... compilation options and potential anti-aliasing are always difficult to compare.
Besides the source for the stand-alone program, always using translators / transpilers can cause delays. Optimization is currently the job of giving your compiler a clean idea of ββwhat you are doing and letting it run.
And sorry for this unrelated part of the question per se, but it looks like you don't need a RE, but a basic string matching algorithm, or some neat structure like a prefix tree and a simple loop to accomplish your task.
source to share