Regex.h efficiency

I'm trying to understand the conflicting performance differences between Python, Cython and pure C with regex matching.

There is a small example program that takes a source text file (17KB), a dictionary of 2000 words, creates a regex with those words (word1 | word2 | ...) and finds all instances of the said dictionary in the source file.

First, I made a clean Python implementation that looks like this:

def scanFile(filename, patterns):
   pattern_regex = re.compile('|'.join(patterns))
   pageContent = open(filename).read()
   matchingPatterns = set()
   for matchObj in pattern_regex.finditer(pageContent):
      matchingPatterns.add(matchObj.group(0))

   return matchingPatterns

      

Then I tried to optimize this by overriding it with Cython, on top of regex.h

, not a Python module re

.

cdef extern from "regex.h" nogil:
   ctypedef struct regmatch_t:
      int rm_so
      int rm_eo
   ctypedef struct regex_t:
      pass
   int REG_EXTENDED
   int regcomp(regex_t* preg, const char* regex, int cflags)
   int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags)
   void regfree(regex_t* preg) 

def matchPatterns(bytes pageContent, bytes regex):
   cdef set matchingPatterns = set()
   cdef regex_t regex_obj
   cdef regmatch_t regmatch_obj[1]
   cdef int regex_res = 0
   cdef int current_str_pos = 0

   regcomp(&regex_obj, regex, REG_EXTENDED)
   regex_res = regexec(&regex_obj, pageContent[current_str_pos:], 1, regmatch_obj, 0)
   while regex_res == 0:
      matchingPatterns.add(pageContent[current_str_pos + regmatch_obj[0].rm_so: current_str_pos + regmatch_obj[0].rm_eo])
      current_str_pos += regmatch_obj[0].rm_eo
      regex_res = regexec(&regex_obj, pageContent[current_str_pos:], 1, regmatch_obj, 0)

   regfree(&regex_obj)
   return matchingPatterns

      

Performance, however, turned out to be quite the opposite: Cython + regex.h takes about 2.34 seconds and Python takes 0.92 seconds.

After running a little profiling and custom code with comments, I confirmed the suspicion that it is up regexec

, which takes 10s milliseconds on each call.

Just to make sure it wasn't Cython's fault, I prepared a stand-alone C unit test that uses the same inputs and regex.h, and also showed worse results than Python (about 1.60s, i.e. 60% slower, than Python).

So with all this, I would be grateful for any understanding of why it regexec

has such poor performance.

I am running this on Python 2.7.10, gcc 4.9.2, Cython 0.22 and the platform is Cygwin / Windows. I had a similar inconsistency while running on Ubuntu.

+3


source to share


1 answer


Based on what is in this question, I can guess several problems: -You are using POSIX on Windows and Cygwin is an overhead, Windows is not a POSIX system. -There is a comparison between pcre (let me assume pcre2) and regex.h -Standard compiled code differs from exported functions (compiler cannot guess anything) -Standalone C program has a big footprint telling you that recompiling a picture or some other thing is going on under the hood ... compilation options and potential anti-aliasing are always difficult to compare.

Besides the source for the stand-alone program, always using translators / transpilers can cause delays. Optimization is currently the job of giving your compiler a clean idea of ​​what you are doing and letting it run.



And sorry for this unrelated part of the question per se, but it looks like you don't need a RE, but a basic string matching algorithm, or some neat structure like a prefix tree and a simple loop to accomplish your task.

0


source







All Articles