Can streaming or multiprocessing improve performance when parsing a single string with multiple regular expressions?

If I want to parse a string using dozens of regular expressions,
can a streaming or multiprocessing process improve performance?
In other words, parsing a string across multiple threads or processes will be faster than:

match = re.search(regex1, string)
if match:
    afunction(match)
else:
    match = re.search(regex2, string)
    if match:
        bfunction(match)
    else:
        match = re.search(regex3, string)
        if match:
            cfunction(match)
...

      

At most one regex will always match, so this is not a problem.
If the answer is multiprocessing, which method would you recommend looking at (queues, pipes)?

+3


source to share


3 answers


Python threads will not improve performance because of the GIL, which eliminates multiple threads running concurrently. If you have a multi-core machine, it is possible that multiple processes can speed up the process, but only if the cost of the spawning sub-processes and data traversal is less than the cost of running your RE queries.



If you do this a lot, you can look at thread pools.

+4


source


The regiks themselves are a powerful response. The following example can contain all regular expressions in one big regular expression. (In the example, replace your regular expressions a

, b

, c

and d

.

(a?P<A>)|(b?P<B>)|(c?P<C>)|(d?P<D>))

      

Use lastindex

on MatchObject

to find out the index of the corresponding group. Use groupindex

on RegexObject

to translate this subscript into a regex name that is a label in angle brackets (I used uppercase for them in the example above).

Edit: (performance analysis)

In cases where the regular expressions involved are simple enough to match common languages ​​and therefore match quickly with a state machine , this approach will actually result in a performance effect similar to parallel evaluation, which is surprising for only one CPU resource.



The reason is that the operator |

, along with ?

, *

, []

or repetition (but unlike most applications of backlinks) is one of the operators allowed in regular expressions that define conventional languages . (Note the "concatenation" in the link.) Therefore, the combined regex can also be found using a finite state machine, without having to indented.

Final state automata perform only a finite number of operations on each character of the input string. They store a state (basically a memory pointer) that represents the "potential matching progress" at the current input position. In the case of a combined regex, the FSA is larger and takes longer to compile (and the memory pointer contains more memory locations to point to). But this (object creation Regex

) can be done once at application startup, and each subsequent input can be quickly found.

Compare this to parallel execution of separate stream-based regular expressions. The progress of each regex will be the same, but not the same for regexes that start with input, especially since the final rejections of irregular regexes will usually be much faster than successful matches. There is a minor advantage on the streaming side: the fastest match completes all computations, whereas the combined regex should complete the evaluation of all matches (all groups). In practice, thread pool threads will more than compensate for this, and with a lot of threads, this will be mostly unusable.

The performance benefit of the combined regex method is especially noticeable with a large number of regexes and comes at the cost of increased memory consumption.

While it satisfies this question of preference for parallel matching under the hood, on smaller instances like multiple regexes it may not be worth the extra complexity when linking the regex.

+4


source


Is your job your concern? If not, just put all REs in an array and loop over it!

for each myRE in myListOfRE
    result = myRE.search(...)
    if result != None:
        something with sqlalchemy
        break

      

If performance is really a concern, I think multithreading should help. RE has read-only access to the search string, so it can be provided. I'm not a python expert though, so I can't tell you how.

+1


source







All Articles