How much faster are regular expressions in C / Java than in Python?

Question

How much faster are regular expressions in C / Java than in Python?

I'm looking for tests that compare regex speeds between python and statically typed languages like C, Java, or C ++. I would also like to hear about Cython's performance for regular expressions.

+3

performance c python regex cython

Jeff mandell May 08 '15 @ 2:19 am

source to share

1 answer

Jerry coffin · Accepted Answer · 2015-05-08T02:39:06+0000

This is probably more implementation specific than language specific.

For example, some patterns are O (N ² ) with some implementations, but ~ O (N) with others. In particular, most RE implementations are based on NFAs (Non-Intermediate Finite State Machines). In short, this means that they can and will return in some cases with some patterns. This gives a rough O (N ² ) complexity. Deterministic finite state machines (DFAs) matching the same pattern never return - it always has linear complexity. At the same time, the compilation phase for DFA is usually more complicated than for NFA (and DFAs do not have all the capabilities of NFA).

Thus, with a lot of simple patterns that are not backtrace related, the RE RE engine can be faster than DFA based. But, when the NFA-based RE engine tries to match a pattern rather than backtracking, it can (and will) slow down dramatically. In the latter case, the DFA engine can be many times faster.

Most RE libraries usually start with a regular expression, represented as a string. When you do a RE based search / match, most of them are compiled into a data structure for their NFA / DFA. This compilation step takes some time (not a huge amount, but can become significant, especially if you are working with many different REs). Several RE engines (such as Boost XPressive) can collect regular expressions statically, that is, the RE is compiled at the same time as the program's source code. This can eliminate the time to compile the RE from the runtime of the program, so if your code spends a significant amount of time compiling the RE, it might improve it significantly (but it doesn't depend on static typing - at least as far as I know,you can't get the same thing in Java or C, or example). Several other languages (like D) provide sufficient capabilities that you could almost certainly do with them, but I am not aware of a real implementation for them that you can plan for now.

How much faster are regular expressions in C / Java than in Python?

More articles: