Optimizing regex filled with '?'
The shorthand keyboard has keys STKPWHRAO*EUFRPBLGTSDZ
. The user presses multiple keys, after which the keys are automatically registered when picked up. It's like playing chords on a piano. Examples of strokes: KAT
, TPHOEUGT
.
I have a regex that checks for correct steno chords. It can be any number of these keys, but they must be in that order. My solution qr/S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?/
, but since this regex is called hundreds of times, variable length can be a speed bottleneck. Each step forward in regex is a bigger and bigger set of possibilities because of all?
Is there a faster approach to this regex? I need the regex to fail if the keys are out of order.
source to share
To check if a string is a valid chord, you really need
/^(?=.)S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?\z/s
A simple optimization would be to make sure a match is possible.
/^(?=[STKPWHRAO*EUFBLGDZ])S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?\z/s
The next step is to eliminate backtracking. That time is wasted.
/ ^ (?=[STKPWHRAO*EUFBLGDZ]) S?+ T?+ K?+ P?+ W?+ H?+ R?+ A?+ O?+ \*?+ E?+ U?+ F?+ R?+ P?+ B?+ L?+ G?+ T?+ S?+ D?+ Z?+ \z /x
Fortunately, despite the fact that S
, T
, P
and R
appear twice, back trace can be completely removed without any problems. It should actually match the timing of practically nothing.
Even if that's not fast enough, the next step is to create a custom C function. Running the regular expression matching engine is expensive and can be avoided entirely with a simple function.
Note that the above optimizations only help if the pattern doesn't match. They should be neutral if the pattern matches. On the other hand, a C function would help even when the pattern matches.
Landmarks:
use strict;
use warnings;
use feature qw( say );
use Benchmark qw( cmpthese );
my %tests = (
orig => q{ $s =~ /^(?=.)S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?\z/s},
new => q{ $s =~
/
^
(?=[STKPWHRAO*EUFBLGDZ])
S?+ T?+ K?+ P?+ W?+ H?+ R?+ A?+ O?+ \*?+ E?+
U?+ F?+ R?+ P?+ B?+ L?+ G?+ T?+ S?+ D?+ Z?+
\z
/x
},
);
$_ = 'use strict; use warnings; our $s; ' . $_
for values %tests;
{ say "Matching:"; local our $s = "STAODZ"; cmpthese(-3, \%tests); }
{ say "Not matching:"; local our $s = "STPRSTPR"; cmpthese(-3, \%tests); }
Output:
Matching:
Rate new orig
new 509020/s -- -29%
orig 712274/s 40% --
Not matching:
Rate orig new
orig 158758/s -- -73%
new 579851/s 265% --
Which means the match slowed down from 1.40 μs to 1.96 μs (in this case) and the
mismatch speed from 6.30 μs to 1.72 μs (in this case).
To check if a string is a valid chord sequence, you just need
/^[STKPWHRAO*EUFBLGDZ]+\z/
If you want to extract all the chords in a string, I would start by highlighting the sequences that match the following, then finding the chords in the extracted sequences:
/([STKPWHRAO*EUFBLGDZ]+)/
source to share
variable length can be a speed bottleneck
You don't have to work this way
-
First write and debug your program
-
then if it's not fast enough for that purpose, profile your program to find where the bottlenecks are
-
then optimize bottlenecks
For heaven's sake, don't waste time trying to guess where the bottlenecks are and optimize them before your code is complete, as you will most likely find yourself wrong and wasted a lot of time
In any case, the regex engine is written in C and is pretty fast. I highly doubt if the short diagram you wrote will spend a significant amount of time testing
Each step forward in regex is a bigger and bigger set of possibilities because of all
?
This is also not the case. Only one character is tested at each point in the regular expression. The next character on the line is either the same or not. Or all is well, and the regex engine just goes to the next step in the pattern. The matching process will be nearly constant, no matter which row is to be matched.
source to share