Optimizing regex filled with '?'

The shorthand keyboard has keys STKPWHRAO*EUFRPBLGTSDZ

. The user presses multiple keys, after which the keys are automatically registered when picked up. It's like playing chords on a piano. Examples of strokes: KAT

, TPHOEUGT

.

I have a regex that checks for correct steno chords. It can be any number of these keys, but they must be in that order. My solution qr/S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?/

, but since this regex is called hundreds of times, variable length can be a speed bottleneck. Each step forward in regex is a bigger and bigger set of possibilities because of all?

Is there a faster approach to this regex? I need the regex to fail if the keys are out of order.

+3


source to share


2 answers


To check if a string is a valid chord, you really need

/^(?=.)S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?\z/s

      

A simple optimization would be to make sure a match is possible.

/^(?=[STKPWHRAO*EUFBLGDZ])S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?\z/s

      

The next step is to eliminate backtracking. That time is wasted.

/
   ^
   (?=[STKPWHRAO*EUFBLGDZ])
   S?+ T?+ K?+ P?+ W?+ H?+ R?+ A?+ O?+ \*?+ E?+
   U?+ F?+ R?+ P?+ B?+ L?+ G?+ T?+ S?+  D?+ Z?+
   \z
/x

      

Fortunately, despite the fact that S

, T

, P

and R

appear twice, back trace can be completely removed without any problems. It should actually match the timing of practically nothing.

Even if that's not fast enough, the next step is to create a custom C function. Running the regular expression matching engine is expensive and can be avoided entirely with a simple function.

Note that the above optimizations only help if the pattern doesn't match. They should be neutral if the pattern matches. On the other hand, a C function would help even when the pattern matches.

Landmarks:



use strict;
use warnings;
use feature qw( say );

use Benchmark qw( cmpthese );

my %tests = (
   orig => q{ $s =~ /^(?=.)S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?\z/s},
   new  => q{ $s =~
      /
         ^
         (?=[STKPWHRAO*EUFBLGDZ])
         S?+ T?+ K?+ P?+ W?+ H?+ R?+ A?+ O?+ \*?+ E?+
         U?+ F?+ R?+ P?+ B?+ L?+ G?+ T?+ S?+  D?+ Z?+
         \z
      /x
   },
);

$_ = 'use strict; use warnings; our $s; ' . $_
   for values %tests;

{ say "Matching:";     local our $s = "STAODZ";   cmpthese(-3, \%tests); }
{ say "Not matching:"; local our $s = "STPRSTPR"; cmpthese(-3, \%tests); }

      

Output:

Matching:
         Rate  new orig
new  509020/s   -- -29%
orig 712274/s  40%   --
Not matching:
         Rate orig  new
orig 158758/s   -- -73%
new  579851/s 265%   --

      

Which means the match slowed down from 1.40 μs to 1.96 μs (in this case) and the
mismatch speed from 6.30 μs to 1.72 μs (in this case).


To check if a string is a valid chord sequence, you just need

/^[STKPWHRAO*EUFBLGDZ]+\z/

      


If you want to extract all the chords in a string, I would start by highlighting the sequences that match the following, then finding the chords in the extracted sequences:

/([STKPWHRAO*EUFBLGDZ]+)/

      

+5


source


variable length can be a speed bottleneck

You don't have to work this way

  • First write and debug your program

  • then if it's not fast enough for that purpose, profile your program to find where the bottlenecks are

  • then optimize bottlenecks

For heaven's sake, don't waste time trying to guess where the bottlenecks are and optimize them before your code is complete, as you will most likely find yourself wrong and wasted a lot of time



In any case, the regex engine is written in C and is pretty fast. I highly doubt if the short diagram you wrote will spend a significant amount of time testing

Each step forward in regex is a bigger and bigger set of possibilities because of all ?

This is also not the case. Only one character is tested at each point in the regular expression. The next character on the line is either the same or not. Or all is well, and the regex engine just goes to the next step in the pattern. The matching process will be nearly constant, no matter which row is to be matched.

0


source







All Articles