Finding a substring in list members with perl

Trying to figure out the fastest way to filter a long array by searching for a substring in elements with the following methods:

  • $str =~ /\.xml/

    - find ".xml" somewhere in the line
  • $str =~ /\.xml$/

    - find ".xml" at the end of the line
  • substr($str,-4) eq ".xml"

    are the last 4 characters of ".xml"?
  • rindex($str, ".xml")

    - any occurrence of ".xml"
  • (length($str) - rindex($str,".xml")) == 4

    - last 4 characters ".xml"?

I tried all of the above with while/if/push

and inside grep

, with the following code ( UPDATED with ideas from comments )

use 5.016;
use warnings;
use Benchmark qw(:all);

my $nmax = 5_000_000;
my @list = map { sprintf "a%s.%s", int(rand(100000000)), (int(rand(2))%2?"txt":"xml") } 1..$nmax;

cmpthese(10, {
        'whl_match'    => sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if( $x =~ /\.xml/  )}; },
        'whl_matchend' => sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if( $x =~ /\.xml$/ )}; },
        'whl_matchendz'=> sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if( $x =~ /\.xml\z/ )}; },
        'whl_substr'   => sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if( substr($x,-4) eq ".xml" )}; },
        'whl_rindex'   => sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if( rindex($x,".xml") >= 0 )}; },
        'whl_lenrindex'=> sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if((length($x)-rindex($x,".xml"))==4)};},

        'for_match'    => sub { my @xml; for my $x (@list) { push(@xml, $x) if( $x =~ /\.xml/  )}; },
        'for_matchend' => sub { my @xml; for my $x (@list) { push(@xml, $x) if( $x =~ /\.xml$/ )}; },
        'for_matchendz'=> sub { my @xml; for my $x (@list) { push(@xml, $x) if( $x =~ /\.xml\z/ )}; },
        'for_substr'   => sub { my @xml; for my $x (@list) { push(@xml, $x) if( substr($x,-4) eq ".xml" )}; },
        'for_rindex'   => sub { my @xml; for my $x (@list) { push(@xml, $x) if( rindex($x,".xml") >= 0 )}; },
        'for_lenrindex'=> sub { my @xml; for my $x (@list) { push(@xml, $x) if((length($x)-rindex($x,".xml"))==4)};},

        'grp_match'    => sub { my @xml = grep { /\.xml/ }  @list; },
        'grp_matchend' => sub { my @xml = grep { /\.xml$/ } @list; },
        'grp_matchendz'=> sub { my @xml = grep { /\.xml\z/ } @list; },
        'grp_substr'   => sub { my @xml = grep { substr($_,-4) eq ".xml" } @list; },
        'grp_rindex'   => sub { my @xml = grep { rindex($_,".xml") >= 0 } @list; },
        'grp_lenrindex'=> sub { my @xml = grep { (length($_) - rindex($_,".xml")) == 4 } @list; },
});

      

Result on my carp laptop.

              s/iter whl_matchend whl_matchendz grp_matchendz grp_matchend whl_lenrindex whl_match whl_substr grp_match whl_rindex for_matchend for_matchendz for_lenrindex for_match grp_lenrindex for_substr for_rindex grp_substr grp_rindex
whl_matchend    4.48           --           -0%          -10%         -12%          -17%      -21%       -24%      -25%       -32%         -47%          -47%          -67%      -70%          -70%       -73%       -73%       -77%       -78%
whl_matchendz   4.46           0%            --           -9%         -11%          -17%      -21%       -23%      -25%       -32%         -46%          -46%          -67%      -69%          -70%       -73%       -73%       -76%       -78%
grp_matchendz   4.05          11%           10%            --          -2%           -9%      -13%       -15%      -17%       -25%         -41%          -41%          -63%      -66%          -67%       -70%       -70%       -74%       -76%
grp_matchend    3.96          13%           13%            2%           --           -6%      -11%       -14%      -15%       -24%         -40%          -40%          -62%      -66%          -66%       -70%       -70%       -73%       -75%
whl_lenrindex   3.70          21%           21%            9%           7%            --       -5%        -8%       -9%       -18%         -35%          -35%          -60%      -63%          -64%       -67%       -67%       -72%       -73%
whl_match       3.53          27%           27%           15%          12%            5%        --        -3%       -5%       -14%         -32%          -32%          -58%      -61%          -62%       -66%       -66%       -70%       -72%
whl_substr      3.42          31%           30%           18%          16%            8%        3%         --       -2%       -12%         -30%          -30%          -57%      -60%          -61%       -65%       -65%       -69%       -71%
grp_match       3.36          33%           33%           20%          18%           10%        5%         2%        --       -10%         -29%          -29%          -56%      -59%          -60%       -64%       -64%       -69%       -71%
whl_rindex      3.02          48%           48%           34%          31%           22%       17%        13%       11%         --         -21%          -21%          -51%      -55%          -56%       -60%       -60%       -65%       -67%
for_matchend    2.40          87%           86%           69%          65%           55%       47%        43%       40%        26%           --           -0%          -38%      -43%          -44%       -50%       -50%       -56%       -59%
for_matchendz   2.39          87%           87%           69%          65%           55%       47%        43%       40%        26%           0%            --          -38%      -43%          -44%       -50%       -50%       -56%       -59%
for_lenrindex   1.49         201%          200%          172%         166%          149%      137%       130%      126%       103%          61%           61%            --       -8%          -10%       -19%       -19%       -29%       -33%
for_match       1.36         229%          227%          197%         191%          172%      159%       151%      146%       122%          76%           76%            9%        --           -2%       -11%       -12%       -23%       -27%
grp_lenrindex   1.33         237%          236%          204%         198%          178%      165%       157%      153%       127%          80%           80%           12%        2%            --        -9%        -9%       -21%       -26%
for_substr      1.21         271%          270%          235%         228%          207%      192%       184%      178%       150%          98%           98%           23%       13%           10%         --        -0%       -13%       -18%
for_rindex      1.20         272%          271%          236%         229%          208%      193%       184%      179%       151%          99%           99%           23%       13%           10%         0%         --       -13%       -18%
grp_substr      1.05         326%          325%          285%         277%          252%      235%       226%      220%       188%         128%          128%           41%       30%           27%        15%        15%         --        -6%
grp_rindex     0.990         352%          351%          309%         300%          274%      256%       246%      239%       205%         142%          142%           50%       38%           34%        22%        22%         6%         --

      

I repeated the test many times and always got the above order.


Question 1.

As I expected, grep

faster than similar while/if/push

, but surprised me with the following:

Comparison:

              s/iter
whl_matchend    4.54
grp_matchend    3.98

      

grep

just slightly faster as similar while/if/push

.

Why, for example, in the following:

whl_substr      3.23
grp_substr      1.05

      

grep

3 times faster as while/if/push

. So why grep

is it almost 3x faster while/if/push

when running substr

and not so fast (only 14%) when running /regex-match/

? Similarly, you can see any "line print".

In other words, it grep {/regex/}

has only a small increase in speed above while/if/push

, but grep {substr}

has a huge increase . Why


Question 2

Another surprise (at least to me) is this: Why $str =~ /\.xml/

faster than $str =~ /\.xml$/

? I am expecting the hint to $

speed up the rexex because it doesn’t need to search the entire string, but this is an incorrect assumption, as tested with the following:

use 5.016;
use warnings;
use Benchmark qw(:all);

my $str = "a38877283.xml";
cmpthese(10, {
    'match'     => sub { $str =~ /\.xml/  for (1..5_000_000) },
    'matchend'  => sub { $str =~ /\.xml$/ for (1..5_000_000) },
    'matchendz' => sub { $str =~ /\.xml\z/ for (1..5_000_000) },  #updated the \z
});

      

for perl 5, version 20, subversion 0 (v5.20.0) built for darwin-2level

(perlbrew)

          s/iter  matchend matchendz     match
matchend    2.32        --       -1%      -64%
matchendz   2.30        1%        --      -63%
match      0.844      175%      173%        --

      

With: perl 5, version 16, subversion 2 (v5.16.2) built for darwin-thread-multi-2level

(default OS X)

             Rate matchendz     match  matchend
matchendz 0.405/s        --      -69%      -70%
match      1.29/s      218%        --       -5%
matchend   1.36/s      235%        5%        --

      

old perl is faster.;)

OS:

Darwin jabko.local 13.3.0 Darwin Kernel Version 13.3.0: Tue Jun  3 21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64 x86_64

      


Last question

  • still untested the impact of precompiled regexes qr

    - any other idea what could be the fastest filter?
+3


source to share


1 answer


As for the last question, try \z

instead $

as it \z

matches the end of a line and $

also looks for an optional trailing newline ( perldoc perlre

).

use Benchmark qw(:all);
my $str = "a38877283.xml";
cmpthese(10, {
    'match'    => sub { $str =~ /\.xml/  for (1..5_000_000) },
    'matchend' => sub { $str =~ /\.xml$/ for (1..5_000_000) },
    'matchend2' => sub { $str =~ /\.xml\z/ for (1..5_000_000) },
});

      



Output

             Rate matchend2  matchend     match
matchend2 0.473/s        --      -58%      -59%
matchend   1.14/s      140%        --       -1%
match      1.15/s      143%        1%        --

      

+2


source







All Articles