Finding a substring in list members with perl
Trying to figure out the fastest way to filter a long array by searching for a substring in elements with the following methods:
-
$str =~ /\.xml/
- find ".xml" somewhere in the line -
$str =~ /\.xml$/
- find ".xml" at the end of the line -
substr($str,-4) eq ".xml"
are the last 4 characters of ".xml"? -
rindex($str, ".xml")
- any occurrence of ".xml" -
(length($str) - rindex($str,".xml")) == 4
- last 4 characters ".xml"?
I tried all of the above with while/if/push
and inside grep
, with the following code ( UPDATED with ideas from comments )
use 5.016;
use warnings;
use Benchmark qw(:all);
my $nmax = 5_000_000;
my @list = map { sprintf "a%s.%s", int(rand(100000000)), (int(rand(2))%2?"txt":"xml") } 1..$nmax;
cmpthese(10, {
'whl_match' => sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if( $x =~ /\.xml/ )}; },
'whl_matchend' => sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if( $x =~ /\.xml$/ )}; },
'whl_matchendz'=> sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if( $x =~ /\.xml\z/ )}; },
'whl_substr' => sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if( substr($x,-4) eq ".xml" )}; },
'whl_rindex' => sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if( rindex($x,".xml") >= 0 )}; },
'whl_lenrindex'=> sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if((length($x)-rindex($x,".xml"))==4)};},
'for_match' => sub { my @xml; for my $x (@list) { push(@xml, $x) if( $x =~ /\.xml/ )}; },
'for_matchend' => sub { my @xml; for my $x (@list) { push(@xml, $x) if( $x =~ /\.xml$/ )}; },
'for_matchendz'=> sub { my @xml; for my $x (@list) { push(@xml, $x) if( $x =~ /\.xml\z/ )}; },
'for_substr' => sub { my @xml; for my $x (@list) { push(@xml, $x) if( substr($x,-4) eq ".xml" )}; },
'for_rindex' => sub { my @xml; for my $x (@list) { push(@xml, $x) if( rindex($x,".xml") >= 0 )}; },
'for_lenrindex'=> sub { my @xml; for my $x (@list) { push(@xml, $x) if((length($x)-rindex($x,".xml"))==4)};},
'grp_match' => sub { my @xml = grep { /\.xml/ } @list; },
'grp_matchend' => sub { my @xml = grep { /\.xml$/ } @list; },
'grp_matchendz'=> sub { my @xml = grep { /\.xml\z/ } @list; },
'grp_substr' => sub { my @xml = grep { substr($_,-4) eq ".xml" } @list; },
'grp_rindex' => sub { my @xml = grep { rindex($_,".xml") >= 0 } @list; },
'grp_lenrindex'=> sub { my @xml = grep { (length($_) - rindex($_,".xml")) == 4 } @list; },
});
Result on my carp laptop.
s/iter whl_matchend whl_matchendz grp_matchendz grp_matchend whl_lenrindex whl_match whl_substr grp_match whl_rindex for_matchend for_matchendz for_lenrindex for_match grp_lenrindex for_substr for_rindex grp_substr grp_rindex
whl_matchend 4.48 -- -0% -10% -12% -17% -21% -24% -25% -32% -47% -47% -67% -70% -70% -73% -73% -77% -78%
whl_matchendz 4.46 0% -- -9% -11% -17% -21% -23% -25% -32% -46% -46% -67% -69% -70% -73% -73% -76% -78%
grp_matchendz 4.05 11% 10% -- -2% -9% -13% -15% -17% -25% -41% -41% -63% -66% -67% -70% -70% -74% -76%
grp_matchend 3.96 13% 13% 2% -- -6% -11% -14% -15% -24% -40% -40% -62% -66% -66% -70% -70% -73% -75%
whl_lenrindex 3.70 21% 21% 9% 7% -- -5% -8% -9% -18% -35% -35% -60% -63% -64% -67% -67% -72% -73%
whl_match 3.53 27% 27% 15% 12% 5% -- -3% -5% -14% -32% -32% -58% -61% -62% -66% -66% -70% -72%
whl_substr 3.42 31% 30% 18% 16% 8% 3% -- -2% -12% -30% -30% -57% -60% -61% -65% -65% -69% -71%
grp_match 3.36 33% 33% 20% 18% 10% 5% 2% -- -10% -29% -29% -56% -59% -60% -64% -64% -69% -71%
whl_rindex 3.02 48% 48% 34% 31% 22% 17% 13% 11% -- -21% -21% -51% -55% -56% -60% -60% -65% -67%
for_matchend 2.40 87% 86% 69% 65% 55% 47% 43% 40% 26% -- -0% -38% -43% -44% -50% -50% -56% -59%
for_matchendz 2.39 87% 87% 69% 65% 55% 47% 43% 40% 26% 0% -- -38% -43% -44% -50% -50% -56% -59%
for_lenrindex 1.49 201% 200% 172% 166% 149% 137% 130% 126% 103% 61% 61% -- -8% -10% -19% -19% -29% -33%
for_match 1.36 229% 227% 197% 191% 172% 159% 151% 146% 122% 76% 76% 9% -- -2% -11% -12% -23% -27%
grp_lenrindex 1.33 237% 236% 204% 198% 178% 165% 157% 153% 127% 80% 80% 12% 2% -- -9% -9% -21% -26%
for_substr 1.21 271% 270% 235% 228% 207% 192% 184% 178% 150% 98% 98% 23% 13% 10% -- -0% -13% -18%
for_rindex 1.20 272% 271% 236% 229% 208% 193% 184% 179% 151% 99% 99% 23% 13% 10% 0% -- -13% -18%
grp_substr 1.05 326% 325% 285% 277% 252% 235% 226% 220% 188% 128% 128% 41% 30% 27% 15% 15% -- -6%
grp_rindex 0.990 352% 351% 309% 300% 274% 256% 246% 239% 205% 142% 142% 50% 38% 34% 22% 22% 6% --
I repeated the test many times and always got the above order.
Question 1.
As I expected, grep
faster than similar while/if/push
, but surprised me with the following:
Comparison:
s/iter
whl_matchend 4.54
grp_matchend 3.98
grep
just slightly faster as similar while/if/push
.
Why, for example, in the following:
whl_substr 3.23
grp_substr 1.05
grep
3 times faster as while/if/push
. So why grep
is it almost 3x faster while/if/push
when running substr
and not so fast (only 14%) when running /regex-match/
? Similarly, you can see any "line print".
In other words, it grep {/regex/}
has only a small increase in speed above while/if/push
, but grep {substr}
has a huge increase . Why
Question 2
Another surprise (at least to me) is this: Why $str =~ /\.xml/
faster than $str =~ /\.xml$/
? I am expecting the hint to $
speed up the rexex because it doesn’t need to search the entire string, but this is an incorrect assumption, as tested with the following:
use 5.016;
use warnings;
use Benchmark qw(:all);
my $str = "a38877283.xml";
cmpthese(10, {
'match' => sub { $str =~ /\.xml/ for (1..5_000_000) },
'matchend' => sub { $str =~ /\.xml$/ for (1..5_000_000) },
'matchendz' => sub { $str =~ /\.xml\z/ for (1..5_000_000) }, #updated the \z
});
for perl 5, version 20, subversion 0 (v5.20.0) built for darwin-2level
(perlbrew)
s/iter matchend matchendz match
matchend 2.32 -- -1% -64%
matchendz 2.30 1% -- -63%
match 0.844 175% 173% --
With: perl 5, version 16, subversion 2 (v5.16.2) built for darwin-thread-multi-2level
(default OS X)
Rate matchendz match matchend
matchendz 0.405/s -- -69% -70%
match 1.29/s 218% -- -5%
matchend 1.36/s 235% 5% --
old perl is faster.;)
OS:
Darwin jabko.local 13.3.0 Darwin Kernel Version 13.3.0: Tue Jun 3 21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64 x86_64
Last question
- still untested the impact of precompiled regexes
qr
- any other idea what could be the fastest filter?
source to share
As for the last question, try \z
instead $
as it \z
matches the end of a line and $
also looks for an optional trailing newline ( perldoc perlre
).
use Benchmark qw(:all);
my $str = "a38877283.xml";
cmpthese(10, {
'match' => sub { $str =~ /\.xml/ for (1..5_000_000) },
'matchend' => sub { $str =~ /\.xml$/ for (1..5_000_000) },
'matchend2' => sub { $str =~ /\.xml\z/ for (1..5_000_000) },
});
Output
Rate matchend2 matchend match
matchend2 0.473/s -- -58% -59%
matchend 1.14/s 140% -- -1%
match 1.15/s 143% 1% --
source to share