Is string comparison or hash faster in Perl?

I have code that will execute a lot on the one hand and am wondering which way is a more efficient implementation. I'll use a for loop to mimic the part you are doing:

Option A:

my %sections = (
    'somestring1' => 1,
    'somestring2' => 1,
    'somestring3' => 1,
    'somestring4' => 1
);

for (0..10000)
{
    # $element is chosen at random
    $namespace = $element if $sections{$element};
}

      

Option B:

for (0..10000)
{
    # $element is chosen at random
    $namespace = $element if ($element eq'somestring1' || 
                            $element eq'somestring2' ||
                            $element eq'somestring3' ||
                            $element eq'somestring4');
}

      

Can anyone check this out or know the answer as I am not familiar with comparison tools.

This code probably doesn't make sense in this context, but this is actually what I need to use.

+2


source to share


5 answers


Use a function cmpthese

from the Benchmark module

use strict;
use warnings;
use Benchmark qw'cmpthese';

my %sections = (
    somestring1 => 1,
    somestring2 => 1,
    somestring3 => 1,
    somestring4 => 1
);

my @elements = map { 'somestring' . int(1 + rand(10)) } 1 .. 100;

my $namespace;

cmpthese(100000, {
    hash_value => sub {
        foreach my $element (@elements) {
            $namespace = $element if $sections{$element};
        }
    },
    hash_exists => sub {
        foreach my $element (@elements) {
            $namespace = $element if exists $sections{$element};
        }
    },
    string_cmp => sub {
        foreach my $element (@elements) {
            $namespace = $element if (
                $element eq'somestring1' ||
                $element eq'somestring2' ||
                $element eq'somestring3' ||
                $element eq'somestring4');
        }
    },
});

      

My results (running Perl 5.10 on WinXP):



               Rate  string_cmp  hash_value hash_exists
string_cmp  18932/s          --        -44%        -50%
hash_value  33512/s         77%          --        -12%
hash_exists 38095/s        101%         14%          --

      

Thus, searching for hashes is 77% faster than cascading string comparisons and checking for the existence of a hash instead of a value (as suggested by Adam Bellair) is 14% faster.

+17


source


I assume the first version with exists

will be faster, not to mention more readable and maintainable.

for (0..10000)
{
    # $element is chosen at random
    $namespace = $element if exists $sections{$element};
}

      



Just checking if a hash key exists is faster than getting its value for comparison, so use exists

.

+5


source


The hash lookup mechanism is much faster.

+4


source


It's probably time to get familiar with benchmarking tools like the CPAN Benchmark module .

+3


source


Michael Karman's metric is good, but the results are pretty outdated so people not running it on their machine might be wrong. So, the same test (10x more cases in total to give more consistent results) on a Mac Pro with Mac OS X and Perl 5.24.1:

               Rate  string_cmp hash_exists  hash_value
string_cmp  54142/s          --        -28%        -32%
hash_exists 74850/s         38%          --         -7%
hash_value  80192/s         48%          7%          --

      

However, on AWS with CentOS 7 / Perl 5.24.0 we get:

string_cmp  70373/s          --        -24%        -25%
hash_value  92851/s         32%          --         -1%
hash_exists 93545/s         33%          1%          --

      

So, I would say test your own machine, but it doesn't seem to provide an advantage yet (on my Mac with the latest Perl, it's even noticeably slower in this particular test - and even about other tests).

One thing I don't like about the benchmark is that it chooses quite arbitrarily to compare 4 equalities with a hash check. Don't be confused if we only have one element in the hash, so we are comparing to one equality we get (on my Mac Pro / Perl 5.24.1):

hash_value  119474/s          --         -1%        -14%        -34%
hash_exists 121065/s          1%          --        -12%        -33%
grep        138122/s         16%         14%          --        -23%
string_cmp  180180/s         51%         49%         30%          --

      

I chose one grep there, which avoids the foreach loop for comparison. So one equality is obviously faster than a hash check, but not twice as fast, so if you can only replace two equalities with a hash check, you get the advantage:

string_cmp  104167/s          --        -15%        -17%
hash_value  121951/s         17%          --         -2%
hash_exists 125000/s         20%          2%          --

      

However, this is because the hash was inlined outside of the loop as in the original example. What if we generate a hash on each reference cycle? That is, you have a couple of values ​​that you want to check for existence in an array, is the overhead of creating a hash with them worth it? I won't bore you with a lot of results, as you may find that they are different on your machine, but the answer is that for the two values ​​"it depends" (so maybe you shouldn't worry), but for 3 or more you must do it.

0


source







All Articles