Perl: trying to speed up parsing a delimited file

I have a large flat text file with lines that contain name / value pairs ("varname = value"). These pairs are separated by a multi-character separator. Thus, one line in this file might look like this:

var1=value1|^|var2=value2|^|var3=value3|^|var4=value4

      

Each line contains about 50 name / value pairs.

I need to iterate over the lines of this file (there are about 100,000 lines) and store the name / value pairs in a hash so that

$field{'var1'} = value1
$field{'var2'} = value2
etc...

      

I did this:

# $line holds a single line from the file

my @fields = split( /\Q|^|\E/, $line );
foreach my $field (@fields) {
  my ($name, $value) = split( /=/, $field );
  $hash{$name} = $value;
}

      

It takes (on my PC) about 2 seconds to do this for each line of the entire file. It doesn't seem like a long time, but I really want to speed it up a bit.

In those 2 seconds, the first split takes about 0.6 seconds and the foreach loop takes about 1.4 seconds. So I thought I would get rid of the foreach loop and put it all in one split:

%hash = split( /\Q|^|\E|=/, $line );

      

Much to my surprise, parsing the entire file this way took a second and a half longer! My question is not why it takes longer (although it would be nice to see why), but my question is if there are other (faster) ways to make this work.

Thanks in advance.

------ Edit below this line ------

I just found out that by changing this:

%hash = split( /\Q|^|\E|=/, $line );

      

in it:

$line =~ s/\Q|^|\E/=/g;
%hash = split( /=/, $line );

      

makes it three times faster! Parsing the entire file this way now takes just over a second ...

------ Snippet below this line ------

use strict;
use Time::HiRes qw( time );

my $line = "a=1|^|b=2|^|c=3|^|d=4|^|e=5|^|f=6|^|g=7|^|h=8|^|i=9|^|j=10|^|k=11|^|l=12|^|m=13|^|n=14|^|o=15|^|p=16|^|q=17|^|r=18|^|s=19|^|t=20|^|u=21|^|v=22|^|w=23|^|x=24|^|y=25|^|z=26|^|aa=27|^|ab=28|^|ac=29|^|ad=30|^|ae=31|^|af=32|^|ag=33|^|ah=34|^|ai=35|^|aj=36|^|ak=37|^|al=38|^|am=39|^|an=40|^|ao=41|^|ap=42|^|aq=43|^|ar=44|^|as=45|^|at=46|^|au=47|^|av=48|^|aw=49|^|ax=50";

ResetTimer();
my %hash;
for( my $i = 1; $i <= 100000; $i++ ) {
  my @fields = split( /\Q|^|\E/, $line );
  foreach my $field (@fields) {
    my ($name, $value) = split( /=/, $field );
    $hash{$name} = $value;
  }
}
print Elapsed() . "\n";

ResetTimer();
%hash = ();
for( my $i = 1; $i <= 100000; $i++ ) {
  %hash = split( /\Q|^|\E|=/, $line );
}
print Elapsed() . "\n";

ResetTimer();
%hash = ();
for( my $i = 1; $i<=100000; $i++ ) {
  $line =~ s/\Q|^|\E/=/g;
  %hash = split( /=/, $line );
}
print Elapsed() . "\n";

################################################################################################################################
BEGIN {
  my $startTime;
  sub ResetTimer {
    $startTime = time();
    return $startTime;
  }
  sub Elapsed {
    return time() - $startTime;
  }
}

      

+3


source to share


1 answer


I cannot answer your question about performance because I need a test case. But I would guess it has something to do with how the regex is handled.

You can see what this does with use re 'debug';

and this will print the regex steps.

But for a broader question - I would most likely handle it with a global one (assuming your data is simple as an example):

#!/usr/bin/env perl
use strict;
use warnings; 
use Data::Dumper;

while ( <DATA> ) { 
   my %row = m/(\w+)=(\w+)/g;
   print Dumper \%row;
}

__DATA__
var1=value1|^|var2=value2|^|var3=value3|^|var4=value4

      

You can use lookahead / behind to match delimiters if you have more complex stuff, but since this is one regex per line, you call the regix engine more often than not, and it will probably be faster. (But I can't tell you for sure without a test case).

If your data is more complex, then perhaps:

my %row = s/\Q|^|\E/\n/rg =~ m/(.*)=(.*)/g;

      

This will "force" the input to be split into a new line and then match "nothing" = "nothing". But this is probably overkill if your values ​​don't include spaces / pipes / metamarks.

With test case editing use Benchmark

:

#!/usr/bin/env perl
use strict;
use warnings;
use Benchmark qw ( cmpthese );

my $line =
  "a=1|^|b=2|^|c=3|^|d=4|^|e=5|^|f=6|^|g=7|^|h=8|^|i=9|^|j=10|^|k=11|^|l=12|^|m=13|^|n=14|^|o=15|^|p=16|^|q=17|^|r=18|^|s=19|^|t=20|^|u=21|^|v=22|^|w=23|^|x=24|^|y=25|^|z=26|^|aa=27|^|ab=28|^|ac=29|^|ad=30|^|ae=31|^|af=32|^|ag=33|^|ah=34|^|ai=35|^|aj=36|^|ak=37|^|al=38|^|am=39|^|an=40|^|ao=41|^|ap=42|^|aq=43|^|ar=44|^|as=45|^|at=46|^|au=47|^|av=48|^|aw=49|^|ax=50";

sub double_split {
   my %hash;
   my @fields = split( /\Q|^|\E/, $line );
   foreach my $field (@fields) {
      my ( $name, $value ) = split( /=/, $field );
      $hash{$name} = $value;
   }
}

sub single_split {
   my %hash = split( /\Q|^|\E|=/, $line );
}

sub re_replace_then_split {
   $line =~ s/\Q|^|\E/=/g;
   my %hash = split( /=/, $line );
}

sub single_regex {
   my %hash = $line =~ m/(\w+)=(\w+)/g;
}

sub compound {
   my %hash = $line =~ s/\Q|^|\E/\n/rg =~ m/(.*)=(.*)/g;
}

cmpthese(
   1_000_000,
   {  "Double Split"                 => \&double_split,
      "single split with regex"      => \&single_split,
      "Replace then split"           => \&re_replace_then_split,
      "Single Regex"                 => \&single_regex,
      "regex to linefeed them match" => \&compound
   }
);

      



It looks like the results look like this:

                                 Rate Double Split single split with regex Single Regex Replace then split regex to linefeed them match
Double Split                  18325/s           --                     -4%         -34%               -56%                         -97%
single split with regex       19050/s           4%                      --         -31%               -54%                         -97%
Single Regex                  27607/s          51%                     45%           --               -34%                         -96%
Replace then split            41733/s         128%                    119%          51%                 --                         -93%
regex to linefeed them match 641026/s        3398%                   3265%        2222%              1436%                           --

      

... I am a little suspicious of this last one because it is absurdly faster. The caching of the results is probably happening there.

But looking at this, what slows you down is the alternation in the regex:

sub single_split_with_alt {
   my %hash = split( /\Q|^|\E|=/, $line );
}

sub single_split {
      my %hash = split( /[\|\^\=]+/, $line );
}

      

(I know the latter may not be exactly what you want, but it is for illustrative purposes)

gives:

                Rate  alternation single split
alternation  19135/s           --         -37%
single split 30239/s          58%           --

      

But is there ever going to be a point where this is a moot point because your limiting factor is disk IO, not CPU.

+6


source







All Articles