Grep lines matching pattern and lines before and after matching

Question

Grep lines matching pattern and lines before and after matching

I would like to perform pattern matching on a file (about 200 megabytes) and then insert the corresponding lines into the array, as well as an arbitrary number of lines before and after each matching line.

sub1 using perl grep takes 11 seconds

sub2 which uses unix egrep, 1 second

sub6 (ack) 50 seconds (faster if you don't use bindings \ b, \ s, etc.)

ack from the command line takes 15 seconds

I'm interested in suggestions for speeding up sub1 or finding a quick perl solution that doesn't rely on external tools

It seems that perl grep is much slower than unix.

"index" is really faster than regular expressions (but I need \ b, \ s, etc.)

http://www.perlmonks.org/?node_id=885174

http://www.perlmonks.org/?node_id=957554

thank

use 5.014;
use strict;
use warnings;
use Time::HiRes qw(usleep ualarm gettimeofday tv_interval);
use List::MoreUtils qw(uniq);

open FILE, '<textMatchInAfile.txt' or die;
my $p = '\bsala|che|relazione|di|questo|coso|^qui\$';
my $mR = 1;        #print more rows before - after the matching
my @n  = <FILE>;

&sub1( $p, $mR, @n );    #suggest: insert references
&sub3( $p, $mR );

sub sub1 {               #questa sub usa perl grep
my $p    = $_[0];             #pattern
my $mR   = $_[1];             #more rows
my @n    = @_[ 2 .. $#_ ];    #input File
my $time = [gettimeofday];
my @new = grep { $n[$_] =~ /$p/ } 0 .. $#n;
my @unique =
  map { @n[ $_ - $mR .. $_ + $mR ] } @new[ 0 + $mR .. $#new - $mR];
say "\n" . 'time sub1 perl grep: ' . tv_interval($time);
@unique = uniq(@unique);
say "sub 1 $#unique";
}

sub sub3 {    #unix grep with color and line numbers
my $p   = $_[0];
my $mR  = $_[1];
my $cmd = "grep -n -C $mR";    #with line numbers
$p =~ s/\|/ /g;
$p =~ s/\h+/" -e "/g;
$p = ' -e "' . $p . '" ';
say "cmd ===$cmd=== ss ===$p===";
my @values;
$values[0] = $p;
$values[1] = ( ' ' . 'textMatchInAfile.txt' );    
my $time = [gettimeofday];
my @valori = `$cmd @values` or die "system @values` failed: $?";
say 'sub3 egrep shell: ' . $#valori;
say 'time sub3 tempo trovati con egrep shell ' . tv_interval($time);
my @uniq_list = uniq(@valori);
}

sub sub6 {             #perl ack
my $p  = $_[0];    #pattern
my $mR = $_[1];    #more rows
my @values;
my $time   = [gettimeofday];
my @valori = qx (ack -C $mR "$p" textMatchInAfile.txt)
  or die "system @values` failed: $?";
say 'number of values found with ack' . $#valori;
say 'time sub6 ack' . tv_interval($time);
}

#

#this one takes 11 seconds

 use 5.014;
 use warnings;
 use Time::HiRes qw(usleep ualarm gettimeofday tv_interval);

 my @array;
 my $pattern = '\bsala|che|relazione|di|questo|coso|^qui\$';
 open( my $filehandle, "<textMatchInAfile.txt" );
 my $time = [gettimeofday];
 while (<$filehandle>) {
     if ( $_ =~ /$pattern/ ) {
    push @array;
     }
 }
 say 'time while' . tv_interval($time);

Ok, unix grep is an order of magnitude faster than perl grep, I'll live with that.

+3

grep perl

user1252114 06 Mar 12:08 pm

source to share

2 answers

user1126070 · Answer 1 · 2012-03-06T13:18:01+0000

Why don't you use grep -B 1 -A 1?

This will give you the exact result you want.

grep -B 1 -A 1 -E patter file

Hello,

Daniel Böhmer · Answer 2 · 2012-03-14T13:15:42+0000

I did some basic comparison between Unix 'command egrep

and Perl grep

, the latter with two different implementations.

use Benchmark qw(cmpthese);

my $count = $ARGV[0] || 100;

my $re = "L[aeiou]n*.?[xyz]\\b";

cmpthese($count, {
    unix => sub {
        my $result = `dmesg|egrep '$re'`;

        #print "===unix===\n";
        #print $result;
    },
    perl => sub {
        my @result = grep {$_ =~ m/$re/} split m/\n/, `dmesg`;

        #print "===perl===\n";
        #map {print "$_\n"} @result;
    },
    perl2 => sub {
        open(DMESG, "dmesg|" ) or die "cannot open dmesg pipe!";

        my @result;

        while(<DMESG>) {
            push @result, $_ if m/$re/;
        }

        #print "===perl2===\n";
        #map {print} @result;

        close DMESG;
    },
});

Result:

$ perl grep.pl 1000
        Rate  unix  perl perl2
unix  24.6/s    --  -40%  -44%
perl  41.0/s   67%    --   -6%
perl2 43.6/s   77%    6%    --

So, explain why Perl grep is naturally slower than Unix grep

.

PS I adapted the script to work on a file with 25k lines of random data and another RE. This scenario is a bit like yours.

$ perl tmp/grep.pl 1000
        Rate  unix  perl perl2
unix  3.71/s    --  -32%  -44%
perl  5.50/s   48%    --  -17%
perl2 6.64/s   79%   21%    --

Grep lines matching pattern and lines before and after matching

More articles: