How to speed up pattern recognition in perl

This is a program, because it is right now, it takes a .fasta file (the file containing the genetic code), creates a hash table of the data and prints it out, however it is quite slow. It splits the string, compares it to all other letters in the file.

use strict;
use warnings;
use Data::Dumper;

my $total = $#ARGV + 1;
my $row;
my $compare;
my %hash;
my $unique = 0;
open( my $f1, '<:encoding(UTF-8)', $ARGV[0] ) or die "Could not open file '$ARGV[0]' $!\n";

my $discard = <$f1>;
while ( $row = <$f1> ) {
    chomp $row;
    $compare .= $row;
}
my $size = length($compare);
close $f1;
for ( my $i = 0; $i < $size - 6; $i++ ) {
    my $vs = ( substr( $compare, $i, 5 ) );
    for ( my $j = 0; $j < $size - 6; $j++ ) {
        foreach my $value ( substr( $compare, $j, 5 ) ) {
            if ( $value eq $vs ) {
                if ( exists $hash{$value} ) {
                    $hash{$value} += 1;
                } else {
                    $hash{$value} = 1;
                }
            }
        }
    }
}
foreach my $val ( values %hash ) {
    if ( $val == 1 ) {
        $unique++;
    }
}

my $OUTFILE;
open $OUTFILE, ">output.txt" or die "Error opening output.txt: $!\n";
print {$OUTFILE} "Number of unique keys: " . $unique . "\n";
print {$OUTFILE} Dumper( \%hash );
close $OUTFILE;

      

Thanks in advance for your help!

+3


source to share


3 answers


It's not clear from the description that this script is required to do, but if you're looking for comparable sets of 5 characters, you don't actually need to do any string collations: you can just run the whole sequence and store the number of times each 5-letter sequence is.



use strict;
use warnings;
use Data::Dumper;

my $str; # store the sequence here
my %hash;
# slurp in the whole file
open(IN, '<:encoding(UTF-8)', $ARGV[0]) or die "Could not open file '$ARGV[0]' $!\n";
while (<IN>) {
    chomp;
    $str .= $_;
}
close(IN);

# not sure if you were deliberately omitting the last two letters of sequence
# this looks at all the sequence
my $l_size = length($str) - 4;
for (my $i = 0; $i < $l_size; $i++) {
    $hash{ substr($str, $i, 5) }++;
}

# grep in a scalar context will count the values.
my $unique = grep { $_ == 1 } values %hash;

open OUT, ">output.txt" or die "Error opening output.txt: $!\n";
print OUT "Number of unique keys: ". $unique."\n";
print OUT Dumper(\%hash);
close OUT;

      

+3


source


This can help remove the search for information you already have.

I can't see which $j

depends on $i

, so you are actually mapping the values ​​for yourself. So you feel bad too. It works for 1 because 1 is square of 1. But if for every five character string you count the lines that match, you go to get the square of the actual number.

You would get better results if you did it like this:

# compute it once.
my $lim = length( $compare ) - 6;

for ( my $i = 0; $i < $lim; $i++ ){
    my $vs = substr( $compare, $i, 5 );

    # count each unique identity *once*
    # if it in the table, we've already counted it.
    next if $hash{ $vs }; 

    $hash{ $vs }++; # we've found it, record it.

    for ( my $j = $i + 1; $j < $lim; $j++ ) {
        my $value = substr( $compare, $j, 5 );
        $hash{ $value }++ if $value eq $vs;
    }
}

      

However, it might be an improvement to make index

for your second loop and let the c-level perl do your match for you.



   my $pos = $i;
   while ( $pos > -1 ) { 
       $pos = index( $compare, $vs, ++$pos );
       $hash{ $vs }++ if $pos > -1;
   }

      

Also, if you've used an index and wanted to omit the last two characters - like you, it might make sense to remove those from the characters you should be looking for:

substr( $compare, -2 ) = ''

      

But you could do it all in one pass when you loop over the file. I believe the code below is almost equivalent.

my $last_4   = '';
my $last_row = '';
my $discard  = <$f1>;

# each row in the file after the first...
while ( $row = <$f1> ) { 
    chomp $row;
    $last_row = $row;
    $row      = $last_4 . $row;
    my $lim = length( $row ) - 5;
    for ( my $i = 0; $i < $lim; $i++ ) { 
        $hash{ substr( $row, $i, 5 ) }++;
    }
    # four is the maximum we can copy over to the new row and not 
    # double count a strand of characters at the end.
    $last_4 = substr( $row, -4 );
}

# I'm not sure what you're getting by omitting the last two characters of 
# the last row, but this would replicate it 
foreach my $bad_key ( map { substr( $last_row, $_ ) } ( -5, -6 )) { 
    --$hash{ $bad_key };
    delete $hash{ $bad_key } if $hash{ $bad_key } < 1;
}

# grep in a scalar context will count the values.
$unique = grep { $_ == 1 } values %hash;

      

+2


source


You may be interested in this more concise version of your code, which uses a global regular expression to find all five-character subsequences. It also reads the entire input file in one go and then deletes new lines.

The input file path is expected as a parameter on the command line, and the output is sent to STDIN

and can be redirected to a file on the command line, for example

perl subseq5.pl input.txt > output.txt

      

I also used Data::Dump

instead Data::Dumper

, because I believe it is much superior. However, this is not the main module, so you will probably need to install it.

use strict;
use warnings;
use open qw/ :std :encoding(utf-8) /;

use Data::Dump;

my $str = do { local $/; <>; };
$str =~ tr|$/||d;

my %dups;
++$dups{$1} while $str =~ /(?=(.{5}))/g;

my $unique = grep $_ == 1, values %dups;

print "Number of unique keys: $unique\n";

dd \%dups;

      

0


source







All Articles