Why is my program so slow using Tie :: File?
#!/usr/bin/perl
use strict;
use warnings;
use Tie::File;
use Data::Dumper;
use Benchmark;
my $t0 = Benchmark->new;
# all files in the current folder with $ext will be input.
# Default $ext is "pileup"
# if entered, second user entered input will be set to $ext
my $ext = "pileup";
if(exists $ARGV[1]) {
$ext = $ARGV[1];
}
# open current directory & store filenames with $ext into @pileupfiles
opendir (DIR, ".");
my @pileupfiles = grep {-f && /\.$ext$/} readdir DIR;
my $dnasegment;
my $pos;
my $total;
my $g_total;
my @index; #hold current index for each tied file
my @totalfiles; #hold total files in each sub-index
# $filenum is iterator to cycle through all pileup files whose names are stored in pileupfiles
my $filenum = 0;
# @tied is an array holding all arrays of tied files
my @tied;
# array of the current line number for each @file,
my @linenum;
# tie each file to an array that is an element of the @tied array
while($filenum < scalar @pileupfiles) {
my @file;
tie @file, 'Tie::File', $pileupfiles[$filenum], recsep => "\n" or die;
push(@tied, [@file]);
# set each line value of $linenum to 0
push(@linenum, 0);
$filenum++;
}
# open user list of dnasegments
open(LIST, $ARGV[0]);
# open file for output
open(OUT, ">>tempfile.tab");
while(<LIST>) {
$dnasegment = $_;
chomp $dnasegment;
my $exit = 0;
$pos = 1;
my %flag;
while(scalar(keys %flag) < scalar @tied) {
$total = 0;
$filenum = 0;
while($filenum < scalar @tied) {
if(exists $tied[$filenum][$linenum[$filenum]]) {
my @line = split(/\t/, $tied[$filenum][$linenum[$filenum]]);
#print $line[0], "\t", $line[1], "\t", $line[3], "\n\n";
if($line[0] eq $dnasegment) {
if($line[1] == $pos) {
$total += $line[3];
$linenum[$filenum]++;
$g_total += $line[3];
print OUT "$dnasegment\t$filenum\t$pos\t$line[3]\n";
}
} else {
$flag{$filenum} = 1;
}
} else {
#print $flag, "\n";
$flag{$filenum} = 1;
}
$filenum++;
}
if($total > 0) {
print OUT "$dnasegment\t$total\n";
}
$pos++;
}
}
close (LIST);
close(OUT);
my $t1 = Benchmark->new;
my $td = timediff($t1, $t0);
print timestr($td), "\n";
the code above takes all files with the default extension or user specified in the directory and calculates common occurrences (column 4 of the input files) for the location (column 2 of the input files) from specific records (column 1 of the input files where column 1 matches the name. included in the file specified on the command line). layout of files to be used by the program: file 1:
Gm02 11896804 G 2 ., \'
Gm02 11896805 G 7 ......, U`
Gm02 11896806 G 3 .,. Sa
Gm02 11896807 T 2 ., U\
Gm02 11896808 T 2 ., ZZ
Gm02 11896809 T 2 ., ZZ
Gm02 11896810 T 2 ., B\
Gm02 11896811 G 3 .,^!, B]E
Gm02 11896812 A 3 T,, BaR
Gm02 11896822 G 3 .,, B`D
file 2:
Gm02 11896804 G 3 .,, \'
Gm02 11896805 G 7 ......, U`
Gm02 11896806 G 3 .,. Sa
Gm02 11896807 T 2 ., U\
Gm02 11896808 T 2 ., ZZ
Gm02 11896809 T 2 ., ZZ
Gm02 11896810 T 2 ., B\
Gm02 11896811 G 3 .,^!, B]E
Gm02 11896812 A 3 T,, BaR
Gm02 11896813 G 3 .,, B`D
file 3:
Gm02 11896804 G 3 .,, \'
Gm02 11896805 G 7 ......, U`
Gm02 11896806 G 3 .,. Sa
Gm02 11896807 T 2 ., U\
Gm02 11896808 T 2 ., ZZ
Gm02 11896809 T 2 ., ZZ
Gm02 11896810 T 2 ., B\
Gm02 11896811 G 3 .,^!, B]E
Gm02 11896812 A 3 T,, BaR
Gm02 11896833 G 3 .,, B`D
in this case, the only command line argument passed to the program will be a text file with "Gm02" as its content.
the hash is used to keep track of already processed locations. in the above file examples, all three files will be checked for counting from position 1 to 11896803 before it encounters the first values at position 11896804. This is necessary so that all positions are checked and summed across all files before incrementing the position.
my question is related to performance. I made the decision to use Tie :: File because I realized that it would improve performance because all files would not be counted in memory. the actual data that will be processed by the program is many hundreds of thousands of lines in length, multiplied by tens of files. at the moment the time taken to execute only one file file1 and on all three sample files is 42 hours (41.96 usr + 0.00 sys = 41.96 CPU) and 110 hours per second (109.76 usr + 0 , 00 sys = 109.76 CPU) respectively. any information on why this program is so slow or recommendations for speeding it up would be much appreciated.
edit 10:17 PM EST: The exit from the program looks like this:
Gm02 0 11896804 2
Gm02 1 11896804 3
Gm02 2 11896804 3
Gm02 8
Gm02 0 11896805 7
Gm02 1 11896805 7
Gm02 2 11896805 7
Gm02 21
Gm02 0 11896806 3
Gm02 1 11896806 3
Gm02 2 11896806 3
Gm02 9
Gm02 0 11896807 2
Gm02 1 11896807 2
Gm02 2 11896807 2
Gm02 6
Gm02 0 11896808 2
Gm02 1 11896808 2
Gm02 2 11896808 2
Gm02 6
Gm02 0 11896809 2
Gm02 1 11896809 2
Gm02 2 11896809 2
Gm02 6
Gm02 0 11896810 2
Gm02 1 11896810 2
Gm02 2 11896810 2
Gm02 6
Gm02 0 11896811 3
Gm02 1 11896811 3
Gm02 2 11896811 3
Gm02 9
Gm02 0 11896812 3
Gm02 1 11896812 3
Gm02 2 11896812 3
Gm02 9
Gm02 1 11896813 3
Gm02 3
Gm02 0 11896822 3
Gm02 3
Gm02 2 11896833 3
Gm02 3
Gm02 0 11896804 2
Gm02 1 11896804 3
Gm02 5
Gm02 0 11896805 7
Gm02 1 11896805 7
Gm02 14
Gm02 0 11896806 3
Gm02 1 11896806 3
Gm02 6
Gm02 0 11896807 2
Gm02 1 11896807 2
Gm02 4
Gm02 0 11896808 2
Gm02 1 11896808 2
Gm02 4
Gm02 0 11896809 2
Gm02 1 11896809 2
Gm02 4
Gm02 0 11896810 2
Gm02 1 11896810 2
Gm02 4
Gm02 0 11896811 3
Gm02 1 11896811 3
Gm02 6
Gm02 0 11896812 3
Gm02 1 11896812 3
Gm02 6
Gm02 1 11896813 3
Gm02 3
Gm02 0 11896822 3
Gm02 3
Gm02 0 11896804 2
Gm02 2
Gm02 0 11896805 7
Gm02 7
Gm02 0 11896806 3
Gm02 3
Gm02 0 11896807 2
Gm02 2
Gm02 0 11896808 2
Gm02 2
Gm02 0 11896809 2
Gm02 2
Gm02 0 11896810 2
Gm02 2
Gm02 0 11896811 3
Gm02 3
Gm02 0 11896812 3
Gm02 3
Gm02 0 11896822 3
Gm02 3
source to share
I would say "because you are using Tie :: File", except that you are not outside the following lines of code:
my @file;
tie @file, 'Tie::File', $pileupfiles[$filenum], recsep => "\n" or die;
push(@tied, [@file]);
You could also write that
open(my $fh, '<', $pileupfiles[$filenum]) or die $!;
push(@tied, [ <$fh> ]);
Perhaps you meant
tie my @file, 'Tie::File', $pileupfiles[$filenum], recsep => "\n" or die;
push(@tied, \@file);
Then we'll go back to my original answer. Tie :: File can shorten development time in some cases, but it won't be the fastest solution to date and will probably use a lot more memory.
By the way, exist
doesn't make sense in an array element.
if (exists $tied[$filenum][$linenum[$filenum]])
is a bad way to do
if (defined $tied[$filenum][$linenum[$filenum]])
or
if ($linenum[$filenum] < @{ $tied[$filenum] })
source to share
I wonder what your output looks like. Would it be something like this (given your sample files above)?
$VAR1 = {
'Gm02;11896804' => 8,
'Gm02;11896805' => 21,
'Gm02;11896806' => 9,
'Gm02;11896807' => 6,
'Gm02;11896808' => 6,
'Gm02;11896809' => 6,
'Gm02;11896810' => 6,
'Gm02;11896811' => 9,
'Gm02;11896812' => 9,
'Gm02;11896813' => 3,
'Gm02;11896822' => 3,
'Gm02;11896833' => 3
};
source to share