Perl: use a link for a different hash
Firstly, I'm not even sure what to look for as I know how to do it in Excel, but I can't find an easy way (with my limited knowledge) to do it in perl. I need to renumber a pedigree (over 1.4 million records) and unfortunately there won't be enough shortage in excel, both due to the power of the PC and the powerful spreadsheet.
The file needs to be renumbered so that the person does not have fewer numbers than the parents, so my test file looks like this:
Ani | Sire | Dam
----------------
15 | 1 | 2
12 | 1 | 2
30 | 15 | 12
18 | 15 | 2
26 | 15 | 30
48 | 18 | 30
32 | 26 | 48
50 | 26 | 30
1 and 2 point to an unknown parent (I'll leave them 1/2), and renumber starts at 10 so that the "new IDs" are:
Old_ID | New_ID
---------------
15 | 10
12 | 11
30 | 12
18 | 13
26 | 14
48 | 15
32 | 16
50 | 17
So the result I would like to see would then be
new_ani | new_sire | new_dam
----------------------------
10 | 1 | 2
11 | 1 | 2
12 | 10 | 11
13 | 10 | 2
14 | 10 | 12
15 | 13 | 30
16 | 14 | 15
17 | 14 | 12
Using two hashes, I tried (unsuccessfully) to first bind the first column to the new ids (which I could do) and then the sire and dam column (which I cannot do).
To reduce the code a bit, I left the block calculating the new identifiers, since this will be a copy of the sire. i My code so far looks like this:
use strict;
use warnings;
my $input_file = .../pedigree.csv;
open (INPUT, "<", $input_file) or die "Cant open $input_file: $!";
my new_id = 0;
my %old_ped = ();
my %new_id = ();
while (<INPUT>){
my $line = $_;
$line =~ s/\s*$//g;
my ($ani,$sire,$dam) = split('\,',$line);
next if $ani eq 'db_animal' or !$ani or $ani eq 'ani';
$old_ped{$ani}[0] = $ani;
$old_ped{$ani}[1] = $sire;
$old_ped{$ani}[2] = $dam;
$new_id++;
$new_id{$ani}[0] = $ani;
$new_id{$ani}[1] = $new_id;
}
close INPUT;
foreach my $tt (sort keys %old_ped){
#animal
if ($old_ped{$tt}[0] == $new_id{$tt}[0]){
print "$new_id{$tt}[1],";
#sires
if ($old_ped{$tt}[1] == 1){
print " 1,";
}
else{
foreach my $tt (sort keys %new_id) {
if ($old_ped{$tt}[1] == $nuwe_id{$tt}[0]){
print "$new_id{$tt}[1],";
}
}
}
}
# AND REPEAT SIRE BLOCK FOR DAM
print "\n";
}
However ... I am obviously wrong as the links do not connect, so there is no match for producers (or dykes).
I tried instead to generate 2 additional hashes, one for sire and dam, using the sire and dam id as reference:
$sire{$sire}[0] = $sire;
$sire{$sire}[1] = $dierid;
$dam{$dam}[0] = $dam;
$dam{$dam}[1] = $dierid;
and using them in foreach like this:
foreach my $tt (sort keys %old_ped){
#animal
if ($old_ped{$tt}[0] == $new_id{$tt}[0]){
print "$new_id{$tt}[1],";
#sires
if ($old_ped{$tt}[1] == 1){
print " 1,";
}
else{
foreach my $tt (sort keys %sire) {
if ($sire{$tt}[0] == $nuwe_id{$tt}[0]){
print "$new_id{$tt}[1],";
}
}
}
}
# AND REPEAT SIRE BLOCK FOR DAM
print "\n";
}
I am guessing that I am using my hashes incorrectly, or perhaps I need to use a different loop? However, my knowledge of perl is still very basic and lacking.
Any help would be hugely appreciated!
source to share
Your approach is complex. I will focus on a different approach first, which I will explain.
You need to make two passes over the data. In the first pass, you create a map of old IDs and new IDs. The algorithm for creating a new identifier should start at 10 and increase, so simple. We can use a regular hash with the old id as the key and the new id as the value.
In my approach, we also store the string data in this first pass into an array of array references. This way I can use it in the second pass. If you have a lot of entries, this may not be very smart as it requires a lot of memory. In this case, you have to re-read the data and print
instead change the values ββlike I did.
In the second pass, we iterate over the lines and simply replace them from the lookup hash.
- The meaning for ani is easy. Take the current value and see it.
- The value for sire should only be replaced if it is not
1
or2
. In Perl, which can translate tounless
, it is less than3
. Watch it in this case, otherwise don't do anything. - The dam value works the same.
use strict;
use warnings;
use Data::Printer;
my $new_id = 10;
my %new_ids;
my @rows;
while (my $line = <DATA>) {
$line =~ s/\s*$//g;
my ( $ani, $sire, $dam ) = split( '\,', $line );
# map old -> new
$new_ids{$ani} = $new_id;
# save row
push @rows, [$ani, $sire, $dam];
++$new_id;
}
# iterate all rows and replace the ids
foreach my $row (@rows) {
$row->[0] = $new_ids{$row->[0]};
$row->[1] = $new_ids{$row->[1]} unless $row->[1] < 3;
$row->[2] = $new_ids{$row->[2]} unless $row->[2] < 3;
}
p @rows;
__DATA__
15,1,2
12,1,2
30,15,12
18,15,2
26,15,30
48,18,30
32,26,48
50,26,30
My program prints the result using Data :: Printer .
[
[0] [
[0] 10,
[1] 1,
[2] 2
],
[1] [
[0] 11,
[1] 1,
[2] 2
],
[2] [
[0] 12,
[1] 10,
[2] 11
],
[3] [
[0] 13,
[1] 10,
[2] 2
],
[4] [
[0] 14,
[1] 10,
[2] 12
],
[5] [
[0] 15,
[1] 13,
[2] 12
],
[6] [
[0] 16,
[1] 14,
[2] 15
],
[7] [
[0] 17,
[1] 14,
[2] 12
]
]
In terms of runtime, I created a file with 1.5M entries in random order with this program.
$ perl -E 'say join ",", int rand 10000, int rand 10000, int rand 10000 for 1 .. 1_500_000' > animals.csv
Running this through my code (changed to open
file) took about 8 seconds on my Core i7 Quadcore laptop and Perl 5.20.1.
$ time perl scratch.pl
real 0m7.863s
user 0m7.260s
sys 0m0.436s
source to share