Inserting multiple "new" items into a database using DBIC

I am working on a bioinformatics project that requires me to read genomic data (nothing too much, just think of it as strings) from various organisms and insert it into a database. Each reading refers to one organism and can contain from 5,000 to 50,000 thousand genes that I need to process and analyze before storing.

The script currently writes this to perl and, after all the calculations, stores the results in a hash likie this:

$new{$id}{gene_name}              = $id;
$new{$id}{gene_database_source} = $gene_database_source
$new{$id}{product}            = $product;
$new{$id}{sequence}               = $sequence;
$new{$id}{seqlength}              = $seqlength;
$new{$id}{digest}             = $digest;
$new{$id}{mw}                     = $mw;
$new{$id}{iep}                = $iep;
$new{$id}{tms}                = $tms;

      

After all the genes have been read and the insertions will be performed via a hash into the expression eval {}.

eval {
foreach my $id (keys %new) {

  my $rs = $schema->resultset('Genes')->create(
    {
        gene_name               => $new{$id}{gene_name},
        gene_product            => $new{$id}{product},
        sequence                => $new{$id}{sequence},
        gene_protein_length     => $new{$id}{seqlength},
        digest                  => $new{$id}{digest},
        gene_isoelectric_point  => $new{$id}{iep},
        gene_molecular_weight   => $new{$id}{mw},
        gene_tmd_count          => $new{$id}{tms},
        gene_species            => $species,
        species_code            => $spc,
        user_id                 => $tdruserid,
        gene_database_source    => $new{$id}{gene_database_source}

    }
  );
}; 

      

While this "works", it has at least two problems that I would like to solve:

  • The eval statement is intended to make the inserts "fail-safe": if one of the inserts fails, the eval fades and no inserts are performed. This is clearly not how eval works. I'm pretty sure all inserts are done until the point of failure is complete and there will be no rollback at all.

  • The script has to loop twice on very large datasets (one when reading and generating hashes, and again when reading hashes and performing inserts). This makes the process work rather poorly.

Instead of generating hashes, I was thinking about using the new directive for DBIX $schema->new({..stuff..});

and then doing a massive insert transaction. This would resolve double iteration and eval would either work (or not) with a single transaction, which is the expected behavior <either all inserts or not> ... Is there a way to do this?

+3


source to share


1 answer


You can create your massive transaction using TxnScopeGuard

in DBIC
. In its most basic form, it will be as follows.

eval { # or try from Try::Tiny
    my $guard = $schema->txn_scope_guard;

    foreach my $id ( keys %new ) {
        my $rs = $schema->resultset('Genes')->create(
            {
                gene_name              => $new{$id}{gene_name},
                gene_product           => $new{$id}{product},
                sequence               => $new{$id}{sequence},
                gene_protein_length    => $new{$id}{seqlength},
                digest                 => $new{$id}{digest},
                gene_isoelectric_point => $new{$id}{iep},
                gene_molecular_weight  => $new{$id}{mw},
                gene_tmd_count         => $new{$id}{tms},
                gene_species           => $species,
                species_code           => $spc,
                user_id                => $tdruserid,
                gene_database_source   => $new{$id}{gene_database_source}

            }
        );
    }
    $guard->commit;
}

      

You create a scope security object, and when you are done configuring the transaction , you have commit

it. If the object is out of scope, i.e. Because of something die

d, it will automatically rollback the transaction.

eval

can catch die

and your program won't crash. You had the correct part, but you are also correct that your code does not undo previous inserts. Note that Try :: Tiny try

provides better syntax. But this is not needed here.

transaction in this case means that all requests are collected and run at the same time.

Note that this will still only insert one row into the statement INSERT

!

If you want to instead create larger operators INSERT

like populate

, and not new

.

INSERT INTO foo (bar, baz) VALUES
(1, 1),
(2, 2),
(3, 3),
...

      

The method populate

allows you to pass a reference to an array with multiple strings in one go. This should be faster than pasting one at a time.

$schema->resultset("Artist")->populate([
  [ qw( artistid name ) ],
  [ 100, 'A Formally Unknown Singer' ],
  [ 101, 'A singer that jumped the shark two albums ago' ],
  [ 102, 'An actually cool singer' ],
]);

      

Translated into your loop, this would be next. Note that the documentation states that it is faster if you run it in a void context.

eval {
    $schema->resultset('Genes')->populate(
        [
            [
                                qw(
                    gene_name             gene_product   sequence
                    gene_protein_length   digest         gene_isoelectric_point
                    gene_molecular_weight gene_tmd_count gene_species
                    species_code          user_id        gene_database_source
        )
            ],
            map {
                [
                    $new{$_}{gene_name}, $new{$_}{product},
                    $new{$_}{sequence},  $new{$_}{seqlength},
                    $new{$_}{digest},    $new{$_}{iep},
                    $new{$_}{mw},        $new{$_}{tms},
                    $species,            $spc,
                    $tdruserid,          $new{$_}{gene_database_source},
                ]
            } keys %new
        ],
    );
}

      



Likewise, no area defense is needed. However, I would advise you not to do more than 1000 lines per statement. Processing it in chunks might be a good idea for performance reasons. In this case, you will iterate over 1000 keys at a time. List :: MoreUtils has a nice natatime

function for this.

use List::MoreUtils 'natatime';

eval {
    my $guard = $schema->txn_scope_guard;

    my $it = natatime 1_000, keys %new;

    while ( my @keys = $it->() ) {
        $schema->resultset('Genes')->populate(
            [
                [
                    qw(
                        gene_name             gene_product   sequence
                        gene_protein_length   digest         gene_isoelectric_point
                        gene_molecular_weight gene_tmd_count gene_species
                        species_code          user_id        gene_database_source
                        )
                ],
                map {
                    [
                        $new{$_}{gene_name}, $new{$_}{product},
                        $new{$_}{sequence},  $new{$_}{seqlength},
                        $new{$_}{digest},    $new{$_}{iep},
                        $new{$_}{mw},        $new{$_}{tms},
                        $species,            $spc,
                        $tdruserid,          $new{$_}{gene_database_source},
                    ]
                } @keys
            ],
        );
    }

    $guard->commit;
}

      

Now it will do 1000 rows for every insert and run all those queries in one big transaction. If one of them fails, none of them will be executed.

The script needs to loop twice on very large datasets (one when reading and generating hashes, and one when reading hashes and performing inserts). This makes the process work rather poorly.

You are not showing how you create data beyond this task.

$new{$id}{gene_name}              = $id;
$new{$id}{gene_database_source} = $gene_database_source
$new{$id}{product}            = $product;

      

If all this is there, nothing prevents you from using the approach I showed above, directly, where you process the data for the first time and create a hash. The following code is incomplete because you are not telling us where the data is coming from, but you should get the gist.

eval {
    my $guard = $schema->txn_scope_guard;

    # we use this to collect rows to process
    my @rows;

    # this is where your data comes in
    while ( my $foo = <DATA> ) {

        # here you process the data and come up with your variables
        my ( $id, $gene_database_source, $product, $sequence, $seqlength, 
             $digest, $mw, $iep, $tms );

        # collect the row so we can insert it later
        push(
            @rows,
            [
                $id, $gene_database_source, $product, $sequence, $seqlength, 
                $digest, $mw, $iep, $tms,
            ]
        );

        # only insert if we reached the limit
        if ( scalar @rows == 1000 ) {
            $schema->resultset('Genes')->populate(
                [
                    [
                        qw(
                            gene_name             gene_product   sequence
                            gene_protein_length   digest         gene_isoelectric_point
                            gene_molecular_weight gene_tmd_count gene_species
                            species_code          user_id        gene_database_source
                            )
                    ],
                    \@rows,
                ],
            );

            # empty the list of values
            @rows = ();
        }
    }
    $guard->commit;
}

      

Basically, we collect up to 1000 rows directly as array references during processing, and when we hit the limit, we pass them to the database. We then reset our array of strings and start over. Again, this is all wrapped in a transaction, so it will only be executed if all inserts are ok.


More information on transactions in DBIC in the cookbook .

Please note that I have not tested this code.

+6


source







All Articles