Fast algorithm for checking the membership of a pair of numbers in large (x, y) coordinates in Perl

I have a list of sorted coordinates (let's call it xycord.txt

) that looks like this:

chr1    10003486        10043713
chr1    10003507        10043106
chr2    10003486        10043713
chr2    10003507        10043162
chr2    10003532        10042759

      

This file is actually very large, with 10 ^ 7 lines.

What I want to do is give two more coordinates with two points, which I want to check if they fall between any coordinates in the file xycord.txt

.

The current approach I have is very slow. Because there are also many other two-point coordinates against this large file xycord.txt

.

Is there a quick way to do this?

#!/usr/bin/perl -w

my $point_to_check_x = $ARGV[0] || '10003488';
my $point_to_check_y = $ARGV[1] || '10003489';
my $chrid = $ARGV[2] || "chr1";

my %allxycordwithchr;   
# skip file opening construct
while (<XYCORD_FILE>) {
  my ($chr,$tx,$ty) = split(/\s+/,$_);
  push @{$allxycordwithchr{$chr}},$tx."-".$ty;
}


 my @chosenchr_cord = @{$allxycordwithchr{$chrid}};

 for my $chro_cords (@chosenchr_cord){

  my ($repox,$repoy) = split("-",$chro_cord);
   my $stat = is_in_xycoordsfile($repox,$repoy,$point_to_check_x,$point_to_check_y);
   if ($stat eq "IN"){
      print "IN\n";
   }
 }

sub is_in_xycoordsfile  {

    my      ($x,$y,$xp,$yp) = @_;  
    if ( $xp >= $x && $yp <= $y ) {
        return "IN";
    }
    else {
        return "OUT";
    }

}

      

Update: We are sorry for correcting this. In my previous post, I simplified the problem.

There is actually another query field (for example, the name of the chromosome). Hence DB / RB-trees / SQL approaches might not be feasible in this question?

+2


source to share


7 replies


A few suggestions:

  • You can store your data in a database like MySQL or SQLite. Then you can use a simple query like:

    "SELECT * FROM coordinates WHERE x<"+xp+" AND y>"+yp
    
          

    If you have indices on x and y, this should be very fast.

  • You can also check out R-Trees . A few years ago, I used R-trees to store tens of thousands of city coordinates, and I could find the nearest city from a certain point in a split second. In your example, you keep 1D ranges, but I'm sure R-trees will work just fine too. You can find R-tree implementations for Perl here . Or you can use RectanglesContainingDot , which seems to do what you want it to do.

  • You can cache the coordinates in memory: each number looks like it would take 4 bytes to store, so this would result in roughly 80MB of memory usage if you have 10 ^ 7 pairs of numbers. What firefox is using on my machine! Of course, if you do this, you must have some kind of daemon to avoid reloading the entire file every time you need to check the coordinates.

  • You can mix solutions 2 and 3.



My preference is for solution 1: it has a good efficiency / complexity ratio.

+9


source


In addition to Udi Pasmon's good advice, you can also convert your big file to DBM and then tie the DBM file for the hash for easy searching.

Convert file:

#!/usr/bin/perl

use strict;
use warnings;

use DB_File;

my $dbfile = "coords.db";

tie my %coords, "DB_File", $dbfile
    or die "could not open $dbfile: $!";

while (<>) {
    my ($x, $y) = split;
    $coords{"$x-$y"} = 1;
}

      



Check if the arguments are not members of a file:

#!/usr/bin/perl

use strict;
use warnings;

use DB_file;

my ($x, $y) = @ARGV;

tie my %coords, "DB_File", "coords.db"
    or die "could not open coords.db: $!";

print "IN\n" if $coords{"$x-$y"};

      

+7


source


Try binary search, not sequential search. There are two possibilities for this:

  • Split files into smaller files ( xycord001.txt

    , xycord002.txt

    etc.). Now you can easily determine which file to search in, and the search is faster. The big argument is that if you need to add data to a file, it can get messy.

  • Do a binary search over the file: start in the middle by dividing the file into two logical parts. Decide which part you will coordinate and look at the middle of that part. You will quickly (exponentially) reduce the size of the file you are in until you are only looking for one line. Learn more about searching in files; There is a perl example about binary file search here .

EDIT: Generally, using a database or DB file is preferred; However, finding a binary file is a quick and dirty way, especially if the script needs to run on different files on different computers (thanks to @MiniQuark, @Chas. Owens)

+3


source


If both inputs, or at least large, are sorted, you can try merging options between them.

If the lookup file (smaller file) is not too large, then the easiest way is to just read it, put it in a hash entered by name, with sorted arrays of leading end pairs for the value. Then go through each line in the big file, find an array of lookup values ​​that might match it by name. Loop through each pair in the search array, if the start of the search is less than the start of the input pairs, discard that value as it can no longer match anything. If the start of your search has gone past the end of the input, break the loop as no other search values ​​can match. If the end of the search is before the end of the input, you have a match, and you can add the input and search to the list of matches.

My Perl is rusty, so there is no Perl example code, but I have combined a quick and dirty Python implementation . On my randomly generated dataset, matching 10M rows, up to 10k search strings for 14k matches, there were 22s, matching 100k searches for 145k matches took 24s and 1M matches for matches in 1.47 M, took 35 s.

If the smaller file is too large to immediately insert into memory, it can be loaded in batches of keys, since the keys appear in the input file.

+2


source


Repeating your question, do you want to print all the ranges in the file that contains the (x, y) pair and also have the same ID? In this case, you do not need to parse the file and store it in memory.

while (<DATA>) {
    my ($chr, $tx, $ty) = split /\s+/;
    print "IN : $chr, $tx, $ty\n"
        if $chr eq $chrid
            && $point_to_check_x >= $tx
            && $point_to_check_y <= $ty;
}

      

+1


source


OK, so let me clarify the problem based on my understanding of your code. You have a file with a very large number of entries. Each entry includes a label "chr1"

, "chr2"

etc. And two numbers, the first of which is less than the second. Then you have a request that contains a label and a number, and you want to know if there is an entry in a large file that has the same label as the request and has two values ​​such that there are less than the request number a the other is more. Essentially, whether the number in the request will be in the range specified by the two numbers in the record.

Assuming my understanding is correct, the first thing to notice is that any of the entries that do not have the same label as the request have nothing to do with this issue. Therefore, you can ignore them. Your program reads them all, puts them in a hash, and then doesn't look at most of the data. Of course, if you have multiple requests, you will need to store data for each of the labels you are interested in, but you can drop the rest as early as possible. This will lower your memory requirement.

I would be tempted to take this a little further. Is it possible to split a huge file into smaller files? It would be nice to split it into files that only have specific shortcuts. For example, you might have one file per label, or one file for all data with labels starting with "a"

or so on. This way, you can only open the files that you know you are interested in and save a lot of time and effort.

These changes alone can make enough of a difference for you. If you still need more performance, I would start thinking about how you store the records that you need. Storing them, ordered on the lower (or higher) of the two numbers, should cut down on the time you spend doing what you need, especially if you store them in a binary search tree or trie .

This should give you enough effort to get the job done.

+1


source


PDL for processing genomic data

We have processed many files in the same format as in your question and found that PDL ( documentation ) is a very good tool for this. It takes you a while to find out, but it's definitely worth the effort (if you're doing data processing in genomics): PDL can process huge files several thousand times faster than MySQL.

Here are some tips for where to go:

First of all, PDL is a Matlab-like language but fully integrated with perl. Read the documentation, do some examples. Consult a mathematician for advice on what functions to use for what purposes.

PDL stores data in simple C arrays. Learn about Inline :: C and access that data directly from C if PDL doesn't do the job for you. To me, PDL and Inline :: C seem to be the perfect combination: PDL for high-level operations; Inline :: C is missing for something. However, PDL runs as fast as your best C because it runs in C.

use PDL :: IO :: FastRaw --- store and access data in files on disk. I often write files "by hand" (see below) and read them as memory mapped files (using PDL :: IO :: FastRaw :: mapfraw, often with the ReadOnly => 1 flag). This is the most efficient way to read data from disk in Linux.

The format of the data files is trivial: just a sequence of C numbers. You can easily write such files in perl with "print FileHandle pack" i *, @data; Check out "perldoc -f pack".

In my experience, simply reading the input files line by line and printing them in binary is the slowest part of the processing: once you've prepared the PDL for "mmap", the processing is much faster.

I hope this helps - even if not much code.

+1


source







All Articles