Apply the numpy.searchsorted method to the array loaded from a text file using numpy.loadtxt

Question

Apply the numpy.searchsorted method to the array loaded from a text file using numpy.loadtxt

I am currently working on a bioinformatics project and I need to solve the following problem.

I have a text file "chr1.txt" containing two columns: positions on the chromosome and boolean variables True or False.

0 false
10000 true
10001 true
10005 false
10007 true
10011 false
10013 true
10017 false
10019 false
10023 false
10025 true
10029 true
10031 false
10035 true
10037 false
...
This data means that areas from 0 to 10000 are repeated or (= unmappable → false), 10000 to 10005 are unique (= displayable -> true), 10005 to 10007 are repeated again, and so on. The file ends at position 248'946'406 and has 15'948'271 lines. To find a general solution to the problem, I would like to limit the file to the lines you can see above.

I want to load this text file into a two column numpy array. I used numpy.loadtxt for this:

import numpy as np    
with open('chr1.txt','r') as f:
        chr1 = np.loadtxt(f, dtype={'names':('start','mappable'),
        'formats':('i4','S1')})

Here's the result:

In [39]: chr1
Out[39]: 
array([(0, b'f'), (10000, b't'), (10001, b't'), (10005, b'f'),
       (10007, b't'), (10011, b'f'), (10013, b't'), (10017, b'f'),
       (10019, b'f'), (10023, b'f'), (10025, b't'), (10029, b't'),
       (10031, b'f'), (10035, b't'), (10037, b'f')], 
      dtype=[('position start', '<i4'), ('mappable', 'S1')])

This doesn't look perfect to me as I want the second column to be recognized as a boolean type, but I haven't found a way to do this.

As the next one, I want to cast a random number between positions 10000 and 10037.

In [49]: np.random.randint(10000,10037)
Out[49]: 10012

Now I want to apply the numpy.searchsorted method to the first column of my array to find out if my genome is uniquely mapped at that position. So what I want as the output in this case is 5 (element index (10011, b'f ') in my array). If I try to retrieve an array consisting of only the first column positions, I get an error:

In [21]: chr1[:,0]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-21-a63d052f1c5d> in <module>()
----> 1 chr1[:,0]

IndexError: too many indices for array

I'm guessing this is because my array doesn't really have two columns

In [40]: chr1.shape
Out[40]: (15,)

So how can I only extract the positions and apply the searchsorted method to them using my existing array? Should I load my text file into the array differently, so that there are actually two columns, first one consisting of an integer type and the second one being a boolean?

extracted_array=[0,10000,10001,10005,10007,10011,10013,10017,10019,10023,10025,10029,10031,10035,10037]
np.searchsorted(extracted_array,10012)-1
Out[58]: 5

Then I would look with the index found to see if the second argument is true or false and be able to infer if the position is in the display area.

Would be grateful for your help!

+3

python arrays numpy large-data

ElMing 05 May '17 at 10:55

source to share

1 answer

Divakar · Accepted Answer · 2017-05-05T11:16:11+0000

We can retrieve data matching position start

using chr1['position start']

and in a similar manner for the second field. We will get a boolean array of valid values matched against 't'

.

Thus, we will have one approach, for example:

indx = chr1['position start']
mask = chr1['mappable']=='t'
rand_num = np.random.randint(10000,10037)
matched_indx = np.searchsorted(indx, rand_num)-1

if mask[matched_indx]:
    print "It is mappable!"
else:
    print "It is NOT mappable!"

1) Getting data and mask / boolean array -

In [283]: chr1   # Input array
Out[283]: 
array([(    0, 'f'), (10000, 't'), (10001, 't'), (10005, 'f'),
       (10007, 't'), (10011, 'f'), (10013, 't'), (10017, 'f'),
       (10019, 'f'), (10023, 'f'), (10025, 't'), (10029, 't'),
       (10031, 'f'), (10035, 't'), (10037, 'f')], 
      dtype=[('position start', '<i4'), ('mappable', 'S1')])

In [284]: indx = chr1['position start']
     ...: mask = chr1['mappable']=='t'
     ...: 

In [285]: indx
Out[285]: 
array([    0, 10000, 10001, 10005, 10007, 10011, 10013, 10017, 10019,
       10023, 10025, 10029, 10031, 10035, 10037], dtype=int32)

In [286]: mask
Out[286]: 
array([False,  True,  True, False,  True, False,  True, False, False,
       False,  True,  True, False,  True, False], dtype=bool)

2) Get a random number and use searchsorted

and use the IF-ELSE part -

In [297]: rand_num = 10012 # np.random.randint(10000,10037)

In [298]: matched_indx = np.searchsorted(indx, rand_num)-1

In [299]: matched_indx
Out[299]: 5

In [300]: if mask[matched_indx]:
     ...:     print "It is mappable!"
     ...: else:
     ...:     print "It is NOT mappable!"
     ...:     
It is NOT mappable!

Apply the numpy.searchsorted method to the array loaded from a text file using numpy.loadtxt

More articles: