How can one concatenate two very large strings and return indices for matches and mismatches?

Question

How can one concatenate two very large strings and return indices for matches and mismatches?

I have a set of text files that contain two very large character sets of the same length. Symbol sets are DNA sequences, so I will call them seq_1

and seq_2

, and together they are called alignment

. The files look like this:

>HEADER1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>HEADER2
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Possible characters in sequence 1 below >HEADER1

are ACGTN-

, and possible characters in sequence 2 below >HEADER2

are equal ACGTN-*

.

I want to parse sequences and return two lists of indices, which I will call valid

and mismatch

.

valid

contains all (1-based) indices where the positions in both sequences ("alignment") are in the set ACGT

; mismatch

contains all indices (1 based) where the positions in the alignment are in the set ACGT

but not the same. Thus, mismatch

is a subset of valid

.

The end-point is that I DO NOT increment the index counter at the positions where sequence 1 is "-"

, because they are considered "spaces" in the coordinate system being used.

This example shows the alignment of my expected result:

seq_1      = 'CT-A-CG'  # total length = 5 because of the two gaps
seq_2      = 'CNCTA*G'
valid      = [1,3,5]    # all indices where both sequences are in 'ACGT' without counting gaps
mismatch   = [3]        # subset of valid where alignment does not match

I want to improve on my current code (see below) which involves fetching sequences regularly, encrypting and enumerating non-market sites into a generator object, and then the main step of the time it takes to loop through that generator and populate the two lists. I feel like there must be an array-based solution or solution to this problem itertools

that would be much more efficient than a for loop through a loop and their index zipped together and I'm looking for suggestions.

code:

def seq_divergence(infile):

    re_1 = re.compile('>HEADER1\n([^>]+)', re.MULTILINE)
    re_2 = re.compile('>HEADER2\n([^>]+)', re.MULTILINE)    

    valid     = []
    mismatch  = []

    mycmp     = cmp  # create local cmp for faster calling in the loop

    with open(infile, 'rb') as f:

        alignment = f.read()

    # get sequences and remove header and newlines
    seq_1       = iter(re_1.search(alignment).group(1).replace('\n',''))
    seq_2       = iter(re_2.search(alignment).group(1).replace('\n',''))

    # GENERATOR BLOCK:
    rm_gaps     = ((i, j) for (i, j) in it.izip(seq_1, seq_2) if i != '-')  # remove gaps
    all_sites   = enumerate(rm_gaps, 1)  # enumerate (1-based)
    valid_sites = ((pos, i, j) for (pos, (i, j)) in all_sites if set(i+j).issubset('ACGT'))  # all sites where both sequences are valid

    for (pos, i, j) in valid_sites:

        valid += [pos]
        if mycmp(i,j):
            mismatch += [pos]

    return valid, mismatch

EDIT: per popular request, here is a link to one of the files for people who would like to test their code: https://www.dropbox.com/s/i904fil7cvv1vco/chr1_homo_maca_100Multiz.fa?dl=0

+3

python iterator arrays regex enumerate

isosceleswheel Jul 29. At 1:11

source to share

2 answers

Patrick maupin · Answer 1 · 2015-07-29T05:19:37+0000

By reading your code, I can tell that you are a smart person, so I'm going to give you something completely untested and let you know how to make it work, and whether or not it goes faster than what you have has already: -)

(Hey, it doesn't look like you gave a realistic dataset in your question anyway ...)

EDIT - Use hexadecimal numbers to calculate inconsistencies.

#! /usr/bin/env python2.7

# Text is much faster in 2.7 than 3...

def rm_gaps(seq1, seq2):
    ''' Given a first sequence with gaps removed,
        do the same operation to a second sequence.
    '''
    startpos = 0
    for substring in seq1:
        length = len(substring)
        yield seq2[startpos:length]
        startpos += length + 1

def seq_divergence(infile):

    # Build a character translation map with ones
    # in the positions of valid bases, and
    # another one with hex numbers for each base.

    transmap_v = ['0'] * 256
    transmap_m = ['0'] * 256
    for ch, num in zip('ACGT', '1248'):
        transmap_v[ord(ch)] = '1'
        transmap_m[ord(ch)] = num
    transmap_v = ''.join(transmap_v)
    transmap_m = ''.join(transmap_m)


    # No need to do everything inside open -- you are
    # done with the file once you have read it in.
    with open(infile, 'rb') as f:
        alignment = f.read()

    # For this case, using regular string stuff might be faster than re

    h1 = '>HEADER1\n'
    h2 = h1.replace('1', '2')

    h1loc = alignment.find(h1)
    h2loc = alignment.find(h2)

    # This assumes header 1 comes before header 2.  If that is
    # not invariant, you will need more code here.

    seq1 = alignment[h1loc + len(h1):h2loc].replace('\n','')
    seq2 = alignment[h2loc + len(h2):].replace('\n','')

    # Remove the gaps
    seq1 = seq1.split('-')
    seq2 = rm_gaps(seq1, seq2)
    seq1 = ''.join(seq1)
    seq2 = ''.join(seq2)

    assert len(seq1) == len(seq2)

    # Let use Python long integer capability to
    # find locations where both sequences are valid.
    # Convert each sequence into a binary number,
    # and AND them together.
    num1 = int(seq1.translate(transmap_v), 2)
    num2 = int(seq2.translate(transmap_v), 2)
    valid = ('{0:0%db}' % len(seq1)).format(num1 & num2)
    assert len(valid) == len(seq1)


    # Now for the mismatch -- use hexadecimal instead
    # of binary here.  The 4 bits per character position
    # nicely matches our 4 possible bases.
    num1 = int(seq1.translate(transmap_m), 16)
    num2 = int(seq2.translate(transmap_m), 16)
    mismatch = ('{0:0%dx}' % len(seq1)).format(num1 & num2)
    assert len(match) == len(seq1)

    # This could possibly use some work.  For example, if
    # you expect very few invalid and/or mismatches, you
    # could split on '0' in both these cases, and then
    # use the length of the strings between the zeros
    # and concatenate ranges for valid, or use them as
    # skip distances for the mismatches.

    valid = [x for x, y in enumerate(valid,1) if y == '1']
    mismatch = [x for x, y in enumerate(mismatch, 1) if y == '0']

    return valid, mismatch

WillemM · Answer 2 · 2015-07-29T05:44:49+0000

Here's my version (but you can start throwing rocks at me to obfuscate). Please post some longer test data, I would like to test it ...

seq1='CT-A-CG'
seq2='CNCTA*G'

import numpy as np
def is_not_gap(a): return 0 if (a=='-') else 1
def is_valid(a): return 1 if (a=='A' or a=='C' or a=='G' or a=='T' ) else 0
def is_mm(t): return 0 if t[0]==t[1] else 1

# find the indexes to be retained
retainx=np.where( np.multiply( map(is_not_gap, seq1), map(is_not_gap, seq2) ) )[0].tolist()

# find the valid ones
valid0=np.where(np.multiply( map( is_valid, seq1),map( is_valid, seq2))[retainx])[0].tolist()

# find the mismatches
mm=np.array(map( is_mm, zip( seq1,seq2)))
mismatch0=set(valid0) & set(np.where(mm[retainx])[0])

Results (zero-based indexing):

 valid0
 [0, 2, 4]

 mismatch0
 {2}

(I can post a longer, more detailed version if you want)

How can one concatenate two very large strings and return indices for matches and mismatches?

More articles: