How to speed up combinations of itertools?

As part of my own research on genotype networks, I have a block of code in which I am trying to build a one-difference network from a list of strings. The procedure is as follows:

  • Iterations over all pairwise combinations of strings.
  • If the lines differ by one position, then draw a border between them on the net.
  • If they do not differ by the same position, skip.

The block of code I have now is this:

from itertools import combinations
import Levenshtein as lev # a package that wraps a C-implemented levenshtein distance
import networkx as nx
strings = [...list of strings...]

G = nx.Graph()

for (s1, s2) in combinations(strings, 2):
    if s1 not in G.nodes():
        G.add_node(s1)
    if s2 not in G.nodes():
        G.add_node(s2)

    if lev.distance(s1, s2) == 1:
        G.add_edge(s1, s2)

      

Obviously, there is no way to improve the computational complexity of the plotting procedure - it always will O(n**2)

. At least in my limited knowledge, this is what I think - maybe I am wrong?

However, given the normal scale of the number of comparisons I need to do (~ 2000-5000 roughly), if I can get a few orders of magnitude speed up overall, then real-time computational time would still be acceptable - with the current Python implementation it takes ~ days to build a graph. With correct imports (not listed below) I tried Cython below but couldn't figure out how to make it faster:

cpdef cython_genotype_network(sequences):

    G = nx.Graph()
    cdef:
        unicode s1
        unicode s2

    for (s1, s2) in combinations(sequences, 2):
        if lev.distance(s1, s2) == 1:
            G.add_edge(s1, s2)

    return G

      

In particular, Cython expects bytes

, not str

for s1

and s2

. This block of code raises an error.

So ... I come to my two questions:

  • Q1: Will Cython's implementation help? And how can I fix the bytes vs. str?
  • Q2: Is it possible to solve this problem using numpy

    ? Easy to convert from matrix numpy

    to NetworkX; however, I cannot figure out how to pass the Levenshtein distance function through an empty n-by-n matrix where each row and column corresponds to a row.

UPDATE 1: How to generate sample data

To generate lines:

from random import choice

def create_random_nucleotides_python(num_nucleotides):
    """
    Creates random nucleotides of length num_nucleotides.
    """

    sequence = ''

    letters = ['A', 'T', 'G', 'C']

    for i in range(num_nucleotides):
        sequence = sequence + choice(letters)

    return sequence


def mutate_random_position(string):
    """
    Mutates one position in the nucleotide sequence at random.
    """

    positions = [i for i in range(len(string))]
    pos_to_mut = choice(positions)

    letters = ['A', 'T', 'G', 'C']

    new_string = ''
    for i, letter in enumerate(string):
        if i == pos_to_mut:
            new_string = new_string + choice(letters)
        else:
            new_string = new_string + letter

    return new_string


# Create 100 Python srings by mutating a first sequence, then randomly choosing stuff to mutate a single position.
base_sequence = create_random_nucleotides_python(1000)

sequences = [base_sequence]

for i in range(99):
    sequence = choice(sequences)
    mutseq = mutate_random_position(sequence)
    sequences.append(mutseq)

      

+3


source to share


2 answers


As for the complexity:

You are looking at every pair of lines. You do not need it. You can consider all lines with one spacing for each line:

# I would start by populating the whole graph:
for s1 in strings:
    if s1 not in G.nodes():
        G.add_node(s1)
# Then compute the leven-1:
for s1 in strings:
    for s2 in variations(s1):
        if s2 in G.nodes():
            G.add_edge(s1, s2)

      

Now you need it in variations(string)

short O(n)

:

This returns all options with distance 1. (only 1 edit | delete | insert)



def variations(string):
    for i in range(len(string)):
        # delete
        yield string[:i] + string[i+1:]
        # edit
        for l in 'ATGC':
            yield string[:i] + l + string[i+1:]
        # insert
        for l in 'ATGC':
            yield string[:i] + l + string[i:]

    # insert at the end
    for l in 'ATGC':
        yield string + l

      

Now the tricky part of this is O(m^2)

(due to the concat string), where m

is the size of the longest sequence. If known, it is a constant, and all that is nowO(1)

If the sequences are the same size, you can only calculate the changes.

Otherwise, you can sort your sequences from largest to smallest and only compute changes and delete.

Or you can sort the strings by size, and instead of comparing all strings to all others, compare those with a size difference of <= 1.

+2


source


Some thoughts on how the code cython

could be faster

for (s1, s2) in combinations(sequences, 2):
    if lev.distance(s1, s2) == 1:
        G.add_edge(s1, s2)

      

I think it is itertools.combinations

compiled, so in itself is fast. And you call it once, so the part for ...

is probably as fast as it gets. However, cython

it wouldn't be easy to get through sequences

.

lev.distance

seems to be using compiled code. Can I directly import and call this code? Look at the source lev

.

I think it lev.distance

ends up calling a function c with signature:

distance_py(PyObject *self, PyObject *args)

      

or more mostly

lev_edit_distance(size_t len1, const lev_byte *string1,
              size_t len2, const lev_byte *string2,
              int xcost)

      

https://github.com/ztane/python-Levenshtein/blob/master/Levenshtein/_levenshtein.c


What does it do G.add_edge

? Is there a simpler data structure where you could collect these "edges". Maybe like a list of tuples?



Does it provide a networkx

way to add nodes and edges in bulk? Do you always need to use add_node

and add_edge

? or can you give it list nodes and a list of node pairs (tuples)?


Looks like a graph

Python dictionary or a dictionary of dictionaries.

networkx

allows you to add a bunch of edges through a list of tuples:

>>> G.add_edges_from([(1,2),(1,3)])

      

http://networkx.github.io/documentation/latest/tutorial/tutorial.html#edges

In python code, this part is probably not needed:

if s1 not in G.nodes():
    G.add_node(s1)
if s2 not in G.nodes():
    G.add_node(s2)

      

Just do:

G.add_nodes_from(strings)

      

http://networkx.github.io/documentation/latest/tutorial/tutorial.html#nodes

0


source







All Articles