Optimize this function with numpy (or other vectorization techniques)

Question

Optimize this function with numpy (or other vectorization techniques)

I am computing a classic population genetics calculation with Python. I am well aware that there are many algorithms that do this job, but I wanted to build my own for some reason.

Below is a snapshot because MathJax is not supported on StackOverflow

enter image description here

I would like to have an efficient algorithm for calculating those Fst

. At the moment I only manage to do loops and the calculations will not be vectorized. How can I accomplish this calculation using numpy (or other vectorization techniques)?

Here's the code I think should make the job work:

def Fst(W, p):
    I = len(p[0])
    K = len(p)
    H_T = 0
    H_S = 0
    for i in xrange(I):
        bar_p_i = 0
        for k in xrange(K):
            bar_p_i += W[k] * p[k][i]
            H_S += W[k] * p[k][i] * p[k][i]
        H_T += bar_p_i*bar_p_i
    H_T = 1 - H_T
    H_S = 1 - H_S
    return (H_T - H_S) / H_T

def main():
    W = [0.2, 0.1, 0.2, 0.5]
    p = [[0.1,0.3,0.6],[0,0,1],[0.4,0.5,0.1],[0,0.1,0.9]]
    F = Fst(W,p)
    print("Fst = " + str(F))
    return

main()

+3

performance python vectorization numpy

Remi.b 01 jul. 15 at 17:39

source to share

1 answer

Patrick mineault · Accepted Answer · 2015-07-01T18:17:37+0000

There is no reason to use loops here. And you really shouldn't be using Numba or Cython for this stuff - linear algebra expressions like what you have are the whole reason for vectorized operations in Numpy.

Since this type of problem will pop up over and over again if you keep using Numpy, I would recommend getting a basic linear algebra descriptor in Numpy. You can find this book in the chapter:

https://www.safaribooksonline.com/library/view/python-for-data/9781449323592/ch04.html

As for your specific situation: start by creating numpy arrays from your variables:

import numpy as np
W = np.array(W)
p = np.array(p)

Now your \ bar p_i ^ 2 is defined by the dot product. It's easy:

bar_p_i = p.T.dot(W)

Note the T for transposition, because the dot product takes the sum of the elements indexed by the last index of the first matrix and the first index of the second matrix. The transport inverts the indices, so the first index becomes the last.

You H_t is determined by the amount. It's also easy:

H_T = 1 - bar_p_i.sum()

Likewise for your H_S:

H_S = 1 - ((bar_p_i**2).T.dot(W)).sum()

Optimize this function with numpy (or other vectorization techniques)

More articles: