Pearson's similarity score, how can I optimize this further?

I have an implemented Pearson similarity score for comparing two dictionaries of values. More time is spent on this method than anywhere else (potentially many millions of calls), so it is definitely a critical technique for optimization.

Even the smallest optimization can make a big difference in my code, so I really want to explore even the smallest improvements.

Here's what I have so far:

def simple_pearson(v1,v2):

    si = [val for val in v1 if val in v2]

    n = len(si)

    if n==0: return 0.0

    sum1 = 0.0
    sum2 = 0.0
    sum1_sq = 0.0
    sum2_sq = 0.0
    p_sum = 0.0

    for v in si:
        val_1 = v1[v]
        val_2 = v2[v]
        sum1+=val_1
        sum2+=val_2
        sum1_sq+=pow(val_1,2)
        sum2_sq+=pow(val_2,2)
        p_sum+=val_1*val_2

    # Calculate Pearson score
    num = p_sum-(sum1*sum2/n)
    temp = (sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n)
    if temp < 0.0:
        temp = -temp
    den = sqrt(temp)
    if den==0: return 1.0

    r = num/den

    return r

      

+2


source to share


8 answers


Scipy is the fastest!

I have some tests with the above code as well as the version found on my comp, see below for results and code:



pearson 14.7597990757
sim_pearson 15.6806837987
scipy: pearsonr 0.451986019188

try:
    import psyco
    psyco.full ()
except ImportError:
    pass

from math import sqrt

def sim_pearson (set1, set2):
    si = {}
    for item in set1:
        if item in set2:
            si [item] = 1

    #number of elements
    n = len (si)

    #if none common, return 0 similarity
    if n == 0: return 0

    #add up all the preferences
    sum1 = sum ([set1 [item] for item in si])
    sum2 = sum ([set2 [item] for item in si])

    #sum up the squares
    sum_sq1 = sum ([pow (set1 [item], 2) for item in si])
    sum_sq2 = sum ([pow (set2 [item], 2) for item in si])

    #sum up the products
    sum_p = sum ([set1 [item] * set2 [item] for item in si])

    nom = sum_p - ((sum1 * sum2) / n)
    den = sqrt ((sum_sq1 - (sum1) ** 2 / n) * (sum_sq2 - (sum2) ** 2 / n))

    if den == 0: return 0
    return nom / den



# from http://stackoverflow.com/questions/1307016/pearson-similarity-score-how-can-i-optimise-this-further
def pearson (v1, v2):
    vs = [(v1 [val], v2 [val]) for val in v1 if val in v2]

    n = len (vs)

    if n == 0: return 0.0

    sum1, sum2, sum1_sq, sum2_sq, p_sum = 0.0, 0.0, 0.0, 0.0, 0.0

    for v1, v2 in vs:
        sum1 + = v1
        sum2 + = v2
        sum1_sq + = v1 * v1
        sum2_sq + = v2 * v2
        p_sum + = v1 * v2

    # Calculate Pearson score
    num = p_sum- (sum1 * sum2 / n)
    temp = max ((sum1_sq-pow (sum1,2) / n) * (sum2_sq-pow (sum2,2) / n), 0)
    if temp:
        return num / sqrt (temp)
    return 1.0






if __name__ == "__main__":
    import timeit

    tsetup = "" "
from random import randrange
from __main__ import pearson, sim_pearson
from scipy.stats import pearsonr
v1 = [randrange (0,1000) for x in range (1000)]
v2 = [randrange (0,1000) for x in range (1000)]
# gc.enable ()
"" "
    t1 = timeit.Timer (stmt = "pearson (v1, v2)", setup = tsetup)
    t2 = timeit.Timer (stmt = "sim_pearson (v1, v2)", setup = tsetup)
    t3 = timeit.Timer (stmt = "pearsonr (v1, v2)", setup = tsetup)

    tt = 1000

    print 'pearson', t1.timeit (tt)
    print 'sim_pearson', t2.timeit (tt)
    print 'scipy: pearsonr', t3.timeit (tt)

+2


source


The real speed boost will be achieved by switching to numpy or scipy. In addition, there are micro-optimizations: for example, x*x

faster than pow(x,2)

; you can retrieve values ​​at the same time as the keys, rather than:

si = [val for val in v1 if val in v2]

      

something like

vs = [ (v1[val],v2[val]) for val in v1 if val in v2]

      



and then

sum1 = sum(x for x, y in vs)

      

etc.; whether each one has a time advantage requires microbusiness. Depending on how you use these square-returning coefficients, you will save sqrt (which is a similar idea to use the squares of the distances between points, in geometry, rather than the distances themselves, and for the same reason - saves you sqrt, which makes sense because IS coefficient is distance, kinda ...; -).

+4


source


If you can use scipy, you can use pearson function: http://www.scipy.org/doc/api_docs/SciPy.stats.stats.html#pearsonr

Or you can copy / paste the code (it has a liberal license) from http://svn.scipy.org/svn/scipy/trunk/scipy/stats/stats.py (search def pearson()

). The code is np

just numpy (code import numpy as np

).

+2


source


I suggest changing:

[val for val in v1 if val in v2]

      

to

set(v1) & set(v2)

      

do

if not n: return 0.0    # and similar for den

      

instead

if n == 0: return 0.0

      

and it's worth replacing the last 6 lines:

try:
    return num / sqrt(abs(temp))
except ZeroDivisionError:
    return 1.0

      

+1


source


Since it looks like you are doing quite a lot of numerical calculations, you should give Psyco a shot. It is a JIT compiler that analyzes the current code and optimizes certain operations. Install it, then at the top of the file put:

try:
    import psyco
    psyco.full()
except ImportError:
    pass

      

This will allow Psyco JIT and speed up your code a bit: free (it doesn't really take up more memory)

+1


source


If the inputs to any of your math functions are fairly limited, you can use a lookup table instead of the math function. This can provide you with some performance (speed) at the expense of additional memory for storing the table.

0


source


I'm not sure if this is the case in Python. But calculating sqrt is CPU intensive.

You can quickly go to newton

0


source


I'll post what I have as long as there is an answer to differentiate it from the question. It is a combination of some of the methods described above that seem to have yielded the best improvement.

def pearson(v1,v2):
    vs = [(v1[val],v2[val]) for val in v1 if val in v2]

    n = len(vs)

    if n==0: return 0.0

    sum1,sum2,sum1_sq,sum2_sq,p_sum = 0.0, 0.0, 0.0, 0.0, 0.0

    for v1,v2 in vs:
        sum1+=v1
        sum2+=v2
        sum1_sq+=v1*v1
        sum2_sq+=v2*v2
        p_sum+=v1*v2

    # Calculate Pearson score
    num = p_sum-(sum1*sum2/n)
    temp = max((sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n),0)
    if temp:
        return num / sqrt(temp)
    return 1.0

      

Edit: It looks like psyco gives 15% improvisation for this version, which is not massive but enough to justify using it.

0


source







All Articles