Fast distance calculation in scipy and numpy

Let A,B

be be ((day,observation,dim))

arrays. Each array contains the same number of observations for a given day, with the observation being a point with dim dimensions (that is, floats). For each day, I want to calculate the spatial distances between all observations on A

and B

on that day.

For example:

import numpy as np
from scipy.spatial.distance import cdist

A, B = np.random.rand(50,1000,10), np.random.rand(50,1000,10)

output = []
for day in range(50):
    output.append(cdist(A[day],B[day]))

      

where i am using scipy.spatial.distance.cdist

.

Is there a faster way to do this? Ideally, I would like to get an array output

a ((day,observation,observation))

that contains, for each day, the paired distances between observations on A

and B

on that day, although somehow avoids the loop for several days.

+3


source to share


2 answers


One way to do this (although it will take a huge amount of memory) is to make clever use of broadcasting arrays:

output = np.sqrt( np.sum( (A[:,:,np.newaxis,:] - B[:,np.newaxis,:,:])**2, axis=-1) )

      


Edit

But after some testing it seems like scikit-learn euclidean_distances

is probably the best option for large arrays. (Note that I rewrote your loop in a list comprehension.)



That's 100 data points per day:

# your own code using cdist
from scipy.spatial.distance import cdist
%timeit dists1 = np.asarray([cdist(x,y) for x, y in zip(A, B)])

100 loops, best of 3: 8.81 ms per loop

# pure numpy with broadcasting
%timeit dists2 = np.sqrt( np.sum( (A[:,:,np.newaxis,:] - B[:,np.newaxis,:,:])**2, axis=-1) )

10 loops, best of 3: 46.9 ms per loop

# scikit-learn algorithm
from sklearn.metrics.pairwise import euclidean_distances
%timeit dists3 = np.asarray([euclidean_distances(x,y) for x, y in zip(A, B)])
100 loops, best of 3: 12.6 ms per loop

      

and that's 2000 data points per day:

In [5]: %timeit dists1 = np.asarray([cdist(x,y) for x, y in zip(A, B)])
1 loops, best of 3: 3.07 s per loop

In [7]: %timeit dists3 = np.asarray([euclidean_distances(x,y) for x, y in zip(A, B)])

1 loops, best of 3: 2.94 s per loop

      

+3


source


Edit: I'm an idiot and forgot python map

is lazy evaluating. My "quick" code didn't actually do any work! Forced scoring has removed the performance gain.



I think your time will depend on the time spent inside the scipy function. I'd use map

a loop instead as I think it's a bit neat, but I don't think it's a magical way to get a huge performance boost here. Perhaps compiling the code with cython or with numba will help a little.

+1


source







All Articles