Is there a way to make this kNN Python feature more efficient?

After problems with MATLAB, I decided to try Python:

I wrote a function that calculates kNN when samples have their own class using my own distance function:

def closestK(sample, otherSamples, distFunc, k):
"Returns the closest k samples to sample based on distFunc"
    n = len(otherSamples)
    d = [distFunc(sample, otherSamples[i]) for i in range(0,n)]
    idx  = sorted(range(0,len(d)), key=lambda k: d[k])
    return idx[1:(k+1)]

def kNN(samples, distFunc, k):
    return [[closestK(samples[i], samples, distFunc, k)] for i in range(len(samples))]

      

and this is the distance function:

@staticmethod    
def distanceRepr(c1, c2):
    r1 = c1.repr
    r2 = c2.repr
    # because cdist needs 2D array
    if r1.ndim == 1:
        r1 = np.vstack([r1,r1])
    if r2.ndim == 1:
        r2 = np.vstack([r2,r2])

    return scipy.spatial.distance.cdist(r1, r2, 'euclidean').min()

      

But it is still surprisingly slower than the "normal" kNN function, even with the "coarse" algorithm. Am I doing something wrong?

UPDATE

I am adding a constructor to the class. The repr attribute contains a set of vectors (from 1 to any), and the distance is calculated as the minimum Euclidean distance between two sets of representations.

class myCluster:
    def __init__(self, index = -1, P = np.array([])):
        if index ==-1 :
            self.repr = np.array([])
            self.IDs = np.array([])
            self.n = 0
            self.center = np.array([])
        else:
            self.repr = np.array(P)
            self.IDs = np.array(index)
            self.n = 1
            self.center = np.array(P)

      

and the rest of the relevant code (X is a matrix whose rows are patterns and columns are variables):

level = [myCluster(i, X[i,:]) for i in range(0,n)]
kNN(level, myCluster.distanceRepr, 3)

      

UPDATE 2

I took some measurements and the line that takes most of the time

d = [distFunc(sample, otherSamples[i]) for i in range(0,n)]

      

So there is something with distFunc

. When I change it to get it back

np.linalg.norm(c1.repr-c2.repr)

      

i.e. "normal" vector calculation, with no sorting the running time remains unchanged. So the problem is in calling this function. Does it make sense that using classes changes the running time 60 times?

+3


source to share


2 answers


You're just running into the slowness of Python (or rather the CPython interpreter, I must say, I think). From wikipedia :

NumPy targets the CPython Python reference implementation, which is a non-optimizing bytecode compiler / interpreter . Math algorithms written for this version of Python are often much slower than compiled equivalents. NumPy aims to solve this problem by providing multidimensional arrays and functions and operators that work efficiently with arrays. Thus, any algorithm that can be expressed mainly as operations on arrays and matrices can execute almost as fast as the equivalent C code.

And from the Scipy FAQ:

Pythons lists are efficient general purpose containers. They support (fairly) efficient insertion, deletion, addition, and concatenation, and understanding Pythons lists makes them easy to create and manage. However, they do have certain limitations: they do not support "vectorized" operations such as adding and multiplying elements, and the fact that they can contain objects of different types means that Python must store type information for each element and must execute type dispatching code when working on each item. This also means that very few list operations can be performed in efficient C loops - each iteration will require type checks and a different Python API account.



Note that this is not specific to Python; for more details see for example this and this question on SO.

Due to the overhead from the dynamic type system and interpreter, Python will be much less useful for high performance crunching if it can't use all sorts of compiled C and Fortran libraries (like Numpy). Also, there are JIT compilers like Numba and PyPy that try to make Python code run closer to the speed of statically typed, compiled code.

Bottomline: You are doing this in plain Python regarding the work that you dump into fast C code. I suppose you should use an "array oriented" style rather than an object oriented towards achieving good performance with Numpy (MATLAB is very similar history for that matter). On the other hand, if you are using a more efficient algorithm (see Ara's answer) then Python's slowness may not be such an issue.

+2


source


Here's what I can think of:

  • you calculate the distance between one pattern and each other every time you call the closest K, so you calculate the distance between each patterns twice (once distance (a, b) then distance (b, a)), this can be avoided by calculating him once and for all
  • you recompile r (which might include an expensive vstack) 2 * (n - 1) times when n is len (sample), you can also compute it once and for all (and store it as myCluster attribute?).
  • you are calculating the sort on the full list, whereas you only need the top k (no need to sort after the kth element)
  • to calculate the distance between thumbnails between the points of your set, you create a matrix containing each distance, then take its min: you can of course do better.

My suggestion was to implement a top-k class with an insert way that only inserts when you are better than the current kth element (and remove it), and to change myCluster to include r. Then your code might look like

kNN = {i : TopK() for i in xrange(len(samples))}
for i, sample1 in enumerate(samples):
    for j, sample2 in enumerate(samples[:i]):
        dist = distanceRepr(sample1, sample2)
        kNN[i].insert(j, -dist)
        kNN[j].insert(i, -dist)
return kNN

      

Here is a possible implementation of ok TopK:



import heapq

class TopK:
    def __init__(self, k):
        self.k = k
        self.content = []

    def insert (self, key, score):
        if len(self.content) < self.k:
            heapq.heappush(self.content, (score, key))
        else:
            heapq.heappushpop(self.content, (score, key))

    def get_keys(self):
        return [elem[1] for elem in self.content]

      

For, distanceRepr

you can use something like:

import scipy.spatial

def distanceRepr(set0 ,set1):
    if len(set0) < len(set1):
        min_set = set0
        max_set = set1
    else:
        min_set = set1
        max_set = set0
    if len(min_set) == 0:
        raise Exception("Empty set")

    min_dist = scipy.inf
    tree = scipy.spatial.cKDTree(max_set)

    for point in min_set:
        distance, _ = tree.query(point, 1, 0., 2, min_dist)
        if min_dist > distance:
            min_dist = min(min_dist, distance)

    return min_dist

      

It will be faster than your current method for medium to large records (say sample 1 and 2 with size> 5k), it will also significantly reduce memory usage, allowing it to work with large samples (where there is cdit

simply not enough memory.)

0


source







All Articles