Numpy array is much slower than list

Given two matrices X1 (N, 3136) and X2 (M, 3136) (where each element in each row is a binary number), I am trying to calculate the hamming distance so that every element in X1 is compared to all rows from X2, so the matrix results (N, M).

I wrote two functions for it (the first with numpy and the other without numpy):

def hamming_distance(X, X_train):
    array = np.array([np.sum(np.logical_xor(x, X_train), axis=1) for x in X])

    return array



def hamming_distance2(X, X_train):
    a = len(X[:,0])
    b = len(X_train[:,0])

    hamming_distance = np.zeros(shape=(a, b))

    for i in range(0, a):
        for j in range(0, b):
            hamming_distance[i,j] = np.count_nonzero(X[i,:] != X_train[j,:])

    return hamming_distance

      

My problem is that the top function is much slower than the bottom one, where I use two for loops . Can the first function be improved so that I only use one loop?

PS. Sorry for my English, this is not my first language, although I tried my best!

+3


source to share


2 answers


Numpy only makes your code that much faster if you use it to vectorize your work. In your case, you can use array broadcast to vectorize your problem: compare your two arrays and create an auxiliary array of the shape (N,M,K)

that you can sum over its third dimension:

hamming_distance = (X[:,None,:] != X_train).sum(axis=-1)

      



We insert a singleton size into the first array to make it a shape (N,1,K)

, the second array is implicitly shape-compatible (1,M,K)

so the operation can be performed.

In the comments, @ayhan noted that this would create a huge helper array for large M

and N

, which is fair enough. That's the price of vectorization: you get CPU time at the expense of memory. If you have enough memory to work above it will be very fast. If you don't, then you need to reduce the scope of your vectorization and loop in M

or N

(or both, this will be your current approach). But this does not concern itself, it is about striking a balance between available resources and performance.

+3


source


What you are doing is very similar to a dot product. Consider these two binary arrays:

1 0 1 0 1 1 0 0
0 0 1 1 0 1 0 1

      

We are trying to find the number of different pairs. If you use the dot product directly, it gives you the number of pairs (1, 1). However, if you deny one of them, it will be considered different. For example, it a1.dot(1-a2)

counts (1, 0) pairs. Since we also need a number of (0, 1) pairs, we'll add to it a2.dot(1-a1)

. The good thing about the dot product is that it is pretty fast. However, you will need to convert your arrays to floats first, as Divakar pointed out .

Here's a demo:



prng = np.random.RandomState(0)
arr1 = prng.binomial(1, 0.3, (1000, 3136))
arr2 = prng.binomial(1, 0.3, (2000, 3136))
res1 = hamming_distance2(arr1, arr2)
arr1 = arr1.astype('float32'); arr2 = arr2.astype('float32')
res2 = (1-arr1).dot(arr2.T) + arr1.dot(1-arr2.T)

np.allclose(res1, res2)
Out: True

      

And timings:

%timeit hamming_distance(arr1, arr2)
1 loop, best of 3: 13.9 s per loop

%timeit hamming_distance2(arr1, arr2)
1 loop, best of 3: 5.01 s per loop

%timeit (1-arr1).dot(arr2.T) + arr1.dot(1-arr2.T)
10 loops, best of 3: 93.1 ms per loop

      

+2


source







All Articles