Numpy map `in1d` over two dimensional array

Question

Numpy map `in1d` over two dimensional array

I have two 2D arrays numpy

,

import numpy as np    
a = np.array([[  1,  15,  16, 200,  10],
              [ -1,  10,  17,  11,  -1],
              [ -1,  -1,  20,  -1,  -1]])

g = np.array([[  1,  12,  15, 100,  11],
              [  2,  13,  16, 200,  12],
              [  3,  14,  17, 300,  13],
              [  4,  17,  18, 400,  14],
              [  5,  20,  19, 500,  16]])

What I want to do, for each column g

, is to check if it contains any element from the corresponding column a

. For the first column, I want to check if any of the values appears [1,2,3,4,5]

in [1,-1,-1]

and returns True

. For the second one, I want to return False

because there is no item [15,10,-1]

in the [12,13,14,17,20]

element. At the moment I am doing this using a Python list comprehension. Running

 result = [np.any(np.in1d(g[:,i], a[:, i])) for i in range(5)]

calculates the correct result, but becomes slow when it a

has many columns. Is there a cleaner numpy

way to do the same? I feel like there should be a keyword axis

that can be added to the function numpy.in1d

, but doesn't exist ...

+3

python arrays numpy

5xum 13 nov. 14 at 14:41

source to share

2 answers

Instead of processing input as a column, you can process it line by line. For example, you will find out if any element is present in the first row a

in the columns g

so that you can stop processing the columns in which the element is found.

idx = arange(a.shape[1])
result = empty((idx.size,), dtype=bool)
result.fill(False)

for j in range(a.shape[0]):
    #delete this print in production
    print "%d line, I look only at columns " % (j + 1), idx
    line_pruned = take(a[j], idx)
    g_pruned = take(g, idx, axis=1)
    positive_idx = where((g_pruned - line_pruned) == 0)[1]
    #delete this print in production
    print "positive hit on the ", positive_idx, " -th columns"
    put(result, positive_idx, True)
    idx = setdiff1d(idx, positive_idx)
    if not idx.size:
        break

To understand how this works, we can look at another input:

a = np.array([[  0,  15,  16, 200,  10],
              [ -1,  10,  17,  11,  -1],
              [  1,  -1,  20,  -1,  -1]])

g = np.array([[  1,  12,  15, 100,  11],
              [  2,  13,  16, 200,  12],
              [  3,  14,  17, 300,  13],
              [  4,  17,  18, 400,  14],
              [  5,  20,  19, 500,  16]])

The script output is:

1 line, I look only at columns  [0 1 2 3 4]
positive hit on the  [2 3]  -th columns
2 line, I look only at columns  [0 1 4]
positive hit on the  []  -th columns
3 line, I look only at columns  [0 1 4]
positive hit on the  [0]  -th columns

Basically you can see how in the second and third round of the loop you are not processing the 2nd and 4th columns.

The performance of this solution really depends on many factors, but it will be faster if you are likely to hit many values True

and the problem has many lines. This of course also depends on the input, not just the form.

+1

gg349 13 nov. 14 at 16:52

source to share

Oliver W. · Accepted Answer · 2014-11-13T16:59:57+0000

I would use broadcasting tricks, but it depends a lot on the size of your arrays and the amount of memory available to you:

M = g.reshape(g.shape+(1,)) - a.T.reshape((1,a.shape[1],a.shape[0]))
np.any(np.any(M == 0, axis=0), axis=1)
# returns:
# array([ True, False,  True,  True, False], dtype=bool)

Easier to explain with a piece of paper and a pen (and smaller test arrays) (see below), but basically you make copies of each column in g

(one copy for each row in a

) and subtract from those copies the individual items taken from the corresponding column in a

. Similar to the original algorithm, just vectorized.

Caveat: if any of the arrays are g

or a

is 1D, you need to force it to become 2D so that its shape is at least (1,n)

.

Speed gain:

is based only on your arrays: coefficient ~ 20
- python for loops: 301us per loop
- vectorized: 15.4us per cycle

larger arrays: coefficient ~ 80

In [2]: a = np.random.random_integers(-2, 3, size=(4, 50))

In [3]: b = np.random.random_integers(-20, 30, size=(35, 50))

In [4]: %timeit np.any(np.any(b.reshape(b.shape+(1,)) - a.T.reshape((1,a.shape[1],a.shape[0])) == 0, axis=0), axis=1)
10000 loops, best of 3: 39.5 us per loop

In [5]: %timeit [np.any(np.in1d(b[:,i], a[:, i])) for i in range(a.shape[1])]
100 loops, best of 3: 3.13 ms per loop

See attached image for broadcast explanation: broadcasting explained

Numpy map `in1d` over two dimensional array

More articles: