Numpy map `in1d` over two dimensional array

I have two 2D arrays numpy

,

import numpy as np    
a = np.array([[  1,  15,  16, 200,  10],
              [ -1,  10,  17,  11,  -1],
              [ -1,  -1,  20,  -1,  -1]])

g = np.array([[  1,  12,  15, 100,  11],
              [  2,  13,  16, 200,  12],
              [  3,  14,  17, 300,  13],
              [  4,  17,  18, 400,  14],
              [  5,  20,  19, 500,  16]])

      

What I want to do, for each column g

, is to check if it contains any element from the corresponding column a

. For the first column, I want to check if any of the values ​​appears [1,2,3,4,5]

in [1,-1,-1]

and returns True

. For the second one, I want to return False

because there is no item [15,10,-1]

in the [12,13,14,17,20]

element. At the moment I am doing this using a Python list comprehension. Running

 result = [np.any(np.in1d(g[:,i], a[:, i])) for i in range(5)]

      

calculates the correct result, but becomes slow when it a

has many columns. Is there a cleaner numpy

way to do the same? I feel like there should be a keyword axis

that can be added to the function numpy.in1d

, but doesn't exist ...

+3


source to share


2 answers


I would use broadcasting tricks, but it depends a lot on the size of your arrays and the amount of memory available to you:

M = g.reshape(g.shape+(1,)) - a.T.reshape((1,a.shape[1],a.shape[0]))
np.any(np.any(M == 0, axis=0), axis=1)
# returns:
# array([ True, False,  True,  True, False], dtype=bool)

      

Easier to explain with a piece of paper and a pen (and smaller test arrays) (see below), but basically you make copies of each column in g

(one copy for each row in a

) and subtract from those copies the individual items taken from the corresponding column in a

. Similar to the original algorithm, just vectorized.

Caveat: if any of the arrays are g

or a

is 1D, you need to force it to become 2D so that its shape is at least (1,n)

.



Speed ​​gain:

  • is based only on your arrays: coefficient ~ 20

    • python for loops: 301us per loop
    • vectorized: 15.4us per cycle
  • larger arrays: coefficient ~ 80

    In [2]: a = np.random.random_integers(-2, 3, size=(4, 50))
    
    In [3]: b = np.random.random_integers(-20, 30, size=(35, 50))
    
    In [4]: %timeit np.any(np.any(b.reshape(b.shape+(1,)) - a.T.reshape((1,a.shape[1],a.shape[0])) == 0, axis=0), axis=1)
    10000 loops, best of 3: 39.5 us per loop
    
    In [5]: %timeit [np.any(np.in1d(b[:,i], a[:, i])) for i in range(a.shape[1])]
    100 loops, best of 3: 3.13 ms per loop
    
          

See attached image for broadcast explanation: broadcasting explained

+1


source


Instead of processing input as a column, you can process it line by line. For example, you will find out if any element is present in the first row a

in the columns g

so that you can stop processing the columns in which the element is found.

idx = arange(a.shape[1])
result = empty((idx.size,), dtype=bool)
result.fill(False)

for j in range(a.shape[0]):
    #delete this print in production
    print "%d line, I look only at columns " % (j + 1), idx
    line_pruned = take(a[j], idx)
    g_pruned = take(g, idx, axis=1)
    positive_idx = where((g_pruned - line_pruned) == 0)[1]
    #delete this print in production
    print "positive hit on the ", positive_idx, " -th columns"
    put(result, positive_idx, True)
    idx = setdiff1d(idx, positive_idx)
    if not idx.size:
        break

      

To understand how this works, we can look at another input:

a = np.array([[  0,  15,  16, 200,  10],
              [ -1,  10,  17,  11,  -1],
              [  1,  -1,  20,  -1,  -1]])

g = np.array([[  1,  12,  15, 100,  11],
              [  2,  13,  16, 200,  12],
              [  3,  14,  17, 300,  13],
              [  4,  17,  18, 400,  14],
              [  5,  20,  19, 500,  16]])

      



The script output is:

1 line, I look only at columns  [0 1 2 3 4]
positive hit on the  [2 3]  -th columns
2 line, I look only at columns  [0 1 4]
positive hit on the  []  -th columns
3 line, I look only at columns  [0 1 4]
positive hit on the  [0]  -th columns

      

Basically you can see how in the second and third round of the loop you are not processing the 2nd and 4th columns.

The performance of this solution really depends on many factors, but it will be faster if you are likely to hit many values True

and the problem has many lines. This of course also depends on the input, not just the form.

+1


source







All Articles