Numpy map `in1d` over two dimensional array
I have two 2D arrays numpy
,
import numpy as np
a = np.array([[ 1, 15, 16, 200, 10],
[ -1, 10, 17, 11, -1],
[ -1, -1, 20, -1, -1]])
g = np.array([[ 1, 12, 15, 100, 11],
[ 2, 13, 16, 200, 12],
[ 3, 14, 17, 300, 13],
[ 4, 17, 18, 400, 14],
[ 5, 20, 19, 500, 16]])
What I want to do, for each column g
, is to check if it contains any element from the corresponding column a
. For the first column, I want to check if any of the values appears [1,2,3,4,5]
in [1,-1,-1]
and returns True
. For the second one, I want to return False
because there is no item [15,10,-1]
in the [12,13,14,17,20]
element. At the moment I am doing this using a Python list comprehension. Running
result = [np.any(np.in1d(g[:,i], a[:, i])) for i in range(5)]
calculates the correct result, but becomes slow when it a
has many columns. Is there a cleaner numpy
way to do the same? I feel like there should be a keyword axis
that can be added to the function numpy.in1d
, but doesn't exist ...
source to share
I would use broadcasting tricks, but it depends a lot on the size of your arrays and the amount of memory available to you:
M = g.reshape(g.shape+(1,)) - a.T.reshape((1,a.shape[1],a.shape[0]))
np.any(np.any(M == 0, axis=0), axis=1)
# returns:
# array([ True, False, True, True, False], dtype=bool)
Easier to explain with a piece of paper and a pen (and smaller test arrays) (see below), but basically you make copies of each column in g
(one copy for each row in a
) and subtract from those copies the individual items taken from the corresponding column in a
. Similar to the original algorithm, just vectorized.
Caveat: if any of the arrays are g
or a
is 1D, you need to force it to become 2D so that its shape is at least (1,n)
.
Speed gain:
-
is based only on your arrays: coefficient ~ 20
- python for loops: 301us per loop
- vectorized: 15.4us per cycle
-
larger arrays: coefficient ~ 80
In [2]: a = np.random.random_integers(-2, 3, size=(4, 50)) In [3]: b = np.random.random_integers(-20, 30, size=(35, 50)) In [4]: %timeit np.any(np.any(b.reshape(b.shape+(1,)) - a.T.reshape((1,a.shape[1],a.shape[0])) == 0, axis=0), axis=1) 10000 loops, best of 3: 39.5 us per loop In [5]: %timeit [np.any(np.in1d(b[:,i], a[:, i])) for i in range(a.shape[1])] 100 loops, best of 3: 3.13 ms per loop
See attached image for broadcast explanation:
source to share
Instead of processing input as a column, you can process it line by line. For example, you will find out if any element is present in the first row a
in the columns g
so that you can stop processing the columns in which the element is found.
idx = arange(a.shape[1])
result = empty((idx.size,), dtype=bool)
result.fill(False)
for j in range(a.shape[0]):
#delete this print in production
print "%d line, I look only at columns " % (j + 1), idx
line_pruned = take(a[j], idx)
g_pruned = take(g, idx, axis=1)
positive_idx = where((g_pruned - line_pruned) == 0)[1]
#delete this print in production
print "positive hit on the ", positive_idx, " -th columns"
put(result, positive_idx, True)
idx = setdiff1d(idx, positive_idx)
if not idx.size:
break
To understand how this works, we can look at another input:
a = np.array([[ 0, 15, 16, 200, 10],
[ -1, 10, 17, 11, -1],
[ 1, -1, 20, -1, -1]])
g = np.array([[ 1, 12, 15, 100, 11],
[ 2, 13, 16, 200, 12],
[ 3, 14, 17, 300, 13],
[ 4, 17, 18, 400, 14],
[ 5, 20, 19, 500, 16]])
The script output is:
1 line, I look only at columns [0 1 2 3 4]
positive hit on the [2 3] -th columns
2 line, I look only at columns [0 1 4]
positive hit on the [] -th columns
3 line, I look only at columns [0 1 4]
positive hit on the [0] -th columns
Basically you can see how in the second and third round of the loop you are not processing the 2nd and 4th columns.
The performance of this solution really depends on many factors, but it will be faster if you are likely to hit many values True
and the problem has many lines. This of course also depends on the input, not just the form.
source to share