Numpy search array for multiple values ββand return their indices
How can I find a small set of values ββin a numpy array (not sorted and shouldn't be modified)? It should return the indices of these values.
For example:
a = np.array(['d', 'v', 'h', 'r', 'm', 'a']) # in general it will be large
query = np.array(['a', 'v', 'd'])
# Required:
idnx = someNumpyFunction(a, query)
print(indx) # should be [5, 1, 0]
I'm new to numpy and I couldn't find the correct way to do this task for multiple values ββat the same time (I know np.where (a == 'd') can do this for a single value lookup).
source to share
The classic way to test one array against another is to adjust the shape and use "==":
In [250]: arr==query[:,None]
Out[250]:
array([[False, False, False, False, False, True],
[False, True, False, False, False, False],
[ True, False, False, False, False, False]], dtype=bool)
In [251]: np.where(arr==query[:,None])
Out[251]: (array([0, 1, 2]), array([5, 1, 0]))
If the item is query
not found in a
, its "string" will be missing, eg. [0,2]
instead[0,1,2]
In [261]: np.where(arr==np.array(['a','x','v'],dtype='S')[:,None])
Out[261]: (array([0, 2]), array([5, 1]))
For this small example, this is significantly faster than the list view equivalent:
np.hstack([(arr==i).nonzero()[0] for i in query])
This is a little slower than the solution searchsorted
. (This solution i
goes out of bounds if item is query
not found.)
Stefano suggested fromiter
. This saves time compared to the hstack
list:
In [313]: timeit np.hstack([(arr==i).nonzero()[0] for i in query])10000 loops, best of 3: 49.5 us per loop
In [314]: timeit np.fromiter(((arr==i).nonzero()[0] for i in query), dtype=int, count=len(query))
10000 loops, best of 3: 35.3 us per loop
But if an error occurs, then the item is missing, or there are multiple cases. hstack
can handle records of variable length, fromiter
cannot.
np.flatnonzero(arr==i)
slower than ().nonzero()[0]
that, but I didn't think about why.
source to share
You can use np.searchsorted
on a sorted array and then revert the returned indices back to the original array. For this you can use np.argsort
; how in:
>>> indx = a.argsort() # indices that would sort the array
>>> i = np.searchsorted(a[indx], query) # indices in the sorted array
>>> indx[i] # indices with respect to the original array
array([5, 1, 0])
if a
is n
and query
is sized k
, it will O(n log n + k log n)
be faster than O(n k)
linear search if log n < k
.
source to share