How do I compare two numpy arrays of strings with the "in" operator to get a boolean array using a broadcast array?
Python allows a simple check if a string is contained in another string:
'ab' in 'abcd'
which is evaluated as True
.
Now take an array of strings numpy
and you can do this:
import numpy as np
A0 = np.array(['z', 'u', 'w'],dtype=object)
A0[:,None] != A0
The result in a boolean array:
array([[False, True, True],
[ True, False, True],
[ True, True, False]], dtype=bool)
Now let's take another array:
A1 = np.array(['u_w', 'u_z', 'w_z'],dtype=object)
I want to check where a string is A0
not contained in a string in A1
, essentially creating unique combinations, but the following does not yield a boolean array, only one boolean value, no matter how I write the indices:
A0[:,None] not in A1
I also tried using numpy.in1d
and np.ndarray.__contains__
, but those methods don't do the trick either.
Performance is an issue, so I want to take full advantage of the optimization numpy's
.
How can I achieve this?
EDIT:
I found it can be done like this:
fv = np.vectorize(lambda x,y: x not in y)
fv(A0[:,None],A1)
But as stated in the numpy
docs:
The vectorization feature is provided primarily for convenience, not performance. The implementation is essentially a for loop.
So this is the same as just looping through the array, and it would be nice to solve this without an explicit or implicit for loop.
source to share
We can convert to string
dtype and then use one of those NumPy based string functions .
So using np.char.count
, one solution would be -
np.char.count(A1.astype(str),A0.astype(str)[:,None])==0
Alternative option np.char.find
-
np.char.find(A1.astype(str),A0.astype(str)[:,None])==-1
Another use np.char.rfind
is
np.char.rfind(A1.astype(str),A0.astype(str)[:,None])==-1
If we convert one to a str
dtype, we can skip converting for another array, since internally it will be done anyway. So the last method can be simplified to -
np.char.rfind(A1.astype(str),A0[:,None])==-1
Example run -
In [97]: A0
Out[97]: array(['z', 'u', 'w'], dtype=object)
In [98]: A1
Out[98]: array(['u_w', 'u_z', 'w_z', 'zz'], dtype=object)
In [99]: np.char.rfind(A1.astype(str),A0[:,None])==-1
Out[99]:
array([[ True, False, False, False],
[False, False, True, True],
[False, True, False, True]], dtype=bool)
# Loopy solution using np.vectorize for verification
In [100]: fv = np.vectorize(lambda x,y: x not in y)
In [102]: fv(A0[:,None],A1)
Out[102]:
array([[ True, False, False, False],
[False, False, True, True],
[False, True, False, True]], dtype=bool)
source to share