How do I compare two numpy arrays of strings with the "in" operator to get a boolean array using a broadcast array?

Question

How do I compare two numpy arrays of strings with the "in" operator to get a boolean array using a broadcast array?

Python allows a simple check if a string is contained in another string:

'ab' in 'abcd'

which is evaluated as True

.

Now take an array of strings numpy

and you can do this:

import numpy as np
A0 = np.array(['z', 'u', 'w'],dtype=object)

A0[:,None] != A0

The result in a boolean array:

array([[False,  True,  True],
       [ True, False,  True],
       [ True,  True, False]], dtype=bool)

Now let's take another array:

A1 = np.array(['u_w', 'u_z', 'w_z'],dtype=object)

I want to check where a string is A0

not contained in a string in A1

, essentially creating unique combinations, but the following does not yield a boolean array, only one boolean value, no matter how I write the indices:

A0[:,None] not in A1

I also tried using numpy.in1d

and np.ndarray.__contains__

, but those methods don't do the trick either.

Performance is an issue, so I want to take full advantage of the optimization numpy's

.

How can I achieve this?

EDIT:

I found it can be done like this:

fv = np.vectorize(lambda x,y: x not in y)
fv(A0[:,None],A1)

But as stated in the numpy

docs:

The vectorization feature is provided primarily for convenience, not performance. The implementation is essentially a for loop.

So this is the same as just looping through the array, and it would be nice to solve this without an explicit or implicit for loop.

+3

python string arrays numpy numpy-broadcasting

Khris May 18 '17 at 6:18

source to share

1 answer

Divakar · Accepted Answer · 2017-05-18T07:15:04+0000

We can convert to string

dtype and then use one of those NumPy based string functions .

So using np.char.count

, one solution would be -

np.char.count(A1.astype(str),A0.astype(str)[:,None])==0

Alternative option np.char.find

-

np.char.find(A1.astype(str),A0.astype(str)[:,None])==-1

Another use np.char.rfind

is

np.char.rfind(A1.astype(str),A0.astype(str)[:,None])==-1

If we convert one to a str

dtype, we can skip converting for another array, since internally it will be done anyway. So the last method can be simplified to -

np.char.rfind(A1.astype(str),A0[:,None])==-1

Example run -

In [97]: A0
Out[97]: array(['z', 'u', 'w'], dtype=object)

In [98]: A1
Out[98]: array(['u_w', 'u_z', 'w_z', 'zz'], dtype=object)

In [99]: np.char.rfind(A1.astype(str),A0[:,None])==-1
Out[99]: 
array([[ True, False, False, False],
       [False, False,  True,  True],
       [False,  True, False,  True]], dtype=bool)

# Loopy solution using np.vectorize for verification
In [100]: fv = np.vectorize(lambda x,y: x not in y)

In [102]: fv(A0[:,None],A1)
Out[102]: 
array([[ True, False, False, False],
       [False, False,  True,  True],
       [False,  True, False,  True]], dtype=bool)

How do I compare two numpy arrays of strings with the "in" operator to get a boolean array using a broadcast array?

More articles: