Fastest way to check if two arrays have equivalent strings
I am trying to find the best way to check if two 2D arrays contain the same strings. Let's take the following example for a quick example:
>>> a
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> b
array([[6, 7, 8],
[3, 4, 5],
[0, 1, 2]])
In this case b=a[::-1]
. To check if two strings are equal:
>>>a=a[np.lexsort((a[:,0],a[:,1],a[:,2]))]
>>>b=b[np.lexsort((b[:,0],b[:,1],b[:,2]))]
>>> np.all(a-b==0)
True
It's great and pretty fast. However, the problem occurs when two lines are "closed":
array([[-1.57839867 2.355354 -1.4225235 ],
[-0.94728367 0. -1.4225235 ],
[-1.57839867 -2.355354 -1.4225215 ]]) <---note ends in 215 not 235
array([[-1.57839867 -2.355354 -1.4225225 ],
[-1.57839867 2.355354 -1.4225225 ],
[-0.94728367 0. -1.4225225 ]])
Within the 1E-5 tolerance, the two arrays are equal in rows, but lexsort will tell you otherwise. This could be solved by a different sort order, but I would like to get a more general case.
I was playing with the idea:
a=a.reshape(-1,1,3)
>>> a-b
array([[[-6, -6, -6],
[-3, -3, -3],
[ 0, 0, 0]],
[[-3, -3, -3],
[ 0, 0, 0],
[ 3, 3, 3]],
[[ 0, 0, 0],
[ 3, 3, 3],
[ 6, 6, 6]]])
>>> np.all(np.around(a-b,5)==0,axis=2)
array([[False, False, True],
[False, True, False],
[ True, False, False]], dtype=bool)
>>>np.all(np.any(np.all(np.around(a-b,5)==0,axis=2),axis=1))
True
This does not mean that the arrays are equal in rows, only if all points in b
are close to the value in a
. The number of lines can be several hundred and I need to do it quite a bit. Any ideas?
source to share
Your last code doesn't do what you think it does. What it tells you is whether each line in b
is close to a line in a
. If you change axis
that you use for external calls to np.any
and np.all
you can check whether each line is in a
close to a line in b
. If both lines in b
are close to line in a
, and each line in a
is close to line in b
, then the sets are equal. This is probably not very computationally efficient, but is probably very fast in numpy for moderately sized arrays:
def same_rows(a, b, tol=5) :
rows_close = np.all(np.round(a - b[:, None], tol) == 0, axis=-1)
return (np.all(np.any(rows_close, axis=-1), axis=-1) and
np.all(np.any(rows_close, axis=0), axis=0))
>>> rows, cols = 5, 3
>>> a = np.arange(rows * cols).reshape(rows, cols)
>>> b = np.arange(rows)
>>> np.random.shuffle(b)
>>> b = a[b]
>>> a
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
>>> b
array([[ 9, 10, 11],
[ 3, 4, 5],
[ 0, 1, 2],
[ 6, 7, 8],
[12, 13, 14]])
>>> same_rows(a, b)
True
>>> b[0] = b[1]
>>> b
array([[ 3, 4, 5],
[ 3, 4, 5],
[ 0, 1, 2],
[ 6, 7, 8],
[12, 13, 14]])
>>> same_rows(a, b) # not all rows in a are close to a row in b
False
And for not too large arrays, the performance is reasonable, although it needs to build the array (rows, rows, cols)
:
In [2]: rows, cols = 1000, 10
In [3]: a = np.arange(rows * cols).reshape(rows, cols)
In [4]: b = np.arange(rows)
In [5]: np.random.shuffle(b)
In [6]: b = a[b]
In [7]: %timeit same_rows(a, b)
10 loops, best of 3: 103 ms per loop
source to share