A quick (er) way of identifying all invalid elements in a list
I have a list of n arrays with 4 elements each, that is (n = 2):
l = [[1, 2, 3, 4], [5, 6, 7, 8]]
and I am trying to find all the elements of the list that are not dominant - that is, they are not dominant in any other element in the list. An array dominates another array if each element within it is less than or equal to the corresponding element in the other array. So
dominates([1, 2, 3, 4], [5, 6, 7, 8]) == True
how 1 <= 5 and 2 <= 6 and 3 <= 7 and 4 <= 8
. But
dominates([1, 2, 3, 9], [5, 6, 7, 8]) == False
how 9 > 8
. This function is relatively easy to write, for example:
def dominates(a, b):
return all(i <= j for i, j in zip(a, b))
More succinctly, given l = [a1, a2, a3, .., an]
where a
are arrays of length 4, I am looking to find all a
that are not dominated by others a in l
.
I have the following solution:
def get_non_dominated(l):
to_remove = set()
for ind, item_1 in enumerate(l):
if item_2 in to_remove:
continue
for item_2 in l[ind + 1:]:
if dominates(item_2, item_1):
to_remove.add(item_1)
break
elif dominates(item_1, item_2):
to_remove.add(item_2)
return [i for i in l if i not in to_remove]
So, get_non_dominated([[1, 2, 3, 4], [5, 6, 7, 8]])
should return [[1, 2, 3, 4]]
. Similarly, get_non_dominated([[1, 2, 3, 9], [5, 6, 7, 8]])
should return a list unchanged by the logic above (nothing dominates anything else).
But this check happens a lot, and l
potentially quite a lot. I was wondering if anyone has any ideas for speeding up this? My first thought was to try and vectorize this code with numpy, but I have relatively little experience with it and struggle a bit. You can assume that it l
has all unique arrays. Any ideas are appreciated!
source to share
Another version of @ Nyps answer:
def dominates(a, b):
return (np.asarray(a) <= b).all()
This is a vectorized approach of your code using numpy.
It can still be slow if you need to go through all the lines you have. If you have a list with all strings and you want to compare them in pairs, you can use scipy to create an array N x N
(where N
is the number of strings).
import numpy as np
a = np.random.randint(0, 10, size=(1000, 10))
a
Here is an array 1000 x 10
simulating 1000
strings of 10
elements:
from scipy.spatial.distance import cdist
X = cdist(a, a, metric=dominates).astype(np.bool)
X
is now a matrix 1000 x 1000
containing a pairwise comparison of all records. This means it X[i, j]
contains True
if the sample i
dominates the sample j
or False
otherwise.
Now you can extract fancy results from X
, for example, the pattern that dominates them:
>>> a[50] = 0 # set a row to all 0s to fake a dominant row
>>> X = cdist(a, a, metric=dominates).astype(np.bool)
>>> non_dominated = np.where(X.all(axis=1))[0]
>>> non_dominated
array([50])
The sample in the position 50
is a ruler, if your population you should keep a close eye on it.
Now, if you only want to keep the dominant, you can do:
if non_dominated.size > 0:
return [a[i] for i in non_dominated]
else: # no one dominates every other
return a
As an example:
import numpy as np
from scipy.spatial.distance import cdist
def get_ruler(a):
X = cdist(a, a, metric=dominates).astype(np.bool)
rulers = np.where(X.all(axis=1))[0]
if rulers.size > 0:
return [a[i] for i in rulers]
else: # no one dominates every other
return a
source to share