How do I store lists by criteria in Python?
I am making lists of data iteratively. Each of these lists has the same number of values, and my goal is to store the worst lists N
, with this criterion being determined by a specific column. I have tried several things but none of them satisfy me and I would like to know if I missed something.
As an example, let's say each of my rows contains 5 elements and that I want to keep the 10 worst rows. I want to compare the last value knowing that it is always positive.
array = [[0] * 5] * 10
while (...)
# processing
# I now have a "my_row" that looks like [5, 102.24, -3.12, 2, 7.37] for instance
indexes = [array.index(row) for row in array if row[-1] == min(r[-1] for r in array)] # can return several indexes
if array[indexes[0]][-1] < my_row[-1]:
array[indexes[0]] = my_row
However, this solution is far from elegant or optimal. Anyone have an idea how best to code it?
Thank you for your help!
source to share
The sortedContainers library has a container sortedlistwithkey
that will do what you want:
rows = [[5, 102.24, -3.12, 2, 9.36], [2, 102.24, -3.12, 2, 388], [2, 102.24, -3.12, 1, 1.54],
[5, 102.24, -3.12, 2, 1.11], [5, 102.24, -3.12, 2, 7.35], [5, 102.24, -3.12, 2, 54],
[5, 102.24, -3.12, 2, 1.53]]
from sortedcontainers import sortedlistwithkey
from operator import itemgetter
array = sortedlistwithkey.SortedListWithKey(key=itemgetter(-1))
n = 3
for row in rows:
array.add(row)
if len(array) > n:
array.pop(0)
print(array.as_list())
Output:
[[5, 102.24, -3.12, 2, 9.36], [5, 102.24, -3.12, 2, 54], [2, 102.24, -3.12, 2, 388]]
All you have to do is expose the bottom element every time.
Or reverse the key value and click from the end:
from sortedcontainers import sortedlistwithkey
array = sortedlistwithkey.SortedListWithKey(key=lambda x: -x[-1])
n = 3
for row in rows:
array.add(row)
if len(array) > n:
array.pop()
print(array.as_list())
Output:
[[2, 102.24, -3.12, 2, 388], [5, 102.24, -3.12, 2, 54], [5, 102.24, -3.12, 2, 9.36]]
The array of size will grow to n + 1 and you don't need to sort, copy, or slice.
You can also change the bisect_right function slightly if you only care about the last value:
def bisect_right(a, x, lo=0, hi=None):
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo + hi) // 2
if x > a[mid][-1]:
hi = mid
else:
lo = mid + 1
return lo
array = []
n = 3
for row in rows:
b = bisect_right(array, row[-1])
array.insert(b, row)
if len(array) > n:
array.pop()
print(array)
Output:
[[2, 102.24, -3.12, 2, 388], [5, 102.24, -3.12, 2, 100], [2, 102.24, -3.12, 97]]
Lines where all have the same maximum value:
rows = [ [5, 102.24, -3.12, 2, 100], [2, 102.24, -3.12, 2, 2], [2, 102.24, -3.12, 97],
[5, 102.24, -3.12, 2, 1.11], [5, 102.24, -3.12, 2, 23], [5, 102.24, -3.12, 2, 54],
[5, 102.24, -3.12, 2, 1.53], [5, 102.24, -3.12, 2, 100], [5, 102.24, -3.12, 2, 100]]
Output:
[[5, 102.24, -3.12, 2, 100], [5, 102.24, -3.12, 2, 100], [5, 102.24, -3.12, 2, 100]]
You can also pass additional keys sortedlistwithkey
if you need more than one value:
array = sortedlistwithkey.SortedListWithKey(key=lambda x: (-x[-1], -x[-2]))
You can also speed up bisect_function by doing some simple type operations and compile cython:
def bisect_right(a, int x, int lo=0, int hi= -1):
cdef int mid
if lo < 0:
raise ValueError('lo must be non-negative')
if hi == -1:
hi = len(a)
while lo < hi:
mid = (lo + hi) // 2
if x > a[mid][-1]:
hi = mid
else:
lo = mid + 1
return lo
source to share
First of all, [[0] * 5] * 10
does not create a correct list, since you created a nested list in which all of its elements point to one object (one address), which you can create with the list:
array = [[0 for _ in range(5)] for _ in range(10)]
and for your code you can use a function zip
to get the columns and instead list.index
you can use enumerate
while (...)
# processing
# I now have a "my_row" that looks like [5, 102.24, -3.12, 2, 7.37] for instance
col=zip(*array)
indexes = [i for i,row in enumerate(array) if row[-1] == min(col[-1])] # can return several indexes
if array[indexes[0]][-1] < my_row[-1]:
array[indexes[0]] = my_row
Also based on what you want to do it might be more efficient, for example if you are working with a large list you can use a module itertools
, for example using itertools.izip
instead zip
. or you can reduce your terms. For example, if you want one line in an array to have a minimum last line, you can simply use a function min
with a function key
:
from operator import itemgetter
index = min(enumerate(array),key=lambda x:itemgetter(-1)(x[1]))[0]
As another example, consider the following nested list:
>>> array = [[i for i in range(5)] for _ in range(10)]
>>> array
[[0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4]]
>>> import random
>>> for i in array:
... random.shuffle(i)
...
>>> array
[[1, 2, 4, 0, 3], [2, 1, 0, 4, 3], [2, 0, 3, 4, 1], [4, 3, 2, 0, 1], [4, 3, 2, 1, 0], [0, 2, 4, 1, 3], [1, 4, 0, 3, 2], [2, 3, 1, 0, 4], [2, 3, 4, 0, 1], [3, 4, 2, 0, 1]]
And now I want to change the line with the minimum last element:
>>> index =min(enumerate(array),key=lambda x:itemgetter(-1)(x[1]))[0]
>>> index
4
And you can use heapq.nlargest
to get the largest items N
. For example, you can use the following code to get the top 5 rows based on the last item:
>>> heapq.nlargest(5,array,key=itemgetter(-1))
[[2, 3, 1, 0, 4], [1, 2, 4, 0, 3], [2, 1, 0, 4, 3], [0, 2, 4, 1, 3], [1, 4, 0, 3, 2]]
And if you need those indices, you can use zip
and enumerate
:
>>> zip(*heapq.nlargest(5,enumerate(array),key=lambda x:itemgetter(-1)(x[1])))[0]
(7, 0, 1, 5, 6)
source to share
I'm not entirely sure I'm following your code sample (using "string" to mean two different things in the same expression doesn't help), so I made a simpler example where the data is strings - but you can use arrays or tuples or arbitrary objects, since you can define a custom "criteria" function for sorting:
data = ["abc", "bup", "zok", "foo", "gek", "ick"]
criteria = lambda item: item[1] # use any rule you want instead
N = 3
toplist = sorted(data[:N], key=criteria)
for item in data[N:]:
if criteria(item) < criteria(toplist[-1]):
toplist.append(item)
toplist = sorted(toplist, key=criteria)
toplist = toplist[:N] # only keep the top N items
print toplist
at the end, "toplist" is your top N elements according to your criteria
Performance Notes:
Sorting won't be terribly expensive as you sort at most N + 1 elements each time and only do so if there is actually an element to be added (this should be a minority if you have no pathological data).
You can improve the situation slightly by taking advantage of the fact that the list is already sorted and using the insert_in_sorted function, which is left as an exercise for the reader.
source to share