How do I store lists by criteria in Python?

Question

How do I store lists by criteria in Python?

I am making lists of data iteratively. Each of these lists has the same number of values, and my goal is to store the worst lists N

, with this criterion being determined by a specific column. I have tried several things but none of them satisfy me and I would like to know if I missed something.

As an example, let's say each of my rows contains 5 elements and that I want to keep the 10 worst rows. I want to compare the last value knowing that it is always positive.

array = [[0] * 5] * 10
while (...)
    # processing
    # I now have a "my_row" that looks like [5, 102.24, -3.12, 2, 7.37] for instance
    indexes = [array.index(row) for row in array if row[-1] == min(r[-1] for r in array)] # can return several indexes
    if array[indexes[0]][-1] < my_row[-1]:
        array[indexes[0]] = my_row

However, this solution is far from elegant or optimal. Anyone have an idea how best to code it?

Thank you for your help!

+3

python

ena 04 june '15 at 9:24

source to share

3 answers

First of all, [[0] * 5] * 10

does not create a correct list, since you created a nested list in which all of its elements point to one object (one address), which you can create with the list:

array = [[0 for _ in range(5)] for _ in range(10)]

and for your code you can use a function zip

to get the columns and instead list.index

you can use enumerate

while (...)
    # processing
    # I now have a "my_row" that looks like [5, 102.24, -3.12, 2, 7.37] for instance
    col=zip(*array)
    indexes = [i for i,row in enumerate(array) if row[-1] == min(col[-1])] # can return several indexes
    if array[indexes[0]][-1] < my_row[-1]:
        array[indexes[0]] = my_row

Also based on what you want to do it might be more efficient, for example if you are working with a large list you can use a module itertools

, for example using itertools.izip

instead zip

. or you can reduce your terms. For example, if you want one line in an array to have a minimum last line, you can simply use a function min

with a function key

:

from operator import itemgetter
index = min(enumerate(array),key=lambda x:itemgetter(-1)(x[1]))[0]

As another example, consider the following nested list:

>>> array = [[i for i in range(5)] for _ in  range(10)]
>>> array
[[0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4]]

>>> import random
>>> for i in array:
...   random.shuffle(i)
... 
>>> array
[[1, 2, 4, 0, 3], [2, 1, 0, 4, 3], [2, 0, 3, 4, 1], [4, 3, 2, 0, 1], [4, 3, 2, 1, 0], [0, 2, 4, 1, 3], [1, 4, 0, 3, 2], [2, 3, 1, 0, 4], [2, 3, 4, 0, 1], [3, 4, 2, 0, 1]]

And now I want to change the line with the minimum last element:

>>> index =min(enumerate(array),key=lambda x:itemgetter(-1)(x[1]))[0]
>>> index
4

And you can use heapq.nlargest

to get the largest items N

. For example, you can use the following code to get the top 5 rows based on the last item:

>>> heapq.nlargest(5,array,key=itemgetter(-1))
[[2, 3, 1, 0, 4], [1, 2, 4, 0, 3], [2, 1, 0, 4, 3], [0, 2, 4, 1, 3], [1, 4, 0, 3, 2]]

And if you need those indices, you can use zip

and enumerate

:

>>> zip(*heapq.nlargest(5,enumerate(array),key=lambda x:itemgetter(-1)(x[1])))[0]
(7, 0, 1, 5, 6)

0

Kasramvd 04 june '15 at 9:33

source to share

I'm not entirely sure I'm following your code sample (using "string" to mean two different things in the same expression doesn't help), so I made a simpler example where the data is strings - but you can use arrays or tuples or arbitrary objects, since you can define a custom "criteria" function for sorting:

data = ["abc", "bup", "zok", "foo", "gek", "ick"]
criteria = lambda item: item[1] # use any rule you want instead
N = 3

toplist = sorted(data[:N], key=criteria)
for item in data[N:]:
    if criteria(item) < criteria(toplist[-1]):
        toplist.append(item)
        toplist = sorted(toplist, key=criteria)
        toplist = toplist[:N] # only keep the top N items

print toplist

at the end, "toplist" is your top N elements according to your criteria

Performance Notes:

Sorting won't be terribly expensive as you sort at most N + 1 elements each time and only do so if there is actually an element to be added (this should be a minority if you have no pathological data).

You can improve the situation slightly by taking advantage of the fact that the list is already sorted and using the insert_in_sorted function, which is left as an exercise for the reader.

0

Emile 04 june '15 at 9:44

source to share

Padraic cunningham · Accepted Answer · 2015-06-04T11:07:42+0000

The sortedContainers library has a container sortedlistwithkey

that will do what you want:

rows = [[5, 102.24, -3.12, 2, 9.36], [2, 102.24, -3.12, 2, 388], [2, 102.24, -3.12, 1, 1.54],
        [5, 102.24, -3.12, 2, 1.11], [5, 102.24, -3.12, 2, 7.35], [5, 102.24, -3.12, 2, 54],
        [5, 102.24, -3.12, 2, 1.53]]

from sortedcontainers import sortedlistwithkey
from operator import itemgetter
array = sortedlistwithkey.SortedListWithKey(key=itemgetter(-1))

n = 3
for row in rows:
    array.add(row)
    if len(array) > n:
            array.pop(0)
print(array.as_list())

Output:

[[5, 102.24, -3.12, 2, 9.36], [5, 102.24, -3.12, 2, 54], [2, 102.24, -3.12, 2, 388]]

All you have to do is expose the bottom element every time.

Or reverse the key value and click from the end:

from sortedcontainers import sortedlistwithkey

array = sortedlistwithkey.SortedListWithKey(key=lambda x: -x[-1])
n = 3
for row in rows:
    array.add(row)
    if len(array) > n:
        array.pop()
print(array.as_list())

Output:

[[2, 102.24, -3.12, 2, 388], [5, 102.24, -3.12, 2, 54], [5, 102.24, -3.12, 2, 9.36]]

The array of size will grow to n + 1 and you don't need to sort, copy, or slice.

You can also change the bisect_right function slightly if you only care about the last value:

def bisect_right(a, x, lo=0, hi=None):
    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo + hi) // 2
        if x > a[mid][-1]:
            hi = mid
        else:
            lo = mid + 1
    return lo

array = []
n = 3
for row in rows:
    b = bisect_right(array, row[-1])
    array.insert(b, row)
    if len(array) > n:
        array.pop()
print(array)

Output:

[[2, 102.24, -3.12, 2, 388], [5, 102.24, -3.12, 2, 100], [2, 102.24, -3.12, 97]]

Lines where all have the same maximum value:

rows = [ [5, 102.24, -3.12, 2, 100], [2, 102.24, -3.12, 2, 2], [2, 102.24, -3.12, 97],
        [5, 102.24, -3.12, 2, 1.11], [5, 102.24, -3.12, 2, 23], [5, 102.24, -3.12, 2, 54],
        [5, 102.24, -3.12, 2, 1.53], [5, 102.24, -3.12, 2, 100], [5, 102.24, -3.12, 2, 100]]

Output:

[[5, 102.24, -3.12, 2, 100], [5, 102.24, -3.12, 2, 100], [5, 102.24, -3.12, 2, 100]]

You can also pass additional keys sortedlistwithkey

if you need more than one value:

array = sortedlistwithkey.SortedListWithKey(key=lambda x: (-x[-1], -x[-2]))

You can also speed up bisect_function by doing some simple type operations and compile cython:

def bisect_right(a, int x, int lo=0, int hi= -1):
    cdef int mid
    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi == -1:
        hi = len(a)
    while lo < hi:
        mid = (lo + hi) // 2
        if x > a[mid][-1]:
            hi = mid
        else:
            lo = mid + 1
    return lo

How do I store lists by criteria in Python?

More articles: