Pythonic way to keep top 10 results

Question

Pythonic way to keep top 10 results

I am working on a python project that runs for hours before it finishes all calculations. I would like to run the top 10 calculation results as it progresses.

There's an obvious way:

if calc > highest_calc:
    second_calc = highest_calc
    highest_calc = calc
if calc < highest_calc and calc > second_calc:
    third_calc =  second_calc
    second_calc = calc
if calc < second_calc and calc > third_calc:
    fourth_calc = third_calc
    third_calc = calc
etc.

But is there a better, more dynamic and pythonic way?

Bonus

Each computation has three corresponding name for my project: name_a

name_b

name_c

. What I don't want is more than one of the 10 best meanings to have the same three names. But if the latter calc

has the same names, I want to keep the highest of the two. What's the best way to do this?

For example, let's say 2.3

is the value calc

using MCD

SBUX

and CAT

to calculate calc

. But what if I already did calc

using MCD

SBUX

and CAT

, and he made it to the top? How do I find the meaning of this calc

so that I can see less or more than new calc

. If it's bigger, remove the old calc with the same and add a new one calc

. If it is smaller, the pass

new calc. Hope this makes sense:

If name_a in top10 and name_b in top10 and name_c in top10:
   if calc > old_calc_with_same_names:
       add_calc = calc, name_a, name_b, name_c
       top10.insert(bisect.bisect(calc, top10[0]), add_calc)
else:
   add to top10

Ready code

csc = []
top_ports = []
add_sharpe = [sharpe, name_a, weight_a, exchange_a, name_b, weight_b, exchange_b, name_c, weight_c, exchange_c]
    if init__calc == 0:
            csc.append(add_sharpe)
    if init__calc > 1:
        if name_a == prev_name_a and name_b == prev_name_b and name_c == prev_name_c:
            csc.append(add_sharpe)
        if name_a != prev_name_a or name_b != prev_name_b or name_c != prev_name_c:
            if csc:
                hs = max(csc, key=lambda x: x[0])
                if top_ports:
                    ls = min(top_ports, key=lambda x: x[0])
                    if hs[0] > ls[0]:
                        hsi = csc.index(hs)
                        top_ports.append(csc[hsi])
                else:
                    hsi = csc.index(hs)
                    top_ports.append(csc[hsi])
            csc = []
            csc.append(add_sharpe)

Later in the script ...

top_ports = sorted(top_ports, key=itemgetter(0), reverse=True)
print "The highest sharpe is: {0}".format(top_ports[0])
print " ==============================================="
print " ==============================================="
print datetime.now() - startTime
print "Second: {0}".format(top_ports[1])
print "Third: {0}".format(top_ports[2])
print "Fourth: {0}".format(top_ports[3])
print "Fifth: {0}".format(top_ports[4])

and etc.

+3

python

thefoxrocks May 17 '15 at 8:04

source to share

3 answers

The easiest way is to keep all your points in a list and then sort it in reverse order (from the beginning) and take the first 10.

import random
# sample random scores
scores = [int(1000*random.random()) for x in xrange(100)]

# uncomment if scores must be unique
#scores = set(scores)
topten = sorted(scores, reverse=True)[:10]

print topten

If you need to prevent duplicate grades in the list, use a set.

This is a 'vanilla' method for getting the top 10, but it does not offer any room for optimization, which will make a difference for large datasets.

Namely, the entire list does not need to be sorted every time the top 10 is requested if the top ten list is maintained with added scores. For this it is possible to save 2 lists; the full list and the top 10, for the later method heapq

suggested by @thijs van Dien, is superior.

+11

Paul rooney May 17 '15 at 8:23

source to share

Thanks to a comment, here's my improved solution using the idea of building a topten list. Using heapq as stated in the other answer is obviously much better. This solution will have a worst-case execution time of N * 10, and using the heap will reduce that to N * log2 (10). This can be noticeable if you want not ten, but, for example, ten thousand values. But more importantly, the use of heapq is more readable, understandable, and correct.

data = [18, 85, 73, 36, 57, 2, 45, 55, 1, 28, 73, 95, 38, 89, 15, 7, 61]

# start off the topten list
# with a sentinel value to simplify the add loop.
sentinel = 12345   # the sentinel could be any value.
topten = [sentinel]

def add(newvalue):
    length = len(topten)

    # temporarily overwrite the sentinel with the new value
    topten[-1] = newvalue

    # find the right place in the topten for the new value
    # iterate over topten in reverse order, skipping the sentinel position
    for i in xrange(-2, -length-1, -1): # -2, -3, ..., -length
        if newvalue > topten[i]:
            topten[i+1] = topten[i]
            topten[i] = newvalue
        else:
            break

    # fix up the topten list.
    # if we haven't yet gathered all top ten, grow the list
    # else discard the last element of the list.
    if length < 11:
        topten.append(sentinel)
    else: # length >= 11 i.e. == 11
        topten[-1] = sentinel

for v in data: add(v)
print topten[:-1] # drop the sentinel

Adding uniqueness based on names ... should be possible while maintaining a set.

For reference, my initial solution is below. It has problems with seed selection and false entries if the total number of calculations is less than 10.

data = [18, 85, 73, 36, 57, 2, 45, 55, 1, 28, 73, 95, 38, 89, 15, 7, 61]

import sys
floor = -sys.maxint - 1  # won't work in Python 3, as there is no sys.maxint
                         # for float, use float('-inf')
topten = [floor] * 10

def add(newvalue):
    # iterate over topten in reverse order
    for i in xrange(-1, -11, -1): # -1, -2, ..., -10. 
        if newvalue > topten[i]:
            if i < -1:
                topten[i+1] = topten[i]
            topten[i] = newvalue
        else:
            break

for v in data: add(v)
print topten

+1

Rishi May 17 '15 at 8:55

source to share

Thijs van dien · Accepted Answer · 2015-05-17T09:11:26+0000

Use a module heapq

. Rather than uselessly storing all the results, at each step it adds a new result and then effectively removes the lowest one that can only be added, effectively keeping the top 10. Saving all results isn't necessarily bad; this can be useful for collecting statistics and making it easier to determine what should be saved later.

from heapq import heappush, heappushpop

heap = []
for x in [18, 85, 36, 57, 2, 45, 55, 1, 28, 73, 95, 38, 89, 15, 7, 61]:
    calculation_result = x + 1 # Dummy calculation
    if len(heap) < 10:
        heappush(heap, calculation_result)
    else:
        heappushpop(heap, calculation_result)

top10 = sorted(heap, reverse=True) # [96, 90, 86, 74, 62, 58, 56, 46, 39, 37]

Note that this module has more useful functions for requesting only the highest / lowest value, etc. This can help you add naming behavior.

In fact, this design is so common that it is available as heapq.nlargest

. However, in order not to save all your results in the end, you will have to model the calculator as a generator, which is a little more advanced.

from heapq import nlargest

def calculate_gen():
    for x in [18, 85, 36, 57, 2, 45, 55, 1, 28, 73, 95, 38, 89, 15, 7, 61]:
        yield x + 1 # Dummy calculation

top10 = nlargest(10, calculate_gen()) # [96, 90, 86, 74, 62, 58, 56, 46, 39, 37]

Bonus

Here's some idea to make the results unique for each combination of related names.

Using heap is not going to cut it anymore because heap is not good at finding any element that is not an absolute minimum / maximum and we are interested in some local minimum here given the criteria combination of names.

Instead, you can use dict

to store the maximum value for each combination of names. First you need to encode the name combination as an immutable value so that it works like a key, and because the order of the names shouldn't matter, decide some order and stick to it. I'm going with alphabetic strings to keep it simple.

In the code below, each result is placed in dict

a location that is unique to its combination of names, so normalization may be required - until there is a better result. Later vertex n is compiled from the highest results for each combination.

from heapq import nlargest

calculations = [('ABC', 18), ('CDE', 85), ('BAC', 36), ('CDE', 57),
                ('ECD',  2), ('BAD', 45), ('EFG', 55), ('DCE',  1)]

highest_per_name_combi = dict()

for name_combi, value in calculations:
    normal_name_combi = ''.join(sorted(name_combi)) # Slow solution
    current = highest_per_name_combi.get(normal_name_combi, float('-inf'))
    highest_per_name_combi[normal_name_combi] = max(value, current)

top3 = nlargest(3, highest_per_name_combi.iteritems(), key=lambda x: x[1])

The only problem with this approach might be the amount of memory used. Since there can be combinations of 551300 with 150 names (150 choose 3), you might have to decide to clear dict

every time, then that's simple. In a loop, check the size dict

, and if it exceeds some (still large) number, compose the current top n and create a new minimum from it dict

. In addition, some micro-optimizations can be applied by reducing the number of searches / calls, for example. not using get

and / or max

.

All of this would be much easier if you had control over the order in which the calculations were performed. If you know that the next 1000 calculations are for the same combination of names, you can simply find the best one before adding them to the overall results.

Also, with a really huge amount of results, the simplest way may be the best. Just write them to a file in a convenient format, sort them there (first by a combination of names, then vice versa by value), take only the first occurrence for each combination of names (easily when they are grouped), and sort the result again, just the value.

Pythonic way to keep top 10 results

More articles: