Quickly find the first record with a value less than or equal to
Let's say in Python I have a list of files with their respective sizes, represented as a dict (I don't need a structure, you can suggest another one):
from random import randint
def gen_rand_fileslist(nbfiles=100, maxvalue=100):
fileslist = {}
for i in xrange(nbfiles):
fileslist["file_"+str(i)] = randint(1, maxvalue)
return fileslist
fileslist = gen_rand_fileslist(10)
Example fileslist
:
{'file_0': 2,
'file_1': 21,
'file_2': 20,
'file_3': 16,
'file_4': 12,
'file_5': 67,
'file_6': 95,
'file_7': 16,
'file_8': 2,
'file_9': 5}
Now I want to quickly find the maximum value below the specified threshold. For example:
get_value_below(fileslist, threshold=25) # result should be 'file_1' with value 21
The get_value_below () function should be called in a tight loop, so it should be as fast as possible and any threshold can be specified (so sorting doesn't help directly).
Is there a way to be faster than just walking through the entire list (linear time)?
source to share
It all depends on how often you look for the threshold fileslist
. If you do more than Θ(log n)
queries, then it is better to sort first and then do a binary search for each query. Otherwise, if you only want to execute one query, then yes, it is better for linear search, since the desired item can be almost anywhere, and you will definitely need to visit each item in the list.
If you plan on using sorting and binary search first, use bisect_right , which for input x
, this will return the position in the list that contains the largest element less than or equal to x
.
source to share
I suggest pandas DataFrame which quickly solves your search problem.
Consider the following example:
from pandas import DataFrame as df
th = 25
d = df(['file_0', 'file_1', 'file_2',
'file_3', 'file_4', 'file_5',
'file_6', 'file_7', 'file_8',
'file_9'],
[2,21,20,16,12,67,95,16,2,5])
x = d.loc[d.index < th]
x = x.loc[x.index == max(x.index)]
print x
OUT:
0
21 file_1
Don't search linearly through a sorted array, use binary search (duh). On the one hand, this is CS 101 stuff. On the other hand, I was not aware of the bisect library and had the code cluttered np.nonzero(sorted<=x)[0]
. Once I switched to using bisect_left / bisect_right, I saw a huge performance improvement. Edit: In the comments, Peter pointed out that NumPy implements a fast binary search called searchsorted; you (and I) should probably use this instead.
Some helpful quotes from http://blog.explainmydata.com/2012/07/expensive-lessons-in-python-performance.html :
Wes McKinney is a genius. If you are implementing everything Wes McKinney has already put into his pandas library, just stop. Its code is faster, more reliable, and more likely to be correct than anything you intend to write. Want sliding window aggregators? Use pandas.
Need to handle missing data? Use pandas. Are you writing some incredibly ugly hack that tries to implement joins and grouping of NumPy arrays, but actually manage to spend 3 hours calculating a subtly wrong result? (I did it). Jesus Christ, just stop and use pandas.
source to share