Finding the closest match in a collection of strings representing numbers

Question

Finding the closest match in a collection of strings representing numbers

I have a sorted list of datetimes in text format. The format of each record is "2009-09-10T12: 00: 00".

I want to find the entry closest to the target. There are many other entries than the number of searches I would have to do.

I can change each entry to a number and then search numerically (like these ), but that might seem like overkill.

Is there a better way:

def mid(res, target): 
#res is a list of entries, sorted by dt (dateTtime) 
#each entry is a dict with a dt and some other info
    n = len(res)
    low = 0
    high = n-1

    # find the first res greater than target
    while low < high:
        mid = (low + high)/2
        t = res[int(mid)]['dt']
        if t < target:
            low = mid + 1
        else:
            high = mid

    # check if the prior value is closer
    i = max(0, int(low)-1)
    a = dttosecs(res[i]['dt'])
    b = dttosecs(res[int(low)]['dt'])
    t = dttosecs(target)
    if abs(a-t) < abs(b-t):
        return int(low-1)
    else:
        return int(low)

import time
def dttosecs(dt):
    # string to seconds since the beginning
    date,tim = dt.split('T')
    y,m,d = date.split('-')
    h,mn,s = tim.split(':')
    y = int(y)
    m = int(m)
    d = int(d)
    h = int(h)
    mn = int(mn)
    s = min(59,int(float(s)+0.5)) # round to neatest second
    s = int(s)
    secs = time.mktime((y,m,d,h,mn,s,0,0,-1))
    return secs

+2

python search

foosion 17 Sep 09 at 13:33

source to share

4 answers

You want the bisect module from the standard library. It will do a binary search and tell you the correct insertion point for the new value in an already sorted list. Here's an example that will print the place in the list to be inserted into target

:

from bisect import bisect
dates = ['2009-09-10T12:00:00', '2009-09-11T12:32:00', '2009-09-11T12:43:00']
target = '2009-09-11T12:40:00'
print bisect(dates, target)

From there, you can simply compare with the thing before and after the insertion point, which in this case will be dates[i-1]

and dates[i]

to see which one is closest to yours target

.

+4

Eli courtwright 17 Sep 09 at 13:52

source to share

import bisect

def mid(res, target):
    keys = [r['dt'] for r in res]
    return res[bisect.bisect_left(keys, target)]

+2

nosklo 17 Sep '09 at 15:01

source to share

Change this first.

import datetime
def parse_dt(dt):
    return datetime.strptime( dt, "%Y-%m-%dT%H:%M:%S" )

This removes most of the "effort".

Consider this as a search.

def mid( res, target ):
    """res is a list of entries, sorted by dt (dateTtime) 
       each entry is a dict with a dt and some other info
    """
    times = [ parse_dt(r['dt']) for r in res ]
    index= bisect( times, parse_dt(target) )
    return times[index]

It doesn't look like "effort". It doesn't depend on how your timestamps are formatted properly. You can change any timestamp format and be sure it will always work.

+1

S.Lott 17 Sep '09 at 15:16

source to share

Alex martelli · Accepted Answer · 2009-09-17T15:51:43+0000

"Copy and paste coding" (getting bisect

sourced into your code) is not recommended as it incurs all kinds of costs down the road (lots of additional source code to test and maintain, difficulty updating in the code you copied, etc.) ; the best way to reuse standard library modules is to just import them and use them.

However, to do one pass, transforming dictionaries into meaningfully comparable entries, it is O (N), which (although each step of the pass is simple) end up soaking the O (log N) time of the search itself. Since it bisect

cannot support key=

a key extractor, for example sort

, what is the solution to this dilemma - how can you reuse bisect

by importing and calling without a prior O (N) ..

As stated here , the solution is found in the famous quote by David Wheeler: "All problems in computer science can be solved by another level of indirection." Consider for example ....:

import bisect

listofdicts = [
  {'dt': '2009-%2.2d-%2.2dT12:00:00' % (m,d) }
  for m in range(4,9) for d in range(1,30)
  ]

class Indexer(object):
  def __init__(self, lod, key):
    self.lod = lod
    self.key = key
  def __len__(self):
    return len(self.lod)
  def __getitem__(self, idx):
    return self.lod[idx][self.key]


lookfor = listofdicts[len(listofdicts)//2]['dt']

def mid(res=listofdicts, target=lookfor):
    keys = [r['dt'] for r in res]
    return res[bisect.bisect_left(keys, target)]

def midi(res=listofdicts, target=lookfor):
    wrap = Indexer(res, 'dt')
    return res[bisect.bisect_left(wrap, target)]

if __name__ == '__main__':
  print '%d dicts on the list' % len(listofdicts)
  print 'Looking for', lookfor
  print mid(), midi()
assert mid() == midi()

Conclusion (only running this indexer.py

as a check, then using timeit

, in two ways):

$ python indexer.py 
145 dicts on the list
Looking for 2009-06-15T12:00:00
{'dt': '2009-06-15T12:00:00'} {'dt': '2009-06-15T12:00:00'}
$ python -mtimeit -s'import indexer' 'indexer.mid()'
10000 loops, best of 3: 27.2 usec per loop
$ python -mtimeit -s'import indexer' 'indexer.midi()'
100000 loops, best of 3: 9.43 usec per loop

As you can see, even on a modest problem with 145 entries in a list, the indirection approach can have performance that is three times better than the key-fetch approach. Since we are comparing O (N) vs O (log N), the advantage of the indirection approach grows indefinitely as N. increases. (For very small N, the higher multiplicative constants due to the indirection make the key allocation method faster, but this will soon surpass the difference of large O ). The Indexer class is admittedly complementary, however it is reused for ALL binary search tasks in the dict list sorted one entry in each dict, so having it in your "container utilities" is an investment.

So much for a basic search loop. For the secondary task of converting two records (one just below and one just above the target) and target for a few seconds, again consider a higher reuse approach, namely:

import time

adt = '2009-09-10T12:00:00'

def dttosecs(dt=adt):
    # string to seconds since the beginning
    date,tim = dt.split('T')
    y,m,d = date.split('-')
    h,mn,s = tim.split(':')
    y = int(y)
    m = int(m)
    d = int(d)
    h = int(h)
    mn = int(mn)
    s = min(59,int(float(s)+0.5)) # round to neatest second
    s = int(s)
    secs = time.mktime((y,m,d,h,mn,s,0,0,-1))
    return secs

def simpler(dt=adt):
  return time.mktime(time.strptime(dt, '%Y-%m-%dT%H:%M:%S'))

if __name__ == '__main__':
  print adt, dttosecs(), simpler()
assert dttosecs() == simpler()

There is no performance advantage for reuse here (and vice versa, it dttosecs

is faster), but then you only need to perform three conversions for each search, no matter how many entries your dicts list is included, so it is not clear if this performance issue is native. Meanwhile, with simpler

you only need to write, test and maintain one simple line of code, and dttosecs

- a dozen lines; given this ratio, in most situations (i.e. excluding absolute bottlenecks) I would prefer simpler

. It is important to keep in mind both approaches and the trade-offs between them in order to make smart choices.

Finding the closest match in a collection of strings representing numbers

More articles: