Find "closest" strings in a Python list (alphabetical order)
I have a list of strings in Python for example. initialized like this:
l = ['aardvark', 'cat', 'dog', 'fish', 'tiger', 'zebra']
I would like to test an input string in this list and find "closest line below it" and "closest line above it", alphabetically and case insensitively (ie phonetics, simple a<b
, etc.). If an entry exists in the list, both "below" and "above" must return an entry.
A few examples:
Input | Below | Above
-------------------------------
bat | aardvark | cat
aaa | None | aardvark
ferret | dog | fish
dog | dog | dog
What's the easiest way to achieve this in Python? (I am currently iterating over a sorted list using a for loop)
Further clarification: I'm interested in a simple alphabetical phrase, not something like Levenshtein or phonetics.
thank
source to share
This is exactly what the bisection module is for. This will be much faster than just looping through large lists.
import bisect
def closest(haystack, needle):
if len(haystack) == 0: return None, None
index = bisect.bisect_left(haystack, needle)
if index == 0:
return None, haystack[0]
if index == len(haystack):
return haystack[index], None
if haystack[index] == needle:
return haystack[index], haystack[index]
return haystack[index-1], haystack[index]
The above code assumes that you sanitized the input and list as either upper or lower case. Also, I wrote this on my iPhone, so please check for typos.
source to share
You can rephrase the problem:
Given a sorted list of strings l
and an input string s
, find the index at l
where it s
should be inserted so that it l
remains sorted after insertion.
The l
at index-1
and elements index+1
(if they exist) are the ones you are looking for. To find the index, you can use binary search .
source to share
A very naive implementation, only good for short lists: you can easily iterate over the list and compare your choice with each, and then interrupt the first time your choice is "greater" than the item being compared.
for i, item in enumerate(l):
if lower(item) > lower(input):
break
print 'below: %s, above, %s' % (l[i-1], item)
source to share
Are these relatively short lists and do the content change, or are they pretty static?
If you have a large number of rows and they are relatively fixed, you might want to examine your data in a Trie structure. Once you've built it, it's quick and easy to search and find the nearest neighbors as you see fit.
source to share