Find "closest" strings in a Python list (alphabetical order)

I have a list of strings in Python for example. initialized like this:

l = ['aardvark', 'cat', 'dog', 'fish', 'tiger', 'zebra']

      

I would like to test an input string in this list and find "closest line below it" and "closest line above it", alphabetically and case insensitively (ie phonetics, simple a<b

, etc.). If an entry exists in the list, both "below" and "above" must return an entry.

A few examples:

Input  | Below    |  Above   
-------------------------------
bat    | aardvark | cat      
aaa    | None     | aardvark 
ferret | dog      | fish     
dog    | dog      | dog

      

What's the easiest way to achieve this in Python? (I am currently iterating over a sorted list using a for loop)

Further clarification: I'm interested in a simple alphabetical phrase, not something like Levenshtein or phonetics.

thank

+2


source to share


4 answers


This is exactly what the bisection module is for. This will be much faster than just looping through large lists.

import bisect

def closest(haystack, needle):
    if len(haystack) == 0: return None, None

    index = bisect.bisect_left(haystack, needle)
    if index == 0:
        return None, haystack[0]
    if index == len(haystack):
        return haystack[index], None
    if haystack[index] == needle:
        return haystack[index], haystack[index]        
    return haystack[index-1], haystack[index]

      



The above code assumes that you sanitized the input and list as either upper or lower case. Also, I wrote this on my iPhone, so please check for typos.

+16


source


You can rephrase the problem:

Given a sorted list of strings l

and an input string s

, find the index at l

where it s

should be inserted so that it l

remains sorted after insertion.



The l

at index-1

and elements index+1

(if they exist) are the ones you are looking for. To find the index, you can use binary search .

+2


source


A very naive implementation, only good for short lists: you can easily iterate over the list and compare your choice with each, and then interrupt the first time your choice is "greater" than the item being compared.

for i, item in enumerate(l):
    if lower(item) > lower(input):
        break

print 'below: %s, above, %s' % (l[i-1], item)

      

+1


source


Are these relatively short lists and do the content change, or are they pretty static?

If you have a large number of rows and they are relatively fixed, you might want to examine your data in a Trie structure. Once you've built it, it's quick and easy to search and find the nearest neighbors as you see fit.

0


source







All Articles