Get close matches for multiple words in a dictionary
I have a dictionary with the following structure:
{
1: {"names": ["name1_A", "name1_B", ...]},
2: {"names": ["name2_A", "name2_B", ...]},
...
}
where name1_A
and name1_B
are synonyms / aliases / different ways of writing the same name, whose identifier is 1. name2_A
and name2_B
are aliases with the same name, whose identifier is 2, and therefore on.
I need to write a function that takes user input and returns a name id whose alias most closely resembles user input.
I know it's not very intuitive what I mean, here's an example. Let's say this is my dictionary:
{
1: {"names": ["James", "Jamie"]},
2: {"names": ["Karen", "Karyn"]}
}
The user enters a word Jimmy
. Since the closest match is Jimmy
from the dictionary Jamie
, the function should return an identifier of 1.
If the user types in the world Karena
since the closest match Karen
, the function should return an ID of 2.
I think the best way to get the closest math is to use difflib get_close_matches()
. However, this function takes a list of possibilities as an argument and I cannot think of how to use it correctly in my function. Any help would be appreciated.
source to share
If you are interested in third-party modules, there is a little little module that I like to use for this kind of thing called fuzzywuzzy
for fuzzy string mapping in Python. This module uses the Levenshtein Distance label to calculate the distance between two lines. Here's an example of how you use it:
>>> from fuzzywuzzy import fuzz
>>> from functools import partial
>>> data_dict = {
... 1: {"names": ["James", "Jamie"]},
... 2: {"names": ["Karen", "Karyn"]}
... }
>>> input_str = 'Karena'
>>> f = partial(fuzz.partial_ratio, input_str)
>>> matches = { k : max(data_dict[k]['names'], key=f) for k in data_dict}
>>> matches
{1: 'James', 2: 'Karen'}
>>> { i : (matches[i], f(matches[i])) for i in matches }
{1: ('James', 40), 2: ('Karen', 100)}
Now you can check Karen
out since it has the highest score.
I had to call this function twice for this demo, but you can only do this once depending on how you extend this example.
Another note: fuzz.partial_ratio
softer with its matches. For a stricter matching scheme, consider using fuzz.ratio
.
You can browse some more examples using fuzzy lines matching here .
source to share