Effective fuzzy hash search

Question

Effective fuzzy hash search

I have a lot of data that needs to be queried very quickly (which is of course relative, but within a few seconds for ~ 100 million keys, but ideally faster). It is in the form of key / value pairs, where keys are a unique string and the values are an array of strings. This is an actual data structure, but I can organize it into any data structures that are faster to search.

Finding values for a key must include not only values for that exact key, but also values for all keys within the levenshtein distance within a given threshold (for example, 5).

For example, the search hello

must not only return all the values are indexed under the key for hello

, but also for hello

, hello!

, yello

, helo

, hellooo

,, etc.

The naive solution, of course, iterates over each key, calculating its Levenshtein distance and including its values if it is in a certain threshold. However, this solution does not scale well with O (n) time complexity to iterate over each key O (n-1) to iterate over each key to compare it with, and O (n) to compute levenshtein, which led to a search time O (n * n * n-1) which is of course unacceptable.

How can I structure this data to optimize the complexity of the search time? Space complexity, insertion, deletion and runtime editing are all irrelevant (although I would rather keep inserts for a second each to avoid bottlenecks).

Some information about the data:

Unique keys ready to be added at a fairly constant rate of 10 / seconds
Strings are ready to be added to arrays of values at a fairly constant rate of 10 / seconds
The size of each key value is usually 1-5 elements, but some outliers have hundreds
Each line in value arrays is usually 20-40 characters long

+3

data-structures nlp levenshtein-distance fuzzy-search

drusepth 13 Aug 14 at 12:27 am

source to share

1 answer

NikoNyrh · Answer 1 · 2015-10-01T18:16:53+0000

I'm not sure if the data structure is being used here, but the simplest option might be to use ElasticSearch a fuzzy query or its siblings. The good thing is that it uses an inverted index with good optimizations for doing fuzzy queries.

Effective fuzzy hash search

More articles: