Effective fuzzy hash search

I have a lot of data that needs to be queried very quickly (which is of course relative, but within a few seconds for ~ 100 million keys, but ideally faster). It is in the form of key / value pairs, where keys are a unique string and the values ​​are an array of strings. This is an actual data structure, but I can organize it into any data structures that are faster to search.

Finding values ​​for a key must include not only values ​​for that exact key, but also values ​​for all keys within the levenshtein distance within a given threshold (for example, 5).

For example, the search hello

must not only return all the values are indexed under the key for hello

, but also for hello

, hello!

, yello

, helo

, hellooo

,, etc.

The naive solution, of course, iterates over each key, calculating its Levenshtein distance and including its values ​​if it is in a certain threshold. However, this solution does not scale well with O (n) time complexity to iterate over each key O (n-1) to iterate over each key to compare it with, and O (n) to compute levenshtein, which led to a search time O (n * n * n-1) which is of course unacceptable.

How can I structure this data to optimize the complexity of the search time? Space complexity, insertion, deletion and runtime editing are all irrelevant (although I would rather keep inserts for a second each to avoid bottlenecks).

Some information about the data:

  • Unique keys ready to be added at a fairly constant rate of 10 / seconds
  • Strings are ready to be added to arrays of values ​​at a fairly constant rate of 10 / seconds
  • The size of each key value is usually 1-5 elements, but some outliers have hundreds
  • Each line in value arrays is usually 20-40 characters long
+3


source to share


1 answer


I'm not sure if the data structure is being used here, but the simplest option might be to use ElasticSearch a fuzzy query or its siblings. The good thing is that it uses an inverted index with good optimizations for doing fuzzy queries.



0


source







All Articles