Effective fuzzy hash search
I have a lot of data that needs to be queried very quickly (which is of course relative, but within a few seconds for ~ 100 million keys, but ideally faster). It is in the form of key / value pairs, where keys are a unique string and the values ββare an array of strings. This is an actual data structure, but I can organize it into any data structures that are faster to search.
Finding values ββfor a key must include not only values ββfor that exact key, but also values ββfor all keys within the levenshtein distance within a given threshold (for example, 5).
For example, the search hello
must not only return all the values are indexed under the key for hello
, but also for hello
, hello!
, yello
, helo
, hellooo
,, etc.
The naive solution, of course, iterates over each key, calculating its Levenshtein distance and including its values ββif it is in a certain threshold. However, this solution does not scale well with O (n) time complexity to iterate over each key O (n-1) to iterate over each key to compare it with, and O (n) to compute levenshtein, which led to a search time O (n * n * n-1) which is of course unacceptable.
How can I structure this data to optimize the complexity of the search time? Space complexity, insertion, deletion and runtime editing are all irrelevant (although I would rather keep inserts for a second each to avoid bottlenecks).
Some information about the data:
- Unique keys ready to be added at a fairly constant rate of 10 / seconds
- Strings are ready to be added to arrays of values ββat a fairly constant rate of 10 / seconds
- The size of each key value is usually 1-5 elements, but some outliers have hundreds
- Each line in value arrays is usually 20-40 characters long
source to share
I'm not sure if the data structure is being used here, but the simplest option might be to use ElasticSearch a fuzzy query or its siblings. The good thing is that it uses an inverted index with good optimizations for doing fuzzy queries.
source to share