Infix Quick Search

I am creating an autocomplete that will query 10 million words / phrases quickly and I am running into some problems. My first idea was to go through some kind of trie / trernary tree structure, but this is a strictly prefix match, which is not enough for my application (I want a full infix match). Then I moved on to some larger solutions, SqlServer FullText Indexing, Lucene, Solr, Sphinx, but Lucent and SqlServer FullText Indexing are not actually full text, but prefixed with great features (soundex, proximity, etc.). I've tried to think about how Levenshtein distance editing might help, but can't find a way to be either at least sane or support words with long editing distances (i.e. google and ogl. Edit distance 3.but 3 way to high threshold is a common case).

My question is, how do law enforcement agencies like google / bing, etc. do? Are they just rudely forcing it after a beat? I would guess not, but I cannot find any support.

Any help would be appreciated!


source to share

3 answers

If you've included queryParser.setAllowLeadingWildcard(true);

in Lucene, you can use start and end wildcards, for example:



It would match all one-word words containing "talli", including "Metallica".

This may not be fast enough for you, but in some cases (if you only prefer wildcard searches) if you can preprocess the query string you could go through with the old "undo term and index, which is also" a trick:





Lucene / Solr can do this very easily. The unit of search in Lucene / Solr is Term , which is usually a word, but can be almost anything depending on how the text is parsed .

With Solr, there are many ways to do this (ngrams / shingles, facet prefix, TermsComponent, ...). Recent versions of Solr have a special component for autocomplete based on spell checker .



When I needed infix search in 2013, I did some research. The only way I've found is with the Sphinx engine . You need to configure it to support infix search

index tra


After that, he deals with the problem in the blink of an eye. I think it was about 200K records to search. I used a local engine to simulate an in-memory search library.



All Articles