Infix Quick Search

Question

Infix Quick Search

I am creating an autocomplete that will query 10 million words / phrases quickly and I am running into some problems. My first idea was to go through some kind of trie / trernary tree structure, but this is a strictly prefix match, which is not enough for my application (I want a full infix match). Then I moved on to some larger solutions, SqlServer FullText Indexing, Lucene, Solr, Sphinx, but Lucent and SqlServer FullText Indexing are not actually full text, but prefixed with great features (soundex, proximity, etc.). I've tried to think about how Levenshtein distance editing might help, but can't find a way to be either at least sane or support words with long editing distances (i.e. google and ogl. Edit distance 3.but 3 way to high threshold is a common case).

My question is, how do law enforcement agencies like google / bing, etc. do? Are they just rudely forcing it after a beat? I would guess not, but I cannot find any support.

Any help would be appreciated!

+3

.net autocomplete full-text-search solr sphinx

hermitt 21 Mar At 1:15 am

source to share

3 answers

Lucene / Solr can do this very easily. The unit of search in Lucene / Solr is Term , which is usually a word, but can be almost anything depending on how the text is parsed .

With Solr, there are many ways to do this (ngrams / shingles, facet prefix, TermsComponent, ...). Recent versions of Solr have a special component for autocomplete based on spell checker .

0

Mauricio scheffer 21 Mar 12 at 22:34

source to share

When I needed infix search in 2013, I did some research. The only way I've found is with the Sphinx engine . You need to configure it to support infix search

index tra
{
  [...]
  enable_star=1
  min_infix_len=2
}

After that, he deals with the problem in the blink of an eye. I think it was about 200K records to search. I used a local engine to simulate an in-memory search library.

0

Jarekczek 12 Sep '17 at 8:30

source to share

Mark leighton fisher · Accepted Answer · 2013-02-08T19:18:12+0000

If you've included queryParser.setAllowLeadingWildcard(true);

in Lucene, you can use start and end wildcards, for example:

*talli*

It would match all one-word words containing "talli", including "Metallica".

This may not be fast enough for you, but in some cases (if you only prefer wildcard searches) if you can preprocess the query string you could go through with the old "undo term and index, which is also" a trick:

acillateM

Infix Quick Search

More articles: