Accessing Mongo text search raw text index (tokenized terms) for term autocompletion

My users are asking me for a suggestion for a "google-like" query (autocomplete) which is useful for spelling terms and general understanding. Mongonian text indexes perform searches only on complete and well-written terms.

I need access to the text index itself, i.e. to his "words". I read this crude solution and was looking for something less fragile than double indexing and word (word) control.

All I want to do is get up to N index tokens that start with a specific text. Don't tell me to use regex search because it defeats the faster text index. I don't want to use Elastic Search, Lucene or any other external index: a maintenance nightmare. Text search belongs to the database and with some limitations Mongo excels on it.

+3


source to share


2 answers


Since you said no to regexp, and also said that you prefer to use Mongodb's built-in text search, I suggest a method that I have sometimes followed. It can do partial word searches, multiple word searches and "limited spelling errors", singular / plural, present / past tense, verb, noun search. But remember that this will not be efficient (it may also not return correct values) if each of your fields contains 1000 words.

Mongodb text search only matches full words, so the string must be formatted accordingly. The key is to create an alternate text field - on which you would apply a text index - instead of the current field to find text matches.

Also you have to make an array of words to match from client side

I will give an overview of what I have done. Suppose the string in the collection

"Implementing an Autocomplete Function Using MongoDB"



You will create the following text string from it and save it as another field (text field with index)

"im imp impl implement implementation implement in implementation au aut auto co com comp compl complete complete complete complete fea feat featu featur feature mo mon mong mongo mongod mongodb"

The process before inserting the document is explained below

  • Clear string - convert to lowercase, remove special characters like -, (), etc.

  • Remove insignificant words such as eat, use, among, have, etc.

  • Push the remaining words into an array (input_array).

  • For each word in input_array, take substrings of length 2, 4, 5 and push it into output_array. They will be matched for automatic completion and to provide protection against some spelling errors. For example, "Implementation" will generate "im", "imp", "impl"

  • For each word of length n in input_array, take substrings of length n-3, n-2, n-1, n and insert it into output_array. The advantage is that it will cover some grammar errors / differences. For example - custom types "implement", text with "implement" will return a positive match. For example, "implementation" will generate "implementation", "implementation", "implementation", "implementation"

  • Concatenate the array to create a multi-word text string and insert it into the collection

  • Now the user search entry should also be formatted into an array. Steps 1, 2, 3, 4, 5 are followed here also to create search_input_array.

  • The advantage of applying step 4 to the client's search string is that it can provide some protection against spelling errors. For example, user types are "impdement", the formatted array will be ("im", "imp", "impd", "impde", "impdem", "impdeme", "impdement"). You can see that there are two valid matches available for implementation. The rest of the words are incorrect words and will contain very few entries

  • Now, the advantage of applying step 5 to the client's search terms is to provide some protection against grammar changes such as current / past tense, singular / plural, noun / verb, etc. For example, custom types that either "implement", "implement", "implemented", "implement" a formatted search array will always contain the term "implement" there, giving a valid match for our entry in the collection.

  • Matching should be done with a query like

    query ["$ text"] = {$ search: formatted_search_input_array};

  • If you want to display sentence markers, you must process the bits in the result set. You should get the "source" of the best n matches. Then clear and separate the words. Make a direct substring using the terms search_array and return the matches as tokens. But if you have small sentences with less than 10 words, you can return the full text just like google (it will look better if the user enters multiple words)

You will get better results if your lines are short. And, of course, the criteria for creating a text string should be changed to suit your needs. You should also consider storing the formatted alt text in another collection and linking it with an objectid reference if it's large.

0


source


The key to fast search response is more or less dependent on the number of items to navigate through the storage / file / database, writes by source frequency, the amount of throttling, and network or hardware overhead. Let me break them down and develop a strategy for improvement in all of these areas.



Full article here

0


source







All Articles