Solr autosuggest with diacritics

I am using solr4 with TermsComponent

Autosuggest (as described here ) We do regEx "startsWith" -search which ignores upper / lower case, the whole searchQuery looks like this:

<solr>/terms
?terms.fl=name
&terms=true
&terms.limit=5
&terms.regex=<term>.*
&terms.regex.flag=case_insensitive
&qt=%2Fterms

      

Let me give you some examples of what returns:

test -> Test Listing; test lowercase
Test -> Test Listing; test lowercase

      

Unfortunately this solution cannot handle diacritics, umlaut, accents. So the following will not work:

têst -> Test Listing; test lowercase; Têst áccènt
Test -> Test Listing; test lowercase; Têst áccènt

      

Field string

- I also tried with tokenized test_en

but no success

<field name="name" type="string" indexed="true" stored="true" required="true" />

      

What's the best way to enable bi-directional search for accents for this autoplay?


Edit: changes for AnalyzerSuggester:

  <searchComponent class="solr.SpellCheckComponent" name="autosuggest">
    <lst name="spellchecker">
      <str name="name">autosuggest</str>
      <str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
      <str name="lookupImpl">org.apache.solr.spelling.suggest.fst.AnalyzingLookupFactory</str>
      <str name="storeDir">autosuggest</str>
      <str name="buildOnCommit">true</str>
      <str name="field">asug</str>
      <str name="suggestAnalyzerFieldType">text_asug</str>

      <!-- Suggester properties -->
      <bool name="exactMatchFirst">true</bool>
    </lst>
  </searchComponent>
  <requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/autosuggest">
    <lst name="defaults">
      <str name="spellcheck">true</str>
      <str name="spellcheck.dictionary">autosuggest</str>
      <str name="spellcheck.onlyMorePopular">true</str>
      <str name="spellcheck.count">5</str>
      <str name="spellcheck.collate">true</str>
    </lst>
    <arr name="components">
      <str>autosuggest</str>
    </arr>
  </requestHandler>

      

...

<fieldType name="text_asug" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>
    </fieldType>

      

+3


source to share


2 answers


The problem is that the term component works with indexed tokens for both search and query. So if you are doing unicode folding (which is what you are doing), you will keep the folded text version. You will match it without accents, but then you will also get it back without accents too.

I can imagine two options:

1) Store folded and unfolded terms in the same field. So we get something "Têst áccènt" to go to "Test accent | Têst áccènt". You match the "Test .." prefix and then retrieve the second term on the client. How to do this can be tricky.



2) Use Suggester instead . This is based on spell checking and - if I'm reading the documentation correctly - allows you to specify an alternate field_type whose parsers are used during the index / query of the indexer (using the barely documented queryAnalyzerFieldType parameter in the solrconfig.xml file). This way, your original text is copied into the examiner folded. But suppose as soon as Matchester matches something, it will return to its original shape. However, I'm not sure. Mainly because it is advertised as a feature of the newly born Lucene / Solr 4.1 AnalyzeSuggester . In fact, the article specifically covers your use case:

With a parser that adds or normalizes case, accents, etc. (for example, using ICUFoldingFilter), sentences will match regardless of case and emphasis. For example, "ame ..." suggests Amelie.

The problem is that you have to put together the complete example yourself at this point. There are very few guidelines. But this (AnalyzingSuggester) is probably the best choice.

+3


source


You will need to create a custom fieldType, possibly similar to a field text_en

, but also implementing an ASCIIFilterFoldingFactory to handle diacritic conversions by index and query time.



+2


source







All Articles