Solr autosuggest with diacritics
I am using solr4 with TermsComponent
Autosuggest (as described here ) We do regEx "startsWith" -search which ignores upper / lower case, the whole searchQuery looks like this:
<solr>/terms
?terms.fl=name
&terms=true
&terms.limit=5
&terms.regex=<term>.*
&terms.regex.flag=case_insensitive
&qt=%2Fterms
Let me give you some examples of what returns:
test -> Test Listing; test lowercase
Test -> Test Listing; test lowercase
Unfortunately this solution cannot handle diacritics, umlaut, accents. So the following will not work:
têst -> Test Listing; test lowercase; Têst áccènt
Test -> Test Listing; test lowercase; Têst áccènt
Field string
- I also tried with tokenized test_en
but no success
<field name="name" type="string" indexed="true" stored="true" required="true" />
What's the best way to enable bi-directional search for accents for this autoplay?
Edit: changes for AnalyzerSuggester:
<searchComponent class="solr.SpellCheckComponent" name="autosuggest">
<lst name="spellchecker">
<str name="name">autosuggest</str>
<str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl">org.apache.solr.spelling.suggest.fst.AnalyzingLookupFactory</str>
<str name="storeDir">autosuggest</str>
<str name="buildOnCommit">true</str>
<str name="field">asug</str>
<str name="suggestAnalyzerFieldType">text_asug</str>
<!-- Suggester properties -->
<bool name="exactMatchFirst">true</bool>
</lst>
</searchComponent>
<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/autosuggest">
<lst name="defaults">
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">autosuggest</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.collate">true</str>
</lst>
<arr name="components">
<str>autosuggest</str>
</arr>
</requestHandler>
...
<fieldType name="text_asug" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
source to share
The problem is that the term component works with indexed tokens for both search and query. So if you are doing unicode folding (which is what you are doing), you will keep the folded text version. You will match it without accents, but then you will also get it back without accents too.
I can imagine two options:
1) Store folded and unfolded terms in the same field. So we get something "Têst áccènt" to go to "Test accent | Têst áccènt". You match the "Test .." prefix and then retrieve the second term on the client. How to do this can be tricky.
2) Use Suggester instead . This is based on spell checking and - if I'm reading the documentation correctly - allows you to specify an alternate field_type whose parsers are used during the index / query of the indexer (using the barely documented queryAnalyzerFieldType parameter in the solrconfig.xml file). This way, your original text is copied into the examiner folded. But suppose as soon as Matchester matches something, it will return to its original shape. However, I'm not sure. Mainly because it is advertised as a feature of the newly born Lucene / Solr 4.1 AnalyzeSuggester . In fact, the article specifically covers your use case:
With a parser that adds or normalizes case, accents, etc. (for example, using ICUFoldingFilter), sentences will match regardless of case and emphasis. For example, "ame ..." suggests Amelie.
The problem is that you have to put together the complete example yourself at this point. There are very few guidelines. But this (AnalyzingSuggester) is probably the best choice.
source to share
You will need to create a custom fieldType, possibly similar to a field text_en
, but also implementing an ASCIIFilterFoldingFactory to handle diacritic conversions by index and query time.
source to share