Why isn't my master wildcard search working in Solr?

I have a textbox defining that I am using copyField to populate with various source fields, and the purpose for that one field is what I use to find the Solr index.

This text box uses this custom fieldType "text_en_splitting_reversed". I created this field type by copying the "text_en_splitting" example and adding the ReversedWildcardFilterFactory to the index parser.

<!-- Just like text_en_splitting, but with the addition of reversed tokens for leading wildcard matches -->
<fieldType name="text_en_splitting_reversed" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <!-- Case insensitive stop word removal.
      add enablePositionIncrements=true in both the index and query
      analyzers to leave a 'gap' for more accurate phrase queries.
    -->
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" types="word-delim-types.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
       maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
 </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"  types="word-delim-types.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

      

My main problem: I am getting unexpected results when searching using the main template. For example, I know that one specific search I do for "* car" should return one match (the document contains the word "race car"). Since this was unfortunate, I decided to debug it in the analyzer tool in Solr Admin. Here is a screenshot of my test:

Leading wildcard issue

I'm new to this parser tool, but shouldn't the right side hold the leading sprocket all the way? And why doesn't it end? Should I reverse-process user-entered keywords?

Now, in my index query configuration, I am configured to use edismax. However, in the admin gui parser, I don't see a way to control whether it uses a standard parser or edismax. (Maybe it doesn't matter?)

In case this information can help provide more context, I'm going to surpass my goals for indexing this particular field:

  • I would like the car to match the racing car. This does not work.
  • I would like $ 30 to match documents containing $ 30 but not $ 30 (no dollar sign). So I added the types = "" attribute where I define $ as DIGIT. This one works .
  • I would like 30 to match documents containing $ 30. This does not work.
+3


source to share


2 answers


Ultimately the main issue with wildcards was a bug in our search engine interface. We have a code that wraps all keywords or phrases in quotation marks before the request is sent to Solr. This way, if a phrase was entered, it would be surrounded by quotes and work fine. And it doesn't affect regular keyword searches.

But apparently if it is a wildcard search by putting quotes around it the search fails for some reason. When I remove the quotes, * the car matched the posts that the race car was in as hoped.



As for my secondary problem (why "30" doesn't match documents containing "$ 30"), I also solved this problem in a separate StackOverflow thread: How do I find documents containing numbers and dollar signs in Solr?

As an aside, I think there is a bug in the Solr admin gui parsing. When testing wildcard lookups, I can never get any highlight indicating that a match would have been made ... this added further to my confusion trying to debug the problem.

+1


source


You can see from your screenshot that the WordDelimiterFilterFactory has removed your presenter *. Try adding preserveOriginal="1"

a query parser to your side.



<filter class="solr.WordDelimiterFilterFactory" 
    preserveOriginal="1" 
    generateWordParts="1" 
    generateNumberParts="1" 
    catenateWords="0" 
    catenateNumbers="0" 
    catenateAll="0" 
    splitOnCaseChange="1" 
    types="word-delim-types.txt" />

      

+1


source







All Articles