DoubleMetaphoneFilterFactory in Solr

Question

DoubleMetaphoneFilterFactory in Solr

My goal is to integrate solr so that the results from my application are accurate and fast. I search on the name field using doublemetaphonic so that names that are similar to each other are also captured, and then using fuzzy search (which uses the levenshtein distance algorithm) retrieves results above a certain percentage. The problem is when I put doublemetaphonic on a feild of type, then I can't fuzzy search that field.

An example config from my schema.xml looks like this:

<field name="sdn_names" type="doublemetaphonetic" indexed="true" stored="true"     termVectors="true"/>
<!--Defination of doublemetaphonic.-->
<fieldtype name="doublemetaphonetic" stored="false" indexed="true" class="solr.TextField" >
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
  </analyzer>
</fieldtype>

From my solr UI, when I tried to search for sdn_names: abdul ~ 0.50 then it returns 0 results and if I change my String query to sdn_names: abdul then I get 180 records in the result set. I used a search for a solution and found that when we use doublemetaphonic to index, the phonetic value is different from the orignal value and the calculated levenshtein distance is very large between the two lines, so the results are 0. Please provide me with any links or recommended solution / reading for the problem, since I'm new to solr. thanks in advance

+3

java lucene solr fuzzy-search

Alok chaudhary 11 Aug 14 at 12:21

source to share

1 answer

femtoRgon · Accepted Answer · 2014-08-11T16:10:32+0000

Metaphone and wildcards are just not compatible.

First, Lucene does not parse terms with wildcards, fuzzy matches, regex, etc. So you are trying to find plain text regarding metaphone codes. So you have:

Index: APTL
In the request: abdul ~ 0.5

Which I think makes it more obvious why you are not getting any matches. This distance is levenshtein 3, which is significant.

Mixing metaphones with wildcards doesn't make a lot of sense. The correct match for the metaphone must be accurate. The metaphone algorithm shortens the term to a code representing the first four sounds (simplifying somewhat).

These are two different and separate methods for finding relevant weaker results. They have to be stored separately, so if you want to be able to search for both fuzzy matches and metaphones, the best idea is to index the metaphones and full text in two different fields and then search on both. Something like:

<field name="sdn_names_phonetic" type="doublemetaphonetic" indexed="true" stored="false" termVectors="true"/>
<field name="sdn_names" type="text_standard" indexed="true" stored="true" termVectors="true"/>

<fieldType name="text_standard" class="solr.TextField"> 
  <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/> 
</fieldType> 
<fieldtype name="doublemetaphonetic" stored="false" indexed="true" class="solr.TextField" >
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
  </analyzer>
</fieldtype>

(Note: I changed your metaphone fields to stored=false

since both of these fields will store the same data, there is no need to store both).

What can you look for as follows:

sdn_names:abdul~0.5 sdn_names_phonetic:abdul

See the solr documentation section: Indexing the same data in multiple fields for a little more about this kind of template.

DoubleMetaphoneFilterFactory in Solr

More articles: