Lucene Proximity Search for a phrase with more than two words

Lucene's guide explains the meaning of proximity search for a two-word phrase, like the example "jakarta apache"~10

in http://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Proximity Searches

However, I am wondering what exactly does a search do for example "jakarta apache lucene"~10

? Does adjacent words allow no more than 10 words apart or all word pairs?

Thank!

+3


source to share


1 answer


Deviation (proximity) works like edit distance (see PhraseQuery.setSlop

). Thus, terms can be reordered or additional terms added. This means that proximity will be the maximum number of terms added in the entire query. I.e:

"jakarta apache lucene"~3

      

Will match:

  • "jakarta lucene apache" (distance: 2)
  • "jakarta additional words here apache lucene" (distance: 3)
  • "jakarta multiple apache words separated by lucene" (distance: 3)

But not:

  • "lucene jakarta apache" (distance: 4)
  • "jakarta too many unnecessary words here apache lucene" (distance: 5)
  • "jakarta several apache words are further separated by lucene" (distance: 4)



Some people were confused:

"lucene jakarta apache" (distance: 4)

The simple explanation is that swapping members accept two changes, so:

  • jakarta apache lucene (distance: 0)
  • jakarta lucene apache (first swap, distance: 2)
  • lucene jakarta apache (second swap, distance: 4)

A longer but more accurate explanation is that each edit allows the term to be moved one position at a time. The first move of the swap transfers the two members onto each other. With this in mind, it explains why any set of three terms can be rearranged in any order with a distance of no more than 4.

  • jakarta apache lucene (distance: 0)
  • jakarta [apache, lucene] (distance: 1)
  • [jakarta, apache, lucene] (all transpose at the same position, distance: 2)
  • lucene [jakarta, apache] (distance: 3)
  • lucene jakarta apache (distance: 4)
+8


source







All Articles