Lucene.NET - search phrase containing "and"

Look for advice on handling ampersands and the word "and" in Lucene queries. My test queries (including quotes):

  • "Oil and gas field" (full phrase)
  • "research and development" (full phrase)
  • "r & d" (full phrase)

Ideally I would like to use QueryParser

as the input comes from the user.

While testing and reading the docs, I found that the usage StandardAnalyzer

doesn't work for what I want. For the first two queries, a QueryParser.Parse

converts them to:

contents:"oil gas field"
contents:"research development"

      

This is not what I want. If I use PhraseQuery

this instead, I get no results (presumably because "and" is not indexed.

If I use SimpleAnalyzer

, then I can find phrases, but QueryParser.Parse

converts the last member to:

contents:"r d"

      

What else is not quite what I'm looking for.

Any advice?

+2


source to share


2 answers


if you want to find "and" you need to index it. Write your own parser or remove the "and" from the stop word list. The same goes for "r & d". Write your own parser that creates 3 words from text: "r", "d", "r & d".



+3


source


Step one of working with Lucene is to accept that almost all of the work is done during indexing. If you want to search for something, you index it. If you want to ignore something, then you don't index it. This is what allows Lucene to provide such high-speed search.

As a result, for the index to work effectively, you must anticipate what your parser must do. In this case, I would write my own parser that does not remove any stop words, and also converts both "and" (and optionally @ into "on", etc.). In the case of research and development R&D, you will almost certainly have to implement some kind of domain logic.

There are other ways to combat this. If you can distinguish between phrase search and regular keyword searches, there is no reason why you cannot maintain two or more indexes to handle different types of searches. This gives a very fast search, but requires additional maintenance.



Another option is to use Lucene's high speed to filter your raw results to something more analyzer-driven that doesn't produce false negatives. Then you can do some detailed filtering over the entire text of those documents it finds to match the correct phrases.

Ultimately, I think you'll find that Lucene sacrifices precision in more advanced searches to provide speed, which is generally good enough for most people. You are probably in uncharted waters trying to tweak your analyzer so much.

+3


source







All Articles