Why doesn't Lucene algorithm work for Exact String in Java?
I am working on Lucene Algorithm in Java . We have 100K stop names in MySQL database . Stop names are similar to
NEW YORK PENN STATION,
NEWARK PENN STATION,
NEWARK BROAD ST,
NEW PROVIDENCE
etc
When the user gives a search entry, for example NEW YORK , we get a NEW YORK PENN STATION , but when the user gives the exact NEW YORK PENN STATION in the search results, it returns zero results .
My code is -
public ArrayList<String> getSimilarString(ArrayList<String> source, String querystr)
{
ArrayList<String> arResult = new ArrayList<String>();
try
{
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
for(int i = 0; i < source.size(); i++)
{
addDoc(w, source.get(i), "1933988" + (i + 1) + "z");
}
w.close();
// 2. query
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");
// 3. search
int hitsPerPage = 20;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for(int i = 0; i < hits.length; ++i)
{
int docId = hits[i].doc;
Document d = searcher.doc(docId);
arResult.add(d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
catch(Exception e)
{
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
return arResult;
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException
{
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
This source code has a Stop Names list and a user-supplied prompt for input.
Does Lucene's algorithm work on a large line?
Why doesn't Lucene's algorithm work on Exact String?
source to share
Instead
1) Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");
Ex: "New York station" will be parsed to "title: new title: york title: station". This query will return all documents containing any of the above terms.
Try it.
2) Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse("+(" + querystr + ")");
Ex1 : "new york" will be parsed into "+ (name: new name: york)"
The aforementioned "+" symbol indicates the " must " appearance of this term in the resulting document. It will match the documents containing "New York" and "New York Station"
Ex2 : "new york station" will be parsed to + (name: new name: york title: station). The request will only match "New York station", not just "New York", since there is no station.
Please make sure the field name 'title' is what you are looking for.
Your questions.
Does Lucene's algorithm work on a large line?
You have to define what a big string is. Are you really looking for Phrase Search . In general, Yes, Lucene works for large strings.
Why doesn't Lucene's algorithm work on Exact String?
Because parsing ("querystr" + "*") will generate individual queries using the OR operator linking them. Example: "new york *" will be processed like this: "title: new OR title: york *
If you are anxiously waiting for "New York station", the above wildcard is not what you should be looking for. This is because the StandardAnalyser you were going through while indexing will tokenize (break conditions) the New York station to three conditions.
Thus, the query "york *" will find "york station" only because it contains "york", but not because of the pattern, since "york" has no concept of "station" since they are different terms, i.e. ... different entries in the index.
What you really need is a PhraseQuery to find the exact string for which the query string should be "new york" WITH quotes
source to share