Lucene - exact string matching
I am trying to create a Lucene 4.10 index. I just want to store in the index the exact strings that I have nested in the document, without tokens.
I am using StandardAnalyzer.
Directory dir = FSDirectory.open(new File("myDire")); Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_10_0, analyzer); iwc.setOpenMode(OpenMode.CREATE); IndexWriter writer = new IndexWriter(dir, iwc); StringField field1 = new StringField("1", content1, Store.YES); StringField field2 = new StringField("2", content2, Store.YES); StringField field3 = new StringField("3", content3, Store.YES); doc.add(field1); doc.add(field2); doc.add(field3); writer.addDocument(doc, analyzer); writer.close();
If I print out the contents of the index, I can see that my data is being stored, for example, my document has this "field 3":
stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<3:"Fuel Tank Capacity"@en>
I am trying to query the index to get it back:
IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(); QueryParser parser = new QueryParser("3", analyzer); String queryString = "\"\"Fuel Tank Capacity"\@en\""; Query query = parser.createPhraseQuery("3", QueryParser.escape(queryString)); TopDocs docs = searcher.search(query, null, 20);
I'm trying to find @en's "Fuel Capacity" term (including quotes), so I tried to get away from them, and I added a couple more quotes around the terms to let Lucene understand what I'm looking for for all the texts.
If I print out the query, I get: 3: "Fuel tank capacity en" but I don't want to split the text by the @ symbol.
I believe my first problem is StandardAnalyzer because it seems to be tokenize if I'm not mistaken. However, I can't figure out how to query the index to get exactly @en's "fuel tank capacity" (including the quotes).
source to share
When escaping a quote (or any other special character in Lucene), you need to use \, but don't forget that the backslash must be escaped inside a Java string.
The following works for me:
Query q = new QueryParser( Version.LUCENE_4_10_0, "", new StandardAnalyzer(Version.LUCENE_4_10_0) ).parse("3:\"\\\"Fuel Tank Capacity\\\"@en\"");
How did I come to this?
- Took the original line
"Fuel Tank Capacity"@en
- Added escaping, which is necessary for Lucene (escaped each
\"Fuel Tank Capacity\"@en
- Added escaped quotes at the beginning and end of the line:
"\"Fuel Tank Capacity\"@en"
- Added escaping, which is necessary for Java String (each slash becomes a double slash, double quotes are escaped with a backslash):
\"\\\"Fuel Tank Capacity\\\"@en\"
source to share