Lucene 3.5 to 4.10 upgrade - how to handle Java API changes

I am currently updating my search engine app from Lucene 3.5.0 to 4.10.3. Version 4 introduces some significant API changes that break backward compatibility. I managed to fix most of them, but there are a few questions left that I could use with:

  • "cannot override the final method from the parser

The source code extended the Analyzer class and overrode tokenStream (...).

@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
    CharStream charStream = CharReader.get(reader);        
    return
        new LowerCaseFilter(version,
            new SeparationFilter(version,
                new WhitespaceTokenizer(version,
                    new HTMLStripFilter(charStream))));
}

      

But this method is final and I'm not sure how to understand the following note from the changelog:

ReusableAnalyzerBase has been renamed Analyzer. All Analyzer implementations should now use Analyzer.TokenStreamComponents and not override .tokenStream () and .reusableTokenStream () (which are now final).

There is another problem with the above method:

  1. "Method get (Reader) undefined for type CharReader"

Here, apparently, there were some significant changes.

  1. "TermPositionVector could not be resolved for type"

This class has gone to Lucene 4. Are there any easy fixes for this? From the changelog:

The terms vector APIs (TermFreqVector, TermPositionVector, TermVectorMapper) have been removed in favor of the above flexible indexing APIs, introducing the inverted index of a single document document from a member vector.

Probably related to this:

  1. "The getTermFreqVector (int, String) method is undefined for type IndexReader."

Both problems arise here, for example:

TermPositionVector termVector = (TermPositionVector) reader.getTermFreqVector(...);

      

("reader" is of type IndexReader)

I would be grateful for any help in solving these problems.

+3


source to share


1 answer


I found main developer Uwe Schindler the answer to your question on the Lucene mailing list. It took me a while to trick the new API, so I need to write something down before I forget.

These notes are for Lucene 4.10.3.

Analyzer Implementation (1-2)

new Analyzer() {
    @Override
    protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
        Tokenizer source = new WhitespaceTokenizer(new HTMLStripCharFilter(reader));
        TokenStream sink = new LowerCaseFilter(source);
        return new TokenStreamComponents(source, sink);
    }
};

      

  • TokenStreamComponents constructor takes source and sink. The sink is the final result of the token stream returned Analyzer.tokenStream()

    , so put it in your filter chain. Source is token before applying any filters.
  • HTMLStripCharFilter , despite its name, is actually a subclass of java.io.Reader that removes HTML constructs, so you no longer need CharReader.

Swapping vectors (3-4)



In Lucene 4 term vectors work differently, so there are no simple method substitutions. The specific answer depends on your requirements.

If you want positional information, you need to index your positional information fields first:

Document doc = new Document();
FieldType f = new FieldType();
f.setIndexed(true);
f.setStoreTermVectors(true);
f.setStoreTermVectorPositions(true);
doc.add(new Field("text", "hello", f));

      

Finally, to get information about the frequency and positions in the document field, you will deploy a new API like this (adapted from this answer )

// IndexReader ir;
// int docID = 0;
Terms terms = ir.getTermVector(docID, "text");
terms.hasPositions(); // should be true if you set the field to store positions
TermsEnum termsEnum = terms.iterator(null);
BytesRef term = null;
// Explore the terms for this field
while ((term = termsEnum.next()) != null) {
    // Enumerate through documents, in this case only one
    DocsAndPositionsEnum docsEnum = termsEnum.docsAndPositions(null, null);
    int docIdEnum;
    while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
        for (int i = 0; i < docsEnum.freq(); i++) {
            System.out.println(term.utf8ToString() + " " + docIdEnum + " "
                    + docsEnum.nextPosition());
        }
    }
}

      

It would be nice if it would Terms.iterator()

return the actual Iterable.

+1


source







All Articles