How do I properly implement a delegating tokenizer in lucene 4.x?

The naive approach suggested by the documentation in the Creating Delegates section does not work as expected as it results in the Tokenizer contract violation being delegated

private static class TokenizerWrapper extends Tokenizer {
  public TokenizerWrapper(Reader _input) {
    super(_input);
    delegate = new WhitespaceTokenizer(input);
  }

  @Override
  public void reset() throws IOException {
    logger.info("TokenizerWrapper.reset()");
    super.reset();
    delegate.setReader(input);
    delegate.reset();
  }

  @Override
  public final boolean incrementToken() throws IOException {
    logger.info("TokenizerWrapper.incrementToken()");
    return delegate.incrementToken();
  }

  private final WhitespaceTokenizer delegate;
}

      

gives me the following log:

14:30:12.885 [main] INFO  test.GapTest - TokenizerWrapper.reset()
14:30:12.886 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
14:30:12.889 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
14:30:12.889 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
14:30:12.897 [main] INFO  test.GapTest - TokenizerWrapper.reset()
Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: close() call missing
    at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:90)
    at test.GapTest$TestTokenizer.reset(GapTest.java:152)
    at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70)
    at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:599)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:454)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1231)
    at test.GapTest.main(GapTest.java:67)

      

Override the close () method as follows:

  @Override
  public void close() throws IOException {
    logger.info("TokenizerWrapper.close()");
    super.close();
    logger.info("TokenizerWrapper.delegate.close()");
    tokenizer.close();
    // tokenizer.setReader(input);
  }

      

doesn't help, but with a different error:

15:36:49.561 [main] INFO  test.GapTest - setting field "text" to "some text"
15:36:49.569 [main] INFO  test.GapTest - Adding created document to the index
15:36:49.605 [main] INFO  test.GapTest - createComponents()
15:36:49.633 [main] INFO  test.GapTest - TokenizerWrapper(_input)
15:36:49.638 [main] INFO  test.GapTest - TokenizerWrapper.reset()
15:36:49.639 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.640 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.640 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.641 [main] INFO  test.GapTest - TokenizerWrapper.close()
15:36:49.641 [main] INFO  test.GapTest - TokenizerWrapper.delegate.close()
15:36:49.648 [main] INFO  test.GapTest - setting field "text" to "some text 1"
15:36:49.648 [main] INFO  test.GapTest - Adding created document to the index
15:36:49.648 [main] INFO  test.GapTest - TokenizerWrapper.reset()
15:36:49.648 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.649 [main] INFO  test.GapTest - TokenizerWrapper.close()
15:36:49.649 [main] INFO  test.GapTest - TokenizerWrapper.delegate.close()
Exception in thread "main" java.lang.IllegalArgumentException: first position increment must be > 0 (got 0) for field 'address'
    at    org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:617)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:454)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1231)
    at test.GapTest.main(GapTest.java:72)

      

those.

  • it successfully processed the first document (with "some text" in the "text" field),
  • then started processing the second document ("some text 1"),
  • [seemingly] successfully processed the first token (the word "some", I checked it in the debugger),
  • and then broke on an inconsistent internal state ( invertState.posIncrAttribute.getPositionIncrement(IndexableField field, boolean first)

    in DefaultIndexingChain.PerField.invert () returned 0, while "normal" behavior should return 1)

Of course, I could handle this particular error by further wrapping and workaround, but most likely I'm in the wrong direction in implementing such a seemingly easy task. Please suggest.

+3


source to share


2 answers


In my project, I created an abstract class that solves exactly this problem. The critical place - is, of course, the methods incrementToken

, reset

, close

, and end

. Feel free to use these bits or all of them.



import java.io.IOException;
import java.io.Reader;
import java.util.Iterator;

import com.google.common.collect.Iterators;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.standard.ClassicTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;

import static vyre.util.search.LuceneVersion.VERSION_IN_USE;

/**
 * Allows to easily manipulate with {@link ClassicTokenizer} by delegating calls to it but hiding all implementation details.
 *
 * @author Mindaugas Žakšauskas
 */
public abstract class ClassicTokenizerDelegate extends Tokenizer {

    private final ClassicTokenizer classicTokenizer;

    private final CharTermAttribute termAtt;

    private final TypeAttribute typeAtt;

    /**
     * Internal buffer of tokens if any of standard tokens was split into many.
     */
    private Iterator<String> pendingTokens = Iterators.emptyIterator();

    protected ClassicTokenizerDelegate(Reader input) {
        super(input);
        this.classicTokenizer = new ClassicTokenizer(VERSION_IN_USE, input);
        termAtt = addAttribute(CharTermAttribute.class);
        typeAtt = addAttribute(TypeAttribute.class);
    }

    /**
     * Is called during tokenization for each token produced by {@link ClassicTokenizer}. Subclasses can call {@link #setTerm(String)} to override
     * current token or {@link #setTerms(Iterator)} if current token needs to be split into more than one token.
     *
     * @return true whether next token exists false otherwise.
     * @see #getTerm()
     * @see #getType()
     * @see #setTerm(String)
     * @see #setTerms(Iterator)
     */
    protected abstract boolean onNextToken();

    /**
     * Subclasses can call this method during execution of {@link #onNextToken()} to retrieve current term.
     *
     * @return current term.
     * @see #getType()
     * @see #setTerm(String)
     * @see #setTerms(Iterator)
     * @see #onNextToken()
     */
    protected String getTerm() {
        return new String(termAtt.buffer(), 0, termAtt.length());
    }

    /**
     * Subclasses can call this method during execution of {@link #onNextToken()} to retrieve type of current term.
     *
     * @return type of current term.
     * @see #getTerm()
     * @see #setTerm(String)
     * @see #setTerms(Iterator)
     * @see #onNextToken()
     */
    protected String getType() {
        return typeAtt.type();
    }

    /**
     * Subclasses can call this method during execution of {@link #onNextToken()} to override current term.
     *
     * @param term the term to override with.
     * @see #getTerm()
     * @see #getType()
     * @see #setTerms(Iterator) setTerms(Iterator) - if you want to override current term with more than one term
     * @see #onNextToken()
     */
    protected void setTerm(String term) {
        termAtt.copyBuffer(term.toCharArray(), 0, term.length());
    }

    /**
     * Subclasses can call this method during execution of {@link #onNextToken()} to override current term with more than one term.
     *
     * @param terms the terms to override with.
     * @see #getTerm()
     * @see #getType()
     * @see #setTerm(String)
     * @see #onNextToken()
     */
    protected void setTerms(Iterator<String> terms) {
        setTerm(terms.next());
        pendingTokens = terms;
    }

    @Override
    public final boolean incrementToken() throws IOException {
        if (pendingTokens.hasNext()) {
            setTerm(pendingTokens.next());
            return true;
        }

        clearAttributes();
        if (!classicTokenizer.incrementToken()) {
            return false;
        }

        typeAtt.setType(classicTokenizer.getAttribute(TypeAttribute.class).type());        // copy type attribute from classic tokenizer attribute

        CharTermAttribute stTermAtt = classicTokenizer.getAttribute(CharTermAttribute.class);
        setTerm(new String(stTermAtt.buffer(), 0, stTermAtt.length()));

        return onNextToken();
    }

    @Override
    public void close() throws IOException {
        super.close();
        if (input != null) {
            input.close();
        }
        classicTokenizer.close();
    }

    @Override
    public void end() throws IOException {
        super.end();
        classicTokenizer.end();
    }

    @Override
    public void reset() throws IOException {
        super.reset();
        this.classicTokenizer.setReader(input);        // important! input has to be carried over to delegate because of poor design of Lucene
        classicTokenizer.reset();
    }
}

      

+2


source


I think it would be helpful to express this explicitly:

TokenizerWrapper

and delegate

do not share a set of attributes. So even indexing the first document seems to be correct, it is not, nothing goes into the index. In order to make meaningful delegation, you need to mirror (in whole or in part) the attributes delegate

in TokenizerWrapper

, for example @mindas in setTerm()

.



Or maybe I'm wrong and there are some "magic mechanisms" that will allow you to reuse delegate.attributes

how TokenizerWrapper.attributes

?

0


source







All Articles