Custom Solr TokenFilter lemmatizer

I'm trying to write a simple Solr lemmatizer to use on a field type, but I can't seem to find any information on how to write the TokenFilter, so I'm lost. Here is the code I have.

import java.io.IOException;
import java.util.List;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

class FooFilter extends TokenFilter {

    private static final Logger log = LoggerFactory.getLogger(FooFilter.class);
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    private final PositionIncrementAttribute posAtt = addAttribute(PositionIncrementAttribute.class);

    public FooFilter(TokenStream input) {
        super(input);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (!input.incrementToken()) {
            return false;
        }

        char termBuffer[] = termAtt.buffer();
        List<String> allForms = Lemmatize.getAllForms(new String(termBuffer));
        if (allForms.size() > 0) {
            for (String word : allForms) {
                // Now what?
            }
        }

        return true;
    }
}

      

+3


source to share


2 answers


Then you want replace

either the append

current token by a termAtt

word.

An example of semantics replacement

termAtt.setEmpty();
termAtt.copyBuffer(word.toCharArray(), 0, word.length());

      



An example of semantics for adding new tokens

For every token you want to add, an attribute must be set CharTermAttribute

and the procedure incrementToken

true.

private List<String> extraTokens = ...
public boolean incrementToken() { 
  if (input.incrementToken()){ 
    // ... 
    return true; 
  } else if (!extraTokens.isEmtpy()) { 
    // set the added token and return true
    termAtt.setTerm(extraTokens.remove(0)); 
    return true; 
  } 
  return false; 
} 

      

+3


source


It is open source. If in doubt, read the code. There are a significant number of classes that implement TokenFilter , read a couple and you will be much clearer about what will happen.



-2


source







All Articles