Custom Solr TokenFilter lemmatizer
I'm trying to write a simple Solr lemmatizer to use on a field type, but I can't seem to find any information on how to write the TokenFilter, so I'm lost. Here is the code I have.
import java.io.IOException;
import java.util.List;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
class FooFilter extends TokenFilter {
private static final Logger log = LoggerFactory.getLogger(FooFilter.class);
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private final PositionIncrementAttribute posAtt = addAttribute(PositionIncrementAttribute.class);
public FooFilter(TokenStream input) {
super(input);
}
@Override
public boolean incrementToken() throws IOException {
if (!input.incrementToken()) {
return false;
}
char termBuffer[] = termAtt.buffer();
List<String> allForms = Lemmatize.getAllForms(new String(termBuffer));
if (allForms.size() > 0) {
for (String word : allForms) {
// Now what?
}
}
return true;
}
}
source to share
Then you want replace
either the append
current token by a termAtt
word.
An example of semantics replacement
termAtt.setEmpty();
termAtt.copyBuffer(word.toCharArray(), 0, word.length());
An example of semantics for adding new tokens
For every token you want to add, an attribute must be set CharTermAttribute
and the procedure incrementToken
true.
private List<String> extraTokens = ...
public boolean incrementToken() {
if (input.incrementToken()){
// ...
return true;
} else if (!extraTokens.isEmtpy()) {
// set the added token and return true
termAtt.setTerm(extraTokens.remove(0));
return true;
}
return false;
}
source to share
It is open source. If in doubt, read the code. There are a significant number of classes that implement TokenFilter , read a couple and you will be much clearer about what will happen.
source to share