Phrase filter Lucene stop

I'm trying to write a filter for Lucene similar to StopWordsFilter (thereby implementing TokenFilter), but I need to remove phrases (sequence of tokens) instead of words.

Stop phrases are presented as a sequence of tokens: punctuation is not considered.

I think I need to do some buffering of the tokens in the token stream and when the full phrase is matched, I discard all the tokens in the buffer.

What would be the best approach to implementing a stop-suit filter given a stream of words such as Lucene TokenStream?

+2


source to share


2 answers


In this thread, I was given a solution: use the Lucene CachingTokenFilter as a starting point:

This decision was the right way.

EDIT: I installed a dead link. Here is the transcript of the stream.

MY QUESTION:

I am trying to implement a "stop phrase filter" with the new TokenStream API.

I would like to be able to look into N tokens ahead, see if token + N subsequent tokens match a "stop phrase" (a set of stop phrases are stored in a HashSet) and then discard all of those tokens when they match a stop phrase or store them all if they don't match.



To do this, I would like to use captureState () and then restoreState () to get back to the original point of the stream.

I have tried many combinations of these APIs. My last try is with the code below, which doesn't work.

    static private HashSet<String> m_stop_phrases = new HashSet<String>(); 
    static private int m_max_stop_phrase_length = 0; 
... 
    public final boolean incrementToken() throws IOException { 
        if (!input.incrementToken()) 
            return false; 
        Stack<State> stateStack = new Stack<State>(); 
        StringBuilder match_string_builder = new StringBuilder(); 
        int skippedPositions = 0; 
        boolean is_next_token = true; 
        while (is_next_token && match_string_builder.length() < m_max_stop_phrase_length) { 
            if (match_string_builder.length() > 0) 
                match_string_builder.append(" "); 
            match_string_builder.append(termAtt.term()); 
            skippedPositions += posIncrAtt.getPositionIncrement(); 
            stateStack.push(captureState()); 
            is_next_token = input.incrementToken(); 
            if (m_stop_phrases.contains(match_string_builder.toString())) { 
              // Stop phrase is found: skip the number of tokens 
              // without restoring the state 
              posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() + skippedPositions); 
              return is_next_token; 
            } 
        } 
        // No stop phrase found: restore the stream 
        while (!stateStack.empty()) 
            restoreState(stateStack.pop()); 
        return true; 
    } 

      

What is the right direction I should look to implement my "stop" phrases?

FIXED ANSWER:

restoreState restores the contents of the token, not the full stream. This way you cannot rollback the token stream (and it was also not possible with the old API). The while loop at the end of your code doesn't work the way you do because of this. You can use a CachingTokenFilter which can be reset and consumed again as a source for further work.

+1


source


You will really need to write your own parser, I have to think about, because whether or not any sequence of words is a "phrase" depends on cues like punctuation that are not available after tokenization.



0


source







All Articles