You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Enrico Detoma <en...@gmail.com> on 2009/10/08 15:42:22 UTC

Lucene 2.9.0 [PROBLEM] : TokenStream API (incrementToken / captureState / restoreState), cannot implement a "stop phrases filter"

Hi all,

I'm trying to implement a "stop phrases filter" with the new TokenStream
API.

I would like to be able to peek into N tokens ahead, see if the current
token + N subsequent tokens match a "stop phrase" (the set of stop phrases
are saved in a HashSet), then discard all these tokens when they match a
stop phrase, or keep them all if they don't match.

For this purpose I would like to use captureState() and then restoreState()
to get back to the starting point of the stream.

I tried many combinations of these API. My last attempt is in the code
below, which doesn't work.



    static private HashSet<String> m_stop_phrases = new HashSet<String>();
    static private int m_max_stop_phrase_length = 0;
...
    public final boolean incrementToken() throws IOException {
        if (!input.incrementToken())
            return false;
        Stack<State> stateStack = new Stack<State>();
        StringBuilder match_string_builder = new StringBuilder();
        int skippedPositions = 0;
        boolean is_next_token = true;
        while (is_next_token && match_string_builder.length() <
m_max_stop_phrase_length) {
            if (match_string_builder.length() > 0)
                match_string_builder.append(" ");
            match_string_builder.append(termAtt.term());
            skippedPositions += posIncrAtt.getPositionIncrement();
            stateStack.push(captureState());
            is_next_token = input.incrementToken();
            if (m_stop_phrases.contains(match_string_builder.toString())) {
              // Stop phrase is found: skip the number of tokens
              // without restoring the state

posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() +
skippedPositions);
              return is_next_token;
            }
        }
        // No stop phrase found: restore the stream
        while (!stateStack.empty())
            restoreState(stateStack.pop());
        return true;
    }


Which is the correct direction I should look into to implement my "stop
phrases" filter?

Thank you
Regards
Enrico

Re: Lucene 2.9.0 [PROBLEM] : TokenStream API (incrementToken / captureState / restoreState), cannot implement a "stop phrases filter"

Posted by Enrico Detoma <en...@gmail.com>.
Thank you.
Starting from CachingTokenFilter was indeed the correct way to proceed.

Regards
Enrico


2009/10/8 Uwe Schindler <uw...@thetaphi.de>

> restoreState only restores the token contents, not the complete stream. So
> you cannot roll back the token stream (and this was also not possible with
> the old API). The while loop at the end of you code is not working as you
> exspect because of this. You may use CachingTokenFilter, which can be reset
> and consumed again, as a source for further work.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>

RE: Lucene 2.9.0 [PROBLEM] : TokenStream API (incrementToken / captureState / restoreState), cannot implement a "stop phrases filter"

Posted by Uwe Schindler <uw...@thetaphi.de>.
restoreState only restores the token contents, not the complete stream. So
you cannot roll back the token stream (and this was also not possible with
the old API). The while loop at the end of you code is not working as you
exspect because of this. You may use CachingTokenFilter, which can be reset
and consumed again, as a source for further work.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Enrico Detoma [mailto:enrico.detoma@gmail.com]
> Sent: Thursday, October 08, 2009 4:42 PM
> To: java-user@lucene.apache.org
> Subject: Lucene 2.9.0 [PROBLEM] : TokenStream API (incrementToken /
> captureState / restoreState), cannot implement a "stop phrases filter"
> 
> Hi all,
> 
> I'm trying to implement a "stop phrases filter" with the new TokenStream
> API.
> 
> I would like to be able to peek into N tokens ahead, see if the current
> token + N subsequent tokens match a "stop phrase" (the set of stop phrases
> are saved in a HashSet), then discard all these tokens when they match a
> stop phrase, or keep them all if they don't match.
> 
> For this purpose I would like to use captureState() and then
> restoreState()
> to get back to the starting point of the stream.
> 
> I tried many combinations of these API. My last attempt is in the code
> below, which doesn't work.
> 
> 
> 
>     static private HashSet<String> m_stop_phrases = new HashSet<String>();
>     static private int m_max_stop_phrase_length = 0;
> ...
>     public final boolean incrementToken() throws IOException {
>         if (!input.incrementToken())
>             return false;
>         Stack<State> stateStack = new Stack<State>();
>         StringBuilder match_string_builder = new StringBuilder();
>         int skippedPositions = 0;
>         boolean is_next_token = true;
>         while (is_next_token && match_string_builder.length() <
> m_max_stop_phrase_length) {
>             if (match_string_builder.length() > 0)
>                 match_string_builder.append(" ");
>             match_string_builder.append(termAtt.term());
>             skippedPositions += posIncrAtt.getPositionIncrement();
>             stateStack.push(captureState());
>             is_next_token = input.incrementToken();
>             if (m_stop_phrases.contains(match_string_builder.toString()))
> {
>               // Stop phrase is found: skip the number of tokens
>               // without restoring the state
> 
> posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() +
> skippedPositions);
>               return is_next_token;
>             }
>         }
>         // No stop phrase found: restore the stream
>         while (!stateStack.empty())
>             restoreState(stateStack.pop());
>         return true;
>     }
> 
> 
> Which is the correct direction I should look into to implement my "stop
> phrases" filter?
> 
> Thank you
> Regards
> Enrico


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org