You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Daniel Shane <sh...@LEXUM.UMontreal.CA> on 2009/09/02 00:01:43 UTC

Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

Hi all!

I'm trying to port my Lucene code to the new TokenStream API and I have 
a filter that I cannot seem to port using the current new API.

The filter is called LookaheadTokenFilter. It behaves exactly like a 
normal token filter, except, you can call peek() and get information on 
the next token in the stream.

Since Lucene does not support stream "rewinding", we did this by 
buffering tokens when peek() was called and giving those back when 
next() was called and when no more "peeked" tokens exist, we then call 
super.next();

Now, I'm looking at this new API and really I'm stuck at how to port 
this using incrementToken...

Am I missing something, is there an object I can get from the 
TokenStream that I can save and get all the attributes from?

Here is the code I'm trying to port :

public class LookaheadTokenFilter extends TokenFilter {
    /** List of tokens that were peeked but not returned with next. */
    LinkedList<Token> peekedTokens = new LinkedList<Token>();

    /** The position of the next character that peek() will return in 
peekedTokens */
    int peekPosition = 0;

    public LookaheadTokenFilter(TokenStream input) {
        super(input);
    }
   
    public Token peek() throws IOException {
        if (this.peekPosition >= this.peekedTokens.size()) {
            Token token = new Token();
            token = this.input.next(token);
            if (token != null) {
                this.peekedTokens.add(token);
                this.peekPosition = this.peekedTokens.size();
            }
            return token;
        }

        return this.peekedTokens.get(this.peekPosition++);
    }
 
    public void reset() { this.peekPosition = 0; }

    public Token next(Token token) throws IOException {
        reset();

        if (this.peekedTokens.size() > 0) {
            return this.peekedTokens.removeFirst();
        }
           
        return this.input.next(token);       
    }
}

Let me know if anyone has an idea,
Daniel Shane

RE: TokenStream API, Quick Question.

Posted by Uwe Schindler <uw...@thetaphi.de>.
The indexer only call getAttribute/addAttribute one time after initializing
(see docs). It will never call it later. If you cache tokens, you always
have to restore the state into the TokenStream's attributes.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Daniel Shane [mailto:shaned@LEXUM.UMontreal.CA]
> Sent: Thursday, September 03, 2009 8:55 PM
> To: java-user@lucene.apache.org
> Subject: TokenStream API, Quick Question.
> 
> Does a TokenStream have to return always the same number of attributes
> with the same underlying classes for all the tokens it generates?
> 
> I mean, during the tokenization phase, can the first "token" have a Term
> and Offset Attribute and the second "token" only a Type Attribute or
> does this mean that the first token has to have an empty Type attribute
> as well?
> 
> I'm just not sure,
> Daniel Shane
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


TokenStream API, Quick Question.

Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.
Does a TokenStream have to return always the same number of attributes 
with the same underlying classes for all the tokens it generates?

I mean, during the tokenization phase, can the first "token" have a Term 
and Offset Attribute and the second "token" only a Type Attribute or 
does this mean that the first token has to have an empty Type attribute 
as well?

I'm just not sure,
Daniel Shane

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.
Ok, I got it, from checking other filters, I should call 
input.incrementToken() instead of super.incrementToken().

Do you feel this kind of breaks the object model (super.incrementToken() 
should also work).

Maybe when the old API is gone, we can stop checking if someone has 
overloaded next() or incrementToken()?

Daniel S.

> Humm... I looked at captureState() and restoreState() and it doesnt 
> seem like it would work in my scenario.
>
> I'd like the LookAheadFilter to be able to peek() several tokens 
> forward and they can have different attributes, so I don't think I 
> should assume I can restoreState() safely.
>
> Here is an application for the filter, lets say I want to recognize 
> abbreviations (like S.C.R.) at the token level. I'd need to be able to 
> peek() a few tokens forward to make sure S.C.R. is an abbreviation and 
> not simply the end of a sentence.
>
> So the user should be able to peek() a number of token forward before 
> returning to usual behavior.
>
> Here is the implementation I had in mind (untested yet because of a 
> StackOverflow) :
>
> public class LookaheadTokenFilter extends TokenFilter {
>    /** List of tokens that were peeked but not returned with next. */
>    LinkedList<AttributeSource> peekedTokens = new 
> LinkedList<AttributeSource>();
>
>    /** The position of the next character that peek() will return in 
> peekedTokens */
>    int peekPosition = 0;
>
>    public LookaheadTokenFilter(TokenStream input) {
>        super(input);
>    }
>
>    public boolean peekIncrementToken() throws IOException {
>        if (this.peekPosition >= this.peekedTokens.size()) {
>            if (this.input.incrementToken() == false) {
>                return false;
>            }
>                      
> this.peekedTokens.add(cloneAttributes());                      
> this.peekPosition = this.peekedTokens.size();
>            return true;
>        }
>               this.peekPosition++;              return true;
>    }
>      @Override
>    public boolean incrementToken() throws IOException {
>        reset();
>              if (this.peekedTokens.isEmpty() == false) {
>            this.peekedTokens.removeFirst();
>        }
>              if (this.peekedTokens.isEmpty() == false) {
>            return true;
>        }
>              return super.incrementToken();
>    }
>          @Override
>    public void reset() {
>        this.peekPosition = 0;
>    }    
>    //Overloaded methods...
>      public Attribute getAttribute(Class attClass) {
>        if (this.peekedTokens.size() > 0) {
>            return 
> this.peekedTokens.get(this.peekPosition).getAttribute(attClass);
>        }              return super.getAttribute(attClass);
>    }
>      //Overload all these just like getAttribute() ...
>    public Iterator<?> getAttributeClassesIterator() ...
>    public AttributeFactory getAttributeFactory() ...
>    public Iterator getAttributeImplsIterator() ...
>    public Attribute addAttribute(Class attClass) ...
>    public void addAttributeImpl(AttributeImpl att) ...
>    public State captureState() ...
>    public void clearAttributes() ...
>    public AttributeSource cloneAttributes() ...
>    public boolean hasAttribute(Class attClass) ...
>    public boolean hasAttributes() ...
>    public void restoreState(State state) ...                     }
>
>
> Now the problem I have is that the below code triggers an evil 
> StackOverflow because I'm overriding incrementToken() and calling 
> super.incrementToken() which will loop back because of this :
>
> public boolean incrementToken() throws IOException {
>    assert tokenWrapper != null;
>      final Token token;
>    if (supportedMethods.hasReusableNext) {
>      token = next(tokenWrapper.delegate);
>    } else {
>      assert supportedMethods.hasNext;
>      token = next(); <----- Lucene calls next();
>    }
>    if (token == null) return false;
>    tokenWrapper.delegate = token;
>    return true;
>  }
>
> which then calls :
>
> public Token next() throws IOException {
>    if (tokenWrapper == null)
>      throw new UnsupportedOperationException("This TokenStream only 
> supports the new Attributes API.");
>      if (supportedMethods.hasIncrementToken) {
>      return incrementToken() ? ((Token) tokenWrapper.delegate.clone()) 
> : null; <--- incrementToken() gets called
>    } else {
>      assert supportedMethods.hasReusableNext;
>      final Token token = next(tokenWrapper.delegate);
>      if (token == null) return null;
>      tokenWrapper.delegate = token;
>      return (Token) token.clone();
>    }
>  }
>
> and hasIncrementToken is true because I overloaded incrementToken();
>
> MethodSupport(Class clazz) {
>    hasIncrementToken = isMethodOverridden(clazz, "incrementToken", 
> METHOD_NO_PARAMS);
>    hasReusableNext = isMethodOverridden(clazz, "next", 
> METHOD_TOKEN_PARAM);
>    hasNext = isMethodOverridden(clazz, "next", METHOD_NO_PARAMS);
> }
>
> Seems like a "catch-22". From what I understand, if I override 
> incrementToken() I should not call super.incrementToken()????
>
> Daniel S.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

Posted by Jamie <ja...@stimulussoft.com>.
Hi THere

In the absense of documentation, I am trying to convert an EmailFilter 
class to Lucene 3.0. Its not working! Obviously, my understanding of the 
new token filter mechanism is misguided.
Can someone in the know help me out for a sec and let me know where I am 
going wrong. Thanks.

import org.apache.commons.logging.*;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;

import java.io.IOException;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.Stack;

/* Many thanks to Michael J. Prichard" <mi...@mac.com> for his
  * original the email filter code. It is rewritten. */

public class EmailFilter extends TokenFilter  implements Serializable {

     public EmailFilter(TokenStream in) {
         super(in);
     }

     public final boolean incrementToken() throws java.io.IOException {

         if (!input.incrementToken()) {
             return false;
         }


         TermAttribute termAtt = (TermAttribute) 
input.getAttribute(TermAttribute.class);

         char[] buffer = termAtt.termBuffer();
         final int bufferLength = termAtt.termLength();
         String emailAddress = new String(buffer, 0,bufferLength);
         emailAddress = emailAddress.replaceAll("<", "");
         emailAddress = emailAddress.replaceAll(">", "");
         emailAddress = emailAddress.replaceAll("\"", "");

         String [] parts = extractEmailParts(emailAddress);
         clearAttributes();
         for (int i = 0; i < parts.length; i++) {
             if (parts[i]!=null) {
                 TermAttribute newTermAttribute = 
addAttribute(TermAttribute.class);
                 newTermAttribute.setTermBuffer(parts[i]);
                 newTermAttribute.setTermLength(parts[i].length());
             }
         }
         return true;
     }

     private String[] extractWhitespaceParts(String email) {
         String[] whitespaceParts = email.split(" ");
         ArrayList<String> partsList = new ArrayList<String>();
         for (int i=0; i < whitespaceParts.length; i++) {
             partsList.add(whitespaceParts[i]);
         }
         return whitespaceParts;
     }

     private String[] extractEmailParts(String email) {

         if (email.indexOf('@')==-1)
             return extractWhitespaceParts(email);

         ArrayList<String> partsList = new ArrayList<String>();

         String[] whitespaceParts = extractWhitespaceParts(email);

          for (int w=0;w<whitespaceParts.length;w++) {

              if (whitespaceParts[w].indexOf('@')==-1)
                  partsList.add(whitespaceParts[w]);
              else {
                  partsList.add(whitespaceParts[w]);
                  String[] splitOnAmpersand = whitespaceParts[w].split("@");
                  try {
                      partsList.add(splitOnAmpersand[0]);
                      partsList.add(splitOnAmpersand[1]);
                  } catch (ArrayIndexOutOfBoundsException ae) {}

                 if (splitOnAmpersand.length > 0) {
                     String[] splitOnDot = splitOnAmpersand[0].split("\\.");
                      for (int i=0; i < splitOnDot.length; i++) {
                          partsList.add(splitOnDot[i]);
                      }
                 }
                 if (splitOnAmpersand.length > 1) {
                     String[] splitOnDot = splitOnAmpersand[1].split("\\.");
                     for (int i=0; i < splitOnDot.length; i++) {
                         partsList.add(splitOnDot[i]);
                     }

                     if (splitOnDot.length > 2) {
                         String domain = splitOnDot[splitOnDot.length-2] 
+ "." + splitOnDot[splitOnDot.length-1];
                         partsList.add(domain);
                     }
                 }
              }
          }
         return partsList.toArray(new String[0]);
     }

}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.
Uwe Schindler wrote:
> There may be a problem that you may not want to restore the peek token into
> the TokenFilter's attributes itsself. It looks like you want to have a Token
> instance returned from peek, but the current Stream should not reset to this
> Token (you only want to "look" into the next Token and then possibly do
> something special with the current Token). To achive this, there is a method
> cloneAttributes() in TokenStream, that creates a new AttributeSource with
> same attribute types, which is independent from the cloned one. You can then
> use clone.getAttribute(TermAttribute.class).term() or similar to look into
> the next token. But creating this new clone is costy, so you may also create
> it once and reuse. In the peek method, you simply copy the state of this to
> the cloned attributesource.
>
> It's a bit complicated but should work somehow. Tell me if you need more
> help. Maybe you should provide us with some code, what you want to do with
> the TokenFilter.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>   
Humm... I looked at captureState() and restoreState() and it doesnt seem 
like it would work in my scenario.

I'd like the LookAheadFilter to be able to peek() several tokens forward 
and they can have different attributes, so I don't think I should assume 
I can restoreState() safely.

Here is an application for the filter, lets say I want to recognize 
abbreviations (like S.C.R.) at the token level. I'd need to be able to 
peek() a few tokens forward to make sure S.C.R. is an abbreviation and 
not simply the end of a sentence.

So the user should be able to peek() a number of token forward before 
returning to usual behavior.

Here is the implementation I had in mind (untested yet because of a 
StackOverflow) :

public class LookaheadTokenFilter extends TokenFilter {
    /** List of tokens that were peeked but not returned with next. */
    LinkedList<AttributeSource> peekedTokens = new 
LinkedList<AttributeSource>();

    /** The position of the next character that peek() will return in 
peekedTokens */
    int peekPosition = 0;

    public LookaheadTokenFilter(TokenStream input) {
        super(input);
    }
 
    public boolean peekIncrementToken() throws IOException {
        if (this.peekPosition >= this.peekedTokens.size()) {
            if (this.input.incrementToken() == false) {
                return false;
            }
           
            this.peekedTokens.add(cloneAttributes());           
            this.peekPosition = this.peekedTokens.size();
            return true;
        }
        
        this.peekPosition++;       
        return true;
    }
   
    @Override
    public boolean incrementToken() throws IOException {
        reset();
       
        if (this.peekedTokens.isEmpty() == false) {
            this.peekedTokens.removeFirst();
        }
       
        if (this.peekedTokens.isEmpty() == false) {
            return true;
        }
       
        return super.incrementToken();
    }
       
    @Override
    public void reset() {
        this.peekPosition = 0;
    }   
   

    //Overloaded methods...
   
    public Attribute getAttribute(Class attClass) {
        if (this.peekedTokens.size() > 0) {
            return 
this.peekedTokens.get(this.peekPosition).getAttribute(attClass);
        }       
        return super.getAttribute(attClass);
    }
   
    //Overload all these just like getAttribute() ...
    public Iterator<?> getAttributeClassesIterator() ...
    public AttributeFactory getAttributeFactory() ...
    public Iterator getAttributeImplsIterator() ...
    public Attribute addAttribute(Class attClass) ...
    public void addAttributeImpl(AttributeImpl att) ...
    public State captureState() ...
    public void clearAttributes() ...
    public AttributeSource cloneAttributes() ...
    public boolean hasAttribute(Class attClass) ...
    public boolean hasAttributes() ...
    public void restoreState(State state) ...                     
}


Now the problem I have is that the below code triggers an evil 
StackOverflow because I'm overriding incrementToken() and calling 
super.incrementToken() which will loop back because of this :

public boolean incrementToken() throws IOException {
    assert tokenWrapper != null;
   
    final Token token;
    if (supportedMethods.hasReusableNext) {
      token = next(tokenWrapper.delegate);
    } else {
      assert supportedMethods.hasNext;
      token = next(); <----- Lucene calls next();
    }
    if (token == null) return false;
    tokenWrapper.delegate = token;
    return true;
  }

which then calls :

public Token next() throws IOException {
    if (tokenWrapper == null)
      throw new UnsupportedOperationException("This TokenStream only 
supports the new Attributes API.");
   
    if (supportedMethods.hasIncrementToken) {
      return incrementToken() ? ((Token) tokenWrapper.delegate.clone()) 
: null; <--- incrementToken() gets called
    } else {
      assert supportedMethods.hasReusableNext;
      final Token token = next(tokenWrapper.delegate);
      if (token == null) return null;
      tokenWrapper.delegate = token;
      return (Token) token.clone();
    }
  }

and hasIncrementToken is true because I overloaded incrementToken();

 MethodSupport(Class clazz) {
    hasIncrementToken = isMethodOverridden(clazz, "incrementToken", 
METHOD_NO_PARAMS);
    hasReusableNext = isMethodOverridden(clazz, "next", METHOD_TOKEN_PARAM);
    hasNext = isMethodOverridden(clazz, "next", METHOD_NO_PARAMS);
}

Seems like a "catch-22". From what I understand, if I override 
incrementToken() I should not call super.incrementToken()????

Daniel S.

RE: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

Posted by Uwe Schindler <uw...@thetaphi.de>.
There may be a problem that you may not want to restore the peek token into
the TokenFilter's attributes itsself. It looks like you want to have a Token
instance returned from peek, but the current Stream should not reset to this
Token (you only want to "look" into the next Token and then possibly do
something special with the current Token). To achive this, there is a method
cloneAttributes() in TokenStream, that creates a new AttributeSource with
same attribute types, which is independent from the cloned one. You can then
use clone.getAttribute(TermAttribute.class).term() or similar to look into
the next token. But creating this new clone is costy, so you may also create
it once and reuse. In the peek method, you simply copy the state of this to
the cloned attributesource.

It's a bit complicated but should work somehow. Tell me if you need more
help. Maybe you should provide us with some code, what you want to do with
the TokenFilter.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Michael Busch [mailto:buschmic@gmail.com]
> Sent: Wednesday, September 02, 2009 1:53 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken
> / AttributeSource), cannot implement a LookaheadTokenFilter.
> 
> This is what I had in mind (completely untested!):
> 
> public class LookaheadTokenFilter extends TokenFilter {
>    /** List of tokens that were peeked but not returned with next. */
>    LinkedList<AttributeSource.State> peekedTokens = new
> LinkedList<AttributeSource.State>();
> 
>    /** The position of the next character that peek() will return in
> peekedTokens */
>    int peekPosition = 0;
> 
>    public LookaheadTokenFilter(TokenStream input) {
>        super(input);
>    }
>      public boolean peek() throws IOException {
>        if (this.peekPosition >= this.peekedTokens.size()) {
>            boolean hasNext = input.incrementToken();
>            if (hasNext) {
>                this.peekedTokens.add(captureState());
>                this.peekPosition = this.peekedTokens.size();
>            }
>            return hasNext;
>        }
> 
>        restoreState(this.peekedTokens.get(this.peekPosition++));
>        return true;
>    }
> 
>    public void reset() { this.peekPosition = 0; }
> 
>    public boolean incrementToken() throws IOException {
>      reset();
> 
>      if (this.peekedTokens.size() > 0) {
>        restoreState(this.peekedTokens.removeFirst());
>        return true;
>      }
>      return this.input.incrementToken();
>    }
> }
> 
> 
> On 9/1/09 4:44 PM, Michael Busch wrote:
> > Daniel,
> >
> > take a look at the captureState() and restoreState() APIs in
> > AttributeSource and TokenStream. captureState() returns a State object
> > containing all attributes with its' current values.
> > restoreState(State) takes a given State and copies its values back
> > into the TokenStream. You should be able to achieve the same thing by
> > storing State objects in your List, instead of Token objects. peek()
> > would change to return true/false instead of Token and the caller of
> > peek consumes the values using the new attribute API. The change on
> > your side should be pretty simple, let us know if you run into problems!
> >
> >  Michael
> >
> > On 9/1/09 3:12 PM, Daniel Shane wrote:
> >> After thinking about it, the only conclusion I got was instead of
> >> saving the token, to save an iterator of Attributes and use that
> >> instead. It may work.
> >>
> >> Daniel Shane
> >>
> >> Daniel Shane wrote:
> >>> Hi all!
> >>>
> >>> I'm trying to port my Lucene code to the new TokenStream API and I
> >>> have a filter that I cannot seem to port using the current new API.
> >>>
> >>> The filter is called LookaheadTokenFilter. It behaves exactly like a
> >>> normal token filter, except, you can call peek() and get information
> >>> on the next token in the stream.
> >>>
> >>> Since Lucene does not support stream "rewinding", we did this by
> >>> buffering tokens when peek() was called and giving those back when
> >>> next() was called and when no more "peeked" tokens exist, we then
> >>> call super.next();
> >>>
> >>> Now, I'm looking at this new API and really I'm stuck at how to port
> >>> this using incrementToken...
> >>>
> >>> Am I missing something, is there an object I can get from the
> >>> TokenStream that I can save and get all the attributes from?
> >>>
> >>> Here is the code I'm trying to port :
> >>>
> >>> public class LookaheadTokenFilter extends TokenFilter {
> >>>    /** List of tokens that were peeked but not returned with next. */
> >>>    LinkedList<Token> peekedTokens = new LinkedList<Token>();
> >>>
> >>>    /** The position of the next character that peek() will return in
> >>> peekedTokens */
> >>>    int peekPosition = 0;
> >>>
> >>>    public LookaheadTokenFilter(TokenStream input) {
> >>>        super(input);
> >>>    }
> >>>      public Token peek() throws IOException {
> >>>        if (this.peekPosition >= this.peekedTokens.size()) {
> >>>            Token token = new Token();
> >>>            token = this.input.next(token);
> >>>            if (token != null) {
> >>>                this.peekedTokens.add(token);
> >>>                this.peekPosition = this.peekedTokens.size();
> >>>            }
> >>>            return token;
> >>>        }
> >>>
> >>>        return this.peekedTokens.get(this.peekPosition++);
> >>>    }
> >>>
> >>>    public void reset() { this.peekPosition = 0; }
> >>>
> >>>    public Token next(Token token) throws IOException {
> >>>        reset();
> >>>
> >>>        if (this.peekedTokens.size() > 0) {
> >>>            return this.peekedTokens.removeFirst();
> >>>        }
> >>>                  return this.input.next(token);          }
> >>> }
> >>>
> >>> Let me know if anyone has an idea,
> >>> Daniel Shane
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

Posted by Michael Busch <bu...@gmail.com>.
This is what I had in mind (completely untested!):

public class LookaheadTokenFilter extends TokenFilter {
   /** List of tokens that were peeked but not returned with next. */
   LinkedList<AttributeSource.State> peekedTokens = new 
LinkedList<AttributeSource.State>();

   /** The position of the next character that peek() will return in 
peekedTokens */
   int peekPosition = 0;

   public LookaheadTokenFilter(TokenStream input) {
       super(input);
   }
     public boolean peek() throws IOException {
       if (this.peekPosition >= this.peekedTokens.size()) {
           boolean hasNext = input.incrementToken();
           if (hasNext) {
               this.peekedTokens.add(captureState());
               this.peekPosition = this.peekedTokens.size();
           }
           return hasNext;
       }

       restoreState(this.peekedTokens.get(this.peekPosition++));
       return true;
   }

   public void reset() { this.peekPosition = 0; }

   public boolean incrementToken() throws IOException {
     reset();

     if (this.peekedTokens.size() > 0) {
       restoreState(this.peekedTokens.removeFirst());
       return true;
     }
     return this.input.incrementToken();
   }
}


On 9/1/09 4:44 PM, Michael Busch wrote:
> Daniel,
>
> take a look at the captureState() and restoreState() APIs in 
> AttributeSource and TokenStream. captureState() returns a State object 
> containing all attributes with its' current values. 
> restoreState(State) takes a given State and copies its values back 
> into the TokenStream. You should be able to achieve the same thing by 
> storing State objects in your List, instead of Token objects. peek() 
> would change to return true/false instead of Token and the caller of 
> peek consumes the values using the new attribute API. The change on 
> your side should be pretty simple, let us know if you run into problems!
>
>  Michael
>
> On 9/1/09 3:12 PM, Daniel Shane wrote:
>> After thinking about it, the only conclusion I got was instead of 
>> saving the token, to save an iterator of Attributes and use that 
>> instead. It may work.
>>
>> Daniel Shane
>>
>> Daniel Shane wrote:
>>> Hi all!
>>>
>>> I'm trying to port my Lucene code to the new TokenStream API and I 
>>> have a filter that I cannot seem to port using the current new API.
>>>
>>> The filter is called LookaheadTokenFilter. It behaves exactly like a 
>>> normal token filter, except, you can call peek() and get information 
>>> on the next token in the stream.
>>>
>>> Since Lucene does not support stream "rewinding", we did this by 
>>> buffering tokens when peek() was called and giving those back when 
>>> next() was called and when no more "peeked" tokens exist, we then 
>>> call super.next();
>>>
>>> Now, I'm looking at this new API and really I'm stuck at how to port 
>>> this using incrementToken...
>>>
>>> Am I missing something, is there an object I can get from the 
>>> TokenStream that I can save and get all the attributes from?
>>>
>>> Here is the code I'm trying to port :
>>>
>>> public class LookaheadTokenFilter extends TokenFilter {
>>>    /** List of tokens that were peeked but not returned with next. */
>>>    LinkedList<Token> peekedTokens = new LinkedList<Token>();
>>>
>>>    /** The position of the next character that peek() will return in 
>>> peekedTokens */
>>>    int peekPosition = 0;
>>>
>>>    public LookaheadTokenFilter(TokenStream input) {
>>>        super(input);
>>>    }
>>>      public Token peek() throws IOException {
>>>        if (this.peekPosition >= this.peekedTokens.size()) {
>>>            Token token = new Token();
>>>            token = this.input.next(token);
>>>            if (token != null) {
>>>                this.peekedTokens.add(token);
>>>                this.peekPosition = this.peekedTokens.size();
>>>            }
>>>            return token;
>>>        }
>>>
>>>        return this.peekedTokens.get(this.peekPosition++);
>>>    }
>>>
>>>    public void reset() { this.peekPosition = 0; }
>>>
>>>    public Token next(Token token) throws IOException {
>>>        reset();
>>>
>>>        if (this.peekedTokens.size() > 0) {
>>>            return this.peekedTokens.removeFirst();
>>>        }
>>>                  return this.input.next(token);          }
>>> }
>>>
>>> Let me know if anyone has an idea,
>>> Daniel Shane
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

Posted by Michael Busch <bu...@gmail.com>.
Daniel,

take a look at the captureState() and restoreState() APIs in 
AttributeSource and TokenStream. captureState() returns a State object 
containing all attributes with its' current values. restoreState(State) 
takes a given State and copies its values back into the TokenStream. You 
should be able to achieve the same thing by storing State objects in 
your List, instead of Token objects. peek() would change to return 
true/false instead of Token and the caller of peek consumes the values 
using the new attribute API. The change on your side should be pretty 
simple, let us know if you run into problems!

  Michael

On 9/1/09 3:12 PM, Daniel Shane wrote:
> After thinking about it, the only conclusion I got was instead of 
> saving the token, to save an iterator of Attributes and use that 
> instead. It may work.
>
> Daniel Shane
>
> Daniel Shane wrote:
>> Hi all!
>>
>> I'm trying to port my Lucene code to the new TokenStream API and I 
>> have a filter that I cannot seem to port using the current new API.
>>
>> The filter is called LookaheadTokenFilter. It behaves exactly like a 
>> normal token filter, except, you can call peek() and get information 
>> on the next token in the stream.
>>
>> Since Lucene does not support stream "rewinding", we did this by 
>> buffering tokens when peek() was called and giving those back when 
>> next() was called and when no more "peeked" tokens exist, we then 
>> call super.next();
>>
>> Now, I'm looking at this new API and really I'm stuck at how to port 
>> this using incrementToken...
>>
>> Am I missing something, is there an object I can get from the 
>> TokenStream that I can save and get all the attributes from?
>>
>> Here is the code I'm trying to port :
>>
>> public class LookaheadTokenFilter extends TokenFilter {
>>    /** List of tokens that were peeked but not returned with next. */
>>    LinkedList<Token> peekedTokens = new LinkedList<Token>();
>>
>>    /** The position of the next character that peek() will return in 
>> peekedTokens */
>>    int peekPosition = 0;
>>
>>    public LookaheadTokenFilter(TokenStream input) {
>>        super(input);
>>    }
>>      public Token peek() throws IOException {
>>        if (this.peekPosition >= this.peekedTokens.size()) {
>>            Token token = new Token();
>>            token = this.input.next(token);
>>            if (token != null) {
>>                this.peekedTokens.add(token);
>>                this.peekPosition = this.peekedTokens.size();
>>            }
>>            return token;
>>        }
>>
>>        return this.peekedTokens.get(this.peekPosition++);
>>    }
>>
>>    public void reset() { this.peekPosition = 0; }
>>
>>    public Token next(Token token) throws IOException {
>>        reset();
>>
>>        if (this.peekedTokens.size() > 0) {
>>            return this.peekedTokens.removeFirst();
>>        }
>>                  return this.input.next(token);          }
>> }
>>
>> Let me know if anyone has an idea,
>> Daniel Shane
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.
After thinking about it, the only conclusion I got was instead of saving 
the token, to save an iterator of Attributes and use that instead. It 
may work.

Daniel Shane

Daniel Shane wrote:
> Hi all!
>
> I'm trying to port my Lucene code to the new TokenStream API and I 
> have a filter that I cannot seem to port using the current new API.
>
> The filter is called LookaheadTokenFilter. It behaves exactly like a 
> normal token filter, except, you can call peek() and get information 
> on the next token in the stream.
>
> Since Lucene does not support stream "rewinding", we did this by 
> buffering tokens when peek() was called and giving those back when 
> next() was called and when no more "peeked" tokens exist, we then call 
> super.next();
>
> Now, I'm looking at this new API and really I'm stuck at how to port 
> this using incrementToken...
>
> Am I missing something, is there an object I can get from the 
> TokenStream that I can save and get all the attributes from?
>
> Here is the code I'm trying to port :
>
> public class LookaheadTokenFilter extends TokenFilter {
>    /** List of tokens that were peeked but not returned with next. */
>    LinkedList<Token> peekedTokens = new LinkedList<Token>();
>
>    /** The position of the next character that peek() will return in 
> peekedTokens */
>    int peekPosition = 0;
>
>    public LookaheadTokenFilter(TokenStream input) {
>        super(input);
>    }
>      public Token peek() throws IOException {
>        if (this.peekPosition >= this.peekedTokens.size()) {
>            Token token = new Token();
>            token = this.input.next(token);
>            if (token != null) {
>                this.peekedTokens.add(token);
>                this.peekPosition = this.peekedTokens.size();
>            }
>            return token;
>        }
>
>        return this.peekedTokens.get(this.peekPosition++);
>    }
>
>    public void reset() { this.peekPosition = 0; }
>
>    public Token next(Token token) throws IOException {
>        reset();
>
>        if (this.peekedTokens.size() > 0) {
>            return this.peekedTokens.removeFirst();
>        }
>                  return this.input.next(token);          }
> }
>
> Let me know if anyone has an idea,
> Daniel Shane
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org