You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Daniel Shane <sh...@LEXUM.UMontreal.CA> on 2009/09/02 00:01:43 UTC
Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource),
cannot implement a LookaheadTokenFilter.
Hi all!
I'm trying to port my Lucene code to the new TokenStream API and I have
a filter that I cannot seem to port using the current new API.
The filter is called LookaheadTokenFilter. It behaves exactly like a
normal token filter, except, you can call peek() and get information on
the next token in the stream.
Since Lucene does not support stream "rewinding", we did this by
buffering tokens when peek() was called and giving those back when
next() was called and when no more "peeked" tokens exist, we then call
super.next();
Now, I'm looking at this new API and really I'm stuck at how to port
this using incrementToken...
Am I missing something, is there an object I can get from the
TokenStream that I can save and get all the attributes from?
Here is the code I'm trying to port :
public class LookaheadTokenFilter extends TokenFilter {
/** List of tokens that were peeked but not returned with next. */
LinkedList<Token> peekedTokens = new LinkedList<Token>();
/** The position of the next character that peek() will return in
peekedTokens */
int peekPosition = 0;
public LookaheadTokenFilter(TokenStream input) {
super(input);
}
public Token peek() throws IOException {
if (this.peekPosition >= this.peekedTokens.size()) {
Token token = new Token();
token = this.input.next(token);
if (token != null) {
this.peekedTokens.add(token);
this.peekPosition = this.peekedTokens.size();
}
return token;
}
return this.peekedTokens.get(this.peekPosition++);
}
public void reset() { this.peekPosition = 0; }
public Token next(Token token) throws IOException {
reset();
if (this.peekedTokens.size() > 0) {
return this.peekedTokens.removeFirst();
}
return this.input.next(token);
}
}
Let me know if anyone has an idea,
Daniel Shane
RE: TokenStream API, Quick Question.
Posted by Uwe Schindler <uw...@thetaphi.de>.
The indexer only call getAttribute/addAttribute one time after initializing
(see docs). It will never call it later. If you cache tokens, you always
have to restore the state into the TokenStream's attributes.
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de
> -----Original Message-----
> From: Daniel Shane [mailto:shaned@LEXUM.UMontreal.CA]
> Sent: Thursday, September 03, 2009 8:55 PM
> To: java-user@lucene.apache.org
> Subject: TokenStream API, Quick Question.
>
> Does a TokenStream have to return always the same number of attributes
> with the same underlying classes for all the tokens it generates?
>
> I mean, during the tokenization phase, can the first "token" have a Term
> and Offset Attribute and the second "token" only a Type Attribute or
> does this mean that the first token has to have an empty Type attribute
> as well?
>
> I'm just not sure,
> Daniel Shane
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
TokenStream API, Quick Question.
Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.
Does a TokenStream have to return always the same number of attributes
with the same underlying classes for all the tokens it generates?
I mean, during the tokenization phase, can the first "token" have a Term
and Offset Attribute and the second "token" only a Type Attribute or
does this mean that the first token has to have an empty Type attribute
as well?
I'm just not sure,
Daniel Shane
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken
/ AttributeSource), cannot implement a LookaheadTokenFilter.
Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.
Ok, I got it, from checking other filters, I should call
input.incrementToken() instead of super.incrementToken().
Do you feel this kind of breaks the object model (super.incrementToken()
should also work).
Maybe when the old API is gone, we can stop checking if someone has
overloaded next() or incrementToken()?
Daniel S.
> Humm... I looked at captureState() and restoreState() and it doesnt
> seem like it would work in my scenario.
>
> I'd like the LookAheadFilter to be able to peek() several tokens
> forward and they can have different attributes, so I don't think I
> should assume I can restoreState() safely.
>
> Here is an application for the filter, lets say I want to recognize
> abbreviations (like S.C.R.) at the token level. I'd need to be able to
> peek() a few tokens forward to make sure S.C.R. is an abbreviation and
> not simply the end of a sentence.
>
> So the user should be able to peek() a number of token forward before
> returning to usual behavior.
>
> Here is the implementation I had in mind (untested yet because of a
> StackOverflow) :
>
> public class LookaheadTokenFilter extends TokenFilter {
> /** List of tokens that were peeked but not returned with next. */
> LinkedList<AttributeSource> peekedTokens = new
> LinkedList<AttributeSource>();
>
> /** The position of the next character that peek() will return in
> peekedTokens */
> int peekPosition = 0;
>
> public LookaheadTokenFilter(TokenStream input) {
> super(input);
> }
>
> public boolean peekIncrementToken() throws IOException {
> if (this.peekPosition >= this.peekedTokens.size()) {
> if (this.input.incrementToken() == false) {
> return false;
> }
>
> this.peekedTokens.add(cloneAttributes());
> this.peekPosition = this.peekedTokens.size();
> return true;
> }
> this.peekPosition++; return true;
> }
> @Override
> public boolean incrementToken() throws IOException {
> reset();
> if (this.peekedTokens.isEmpty() == false) {
> this.peekedTokens.removeFirst();
> }
> if (this.peekedTokens.isEmpty() == false) {
> return true;
> }
> return super.incrementToken();
> }
> @Override
> public void reset() {
> this.peekPosition = 0;
> }
> //Overloaded methods...
> public Attribute getAttribute(Class attClass) {
> if (this.peekedTokens.size() > 0) {
> return
> this.peekedTokens.get(this.peekPosition).getAttribute(attClass);
> } return super.getAttribute(attClass);
> }
> //Overload all these just like getAttribute() ...
> public Iterator<?> getAttributeClassesIterator() ...
> public AttributeFactory getAttributeFactory() ...
> public Iterator getAttributeImplsIterator() ...
> public Attribute addAttribute(Class attClass) ...
> public void addAttributeImpl(AttributeImpl att) ...
> public State captureState() ...
> public void clearAttributes() ...
> public AttributeSource cloneAttributes() ...
> public boolean hasAttribute(Class attClass) ...
> public boolean hasAttributes() ...
> public void restoreState(State state) ... }
>
>
> Now the problem I have is that the below code triggers an evil
> StackOverflow because I'm overriding incrementToken() and calling
> super.incrementToken() which will loop back because of this :
>
> public boolean incrementToken() throws IOException {
> assert tokenWrapper != null;
> final Token token;
> if (supportedMethods.hasReusableNext) {
> token = next(tokenWrapper.delegate);
> } else {
> assert supportedMethods.hasNext;
> token = next(); <----- Lucene calls next();
> }
> if (token == null) return false;
> tokenWrapper.delegate = token;
> return true;
> }
>
> which then calls :
>
> public Token next() throws IOException {
> if (tokenWrapper == null)
> throw new UnsupportedOperationException("This TokenStream only
> supports the new Attributes API.");
> if (supportedMethods.hasIncrementToken) {
> return incrementToken() ? ((Token) tokenWrapper.delegate.clone())
> : null; <--- incrementToken() gets called
> } else {
> assert supportedMethods.hasReusableNext;
> final Token token = next(tokenWrapper.delegate);
> if (token == null) return null;
> tokenWrapper.delegate = token;
> return (Token) token.clone();
> }
> }
>
> and hasIncrementToken is true because I overloaded incrementToken();
>
> MethodSupport(Class clazz) {
> hasIncrementToken = isMethodOverridden(clazz, "incrementToken",
> METHOD_NO_PARAMS);
> hasReusableNext = isMethodOverridden(clazz, "next",
> METHOD_TOKEN_PARAM);
> hasNext = isMethodOverridden(clazz, "next", METHOD_NO_PARAMS);
> }
>
> Seems like a "catch-22". From what I understand, if I override
> incrementToken() I should not call super.incrementToken()????
>
> Daniel S.
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken
/ AttributeSource), cannot implement a LookaheadTokenFilter.
Posted by Jamie <ja...@stimulussoft.com>.
Hi THere
In the absense of documentation, I am trying to convert an EmailFilter
class to Lucene 3.0. Its not working! Obviously, my understanding of the
new token filter mechanism is misguided.
Can someone in the know help me out for a sec and let me know where I am
going wrong. Thanks.
import org.apache.commons.logging.*;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import java.io.IOException;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.Stack;
/* Many thanks to Michael J. Prichard" <mi...@mac.com> for his
* original the email filter code. It is rewritten. */
public class EmailFilter extends TokenFilter implements Serializable {
public EmailFilter(TokenStream in) {
super(in);
}
public final boolean incrementToken() throws java.io.IOException {
if (!input.incrementToken()) {
return false;
}
TermAttribute termAtt = (TermAttribute)
input.getAttribute(TermAttribute.class);
char[] buffer = termAtt.termBuffer();
final int bufferLength = termAtt.termLength();
String emailAddress = new String(buffer, 0,bufferLength);
emailAddress = emailAddress.replaceAll("<", "");
emailAddress = emailAddress.replaceAll(">", "");
emailAddress = emailAddress.replaceAll("\"", "");
String [] parts = extractEmailParts(emailAddress);
clearAttributes();
for (int i = 0; i < parts.length; i++) {
if (parts[i]!=null) {
TermAttribute newTermAttribute =
addAttribute(TermAttribute.class);
newTermAttribute.setTermBuffer(parts[i]);
newTermAttribute.setTermLength(parts[i].length());
}
}
return true;
}
private String[] extractWhitespaceParts(String email) {
String[] whitespaceParts = email.split(" ");
ArrayList<String> partsList = new ArrayList<String>();
for (int i=0; i < whitespaceParts.length; i++) {
partsList.add(whitespaceParts[i]);
}
return whitespaceParts;
}
private String[] extractEmailParts(String email) {
if (email.indexOf('@')==-1)
return extractWhitespaceParts(email);
ArrayList<String> partsList = new ArrayList<String>();
String[] whitespaceParts = extractWhitespaceParts(email);
for (int w=0;w<whitespaceParts.length;w++) {
if (whitespaceParts[w].indexOf('@')==-1)
partsList.add(whitespaceParts[w]);
else {
partsList.add(whitespaceParts[w]);
String[] splitOnAmpersand = whitespaceParts[w].split("@");
try {
partsList.add(splitOnAmpersand[0]);
partsList.add(splitOnAmpersand[1]);
} catch (ArrayIndexOutOfBoundsException ae) {}
if (splitOnAmpersand.length > 0) {
String[] splitOnDot = splitOnAmpersand[0].split("\\.");
for (int i=0; i < splitOnDot.length; i++) {
partsList.add(splitOnDot[i]);
}
}
if (splitOnAmpersand.length > 1) {
String[] splitOnDot = splitOnAmpersand[1].split("\\.");
for (int i=0; i < splitOnDot.length; i++) {
partsList.add(splitOnDot[i]);
}
if (splitOnDot.length > 2) {
String domain = splitOnDot[splitOnDot.length-2]
+ "." + splitOnDot[splitOnDot.length-1];
partsList.add(domain);
}
}
}
}
return partsList.toArray(new String[0]);
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken
/ AttributeSource), cannot implement a LookaheadTokenFilter.
Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.
Uwe Schindler wrote:
> There may be a problem that you may not want to restore the peek token into
> the TokenFilter's attributes itsself. It looks like you want to have a Token
> instance returned from peek, but the current Stream should not reset to this
> Token (you only want to "look" into the next Token and then possibly do
> something special with the current Token). To achive this, there is a method
> cloneAttributes() in TokenStream, that creates a new AttributeSource with
> same attribute types, which is independent from the cloned one. You can then
> use clone.getAttribute(TermAttribute.class).term() or similar to look into
> the next token. But creating this new clone is costy, so you may also create
> it once and reuse. In the peek method, you simply copy the state of this to
> the cloned attributesource.
>
> It's a bit complicated but should work somehow. Tell me if you need more
> help. Maybe you should provide us with some code, what you want to do with
> the TokenFilter.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>
Humm... I looked at captureState() and restoreState() and it doesnt seem
like it would work in my scenario.
I'd like the LookAheadFilter to be able to peek() several tokens forward
and they can have different attributes, so I don't think I should assume
I can restoreState() safely.
Here is an application for the filter, lets say I want to recognize
abbreviations (like S.C.R.) at the token level. I'd need to be able to
peek() a few tokens forward to make sure S.C.R. is an abbreviation and
not simply the end of a sentence.
So the user should be able to peek() a number of token forward before
returning to usual behavior.
Here is the implementation I had in mind (untested yet because of a
StackOverflow) :
public class LookaheadTokenFilter extends TokenFilter {
/** List of tokens that were peeked but not returned with next. */
LinkedList<AttributeSource> peekedTokens = new
LinkedList<AttributeSource>();
/** The position of the next character that peek() will return in
peekedTokens */
int peekPosition = 0;
public LookaheadTokenFilter(TokenStream input) {
super(input);
}
public boolean peekIncrementToken() throws IOException {
if (this.peekPosition >= this.peekedTokens.size()) {
if (this.input.incrementToken() == false) {
return false;
}
this.peekedTokens.add(cloneAttributes());
this.peekPosition = this.peekedTokens.size();
return true;
}
this.peekPosition++;
return true;
}
@Override
public boolean incrementToken() throws IOException {
reset();
if (this.peekedTokens.isEmpty() == false) {
this.peekedTokens.removeFirst();
}
if (this.peekedTokens.isEmpty() == false) {
return true;
}
return super.incrementToken();
}
@Override
public void reset() {
this.peekPosition = 0;
}
//Overloaded methods...
public Attribute getAttribute(Class attClass) {
if (this.peekedTokens.size() > 0) {
return
this.peekedTokens.get(this.peekPosition).getAttribute(attClass);
}
return super.getAttribute(attClass);
}
//Overload all these just like getAttribute() ...
public Iterator<?> getAttributeClassesIterator() ...
public AttributeFactory getAttributeFactory() ...
public Iterator getAttributeImplsIterator() ...
public Attribute addAttribute(Class attClass) ...
public void addAttributeImpl(AttributeImpl att) ...
public State captureState() ...
public void clearAttributes() ...
public AttributeSource cloneAttributes() ...
public boolean hasAttribute(Class attClass) ...
public boolean hasAttributes() ...
public void restoreState(State state) ...
}
Now the problem I have is that the below code triggers an evil
StackOverflow because I'm overriding incrementToken() and calling
super.incrementToken() which will loop back because of this :
public boolean incrementToken() throws IOException {
assert tokenWrapper != null;
final Token token;
if (supportedMethods.hasReusableNext) {
token = next(tokenWrapper.delegate);
} else {
assert supportedMethods.hasNext;
token = next(); <----- Lucene calls next();
}
if (token == null) return false;
tokenWrapper.delegate = token;
return true;
}
which then calls :
public Token next() throws IOException {
if (tokenWrapper == null)
throw new UnsupportedOperationException("This TokenStream only
supports the new Attributes API.");
if (supportedMethods.hasIncrementToken) {
return incrementToken() ? ((Token) tokenWrapper.delegate.clone())
: null; <--- incrementToken() gets called
} else {
assert supportedMethods.hasReusableNext;
final Token token = next(tokenWrapper.delegate);
if (token == null) return null;
tokenWrapper.delegate = token;
return (Token) token.clone();
}
}
and hasIncrementToken is true because I overloaded incrementToken();
MethodSupport(Class clazz) {
hasIncrementToken = isMethodOverridden(clazz, "incrementToken",
METHOD_NO_PARAMS);
hasReusableNext = isMethodOverridden(clazz, "next", METHOD_TOKEN_PARAM);
hasNext = isMethodOverridden(clazz, "next", METHOD_NO_PARAMS);
}
Seems like a "catch-22". From what I understand, if I override
incrementToken() I should not call super.incrementToken()????
Daniel S.
RE: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.
Posted by Uwe Schindler <uw...@thetaphi.de>.
There may be a problem that you may not want to restore the peek token into
the TokenFilter's attributes itsself. It looks like you want to have a Token
instance returned from peek, but the current Stream should not reset to this
Token (you only want to "look" into the next Token and then possibly do
something special with the current Token). To achive this, there is a method
cloneAttributes() in TokenStream, that creates a new AttributeSource with
same attribute types, which is independent from the cloned one. You can then
use clone.getAttribute(TermAttribute.class).term() or similar to look into
the next token. But creating this new clone is costy, so you may also create
it once and reuse. In the peek method, you simply copy the state of this to
the cloned attributesource.
It's a bit complicated but should work somehow. Tell me if you need more
help. Maybe you should provide us with some code, what you want to do with
the TokenFilter.
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de
> -----Original Message-----
> From: Michael Busch [mailto:buschmic@gmail.com]
> Sent: Wednesday, September 02, 2009 1:53 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken
> / AttributeSource), cannot implement a LookaheadTokenFilter.
>
> This is what I had in mind (completely untested!):
>
> public class LookaheadTokenFilter extends TokenFilter {
> /** List of tokens that were peeked but not returned with next. */
> LinkedList<AttributeSource.State> peekedTokens = new
> LinkedList<AttributeSource.State>();
>
> /** The position of the next character that peek() will return in
> peekedTokens */
> int peekPosition = 0;
>
> public LookaheadTokenFilter(TokenStream input) {
> super(input);
> }
> public boolean peek() throws IOException {
> if (this.peekPosition >= this.peekedTokens.size()) {
> boolean hasNext = input.incrementToken();
> if (hasNext) {
> this.peekedTokens.add(captureState());
> this.peekPosition = this.peekedTokens.size();
> }
> return hasNext;
> }
>
> restoreState(this.peekedTokens.get(this.peekPosition++));
> return true;
> }
>
> public void reset() { this.peekPosition = 0; }
>
> public boolean incrementToken() throws IOException {
> reset();
>
> if (this.peekedTokens.size() > 0) {
> restoreState(this.peekedTokens.removeFirst());
> return true;
> }
> return this.input.incrementToken();
> }
> }
>
>
> On 9/1/09 4:44 PM, Michael Busch wrote:
> > Daniel,
> >
> > take a look at the captureState() and restoreState() APIs in
> > AttributeSource and TokenStream. captureState() returns a State object
> > containing all attributes with its' current values.
> > restoreState(State) takes a given State and copies its values back
> > into the TokenStream. You should be able to achieve the same thing by
> > storing State objects in your List, instead of Token objects. peek()
> > would change to return true/false instead of Token and the caller of
> > peek consumes the values using the new attribute API. The change on
> > your side should be pretty simple, let us know if you run into problems!
> >
> > Michael
> >
> > On 9/1/09 3:12 PM, Daniel Shane wrote:
> >> After thinking about it, the only conclusion I got was instead of
> >> saving the token, to save an iterator of Attributes and use that
> >> instead. It may work.
> >>
> >> Daniel Shane
> >>
> >> Daniel Shane wrote:
> >>> Hi all!
> >>>
> >>> I'm trying to port my Lucene code to the new TokenStream API and I
> >>> have a filter that I cannot seem to port using the current new API.
> >>>
> >>> The filter is called LookaheadTokenFilter. It behaves exactly like a
> >>> normal token filter, except, you can call peek() and get information
> >>> on the next token in the stream.
> >>>
> >>> Since Lucene does not support stream "rewinding", we did this by
> >>> buffering tokens when peek() was called and giving those back when
> >>> next() was called and when no more "peeked" tokens exist, we then
> >>> call super.next();
> >>>
> >>> Now, I'm looking at this new API and really I'm stuck at how to port
> >>> this using incrementToken...
> >>>
> >>> Am I missing something, is there an object I can get from the
> >>> TokenStream that I can save and get all the attributes from?
> >>>
> >>> Here is the code I'm trying to port :
> >>>
> >>> public class LookaheadTokenFilter extends TokenFilter {
> >>> /** List of tokens that were peeked but not returned with next. */
> >>> LinkedList<Token> peekedTokens = new LinkedList<Token>();
> >>>
> >>> /** The position of the next character that peek() will return in
> >>> peekedTokens */
> >>> int peekPosition = 0;
> >>>
> >>> public LookaheadTokenFilter(TokenStream input) {
> >>> super(input);
> >>> }
> >>> public Token peek() throws IOException {
> >>> if (this.peekPosition >= this.peekedTokens.size()) {
> >>> Token token = new Token();
> >>> token = this.input.next(token);
> >>> if (token != null) {
> >>> this.peekedTokens.add(token);
> >>> this.peekPosition = this.peekedTokens.size();
> >>> }
> >>> return token;
> >>> }
> >>>
> >>> return this.peekedTokens.get(this.peekPosition++);
> >>> }
> >>>
> >>> public void reset() { this.peekPosition = 0; }
> >>>
> >>> public Token next(Token token) throws IOException {
> >>> reset();
> >>>
> >>> if (this.peekedTokens.size() > 0) {
> >>> return this.peekedTokens.removeFirst();
> >>> }
> >>> return this.input.next(token); }
> >>> }
> >>>
> >>> Let me know if anyone has an idea,
> >>> Daniel Shane
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken
/ AttributeSource), cannot implement a LookaheadTokenFilter.
Posted by Michael Busch <bu...@gmail.com>.
This is what I had in mind (completely untested!):
public class LookaheadTokenFilter extends TokenFilter {
/** List of tokens that were peeked but not returned with next. */
LinkedList<AttributeSource.State> peekedTokens = new
LinkedList<AttributeSource.State>();
/** The position of the next character that peek() will return in
peekedTokens */
int peekPosition = 0;
public LookaheadTokenFilter(TokenStream input) {
super(input);
}
public boolean peek() throws IOException {
if (this.peekPosition >= this.peekedTokens.size()) {
boolean hasNext = input.incrementToken();
if (hasNext) {
this.peekedTokens.add(captureState());
this.peekPosition = this.peekedTokens.size();
}
return hasNext;
}
restoreState(this.peekedTokens.get(this.peekPosition++));
return true;
}
public void reset() { this.peekPosition = 0; }
public boolean incrementToken() throws IOException {
reset();
if (this.peekedTokens.size() > 0) {
restoreState(this.peekedTokens.removeFirst());
return true;
}
return this.input.incrementToken();
}
}
On 9/1/09 4:44 PM, Michael Busch wrote:
> Daniel,
>
> take a look at the captureState() and restoreState() APIs in
> AttributeSource and TokenStream. captureState() returns a State object
> containing all attributes with its' current values.
> restoreState(State) takes a given State and copies its values back
> into the TokenStream. You should be able to achieve the same thing by
> storing State objects in your List, instead of Token objects. peek()
> would change to return true/false instead of Token and the caller of
> peek consumes the values using the new attribute API. The change on
> your side should be pretty simple, let us know if you run into problems!
>
> Michael
>
> On 9/1/09 3:12 PM, Daniel Shane wrote:
>> After thinking about it, the only conclusion I got was instead of
>> saving the token, to save an iterator of Attributes and use that
>> instead. It may work.
>>
>> Daniel Shane
>>
>> Daniel Shane wrote:
>>> Hi all!
>>>
>>> I'm trying to port my Lucene code to the new TokenStream API and I
>>> have a filter that I cannot seem to port using the current new API.
>>>
>>> The filter is called LookaheadTokenFilter. It behaves exactly like a
>>> normal token filter, except, you can call peek() and get information
>>> on the next token in the stream.
>>>
>>> Since Lucene does not support stream "rewinding", we did this by
>>> buffering tokens when peek() was called and giving those back when
>>> next() was called and when no more "peeked" tokens exist, we then
>>> call super.next();
>>>
>>> Now, I'm looking at this new API and really I'm stuck at how to port
>>> this using incrementToken...
>>>
>>> Am I missing something, is there an object I can get from the
>>> TokenStream that I can save and get all the attributes from?
>>>
>>> Here is the code I'm trying to port :
>>>
>>> public class LookaheadTokenFilter extends TokenFilter {
>>> /** List of tokens that were peeked but not returned with next. */
>>> LinkedList<Token> peekedTokens = new LinkedList<Token>();
>>>
>>> /** The position of the next character that peek() will return in
>>> peekedTokens */
>>> int peekPosition = 0;
>>>
>>> public LookaheadTokenFilter(TokenStream input) {
>>> super(input);
>>> }
>>> public Token peek() throws IOException {
>>> if (this.peekPosition >= this.peekedTokens.size()) {
>>> Token token = new Token();
>>> token = this.input.next(token);
>>> if (token != null) {
>>> this.peekedTokens.add(token);
>>> this.peekPosition = this.peekedTokens.size();
>>> }
>>> return token;
>>> }
>>>
>>> return this.peekedTokens.get(this.peekPosition++);
>>> }
>>>
>>> public void reset() { this.peekPosition = 0; }
>>>
>>> public Token next(Token token) throws IOException {
>>> reset();
>>>
>>> if (this.peekedTokens.size() > 0) {
>>> return this.peekedTokens.removeFirst();
>>> }
>>> return this.input.next(token); }
>>> }
>>>
>>> Let me know if anyone has an idea,
>>> Daniel Shane
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken
/ AttributeSource), cannot implement a LookaheadTokenFilter.
Posted by Michael Busch <bu...@gmail.com>.
Daniel,
take a look at the captureState() and restoreState() APIs in
AttributeSource and TokenStream. captureState() returns a State object
containing all attributes with its' current values. restoreState(State)
takes a given State and copies its values back into the TokenStream. You
should be able to achieve the same thing by storing State objects in
your List, instead of Token objects. peek() would change to return
true/false instead of Token and the caller of peek consumes the values
using the new attribute API. The change on your side should be pretty
simple, let us know if you run into problems!
Michael
On 9/1/09 3:12 PM, Daniel Shane wrote:
> After thinking about it, the only conclusion I got was instead of
> saving the token, to save an iterator of Attributes and use that
> instead. It may work.
>
> Daniel Shane
>
> Daniel Shane wrote:
>> Hi all!
>>
>> I'm trying to port my Lucene code to the new TokenStream API and I
>> have a filter that I cannot seem to port using the current new API.
>>
>> The filter is called LookaheadTokenFilter. It behaves exactly like a
>> normal token filter, except, you can call peek() and get information
>> on the next token in the stream.
>>
>> Since Lucene does not support stream "rewinding", we did this by
>> buffering tokens when peek() was called and giving those back when
>> next() was called and when no more "peeked" tokens exist, we then
>> call super.next();
>>
>> Now, I'm looking at this new API and really I'm stuck at how to port
>> this using incrementToken...
>>
>> Am I missing something, is there an object I can get from the
>> TokenStream that I can save and get all the attributes from?
>>
>> Here is the code I'm trying to port :
>>
>> public class LookaheadTokenFilter extends TokenFilter {
>> /** List of tokens that were peeked but not returned with next. */
>> LinkedList<Token> peekedTokens = new LinkedList<Token>();
>>
>> /** The position of the next character that peek() will return in
>> peekedTokens */
>> int peekPosition = 0;
>>
>> public LookaheadTokenFilter(TokenStream input) {
>> super(input);
>> }
>> public Token peek() throws IOException {
>> if (this.peekPosition >= this.peekedTokens.size()) {
>> Token token = new Token();
>> token = this.input.next(token);
>> if (token != null) {
>> this.peekedTokens.add(token);
>> this.peekPosition = this.peekedTokens.size();
>> }
>> return token;
>> }
>>
>> return this.peekedTokens.get(this.peekPosition++);
>> }
>>
>> public void reset() { this.peekPosition = 0; }
>>
>> public Token next(Token token) throws IOException {
>> reset();
>>
>> if (this.peekedTokens.size() > 0) {
>> return this.peekedTokens.removeFirst();
>> }
>> return this.input.next(token); }
>> }
>>
>> Let me know if anyone has an idea,
>> Daniel Shane
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken
/ AttributeSource), cannot implement a LookaheadTokenFilter.
Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.
After thinking about it, the only conclusion I got was instead of saving
the token, to save an iterator of Attributes and use that instead. It
may work.
Daniel Shane
Daniel Shane wrote:
> Hi all!
>
> I'm trying to port my Lucene code to the new TokenStream API and I
> have a filter that I cannot seem to port using the current new API.
>
> The filter is called LookaheadTokenFilter. It behaves exactly like a
> normal token filter, except, you can call peek() and get information
> on the next token in the stream.
>
> Since Lucene does not support stream "rewinding", we did this by
> buffering tokens when peek() was called and giving those back when
> next() was called and when no more "peeked" tokens exist, we then call
> super.next();
>
> Now, I'm looking at this new API and really I'm stuck at how to port
> this using incrementToken...
>
> Am I missing something, is there an object I can get from the
> TokenStream that I can save and get all the attributes from?
>
> Here is the code I'm trying to port :
>
> public class LookaheadTokenFilter extends TokenFilter {
> /** List of tokens that were peeked but not returned with next. */
> LinkedList<Token> peekedTokens = new LinkedList<Token>();
>
> /** The position of the next character that peek() will return in
> peekedTokens */
> int peekPosition = 0;
>
> public LookaheadTokenFilter(TokenStream input) {
> super(input);
> }
> public Token peek() throws IOException {
> if (this.peekPosition >= this.peekedTokens.size()) {
> Token token = new Token();
> token = this.input.next(token);
> if (token != null) {
> this.peekedTokens.add(token);
> this.peekPosition = this.peekedTokens.size();
> }
> return token;
> }
>
> return this.peekedTokens.get(this.peekPosition++);
> }
>
> public void reset() { this.peekPosition = 0; }
>
> public Token next(Token token) throws IOException {
> reset();
>
> if (this.peekedTokens.size() > 0) {
> return this.peekedTokens.removeFirst();
> }
> return this.input.next(token); }
> }
>
> Let me know if anyone has an idea,
> Daniel Shane
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org