You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Benson Margulies <be...@basistech.com> on 2012/08/29 21:29:03 UTC

reset versus setReader on TokenStream

I've read the javadoc through a few times, but I confess that I'm still
feeling dense.

Are all tokenizers responsible for implementing some way of retaining the
contents of their reader, so that a call to reset without a call to
setReader rewinds? I note that CharTokenizer doesn't implement #reset,
which leads me to suspect that I'm not responsible for the rewind behavior.

Re: reset versus setReader on TokenStream

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Aug 29, 2012 at 3:54 PM, Benson Margulies <be...@basistech.com> wrote:
>  Some interlinear commentary on the doc.
>
> * Resets this stream to the beginning.
>
> To me this implies a rewind.  As previously noted, I don't see how this
> works for the existing implementations.

its not a rewind. the javadocs here are not good. we need to fix them
to be clear :)

>
>    * As all TokenStreams must be reusable,
>    * any implementations which have state that needs to be reset between
> usages
>    * of the TokenStream, must implement this method. Note that if your
> TokenStream
>    * caches tokens and feeds them back again after a reset,
>
> What's the alternative? What happens with all the existing Tokenizers that
> have no special implementation of #reset()?

perhaps these Tokenizers have no state to reset()? lots of tokenstream
classes are stateless.
if you are stateless, then you dont need to implement this method. You
get the default implementation: e.g. TokenFilter's just passes it down
the chain (input.reset()), and i think Tokenizer/TokenStream are
no-ops.

-- 
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: reset versus setReader on TokenStream

Posted by Benson Margulies <be...@basistech.com>.

 Some interlinear commentary on the doc.

* Resets this stream to the beginning.

To me this implies a rewind.  As previously noted, I don't see how this
works for the existing implementations.

   * As all TokenStreams must be reusable,
   * any implementations which have state that needs to be reset between
usages
   * of the TokenStream, must implement this method. Note that if your
TokenStream
   * caches tokens and feeds them back again after a reset,

What's the alternative? What happens with all the existing Tokenizers that
have no special implementation of #reset()?

   * it is imperative
   * that you clone the tokens when you store them away (on the first pass)
as
   * well as when you return them (on future passes after {@link #reset()}).

Re: reset versus setReader on TokenStream

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Aug 29, 2012 at 4:18 PM, Benson Margulies <be...@basistech.com> wrote:
> If I'm following, you've created a division of labor between setReader and
> reset.

Thats not true. setReader shouldnt be doing any labor. its really only
a setter!

One possibility here is to make it final (though its not obvious to me
that it would clear up the situation, I think javadocs are more
important here).

>
> We have a tokenizer that has a good deal of state, since it has to split
> the input into chunks. If I'm following here, you'd recommend that we do
> nothing special in setReader, but have #reset fix up all the state on the
> assumption that we are are starting from the beginning of something, and
> we'd reinitialize our chunker over what was sitting in the protected
> 'input'. If someone called #setReader and neglected to call #reset, awful
> things would happen, but you've warned them.

If someone called setReader and neglected to call reset, aweful things
will happen to them in general. they would be violating the contracts
of the API and the workflow described in the javadocs.

Thats why we test as much consumer code as possible against
MockTokenizer (from test-framework package). it has a state machine
that will fail if you do this.

>
> To me, it seemed natural to overload #setReader so that our tokenizer was
> in a consistent state once it was called. It occurs to me to wonder about
> order: if #reset is called before #setReader, I'm up creek unless I copy my
> reset implementation into a local override of #setReader.

This would also be a violation on the consumer's part (also detected
by MockTokenizer, in case you have such consumers like queryparsers or
whatever you want to test).

-- 
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: reset versus setReader on TokenStream

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,
 
> To me, it seemed natural to overload #setReader so that our tokenizer was in a
> consistent state once it was called. It occurs to me to wonder about
> order: if #reset is called before #setReader, I'm up creek unless I copy my reset
> implementation into a local override of #setReader.

The order is defined in TokenStream and Tokenizer JavaDocs. First call setReader on the Tokenizer and after that the *consumer* has to call reset() on the chain of filters. When a user uses your Tokenizer, he will set a new Reader and then pass it to the indexer. Indexer (the consumer) will then call reset() before incrementToken() is called for the first time. In Lucene's BaseTokenStreamTestcase, this is asserted to be correct.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: reset versus setReader on TokenStream

Posted by Benson Margulies <be...@basistech.com>.

If I'm following, you've created a division of labor between setReader and
reset.

We have a tokenizer that has a good deal of state, since it has to split
the input into chunks. If I'm following here, you'd recommend that we do
nothing special in setReader, but have #reset fix up all the state on the
assumption that we are are starting from the beginning of something, and
we'd reinitialize our chunker over what was sitting in the protected
'input'. If someone called #setReader and neglected to call #reset, awful
things would happen, but you've warned them.

To me, it seemed natural to overload #setReader so that our tokenizer was
in a consistent state once it was called. It occurs to me to wonder about
order: if #reset is called before #setReader, I'm up creek unless I copy my
reset implementation into a local override of #setReader.

RE: reset versus setReader on TokenStream

Posted by Uwe Schindler <uw...@thetaphi.de>.

Yeah, make setReader() final in TokenStream. CharTokenizer must do this resetting in reset(). This is a similar problem like in good old StandardTokenizer which called reset() inside setReader(Reader) [at that time reset(Reader)]. But the latter was fixed already.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Wednesday, August 29, 2012 10:08 PM
> To: java-user@lucene.apache.org
> Subject: Re: reset versus setReader on TokenStream
> 
> On Wed, Aug 29, 2012 at 3:58 PM, Benson Margulies <be...@basistech.com>
> wrote:
> > I think I'm beginning to get the idea. Is the following plausible?
> >
> > At the bottom of the stack, there's an actual source of data -- like a
> > tokenizer. For one of those, reset() is a bit silly, and something
> > like setReader is the brains of the operation.
> 
> Actually i think setReader() is silly in most cases for Tokenizers.
> Most tokenizers should never override this (in fact technically we could make it
> final or something, to make it super-clear, but that might be a bit over the top)
> 
> The default implementation in Tokenizer.java should almost always suffice, as
> it does what you expect a setter would do in java:
> 
>   public void setReader(Reader input) throws IOException {
>     assert input != null: "input must not be null";
>     this.input = input;
>   }
> 
> So lets take your CharTokenizer example:
> 
>   @Override
>   public void setReader(Reader input) throws IOException {
>     super.setReader(input);
>     bufferIndex = 0;
>     offset = 0;
>     dataLen = 0;
>     finalOffset = 0;
>     ioBuffer.reset(); // make sure to reset the IO buffer!!
>   }
> 
> Really this is bogus, i think it should not override this method at all, and instead
> should do:
> 
>   @Override
>   public void reset() throws IOException {
>     // reset our internal state
>     bufferIndex = 0;
>     offset = 0;
>     dataLen = 0;
>     finalOffset = 0;
>     ioBuffer.reset(); // make sure to reset the IO buffer!!
>   }
> 
> Does that make sense?
> 
> --
> lucidworks.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: reset versus setReader on TokenStream

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Aug 29, 2012 at 3:58 PM, Benson Margulies <be...@basistech.com> wrote:
> I think I'm beginning to get the idea. Is the following plausible?
>
> At the bottom of the stack, there's an actual source of data -- like a
> tokenizer. For one of those, reset() is a bit silly, and something like
> setReader is the brains of the operation.

Actually i think setReader() is silly in most cases for Tokenizers.
Most tokenizers should never override this (in fact technically we
could make it final or something, to make it super-clear, but that
might be a bit over the top)

The default implementation in Tokenizer.java should almost always
suffice, as it does what you expect a setter would do in java:

  public void setReader(Reader input) throws IOException {
    assert input != null: "input must not be null";
    this.input = input;
  }

So lets take your CharTokenizer example:

  @Override
  public void setReader(Reader input) throws IOException {
    super.setReader(input);
    bufferIndex = 0;
    offset = 0;
    dataLen = 0;
    finalOffset = 0;
    ioBuffer.reset(); // make sure to reset the IO buffer!!
  }

Really this is bogus, i think it should not override this method at
all, and instead should do:

  @Override
  public void reset() throws IOException {
    // reset our internal state
    bufferIndex = 0;
    offset = 0;
    dataLen = 0;
    finalOffset = 0;
    ioBuffer.reset(); // make sure to reset the IO buffer!!
  }

Does that make sense?

-- 
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: reset versus setReader on TokenStream

Posted by Benson Margulies <be...@basistech.com>.

I think I'm beginning to get the idea. Is the following plausible?

At the bottom of the stack, there's an actual source of data -- like a
tokenizer. For one of those, reset() is a bit silly, and something like
setReader is the brains of the operation.

Some number of other components may be stacked up on top of the source of
data, and these may have local state. Calling #reset prepared them for new
data to emerge from the actual source of data.

Re: reset versus setReader on TokenStream

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Aug 29, 2012 at 3:45 PM, Benson Margulies <be...@basistech.com> wrote:
> On Wed, Aug 29, 2012 at 3:37 PM, Robert Muir <rc...@gmail.com> wrote:
>
>> ok, lets help improve it: I think these have likely always been confusing.
>>
>> before they were both reset: reset() and reset(Reader), even though
>> they are unrelated. I thought the rename would help this :)
>>
>> Does the TokenStream workfloat here help?
>>
>> http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/analysis/TokenStream.html
>> Basically reset() is a mandatory thing the consumer must call. it just
>> means 'reset any mutable state so you can be reused for processing
>> again'.
>>
>
> I really did read this. setReader I get; I don't understand what reset
> accomplishes. What does it mean to reuse one a TokenStream without calling
> setReader to supply a new input?

TokenStream is more generic, it doesnt have to take Reader. It can
take anything you want: e.g. a String or a byte array of your Word
document or whatever.

Tokenizer is a subclass that takes Reader. its the only thing that has
setReader.

reset() doesnt mean rewind. it just means clearing any accumulated
internal state so its ready for processing again.

so if i made a StringTokenizer class that extends Tokenizer, i would
probably add setString(String s) to it so i could set new string
objects on it, but consumers
must always call reset() on the entire chain (the outer stopfilters,
synonym filters, all this stuff that might be keeping state). this
reset() call chains down
all tokenstreams.

-- 
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: reset versus setReader on TokenStream

Posted by Benson Margulies <be...@basistech.com>.

On Wed, Aug 29, 2012 at 3:37 PM, Robert Muir <rc...@gmail.com> wrote:

> ok, lets help improve it: I think these have likely always been confusing.
>
> before they were both reset: reset() and reset(Reader), even though
> they are unrelated. I thought the rename would help this :)
>
> Does the TokenStream workfloat here help?
>
> http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/analysis/TokenStream.html
> Basically reset() is a mandatory thing the consumer must call. it just
> means 'reset any mutable state so you can be reused for processing
> again'.
>

I really did read this. setReader I get; I don't understand what reset
accomplishes. What does it mean to reuse one a TokenStream without calling
setReader to supply a new input? If it means reuse the old input, who does
the rewinding?





> This is something on any TokenStream: Tokenizers, TokenFilters, or
> even some direct descendent you make that parses byte arrays, or
> whatever.
>
> This means if you are keeping some state across tokens (like
> stopfilter's #skippedTokens). here is where you would set that = 0
> again.
>
> setReader(Reader) is only on Tokenizer, it means replace the Reader
> with a different one to be processed.
> The fact that CharTokenizer is doing 'reset()-like-stuff' in here is
> bogus IMO, but I dont think it will cause any bugs. Don't emulate it
> :)
>
> On Wed, Aug 29, 2012 at 3:29 PM, Benson Margulies <be...@basistech.com>
> wrote:
> > I've read the javadoc through a few times, but I confess that I'm still
> > feeling dense.
> >
> > Are all tokenizers responsible for implementing some way of retaining the
> > contents of their reader, so that a call to reset without a call to
> > setReader rewinds? I note that CharTokenizer doesn't implement #reset,
> > which leads me to suspect that I'm not responsible for the rewind
> behavior.
>
>
>
> --
> lucidworks.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: reset versus setReader on TokenStream

Posted by Robert Muir <rc...@gmail.com>.

ok, lets help improve it: I think these have likely always been confusing.

before they were both reset: reset() and reset(Reader), even though
they are unrelated. I thought the rename would help this :)

Does the TokenStream workfloat here help?
http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/analysis/TokenStream.html
Basically reset() is a mandatory thing the consumer must call. it just
means 'reset any mutable state so you can be reused for processing
again'.
This is something on any TokenStream: Tokenizers, TokenFilters, or
even some direct descendent you make that parses byte arrays, or
whatever.

This means if you are keeping some state across tokens (like
stopfilter's #skippedTokens). here is where you would set that = 0
again.

setReader(Reader) is only on Tokenizer, it means replace the Reader
with a different one to be processed.
The fact that CharTokenizer is doing 'reset()-like-stuff' in here is
bogus IMO, but I dont think it will cause any bugs. Don't emulate it
:)

On Wed, Aug 29, 2012 at 3:29 PM, Benson Margulies <be...@basistech.com> wrote:
> I've read the javadoc through a few times, but I confess that I'm still
> feeling dense.
>
> Are all tokenizers responsible for implementing some way of retaining the
> contents of their reader, so that a call to reset without a call to
> setReader rewinds? I note that CharTokenizer doesn't implement #reset,
> which leads me to suspect that I'm not responsible for the rewind behavior.

-- 
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org