You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Spyros Kapnissis <sk...@yahoo.com.INVALID> on 2015/03/23 12:26:34 UTC

CachingTokenFilter tests fail when using MockTokenizer

Hello, 
We have a couple of custom token filters that use CachingTokenFilter internally. However, when we try to test them with MockTokenizer so that we can have these nice TokenStream API checks that it provides, the tests fail with: "java.lang.AssertionError: end() called before incrementToken() returned false!"

Here is a link with a unit test to reproduce the issue: https://gist.github.com/spyk/c783c72689410070811b
Do we misuse CachingTokenFilter? Or is it an issue of MockTonenizer when used with CachingTokenFilter?
Thanks!Spyros



Re: CachingTokenFilter tests fail when using MockTokenizer

Posted by Spyros Kapnissis <sk...@yahoo.com.INVALID>.
Hello Uwe, thanks a lot for your answer. Makes perfect sense - I knew something was wrong with CachingTokenFilter! I will try to modify and adapt the filter to avoid the error as per your instructions. By the way, is there a better way/pattern to use for consuming two (or more) times the tokenstream, maybe with an example from an existing filter?
 


     On Tuesday, March 24, 2015 3:12 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
   

 Hi,

One of the problems is CachingTokenFilter not 100% conformant to the TokenStream/TokenFilter specs. It is mainly used in Lucene internally for stuff like the highlighter, who needs to consume the same TokenStream multiple times. But when doing this, the code knows how to handle that. One problem is that reset() is wrongly defined: Instead the rewind case should be named rewind(), so it behaves correctly and cannot be confused with reset() [which is called before consumption automatically, which has side effects]. To me CachingTokenFilter is a bug by itself™. This filter is excluded from our random tests because of those problems (it never gets tested by TestRandomChains).

The problem in your code is that you wrap the underlying TokenStream with CachingTokenFilter inside incrementToken() and consume it, and this confuses the whole TS state machine. You should wrap the TokenFilter in the constructor with CachingTokenFilter, not too late in incrementToken() [at this point reset was already called on the underlying stream, so CachingTokenFilter will do this a second time]. This leads to this problem, which may later cause the end() problem.

In addition, the TokenFilter does not implement reset() correctly, so the whole thing cannot be reused in analyzers.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Spyros Kapnissis [mailto:skapni@yahoo.com.INVALID]
> Sent: Monday, March 23, 2015 11:02 PM
> To: java-user@lucene.apache.org; Ahmet Arslan
> Subject: Re: CachingTokenFilter tests fail when using MockTokenizer
> 
> Hello Ahmet,
> Unfortunately the test still fails with the same error: "end() called before
> incrementToken() returned false!". I am not sure if I am misusing
> CachingTokenFilter, or if it cannot be used with MockTokenizer, since it
> "always calls end() before incrementToken() returns false".
> Spyros
> 
> 
> 
> 
>      On Monday, March 23, 2015 9:12 PM, Ahmet Arslan
> <io...@yahoo.com.INVALID> wrote:
> 
> 
>  Hi Spyros,
> 
> Not 100% sure but I think you should override reset method.
> 
> @Override
> public void reset() throws IOException {
> super.reset();
> 
> cachedInput = null;
> }
> 
> Ahmet
> 
> 
> On Monday, March 23, 2015 1:29 PM, Spyros Kapnissis
> <sk...@yahoo.com.INVALID> wrote:
> Hello,
> We have a couple of custom token filters that use CachingTokenFilter
> internally. However, when we try to test them with MockTokenizer so that
> we can have these nice TokenStream API checks that it provides, the tests
> fail with: "java.lang.AssertionError: end() called before incrementToken()
> returned false!"
> 
> Here is a link with a unit test to reproduce the issue:
> https://gist.github.com/spyk/c783c72689410070811b
> Do we misuse CachingTokenFilter? Or is it an issue of MockTonenizer when
> used with CachingTokenFilter?
> Thanks!Spyros
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


  

RE: CachingTokenFilter tests fail when using MockTokenizer

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

One of the problems is CachingTokenFilter not 100% conformant to the TokenStream/TokenFilter specs. It is mainly used in Lucene internally for stuff like the highlighter, who needs to consume the same TokenStream multiple times. But when doing this, the code knows how to handle that. One problem is that reset() is wrongly defined: Instead the rewind case should be named rewind(), so it behaves correctly and cannot be confused with reset() [which is called before consumption automatically, which has side effects]. To me CachingTokenFilter is a bug by itself™. This filter is excluded from our random tests because of those problems (it never gets tested by TestRandomChains).

The problem in your code is that you wrap the underlying TokenStream with CachingTokenFilter inside incrementToken() and consume it, and this confuses the whole TS state machine. You should wrap the TokenFilter in the constructor with CachingTokenFilter, not too late in incrementToken() [at this point reset was already called on the underlying stream, so CachingTokenFilter will do this a second time]. This leads to this problem, which may later cause the end() problem.

In addition, the TokenFilter does not implement reset() correctly, so the whole thing cannot be reused in analyzers.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Spyros Kapnissis [mailto:skapni@yahoo.com.INVALID]
> Sent: Monday, March 23, 2015 11:02 PM
> To: java-user@lucene.apache.org; Ahmet Arslan
> Subject: Re: CachingTokenFilter tests fail when using MockTokenizer
> 
> Hello Ahmet,
> Unfortunately the test still fails with the same error: "end() called before
> incrementToken() returned false!". I am not sure if I am misusing
> CachingTokenFilter, or if it cannot be used with MockTokenizer, since it
> "always calls end() before incrementToken() returns false".
> Spyros
> 
> 
> 
> 
>      On Monday, March 23, 2015 9:12 PM, Ahmet Arslan
> <io...@yahoo.com.INVALID> wrote:
> 
> 
>  Hi Spyros,
> 
> Not 100% sure but I think you should override reset method.
> 
> @Override
> public void reset() throws IOException {
> super.reset();
> 
> cachedInput = null;
> }
> 
> Ahmet
> 
> 
> On Monday, March 23, 2015 1:29 PM, Spyros Kapnissis
> <sk...@yahoo.com.INVALID> wrote:
> Hello,
> We have a couple of custom token filters that use CachingTokenFilter
> internally. However, when we try to test them with MockTokenizer so that
> we can have these nice TokenStream API checks that it provides, the tests
> fail with: "java.lang.AssertionError: end() called before incrementToken()
> returned false!"
> 
> Here is a link with a unit test to reproduce the issue:
> https://gist.github.com/spyk/c783c72689410070811b
> Do we misuse CachingTokenFilter? Or is it an issue of MockTonenizer when
> used with CachingTokenFilter?
> Thanks!Spyros
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: CachingTokenFilter tests fail when using MockTokenizer

Posted by Spyros Kapnissis <sk...@yahoo.com.INVALID>.
Hello Ahmet, 
Unfortunately the test still fails with the same error: "end() called before incrementToken() returned false!". I am not sure if I am misusing CachingTokenFilter, or if it cannot be used with MockTokenizer, since it "always calls end() before incrementToken() returns false".
Spyros




     On Monday, March 23, 2015 9:12 PM, Ahmet Arslan <io...@yahoo.com.INVALID> wrote:
   

 Hi Spyros,

Not 100% sure but I think you should override reset method.

@Override
public void reset() throws IOException {
super.reset();

cachedInput = null;
}

Ahmet


On Monday, March 23, 2015 1:29 PM, Spyros Kapnissis <sk...@yahoo.com.INVALID> wrote:
Hello, 
We have a couple of custom token filters that use CachingTokenFilter internally. However, when we try to test them with MockTokenizer so that we can have these nice TokenStream API checks that it provides, the tests fail with: "java.lang.AssertionError: end() called before incrementToken() returned false!"

Here is a link with a unit test to reproduce the issue: https://gist.github.com/spyk/c783c72689410070811b
Do we misuse CachingTokenFilter? Or is it an issue of MockTonenizer when used with CachingTokenFilter?
Thanks!Spyros

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



  

Re: CachingTokenFilter tests fail when using MockTokenizer

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Spyros,

Not 100% sure but I think you should override reset method.

@Override
public void reset() throws IOException {
super.reset();

cachedInput = null;
}

Ahmet


On Monday, March 23, 2015 1:29 PM, Spyros Kapnissis <sk...@yahoo.com.INVALID> wrote:
Hello, 
We have a couple of custom token filters that use CachingTokenFilter internally. However, when we try to test them with MockTokenizer so that we can have these nice TokenStream API checks that it provides, the tests fail with: "java.lang.AssertionError: end() called before incrementToken() returned false!"

Here is a link with a unit test to reproduce the issue: https://gist.github.com/spyk/c783c72689410070811b
Do we misuse CachingTokenFilter? Or is it an issue of MockTonenizer when used with CachingTokenFilter?
Thanks!Spyros

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org