You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2010/10/19 17:10:44 UTC

Analyzer forcing tokenStream and reusableTokenStream to be final

Hi

I understand the assertions in Analyzer to enforce 'final' on those methods,
but I wanted to ask why do we care if a user's code does not declare them
final? Why fail the tests on it? Can we change the assertion to fail if the
code from o.a.l?

The thing is, we run tests w/ -ea to catch all sorts of errors (real errors)
Lucene may detect. Having some custom analyzers non-final is something we
may choose to live with, yet our tests keep failing. I don't want to argue
if we should or shouldn't make them final - just ask why does Lucene code
cares if the application code may not follow 'best practices'.

Is there real danger in having my analyzer not declaring these methods final
- something that can affect Lucene code for example? Or am I only risking my
code?

Running w/o -ea is not an option, so please avoid suggesting it :).

Shai

Re: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by Simon Willnauer <si...@googlemail.com>.
Any reason why you don't use ReusableAnalyzerBase? If you don't want
it to be reusable just return false from its reset method, your code
will be cleaner and both methods are final.

simon

On Tue, Oct 19, 2010 at 7:05 PM, Shai Erera <se...@gmail.com> wrote:
> Thanks Uwe - that's what I was aiming for. We let the external analyzers
> make sure they are safe by themselves, while ensuring Lucene/Solr ones are
> good. +1 from me to commit this :).
>
> Shai
>
> On Tue, Oct 19, 2010 at 7:03 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
>>
>> About the whole assertion (as it also affects TokenStreams). We want to
>> make sure that all Lucene/Solr TokenStreams and Analyzers are final or have
>> final implementation (even when we remove the reuseable method).
>>
>>
>>
>> The idea is to simply only hit this assert for classes from the
>> org.apache.lucene package prefix! So we can test Lucene code, but for all
>> other subclasses we simply ignore. The method assertFinal can do this for
>> us:
>>
>>
>>
>> Index: Analyzer.java
>>
>> ===================================================================
>>
>> --- Analyzer.java   (revision 1023877)
>>
>> +++ Analyzer.java   (working copy)
>>
>> @@ -48,6 +48,8 @@
>>
>>    private boolean assertFinal() {
>>
>>      try {
>>
>>        final Class<?> clazz = getClass();
>>
>> +      if (!clazz.getName().startsWith("org.apache.lucene.")
>>
>> +        return true;
>>
>>        assert clazz.isAnonymousClass() ||
>>
>>          (clazz.getModifiers() & (Modifier.FINAL | Modifier.PRIVATE)) != 0
>> ||
>>
>>          (
>>
>>
>>
>> Same for TokenStream. This is no performance problem, as assertFinal is
>> only called when asserts are enabled (trick is “assert assertFinal();” in
>> ctor).
>>
>>
>>
>> Uwe
>>
>>
>>
>> -----
>>
>> Uwe Schindler
>>
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>
>> http://www.thetaphi.de
>>
>> eMail: uwe@thetaphi.de
>>
>>
>>
>> From: Uwe Schindler [mailto:uwe@thetaphi.de]
>> Sent: Tuesday, October 19, 2010 6:18 PM
>>
>> To: dev@lucene.apache.org
>> Subject: RE: Analyzer forcing tokenStream and reusableTokenStream to be
>> final
>>
>>
>>
>> By the way, the same tests are done for TokenStream subclasses (whose
>> impls must be final in all cases – its defined as decorator pattern, so we
>> enforce it). And: You don’t need to make the class itself final, its enough
>> to make both methods final.
>>
>>
>>
>> -----
>>
>> Uwe Schindler
>>
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>
>> http://www.thetaphi.de
>>
>> eMail: uwe@thetaphi.de
>>
>>
>>
>> From: Shai Erera [mailto:serera@gmail.com]
>> Sent: Tuesday, October 19, 2010 6:06 PM
>> To: dev@lucene.apache.org
>> Subject: Re: Analyzer forcing tokenStream and reusableTokenStream to be
>> final
>>
>>
>>
>> I guess you didn't read my email all the way through - I cannot disable
>> assertions for Lucene stuff because I use Lucene's assertions to assert that
>> my indexing code works :).
>>
>> Shai
>>
>> On Tue, Oct 19, 2010 at 5:59 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
>>
>> We simply added that to *test* the bundled analyzers for conformance. If
>> you don’t like that, you can simply disable assertions for the
>> org.apache.lucene package.
>>
>>
>>
>> -----
>>
>> Uwe Schindler
>>
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>
>> http://www.thetaphi.de
>>
>> eMail: uwe@thetaphi.de
>>
>>
>>
>> From: Shai Erera [mailto:serera@gmail.com]
>> Sent: Tuesday, October 19, 2010 5:53 PM
>> To: dev@lucene.apache.org
>> Subject: Re: Analyzer forcing tokenStream and reusableTokenStream to be
>> final
>>
>>
>>
>> I still don't understand how not declaring my tokenStream and
>> reusableTokenStream final can break anything. The methods are there (in my
>> analyzers), and if I risk overriding them somewhere else it's my problem.
>>
>> What am I missing?
>>
>> To add to your email - I too didn't encounter an analyzer that cannot be
>> reused, yet.
>>
>> Shai
>>
>> On Tue, Oct 19, 2010 at 5:45 PM, Robert Muir <rc...@gmail.com> wrote:
>>
>> On Tue, Oct 19, 2010 at 11:21 AM, Robert Muir <rc...@gmail.com> wrote:
>> > If someone doesn't override both (e.g. they just override
>> > tokenStream), then it wouldnt actually use their subclasses code. So
>> > then the reflection hack from LUCENE-1678 would force the analyzer to
>> > never re-use, but instead call tokenStream: but this is very bad for
>> > indexing performance!
>> >
>>
>> Here's a jira issue with an example of how the
>> tokenstream/reusableTokenStream confusion makes this a real problem in
>> practice:
>>
>> https://issues.apache.org/jira/browse/LUCENE-2279
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by Shai Erera <se...@gmail.com>.
Thanks Uwe - that's what I was aiming for. We let the external analyzers
make sure they are safe by themselves, while ensuring Lucene/Solr ones are
good. +1 from me to commit this :).

Shai

On Tue, Oct 19, 2010 at 7:03 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> About the whole assertion (as it also affects TokenStreams). We want to
> make sure that all Lucene/Solr TokenStreams and Analyzers are final or have
> final implementation (even when we remove the reuseable method).
>
>
>
> The idea is to simply only hit this assert for classes from the
> org.apache.lucene package prefix! So we can test Lucene code, but for all
> other subclasses we simply ignore. The method assertFinal can do this for
> us:
>
>
>
> Index: Analyzer.java
>
> ===================================================================
>
> --- Analyzer.java   (revision 1023877)
>
> +++ Analyzer.java   (working copy)
>
> @@ -48,6 +48,8 @@
>
>    private boolean assertFinal() {
>
>      try {
>
>        final Class<?> clazz = getClass();
>
> +      if (!clazz.getName().startsWith("org.apache.lucene.")
>
> +        return true;
>
>        assert clazz.isAnonymousClass() ||
>
>          (clazz.getModifiers() & (Modifier.FINAL | Modifier.PRIVATE)) != 0
> ||
>
>          (
>
>
>
> Same for TokenStream. This is no performance problem, as assertFinal is
> only called when asserts are enabled (trick is “assert assertFinal();” in
> ctor).
>
>
>
> Uwe
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> *From:* Uwe Schindler [mailto:uwe@thetaphi.de]
> *Sent:* Tuesday, October 19, 2010 6:18 PM
>
> *To:* dev@lucene.apache.org
> *Subject:* RE: Analyzer forcing tokenStream and reusableTokenStream to be
> final
>
>
>
> By the way, the same tests are done for TokenStream subclasses (whose impls
> must be final in all cases – its defined as decorator pattern, so we enforce
> it). And: You don’t need to make the class itself final, its enough to make
> both methods final.
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> *From:* Shai Erera [mailto:serera@gmail.com]
> *Sent:* Tuesday, October 19, 2010 6:06 PM
> *To:* dev@lucene.apache.org
> *Subject:* Re: Analyzer forcing tokenStream and reusableTokenStream to be
> final
>
>
>
> I guess you didn't read my email all the way through - I cannot disable
> assertions for Lucene stuff because I use Lucene's assertions to assert that
> my indexing code works :).
>
> Shai
>
> On Tue, Oct 19, 2010 at 5:59 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
>
> We simply added that to **test** the bundled analyzers for conformance. If
> you don’t like that, you can simply disable assertions for the
> org.apache.lucene package.
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> *From:* Shai Erera [mailto:serera@gmail.com]
> *Sent:* Tuesday, October 19, 2010 5:53 PM
> *To:* dev@lucene.apache.org
> *Subject:* Re: Analyzer forcing tokenStream and reusableTokenStream to be
> final
>
>
>
> I still don't understand how not declaring my tokenStream and
> reusableTokenStream final can break anything. The methods are there (in my
> analyzers), and if I risk overriding them somewhere else it's my problem.
>
> What am I missing?
>
> To add to your email - I too didn't encounter an analyzer that cannot be
> reused, yet.
>
> Shai
>
> On Tue, Oct 19, 2010 at 5:45 PM, Robert Muir <rc...@gmail.com> wrote:
>
> On Tue, Oct 19, 2010 at 11:21 AM, Robert Muir <rc...@gmail.com> wrote:
> > If someone doesn't override both (e.g. they just override
> > tokenStream), then it wouldnt actually use their subclasses code. So
> > then the reflection hack from LUCENE-1678 would force the analyzer to
> > never re-use, but instead call tokenStream: but this is very bad for
> > indexing performance!
> >
>
> Here's a jira issue with an example of how the
> tokenstream/reusableTokenStream confusion makes this a real problem in
> practice:
>
> https://issues.apache.org/jira/browse/LUCENE-2279
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
>
>
>

RE: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by Uwe Schindler <uw...@thetaphi.de>.
About the whole assertion (as it also affects TokenStreams). We want to make
sure that all Lucene/Solr TokenStreams and Analyzers are final or have final
implementation (even when we remove the reuseable method).

 

The idea is to simply only hit this assert for classes from the
org.apache.lucene package prefix! So we can test Lucene code, but for all
other subclasses we simply ignore. The method assertFinal can do this for
us:

 

Index: Analyzer.java

===================================================================

--- Analyzer.java   (revision 1023877)

+++ Analyzer.java   (working copy)

@@ -48,6 +48,8 @@

   private boolean assertFinal() {

     try {

       final Class<?> clazz = getClass();

+      if (!clazz.getName().startsWith("org.apache.lucene.")

+        return true;

       assert clazz.isAnonymousClass() ||

         (clazz.getModifiers() & (Modifier.FINAL | Modifier.PRIVATE)) != 0
||

         (

 

Same for TokenStream. This is no performance problem, as assertFinal is only
called when asserts are enabled (trick is "assert assertFinal();" in ctor).

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Uwe Schindler [mailto:uwe@thetaphi.de] 
Sent: Tuesday, October 19, 2010 6:18 PM
To: dev@lucene.apache.org
Subject: RE: Analyzer forcing tokenStream and reusableTokenStream to be
final

 

By the way, the same tests are done for TokenStream subclasses (whose impls
must be final in all cases - its defined as decorator pattern, so we enforce
it). And: You don't need to make the class itself final, its enough to make
both methods final.

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: Shai Erera [mailto:serera@gmail.com] 
Sent: Tuesday, October 19, 2010 6:06 PM
To: dev@lucene.apache.org
Subject: Re: Analyzer forcing tokenStream and reusableTokenStream to be
final

 

I guess you didn't read my email all the way through - I cannot disable
assertions for Lucene stuff because I use Lucene's assertions to assert that
my indexing code works :).

Shai

On Tue, Oct 19, 2010 at 5:59 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

We simply added that to *test* the bundled analyzers for conformance. If you
don't like that, you can simply disable assertions for the org.apache.lucene
package.

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: Shai Erera [mailto:serera@gmail.com] 
Sent: Tuesday, October 19, 2010 5:53 PM
To: dev@lucene.apache.org
Subject: Re: Analyzer forcing tokenStream and reusableTokenStream to be
final

 

I still don't understand how not declaring my tokenStream and
reusableTokenStream final can break anything. The methods are there (in my
analyzers), and if I risk overriding them somewhere else it's my problem.

What am I missing?

To add to your email - I too didn't encounter an analyzer that cannot be
reused, yet.

Shai

On Tue, Oct 19, 2010 at 5:45 PM, Robert Muir <rc...@gmail.com> wrote:

On Tue, Oct 19, 2010 at 11:21 AM, Robert Muir <rc...@gmail.com> wrote:
> If someone doesn't override both (e.g. they just override
> tokenStream), then it wouldnt actually use their subclasses code. So
> then the reflection hack from LUCENE-1678 would force the analyzer to
> never re-use, but instead call tokenStream: but this is very bad for
> indexing performance!
>

Here's a jira issue with an example of how the
tokenstream/reusableTokenStream confusion makes this a real problem in
practice:

https://issues.apache.org/jira/browse/LUCENE-2279


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

 

 


RE: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by Uwe Schindler <uw...@thetaphi.de>.
By the way, the same tests are done for TokenStream subclasses (whose impls
must be final in all cases - its defined as decorator pattern, so we enforce
it). And: You don't need to make the class itself final, its enough to make
both methods final.

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Shai Erera [mailto:serera@gmail.com] 
Sent: Tuesday, October 19, 2010 6:06 PM
To: dev@lucene.apache.org
Subject: Re: Analyzer forcing tokenStream and reusableTokenStream to be
final

 

I guess you didn't read my email all the way through - I cannot disable
assertions for Lucene stuff because I use Lucene's assertions to assert that
my indexing code works :).

Shai

On Tue, Oct 19, 2010 at 5:59 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

We simply added that to *test* the bundled analyzers for conformance. If you
don't like that, you can simply disable assertions for the org.apache.lucene
package.

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: Shai Erera [mailto:serera@gmail.com] 
Sent: Tuesday, October 19, 2010 5:53 PM
To: dev@lucene.apache.org
Subject: Re: Analyzer forcing tokenStream and reusableTokenStream to be
final

 

I still don't understand how not declaring my tokenStream and
reusableTokenStream final can break anything. The methods are there (in my
analyzers), and if I risk overriding them somewhere else it's my problem.

What am I missing?

To add to your email - I too didn't encounter an analyzer that cannot be
reused, yet.

Shai

On Tue, Oct 19, 2010 at 5:45 PM, Robert Muir <rc...@gmail.com> wrote:

On Tue, Oct 19, 2010 at 11:21 AM, Robert Muir <rc...@gmail.com> wrote:
> If someone doesn't override both (e.g. they just override
> tokenStream), then it wouldnt actually use their subclasses code. So
> then the reflection hack from LUCENE-1678 would force the analyzer to
> never re-use, but instead call tokenStream: but this is very bad for
> indexing performance!
>

Here's a jira issue with an example of how the
tokenstream/reusableTokenStream confusion makes this a real problem in
practice:

https://issues.apache.org/jira/browse/LUCENE-2279


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

 

 


Re: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by Shai Erera <se...@gmail.com>.
I guess you didn't read my email all the way through - I cannot disable
assertions for Lucene stuff because I use Lucene's assertions to assert that
my indexing code works :).

Shai

On Tue, Oct 19, 2010 at 5:59 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> We simply added that to **test** the bundled analyzers for conformance. If
> you don’t like that, you can simply disable assertions for the
> org.apache.lucene package.
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> *From:* Shai Erera [mailto:serera@gmail.com]
> *Sent:* Tuesday, October 19, 2010 5:53 PM
> *To:* dev@lucene.apache.org
> *Subject:* Re: Analyzer forcing tokenStream and reusableTokenStream to be
> final
>
>
>
> I still don't understand how not declaring my tokenStream and
> reusableTokenStream final can break anything. The methods are there (in my
> analyzers), and if I risk overriding them somewhere else it's my problem.
>
> What am I missing?
>
> To add to your email - I too didn't encounter an analyzer that cannot be
> reused, yet.
>
> Shai
>
> On Tue, Oct 19, 2010 at 5:45 PM, Robert Muir <rc...@gmail.com> wrote:
>
> On Tue, Oct 19, 2010 at 11:21 AM, Robert Muir <rc...@gmail.com> wrote:
> > If someone doesn't override both (e.g. they just override
> > tokenStream), then it wouldnt actually use their subclasses code. So
> > then the reflection hack from LUCENE-1678 would force the analyzer to
> > never re-use, but instead call tokenStream: but this is very bad for
> > indexing performance!
> >
>
> Here's a jira issue with an example of how the
> tokenstream/reusableTokenStream confusion makes this a real problem in
> practice:
>
> https://issues.apache.org/jira/browse/LUCENE-2279
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
>

RE: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by Uwe Schindler <uw...@thetaphi.de>.
We simply added that to *test* the bundled analyzers for conformance. If you
don't like that, you can simply disable assertions for the org.apache.lucene
package.

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Shai Erera [mailto:serera@gmail.com] 
Sent: Tuesday, October 19, 2010 5:53 PM
To: dev@lucene.apache.org
Subject: Re: Analyzer forcing tokenStream and reusableTokenStream to be
final

 

I still don't understand how not declaring my tokenStream and
reusableTokenStream final can break anything. The methods are there (in my
analyzers), and if I risk overriding them somewhere else it's my problem.

What am I missing?

To add to your email - I too didn't encounter an analyzer that cannot be
reused, yet.

Shai

On Tue, Oct 19, 2010 at 5:45 PM, Robert Muir <rc...@gmail.com> wrote:

On Tue, Oct 19, 2010 at 11:21 AM, Robert Muir <rc...@gmail.com> wrote:
> If someone doesn't override both (e.g. they just override
> tokenStream), then it wouldnt actually use their subclasses code. So
> then the reflection hack from LUCENE-1678 would force the analyzer to
> never re-use, but instead call tokenStream: but this is very bad for
> indexing performance!
>

Here's a jira issue with an example of how the
tokenstream/reusableTokenStream confusion makes this a real problem in
practice:

https://issues.apache.org/jira/browse/LUCENE-2279


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

 


Re: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by Shai Erera <se...@gmail.com>.
My only problem is that w/o disabling asserts, I cannot bypass these checks.
Hence why I hoped we can limit the check itself to o.a.l code. For someone
who knows what he's doing, we don't allow him to inherit analyzers and
override these methods, yet we protect those who don't know what they're
doing.

It's frustrating :).

I'm all for removing one of them and declare the other one reusable and be
done with it, but I have a feeling this is a matter for a larger discussion.
What I'm asking here is for something much simpler - we don't jeopardize
Lucene code, and we document the risks of not overriding
reusableTokenStream.

Can't we change the assertion to not fail if the class declares
reusableTokenStream, yet nothing is final? Wouldn't that avoid the issues
you've mentioned?

Shai

On Tue, Oct 19, 2010 at 5:59 PM, Robert Muir <rc...@gmail.com> wrote:

> On Tue, Oct 19, 2010 at 11:52 AM, Shai Erera <se...@gmail.com> wrote:
> > I still don't understand how not declaring my tokenStream and
> > reusableTokenStream final can break anything. The methods are there (in
> my
> > analyzers), and if I risk overriding them somewhere else it's my problem.
> >
>
> Well it is your problem, but we created it with our confusing APIs :)
>
> So if you subclass your analyzer but only implement tokenStream and
> not also reusableTokenStream, you get very terrible performance like
> https://issues.apache.org/jira/browse/LUCENE-2279
>
> By enforcing these to be final we prevent the trap where you subclass
> and don't implement reusableTokenStream and get bad performance, but
> its still not completely solved.
> There is still the trap (especially with the attributes-based API,
> even more overhead), that you just implement an Analyzer with only
> tokenStream and get bad performance.
>
> If we only had one of these methods, lets say called "tokenStream",
> and it was reusable, we could remove these final checks completely.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Oct 19, 2010 at 11:52 AM, Shai Erera <se...@gmail.com> wrote:
> I still don't understand how not declaring my tokenStream and
> reusableTokenStream final can break anything. The methods are there (in my
> analyzers), and if I risk overriding them somewhere else it's my problem.
>

Well it is your problem, but we created it with our confusing APIs :)

So if you subclass your analyzer but only implement tokenStream and
not also reusableTokenStream, you get very terrible performance like
https://issues.apache.org/jira/browse/LUCENE-2279

By enforcing these to be final we prevent the trap where you subclass
and don't implement reusableTokenStream and get bad performance, but
its still not completely solved.
There is still the trap (especially with the attributes-based API,
even more overhead), that you just implement an Analyzer with only
tokenStream and get bad performance.

If we only had one of these methods, lets say called "tokenStream",
and it was reusable, we could remove these final checks completely.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by Shai Erera <se...@gmail.com>.
I still don't understand how not declaring my tokenStream and
reusableTokenStream final can break anything. The methods are there (in my
analyzers), and if I risk overriding them somewhere else it's my problem.

What am I missing?

To add to your email - I too didn't encounter an analyzer that cannot be
reused, yet.

Shai

On Tue, Oct 19, 2010 at 5:45 PM, Robert Muir <rc...@gmail.com> wrote:

> On Tue, Oct 19, 2010 at 11:21 AM, Robert Muir <rc...@gmail.com> wrote:
> > If someone doesn't override both (e.g. they just override
> > tokenStream), then it wouldnt actually use their subclasses code. So
> > then the reflection hack from LUCENE-1678 would force the analyzer to
> > never re-use, but instead call tokenStream: but this is very bad for
> > indexing performance!
> >
>
> Here's a jira issue with an example of how the
> tokenstream/reusableTokenStream confusion makes this a real problem in
> practice:
>
> https://issues.apache.org/jira/browse/LUCENE-2279
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Oct 19, 2010 at 11:21 AM, Robert Muir <rc...@gmail.com> wrote:
> If someone doesn't override both (e.g. they just override
> tokenStream), then it wouldnt actually use their subclasses code. So
> then the reflection hack from LUCENE-1678 would force the analyzer to
> never re-use, but instead call tokenStream: but this is very bad for
> indexing performance!
>

Here's a jira issue with an example of how the
tokenstream/reusableTokenStream confusion makes this a real problem in
practice:

https://issues.apache.org/jira/browse/LUCENE-2279

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by DM Smith <dm...@gmail.com>.
On Oct 19, 2010, at 12:20 PM, Robert Muir wrote:

> On Tue, Oct 19, 2010 at 12:17 PM, DM Smith <dm...@gmail.com> wrote:
> 
>> I'd be surprised if there are use cases for non-reuse.
>> 
>> IIRC: When we started down the reuse path, the goal was reuse only, not just reuse by default. But in order to bridge the past to the future, there was the possibility of continued non-reuse. In a sense non-reuse was deprecated, but I'm not sure that @deprecated as a mechanism was able to clearly indicate that.
>> 
> 
> Exactly: i don't think theres a clear way to detect that your
> tokenStream() method is "reuse-safe" and deprecate it: e.g. you have
> to implement reset() correctly in your tokenstreams.
> 
> But lets think about this: for non-experts, making Analyzer "reusable
> by default" by removing reusableTokenStream() and reusing
> tokenStream() would probably be the single largest indexing
> performance improvement we could make... the API is so confusing that
> I think many people probably have analyzers that aren't reusing today.
> 
> I think its worth considering a backwards break, especially since as
> Mike mentioned, for the very special (possibly even only theoretical!)
> non-reuse case, there are ways they could still index: but the "fast
> way" should be the "easy/default way".

To me, the backwards break is merely a code break. I can't see how it would break an index.

-- DM


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by Shai Erera <se...@gmail.com>.
Which would mean all the implementation needs to do is delete its
tokenStream and rename reusableTokenStream to tokenStream, and be done with
it? Works for me !

Also, if we delete reusaleTokenStream, and if the application used the
@override tag, it'll get a compilation error, which will make the change
even more visible.

Shai

On Tue, Oct 19, 2010 at 6:38 PM, Robert Muir <rc...@gmail.com> wrote:

> On Tue, Oct 19, 2010 at 12:26 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> > Trunk has new API why not clean all that stuff up? We already have no
> more really old analyzers, as all needed to be upgraded before 3.0. If you
> did not add reset() it’s a bug as its clearly documented to be a must?
> >
>
> This is an idea: i haven't thought it all through but it should be
> possible in trunk to:
> 1. document the change in runtime that tokenStream is assumed to be
> reusable.
> 2. deprecate or remove reusableTokenStream (it is gone and ignored)
>
> then existing analyzers work with no or little modifications.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Oct 19, 2010 at 12:26 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> Trunk has new API why not clean all that stuff up? We already have no more really old analyzers, as all needed to be upgraded before 3.0. If you did not add reset() it’s a bug as its clearly documented to be a must?
>

This is an idea: i haven't thought it all through but it should be
possible in trunk to:
1. document the change in runtime that tokenStream is assumed to be reusable.
2. deprecate or remove reusableTokenStream (it is gone and ignored)

then existing analyzers work with no or little modifications.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by Uwe Schindler <uw...@thetaphi.de>.
Trunk has new API why not clean all that stuff up? We already have no more really old analyzers, as all needed to be upgraded before 3.0. If you did not add reset() it’s a bug as its clearly documented to be a must?

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Tuesday, October 19, 2010 6:21 PM
> To: dev@lucene.apache.org
> Subject: Re: Analyzer forcing tokenStream and reusableTokenStream to be final
> 
> On Tue, Oct 19, 2010 at 12:17 PM, DM Smith <dm...@gmail.com>
> wrote:
> 
> > I'd be surprised if there are use cases for non-reuse.
> >
> > IIRC: When we started down the reuse path, the goal was reuse only, not just
> reuse by default. But in order to bridge the past to the future, there was the
> possibility of continued non-reuse. In a sense non-reuse was deprecated, but
> I'm not sure that @deprecated as a mechanism was able to clearly indicate
> that.
> >
> 
> Exactly: i don't think theres a clear way to detect that your
> tokenStream() method is "reuse-safe" and deprecate it: e.g. you have to
> implement reset() correctly in your tokenstreams.
> 
> But lets think about this: for non-experts, making Analyzer "reusable by default"
> by removing reusableTokenStream() and reusing
> tokenStream() would probably be the single largest indexing performance
> improvement we could make... the API is so confusing that I think many people
> probably have analyzers that aren't reusing today.
> 
> I think its worth considering a backwards break, especially since as Mike
> mentioned, for the very special (possibly even only theoretical!) non-reuse
> case, there are ways they could still index: but the "fast way" should be the
> "easy/default way".
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by DM Smith <dm...@gmail.com>.
On Oct 19, 2010, at 12:20 PM, Robert Muir wrote:

> On Tue, Oct 19, 2010 at 12:17 PM, DM Smith <dm...@gmail.com> wrote:
> 
>> I'd be surprised if there are use cases for non-reuse.
>> 
>> IIRC: When we started down the reuse path, the goal was reuse only, not just reuse by default. But in order to bridge the past to the future, there was the possibility of continued non-reuse. In a sense non-reuse was deprecated, but I'm not sure that @deprecated as a mechanism was able to clearly indicate that.
>> 
> 
> Exactly: i don't think theres a clear way to detect that your
> tokenStream() method is "reuse-safe" and deprecate it: e.g. you have
> to implement reset() correctly in your tokenstreams.
> 
> But lets think about this: for non-experts, making Analyzer "reusable
> by default" by removing reusableTokenStream() and reusing
> tokenStream() would probably be the single largest indexing
> performance improvement we could make... the API is so confusing that
> I think many people probably have analyzers that aren't reusing today.
> 
> I think its worth considering a backwards break, especially since as
> Mike mentioned, for the very special (possibly even only theoretical!)
> non-reuse case, there are ways they could still index: but the "fast
> way" should be the "easy/default way".

To me, the backwards break is merely a code break. I can't see how it would break an index.

-- DM


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Oct 19, 2010 at 12:17 PM, DM Smith <dm...@gmail.com> wrote:

> I'd be surprised if there are use cases for non-reuse.
>
> IIRC: When we started down the reuse path, the goal was reuse only, not just reuse by default. But in order to bridge the past to the future, there was the possibility of continued non-reuse. In a sense non-reuse was deprecated, but I'm not sure that @deprecated as a mechanism was able to clearly indicate that.
>

Exactly: i don't think theres a clear way to detect that your
tokenStream() method is "reuse-safe" and deprecate it: e.g. you have
to implement reset() correctly in your tokenstreams.

But lets think about this: for non-experts, making Analyzer "reusable
by default" by removing reusableTokenStream() and reusing
tokenStream() would probably be the single largest indexing
performance improvement we could make... the API is so confusing that
I think many people probably have analyzers that aren't reusing today.

I think its worth considering a backwards break, especially since as
Mike mentioned, for the very special (possibly even only theoretical!)
non-reuse case, there are ways they could still index: but the "fast
way" should be the "easy/default way".

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by DM Smith <dm...@gmail.com>.
On Oct 19, 2010, at 11:21 AM, Robert Muir wrote:

> On Tue, Oct 19, 2010 at 11:10 AM, Shai Erera <se...@gmail.com> wrote:
>> Is there real danger in having my analyzer not declaring these methods final
>> - something that can affect Lucene code for example? Or am I only risking my
>> code?
>> 
> 
> There is a real danger: bugs like
> https://issues.apache.org/jira/browse/LUCENE-1678
> 
> I would love for us to re-think the whole
> tokenStream/reusableTokenStream issue...
> 
> If someone doesn't override both (e.g. they just override
> tokenStream), then it wouldnt actually use their subclasses code. So
> then the reflection hack from LUCENE-1678 would force the analyzer to
> never re-use, but instead call tokenStream: but this is very bad for
> indexing performance!
> 
> Are there still real use cases where an analyzer cannot actually
> reuse? For example, all Solr tokenstreams are reused. With an
> application as big and widely used as that having no need for
> non-reusable tokenStream(), I think we should seriously consider
> simplifying the analysis api to be "reusable by default".

I'd be surprised if there are use cases for non-reuse.

IIRC: When we started down the reuse path, the goal was reuse only, not just reuse by default. But in order to bridge the past to the future, there was the possibility of continued non-reuse. In a sense non-reuse was deprecated, but I'm not sure that @deprecated as a mechanism was able to clearly indicate that.

DM


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Oct 19, 2010 at 11:10 AM, Shai Erera <se...@gmail.com> wrote:
> Is there real danger in having my analyzer not declaring these methods final
> - something that can affect Lucene code for example? Or am I only risking my
> code?
>

There is a real danger: bugs like
https://issues.apache.org/jira/browse/LUCENE-1678

I would love for us to re-think the whole
tokenStream/reusableTokenStream issue...

If someone doesn't override both (e.g. they just override
tokenStream), then it wouldnt actually use their subclasses code. So
then the reflection hack from LUCENE-1678 would force the analyzer to
never re-use, but instead call tokenStream: but this is very bad for
indexing performance!

Are there still real use cases where an analyzer cannot actually
reuse? For example, all Solr tokenstreams are reused. With an
application as big and widely used as that having no need for
non-reusable tokenStream(), I think we should seriously consider
simplifying the analysis api to be "reusable by default".

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Analyzer forcing tokenStream and reusableTokenStream to be final

Posted by Michael McCandless <lu...@mikemccandless.com>.
Can't we have only one API?

Call it .tokenStream and it's always reusable?

What breaks?  Remember the re-use is only w/in a thread.  Each thread
gets its own instance.

Apps that for some reason cannot reuse even w/in a thread can always
make a Field w/ the tokenStream pre-set?

Mike

On Tue, Oct 19, 2010 at 11:10 AM, Shai Erera <se...@gmail.com> wrote:
> Hi
>
> I understand the assertions in Analyzer to enforce 'final' on those methods,
> but I wanted to ask why do we care if a user's code does not declare them
> final? Why fail the tests on it? Can we change the assertion to fail if the
> code from o.a.l?
>
> The thing is, we run tests w/ -ea to catch all sorts of errors (real errors)
> Lucene may detect. Having some custom analyzers non-final is something we
> may choose to live with, yet our tests keep failing. I don't want to argue
> if we should or shouldn't make them final - just ask why does Lucene code
> cares if the application code may not follow 'best practices'.
>
> Is there real danger in having my analyzer not declaring these methods final
> - something that can affect Lucene code for example? Or am I only risking my
> code?
>
> Running w/o -ea is not an option, so please avoid suggesting it :).
>
> Shai
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org