You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ian Soboroff (JIRA)" <ji...@apache.org> on 2011/04/29 22:07:03 UTC

[jira] [Created] (LUCENE-3055) LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers

LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers
----------------------------------------------------------------------

                 Key: LUCENE-3055
                 URL: https://issues.apache.org/jira/browse/LUCENE-3055
             Project: Lucene - Java
          Issue Type: Bug
          Components: Analysis
    Affects Versions: 3.1
            Reporter: Ian Soboroff


LUCENE-2372 and LUCENE-2389 marked all analyzers as final.  This makes ReusableAnalyzerBase useless, and makes it impossible to subclass e.g. StandardAnalyzer to make a small modification e.g. to tokenStream().  These issues don't indicate a new method of doing this.  The issues don't give a reason except for design considerations, which seems a poor reason to make a backward-incompatible change

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3055) LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028189#comment-13028189 ] 

Robert Muir commented on LUCENE-3055:
-------------------------------------

{quote}
Also, if reusableTokenStream is the only method left standing, isn't it wise to hide actual reuse somewhere in Lucene internals and turn Analyzer into plain and dumb factory interface?
{quote}

Hi Earwin: I completely agree that somehow Analyzer should be a "plain and dumb" interface, but are you suggesting we should move the responsibility of reuse onto the consumer? I think this could be challenging, alternatively there might be a way to present a "plain and dumb" API with the reuse guts buried inside Analyzer itself (like ReusableAnalyzerBase), and reuse enforced (e.g. the tokenStream() is final and you cannot "disable" reuse). The trick would be handling the special cases such as AnalyzerWrappers but I feel like we could still do this.

Either way, I really think we should try to do this for 4.0. Though I think to get there it would be safest if we addressed a few issues first:

* LUCENE-2788: make charfilters reusable, otherwise we will make the same mistake again!
* LUCENE-3064: ensure consumers are properly using the API e.g. calling reset()
* LUCENE-3040: cut all consumers over to reusable API, so its really the "one left standing"



> LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-3055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.1
>            Reporter: Ian Soboroff
>
> LUCENE-2372 and LUCENE-2389 marked all analyzers as final.  This makes ReusableAnalyzerBase useless, and makes it impossible to subclass e.g. StandardAnalyzer to make a small modification e.g. to tokenStream().  These issues don't indicate a new method of doing this.  The issues don't give a reason except for design considerations, which seems a poor reason to make a backward-incompatible change

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3055) LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027361#comment-13027361 ] 

Earwin Burrfoot commented on LUCENE-3055:
-----------------------------------------

Could anyone remind me, why the hell do we still have Analyzer.tokenStream AND reusableTokenStream rampaging around and confusing minds? We always recommend to use the latter, Robert just fixed some of the core classes to use the latter.

Also, if reusableTokenStream is the only method left standing, isn't it wise to hide actual reuse somewhere in Lucene internals and turn Analyzer into plain and dumb factory interface?

> LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-3055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.1
>            Reporter: Ian Soboroff
>
> LUCENE-2372 and LUCENE-2389 marked all analyzers as final.  This makes ReusableAnalyzerBase useless, and makes it impossible to subclass e.g. StandardAnalyzer to make a small modification e.g. to tokenStream().  These issues don't indicate a new method of doing this.  The issues don't give a reason except for design considerations, which seems a poor reason to make a backward-incompatible change

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3055) LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027179#comment-13027179 ] 

Uwe Schindler commented on LUCENE-3055:
---------------------------------------

{quote}
>From my perspective the most important reason is to avoid a huge performance trap: previously if you subclassed one of these analyzers, override tokenStream(), and added SpecialFilter for example, most of the time users would actually slow down indexing, because now reusableTokenStream() cannot be used by the indexer.
{quote}

Additionally, exactly this special case (overwriting one of the methods) was the biggest problem, leading to ugly reflection based checks in Lucene 3.0: In 3.0 StandardAnalyzer correctly implemented both tokenStream() and reuseableTokenStream(). As soon as one subclass only overrided tokenStream(), but the indexer still calling reuseableTokenStream() the changes were not even used, leading to lots of bug reports. Because of this, a reflection based backwards hack was done in 3.0 (see o.a.l.util.VirtualMethod class to make this easier), that prevented the indexer from calling reuseableTokenStream if a subclass suddenly overwrote only one of the methods. With moving forward in 3.1, these backwards hacks even got heavier (e.g. changes in TokenStreams, new base class ReuseableAnalyzerBase,...), so the only solution was to enforce the decorator pattern.

The above example by Robert is the correct way to implement you "factory" of TokenStreams. Everything else like subclassing StandardAnalyzer is ugly as it hides what you are really doing. The above pattern does exactly what also Solr's Schemadoes: You have to explicitely list all your components, making it clear what your TokenStreams are doing.

Trust me, the above example is shorter than subclassing previous StandardAnalyzer completely (both tokenStream and reuseableTokenStream) and is showing like solrschema.xml what your Analyzer looks like (no hidden stuff in superfactories,...)

> LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-3055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.1
>            Reporter: Ian Soboroff
>
> LUCENE-2372 and LUCENE-2389 marked all analyzers as final.  This makes ReusableAnalyzerBase useless, and makes it impossible to subclass e.g. StandardAnalyzer to make a small modification e.g. to tokenStream().  These issues don't indicate a new method of doing this.  The issues don't give a reason except for design considerations, which seems a poor reason to make a backward-incompatible change

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3055) LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027171#comment-13027171 ] 

Robert Muir commented on LUCENE-3055:
-------------------------------------

Hi Ian, you are right the justifications don't totally explain the reasoning behind this change.

>From my perspective the most important reason is to avoid a huge performance trap: previously if you subclassed one of these analyzers, override tokenStream(), and added SpecialFilter for example, most of the time users would actually slow down indexing, because now reusableTokenStream() cannot be used by the indexer.

This created worst-case situations like LUCENE-2279.

Instead, the recommended approach is to just let analyzers be tokenstream factories (which is all they are). They aren't really "extendable" only "overridable" since they are just factories for tokenstreams, and by doing so it creates the worst-case performance trap where new objects are created for every document. I would instead recommend writing your analyzer by extending ReusableAnalyzerBase instead, which is easy and safe:
{noformat}
Analyzer analyzer = new ReusableAnalyzerBase() {
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    Tokenizer tokenizer = new WhitespaceTokenizer(...);
    TokenStream filteredStream = new FooTokenFilter(tokenizer, ...);
    filteredStream = new BarTokenFilter(filteredStream, ...);
    return new TokenStreamComponents(tokenizer, filteredStream);
  }
};
{noformat}


> LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-3055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.1
>            Reporter: Ian Soboroff
>
> LUCENE-2372 and LUCENE-2389 marked all analyzers as final.  This makes ReusableAnalyzerBase useless, and makes it impossible to subclass e.g. StandardAnalyzer to make a small modification e.g. to tokenStream().  These issues don't indicate a new method of doing this.  The issues don't give a reason except for design considerations, which seems a poor reason to make a backward-incompatible change

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3055) LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027179#comment-13027179 ] 

Uwe Schindler edited comment on LUCENE-3055 at 4/29/11 8:51 PM:
----------------------------------------------------------------

{quote}
>From my perspective the most important reason is to avoid a huge performance trap: previously if you subclassed one of these analyzers, override tokenStream(), and added SpecialFilter for example, most of the time users would actually slow down indexing, because now reusableTokenStream() cannot be used by the indexer.
{quote}

Additionally, exactly this special case (overwriting one of the methods) was the biggest problem, leading to ugly reflection based checks in Lucene 3.0: In 3.0 StandardAnalyzer correctly implemented both tokenStream() and reuseableTokenStream(). As soon as one subclass only overrided tokenStream(), but the indexer still calling reuseableTokenStream() the changes were not even used, leading to lots of bug reports. Because of this, a reflection based backwards hack was done in 3.0 (see o.a.l.util.VirtualMethod class to make this easier), that prevented the indexer from calling reuseableTokenStream if a subclass suddenly overwrote only one of the methods. With moving forward in 3.1, these backwards hacks even got heavier (e.g. changes in TokenStreams, new base class ReuseableAnalyzerBase,...), so the only solution was to enforce the decorator pattern.

The above example by Robert is the correct way to implement your "factory" of TokenStreams. Everything else like subclassing StandardAnalyzer is ugly as it hides what you are really doing. The above pattern does exactly what also Solr's Schema does: You have to explicitely list all your components, making it clear what your TokenStreams are doing.

Trust me, the above example is shorter than subclassing previous StandardAnalyzer completely (both tokenStream and reuseableTokenStream) and is showing like solrschema.xml what your Analyzer looks like (no hidden stuff in superfactories,...)

      was (Author: thetaphi):
    {quote}
>From my perspective the most important reason is to avoid a huge performance trap: previously if you subclassed one of these analyzers, override tokenStream(), and added SpecialFilter for example, most of the time users would actually slow down indexing, because now reusableTokenStream() cannot be used by the indexer.
{quote}

Additionally, exactly this special case (overwriting one of the methods) was the biggest problem, leading to ugly reflection based checks in Lucene 3.0: In 3.0 StandardAnalyzer correctly implemented both tokenStream() and reuseableTokenStream(). As soon as one subclass only overrided tokenStream(), but the indexer still calling reuseableTokenStream() the changes were not even used, leading to lots of bug reports. Because of this, a reflection based backwards hack was done in 3.0 (see o.a.l.util.VirtualMethod class to make this easier), that prevented the indexer from calling reuseableTokenStream if a subclass suddenly overwrote only one of the methods. With moving forward in 3.1, these backwards hacks even got heavier (e.g. changes in TokenStreams, new base class ReuseableAnalyzerBase,...), so the only solution was to enforce the decorator pattern.

The above example by Robert is the correct way to implement you "factory" of TokenStreams. Everything else like subclassing StandardAnalyzer is ugly as it hides what you are really doing. The above pattern does exactly what also Solr's Schemadoes: You have to explicitely list all your components, making it clear what your TokenStreams are doing.

Trust me, the above example is shorter than subclassing previous StandardAnalyzer completely (both tokenStream and reuseableTokenStream) and is showing like solrschema.xml what your Analyzer looks like (no hidden stuff in superfactories,...)
  
> LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-3055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.1
>            Reporter: Ian Soboroff
>
> LUCENE-2372 and LUCENE-2389 marked all analyzers as final.  This makes ReusableAnalyzerBase useless, and makes it impossible to subclass e.g. StandardAnalyzer to make a small modification e.g. to tokenStream().  These issues don't indicate a new method of doing this.  The issues don't give a reason except for design considerations, which seems a poor reason to make a backward-incompatible change

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3055) LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114148#comment-13114148 ] 

Robert Muir commented on LUCENE-3055:
-------------------------------------

Chris has made a ton of progress here, I think we are very close, though it would be good revisit LUCENE-2788 in the future and ensure that for 4.0 charfilters have a reusable API as well (this is currently not the case).


> LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-3055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 3.1
>            Reporter: Ian Soboroff
>
> LUCENE-2372 and LUCENE-2389 marked all analyzers as final.  This makes ReusableAnalyzerBase useless, and makes it impossible to subclass e.g. StandardAnalyzer to make a small modification e.g. to tokenStream().  These issues don't indicate a new method of doing this.  The issues don't give a reason except for design considerations, which seems a poor reason to make a backward-incompatible change

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org