You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2011/05/28 19:48:47 UTC

[jira] [Created] (LUCENE-3151) Make all of Analysis completely independent from Lucene Core

Make all of Analysis completely independent from Lucene Core
------------------------------------------------------------

                 Key: LUCENE-3151
                 URL: https://issues.apache.org/jira/browse/LUCENE-3151
             Project: Lucene - Java
          Issue Type: Improvement
    Affects Versions: 4.0
            Reporter: Grant Ingersoll
             Fix For: 4.0


Lucene's analysis package, including the definitions of Attribute, TokenStream, etc. are quite useful outside of Lucene (for instance, Mahout uses them) for text processing.  I'd like to move the definitions, or at least their packaging, to a separate JAR file so that one can consume them w/o needing Lucene core.  My draft idea is to have a definition area that Lucene core is dependent on and the rest of the analysis package can then be dependent on the definition area.  (I'm open to other ideas as well)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3151) Make all of Analysis completely independent from Lucene Core

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040641#comment-13040641 ] 

Robert Muir commented on LUCENE-3151:
-------------------------------------

I agree, and wanted to mention we shouldn't limit ourselves based on packaging.

for example we can have analyzers-def and analyzers-impl, but actually shove the analyzers-def into the lucene-core jar for simplicity/packaging purposes if we want.

but this way you could still use the analyzers without the lucene core if you wanted.


> Make all of Analysis completely independent from Lucene Core
> ------------------------------------------------------------
>
>                 Key: LUCENE-3151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3151
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Grant Ingersoll
>             Fix For: 4.0
>
>
> Lucene's analysis package, including the definitions of Attribute, TokenStream, etc. are quite useful outside of Lucene (for instance, Mahout uses them) for text processing.  I'd like to move the definitions, or at least their packaging, to a separate JAR file so that one can consume them w/o needing Lucene core.  My draft idea is to have a definition area that Lucene core is dependent on and the rest of the analysis package can then be dependent on the definition area.  (I'm open to other ideas as well)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3151) Make all of Analysis completely independent from Lucene Core

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated LUCENE-3151:
------------------------------------

    Attachment: LUCENE-3151.patch

doesn't fully compile yet (but core does) due to our recursive build system, but at least fleshes out the proposed directory layout.  I may, however, change src/declarations to src/common and then we would have lucene-common.jar.  I was surprised by how much I needed to move out of core (e.g. BytesRef)

> Make all of Analysis completely independent from Lucene Core
> ------------------------------------------------------------
>
>                 Key: LUCENE-3151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3151
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Grant Ingersoll
>             Fix For: 4.0
>
>         Attachments: LUCENE-3151.patch
>
>
> Lucene's analysis package, including the definitions of Attribute, TokenStream, etc. are quite useful outside of Lucene (for instance, Mahout uses them) for text processing.  I'd like to move the definitions, or at least their packaging, to a separate JAR file so that one can consume them w/o needing Lucene core.  My draft idea is to have a definition area that Lucene core is dependent on and the rest of the analysis package can then be dependent on the definition area.  (I'm open to other ideas as well)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3151) Make all of Analysis completely independent from Lucene Core

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067432#comment-13067432 ] 

Lance Norskog commented on LUCENE-3151:
---------------------------------------

_*Architects remove dependencies*_

For external use, this locksteps the external user (Mahout for example) to changes in these data structures. It's a direct coupling. This is how you get conflicting dependencies, what the Linux people call "RPM Hell". 

If you can make a minimal class for export, then have Lucene use a larger class, that might work. Here is a _semi-coupled_ design:
h5. public class ITerm
* A really minimal API that will never be changed, only added onto.
* Code that uses this API will always work- that is the contract.
** clone() is banned (via UnsupportedOperationException).
** If a class implements clone(), all subclasses must also implement it.
* I would also ban equals & hashCode- if you want these, make your own subclass that delegates to a real Term subclass.

h5. public class Term extends ITerm
* This is what Lucene uses.
* It can be versioned. 
* If you code to this, you lock your binaries to Lucene release jars.
    
Here is a _fully-decoupled_ design:
* Separate suite of main Lucene objects, with minimal features as above.
* Separate Lucene library that xlates/wraps/etc. between this parallel suite and the Lucene versions. Lucene exports this jar and works very hard to avoid version changes.

It's a hard problem all around, and different solutions have failed in their own ways. Error-handling is a particularly big problem. Using these objects in parallel brings its own funkiness.


> Make all of Analysis completely independent from Lucene Core
> ------------------------------------------------------------
>
>                 Key: LUCENE-3151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3151
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Grant Ingersoll
>             Fix For: 4.0
>
>         Attachments: LUCENE-3151.patch
>
>
> Lucene's analysis package, including the definitions of Attribute, TokenStream, etc. are quite useful outside of Lucene (for instance, Mahout uses them) for text processing.  I'd like to move the definitions, or at least their packaging, to a separate JAR file so that one can consume them w/o needing Lucene core.  My draft idea is to have a definition area that Lucene core is dependent on and the rest of the analysis package can then be dependent on the definition area.  (I'm open to other ideas as well)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3151) Make all of Analysis completely independent from Lucene Core

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040668#comment-13040668 ] 

Grant Ingersoll commented on LUCENE-3151:
-----------------------------------------

I would propose to add:
lucene/src/analysis-defs that would contain all of the analysis declarations (including attributes) and that the main build would depend on it being built first.  I thought about moving it to modules/analysis, but that makes for some clunky Ant, IMO (although, I'm not sure if this is less clunky.)

bq. but actually shove the analyzers-def into the lucene-core jar for simplicity/packaging purposes if we want.
I'm not sure on shoving them into lucene-core just b/c I wonder if people might think they need both jars then b/c they don't know if it's in core.  Not sure on that one, so I'm not ruling it out.

> Make all of Analysis completely independent from Lucene Core
> ------------------------------------------------------------
>
>                 Key: LUCENE-3151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3151
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Grant Ingersoll
>             Fix For: 4.0
>
>
> Lucene's analysis package, including the definitions of Attribute, TokenStream, etc. are quite useful outside of Lucene (for instance, Mahout uses them) for text processing.  I'd like to move the definitions, or at least their packaging, to a separate JAR file so that one can consume them w/o needing Lucene core.  My draft idea is to have a definition area that Lucene core is dependent on and the rest of the analysis package can then be dependent on the definition area.  (I'm open to other ideas as well)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3151) Make all of Analysis completely independent from Lucene Core

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044282#comment-13044282 ] 

Grant Ingersoll commented on LUCENE-3151:
-----------------------------------------




It's not too bad, except for the build system's recursive nature.  Not sure how to get around that yet.


I did it for Token.  I think the others are useful at the definition layer if someone wants just this piece of analysis, but not all of Lucene's implementations.  But, could be persuaded otherwise.


QueryParserBase has a dep. here, so if we could fix that, then we might be able to do this.   That being said, they are useful constructs for someone who wants them w/o all of Lucene's implementations.


I've got a new patch that helps here w/ some, but some of those utils are pretty useful in the context of a common area, I guess.




> Make all of Analysis completely independent from Lucene Core
> ------------------------------------------------------------
>
>                 Key: LUCENE-3151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3151
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Grant Ingersoll
>             Fix For: 4.0
>
>         Attachments: LUCENE-3151.patch
>
>
> Lucene's analysis package, including the definitions of Attribute, TokenStream, etc. are quite useful outside of Lucene (for instance, Mahout uses them) for text processing.  I'd like to move the definitions, or at least their packaging, to a separate JAR file so that one can consume them w/o needing Lucene core.  My draft idea is to have a definition area that Lucene core is dependent on and the rest of the analysis package can then be dependent on the definition area.  (I'm open to other ideas as well)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3151) Make all of Analysis completely independent from Lucene Core

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040718#comment-13040718 ] 

Robert Muir commented on LUCENE-3151:
-------------------------------------

Looks like it makes sense that we would have to pull out these classes to do it now... but here are a few thoughts maybe for discussion... this stuff certainly should not block this issue, its hard refactorings and a lot of work, but just ideas for the future.

As far as analyzers:
* does the lucene-core/common jar need to have all the tokenAttributes? Maybe it should only have the ones that the indexer etc actually consume, and things like TypeAttribute, FlagsAttribute, KeywordAttribute, Token, etc should simply be moved to the analysis module?
* does the lucene-core/common jar need to have Tokenizer/TokenFilter/CharFilter/CharReader/etc. Seems like it really only needs TokenStream and those could also be moved to the analysis module.
* currently I think its bad that the analyzers depend upon so many of lucene's util package (some internal)... long term we want to get rid of the cumbersome backwards compatibility methods like Version and ideally have a very minimal interface between core and analysis so that you could safely just use your old analyzers jar file, etc... maybe we should see how hard it is to remove some of these util dependencies?

So in a way, this issue is related to LUCENE-2309...



> Make all of Analysis completely independent from Lucene Core
> ------------------------------------------------------------
>
>                 Key: LUCENE-3151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3151
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Grant Ingersoll
>             Fix For: 4.0
>
>         Attachments: LUCENE-3151.patch
>
>
> Lucene's analysis package, including the definitions of Attribute, TokenStream, etc. are quite useful outside of Lucene (for instance, Mahout uses them) for text processing.  I'd like to move the definitions, or at least their packaging, to a separate JAR file so that one can consume them w/o needing Lucene core.  My draft idea is to have a definition area that Lucene core is dependent on and the rest of the analysis package can then be dependent on the definition area.  (I'm open to other ideas as well)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3151) Make all of Analysis completely independent from Lucene Core

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040644#comment-13040644 ] 

Grant Ingersoll commented on LUCENE-3151:
-----------------------------------------

Analysis could even be released independently.  I've got a start to a patch that I hope to put up today as a POC.

> Make all of Analysis completely independent from Lucene Core
> ------------------------------------------------------------
>
>                 Key: LUCENE-3151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3151
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Grant Ingersoll
>             Fix For: 4.0
>
>
> Lucene's analysis package, including the definitions of Attribute, TokenStream, etc. are quite useful outside of Lucene (for instance, Mahout uses them) for text processing.  I'd like to move the definitions, or at least their packaging, to a separate JAR file so that one can consume them w/o needing Lucene core.  My draft idea is to have a definition area that Lucene core is dependent on and the rest of the analysis package can then be dependent on the definition area.  (I'm open to other ideas as well)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3151) Make all of Analysis completely independent from Lucene Core

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040640#comment-13040640 ] 

Michael McCandless commented on LUCENE-3151:
--------------------------------------------

+1

> Make all of Analysis completely independent from Lucene Core
> ------------------------------------------------------------
>
>                 Key: LUCENE-3151
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3151
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Grant Ingersoll
>             Fix For: 4.0
>
>
> Lucene's analysis package, including the definitions of Attribute, TokenStream, etc. are quite useful outside of Lucene (for instance, Mahout uses them) for text processing.  I'd like to move the definitions, or at least their packaging, to a separate JAR file so that one can consume them w/o needing Lucene core.  My draft idea is to have a definition area that Lucene core is dependent on and the rest of the analysis package can then be dependent on the definition area.  (I'm open to other ideas as well)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org