You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Michael Busch <bu...@gmail.com> on 2009/03/21 12:34:30 UTC

Re: Modularization

On 3/21/09 11:26 AM, Michael McCandless wrote:
> I think we are mixing up source code modularity with
> bundling/packaging.
>
> Honestly, I would not mind much where the source code lives in svn, so
> long as a developer, upon downloading Lucene 2.9, can go to *one*
> place (javadocs) for Lucene's "queries & filters" and see
> {Int,Long}NumberRangeFilter in there.
> We are not there today: a developer must first realize there's a whole
> separate place to look for "other" queries (contrib/queries).  Then
> the developer browses that and likely becomes confused/misled by what
> TrieRangeQuery means (is it a letter trie?).
>
> My goal here is Lucene's consumability -- when someone new says "hey I
> heard about this great search library called Lucene; let me go try it
> out" I want that first impression to be as solid as possible.  I think
> this is very important for growing Lucene's community.  This is why
> "out of the box" defaults are so crucial (eg changing IW from flushing
> every 10 docs to every 16 MB gained sizable throughput).
>
So this guy landing on http://lucene.apache.org/java/docs/index.html 
sees the "Overview" section first. That one only gives a very short 
introduction to what Lucene is. He might then look at "Features", which 
is also not very specific. I think the next thing would then be to look 
for the documentation of the newest release, so he would click on 
"Lucene 2.4.1 Documentation". The landing page doesn't say much, except 
tells you to go look for the javadocs and other docs in the menu. So 
maybe the "Getting Started" link might the first one to go to, but it's 
also pretty far down the list. So probably he would click on the 
javadocs first. Now he encounters "All, Core, Demo, Contrib". Until now, 
he hasn't read the word "Contrib" anywhere. We basically have nowhere 
documentation that introduces the concept of contribs, or where to find 
them, I think? Even the "Contributions" section talks about something 
else. So that guy probably looks then trough the  demo and examples and 
ends up using only core features until becoming more familiar with 
Lucene as a whole. Maybe he actually ends up buying LIA(2) :)

> How many times have we seen a review, article, blog post, etc.,
> comparing Lucene to other search libraries only to incorrectly
> complain because "Lucene can't do XYZ" or "Lucene's indexing
> performance is poor", etc, because they didn't dig in to learn all the
> tunings/options/tricks we all know you are supposed to do?  (It
> frustrates me to end when this happens).  This then hurts Lucene's
> adoption because others read such articles and conclude Lucene is a
> non-starter.
>
> We all ought to be concerned with Lucene's adoption & growth with time
> (I am), and first-impression consumability / out of the box defaults
> are big drivers of that.
>
> point?) we change how Lucene is bundled, such that core queries and
> contrib/query/* are in one JAR (lucene-query-3.0.jar)?  And
> lucene-analyzers-3.0.jar would include contrib/analyzers/* and
> org/apache/lucene/analysis/*.  And lucene-queryparser.jar, etc.
>

So yeah I like this and 3.0 is a good opportunity to do this. I think a 
big part of this work should be good documentation. As you mentioned, 
Mike, it should be very simple to get an overview of what the different 
modules are. So there should be the list of the different modules, 
together with a short description for each of them and infos about where 
to find them (which jar). Then by clicking on e.g. queries, the user 
would see the list of all queries we support.

But I think we should still have "main modules", such as core, queries, 
analyzers, ... and separately e.g. "sandbox modules?", for the things 
currently in contrib that are experimental or, as Mark called them, 
"graveyard contribs" :) ... even though we might then as well ask the 
questions if we can not really bury the latter ones...

> Mike
>
> Michael Busch wrote:
>
>> On 3/21/09 12:27 AM, Michael Busch wrote:
>>> +1. I'd love to see Lucene going into such a direction.
>>>
>>> However, I'm a little worried about contrib's reputation. I think it 
>>> contains components with differing levels of activity, maturity and 
>>> support.
>>> So maybe instead of moving things from core into contrib to achieve 
>>> the goal you mentioned, we could create a new folder named e.g. 
>>> 'components', which will contain stuff that we claim is as stable, 
>>> mature and supported as the core, just packaged into separate jars. 
>>> Those jars should then only have dependencies on the core, but not 
>>> on each other. They would also follow the same 
>>> backwards-compatibility and other requirements as the core. Thoughts?
>>
>> I guess something very similar has been proposed and discussed here: 
>> http://www.nabble.com/Moving-SweetSpotSimilarity-out-of-contrib-to19267437.html#a19320894 
>>
>> (same link that Hoss sent while having his deja vu)...
>>
>> -Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Grant Ingersoll <gs...@apache.org>.

I'm really ambivalent about Maven.  Having just converted Mahout to  
it, am using it for some other projects and used it quite a bit in the  
past, I am still on the fence (although I am mostly happy w/ it for  
Mahout).  I keep being lured in by the promise of it (dep. management,  
convention over configuration, POM, and most importantly, being able  
to point IntelliJ at it and have it setup my project structures), but  
then left hanging by the execution/bugginess of components.  For a  
simpler project structure like Mahout, it has worked pretty well  
except for the release stuff (which was a major pain and still isn't  
perfect), but for Lucene, I'm not so sure.  Multimodule support in  
Maven is OK at best and we have a lot of modules in Lucene.

Having been on the Maven list a number of times in the past, my sense  
was that it was overwhelmed by the sheer number of requests for help  
and the community itself was not able to keep up, so getting help may  
be more difficult.  Maybe that has changed since I was last on (about  
1.5 years ago)

Customization work in Maven is also a pain, and I have yet to see a  
project of any significance that didn't require some customization, no  
matter how much you follow the conventions.   For instance, Lucene's  
automated regression tests come to mind.  And, I am willing to bet  
Lucene's release process would need to be customized.

Finally, we have a pretty large installed base.  This is not something  
we should do lightly (not that anyone was suggesting otherwise).  We  
have a working build system and we have pretty broad Ant knowledge in  
the project (including the guy who wrote the book on Ant).

To sum up, I'm -0.9.  You might be able to convince me of using Maven,  
but the execution would really have to overcome a whole lot in order  
to do so.

-Grant

On Apr 9, 2009, at 6:48 PM, Earwin Burrfoot wrote:

> On Fri, Apr 10, 2009 at 02:25, Chris Hostetter <hossman_lucene@fucit.org 
> > wrote:
>> Or just make it trivial to get all jars that fit a given profile w/o
>> actually merging those jars into an uber-jar ... does maven's
>> dependency management have any like "bundles" or "virtual packages"  
>> so
>> we could publish a "lucene-all-analzers" POM that didn't have an  
>> actual
>> lucene-all-analyzers.jar but listed dependencies on all of the  
>> individual
>> jars?
>
> Maven can do this. Not sure transitive dependencies were meant to be
> used that way, but they definetly work like you want.
>
>> I think ideally the existig contrib/analysis would be broken up by
>> language -- even if that means only 2 or 3 classes per jar -- but i  
>> don't
>> deal with multilingual stuff much so i don't have much of an  
>> opinoin ...
>> perhaps the majority of our users that deal with non-english tend  
>> to deal
>> with *lots* of langauges so having a single "multilingual-analysis"  
>> module
>> would be suitable.
>
> I bet lots of users dealing with non-english language deal only with
> it, because they're providing local services. Like we're working with
> a mix of russian/english/ukrainian.
>
> But my point really is that I don't see any adequate reason to have
> dozens of well-defined micromodules.
> People that care big time about dead weight in their distributions
> should use tools like jar jar links anyway. (If I remember right, one
> of its abilities is to build an uberjar from a bunch of jars, dropping
> unused classes in the process)
>
>
> -- 
> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> ICQ: 104465785
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Earwin Burrfoot <ea...@gmail.com>.

On Fri, Apr 10, 2009 at 02:25, Chris Hostetter <ho...@fucit.org> wrote:
> Or just make it trivial to get all jars that fit a given profile w/o
> actually merging those jars into an uber-jar ... does maven's
> dependency management have any like "bundles" or "virtual packages" so
> we could publish a "lucene-all-analzers" POM that didn't have an actual
> lucene-all-analyzers.jar but listed dependencies on all of the individual
> jars?

Maven can do this. Not sure transitive dependencies were meant to be
used that way, but they definetly work like you want.

> I think ideally the existig contrib/analysis would be broken up by
> language -- even if that means only 2 or 3 classes per jar -- but i don't
> deal with multilingual stuff much so i don't have much of an opinoin ...
> perhaps the majority of our users that deal with non-english tend to deal
> with *lots* of langauges so having a single "multilingual-analysis" module
> would be suitable.

I bet lots of users dealing with non-english language deal only with
it, because they're providing local services. Like we're working with
a mix of russian/english/ukrainian.

But my point really is that I don't see any adequate reason to have
dozens of well-defined micromodules.
People that care big time about dead weight in their distributions
should use tools like jar jar links anyway. (If I remember right, one
of its abilities is to build an uberjar from a bunch of jars, dropping
unused classes in the process)

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Earwin Burrfoot <ea...@gmail.com>.

On Mon, Mar 23, 2009 at 22:13, Mark Miller <ma...@gmail.com> wrote:
> Earwin Burrfoot wrote:
>>>
>>> - contrib has always had a lower bar and stuff was committed under
>>> that lower bar - there should be no blanket promotion.
>>> - contrib items may have different dependencies... putting it all
>>> under the same source root can make a developers job harder
>>> - many contrib items are less related to lucene-java core indexing and
>>> searching... if there is no contrib, then they don't belong in the
>>> lucene-java project at all.
>>> - right now it's clear - core can't have dependencies on non-core
>>> classes.  If everything is stuck in the same source tree, that goes
>>> away.
>>>
>>
>> Adding to this, afaik contribs have no java 1.4 restriction. If you
>> merge them into the core, you must either enforce it for contribs, or
>> lift it from the core. I think both variants may be a reason for
>> several heart attacks :)
>> One could argue that five years after 1.5 was released Lucene is going
>> to use it, so the point is no longer relevant. Sorry, 1.7 is just
>> behind the door.
>>
>>
>
> I think we are considering this for Lucene 3.0 (should be the release after
> next) which will allow Java 1.5.

So where are you going to put 1.6 and 1.7 contribs?

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Mark Miller <ma...@gmail.com>.

Earwin Burrfoot wrote:
>> - contrib has always had a lower bar and stuff was committed under
>> that lower bar - there should be no blanket promotion.
>> - contrib items may have different dependencies... putting it all
>> under the same source root can make a developers job harder
>> - many contrib items are less related to lucene-java core indexing and
>> searching... if there is no contrib, then they don't belong in the
>> lucene-java project at all.
>> - right now it's clear - core can't have dependencies on non-core
>> classes.  If everything is stuck in the same source tree, that goes
>> away.
>>     
> Adding to this, afaik contribs have no java 1.4 restriction. If you
> merge them into the core, you must either enforce it for contribs, or
> lift it from the core. I think both variants may be a reason for
> several heart attacks :)
> One could argue that five years after 1.5 was released Lucene is going
> to use it, so the point is no longer relevant. Sorry, 1.7 is just
> behind the door.
>
>   
I think we are considering this for Lucene 3.0 (should be the release 
after next) which will allow Java 1.5.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Earwin Burrfoot <ea...@gmail.com>.

> - contrib has always had a lower bar and stuff was committed under
> that lower bar - there should be no blanket promotion.
> - contrib items may have different dependencies... putting it all
> under the same source root can make a developers job harder
> - many contrib items are less related to lucene-java core indexing and
> searching... if there is no contrib, then they don't belong in the
> lucene-java project at all.
> - right now it's clear - core can't have dependencies on non-core
> classes.  If everything is stuck in the same source tree, that goes
> away.
Adding to this, afaik contribs have no java 1.4 restriction. If you
merge them into the core, you must either enforce it for contribs, or
lift it from the core. I think both variants may be a reason for
several heart attacks :)
One could argue that five years after 1.5 was released Lucene is going
to use it, so the point is no longer relevant. Sorry, 1.7 is just
behind the door.

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Chris Hostetter <ho...@fucit.org>.

: Then during build we can package up certain combinations.  I think
: there should be sub-kitchen-sink jars by area, eg a jar that contains
: all analyzers/tokenstreams/filters, all queries/filters, etc.

Or just make it trivial to get all jars that fit a given profile w/o 
actually merging those jars into an uber-jar ... does maven's 
dependency management have any like "bundles" or "virtual packages" so 
we could publish a "lucene-all-analzers" POM that didn't have an actual 
lucene-all-analyzers.jar but listed dependencies on all of the individual 
jars?

(FYI: Perl's CPAN has the concept of a "Bundle" that's just an empty 
distribution that depends on other distributions so you have an single 
refrence point for installing them)

: So, how would you refactor the various sources of
: analyzers/tokenstream/tokenfilters we have today
: (src/java/org/apache/lucene/analysis/*, contrib/snowball/*,
: contrib/collation/* and contrib/analyzers/*)?  (Even contrib/memory
: has a neat PatternAnalyzer, that operates on a string using a regexp
: to get tokenns out, that only now am I just discovering).

I think ideally the existig contrib/analysis would be broken up by 
language -- even if that means only 2 or 3 classes per jar -- but i don't 
deal with multilingual stuff much so i don't have much of an opinoin ... 
perhaps the majority of our users that deal with non-english tend to deal 
with *lots* of langauges so having a single "multilingual-analysis" module 
would be suitable.

: We also need to think about how this impacts our back-compat policy.
: EG when are we allowed to split up modules into sub-modules, or merge
: them.

spliting a module should always be fair game as long as the new module(s) 
maintain the same back compat policy ... it's not a burden to ask people 
to start using 2 jars instead of 1 jar (especially if we're already going 
to have an easy way to bundle jars up into uber-jars)

in theory merging modules should require that the new module adopt the 
most restrictive back-compat policy of the previous modules.

: Assuming there's general consensus on this "break core into modules"
: approach, I think the next step is to take in inventory of all of
: Lucene's classes and roughly divide them into proposed modules, and
: iterate on that?  Hoss do you want to take a first stab at that?

Heh.  i'm not sure i could even answer the "want" question in the 
afirmative.  This is essentially a question of refactoring, and I think 
approaching this incrimentally would be the best strategy ... either by 
first finding some low hanging fruit in core that could be extracted int 
oa contrib easily (spans, query parser) or by restructuring the build 
system to put contribs and the demo on equal footing with core as 
"modules" and reasses as progress is made.

on a personal note: even if i wanted to lead this charge, i really can't 
right now ... folks may have noticed my involvement with lucene has been 
markedly lower in the last few months, i expect it to get even lower over 
the next 2 months before it will (hopefully) get higher. 



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Mon, Mar 30, 2009 at 7:31 PM, Chris Hostetter
<ho...@fucit.org> wrote:

> code isolation (by directory hierarchy) is hte best way i've seen to
> ensure modularization, and protect against inadvertent dependency
> bleeding.

OK I agree this (divorced top-level directories) is a great way to
enforce modularity and we should use that.

It seems the toplevel directory structure could still have subdirs,
eg:

  analyzers
    languages
      th
      es
      fr
      snowball?
      ...
    standard
    collation

and:

  search
    searcher
    queries
      span
      function

And in those "leaf" subdirs above would be the package subdir
structure (src/{java,test}/org/apache/lucene/...).

Though "svn checkout" and "svn update" and "svn diff" are going to
take quite a bit longer with this switch...

> One underlying assumption that seems to have permiated the existing
> discussion (without ever being explicitly stated) is the idea that
> most currently lives in src/java is the "core" and would be a single
> "module" ... personally i'd like to challege that assumption.  I'd
> like to suggest that besides obvious things that could be refactored
> out into other "modules" (span queries, queryparser) there are lots
> of additional ways that src/java could be sliced...

+1: I very much agree what is now called "core" should be refactored
as a number of modules.

So the general new proposal here seems to be lets break up src/java/*
into separate modules (each under its own toplevel directory), just
like contrib/* is today.

And move Lucene to an "a la carte" model for what we now call core.
(what we now call contrib is already "a la carte" today).

We would then do away with the top level "core" vs "contrib", and
everything would simply be "modules", where each module has
metadata/javadocs stating:

  * JRE version required

  * What external dependencies (including dependencies to other Lucene
    modules) are needed

  * Some measure of "maturity"

  * Back-compat policy

  * CHANGES

Then during build we can package up certain combinations.  I think
there should be sub-kitchen-sink jars by area, eg a jar that contains
all analyzers/tokenstreams/filters, all queries/filters, etc.

This does make the future decision process far easier.  Rather than
have a capricious and ill-defined "does it go into core vs contrib"
question, we now simply decide if it goes into an existing module or
makes a new one.

> Even without making radical changes to the way our source code is
> organized, a lot of improvements could be made by having better
> documentation .

Agreed. I think this is actually somewhat orthogonal, though should
follow more naturally once Lucene is simply a collection of modules.
I would think we present "all" and a "per-module" sets of javadocs,
plus javadocs aggregated based on how the JARs aggregate?  (Ie I could
browse the "kitchen-sink" javadocs, the "all analyzers" javadocs, or
the "thai analyzers only" javadocs).

> (ie: a new ThaiStemmerFilter could be added to an existing
> thai-analysis module)

So, how would you refactor the various sources of
analyzers/tokenstream/tokenfilters we have today
(src/java/org/apache/lucene/analysis/*, contrib/snowball/*,
contrib/collation/* and contrib/analyzers/*)?  (Even contrib/memory
has a neat PatternAnalyzer, that operates on a string using a regexp
to get tokenns out, that only now am I just discovering).

We also need to think about how this impacts our back-compat policy.
EG when are we allowed to split up modules into sub-modules, or merge
them.

Assuming there's general consensus on this "break core into modules"
approach, I think the next step is to take in inventory of all of
Lucene's classes and roughly divide them into proposed modules, and
iterate on that?  Hoss do you want to take a first stab at that?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Douglas Campos <do...@theros.info>.

I haven't paid attention, as I looked first for the build.xml on trunk....

as we already are using maven, Ryan's approach is the way to go, IMHO

On Wed, Apr 1, 2009 at 7:00 PM, Earwin Burrfoot <ea...@gmail.com> wrote:

> Lucene is in fact already available through maven. poms do exist, all
> what is left is to find who manages them and releases.
>
> On Thu, Apr 2, 2009 at 01:40, Douglas Campos <do...@theros.info> wrote:
> > +1 on maven, and I volunteer to aid in the creation of the maven project
> > files (pom's)
> >
> > On Wed, Apr 1, 2009 at 11:02 AM, Ryan McKinley <ry...@gmail.com>
> wrote:
> >>>
> >>> we can have fine grained modularity w/o having second class citizens,
> and
> >>> we can achieve it without needing to make radical changes -- but
> putting
> >>> more stuff into "core" isn't going to help us get there.
> >>>
> >>
> >> I totally agree.
> >>
> >> However, just to stir the pot (and assuming you are well rested), I'll
> >> drop your "radical changes" constraint and suggest that maven (while it
> can
> >> be a PIA) makes this kind of modularity trivial.
> >>
> >> With maven we could easily have:
> >>  /core
> >>  /modules/xxx
> >>
> >> Each module could easily declare:
> >>  * its dependencies on other modules
> >>  * the required JRE
> >>  * document its level of maturity
> >>
> >> And there are good off the shelf tools to report the dependency graphs,
> >> etc, etc.
> >>
> >> If there are any serious moves to reorganize things, we should at least
> >> consider the benefits of maven.
> >>
> >> ryan
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> >
> >
> > --
> > Douglas Campos
> > Theros Consulting
> > +55 11 9267 4540
> > +55 11 3020 8168
> >
>
>
>
> --
> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> ICQ: 104465785
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Douglas Campos
Theros Consulting
+55 11 9267 4540
+55 11 3020 8168

Re: Modularization

Posted by Earwin Burrfoot <ea...@gmail.com>.

Lucene is in fact already available through maven. poms do exist, all
what is left is to find who manages them and releases.

On Thu, Apr 2, 2009 at 01:40, Douglas Campos <do...@theros.info> wrote:
> +1 on maven, and I volunteer to aid in the creation of the maven project
> files (pom's)
>
> On Wed, Apr 1, 2009 at 11:02 AM, Ryan McKinley <ry...@gmail.com> wrote:
>>>
>>> we can have fine grained modularity w/o having second class citizens, and
>>> we can achieve it without needing to make radical changes -- but putting
>>> more stuff into "core" isn't going to help us get there.
>>>
>>
>> I totally agree.
>>
>> However, just to stir the pot (and assuming you are well rested), I'll
>> drop your "radical changes" constraint and suggest that maven (while it can
>> be a PIA) makes this kind of modularity trivial.
>>
>> With maven we could easily have:
>>  /core
>>  /modules/xxx
>>
>> Each module could easily declare:
>>  * its dependencies on other modules
>>  * the required JRE
>>  * document its level of maturity
>>
>> And there are good off the shelf tools to report the dependency graphs,
>> etc, etc.
>>
>> If there are any serious moves to reorganize things, we should at least
>> consider the benefits of maven.
>>
>> ryan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
>
> --
> Douglas Campos
> Theros Consulting
> +55 11 9267 4540
> +55 11 3020 8168
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Douglas Campos <do...@theros.info>.

+1 on maven, and I volunteer to aid in the creation of the maven project
files (pom's)

On Wed, Apr 1, 2009 at 11:02 AM, Ryan McKinley <ry...@gmail.com> wrote:

>
>> we can have fine grained modularity w/o having second class citizens, and
>> we can achieve it without needing to make radical changes -- but putting
>> more stuff into "core" isn't going to help us get there.
>>
>>
> I totally agree.
>
> However, just to stir the pot (and assuming you are well rested), I'll drop
> your "radical changes" constraint and suggest that maven (while it can be a
> PIA) makes this kind of modularity trivial.
>
> With maven we could easily have:
>  /core
>  /modules/xxx
>
> Each module could easily declare:
>  * its dependencies on other modules
>  * the required JRE
>  * document its level of maturity
>
> And there are good off the shelf tools to report the dependency graphs,
> etc, etc.
>
> If there are any serious moves to reorganize things, we should at least
> consider the benefits of maven.
>
> ryan
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Douglas Campos
Theros Consulting
+55 11 9267 4540
+55 11 3020 8168

Re: Modularization

Posted by Chris Hostetter <ho...@fucit.org>.


: If there are any serious moves to reorganize things, we should at least
: consider the benefits of maven.

+1

we can certainly do a lot to improve things just by refacting stuff from 
core into contrib, and improving the visibility of contribs and 
documentation about contribs -- but if we're going to make massive changes 
to how things are built or how the source code is organized, then 
utilizing maven as the build system seems like an obvious choice to me.

(and i don't even like maven that much)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Ryan McKinley <ry...@gmail.com>.

>
> we can have fine grained modularity w/o having second class  
> citizens, and
> we can achieve it without needing to make radical changes -- but  
> putting
> more stuff into "core" isn't going to help us get there.
>

I totally agree.

However, just to stir the pot (and assuming you are well rested), I'll  
drop your "radical changes" constraint and suggest that maven (while  
it can be a PIA) makes this kind of modularity trivial.

With maven we could easily have:
  /core
  /modules/xxx

Each module could easily declare:
  * its dependencies on other modules
  * the required JRE
  * document its level of maturity

And there are good off the shelf tools to report the dependency  
graphs, etc, etc.

If there are any serious moves to reorganize things, we should at  
least consider the benefits of maven.

ryan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Chris Hostetter <ho...@fucit.org>.

: We've been doing this using just one source tree (like in Lucene), and
: instead ensuring the separation using the build system. We did not, like you

I think you are missunderstanding my previous comment ... Lucene-Java does 
not currenlty have one source tree in the sense that someone else 
suggested (i forget who) and i was commenting on ... at the moment Lucene 
has several source trees (src/java, src/demo, and each dir matching 
contrib/*/src).  

Based on your examples, i believe we are suggesting the same thing: 
building seperate "modules" from seperate base directories (in your case 
foo/A and foo/B) with well defined dependencies.






-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Nadav Har'El <ny...@math.technion.ac.il>.

On Mon, Mar 30, 2009, Chris Hostetter wrote about "Re: Modularization":
> code isolation (by directory hierarchy) is hte best way i've seen to 
> ensure modularization, and protect against inadvertent dependency 
> bleeding.
>...
> it's certainly possible to have "all" source code in a single directory 
> hierarchy, and then rely on the build system to ensure your don't 
> inwarranted dependencies, but that requires you do express rules in the 
> build system about what exactly the acceptible dependencies are, and it 
> relies on everyone using the buildsystem correctly (missguided users of 
> hand-holding IDEs could get very frustrated when the patches they submit 
> violate rules of an overly complicated set of ant build files)

In a project I've been involved in, we are building a library with similar
concerns that Lucene now faces - on one hand you want to be a "kitchen sink"
providing features for everyone, but on the other hand you want to create
small jars and allow people who only need a small number of features to pick
only some of the jars, instead of one huge jar.

We've been doing this using just one source tree (like in Lucene), and
instead ensuring the separation using the build system. We did not, like you
suggest, found this to complicated to set up or maintain. The only snag, of
course, is that people who don't know how to write build.xml properly do
not touch it, but it's exactly like people who don't know how to properly
code in Java do not touch our source code :-) Having a "hand-holding IDE"
is no replacement for knowing how to code, whether the code is Java source
code or Ant configuration.

The idea of the Ant-based approach is to have the Ant build script compile
each module source separately, allowing it only to refer to pre-defined
dependencies. This instead of the more usual approach of compiling all the
source code together (and thus allowing unwanted dependencies) and only
collecting the jars from the compiled classes at the very end.

For example, let's say that we want to build three JARs of three packages,
foo.A, foo.B, and foo.C. Let's say that foo.A is stand-alone (doesn't need
the other source code to compile), and foo.B depends on stuff from foo.A
(and must not depend on stuff from foo.C).

In that case, I would first create an Ant rule to build a jar from the sources
of foo.A, and them alone (which ensures that foo.A doesn't accidentally
depend on foo.B or foo.C). Note the "includes" argument to javac, and the
separate destdir:

        <target name="A.compile">
                <sequential>
                <mkdir dir="${build.classes}/A"/>
                <javac srcdir="${src}" destdir="${build.classes}/A"
                        includes="foo/A/**/*.java"
                        sourcepath="" listfiles="no">
                </javac>
                </sequential>
        </target>

	<target name="A.jar" depends="A.compile">
                <sequential>
                <mkdir dir="${build.jars}"/>
                <jar destfile="${build.jars}/A.jar" basedir="${build.classes}/A">
                </jar>
                </sequential>
        </target>

Now, we do a similar thing for B.jar - when compiling it, we allow the
compiler to look at only the source code of foo.B, and at the previously
built A.jar. It cannot, for example, accidentally use stuff from foo.C:

        <target name="B.compile" depends="A.jar">
                <sequential>
                <mkdir dir="${build.classes}/B"/>
                <javac srcdir="${src}" destdir="${build.classes}/B"
                        includes="foo/B/**/*.java" sourcepath="" listfiles="no">
                        <classpath>
                        <pathelement location="${build.jars}/A.jar" />
                        </classpath>
                </javac>
                </sequential>
        </target>

	<target name="B.jar" depends="B.compile">
                <sequential>
                <mkdir dir="${build.jars}"/>
                <jar destfile="${build.jars}/B.jar" basedir="${build.classes}/B">
                </jar>
                </sequential>
        </target>

Putting my money (or rather, time) where my mouth is, is there an interest
that I try to build a build script for Lucene to demonstrate these ideas
in action?
 
> FWIW: having lots/more of very small, isolated, hierarcies also wouldn't 
> hinder any attempts at having kitchen-sink or "essential" jars --
> combining the classes from lots of little isolated code trees is a lot 
> easier then extracting a few classes from one big code tree. 

But I think you've swept on issue under the rug: what happens when the
hierarcies aren't completely isolated? For example, an analyzer package
obviously depends on some Lucene core package. Or the query parser package
depends on the wildcard query package (for example). You need to specify these
dependencies somehow, and allow only them. How do you do that? Via an
Eclipse ".project" file in each of the small hierarcies? How is this any
better than having an Ant build file? How would anyone not using Eclipse
use this sort of setup?

Another problem with your separate-source-hierarchies proposal is that it
requires some drastic changes to the source code tree. With my Ant-based
proposal, you don't need *any* change to the source code tree we have now
(heck, you can even keep the "contrib/" directory as is), you just need to
change one file - build.xml. Of course, if you discover unwanted dependencies
in the existing code (e.g., the indexing code accidentally depends on the
whitespace analyzer) you'll need to fix them.

> One underlying assumption that seems to have permiated the existing 
> discussion (without ever being explicitly stated) is the idea that most 
> currently lives in src/java is the "core" and would be a single "module" 
> ... personally i'd like to challege that assumption. 

I wholeheartedly agree.

> I'd like to suggest 
> that besides obvious things that could be refactored out into other 
> "modules" (span queries, queryparser) there are lots of additional ways 
> that src/java could be sliced...
>  - interfaces and abstract clases and concrete classes for reading an 
> index in one index-api.jar (ie: Directory but no FSDirectory; IndexReader 
> but not MultiReader)

Interesting ideas. Thus one might use a RamDirectory in his application,
and not incur the code size of FSDirectory.

However, at some point you have to wonder how fine-grain we want the division
to be. For example, if the FS-specific stuff only amounts to 20k of code
(and I'm just making this number up), how important is it to have a separate
jar for it? What do we lose (if anything) by having too many tiny jars?

> The crux of my point being that what we think of today as the lucene 
> "core" is actually kind of big and bloated, and already has *a* kitchen 
> sink thrown in -- it's just not neccessarily the kitchen sink many people 
> want.  

I agree.

> a big percentage of our users may want highlighting by default, and may 
> never care about function or span queries -- making it easier to get a 
> monolithic jar of *everything* only addresses one of those three 
> disconnects (easy access to the highlighting code) but splitting the 
> current "core" up into lots of little pieces (aka: "modules") that have 
> equal visibility to the existing contribs (now also "modules") would 
> address all three disconnects: people wouldn't overlook modules they might 
> want (like highlighting) because they are just as easy to find the "core" 
> and people wouldn't wind up with bloated jars containing a lot of code 
> they don't need. (beating a dead horse for a moment: this wouldn't 
> proclude us from offering a bloated jar containing everything under the 
> sun)

Again, I wholeheartedly agree.

-- 
Nadav Har'El                        |     Wednesday, Apr  1 2009, 7 Nisan 5769
IBM Haifa Research Lab              |-----------------------------------------
                                    |"A witty saying proves nothing." --
http://nadav.harel.org.il           |Voltaire

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Babak Farhang <fa...@gmail.com>.

> maturity, and their back compat commitments.  The demo and getting
> started guies could also be expanded to refrence the contrib jars that
> contain code many people may want to reuse...

Here's an idea. Each contrib is really a project onto its own. And any
project, I suggest, ought to have its own demo program, together maybe
with a small write-up describing the idea behind the contrib and what
the demo does. So to get the ball rolling, how about adopting some
such documentation policy for *future* contribs as a
pseudo-requirement for making it into the official release?

Cheers,
-Babak

PS this not a swipe at any upcoming contrib (TrieUtils: the
documentation there is really good :)


On Mon, Mar 30, 2009 at 5:31 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> After stiring things up, and then being off-list for ~10 days, I'm in an
> interesting position coming back to this thread and seeing the discussion
> *after* it essentially ended, with a lot of semi-concensus but no clear
> sense of hard and fast resolution or plan of action.
>
> FWIW, here are the notes i made based on reading the thread about the
> various sentiments i noticed expressed (wether i agree with them or
> not) in order to try and get a handle on what had been discussed.
> some of these were the optinion of a single person and i've paraphrased,
> others are my generalization of similar comments made by various
> people...
>
> - contrib has a bad rap
> - widely varying degrees of quality/stability in contrib code, hard to get
> people to rely on the "good" ones because of the "less good" ones
> - many people want a good, out of hte box, kitchen sink experience (ie:
> one monolithic jar containing all the "essentials")
> - need easy discoverability of all things of a given type (ie: all
> queries, all filters, all analyzers, etc...) .. ie: combined javadocs.
> - need easy installation of of all things of a given type (ie: a jar
> containing all types of queries, a jar containing all types of analyzers,
> etc...)
> - still need to deal with contribs that have external dependencies
> - still need to deal with contribs that require future versions of
> langauge (Java1.7 when core is still 1.5 compat)
> - users need better guidance about "why" something is a contrib
> (additional functionality, alternate functionality, example of use, tool,
> etc...)
> - while we should maintain/increase modularization, documentation should
> make features of contribs more promonent without stressing the isolation
> resulting from code modularization.
> - we should merge all contrib & core code into a unified src/ tree, and
> make the pacakging independent of the physical location in svn (ie: jars
> based on java package, not directory)
>
> While I'm mostly in favor of all of these sentiments, and think it's
> really just a question of how to go about it, the last one is actually
> something i've pretty stronly opposed to -- I think the best way forward
> is to have lots of small, well isolated source trees.
>
> code isolation (by directory hierarchy) is hte best way i've seen to
> ensure modularization, and protect against inadvertent dependency
> bleeding.  If we want to be able to produce small jars targeted at
> specific goals, and we want o.l.a.foo.FooClass to be in foo.jar and
> o.l.a.bar.BarClass to be in bar.jar then we shouldn't have
> src/java/o/l/a/foo/FooClass.java and src/java/o/l/a/bar/BarClass.java --
> doing so makes it way to easy for inadvertnent dependencies to crop up
> that make FooClass depend on bar class, and thus make it impossible to use
> foo.jar without also using bar.jar at runtime.
>
> it's certainly possible to have "all" source code in a single directory
> hierarchy, and then rely on the build system to ensure your don't
> inwarranted dependencies, but that requires you do express rules in the
> build system about what exactly the acceptible dependencies are, and it
> relies on everyone using the buildsystem correctly (missguided users of
> hand-holding IDEs could get very frustrated when the patches they submit
> violate rules of an overly complicated set of ant build files)
>
> FWIW: having lots/more of very small, isolated, hierarcies also wouldn't
> hinder any attempts at having kitchen-sink or "essential" jars --
> combining the classes from lots of little isolated code trees is a lot
> easier then extracting a few classes from one big code tree.
>
> One underlying assumption that seems to have permiated the existing
> discussion (without ever being explicitly stated) is the idea that most
> currently lives in src/java is the "core" and would be a single "module"
> ... personally i'd like to challege that assumption.  I'd like to suggest
> that besides obvious things that could be refactored out into other
> "modules" (span queries, queryparser) there are lots of additional ways
> that src/java could be sliced...
>
>  - interfaces and abstract clases and concrete classes for reading an
> index in one index-api.jar (ie: Directory but no FSDirectory; IndexReader
> but not MultiReader)
>  - ditto for creating/updating an index in one index-update.jar (ie:
> IndexWriter, TokenStream, Tokenizer, TokenFilter, Analyzer  but
> not any impls of the last 3)
>  - ditto for searching in index-search.jar (ie: Searcher, Searchable,
> HitCollector, Query ... but not any concrete subclasses
>  - simple-analysis.jar (SimpleAnalyzer, WhitespaceAnalyzer,
> LetterTokenizer, LowercaseFilter, etc...)
>  - english-analysis.jar (StandardAnalyzer, etc...)
>  - primative-queries.jar (TermQuery, BooleanQuery, MatchAllDocsQuery,
> MultiTermQuery, etc...)
>  - range-queries.jar (RangeQuery, RangeFilter, ConstantScoreRangeQuery)
>
>   ...etc...
>
>
> The crux of my point being that what we think of today as the lucene
> "core" is actually kind of big and bloated, and already has *a* kitchen
> sink thrown in -- it's just not neccessarily the kitchen sink many people
> want.
>
> a big percentage of our users may want highlighting by default, and may
> never care about function or span queries -- making it easier to get a
> monolithic jar of *everything* only addresses one of those three
> disconnects (easy access to the highlighting code) but splitting the
> current "core" up into lots of little pieces (aka: "modules") that have
> equal visibility to the existing contribs (now also "modules") would
> address all three disconnects: people wouldn't overlook modules they might
> want (like highlighting) because they are just as easy to find the "core"
> and people wouldn't wind up with bloated jars containing a lot of code
> they don't need. (beating a dead horse for a moment: this wouldn't
> proclude us from offering a bloated jar containing everything under the
> sun)
>
> Even without making radical changes to the way our source code is
> organized, a lot of improvements could be made by having better
> documentation ... http://lucene.apache.org/java/2_4_1/ could certainly
> have more info about what is included in a release, what types of things
> can be found in a contrib, etc...  Individual contrib README files should
> certianly get beefed up to describe their purpose, their level of
> maturity, and their back compat commitments.  The demo and getting
> started guies could also be expanded to refrence the contrib jars that
> contain code many people may want to reuse...
>
>
>   ...and that's all small improvements that could be made without
> radically changing anything about our source organization or packaging.
> splitting the core up into smaller modules would only help the situation,
> moving more things into the core seem like it would just make the problem
> worse.
>
> : I agree, but at least we need some clear criteria so the future
> : decision process is more straightforward.  Towards that... it seems
> : like there are good reasons why something should be put into contrib:
>
> I would agrue that is approaching the problem from the wrong direction.
>
> assume for the moment that we define the list of lucene "modules" as:
>   ls -d contrib/* src/java src/gcj src/demo src/jsp
> ...but in the future we want to split up some of hte bigger "modules" and
> move each module so they have equal visibility.
>
> i would suggest that the opperating assumption be that any new code
> contribution that adds functionality (ie: not a bug fix, or an
> enhancement to an existing Impl) belongs in a new "module" unless:
>  1) compilation constraints require that it be put in an existing module
> (ie: needs to introduce a bi-directional dependency with an existing
> class which can't be refactored out into the new module)
>  2) it is a natural conceptual fit with *all* of the existing classes in
> that module (ie: a new ThaiStemmerFilter could be added to an existing
> thai-analysis module)
>
> (but an equally important to the question of "when to add to an existing
> 'module' vs creating a new module?" should be the question of "when to
> split an exsting module?" ... something we've never really talked about
> for core or contribs.)
>
> : But I don't think "it doesn't have to be in core" (the "software
> : modularity" goal) is the right reason to put something in contrib.
>
> Would it sound like a better reason if we stoped calling "core" ... i look
> at it from the point of view of: Are classes A,B&C (which are tightly
> coupled) directly related to classes X,Y&Z (also tightly coupled) ?"
> ... if the answer is "no" then A,B&C do not belong in the same module as
> X,Y&Z ... it doesn't matter which module we're talking about (src/java,
> contrib/highlighter etc...)
>
> i don't think it makes any sense for the the TreiRangeQueries to be in the
> same "module" as IndexWriter, or IndexReader ... but i also don't think it
> makes sense for the trie to be in the same module as BoostingQuery or
> DuplicateFilter -- or for IndexWRiter to be in the same module as the
> existing query parser (or for hte existing query parser to be in the same
> module as the new one the IBM folks have been working on)
>
>
> we can have fine grained modularity w/o having second class citizens, and
> we can achieve it without needing to make radical changes -- but putting
> more stuff into "core" isn't going to help us get there.
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Michael Busch <bu...@gmail.com>.

On 3/31/09 1:31 AM, Chris Hostetter wrote:
> code isolation (by directory hierarchy) is hte best way i've seen to
> ensure modularization, and protect against inadvertent dependency
> bleeding.
+1. That's actually what I meant with "one-to-one mapping between the 
packaging and the source code" (I didn't say that as elaborately as you :) )
To make jars based on packages rather than directories would be the 
wrong decision I strongly believe, for the reasons you mentioned nicely 
here.

-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Chris Hostetter <ho...@fucit.org>.

After stiring things up, and then being off-list for ~10 days, I'm in an 
interesting position coming back to this thread and seeing the discussion 
*after* it essentially ended, with a lot of semi-concensus but no clear 
sense of hard and fast resolution or plan of action.

FWIW, here are the notes i made based on reading the thread about the 
various sentiments i noticed expressed (wether i agree with them or 
not) in order to try and get a handle on what had been discussed.  
some of these were the optinion of a single person and i've paraphrased, 
others are my generalization of similar comments made by various 
people...

- contrib has a bad rap
- widely varying degrees of quality/stability in contrib code, hard to get 
people to rely on the "good" ones because of the "less good" ones
- many people want a good, out of hte box, kitchen sink experience (ie: 
one monolithic jar containing all the "essentials")
- need easy discoverability of all things of a given type (ie: all 
queries, all filters, all analyzers, etc...) .. ie: combined javadocs.
- need easy installation of of all things of a given type (ie: a jar 
containing all types of queries, a jar containing all types of analyzers, 
etc...)
- still need to deal with contribs that have external dependencies
- still need to deal with contribs that require future versions of 
langauge (Java1.7 when core is still 1.5 compat)
- users need better guidance about "why" something is a contrib 
(additional functionality, alternate functionality, example of use, tool, 
etc...)
- while we should maintain/increase modularization, documentation should 
make features of contribs more promonent without stressing the isolation 
resulting from code modularization.
- we should merge all contrib & core code into a unified src/ tree, and 
make the pacakging independent of the physical location in svn (ie: jars 
based on java package, not directory)

While I'm mostly in favor of all of these sentiments, and think it's 
really just a question of how to go about it, the last one is actually 
something i've pretty stronly opposed to -- I think the best way forward 
is to have lots of small, well isolated source trees.

code isolation (by directory hierarchy) is hte best way i've seen to 
ensure modularization, and protect against inadvertent dependency 
bleeding.  If we want to be able to produce small jars targeted at 
specific goals, and we want o.l.a.foo.FooClass to be in foo.jar and 
o.l.a.bar.BarClass to be in bar.jar then we shouldn't have 
src/java/o/l/a/foo/FooClass.java and src/java/o/l/a/bar/BarClass.java -- 
doing so makes it way to easy for inadvertnent dependencies to crop up 
that make FooClass depend on bar class, and thus make it impossible to use 
foo.jar without also using bar.jar at runtime.

it's certainly possible to have "all" source code in a single directory 
hierarchy, and then rely on the build system to ensure your don't 
inwarranted dependencies, but that requires you do express rules in the 
build system about what exactly the acceptible dependencies are, and it 
relies on everyone using the buildsystem correctly (missguided users of 
hand-holding IDEs could get very frustrated when the patches they submit 
violate rules of an overly complicated set of ant build files)

FWIW: having lots/more of very small, isolated, hierarcies also wouldn't 
hinder any attempts at having kitchen-sink or "essential" jars --
combining the classes from lots of little isolated code trees is a lot 
easier then extracting a few classes from one big code tree. 

One underlying assumption that seems to have permiated the existing 
discussion (without ever being explicitly stated) is the idea that most 
currently lives in src/java is the "core" and would be a single "module" 
... personally i'd like to challege that assumption.  I'd like to suggest 
that besides obvious things that could be refactored out into other 
"modules" (span queries, queryparser) there are lots of additional ways 
that src/java could be sliced...

 - interfaces and abstract clases and concrete classes for reading an 
index in one index-api.jar (ie: Directory but no FSDirectory; IndexReader 
but not MultiReader)
 - ditto for creating/updating an index in one index-update.jar (ie: 
IndexWriter, TokenStream, Tokenizer, TokenFilter, Analyzer  but 
not any impls of the last 3)
 - ditto for searching in index-search.jar (ie: Searcher, Searchable, 
HitCollector, Query ... but not any concrete subclasses
 - simple-analysis.jar (SimpleAnalyzer, WhitespaceAnalyzer, 
LetterTokenizer, LowercaseFilter, etc...)
 - english-analysis.jar (StandardAnalyzer, etc...)
 - primative-queries.jar (TermQuery, BooleanQuery, MatchAllDocsQuery, 
MultiTermQuery, etc...)
 - range-queries.jar (RangeQuery, RangeFilter, ConstantScoreRangeQuery)

   ...etc...


The crux of my point being that what we think of today as the lucene 
"core" is actually kind of big and bloated, and already has *a* kitchen 
sink thrown in -- it's just not neccessarily the kitchen sink many people 
want.  

a big percentage of our users may want highlighting by default, and may 
never care about function or span queries -- making it easier to get a 
monolithic jar of *everything* only addresses one of those three 
disconnects (easy access to the highlighting code) but splitting the 
current "core" up into lots of little pieces (aka: "modules") that have 
equal visibility to the existing contribs (now also "modules") would 
address all three disconnects: people wouldn't overlook modules they might 
want (like highlighting) because they are just as easy to find the "core" 
and people wouldn't wind up with bloated jars containing a lot of code 
they don't need. (beating a dead horse for a moment: this wouldn't 
proclude us from offering a bloated jar containing everything under the 
sun)

Even without making radical changes to the way our source code is 
organized, a lot of improvements could be made by having better 
documentation ... http://lucene.apache.org/java/2_4_1/ could certainly 
have more info about what is included in a release, what types of things 
can be found in a contrib, etc...  Individual contrib README files should 
certianly get beefed up to describe their purpose, their level of 
maturity, and their back compat commitments.  The demo and getting 
started guies could also be expanded to refrence the contrib jars that 
contain code many people may want to reuse...


   ...and that's all small improvements that could be made without 
radically changing anything about our source organization or packaging.  
splitting the core up into smaller modules would only help the situation, 
moving more things into the core seem like it would just make the problem 
worse.

: I agree, but at least we need some clear criteria so the future
: decision process is more straightforward.  Towards that... it seems
: like there are good reasons why something should be put into contrib:

I would agrue that is approaching the problem from the wrong direction.  

assume for the moment that we define the list of lucene "modules" as:
   ls -d contrib/* src/java src/gcj src/demo src/jsp
...but in the future we want to split up some of hte bigger "modules" and 
move each module so they have equal visibility.

i would suggest that the opperating assumption be that any new code 
contribution that adds functionality (ie: not a bug fix, or an 
enhancement to an existing Impl) belongs in a new "module" unless:
 1) compilation constraints require that it be put in an existing module 
(ie: needs to introduce a bi-directional dependency with an existing 
class which can't be refactored out into the new module)
 2) it is a natural conceptual fit with *all* of the existing classes in 
that module (ie: a new ThaiStemmerFilter could be added to an existing 
thai-analysis module)

(but an equally important to the question of "when to add to an existing 
'module' vs creating a new module?" should be the question of "when to 
split an exsting module?" ... something we've never really talked about 
for core or contribs.)

: But I don't think "it doesn't have to be in core" (the "software
: modularity" goal) is the right reason to put something in contrib.

Would it sound like a better reason if we stoped calling "core" ... i look 
at it from the point of view of: Are classes A,B&C (which are tightly 
coupled) directly related to classes X,Y&Z (also tightly coupled) ?"
... if the answer is "no" then A,B&C do not belong in the same module as 
X,Y&Z ... it doesn't matter which module we're talking about (src/java, 
contrib/highlighter etc...)

i don't think it makes any sense for the the TreiRangeQueries to be in the 
same "module" as IndexWriter, or IndexReader ... but i also don't think it 
makes sense for the trie to be in the same module as BoostingQuery or 
DuplicateFilter -- or for IndexWRiter to be in the same module as the 
existing query parser (or for hte existing query parser to be in the same 
module as the new one the IBM folks have been working on)


we can have fine grained modularity w/o having second class citizens, and 
we can achieve it without needing to make radical changes -- but putting 
more stuff into "core" isn't going to help us get there.

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Mike Klaas <mi...@gmail.com>.

On 23-Mar-09, at 2:41 PM, Michael McCandless wrote:
>
> I agree, but at least we need some clear criteria so the future
> decision process is more straightforward.  Towards that... it seems
> like there are good reasons why something should be put into contrib:
>
>  * It uses a version of JDK higher than what core can allow
>
>  * It has external dependencies
>
>  * Its quality is debatable (or at least not proven)
>
>  * It's of somewhat narrow usage/interest (eg: contrib/bdb)
>
> But I don't think "it doesn't have to be in core" (the "software
> modularity" goal) is the right reason to put something in contrib.

Agreed.  I don't think that building on the existing 'contrib' is the  
way to go.  Frequently-used, high-quality components should be more  
properly part of "Lucene", whether that means that they move to core,  
or in a new blessed modules section.

> Getting back to the original topic: Trie(Numeric)RangeFilter runs on
> JDK 1.4, has no external dependencies, looks to be high quality, and
> likely will have wide appeal.  Doesn't it belong in core?

+1.  It is important that Lucene come blessed with very good quality  
defaults.  Fast range queries are a common requirement.  Similarly, I  
wouldn't be happy to have a new, wicked QueryParser be relegated to  
contrib where it is unlikely to be found by non-savvy users.  At the  
very least, I agree with Michael that it should be findable in the  
same "place".

It does make sense to separate the machinery/building blocks (base  
Query, Weight, Scorer, Filter classes, Similarity interface, etc.)  
from the Query/Filter implementations that use them.  But whether this  
is done by putting them in separate directories or via global core/ 
modules distinction seems unimportant.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Michael McCandless <lu...@mikemccandless.com>.

>> I think we are considering this for Lucene 3.0 (should be the
>> release after next) which will allow Java 1.5.
>
> So where are you going to put 1.6 and 1.7 contribs?

This is a good point: core Lucene must remain on "old" JREs, but we
should not force all contrib packages to do so.

> - contrib has always had a lower bar and stuff was committed under
> that lower bar - there should be no blanket promotion.

OK so that was the past, and I agree.

I assume by this you're also advocating that going forward this is an
ongoing reason to put something into contrib?  I agree with that. Ie,
if a contribution is made, but it's not clear the quality is up to
core's standards, I would much rather have some place to commit it
(contrib) than to reject it, because once it has a home here, it has a
chance to gain interest, grow, improve, etc.

But: do you think, for this reason, the web site should continue to
present the dichotomy?

> - contrib items may have different dependencies... putting it all
> under the same source root can make a developers job harder

That's a good point & criterion for leaving something in contrib.

> - many contrib items are less related to lucene-java core indexing
> and searching... if there is no contrib, then they don't belong in
> the lucene-java project at all.

But most contrib packages are very related to Lucene.

Though I agree some contrib packages likely have very narrow
appeal/usage (eg, contrib/db, for using BDB as the raw store for an
index).

And I agree (as above): I would like to have somewhere for
contributions to go, rather than reject them.

> - right now it's clear - core can't have dependencies on non-core
> classes.  If everything is stuck in the same source tree, that goes
> away.

Well... this gets to Hoss's motivation, which I appreciate, to keep
the core tiny.

But that's just good software design and you don't need a divorced
directory structure to achieve that.

> I think there are a lot of benefits to continue considering very
> carefully if something is "core" or not.

I agree, but at least we need some clear criteria so the future
decision process is more straightforward.  Towards that... it seems
like there are good reasons why something should be put into contrib:

  * It uses a version of JDK higher than what core can allow

  * It has external dependencies

  * Its quality is debatable (or at least not proven)

  * It's of somewhat narrow usage/interest (eg: contrib/bdb)

But I don't think "it doesn't have to be in core" (the "software
modularity" goal) is the right reason to put something in contrib.

Getting back to the original topic: Trie(Numeric)RangeFilter runs on
JDK 1.4, has no external dependencies, looks to be high quality, and
likely will have wide appeal.  Doesn't it belong in core?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Mark Miller <ma...@gmail.com>.

Are you arguing for no change Yonik? I agree with all of your points in 
any case.

What appeals to me most so far is:

Take the best of contrib and up its status to something like "modules". 
Equal to core, different requirements, dependencies, etc. Perhaps take 
queryparser out of core, but frankly I'd wouldn't mind just leaving core 
as it is.

Reintroduce the sandbox (I believe core was sandbox, part of the lower 
bar history) and put lesser contrib there and new stuff thats unproven. 
Contrib doesn't appeal to me as a name anyway.

That would give core, modules, and the sandbox (perhaps sandbox is a 
module?). Things could move from sandbox to core or the modules. Modules 
get new requirements similar to core - back compat guarantees and 
changes.txt per module.

Yonik Seeley wrote:
> On Mon, Mar 23, 2009 at 11:10 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>   
>>   4. Move contrib/* under src/java/*, updating the javadocs to state
>>       back compatibility promises per class/package.
>>     
>
> - contrib has always had a lower bar and stuff was committed under
> that lower bar - there should be no blanket promotion.
> - contrib items may have different dependencies... putting it all
> under the same source root can make a developers job harder
> - many contrib items are less related to lucene-java core indexing and
> searching... if there is no contrib, then they don't belong in the
> lucene-java project at all.
> - right now it's clear - core can't have dependencies on non-core
> classes.  If everything is stuck in the same source tree, that goes
> away.
>
> I think there are a lot of benefits to continue considering very
> carefully if something is "core" or not.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>   

-- 
- Mark

http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Mon, Mar 23, 2009 at 11:10 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
>   4. Move contrib/* under src/java/*, updating the javadocs to state
>       back compatibility promises per class/package.

- contrib has always had a lower bar and stuff was committed under
that lower bar - there should be no blanket promotion.
- contrib items may have different dependencies... putting it all
under the same source root can make a developers job harder
- many contrib items are less related to lucene-java core indexing and
searching... if there is no contrib, then they don't belong in the
lucene-java project at all.
- right now it's clear - core can't have dependencies on non-core
classes.  If everything is stuck in the same source tree, that goes
away.

I think there are a lot of benefits to continue considering very
carefully if something is "core" or not.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Michael McCandless <lu...@mikemccandless.com>.

Michael Busch <bu...@gmail.com> wrote:

>> And I don't think the sudden separation of "core" vs "contrib"
>> should be so prominent (or even visible); it's really a detail of
>> how we manage source control.
>
>> When looking at the website I'd like read that Lucene can do hit
>> highlighting, powerful query parsing, spell checking, analyze
>> different languages, etc.  I could care less that some of these
>> happen to live under a "contrib" subdirectory somewhere in the
>> source control system.
>
> OK, so I think we all agree about the packaging. But I believe it is
> also important how the source code is organized. Maybe Lucene
> consumers don't care too much, however, Lucene is an open source
> project. So we also want to attract possible contributors with a
> nicely organized code base. If there is a clear separation between
> the different components on a source code level, becoming familiar
> with Lucene as a contributor might not be so overwhelming.

+1

We want the source code to be well organized: consumability by Lucene
developers (not just Lucene users) is also important for Lucene's
future growth.

> Besides that, I think a one-to-one mapping between the packaging and
> the source code has no disadvantages. (and it would certainly make
> the build scripts easier!)

Right.

So, towards that... why even break out contrib vs core, in source
control?  Can't we simply migrate contrib/* into core, in the right
places?

>> Could we, instead, adopt some standard way (in the package
>> javadocs) of stating the maturity/activity/back compat policies/etc
>> of a given package?
>
> This makes sense; e.g. we could release new modules as beta versions
> (= use at own risk, no backwards-compatibility).

In fact we already have a 2.9 Jira issue opened to better document the
back-compat/JDK version requirements of all packages.

I think, like we've done with core lately when a new feature is added,
we could have the default assumption be full back compatibility, but
then those classes/methods/packages that are very new and may change
simply say so clearly in their javadocs.

> And if we start a new module (e.g. a GSoC project) we could exclude
> it from a release easily if it's truly experimental and not in a
> release-able state.

Right.

>> So I think the beginnings of a rough proposal is taking shape, for
>>3.0:

>>   1. Fix web site to give a better intro to Lucene's features,
>>       without exposing core vs. contrib false (to the Lucene
>>       consumer) > distinction
>>
>>   2. When releasing, we make a single JAR holding core & contrib
>>       classes for a given area.  The final JAR files don't contain a
>>       "core" vs "contrib" distinction.
>>
>>   3. We create a "bundled" JAR that has the common packages
>>       "typically" needed (index/search core, analyzers, queries,
>>       highlighter, spellchecker)
>
> +1 to all three points.

OK.

So I guess I'm proposing adding:

   4. Move contrib/* under src/java/*, updating the javadocs to state
       back compatibility promises per class/package.

I think net/net this'd be a great simplification?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Modularization

Posted by Michael Busch <bu...@gmail.com>.

On 3/21/09 1:36 PM, Michael McCandless wrote:
> And I don't think the sudden separation of "core" vs "contrib" should
> be so prominent (or even visible); it's really a detail of how we
> manage source control.
>
> When looking at the website I'd like read that Lucene can do hit
> highlighting, powerful query parsing, spell checking, analyze
> different languages, etc.  I could care less that some of these happen
> to live under a "contrib" subdirectory somewhere in the source control
> system.
>
>    
OK, so I think we all agree about the packaging. But I believe it is 
also important
how the source code is organized. Maybe Lucene consumers don't care too 
much,
however, Lucene is an open source project. So we also want to attract 
possible
contributors with a nicely organized code base. If there is a clear 
separation between
the different components on a source code level, becoming familiar with 
Lucene as a
contributor might not be so overwhelming.

Besides that, I think a one-to-one mapping between the packaging and the 
source code
has no disadvantages. (and it would certainly make the build scripts 
easier!)
>> But I think we should still have "main modules", such as core,
>> queries, analyzers, ... and separately e.g. "sandbox modules?", for
>> the things currently in contrib that are experimental or, as Mark
>> called them, "graveyard contribs" :) ... even though we might then
>> as well ask the questions if we can not really bury the latter
>> ones...
>>      
>
> Could we, instead, adopt some standard way (in the package javadocs)
> of stating the maturity/activity/back compat policies/etc of a given
> package?
>    

This makes sense; e.g. we could release new modules as beta versions (= 
use at own risk,
no backwards-compatibility).

And if we start a new module (e.g. a GSoC project) we could exclude it 
from a release
easily if it's truly experimental and not in a release-able state.
> So I think the beginnings of a rough proposal is taking shape, for 3.0:
>
>    1. Fix web site to give a better intro to Lucene's features, without
>       exposing core vs. contrib false (to the Lucene consumer)
>       distinction
>
>    2. When releasing, we make a single JAR holding core&  contrib
>       classes for a given area.  The final JAR files don't contain a
>       "core" vs "contrib" distinction.
>
>    3. We create a "bundled" JAR that has the common packages
>       "typically" needed (index/search core, analyzers, queries,
>       highlighter, spellchecker)
>
>    
+1 to all three points.

> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>

Re: Modularization

Posted by Michael McCandless <lu...@mikemccandless.com>.

> Maybe he actually ends up buying LIA(2) :)

LIA/2 suffers the same false dichotomy, and it drives me crazy there
too: we put all "contrib" packages in a different chapter, even though
it'd make much more sense to cover all analyzers in one chapter, all
queries in one chapter, etc.

I find myself cross-referencing over to TrieRangeQuery in Chapter 8,
from LIA's search chapter (Chapter 3), and it's awkward.

> So yeah I like this and 3.0 is a good opportunity to do this. I
> think a big part of this work should be good documentation. As you
> mentioned, Mike, it should be very simple to get an overview of what
> the different modules are.  So there should be the list of the
> different modules, together with a short description for each of
> them and infos about where to find them (which jar).  Then by
> clicking on e.g. queries, the user would see the list of all queries
> we support.

I agree: revamping the web-site for a better top-down introduction of
Lucene's features should be part of 3.0.

And I don't think the sudden separation of "core" vs "contrib" should
be so prominent (or even visible); it's really a detail of how we
manage source control.

When looking at the website I'd like read that Lucene can do hit
highlighting, powerful query parsing, spell checking, analyze
different languages, etc.  I could care less that some of these happen
to live under a "contrib" subdirectory somewhere in the source control
system.

> But I think we should still have "main modules", such as core,
> queries, analyzers, ... and separately e.g. "sandbox modules?", for
> the things currently in contrib that are experimental or, as Mark
> called them, "graveyard contribs" :) ... even though we might then
> as well ask the questions if we can not really bury the latter
> ones...

Could we, instead, adopt some standard way (in the package javadocs)
of stating the maturity/activity/back compat policies/etc of a given
package?

> Since we are just talking about packaging, why can't we have
> both/all of the above?  Individual jars, as well as one "big" jar,
> that contains everything (or, everything that has only dependencies
> we can ship, or "everything" that we deem important for an OOTB
> experience).  I, for one, find it annoying to have to go get
> snowball, analyzers, spellchecking and highlighting separate in most
> cases b/c I almost always use all of them and don't particularly
> care if there are extra classes in a JAR, but can appreciate the
> need to do that in specific instances where leaner versions are
> needed.  After all, the Ant magic to do all of this is pretty
> trivial given we just need to combine the various jars into a single
> jar (while keeping the indiv. ones)

+1

So I think the beginnings of a rough proposal is taking shape, for 3.0:

  1. Fix web site to give a better intro to Lucene's features, without
     exposing core vs. contrib false (to the Lucene consumer)
     distinction

  2. When releasing, we make a single JAR holding core & contrib
     classes for a given area.  The final JAR files don't contain a
     "core" vs "contrib" distinction.

  3. We create a "bundled" JAR that has the common packages
     "typically" needed (index/search core, analyzers, queries,
     highlighter, spellchecker)

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org