You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2008/01/29 23:26:52 UTC

Thinking about Mahout layout, builds, etc.

I am thinking a structure like the following would be useful for  
getting started:
mahout/trunk/
   docs
   common/
	src/
            main/
            test/
         docs/
         lib/
   algorithmA/
        Similar to common, but for this algorithm
   algB
        ...
    ...

Where algorithmA, B, etc. are the various libraries we intend to  
implement.  We can hold off on creating them until we have some code,  
but was thinking it would be good to have the general layout in mind.

Of course, this is expandable and changeable.  What do others think?

On a related note, one of the things we discussed pre-Apache, was the  
general sense that we shouldn't feel the need to create an all  
encompassing framework.  The basic gist of this being that any given  
library could be completely independent of the others (with maybe the  
exception that they share a common library).  My gut says this is the  
way to get started, but that it may evolve over time once we have some  
running time together and can start to recognize synergies, such that  
maybe by the time we get to 1.0 of Mahout there may be more common  
code than we originally thought.  The "common" area above can serve as  
the area for utilities, classes, common Hadoop extensions, etc. that  
are shared between the various algorithms, but I would also say let's  
not try to prematurely optimize across the algorithms just yet.

Anyone else have any preference on this?

-Grant

Re: Thinking about Mahout layout, builds, etc.

Posted by Ken Montanez <ke...@gmail.com>.

agree. i like the fact that it is a widely accepted/used project and I am
sure there is a large community of knowledge that we can leverage...all in
the spirit of open source. =)

On Jan 29, 2008 11:15 PM, Karl Wettin <ka...@gmail.com> wrote:

>
> 29 jan 2008 kl. 23.26 skrev Grant Ingersoll:
>
> > I am thinking a structure like the following would be useful for
> > getting started:
> > mahout/trunk/
> >  docs
> >  common/
> >       src/
> >           main/
> >           test/
> >        docs/
> >        lib/
> >  algorithmA/
> >       Similar to common, but for this algorithm
> >  algB
> >       ...
> >   ...
> >
> > Where algorithmA, B, etc. are the various libraries we intend to
> > implement.  We can hold off on creating them until we have some
> > code, but was thinking it would be good to have the general layout
> > in mind.
> >
> > Of course, this is expandable and changeable.  What do others think?
>
> -1
>
> I still think we should use Maven compatible file system structure,
> even if we end up using Ant.
>
>
>   karl
>



-- 
Ken Montanez | 510.681.5576

Re: Thinking about Mahout layout, builds, etc.

Posted by Karl Wettin <ka...@gmail.com>.

30 jan 2008 kl. 08.15 skrev Karl Wettin:

> I still think we should use Maven compatible file system structure,  
> even if we end up using Ant.

Oh, I now see the structure was Mavenized. It was my mail client that  
messed it up. Let me clearify Grants original proposal that I now +1.

mahout/trunk/
mahout/docs

mahout/common/
mahout/common/src/
mahout/common/src/main/
mahout/common/src/test/
mahout/commondocs/
mahout/common/lib/

mahout/algorithmA/
       Similar to common, but for this algorithm

mahout/algB
       ...



   karl

Re: Thinking about Mahout layout, builds, etc.

Posted by Karl Wettin <ka...@gmail.com>.

30 jan 2008 kl. 09.03 skrev Ted Dunning:
> On 1/29/08 11:15 PM, "Karl Wettin" <ka...@gmail.com> wrote:
>
>> I still think we should use Maven compatible file system structure,
>> even if we end up using Ant.
>
> What would that be?

For each module, and using any directory that make sense:

module/src/main/
module/src/main/java/
module/src/main/webapp/
module/src/main/resources/

module/src/test/
module/src/test/java/
module/src/test/resources/




   karl

Re: Thinking about Mahout layout, builds, etc.

Posted by Ted Dunning <td...@veoh.com>.

What would that be?


On 1/29/08 11:15 PM, "Karl Wettin" <ka...@gmail.com> wrote:

> I still think we should use Maven compatible file system structure,
> even if we end up using Ant.

Re: Thinking about Mahout layout, builds, etc.

Posted by Karl Wettin <ka...@gmail.com>.

29 jan 2008 kl. 23.26 skrev Grant Ingersoll:

> I am thinking a structure like the following would be useful for  
> getting started:
> mahout/trunk/
>  docs
>  common/
> 	src/
>           main/
>           test/
>        docs/
>        lib/
>  algorithmA/
>       Similar to common, but for this algorithm
>  algB
>       ...
>   ...
>
> Where algorithmA, B, etc. are the various libraries we intend to  
> implement.  We can hold off on creating them until we have some  
> code, but was thinking it would be good to have the general layout  
> in mind.
>
> Of course, this is expandable and changeable.  What do others think?

-1

I still think we should use Maven compatible file system structure,  
even if we end up using Ant.


   karl

Re: Thinking about Mahout layout, builds, etc.

Posted by Sami Siren <ss...@gmail.com>.

Vadim Zaliva wrote:
> 
> On Jan 30, 2008, at 0:06, Ian Holsman wrote:
> 
>> What I like about it is you specify what jars the projects needs (and 
>> versions) and it goes and gets them and your code builds nicely, and 
>> it can do things like deploy your code onto a tomcat server.
> 
> So, how this automatic jar fetching works with Eclipse? I will need to 
> modify
> eclipse project files manually after Maven fetches some new libraries?

no need to _edit_ things manually, either use "mvn eclipse:eclipse" or 
install maven2 plugin to do similar thing through the gui.

-- 
  Sami Siren

Re: Thinking about Mahout layout, builds, etc.

Posted by Vadim Zaliva <kr...@gmail.com>.

On Jan 30, 2008, at 0:06, Ian Holsman wrote:

> What I like about it is you specify what jars the projects needs  
> (and versions) and it goes and gets them and your code builds  
> nicely, and it can do things like deploy your code onto a tomcat  
> server.

So, how this automatic jar fetching works with Eclipse? I will need to  
modify
eclipse project files manually after Maven fetches some new libraries?

Vadim

Re: Thinking about Mahout layout, builds, etc.

Posted by Lukas Vlcek <lu...@gmail.com>.

Hi,

I have hands-on experience with Maven (1.x) and I used it for many internal
projects. Dependencies declaration is really cool and very helpful.
Executable XML as well. Tons of ready-to-use goals too and hacking Maven is
a fun ... however; I think Ant can do this as well except the dependencies
(I haven't tried Ivy). Saying that it means I really love Maven but for this
project I think that Ant should be the sufficient (and more appropriate?).

1) How many dependencies can we think of (about 5 libraries right now) -
does not seem to be a high number.
2) It can be a pain to code complex compilation and build process in Ant -
but isn't this a good sign that you have to change something? This can force
us to make compilation and build process more transparent.
3) With Ant we can still provide targets (goals in Maven terminology) for
building a specific algorithm only.

I think Maven would make sense if we want to make a specific project for
each algorithm because each of algorithms can have different jar
dependencies and/or jar versions. This sounds like quite complex scenario
but it should be achievable with Maven. If we don't need this then I would
vote for Ant (despite the fact I am Maven fan!).

Lukas

On Jan 30, 2008 9:06 AM, Ian Holsman <li...@holsman.net> wrote:

> Ken Montanez wrote:
> > Ian, it looks like you have some experience with Maven. I have heard a
> lot
> > of good things about it from many people on many different projects, but
> no
> > personal experience to speak of - can you vouch for it? What do you like
> > about it? Is there anything that you don't like/want to get away from?
> >
> > Ken
> >
> my experience is more as a 'user' of maven, I haven't ever written a pom
> file or gone deeper than that.
>
> What I like about it is you specify what jars the projects needs (and
> versions) and it goes and gets them and your code builds nicely, and it
> can do things like deploy your code onto a tomcat server.
>
> regards
> Ian
>
>
>

-- 
http://blog.lukas-vlcek.com/

Re: Thinking about Mahout layout, builds, etc.

Posted by Ian Holsman <li...@holsman.net>.

Ken Montanez wrote:
> Ian, it looks like you have some experience with Maven. I have heard a lot
> of good things about it from many people on many different projects, but no
> personal experience to speak of - can you vouch for it? What do you like
> about it? Is there anything that you don't like/want to get away from?
>
> Ken
>   
my experience is more as a 'user' of maven, I haven't ever written a pom 
file or gone deeper than that.

What I like about it is you specify what jars the projects needs (and 
versions) and it goes and gets them and your code builds nicely, and it 
can do things like deploy your code onto a tomcat server.

regards
Ian

Re: Thinking about Mahout layout, builds, etc.

Posted by Ken Montanez <ke...@gmail.com>.

Ian, it looks like you have some experience with Maven. I have heard a lot
of good things about it from many people on many different projects, but no
personal experience to speak of - can you vouch for it? What do you like
about it? Is there anything that you don't like/want to get away from?

Ken

On Jan 29, 2008 10:41 PM, Ian Holsman <li...@holsman.net> wrote:

> Hi Guys,
>
> my 2c's
>
> Grant Ingersoll wrote:
> > But would people prefer getting the jars separately?  I do think there
> > is some common housekeeping code, but I also don't want to
> > overemphasize it when it comes to developing an individual algorithm.
> > In other words, if a bayes classifier and an SVM implementation could
> > share a common framework, but it would end up being really confusing,
> > versus them each being more or less cleanly separated and logical, I
> > think I would favor the separated.   By the same token, if they can
> > work beautifully together, then, that would argue for more common code.
> >
> are we planning on making separate releases for the code?
> does having these bundled together somehow impact the performance or
> functionality of other algorithms?
> would the combined size of he jar be less than 2-3m?
>
> if the answer to all of these is no, we should just have a single jar.
> Size should not be an issue here, development/operational speed should be.
> It is much easier to manage a single jar operationally IMHO.
>
>
> > As for Hadoop and HBase, that is just two potential libraries.  We are
> > potentially talking 10+.  Would you want to download a huge jar that
> > contains everything when all you want is a single algorithm?  Granted,
> > that can be done from one source tree, but I wonder if that makes it
> > harder.
> yep. but in these days of maven and the like I have no idea how many
> jars I'm actually downloading, it just does it.
>
> >
> > But, I do take away that we probably should just start simple and not
> > worry about a complex build just yet.  I think it is safe to say that
> > up through our first official release we can feel free to change
> > things around if we have to.
> >
>
> yep..
> >
> > -Grant
> >
> >
> > On Jan 29, 2008, at 9:04 PM, Mason Tang wrote:
> >
> >> +1
> >>
> >> Not going to repeat the same arguments, but one other thing is that
> >> almost all of the algorithms are going to (or at least should) share
> >> some common housekeeping code, the main chunk of which will probably
> >> be IO.  Functionally, I don't think an individual algorithm is
> >> significant enough to warrant its own project, and many of them might
> >> wind up sharing common interfaces.
> >>
> >> ~ Mason
> >>
> >> Jeff Eastman wrote:
> >>> +1
> >>> A single project facilitates refactoring and promotes consistency of
> >>> design. If there's not enough code in Hadoop+Hbase to justify multiple
> >>> projects it would be premature abstraction to organize Mahout that
> way.
> >>> Let's keep it simple...
> >>> Jeff
> >>> -----Original Message-----
> >>> From: Ted Dunning [mailto:tdunning@veoh.com] Sent: Tuesday, January
> >>> 29, 2008 4:21 PM
> >>> To: mahout-dev@lucene.apache.org
> >>> Subject: Re: Thinking about Mahout layout, builds, etc.
> >>> Initially, developers will be hitting bugs or bad design all over the
> >>> place
> >>> so they would favor one project.  Also, with good package design, you
> >>> get
> >>> most of the benefits of multiple projects.
> >>> So why not start simple and migrate to complicated later?
> >>> On 1/29/08 3:15 PM, "Jeff Eastman" <je...@collab.net> wrote:
> >>>> Thinking about these alternatives from an Eclipse user's point of
> >>> view,
> >>>> the original proposal would seem to encourage multiple projects (one
> >>> per
> >>>> algorithm + a common project) while the second would encourage a
> >>> single
> >>>> project containing multiple packages. Depending upon the amount of
> >>> code
> >>>> that would reside in each algorithm, one or the other might be
> >>>> preferable.
> >>>>
> >>>> Would a given developer typically be working on the entire library
> >>>> (single project favoring) or just on one or two algorithms (multiple
> >>>> project favoring)?
> >>>>
> >>>> Jeff
> >>>>
> >>>> -----Original Message-----
> >>>> From: Ted Dunning [mailto:tdunning@veoh.com]
> >>>> Sent: Tuesday, January 29, 2008 2:43 PM
> >>>> To: mahout-dev@lucene.apache.org
> >>>> Subject: Re: Thinking about Mahout layout, builds, etc.
> >>>>
> >>>>
> >>>>
> >>>> I think that having multiple source roots is a pain.  That is what
> >>>> packages
> >>>> are for.
> >>>>
> >>>> I would recommend instead:
> >>>>
> >>>> - at the top level, there should be trunk, tags, releases as is
> >>> typical
> >>>> in
> >>>> an SVN based project.
> >>>>
> >>>> - below trunk and any tag or release there should be:
> >>>>
> >>>>   docs
> >>>>   lib
> >>>>   src/org/apache/mahout
> >>>>
> >>>> Below the source directory, there should be packages common,
> >>> algorithmA,
> >>>> algorithmB and all tests should be locaated near the associated
> >>> source.
> >>>> If it is really desirable to separate tests from normal source (I
> have
> >>>> done
> >>>> it both ways and find having the tests nearby beneficial), then there
> >>>> can be
> >>>> a parallel tree next to src called "test".
> >>>>
> >>>> The target of compilation should be a single jar file.
> >>>>
> >>>>
> >>>> On 1/29/08 2:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
> >>>>
> >>>>> I am thinking a structure like the following would be useful for
> >>>>> getting started:
> >>>>> mahout/trunk/
> >>>>>   docs
> >>>>>   common/
> >>>>> src/
> >>>>>            main/
> >>>>>            test/
> >>>>>         docs/
> >>>>>         lib/
> >>>>>   algorithmA/
> >>>>>        Similar to common, but for this algorithm
> >>>>>   algB
> >>>>>        ...
> >>>>>    ...
> >>>>>
> >>>>> Where algorithmA, B, etc. are the various libraries we intend to
> >>>>> implement.  We can hold off on creating them until we have some
> code,
> >>>>> but was thinking it would be good to have the general layout in
> mind.
> >>>>>
> >>>>> Of course, this is expandable and changeable.  What do others think?
> >>>>>
> >>>>> On a related note, one of the things we discussed pre-Apache, was
> the
> >>>>> general sense that we shouldn't feel the need to create an all
> >>>>> encompassing framework.  The basic gist of this being that any given
> >>>>> library could be completely independent of the others (with maybe
> the
> >>>>> exception that they share a common library).  My gut says this is
> the
> >>>>> way to get started, but that it may evolve over time once we have
> >>> some
> >>>>> running time together and can start to recognize synergies, such
> that
> >>>>> maybe by the time we get to 1.0 of Mahout there may be more common
> >>>>> code than we originally thought.  The "common" area above can serve
> >>> as
> >>>>> the area for utilities, classes, common Hadoop extensions, etc. that
> >>>>> are shared between the various algorithms, but I would also say
> let's
> >>>>> not try to prematurely optimize across the algorithms just yet.
> >>>>>
> >>>>> Anyone else have any preference on this?
> >>>>>
> >>>>> -Grant
> >>>>>
> >>
> >> --
> >> Mason Tang '10, Course 6-3
> >> Address: Burton-Conner 224A        Email: masont@mit.edu
> >>         410 Memorial Dr.          Phone: 508-414-5811
> >>         Cambridge, MA 02139         WWW: www.geekbyday.com
> >
> >
> >
>
>


-- 
Ken Montanez | 510.681.5576

Re: Thinking about Mahout layout, builds, etc.

Posted by Ian Holsman <li...@holsman.net>.

Hi Guys,

my 2c's

Grant Ingersoll wrote:
> But would people prefer getting the jars separately?  I do think there 
> is some common housekeeping code, but I also don't want to 
> overemphasize it when it comes to developing an individual algorithm.  
> In other words, if a bayes classifier and an SVM implementation could 
> share a common framework, but it would end up being really confusing, 
> versus them each being more or less cleanly separated and logical, I 
> think I would favor the separated.   By the same token, if they can 
> work beautifully together, then, that would argue for more common code.
>
are we planning on making separate releases for the code?
does having these bundled together somehow impact the performance or 
functionality of other algorithms?
would the combined size of he jar be less than 2-3m?

if the answer to all of these is no, we should just have a single jar. 
Size should not be an issue here, development/operational speed should be.
It is much easier to manage a single jar operationally IMHO.


> As for Hadoop and HBase, that is just two potential libraries.  We are 
> potentially talking 10+.  Would you want to download a huge jar that 
> contains everything when all you want is a single algorithm?  Granted, 
> that can be done from one source tree, but I wonder if that makes it 
> harder.
yep. but in these days of maven and the like I have no idea how many 
jars I'm actually downloading, it just does it.

>
> But, I do take away that we probably should just start simple and not 
> worry about a complex build just yet.  I think it is safe to say that 
> up through our first official release we can feel free to change 
> things around if we have to.
>

yep..
>
> -Grant
>
>
> On Jan 29, 2008, at 9:04 PM, Mason Tang wrote:
>
>> +1
>>
>> Not going to repeat the same arguments, but one other thing is that 
>> almost all of the algorithms are going to (or at least should) share 
>> some common housekeeping code, the main chunk of which will probably 
>> be IO.  Functionally, I don't think an individual algorithm is 
>> significant enough to warrant its own project, and many of them might 
>> wind up sharing common interfaces.
>>
>> ~ Mason
>>
>> Jeff Eastman wrote:
>>> +1
>>> A single project facilitates refactoring and promotes consistency of
>>> design. If there's not enough code in Hadoop+Hbase to justify multiple
>>> projects it would be premature abstraction to organize Mahout that way.
>>> Let's keep it simple...
>>> Jeff
>>> -----Original Message-----
>>> From: Ted Dunning [mailto:tdunning@veoh.com] Sent: Tuesday, January 
>>> 29, 2008 4:21 PM
>>> To: mahout-dev@lucene.apache.org
>>> Subject: Re: Thinking about Mahout layout, builds, etc.
>>> Initially, developers will be hitting bugs or bad design all over the
>>> place
>>> so they would favor one project.  Also, with good package design, you
>>> get
>>> most of the benefits of multiple projects.
>>> So why not start simple and migrate to complicated later?
>>> On 1/29/08 3:15 PM, "Jeff Eastman" <je...@collab.net> wrote:
>>>> Thinking about these alternatives from an Eclipse user's point of
>>> view,
>>>> the original proposal would seem to encourage multiple projects (one
>>> per
>>>> algorithm + a common project) while the second would encourage a
>>> single
>>>> project containing multiple packages. Depending upon the amount of
>>> code
>>>> that would reside in each algorithm, one or the other might be
>>>> preferable.
>>>>
>>>> Would a given developer typically be working on the entire library
>>>> (single project favoring) or just on one or two algorithms (multiple
>>>> project favoring)?
>>>>
>>>> Jeff
>>>>
>>>> -----Original Message-----
>>>> From: Ted Dunning [mailto:tdunning@veoh.com]
>>>> Sent: Tuesday, January 29, 2008 2:43 PM
>>>> To: mahout-dev@lucene.apache.org
>>>> Subject: Re: Thinking about Mahout layout, builds, etc.
>>>>
>>>>
>>>>
>>>> I think that having multiple source roots is a pain.  That is what
>>>> packages
>>>> are for.
>>>>
>>>> I would recommend instead:
>>>>
>>>> - at the top level, there should be trunk, tags, releases as is
>>> typical
>>>> in
>>>> an SVN based project.
>>>>
>>>> - below trunk and any tag or release there should be:
>>>>
>>>>   docs
>>>>   lib
>>>>   src/org/apache/mahout
>>>>
>>>> Below the source directory, there should be packages common,
>>> algorithmA,
>>>> algorithmB and all tests should be locaated near the associated
>>> source.
>>>> If it is really desirable to separate tests from normal source (I have
>>>> done
>>>> it both ways and find having the tests nearby beneficial), then there
>>>> can be
>>>> a parallel tree next to src called "test".
>>>>
>>>> The target of compilation should be a single jar file.
>>>>
>>>>
>>>> On 1/29/08 2:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
>>>>
>>>>> I am thinking a structure like the following would be useful for
>>>>> getting started:
>>>>> mahout/trunk/
>>>>>   docs
>>>>>   common/
>>>>> src/
>>>>>            main/
>>>>>            test/
>>>>>         docs/
>>>>>         lib/
>>>>>   algorithmA/
>>>>>        Similar to common, but for this algorithm
>>>>>   algB
>>>>>        ...
>>>>>    ...
>>>>>
>>>>> Where algorithmA, B, etc. are the various libraries we intend to
>>>>> implement.  We can hold off on creating them until we have some code,
>>>>> but was thinking it would be good to have the general layout in mind.
>>>>>
>>>>> Of course, this is expandable and changeable.  What do others think?
>>>>>
>>>>> On a related note, one of the things we discussed pre-Apache, was the
>>>>> general sense that we shouldn't feel the need to create an all
>>>>> encompassing framework.  The basic gist of this being that any given
>>>>> library could be completely independent of the others (with maybe the
>>>>> exception that they share a common library).  My gut says this is the
>>>>> way to get started, but that it may evolve over time once we have
>>> some
>>>>> running time together and can start to recognize synergies, such that
>>>>> maybe by the time we get to 1.0 of Mahout there may be more common
>>>>> code than we originally thought.  The "common" area above can serve
>>> as
>>>>> the area for utilities, classes, common Hadoop extensions, etc. that
>>>>> are shared between the various algorithms, but I would also say let's
>>>>> not try to prematurely optimize across the algorithms just yet.
>>>>>
>>>>> Anyone else have any preference on this?
>>>>>
>>>>> -Grant
>>>>>
>>
>> -- 
>> Mason Tang '10, Course 6-3
>> Address: Burton-Conner 224A        Email: masont@mit.edu
>>         410 Memorial Dr.          Phone: 508-414-5811
>>         Cambridge, MA 02139         WWW: www.geekbyday.com
>
>
>

Re: Thinking about Mahout layout, builds, etc.

Posted by Ted Dunning <td...@veoh.com>.

I think we are talking about a single jar for most distribution purposes.
There would occasionally be reasons for separated jars, but that would be
rare.

And, yes, I would do exactly as you say, and, indeed, I do prefer a single
jar for Colt which is quite a large library in terms of range of
functionality.  On the other extreme is Spring which often requires 4-7 jars
and it is never clear to me just which ones are needed.  I would vastly
prefer one jar with stuff I don't need in it.

So, I say again, let's start small and simple and look for demand rather
than make life hard.

On 1/29/08 7:54 PM, "Grant Ingersoll" <gs...@apache.org> wrote:

> As for Hadoop and HBase, that is just two potential libraries.  We are
> potentially talking 10+.  Would you want to download a huge jar that
> contains everything when all you want is a single algorithm?  Granted,
> that can be done from one source tree, but I wonder if that makes it
> harder.

Re: Thinking about Mahout layout, builds, etc.

Posted by Grant Ingersoll <gs...@apache.org>.

But would people prefer getting the jars separately?  I do think there  
is some common housekeeping code, but I also don't want to  
overemphasize it when it comes to developing an individual algorithm.   
In other words, if a bayes classifier and an SVM implementation could  
share a common framework, but it would end up being really confusing,  
versus them each being more or less cleanly separated and logical, I  
think I would favor the separated.   By the same token, if they can  
work beautifully together, then, that would argue for more common code.

As for Hadoop and HBase, that is just two potential libraries.  We are  
potentially talking 10+.  Would you want to download a huge jar that  
contains everything when all you want is a single algorithm?  Granted,  
that can be done from one source tree, but I wonder if that makes it  
harder.

But, I do take away that we probably should just start simple and not  
worry about a complex build just yet.  I think it is safe to say that  
up through our first official release we can feel free to change  
things around if we have to.


-Grant


On Jan 29, 2008, at 9:04 PM, Mason Tang wrote:

> +1
>
> Not going to repeat the same arguments, but one other thing is that  
> almost all of the algorithms are going to (or at least should) share  
> some common housekeeping code, the main chunk of which will probably  
> be IO.  Functionally, I don't think an individual algorithm is  
> significant enough to warrant its own project, and many of them  
> might wind up sharing common interfaces.
>
> ~ Mason
>
> Jeff Eastman wrote:
>> +1
>> A single project facilitates refactoring and promotes consistency of
>> design. If there's not enough code in Hadoop+Hbase to justify  
>> multiple
>> projects it would be premature abstraction to organize Mahout that  
>> way.
>> Let's keep it simple...
>> Jeff
>> -----Original Message-----
>> From: Ted Dunning [mailto:tdunning@veoh.com] Sent: Tuesday, January  
>> 29, 2008 4:21 PM
>> To: mahout-dev@lucene.apache.org
>> Subject: Re: Thinking about Mahout layout, builds, etc.
>> Initially, developers will be hitting bugs or bad design all over the
>> place
>> so they would favor one project.  Also, with good package design, you
>> get
>> most of the benefits of multiple projects.
>> So why not start simple and migrate to complicated later?
>> On 1/29/08 3:15 PM, "Jeff Eastman" <je...@collab.net> wrote:
>>> Thinking about these alternatives from an Eclipse user's point of
>> view,
>>> the original proposal would seem to encourage multiple projects (one
>> per
>>> algorithm + a common project) while the second would encourage a
>> single
>>> project containing multiple packages. Depending upon the amount of
>> code
>>> that would reside in each algorithm, one or the other might be
>>> preferable.
>>>
>>> Would a given developer typically be working on the entire library
>>> (single project favoring) or just on one or two algorithms (multiple
>>> project favoring)?
>>>
>>> Jeff
>>>
>>> -----Original Message-----
>>> From: Ted Dunning [mailto:tdunning@veoh.com]
>>> Sent: Tuesday, January 29, 2008 2:43 PM
>>> To: mahout-dev@lucene.apache.org
>>> Subject: Re: Thinking about Mahout layout, builds, etc.
>>>
>>>
>>>
>>> I think that having multiple source roots is a pain.  That is what
>>> packages
>>> are for.
>>>
>>> I would recommend instead:
>>>
>>> - at the top level, there should be trunk, tags, releases as is
>> typical
>>> in
>>> an SVN based project.
>>>
>>> - below trunk and any tag or release there should be:
>>>
>>>   docs
>>>   lib
>>>   src/org/apache/mahout
>>>
>>> Below the source directory, there should be packages common,
>> algorithmA,
>>> algorithmB and all tests should be locaated near the associated
>> source.
>>> If it is really desirable to separate tests from normal source (I  
>>> have
>>> done
>>> it both ways and find having the tests nearby beneficial), then  
>>> there
>>> can be
>>> a parallel tree next to src called "test".
>>>
>>> The target of compilation should be a single jar file.
>>>
>>>
>>> On 1/29/08 2:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
>>>
>>>> I am thinking a structure like the following would be useful for
>>>> getting started:
>>>> mahout/trunk/
>>>>   docs
>>>>   common/
>>>> src/
>>>>            main/
>>>>            test/
>>>>         docs/
>>>>         lib/
>>>>   algorithmA/
>>>>        Similar to common, but for this algorithm
>>>>   algB
>>>>        ...
>>>>    ...
>>>>
>>>> Where algorithmA, B, etc. are the various libraries we intend to
>>>> implement.  We can hold off on creating them until we have some  
>>>> code,
>>>> but was thinking it would be good to have the general layout in  
>>>> mind.
>>>>
>>>> Of course, this is expandable and changeable.  What do others  
>>>> think?
>>>>
>>>> On a related note, one of the things we discussed pre-Apache, was  
>>>> the
>>>> general sense that we shouldn't feel the need to create an all
>>>> encompassing framework.  The basic gist of this being that any  
>>>> given
>>>> library could be completely independent of the others (with maybe  
>>>> the
>>>> exception that they share a common library).  My gut says this is  
>>>> the
>>>> way to get started, but that it may evolve over time once we have
>> some
>>>> running time together and can start to recognize synergies, such  
>>>> that
>>>> maybe by the time we get to 1.0 of Mahout there may be more common
>>>> code than we originally thought.  The "common" area above can serve
>> as
>>>> the area for utilities, classes, common Hadoop extensions, etc.  
>>>> that
>>>> are shared between the various algorithms, but I would also say  
>>>> let's
>>>> not try to prematurely optimize across the algorithms just yet.
>>>>
>>>> Anyone else have any preference on this?
>>>>
>>>> -Grant
>>>>
>
> -- 
> Mason Tang '10, Course 6-3
> Address: Burton-Conner 224A        Email: masont@mit.edu
>         410 Memorial Dr.          Phone: 508-414-5811
>         Cambridge, MA 02139         WWW: www.geekbyday.com

Re: Thinking about Mahout layout, builds, etc.

Posted by Mason Tang <ma...@MIT.EDU>.

+1

Not going to repeat the same arguments, but one other thing is that 
almost all of the algorithms are going to (or at least should) share 
some common housekeeping code, the main chunk of which will probably be 
IO.  Functionally, I don't think an individual algorithm is significant 
enough to warrant its own project, and many of them might wind up 
sharing common interfaces.

~ Mason

Jeff Eastman wrote:
> +1
> 
> A single project facilitates refactoring and promotes consistency of
> design. If there's not enough code in Hadoop+Hbase to justify multiple
> projects it would be premature abstraction to organize Mahout that way.
> Let's keep it simple...
> 
> Jeff
> 
> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com] 
> Sent: Tuesday, January 29, 2008 4:21 PM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Thinking about Mahout layout, builds, etc.
> 
> 
> 
> Initially, developers will be hitting bugs or bad design all over the
> place
> so they would favor one project.  Also, with good package design, you
> get
> most of the benefits of multiple projects.
> 
> So why not start simple and migrate to complicated later?
> 
> 
> On 1/29/08 3:15 PM, "Jeff Eastman" <je...@collab.net> wrote:
> 
>> Thinking about these alternatives from an Eclipse user's point of
> view,
>> the original proposal would seem to encourage multiple projects (one
> per
>> algorithm + a common project) while the second would encourage a
> single
>> project containing multiple packages. Depending upon the amount of
> code
>> that would reside in each algorithm, one or the other might be
>> preferable.
>>
>> Would a given developer typically be working on the entire library
>> (single project favoring) or just on one or two algorithms (multiple
>> project favoring)?
>>
>> Jeff
>>
>> -----Original Message-----
>> From: Ted Dunning [mailto:tdunning@veoh.com]
>> Sent: Tuesday, January 29, 2008 2:43 PM
>> To: mahout-dev@lucene.apache.org
>> Subject: Re: Thinking about Mahout layout, builds, etc.
>>
>>
>>
>> I think that having multiple source roots is a pain.  That is what
>> packages
>> are for.
>>
>> I would recommend instead:
>>
>> - at the top level, there should be trunk, tags, releases as is
> typical
>> in
>> an SVN based project.
>>
>> - below trunk and any tag or release there should be:
>>
>>    docs
>>    lib
>>    src/org/apache/mahout
>>
>> Below the source directory, there should be packages common,
> algorithmA,
>> algorithmB and all tests should be locaated near the associated
> source.
>> If it is really desirable to separate tests from normal source (I have
>> done
>> it both ways and find having the tests nearby beneficial), then there
>> can be
>> a parallel tree next to src called "test".
>>
>> The target of compilation should be a single jar file.
>>
>>
>> On 1/29/08 2:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
>>
>>> I am thinking a structure like the following would be useful for
>>> getting started:
>>> mahout/trunk/
>>>    docs
>>>    common/
>>> src/
>>>             main/
>>>             test/
>>>          docs/
>>>          lib/
>>>    algorithmA/
>>>         Similar to common, but for this algorithm
>>>    algB
>>>         ...
>>>     ...
>>>
>>> Where algorithmA, B, etc. are the various libraries we intend to
>>> implement.  We can hold off on creating them until we have some code,
>>> but was thinking it would be good to have the general layout in mind.
>>>
>>> Of course, this is expandable and changeable.  What do others think?
>>>
>>> On a related note, one of the things we discussed pre-Apache, was the
>>> general sense that we shouldn't feel the need to create an all
>>> encompassing framework.  The basic gist of this being that any given
>>> library could be completely independent of the others (with maybe the
>>> exception that they share a common library).  My gut says this is the
>>> way to get started, but that it may evolve over time once we have
> some
>>> running time together and can start to recognize synergies, such that
>>> maybe by the time we get to 1.0 of Mahout there may be more common
>>> code than we originally thought.  The "common" area above can serve
> as
>>> the area for utilities, classes, common Hadoop extensions, etc. that
>>> are shared between the various algorithms, but I would also say let's
>>> not try to prematurely optimize across the algorithms just yet.
>>>
>>> Anyone else have any preference on this?
>>>
>>> -Grant
>>>
> 

-- 
Mason Tang '10, Course 6-3
Address: Burton-Conner 224A        Email: masont@mit.edu
          410 Memorial Dr.          Phone: 508-414-5811
          Cambridge, MA 02139         WWW: www.geekbyday.com

RE: Thinking about Mahout layout, builds, etc.

Posted by Jeff Eastman <je...@collab.net>.

+1

A single project facilitates refactoring and promotes consistency of
design. If there's not enough code in Hadoop+Hbase to justify multiple
projects it would be premature abstraction to organize Mahout that way.
Let's keep it simple...

Jeff

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Tuesday, January 29, 2008 4:21 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Thinking about Mahout layout, builds, etc.



Initially, developers will be hitting bugs or bad design all over the
place
so they would favor one project.  Also, with good package design, you
get
most of the benefits of multiple projects.

So why not start simple and migrate to complicated later?


On 1/29/08 3:15 PM, "Jeff Eastman" <je...@collab.net> wrote:

> Thinking about these alternatives from an Eclipse user's point of
view,
> the original proposal would seem to encourage multiple projects (one
per
> algorithm + a common project) while the second would encourage a
single
> project containing multiple packages. Depending upon the amount of
code
> that would reside in each algorithm, one or the other might be
> preferable.
> 
> Would a given developer typically be working on the entire library
> (single project favoring) or just on one or two algorithms (multiple
> project favoring)?
> 
> Jeff
> 
> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com]
> Sent: Tuesday, January 29, 2008 2:43 PM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Thinking about Mahout layout, builds, etc.
> 
> 
> 
> I think that having multiple source roots is a pain.  That is what
> packages
> are for.
> 
> I would recommend instead:
> 
> - at the top level, there should be trunk, tags, releases as is
typical
> in
> an SVN based project.
> 
> - below trunk and any tag or release there should be:
> 
>    docs
>    lib
>    src/org/apache/mahout
> 
> Below the source directory, there should be packages common,
algorithmA,
> algorithmB and all tests should be locaated near the associated
source.
> 
> If it is really desirable to separate tests from normal source (I have
> done
> it both ways and find having the tests nearby beneficial), then there
> can be
> a parallel tree next to src called "test".
> 
> The target of compilation should be a single jar file.
> 
> 
> On 1/29/08 2:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
> 
>> I am thinking a structure like the following would be useful for
>> getting started:
>> mahout/trunk/
>>    docs
>>    common/
>> src/
>>             main/
>>             test/
>>          docs/
>>          lib/
>>    algorithmA/
>>         Similar to common, but for this algorithm
>>    algB
>>         ...
>>     ...
>> 
>> Where algorithmA, B, etc. are the various libraries we intend to
>> implement.  We can hold off on creating them until we have some code,
>> but was thinking it would be good to have the general layout in mind.
>> 
>> Of course, this is expandable and changeable.  What do others think?
>> 
>> On a related note, one of the things we discussed pre-Apache, was the
>> general sense that we shouldn't feel the need to create an all
>> encompassing framework.  The basic gist of this being that any given
>> library could be completely independent of the others (with maybe the
>> exception that they share a common library).  My gut says this is the
>> way to get started, but that it may evolve over time once we have
some
>> running time together and can start to recognize synergies, such that
>> maybe by the time we get to 1.0 of Mahout there may be more common
>> code than we originally thought.  The "common" area above can serve
as
>> the area for utilities, classes, common Hadoop extensions, etc. that
>> are shared between the various algorithms, but I would also say let's
>> not try to prematurely optimize across the algorithms just yet.
>> 
>> Anyone else have any preference on this?
>> 
>> -Grant
>> 
>

Re: Thinking about Mahout layout, builds, etc.

Posted by Ted Dunning <td...@veoh.com>.

I can testify to the value of Hudson in the Hadoop project.  I have never
seen much value in Maven, especially for pure java projects.  Can't comment
on Ivy.


On 1/29/08 5:24 PM, "Ken Montanez" <ke...@gmail.com> wrote:

> Just curious if this technology stack has been looked at: Maven/Hudson/Ivy.
> Many groups have used these projects with great success and might address
> some of the initial questions that might come up during this initial phase
> of the project.
> 
> Ken
> 
> On Jan 29, 2008 4:38 PM, Ken Montanez <ke...@gmail.com> wrote:
> 
>> I agree. Also this is a good starting point. If we find that our initial
>> approach is not sufficient it will be easier to split from one source tree
>> to many than it will be to splice many to one; I am also trying to hint at
>> the fact that having different source tree's will tempt some to follow
>> different conventions than if everything is in one source tree (more context
>> to your work).
>> 
>> Thanks,
>> Ken
>> 
>> 
>> On Jan 29, 2008 4:22 PM, Vadim Zaliva <kr...@gmail.com> wrote:
>> 
>>> On Jan 29, 2008, at 16:13, Yousef Ourabi wrote:
>>> 
>>> I am am with Yoasef. I would prefer single-rooted source tree
>>> but would leave an option of building multiple jars. Actually
>>> we can build one jar per algorithm, plus special jumbo jar containing
>>> everything.
>>> 
>>> Sincerely,
>>> Vadim
>>> 
>>>> I'm with Ted on this one.
>>>> 
>>>> +1 for tags,trunk, branches and diff. packages.
>>>> 
>>>> Where I differ Is with the output. I can see some scenarios where it
>>>> makes
>>>> sense for ant dist-alg1, ant dist-alg2 -- this would reduce the
>>>> footprint in
>>>> applications that only need one vs the other.
>>>> 
>>>> Having multiple projects is just unnecessary over head.
>>>> 
>>>> -Yousef
>>>> 
>>>> On 1/29/08, Steve Rowe <sa...@odyssey.net> wrote:
>>>>> 
>>>>> On 01/29/2008 at 6:44 PM, Lukas Vlcek wrote:
>>>>>> I would prefer to have an option not to work with whole library but
>>>>>> select only specific algorithms and optionally their particular
>>>>>> modifications.
>>>>> 
>>>>> +1
>>>>> 
>>>>>>> Thinking about these alternatives from an Eclipse user's point of
>>>>> view,
>>>>>>> the original proposal would seem to encourage multiple projects
>>>>>>> (one
>>>>>>> per algorithm + a common project) while the second would
>>>>>>> encourage a
>>>>>>> single project containing multiple packages. Depending upon the
>>>>>>> amount
>>>>>>> of code that would reside in each algorithm, one or the other
>>>>>>> might be
>>>>>>> preferable.
>>>>>>> 
>>>>>>> Would a given developer typically be working on the entire library
>>>>>>> (single project favoring) or just on one or two algorithms
>>>>>>> (multiple
>>>>>>> project favoring)?
>>>>>>> 
>>>>>>> Jeff
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Ted Dunning [mailto:tdunning@veoh.com]
>>>>>>> Sent: Tuesday, January 29, 2008 2:43 PM
>>>>>>> To: mahout-dev@lucene.apache.org
>>>>>>> Subject: Re: Thinking about Mahout layout, builds, etc.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I think that having multiple source roots is a pain.  That is what
>>>>>>> packages
>>>>>>> are for.
>>>>>>> 
>>>>>>> I would recommend instead:
>>>>>>> 
>>>>>>> - at the top level, there should be trunk, tags, releases as is
>>>>> typical
>>>>>>> in an SVN based project.
>>>>>>> 
>>>>>>> - below trunk and any tag or release there should be:
>>>>>>> 
>>>>>>>  docs
>>>>>>>  lib
>>>>>>>  src/org/apache/mahout
>>>>>>> 
>>>>>>> Below the source directory, there should be packages common,
>>>>>>> algorithmA, algorithmB and all tests should be locaated near the
>>>>>>> associated source.
>>>>>>> 
>>>>>>> If it is really desirable to separate tests from normal source (I
>>>>>>> have
>>>>>>> done it both ways and find having the tests nearby beneficial),
>>>>>>> then
>>>>>>> there can be a parallel tree next to src called "test".
>>>>>>> 
>>>>>>> The target of compilation should be a single jar file.
>>>>>>> 
>>>>>>> 
>>>>>>> On 1/29/08 2:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
>>>>>>> 
>>>>>>>> I am thinking a structure like the following would be useful for
>>>>>>>> getting started:
>>>>>>>> mahout/trunk/
>>>>>>>>   docs
>>>>>>>>   common/
>>>>>>>> src/
>>>>>>>>            main/
>>>>>>>>            test/
>>>>>>>>         docs/
>>>>>>>>         lib/
>>>>>>>>   algorithmA/
>>>>>>>>        Similar to common, but for this algorithm algB ...
>>>>>>>>    ...
>>>>>>>> 
>>>>>>>> Where algorithmA, B, etc. are the various libraries we intend to
>>>>>>>> implement.  We can hold off on creating them until we have some
>>>>> code,
>>>>>>>> but was thinking it would be good to have the general layout in
>>>>> mind.
>>>>>>>> 
>>>>>>>> Of course, this is expandable and changeable.  What do others
>>>>>>>> think?
>>>>>>>> 
>>>>>>>> On a related note, one of the things we discussed pre-Apache, was
>>>>> the
>>>>>>>> general sense that we shouldn't feel the need to create an all
>>>>>>>> encompassing framework.  The basic gist of this being that any
>>>>>>>> given
>>>>>>>> library could be completely independent of the others (with maybe
>>>>> the
>>>>>>>> exception that they share a common library).  My gut says this is
>>>>> the
>>>>>>>> way to get started, but that it may evolve over time once we have
>>>>> some
>>>>>>>> running time together and can start to recognize synergies, such
>>>>> that
>>>>>>>> maybe by the time we get to 1.0 of Mahout there may be more common
>>>>>>>> code than we originally thought.  The "common" area above can
>>>>>>>> serve
>>>>> as
>>>>>>>> the area for utilities, classes, common Hadoop extensions, etc.
>>>>>>>> that
>>>>>>>> are shared between the various algorithms, but I would also say
>>>>> let's
>>>>>>>> not try to prematurely optimize across the algorithms just yet.
>>>>>>>> 
>>>>>>>> Anyone else have any preference on this?
>>>>>>>> 
>>>>>>>> -Grant
>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>> 
>

Re: Thinking about Mahout layout, builds, etc.

Posted by Ken Montanez <ke...@gmail.com>.

Just curious if this technology stack has been looked at: Maven/Hudson/Ivy.
Many groups have used these projects with great success and might address
some of the initial questions that might come up during this initial phase
of the project.

Ken

On Jan 29, 2008 4:38 PM, Ken Montanez <ke...@gmail.com> wrote:

> I agree. Also this is a good starting point. If we find that our initial
> approach is not sufficient it will be easier to split from one source tree
> to many than it will be to splice many to one; I am also trying to hint at
> the fact that having different source tree's will tempt some to follow
> different conventions than if everything is in one source tree (more context
> to your work).
>
> Thanks,
> Ken
>
>
> On Jan 29, 2008 4:22 PM, Vadim Zaliva <kr...@gmail.com> wrote:
>
> > On Jan 29, 2008, at 16:13, Yousef Ourabi wrote:
> >
> > I am am with Yoasef. I would prefer single-rooted source tree
> > but would leave an option of building multiple jars. Actually
> > we can build one jar per algorithm, plus special jumbo jar containing
> > everything.
> >
> > Sincerely,
> > Vadim
> >
> > > I'm with Ted on this one.
> > >
> > > +1 for tags,trunk, branches and diff. packages.
> > >
> > > Where I differ Is with the output. I can see some scenarios where it
> > > makes
> > > sense for ant dist-alg1, ant dist-alg2 -- this would reduce the
> > > footprint in
> > > applications that only need one vs the other.
> > >
> > > Having multiple projects is just unnecessary over head.
> > >
> > > -Yousef
> > >
> > > On 1/29/08, Steve Rowe <sa...@odyssey.net> wrote:
> > >>
> > >> On 01/29/2008 at 6:44 PM, Lukas Vlcek wrote:
> > >>> I would prefer to have an option not to work with whole library but
> > >>> select only specific algorithms and optionally their particular
> > >>> modifications.
> > >>
> > >> +1
> > >>
> > >>>> Thinking about these alternatives from an Eclipse user's point of
> > >> view,
> > >>>> the original proposal would seem to encourage multiple projects
> > >>>> (one
> > >>>> per algorithm + a common project) while the second would
> > >>>> encourage a
> > >>>> single project containing multiple packages. Depending upon the
> > >>>> amount
> > >>>> of code that would reside in each algorithm, one or the other
> > >>>> might be
> > >>>> preferable.
> > >>>>
> > >>>> Would a given developer typically be working on the entire library
> > >>>> (single project favoring) or just on one or two algorithms
> > >>>> (multiple
> > >>>> project favoring)?
> > >>>>
> > >>>> Jeff
> > >>>>
> > >>>> -----Original Message-----
> > >>>> From: Ted Dunning [mailto:tdunning@veoh.com]
> > >>>> Sent: Tuesday, January 29, 2008 2:43 PM
> > >>>> To: mahout-dev@lucene.apache.org
> > >>>> Subject: Re: Thinking about Mahout layout, builds, etc.
> > >>>>
> > >>>>
> > >>>>
> > >>>> I think that having multiple source roots is a pain.  That is what
> > >>>> packages
> > >>>> are for.
> > >>>>
> > >>>> I would recommend instead:
> > >>>>
> > >>>> - at the top level, there should be trunk, tags, releases as is
> > >> typical
> > >>>> in an SVN based project.
> > >>>>
> > >>>> - below trunk and any tag or release there should be:
> > >>>>
> > >>>>  docs
> > >>>>  lib
> > >>>>  src/org/apache/mahout
> > >>>>
> > >>>> Below the source directory, there should be packages common,
> > >>>> algorithmA, algorithmB and all tests should be locaated near the
> > >>>> associated source.
> > >>>>
> > >>>> If it is really desirable to separate tests from normal source (I
> > >>>> have
> > >>>> done it both ways and find having the tests nearby beneficial),
> > >>>> then
> > >>>> there can be a parallel tree next to src called "test".
> > >>>>
> > >>>> The target of compilation should be a single jar file.
> > >>>>
> > >>>>
> > >>>> On 1/29/08 2:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
> > >>>>
> > >>>>> I am thinking a structure like the following would be useful for
> > >>>>> getting started:
> > >>>>> mahout/trunk/
> > >>>>>   docs
> > >>>>>   common/
> > >>>>> src/
> > >>>>>            main/
> > >>>>>            test/
> > >>>>>         docs/
> > >>>>>         lib/
> > >>>>>   algorithmA/
> > >>>>>        Similar to common, but for this algorithm algB ...
> > >>>>>    ...
> > >>>>>
> > >>>>> Where algorithmA, B, etc. are the various libraries we intend to
> > >>>>> implement.  We can hold off on creating them until we have some
> > >> code,
> > >>>>> but was thinking it would be good to have the general layout in
> > >> mind.
> > >>>>>
> > >>>>> Of course, this is expandable and changeable.  What do others
> > >>>>> think?
> > >>>>>
> > >>>>> On a related note, one of the things we discussed pre-Apache, was
> > >> the
> > >>>>> general sense that we shouldn't feel the need to create an all
> > >>>>> encompassing framework.  The basic gist of this being that any
> > >>>>> given
> > >>>>> library could be completely independent of the others (with maybe
> > >> the
> > >>>>> exception that they share a common library).  My gut says this is
> > >> the
> > >>>>> way to get started, but that it may evolve over time once we have
> > >> some
> > >>>>> running time together and can start to recognize synergies, such
> > >> that
> > >>>>> maybe by the time we get to 1.0 of Mahout there may be more common
> > >>>>> code than we originally thought.  The "common" area above can
> > >>>>> serve
> > >> as
> > >>>>> the area for utilities, classes, common Hadoop extensions, etc.
> > >>>>> that
> > >>>>> are shared between the various algorithms, but I would also say
> > >> let's
> > >>>>> not try to prematurely optimize across the algorithms just yet.
> > >>>>>
> > >>>>> Anyone else have any preference on this?
> > >>>>>
> > >>>>> -Grant
> > >>>
> > >>
> > >>
> >
> >
>


-- 
Ken Montanez | 510.681.5576

Re: Thinking about Mahout layout, builds, etc.

Posted by Ken Montanez <ke...@gmail.com>.

I agree. Also this is a good starting point. If we find that our initial
approach is not sufficient it will be easier to split from one source tree
to many than it will be to splice many to one; I am also trying to hint at
the fact that having different source tree's will tempt some to follow
different conventions than if everything is in one source tree (more context
to your work).

Thanks,
Ken

On Jan 29, 2008 4:22 PM, Vadim Zaliva <kr...@gmail.com> wrote:

> On Jan 29, 2008, at 16:13, Yousef Ourabi wrote:
>
> I am am with Yoasef. I would prefer single-rooted source tree
> but would leave an option of building multiple jars. Actually
> we can build one jar per algorithm, plus special jumbo jar containing
> everything.
>
> Sincerely,
> Vadim
>
> > I'm with Ted on this one.
> >
> > +1 for tags,trunk, branches and diff. packages.
> >
> > Where I differ Is with the output. I can see some scenarios where it
> > makes
> > sense for ant dist-alg1, ant dist-alg2 -- this would reduce the
> > footprint in
> > applications that only need one vs the other.
> >
> > Having multiple projects is just unnecessary over head.
> >
> > -Yousef
> >
> > On 1/29/08, Steve Rowe <sa...@odyssey.net> wrote:
> >>
> >> On 01/29/2008 at 6:44 PM, Lukas Vlcek wrote:
> >>> I would prefer to have an option not to work with whole library but
> >>> select only specific algorithms and optionally their particular
> >>> modifications.
> >>
> >> +1
> >>
> >>>> Thinking about these alternatives from an Eclipse user's point of
> >> view,
> >>>> the original proposal would seem to encourage multiple projects
> >>>> (one
> >>>> per algorithm + a common project) while the second would
> >>>> encourage a
> >>>> single project containing multiple packages. Depending upon the
> >>>> amount
> >>>> of code that would reside in each algorithm, one or the other
> >>>> might be
> >>>> preferable.
> >>>>
> >>>> Would a given developer typically be working on the entire library
> >>>> (single project favoring) or just on one or two algorithms
> >>>> (multiple
> >>>> project favoring)?
> >>>>
> >>>> Jeff
> >>>>
> >>>> -----Original Message-----
> >>>> From: Ted Dunning [mailto:tdunning@veoh.com]
> >>>> Sent: Tuesday, January 29, 2008 2:43 PM
> >>>> To: mahout-dev@lucene.apache.org
> >>>> Subject: Re: Thinking about Mahout layout, builds, etc.
> >>>>
> >>>>
> >>>>
> >>>> I think that having multiple source roots is a pain.  That is what
> >>>> packages
> >>>> are for.
> >>>>
> >>>> I would recommend instead:
> >>>>
> >>>> - at the top level, there should be trunk, tags, releases as is
> >> typical
> >>>> in an SVN based project.
> >>>>
> >>>> - below trunk and any tag or release there should be:
> >>>>
> >>>>  docs
> >>>>  lib
> >>>>  src/org/apache/mahout
> >>>>
> >>>> Below the source directory, there should be packages common,
> >>>> algorithmA, algorithmB and all tests should be locaated near the
> >>>> associated source.
> >>>>
> >>>> If it is really desirable to separate tests from normal source (I
> >>>> have
> >>>> done it both ways and find having the tests nearby beneficial),
> >>>> then
> >>>> there can be a parallel tree next to src called "test".
> >>>>
> >>>> The target of compilation should be a single jar file.
> >>>>
> >>>>
> >>>> On 1/29/08 2:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
> >>>>
> >>>>> I am thinking a structure like the following would be useful for
> >>>>> getting started:
> >>>>> mahout/trunk/
> >>>>>   docs
> >>>>>   common/
> >>>>> src/
> >>>>>            main/
> >>>>>            test/
> >>>>>         docs/
> >>>>>         lib/
> >>>>>   algorithmA/
> >>>>>        Similar to common, but for this algorithm algB ...
> >>>>>    ...
> >>>>>
> >>>>> Where algorithmA, B, etc. are the various libraries we intend to
> >>>>> implement.  We can hold off on creating them until we have some
> >> code,
> >>>>> but was thinking it would be good to have the general layout in
> >> mind.
> >>>>>
> >>>>> Of course, this is expandable and changeable.  What do others
> >>>>> think?
> >>>>>
> >>>>> On a related note, one of the things we discussed pre-Apache, was
> >> the
> >>>>> general sense that we shouldn't feel the need to create an all
> >>>>> encompassing framework.  The basic gist of this being that any
> >>>>> given
> >>>>> library could be completely independent of the others (with maybe
> >> the
> >>>>> exception that they share a common library).  My gut says this is
> >> the
> >>>>> way to get started, but that it may evolve over time once we have
> >> some
> >>>>> running time together and can start to recognize synergies, such
> >> that
> >>>>> maybe by the time we get to 1.0 of Mahout there may be more common
> >>>>> code than we originally thought.  The "common" area above can
> >>>>> serve
> >> as
> >>>>> the area for utilities, classes, common Hadoop extensions, etc.
> >>>>> that
> >>>>> are shared between the various algorithms, but I would also say
> >> let's
> >>>>> not try to prematurely optimize across the algorithms just yet.
> >>>>>
> >>>>> Anyone else have any preference on this?
> >>>>>
> >>>>> -Grant
> >>>
> >>
> >>
>
>

Re: Thinking about Mahout layout, builds, etc.

Posted by Vadim Zaliva <kr...@gmail.com>.

On Jan 29, 2008, at 16:13, Yousef Ourabi wrote:

I am am with Yoasef. I would prefer single-rooted source tree
but would leave an option of building multiple jars. Actually
we can build one jar per algorithm, plus special jumbo jar containing
everything.

Sincerely,
Vadim

> I'm with Ted on this one.
>
> +1 for tags,trunk, branches and diff. packages.
>
> Where I differ Is with the output. I can see some scenarios where it  
> makes
> sense for ant dist-alg1, ant dist-alg2 -- this would reduce the  
> footprint in
> applications that only need one vs the other.
>
> Having multiple projects is just unnecessary over head.
>
> -Yousef
>
> On 1/29/08, Steve Rowe <sa...@odyssey.net> wrote:
>>
>> On 01/29/2008 at 6:44 PM, Lukas Vlcek wrote:
>>> I would prefer to have an option not to work with whole library but
>>> select only specific algorithms and optionally their particular
>>> modifications.
>>
>> +1
>>
>>>> Thinking about these alternatives from an Eclipse user's point of
>> view,
>>>> the original proposal would seem to encourage multiple projects  
>>>> (one
>>>> per algorithm + a common project) while the second would  
>>>> encourage a
>>>> single project containing multiple packages. Depending upon the  
>>>> amount
>>>> of code that would reside in each algorithm, one or the other  
>>>> might be
>>>> preferable.
>>>>
>>>> Would a given developer typically be working on the entire library
>>>> (single project favoring) or just on one or two algorithms  
>>>> (multiple
>>>> project favoring)?
>>>>
>>>> Jeff
>>>>
>>>> -----Original Message-----
>>>> From: Ted Dunning [mailto:tdunning@veoh.com]
>>>> Sent: Tuesday, January 29, 2008 2:43 PM
>>>> To: mahout-dev@lucene.apache.org
>>>> Subject: Re: Thinking about Mahout layout, builds, etc.
>>>>
>>>>
>>>>
>>>> I think that having multiple source roots is a pain.  That is what
>>>> packages
>>>> are for.
>>>>
>>>> I would recommend instead:
>>>>
>>>> - at the top level, there should be trunk, tags, releases as is
>> typical
>>>> in an SVN based project.
>>>>
>>>> - below trunk and any tag or release there should be:
>>>>
>>>>  docs
>>>>  lib
>>>>  src/org/apache/mahout
>>>>
>>>> Below the source directory, there should be packages common,
>>>> algorithmA, algorithmB and all tests should be locaated near the
>>>> associated source.
>>>>
>>>> If it is really desirable to separate tests from normal source (I  
>>>> have
>>>> done it both ways and find having the tests nearby beneficial),  
>>>> then
>>>> there can be a parallel tree next to src called "test".
>>>>
>>>> The target of compilation should be a single jar file.
>>>>
>>>>
>>>> On 1/29/08 2:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
>>>>
>>>>> I am thinking a structure like the following would be useful for
>>>>> getting started:
>>>>> mahout/trunk/
>>>>>   docs
>>>>>   common/
>>>>> src/
>>>>>            main/
>>>>>            test/
>>>>>         docs/
>>>>>         lib/
>>>>>   algorithmA/
>>>>>        Similar to common, but for this algorithm algB ...
>>>>>    ...
>>>>>
>>>>> Where algorithmA, B, etc. are the various libraries we intend to
>>>>> implement.  We can hold off on creating them until we have some
>> code,
>>>>> but was thinking it would be good to have the general layout in
>> mind.
>>>>>
>>>>> Of course, this is expandable and changeable.  What do others  
>>>>> think?
>>>>>
>>>>> On a related note, one of the things we discussed pre-Apache, was
>> the
>>>>> general sense that we shouldn't feel the need to create an all
>>>>> encompassing framework.  The basic gist of this being that any  
>>>>> given
>>>>> library could be completely independent of the others (with maybe
>> the
>>>>> exception that they share a common library).  My gut says this is
>> the
>>>>> way to get started, but that it may evolve over time once we have
>> some
>>>>> running time together and can start to recognize synergies, such
>> that
>>>>> maybe by the time we get to 1.0 of Mahout there may be more common
>>>>> code than we originally thought.  The "common" area above can  
>>>>> serve
>> as
>>>>> the area for utilities, classes, common Hadoop extensions, etc.  
>>>>> that
>>>>> are shared between the various algorithms, but I would also say
>> let's
>>>>> not try to prematurely optimize across the algorithms just yet.
>>>>>
>>>>> Anyone else have any preference on this?
>>>>>
>>>>> -Grant
>>>
>>
>>

Re: Thinking about Mahout layout, builds, etc.

Posted by Yousef Ourabi <yo...@zero-analog.com>.

I'm with Ted on this one.

+1 for tags,trunk, branches and diff. packages.

Where I differ Is with the output. I can see some scenarios where it makes
sense for ant dist-alg1, ant dist-alg2 -- this would reduce the footprint in
applications that only need one vs the other.

Having multiple projects is just unnecessary over head.

-Yousef

On 1/29/08, Steve Rowe <sa...@odyssey.net> wrote:
>
> On 01/29/2008 at 6:44 PM, Lukas Vlcek wrote:
> > I would prefer to have an option not to work with whole library but
> > select only specific algorithms and optionally their particular
> > modifications.
>
> +1
>
> > > Thinking about these alternatives from an Eclipse user's point of
> view,
> > > the original proposal would seem to encourage multiple projects (one
> > > per algorithm + a common project) while the second would encourage a
> > > single project containing multiple packages. Depending upon the amount
> > > of code that would reside in each algorithm, one or the other might be
> > > preferable.
> > >
> > > Would a given developer typically be working on the entire library
> > > (single project favoring) or just on one or two algorithms (multiple
> > > project favoring)?
> > >
> > > Jeff
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:tdunning@veoh.com]
> > > Sent: Tuesday, January 29, 2008 2:43 PM
> > > To: mahout-dev@lucene.apache.org
> > > Subject: Re: Thinking about Mahout layout, builds, etc.
> > >
> > >
> > >
> > > I think that having multiple source roots is a pain.  That is what
> > > packages
> > > are for.
> > >
> > > I would recommend instead:
> > >
> > > - at the top level, there should be trunk, tags, releases as is
> typical
> > > in an SVN based project.
> > >
> > > - below trunk and any tag or release there should be:
> > >
> > >   docs
> > >   lib
> > >   src/org/apache/mahout
> > >
> > > Below the source directory, there should be packages common,
> > > algorithmA, algorithmB and all tests should be locaated near the
> > > associated source.
> > >
> > > If it is really desirable to separate tests from normal source (I have
> > > done it both ways and find having the tests nearby beneficial), then
> > > there can be a parallel tree next to src called "test".
> > >
> > > The target of compilation should be a single jar file.
> > >
> > >
> > > On 1/29/08 2:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
> > >
> > > > I am thinking a structure like the following would be useful for
> > > > getting started:
> > > > mahout/trunk/
> > > >    docs
> > > >    common/
> > > > src/
> > > >             main/
> > > >             test/
> > > >          docs/
> > > >          lib/
> > > >    algorithmA/
> > > >         Similar to common, but for this algorithm algB ...
> > > >     ...
> > > >
> > > > Where algorithmA, B, etc. are the various libraries we intend to
> > > > implement.  We can hold off on creating them until we have some
> code,
> > > > but was thinking it would be good to have the general layout in
> mind.
> > > >
> > > > Of course, this is expandable and changeable.  What do others think?
> > > >
> > > > On a related note, one of the things we discussed pre-Apache, was
> the
> > > > general sense that we shouldn't feel the need to create an all
> > > > encompassing framework.  The basic gist of this being that any given
> > > > library could be completely independent of the others (with maybe
> the
> > > > exception that they share a common library).  My gut says this is
> the
> > > > way to get started, but that it may evolve over time once we have
> some
> > > > running time together and can start to recognize synergies, such
> that
> > > > maybe by the time we get to 1.0 of Mahout there may be more common
> > > > code than we originally thought.  The "common" area above can serve
> as
> > > > the area for utilities, classes, common Hadoop extensions, etc. that
> > > > are shared between the various algorithms, but I would also say
> let's
> > > > not try to prematurely optimize across the algorithms just yet.
> > > >
> > > > Anyone else have any preference on this?
> > > >
> > > > -Grant
> >
>
>

Re: Thinking about Mahout layout, builds, etc.

Posted by Karl Wettin <ka...@gmail.com>.

30 jan 2008 kl. 08.45 skrev Isabel Drost:

> On Wednesday 30 January 2008, Steve Rowe wrote:
>> On 01/29/2008 at 6:44 PM, Lukas Vlcek wrote:
>>> I would prefer to have an option not to work with whole library but
>>> select only specific algorithms and optionally their particular
>>> modifications.
>>
>> +1
>
> +1 I would at least like to have one downloadable jar for each  
> algorithm
> family

With Grants current proposal combined with an adoption of the Lucence  
Ant build, each algorithm package would end up as its own jar file.


   karl

Re: Thinking about Mahout layout, builds, etc.

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Wednesday 30 January 2008, Toby DiPasquale wrote:
> My votes are as follows.

Mine are mostly equal to DiPasquale's:

Build system:
      +0 Maven 2
      +0 Ant

Project structure:
     +0 Per-algorithm source tree
     +1 Single source tree

Release artifact(s):
     +1 Per-algorithm jar
     +1 Monolithic jar

-- 
Clarke's Conclusion:	Never let your sense of morals interfere with doing the 
right thing.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

RE: Thinking about Mahout layout, builds, etc.

Posted by Jeff Eastman <je...@collab.net>.

+1

Jeff

-----Original Message-----
From: Toby DiPasquale [mailto:toby@cbcg.net] 
Sent: Wednesday, January 30, 2008 9:50 AM
To: mahout-dev@lucene.apache.org
Subject: Re: Thinking about Mahout layout, builds, etc.

On Jan 30, 2008 12:29 PM, Ted Dunning <td...@veoh.com> wrote:
> I have said too much already, but here are my votes:

My votes are as follows:

Build system:
     +0 Maven 2
     +1 Ant

Project structure:
     +0 Per-algorithm source tree
     +1 Single source tree

Release artifact(s):
     +1 Per-algorithm jar
     +1 Monolithic jar

-- 
Toby DiPasquale

Re: Thinking about Mahout layout, builds, etc.

Posted by Toby DiPasquale <to...@cbcg.net>.

On Jan 30, 2008 12:29 PM, Ted Dunning <td...@veoh.com> wrote:
> I have said too much already, but here are my votes:

My votes are as follows:

Build system:
     +0 Maven 2
     +1 Ant

Project structure:
     +0 Per-algorithm source tree
     +1 Single source tree

Release artifact(s):
     +1 Per-algorithm jar
     +1 Monolithic jar

-- 
Toby DiPasquale

Re: Thinking about Mahout layout, builds, etc.

Posted by Yousef Ourabi <yo...@zero-analog.com>.

Everyone seems to be talking about monolithic jars vs modular jars as if
they are mutually exclusive.

I don't see why we can't have both and make everyone happy. If someone
really wants only one jar as well as mahout-common then the burden is on
them to do a little more research and figure out deps. If they want
everything they just use the all singing all dancing jar.

To get this functionality it's just a simple <copy> task with an exclude, or
include depending on how you look at it (there are also many ways to skin
this cat).

About dependencies for mahout: Simply stick them in a lib directory. Done,
end of store. See Apache-SOLR. There is no magic here.

ant dist (all) -> mahout.jar
and dist-foo -> mahout-foo.jar

To re-summarize

Build:
+0 Maven
+1 Ant

Structure:
+0 Per project layout
+1 Single Source Tree

Build Out:
+1 Modular
+1 Single

(Yes, I favor both).

On 1/30/08, Dawid Weiss <da...@cs.put.poznan.pl> wrote:
>
>
> I share some folks' opinion. While Maven seems nicer in some aspects, I
> like
> build scripts that are predictable and consistent. Even if ANT can be a
> pain, I
> would go with it instead of Maven.
>
> My votes:
>
> Build system:
>      +0 Maven 2
>      +1 Ant
>
> Project structure:
>      +0 Per-algorithm source tree
>      +1 Single source tree
>
> Release artifact(s):
>      +0 Per-algorithm jar
>      +1 Monolithic jar
>
> I don't think we will have that many classes to be concerned with JAR
> size. The
> dependencies will be much larger and splitting these (or making a clear
> documentation about which algorithm requires which JARs) will be more
> important.
>
> Dawid
>

Re: Thinking about Mahout layout, builds, etc.

Posted by Ted Dunning <td...@veoh.com>.

I have said too much already, but here are my votes:


Build system:
     +0 Maven 2
     +1 Ant

Project structure:
     +0 Per-algorithm source tree
     +1 Single source tree

Release artifact(s):
     +0 Per-algorithm jar
     +1 Monolithic jar

Re: Thinking about Mahout layout, builds, etc.

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

I share some folks' opinion. While Maven seems nicer in some aspects, I like 
build scripts that are predictable and consistent. Even if ANT can be a pain, I 
would go with it instead of Maven.

My votes:

Build system:
     +0 Maven 2
     +1 Ant

Project structure:
     +0 Per-algorithm source tree
     +1 Single source tree

Release artifact(s):
     +0 Per-algorithm jar
     +1 Monolithic jar

I don't think we will have that many classes to be concerned with JAR size. The 
dependencies will be much larger and splitting these (or making a clear 
documentation about which algorithm requires which JARs) will be more important.

Dawid

Re: Thinking about Mahout layout, builds, etc.

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 30, 2008, at 8:56 AM, Steve Rowe wrote:

> I have heard good things about Ivy for dependency management, though  
> I've never used it.  I think it leverages Maven remote repositories.  
> And Lucene uses Forrest to build its site.  Both of these things can  
> be bolted on later if we start with an Ant build.

Our site is in Forrest too.  It was the easiest for me to setup since  
I pretty much just copied it from Lucene and removed stuff we don't  
need.  In the early stages, I think we will want to rely more heavily  
on the wiki for anything beyond the basic project metadata that is  
already there.  Once we get real code, then we can publish per release  
docs.

Re: Thinking about Mahout layout, builds, etc.

Posted by Steve Rowe <sa...@odyssey.net>.

I have experience with Ant, Maven 1.X, and Maven 2.

Maven 1.X builds are a nightmare to maintain - we should not go there. 
Fortunately for Maven 2, it's a completely different animal.

When your use case is directly supported by Maven 2, it's a beautiful 
thing.  As Grant says, it's magic.  Like Grant, I've written M2 plugins 
and set up some complex builds.

But unless you pin down the versions of plugins you use (currently 
possible using <dependencyManagement>) and the exact version of M2 you 
use (I don't think this is possible yet), people can get different 
results.  Maven has never been terribly stable, because it's in a 
constant state of change.  Maven 2.1 is the current focus of 
development, so modifications to 2.0.X tend to take a long time to be 
released.  This has long been the Maven way: focusing on future 
(backwards-incompatible) versions to the detriment of the existing versions.

If we don't use Maven, then we need to have alternative dependency 
management and site building facilities, since, unlike Maven, Ant does 
not provide support for these.  Maybe at the start we can ignore these 
two - without code, there isn't much of a site required, and the 
dependencies will be fairly static (per algorithm).

I have heard good things about Ivy for dependency management, though 
I've never used it.  I think it leverages Maven remote repositories. 
And Lucene uses Forrest to build its site.  Both of these things can be 
bolted on later if we start with an Ant build.

I've changed my mind about the project structure: I think it's okay to 
start out with a single source tree.  If it makes sense to do so later, 
splitting algorithms out shouldn't be too hard.

Similarly, I think shipping a monolithic jar is okay to begin with. 
Size is definitely not an issue, in the short- and medium-term, anyway.

Summarizing my votes:

Build system:
	+0 Maven 2
	+1 Ant

Project structure:
	+0 Per-algorithm source tree
	+1 Single source tree

Release artifact(s):
	+0 Per-algorithm jar
	+1 Monolithic jar

Steve

Grant Ingersoll wrote:
> A couple of comments on various things that have come up (btw, I love 
> the participation, already!)
> 
> 1.  The structure fits well with Maven or ANT.  Personally, I have come 
> full circle from ANT - Maven - ANT.  I have done a lot of ANT building 
> and a lot of Maven building, including writing plugins/tasks, etc.  ANT 
> is less magic at the cost of a little more upfront work (but it is easy 
> to setup common build functionality, etc.).  Magic in your builds is not 
> good.  Maven updates itself automatically, gets jars automatically, 
> etc.  I know this sounds like a good thing, but it isn't, IMO.  
> Especially when it comes to the plugins.  You have no idea whether 
> everyone is building on the same base.  Maven does not do much to 
> guarantee back-compatibility, either.  On the other hand, the Maven 
> repository is really nice.  And I really like that Maven has convinced 
> people that using common file structures and conventions is a good thing 
> in project management.  But neither of these things requires Maven 
> itself.  I tend to want to minimize our 3rd party dependencies, anyway, 
> as much as possible.  The simpler we can keep this, the better off we 
> will be.
> 
> 2. One other good thing from a infrastructure point of view for the 
> sub-project structure is we can, in theory, give permission to a 
> committer on a single algorithm, much like the contrib modules in 
> Lucene.  This isn't a big deal, but it could be useful, if someone is 
> really knowledgeable in one particular area and is only contributing in 
> that area.  Generally, however, I would favor making someone a full 
> committer.
> 
> 3. I do like the idea of both separate jars and a single uber-jar.  This 
> is trivial to do in both ANT and Maven.
> 
> -Grant
> 
> On Jan 30, 2008, at 3:21 AM, Ted Dunning wrote:
> 
>>
>> And all of Colt is < 1M.
>>
>> I would say that it isn't all that likely that the library will get to 
>> more
>> than a few megs (if that).  At that size, it really doesn't matter that
>> there is a bit of dross along for the ride.
>>
>> How many here would rather pick and choose pieces out of rapid miner or
>> weka?  Or would you rather just download the comprehensive jar and be 
>> ready
>> to roll?
>>
>> I also think that the example of text translation vs spam 
>> categorization is
>> a bit of a straw man.  It is much more likely that these would be 
>> entirely
>> independent applications that would themselves like to download the 
>> (single)
>> Mahout jar.
>>
>> On 1/29/08 11:45 PM, "Isabel Drost" <ap...@isabel-drost.de> 
>> wrote:
>>
>>> On Wednesday 30 January 2008, Steve Rowe wrote:
>>>> On 01/29/2008 at 6:44 PM, Lukas Vlcek wrote:
>>>>> I would prefer to have an option not to work with whole library but
>>>>> select only specific algorithms and optionally their particular
>>>>> modifications.
>>>>
>>>> +1
>>>
>>> +1 I would at least like to have one downloadable jar for each algorithm
>>> family (why would I as a user want to download the functionality for
>>> translating texts, if all I want to do is build a better spam 
>>> classification
>>> plugin for spam assassin?) plus one library for the common code like 
>>> input-/
>>> output-filters.
>>>
>>> Maybe we should look at other machine learning frameworks that followed
>>> the "all in one jar" path to get a feeling on how large a project can 
>>> easily
>>> get. Please be careful with these numbers, as both projects are 
>>> trying to
>>> provide whole machine learning frameworks with GUIs for experimentation,
>>> algorithms for evaluation and the like.
>>>
>>> Weka                         Compiled: 4.4M
>>> Rapid Miner   Sources: 12M   Compiled: 4.5M (21M including all 
>>> dependencies)
>>>
>>> Isabel

Re: Thinking about Mahout layout, builds, etc.

Posted by Grant Ingersoll <gs...@apache.org>.

A couple of comments on various things that have come up (btw, I love  
the participation, already!)

1.  The structure fits well with Maven or ANT.  Personally, I have  
come full circle from ANT - Maven - ANT.  I have done a lot of ANT  
building and a lot of Maven building, including writing plugins/tasks,  
etc.  ANT is less magic at the cost of a little more upfront work (but  
it is easy to setup common build functionality, etc.).  Magic in your  
builds is not good.  Maven updates itself automatically, gets jars  
automatically, etc.  I know this sounds like a good thing, but it  
isn't, IMO.  Especially when it comes to the plugins.  You have no  
idea whether everyone is building on the same base.  Maven does not do  
much to guarantee back-compatibility, either.  On the other hand, the  
Maven repository is really nice.  And I really like that Maven has  
convinced people that using common file structures and conventions is  
a good thing in project management.  But neither of these things  
requires Maven itself.  I tend to want to minimize our 3rd party  
dependencies, anyway, as much as possible.  The simpler we can keep  
this, the better off we will be.

2. One other good thing from a infrastructure point of view for the  
sub-project structure is we can, in theory, give permission to a  
committer on a single algorithm, much like the contrib modules in  
Lucene.  This isn't a big deal, but it could be useful, if someone is  
really knowledgeable in one particular area and is only contributing  
in that area.  Generally, however, I would favor making someone a full  
committer.

3. I do like the idea of both separate jars and a single uber-jar.   
This is trivial to do in both ANT and Maven.

-Grant

On Jan 30, 2008, at 3:21 AM, Ted Dunning wrote:

>
> And all of Colt is < 1M.
>
> I would say that it isn't all that likely that the library will get  
> to more
> than a few megs (if that).  At that size, it really doesn't matter  
> that
> there is a bit of dross along for the ride.
>
> How many here would rather pick and choose pieces out of rapid miner  
> or
> weka?  Or would you rather just download the comprehensive jar and  
> be ready
> to roll?
>
> I also think that the example of text translation vs spam  
> categorization is
> a bit of a straw man.  It is much more likely that these would be  
> entirely
> independent applications that would themselves like to download the  
> (single)
> Mahout jar.
>
> On 1/29/08 11:45 PM, "Isabel Drost" <ap...@isabel-drost.de>  
> wrote:
>
>> On Wednesday 30 January 2008, Steve Rowe wrote:
>>> On 01/29/2008 at 6:44 PM, Lukas Vlcek wrote:
>>>> I would prefer to have an option not to work with whole library but
>>>> select only specific algorithms and optionally their particular
>>>> modifications.
>>>
>>> +1
>>
>> +1 I would at least like to have one downloadable jar for each  
>> algorithm
>> family (why would I as a user want to download the functionality for
>> translating texts, if all I want to do is build a better spam  
>> classification
>> plugin for spam assassin?) plus one library for the common code  
>> like input-/
>> output-filters.
>>
>> Maybe we should look at other machine learning frameworks that  
>> followed
>> the "all in one jar" path to get a feeling on how large a project  
>> can easily
>> get. Please be careful with these numbers, as both projects are  
>> trying to
>> provide whole machine learning frameworks with GUIs for  
>> experimentation,
>> algorithms for evaluation and the like.
>>
>> Weka                         Compiled: 4.4M
>> Rapid Miner   Sources: 12M   Compiled: 4.5M (21M including all  
>> dependencies)
>>
>> Isabel
>

Re: Thinking about Mahout layout, builds, etc.

Posted by Ted Dunning <td...@veoh.com>.

And all of Colt is < 1M.

I would say that it isn't all that likely that the library will get to more
than a few megs (if that).  At that size, it really doesn't matter that
there is a bit of dross along for the ride.

How many here would rather pick and choose pieces out of rapid miner or
weka?  Or would you rather just download the comprehensive jar and be ready
to roll?

I also think that the example of text translation vs spam categorization is
a bit of a straw man.  It is much more likely that these would be entirely
independent applications that would themselves like to download the (single)
Mahout jar.

On 1/29/08 11:45 PM, "Isabel Drost" <ap...@isabel-drost.de> wrote:

> On Wednesday 30 January 2008, Steve Rowe wrote:
>> On 01/29/2008 at 6:44 PM, Lukas Vlcek wrote:
>>> I would prefer to have an option not to work with whole library but
>>> select only specific algorithms and optionally their particular
>>> modifications.
>> 
>> +1
> 
> +1 I would at least like to have one downloadable jar for each algorithm
> family (why would I as a user want to download the functionality for
> translating texts, if all I want to do is build a better spam classification
> plugin for spam assassin?) plus one library for the common code like input-/
> output-filters.
> 
> Maybe we should look at other machine learning frameworks that followed
> the "all in one jar" path to get a feeling on how large a project can easily
> get. Please be careful with these numbers, as both projects are trying to
> provide whole machine learning frameworks with GUIs for experimentation,
> algorithms for evaluation and the like.
> 
> Weka                         Compiled: 4.4M
> Rapid Miner   Sources: 12M   Compiled: 4.5M (21M including all dependencies)
> 
> Isabel

Re: Thinking about Mahout layout, builds, etc.

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Wednesday 30 January 2008, Steve Rowe wrote:
> On 01/29/2008 at 6:44 PM, Lukas Vlcek wrote:
> > I would prefer to have an option not to work with whole library but
> > select only specific algorithms and optionally their particular
> > modifications.
>
> +1

+1 I would at least like to have one downloadable jar for each algorithm 
family (why would I as a user want to download the functionality for 
translating texts, if all I want to do is build a better spam classification 
plugin for spam assassin?) plus one library for the common code like input-/ 
output-filters.

Maybe we should look at other machine learning frameworks that followed 
the "all in one jar" path to get a feeling on how large a project can easily 
get. Please be careful with these numbers, as both projects are trying to 
provide whole machine learning frameworks with GUIs for experimentation, 
algorithms for evaluation and the like.

Weka                         Compiled: 4.4M
Rapid Miner   Sources: 12M   Compiled: 4.5M (21M including all dependencies)

Isabel

-- 
"We learn from history that we learn nothing from history." -- George Bernard 
Shaw
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

RE: Thinking about Mahout layout, builds, etc.

Posted by Steve Rowe <sa...@odyssey.net>.

On 01/29/2008 at 6:44 PM, Lukas Vlcek wrote:
> I would prefer to have an option not to work with whole library but
> select only specific algorithms and optionally their particular
> modifications.

+1

> > Thinking about these alternatives from an Eclipse user's point of view,
> > the original proposal would seem to encourage multiple projects (one
> > per algorithm + a common project) while the second would encourage a
> > single project containing multiple packages. Depending upon the amount
> > of code that would reside in each algorithm, one or the other might be
> > preferable.
> > 
> > Would a given developer typically be working on the entire library
> > (single project favoring) or just on one or two algorithms (multiple
> > project favoring)?
> > 
> > Jeff
> > 
> > -----Original Message-----
> > From: Ted Dunning [mailto:tdunning@veoh.com]
> > Sent: Tuesday, January 29, 2008 2:43 PM
> > To: mahout-dev@lucene.apache.org
> > Subject: Re: Thinking about Mahout layout, builds, etc.
> > 
> > 
> > 
> > I think that having multiple source roots is a pain.  That is what
> > packages
> > are for.
> > 
> > I would recommend instead:
> > 
> > - at the top level, there should be trunk, tags, releases as is typical
> > in an SVN based project.
> > 
> > - below trunk and any tag or release there should be:
> > 
> >   docs
> >   lib
> >   src/org/apache/mahout
> > 
> > Below the source directory, there should be packages common,
> > algorithmA, algorithmB and all tests should be locaated near the
> > associated source.
> > 
> > If it is really desirable to separate tests from normal source (I have
> > done it both ways and find having the tests nearby beneficial), then
> > there can be a parallel tree next to src called "test".
> > 
> > The target of compilation should be a single jar file.
> > 
> > 
> > On 1/29/08 2:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
> > 
> > > I am thinking a structure like the following would be useful for
> > > getting started:
> > > mahout/trunk/
> > >    docs
> > >    common/
> > > src/
> > >             main/
> > >             test/
> > >          docs/
> > >          lib/
> > >    algorithmA/
> > >         Similar to common, but for this algorithm algB ...
> > >     ...
> > > 
> > > Where algorithmA, B, etc. are the various libraries we intend to
> > > implement.  We can hold off on creating them until we have some code,
> > > but was thinking it would be good to have the general layout in mind.
> > > 
> > > Of course, this is expandable and changeable.  What do others think?
> > > 
> > > On a related note, one of the things we discussed pre-Apache, was the
> > > general sense that we shouldn't feel the need to create an all
> > > encompassing framework.  The basic gist of this being that any given
> > > library could be completely independent of the others (with maybe the
> > > exception that they share a common library).  My gut says this is the
> > > way to get started, but that it may evolve over time once we have some
> > > running time together and can start to recognize synergies, such that
> > > maybe by the time we get to 1.0 of Mahout there may be more common
> > > code than we originally thought.  The "common" area above can serve as
> > > the area for utilities, classes, common Hadoop extensions, etc. that
> > > are shared between the various algorithms, but I would also say let's
> > > not try to prematurely optimize across the algorithms just yet.
> > > 
> > > Anyone else have any preference on this?
> > > 
> > > -Grant
>

Re: Thinking about Mahout layout, builds, etc.

Posted by Ted Dunning <td...@veoh.com>.


Most algorithms will fall naturally into families and will be quite small.

What you say really has merit on a large project, but even all of hadoop is
barely that large.


On 1/29/08 3:44 PM, "Lukas Vlcek" <lu...@gmail.com> wrote:

> Hi,
> 
> Not only each algorithm can be seen as a separated project but also there
> can be many ways how one algorithm can be implemented as well. I would not
> be surprised by the fact that one algorithm can have many implementations
> each suitable for different type of input data (dense vs sparse) or having
> various accuracy to speed ratio for example.
> 
> I would prefer to have an option not to work with whole library but select
> only specific algorithms and optionally their particular modifications.
> 
> Just my 2 cents.
> 
> Lukas
> 
> On Jan 30, 2008 12:15 AM, Jeff Eastman <je...@collab.net> wrote:
> 
>> Thinking about these alternatives from an Eclipse user's point of view,
>> the original proposal would seem to encourage multiple projects (one per
>> algorithm + a common project) while the second would encourage a single
>> project containing multiple packages. Depending upon the amount of code
>> that would reside in each algorithm, one or the other might be
>> preferable.
>> 
>> Would a given developer typically be working on the entire library
>> (single project favoring) or just on one or two algorithms (multiple
>> project favoring)?
>> 
>> Jeff
>> 
>> -----Original Message-----
>> From: Ted Dunning [mailto:tdunning@veoh.com]
>> Sent: Tuesday, January 29, 2008 2:43 PM
>> To: mahout-dev@lucene.apache.org
>> Subject: Re: Thinking about Mahout layout, builds, etc.
>> 
>> 
>> 
>> I think that having multiple source roots is a pain.  That is what
>> packages
>> are for.
>> 
>> I would recommend instead:
>> 
>> - at the top level, there should be trunk, tags, releases as is typical
>> in
>> an SVN based project.
>> 
>> - below trunk and any tag or release there should be:
>> 
>>   docs
>>   lib
>>   src/org/apache/mahout
>> 
>> Below the source directory, there should be packages common, algorithmA,
>> algorithmB and all tests should be locaated near the associated source.
>> 
>> If it is really desirable to separate tests from normal source (I have
>> done
>> it both ways and find having the tests nearby beneficial), then there
>> can be
>> a parallel tree next to src called "test".
>> 
>> The target of compilation should be a single jar file.
>> 
>> 
>> On 1/29/08 2:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
>> 
>>> I am thinking a structure like the following would be useful for
>>> getting started:
>>> mahout/trunk/
>>>    docs
>>>    common/
>>> src/
>>>             main/
>>>             test/
>>>          docs/
>>>          lib/
>>>    algorithmA/
>>>         Similar to common, but for this algorithm
>>>    algB
>>>         ...
>>>     ...
>>> 
>>> Where algorithmA, B, etc. are the various libraries we intend to
>>> implement.  We can hold off on creating them until we have some code,
>>> but was thinking it would be good to have the general layout in mind.
>>> 
>>> Of course, this is expandable and changeable.  What do others think?
>>> 
>>> On a related note, one of the things we discussed pre-Apache, was the
>>> general sense that we shouldn't feel the need to create an all
>>> encompassing framework.  The basic gist of this being that any given
>>> library could be completely independent of the others (with maybe the
>>> exception that they share a common library).  My gut says this is the
>>> way to get started, but that it may evolve over time once we have some
>>> running time together and can start to recognize synergies, such that
>>> maybe by the time we get to 1.0 of Mahout there may be more common
>>> code than we originally thought.  The "common" area above can serve as
>>> the area for utilities, classes, common Hadoop extensions, etc. that
>>> are shared between the various algorithms, but I would also say let's
>>> not try to prematurely optimize across the algorithms just yet.
>>> 
>>> Anyone else have any preference on this?
>>> 
>>> -Grant
>>> 
>> 
>> 
>

Re: Thinking about Mahout layout, builds, etc.

Posted by Lukas Vlcek <lu...@gmail.com>.

Hi,

Not only each algorithm can be seen as a separated project but also there
can be many ways how one algorithm can be implemented as well. I would not
be surprised by the fact that one algorithm can have many implementations
each suitable for different type of input data (dense vs sparse) or having
various accuracy to speed ratio for example.

I would prefer to have an option not to work with whole library but select
only specific algorithms and optionally their particular modifications.

Just my 2 cents.

Lukas

On Jan 30, 2008 12:15 AM, Jeff Eastman <je...@collab.net> wrote:

> Thinking about these alternatives from an Eclipse user's point of view,
> the original proposal would seem to encourage multiple projects (one per
> algorithm + a common project) while the second would encourage a single
> project containing multiple packages. Depending upon the amount of code
> that would reside in each algorithm, one or the other might be
> preferable.
>
> Would a given developer typically be working on the entire library
> (single project favoring) or just on one or two algorithms (multiple
> project favoring)?
>
> Jeff
>
> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com]
> Sent: Tuesday, January 29, 2008 2:43 PM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Thinking about Mahout layout, builds, etc.
>
>
>
> I think that having multiple source roots is a pain.  That is what
> packages
> are for.
>
> I would recommend instead:
>
> - at the top level, there should be trunk, tags, releases as is typical
> in
> an SVN based project.
>
> - below trunk and any tag or release there should be:
>
>   docs
>   lib
>   src/org/apache/mahout
>
> Below the source directory, there should be packages common, algorithmA,
> algorithmB and all tests should be locaated near the associated source.
>
> If it is really desirable to separate tests from normal source (I have
> done
> it both ways and find having the tests nearby beneficial), then there
> can be
> a parallel tree next to src called "test".
>
> The target of compilation should be a single jar file.
>
>
> On 1/29/08 2:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
>
> > I am thinking a structure like the following would be useful for
> > getting started:
> > mahout/trunk/
> >    docs
> >    common/
> > src/
> >             main/
> >             test/
> >          docs/
> >          lib/
> >    algorithmA/
> >         Similar to common, but for this algorithm
> >    algB
> >         ...
> >     ...
> >
> > Where algorithmA, B, etc. are the various libraries we intend to
> > implement.  We can hold off on creating them until we have some code,
> > but was thinking it would be good to have the general layout in mind.
> >
> > Of course, this is expandable and changeable.  What do others think?
> >
> > On a related note, one of the things we discussed pre-Apache, was the
> > general sense that we shouldn't feel the need to create an all
> > encompassing framework.  The basic gist of this being that any given
> > library could be completely independent of the others (with maybe the
> > exception that they share a common library).  My gut says this is the
> > way to get started, but that it may evolve over time once we have some
> > running time together and can start to recognize synergies, such that
> > maybe by the time we get to 1.0 of Mahout there may be more common
> > code than we originally thought.  The "common" area above can serve as
> > the area for utilities, classes, common Hadoop extensions, etc. that
> > are shared between the various algorithms, but I would also say let's
> > not try to prematurely optimize across the algorithms just yet.
> >
> > Anyone else have any preference on this?
> >
> > -Grant
> >
>
>


-- 
http://blog.lukas-vlcek.com/

Re: Thinking about Mahout layout, builds, etc.

Posted by Ted Dunning <td...@veoh.com>.


Initially, developers will be hitting bugs or bad design all over the place
so they would favor one project.  Also, with good package design, you get
most of the benefits of multiple projects.

So why not start simple and migrate to complicated later?


On 1/29/08 3:15 PM, "Jeff Eastman" <je...@collab.net> wrote:

> Thinking about these alternatives from an Eclipse user's point of view,
> the original proposal would seem to encourage multiple projects (one per
> algorithm + a common project) while the second would encourage a single
> project containing multiple packages. Depending upon the amount of code
> that would reside in each algorithm, one or the other might be
> preferable.
> 
> Would a given developer typically be working on the entire library
> (single project favoring) or just on one or two algorithms (multiple
> project favoring)?
> 
> Jeff
> 
> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com]
> Sent: Tuesday, January 29, 2008 2:43 PM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Thinking about Mahout layout, builds, etc.
> 
> 
> 
> I think that having multiple source roots is a pain.  That is what
> packages
> are for.
> 
> I would recommend instead:
> 
> - at the top level, there should be trunk, tags, releases as is typical
> in
> an SVN based project.
> 
> - below trunk and any tag or release there should be:
> 
>    docs
>    lib
>    src/org/apache/mahout
> 
> Below the source directory, there should be packages common, algorithmA,
> algorithmB and all tests should be locaated near the associated source.
> 
> If it is really desirable to separate tests from normal source (I have
> done
> it both ways and find having the tests nearby beneficial), then there
> can be
> a parallel tree next to src called "test".
> 
> The target of compilation should be a single jar file.
> 
> 
> On 1/29/08 2:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:
> 
>> I am thinking a structure like the following would be useful for
>> getting started:
>> mahout/trunk/
>>    docs
>>    common/
>> src/
>>             main/
>>             test/
>>          docs/
>>          lib/
>>    algorithmA/
>>         Similar to common, but for this algorithm
>>    algB
>>         ...
>>     ...
>> 
>> Where algorithmA, B, etc. are the various libraries we intend to
>> implement.  We can hold off on creating them until we have some code,
>> but was thinking it would be good to have the general layout in mind.
>> 
>> Of course, this is expandable and changeable.  What do others think?
>> 
>> On a related note, one of the things we discussed pre-Apache, was the
>> general sense that we shouldn't feel the need to create an all
>> encompassing framework.  The basic gist of this being that any given
>> library could be completely independent of the others (with maybe the
>> exception that they share a common library).  My gut says this is the
>> way to get started, but that it may evolve over time once we have some
>> running time together and can start to recognize synergies, such that
>> maybe by the time we get to 1.0 of Mahout there may be more common
>> code than we originally thought.  The "common" area above can serve as
>> the area for utilities, classes, common Hadoop extensions, etc. that
>> are shared between the various algorithms, but I would also say let's
>> not try to prematurely optimize across the algorithms just yet.
>> 
>> Anyone else have any preference on this?
>> 
>> -Grant
>> 
>

RE: Thinking about Mahout layout, builds, etc.

Posted by Jeff Eastman <je...@collab.net>.

Thinking about these alternatives from an Eclipse user's point of view,
the original proposal would seem to encourage multiple projects (one per
algorithm + a common project) while the second would encourage a single
project containing multiple packages. Depending upon the amount of code
that would reside in each algorithm, one or the other might be
preferable.

Would a given developer typically be working on the entire library
(single project favoring) or just on one or two algorithms (multiple
project favoring)?

Jeff

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Tuesday, January 29, 2008 2:43 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Thinking about Mahout layout, builds, etc.

I think that having multiple source roots is a pain.  That is what
packages
are for.

I would recommend instead:

- at the top level, there should be trunk, tags, releases as is typical
in
an SVN based project.

- below trunk and any tag or release there should be:

   docs
   lib
   src/org/apache/mahout

Below the source directory, there should be packages common, algorithmA,
algorithmB and all tests should be locaated near the associated source.

If it is really desirable to separate tests from normal source (I have
done
it both ways and find having the tests nearby beneficial), then there
can be
a parallel tree next to src called "test".

The target of compilation should be a single jar file.

On 1/29/08 2:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:

> I am thinking a structure like the following would be useful for
> getting started:
> mahout/trunk/
>    docs
>    common/
> src/
>             main/
>             test/
>          docs/
>          lib/
>    algorithmA/
>         Similar to common, but for this algorithm
>    algB
>         ...
>     ...
> 
> Where algorithmA, B, etc. are the various libraries we intend to
> implement.  We can hold off on creating them until we have some code,
> but was thinking it would be good to have the general layout in mind.
> 
> Of course, this is expandable and changeable.  What do others think?
> 
> On a related note, one of the things we discussed pre-Apache, was the
> general sense that we shouldn't feel the need to create an all
> encompassing framework.  The basic gist of this being that any given
> library could be completely independent of the others (with maybe the
> exception that they share a common library).  My gut says this is the
> way to get started, but that it may evolve over time once we have some
> running time together and can start to recognize synergies, such that
> maybe by the time we get to 1.0 of Mahout there may be more common
> code than we originally thought.  The "common" area above can serve as
> the area for utilities, classes, common Hadoop extensions, etc. that
> are shared between the various algorithms, but I would also say let's
> not try to prematurely optimize across the algorithms just yet.
> 
> Anyone else have any preference on this?
> 
> -Grant
>

Re: Thinking about Mahout layout, builds, etc.

Posted by Ted Dunning <td...@veoh.com>.


I think that having multiple source roots is a pain.  That is what packages
are for.

I would recommend instead:

- at the top level, there should be trunk, tags, releases as is typical in
an SVN based project.

- below trunk and any tag or release there should be:

   docs
   lib
   src/org/apache/mahout

Below the source directory, there should be packages common, algorithmA,
algorithmB and all tests should be locaated near the associated source.

If it is really desirable to separate tests from normal source (I have done
it both ways and find having the tests nearby beneficial), then there can be
a parallel tree next to src called "test".

The target of compilation should be a single jar file.


On 1/29/08 2:26 PM, "Grant Ingersoll" <gs...@apache.org> wrote:

> I am thinking a structure like the following would be useful for
> getting started:
> mahout/trunk/
>    docs
>    common/
> src/
>             main/
>             test/
>          docs/
>          lib/
>    algorithmA/
>         Similar to common, but for this algorithm
>    algB
>         ...
>     ...
> 
> Where algorithmA, B, etc. are the various libraries we intend to
> implement.  We can hold off on creating them until we have some code,
> but was thinking it would be good to have the general layout in mind.
> 
> Of course, this is expandable and changeable.  What do others think?
> 
> On a related note, one of the things we discussed pre-Apache, was the
> general sense that we shouldn't feel the need to create an all
> encompassing framework.  The basic gist of this being that any given
> library could be completely independent of the others (with maybe the
> exception that they share a common library).  My gut says this is the
> way to get started, but that it may evolve over time once we have some
> running time together and can start to recognize synergies, such that
> maybe by the time we get to 1.0 of Mahout there may be more common
> code than we originally thought.  The "common" area above can serve as
> the area for utilities, classes, common Hadoop extensions, etc. that
> are shared between the various algorithms, but I would also say let's
> not try to prematurely optimize across the algorithms just yet.
> 
> Anyone else have any preference on this?
> 
> -Grant
>