You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2008/08/20 14:59:23 UTC

0.1 Planning

Hi Mahouters,

I'd like to suggest we start gearing up for a 0.1 release. Since this
is our first one, we're going to have a bit of extra work to get
things in the right shape, so any extra time you have would be most
appreciated.

First and foremost, would be testing, etc. on the current trunk
(assuming SVN is up, which it doesn't appear to be right now) and
providing feedback on what's good and bad. This is especially true of
people who have access to clusters (which many of us committers will
soon have thanks to a kind donation by Amazon.)

Second, we should go through JIRA and (un)mark issues in JIRA as
either in or out of 0.1 or closed. See https://issues.apache.org/jira/browse/MAHOUT/fixforversion/12312976
Of these, MAHOUT-9, 56 and 60 are all pretty much done, they just
need a bit more documentation. M-54 looks like it could be closed,
right Jeff, as the reporter hasn't responded to questions, etc.? So,
if you have something you think should be in 0.1, please go mark it as
such in JIRA.

Next, we need to address https://issues.apache.org/jira/browse/
MAHOUT-69, at a minimum. One of us should look at other ASF projects
(Lucene/Solr) and grab their "How To Make a Release" documentation (on
the wiki) and put it up on our wiki. Volunteers?

After that, I'd suggest we are ready for a release. Typically, we
call a "freeze" date, and then we release a series of release
candidates. For Mahout, since we are so young and this is such an
early release, I don't think we need to obsess too much over this.
Our APIs are likely to change in the future, so we should just keep
things light: release early, release often. I volunteer to be the
release manager.

With the release ready to go, then we can go out and make some noise,
to help attract more people, etc. We can work w/ the ASF PRC (public
relations committee) on this a bit, I think. Additionally, those of
us who blog should do so. I'd also think it would be great if anyone
with Wikipedia savviness could put us on the map there. Currently,
Wikipedia Mahout is: http://en.wikipedia.org/wiki/Mahout but I think
we could make it a "disambiguation" page, or at least add in an Apache
Mahout page. Just food for thought... Our community is actually
pretty big for a new project, or at least the number of lurkers is
pretty big. I think a number of people are in "wait and see" mode, so
we (i.e. committers and active contributors) need to get over the hump
a bit so that others will feel more comfortable joining in. An
official release should help with that, but do let us know if you have
other ideas as well.

Time wise, I'd love it if we could have the release out within the
month, but of course, I know we are all busy. That being said, we've
got a lot of goodness in our repo now, what w/ Taste, Clustering, the
GA stuff and the Naive Bayes stuff (kudos to our two active GSOC
students Deneche and Robin!)

Cheers,
Grant

Re: 0.1 Planning

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Wednesday 20 August 2008, Karl Wettin wrote:
> We could post a wishlist/planning for 0.2 in the release of 0.1. This
> is probably just a link to a currently non existing Wiki page where we
> list what people are working on that may or may not become something.

I think we could move the list of algorithms on the front page there and have 
like three sections: 1) Documentation for algorithms that are implemented 
(together with a link to the closed JIRA ticket for information on decisions 
taken during implementation. 2) Documentation of algorithms that are in 
progress (again possibly with a link to the JIRA ticket) and 3) a wishlist.

I think for some users it is easier to read the wiki than collect all relevant 
information from JIRA.

> Also, one way to potenitally get lots of users at release is to
> introduce a simple bandade between a Lucene index and Mahout.

+1 - there were a few people here at FrOSCon explicitly asking for bindings to 
either Lucene or Solr. One specific idea I remember was to either use the 
data in Lucene as input or to use the models learnt with Mahout for e.g. 
annotating or selecting incoming web pages.

-- 
I'm proud to be paying taxes in the United States.  The only thing is-- I 
could be just as proud for half the money.		-- Arthur Godfrey
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: 0.1 Planning

Posted by Grant Ingersoll <gs...@apache.org>.

On Aug 20, 2008, at 10:10 AM, Karl Wettin wrote:

> I think it would be nice to get it out ASAP, perhaps even by next  
> weekend? I'll get started on the HowToRelease wiki page right now.

Anything is possible, I suppose.  I'll do what I can, but I am also  
planning a Solr release for next week, so...

>
>
>
> I also got a bunch of post 0.1 thoughts:
>
> We could post a wishlist/planning for 0.2 in the release of 0.1.  
> This is probably just a link to a currently non existing Wiki page  
> where we list what people are working on that may or may not become  
> something. This could turn out to be a catalysator, and if nothing  
> else it could be used to help consolidate work taking place outside  
> of the fora to avoid duplicate work. Or is it better if we filled  
> the JIRA with that sort of stuff? It would be nice if we did not end  
> up with a thousand old and open issues without patches. Or?

I'm open to anything, but I've always found coordinating O/S projects  
to be like the proverbial cat herding problem.  I'd love to get  
Hadoop's Patch checker system in place for Mahout on Hudson, I think  
this can help w/ the bad patch problem.   Of course, the flip side to  
the thousand old issues is the stale wiki.  I don't know a good  
solution, as they all rely on people to be involved and take up the  
work to maintain.  Or perhaps, we can come up w/ a cool Mahout  
application that we train on JIRA to classify issues into: Good,  
maybe, and bad and we automatically close/mark any issue that is  
labeled as bad.  :-)  Might make for a cool, real world application  
that would benefit a whole ton of projects in the ASF alone.  Argh,  
where's that cloning machine when you need it?  Just not enough hours  
in the day.

>
>
>
> Also, one way to potenitally get lots of users at release is to  
> introduce a simple bandade between a Lucene index and Mahout. No  
> need to make it as complex as MAHOUT-7, something that converts the  
> term vector of a document to a SparseVector using term identity as  
> column would be enough. They who don't want the term vectors in  
> their index could use some layer that pre-analyzed a Document at  
> index time (and replace the fields with the stream) and passed down  
> the vectors in some format that makes sense for Mahout.

I think the Bayes stuff has some of this ground work, namely the  
examples use Lucene to analyze the articles and put them in the Bayes  
format.

>
>
>
> I for one is working on MAHOUT-19, using -61 (mbox/nntp->matrix) for  
> examples and trying to come up with a new take on -65 (meta data)  
> (as -61 can make use of that). I'm also looking closer at cross fold  
> validation to power various feature selection schemes, but this is a  
> bit secondary.

Cool.  Once we get the release out, I plan on building an Amazon AMI  
for it and putting up docs on it, as well as start doing some tests,  
using the new NB/CNB Wikipedia stuff, and maybe also setting up an  
example using DMOZ or something like that as a POC.

I would also love to get in a SVM implementation for 0.2.

>
>
>
>
> 20 aug 2008 kl. 14.59 skrev Grant Ingersoll:
>
>> Hi Mahouters,
>>
>> I'd like to suggest we start gearing up for a 0.1 release.  Since  
>> this is our first one, we're going to have a bit of extra work to  
>> get things in the right shape, so any extra time you have would be  
>> most appreciated.
>>
>> First and foremost, would be testing, etc. on the current trunk  
>> (assuming SVN is up, which it doesn't appear to be right now) and  
>> providing feedback on what's good and bad.  This is especially true  
>> of people who have access to clusters (which many of us committers  
>> will soon have thanks to a kind donation by Amazon.)
>>
>> Second, we should go through JIRA and (un)mark issues in JIRA as  
>> either in or out of 0.1 or closed.  See https://issues.apache.org/jira/browse/MAHOUT/fixforversion/12312976 
>>   Of these, MAHOUT-9, 56 and 60 are all pretty much done, they just  
>> need a bit more documentation.  M-54 looks like it could be closed,  
>> right Jeff, as the reporter hasn't responded to questions, etc.?   
>> So, if you have something you think should be in 0.1, please go  
>> mark it as such in JIRA.
>>
>> Next, we need to address https://issues.apache.org/jira/browse/MAHOUT-69 
>> , at a minimum.  One of us should look at other ASF projects  
>> (Lucene/Solr) and grab their "How To Make a Release" documentation  
>> (on the wiki) and put it up on our wiki.  Volunteers?
>>
>> After that, I'd suggest we are ready for a release.  Typically, we  
>> call a "freeze" date, and then we release a series of release  
>> candidates.  For Mahout, since we are so young and this is such an  
>> early release, I don't think we need to obsess too much over this.   
>> Our APIs are likely to change in the future, so we should just keep  
>> things light: release early, release often.   I volunteer to be the  
>> release manager.
>>
>> With the release ready to go, then we can go out and make some  
>> noise, to help attract more people, etc.  We can work w/ the ASF  
>> PRC (public relations committee) on this a bit, I think.   
>> Additionally, those of us who blog should do so.  I'd also think it  
>> would be great if anyone with Wikipedia savviness could put us on  
>> the map there.  Currently, Wikipedia Mahout is: http://en.wikipedia.org/wiki/Mahout 
>>   but I think we could make it a "disambiguation" page, or at least  
>> add in an Apache Mahout page.  Just food for thought...  Our  
>> community is actually pretty big for a new project, or at least the  
>> number of lurkers is pretty big.  I think a number of people are in  
>> "wait and see" mode, so we (i.e. committers and active  
>> contributors) need to get over the hump a bit so that others will  
>> feel more comfortable joining in.  An official release should help  
>> with that, but do let us know if you have other ideas as well.
>>
>> Time wise, I'd love it if we could have the release out within the  
>> month, but of course, I know we are all busy.  That being said,  
>> we've got a lot of goodness in our repo now, what w/ Taste,  
>> Clustering, the GA stuff and the Naive Bayes stuff (kudos to our  
>> two active GSOC students Deneche and Robin!)
>>
>> Cheers,
>> Grant
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: 0.1 Planning

Posted by Karl Wettin <ka...@gmail.com>.

I think it would be nice to get it out ASAP, perhaps even by next  
weekend? I'll get started on the HowToRelease wiki page right now.


I also got a bunch of post 0.1 thoughts:

We could post a wishlist/planning for 0.2 in the release of 0.1. This  
is probably just a link to a currently non existing Wiki page where we  
list what people are working on that may or may not become something.  
This could turn out to be a catalysator, and if nothing else it could  
be used to help consolidate work taking place outside of the fora to  
avoid duplicate work. Or is it better if we filled the JIRA with that  
sort of stuff? It would be nice if we did not end up with a thousand  
old and open issues without patches. Or?


Also, one way to potenitally get lots of users at release is to  
introduce a simple bandade between a Lucene index and Mahout. No need  
to make it as complex as MAHOUT-7, something that converts the term  
vector of a document to a SparseVector using term identity as column  
would be enough. They who don't want the term vectors in their index  
could use some layer that pre-analyzed a Document at index time (and  
replace the fields with the stream) and passed down the vectors in  
some format that makes sense for Mahout.


I for one is working on MAHOUT-19, using -61 (mbox/nntp->matrix) for  
examples and trying to come up with a new take on -65 (meta data) (as  
-61 can make use of that). I'm also looking closer at cross fold  
validation to power various feature selection schemes, but this is a  
bit secondary.



20 aug 2008 kl. 14.59 skrev Grant Ingersoll:

> Hi Mahouters,
>
> I'd like to suggest we start gearing up for a 0.1 release.  Since  
> this is our first one, we're going to have a bit of extra work to  
> get things in the right shape, so any extra time you have would be  
> most appreciated.
>
> First and foremost, would be testing, etc. on the current trunk  
> (assuming SVN is up, which it doesn't appear to be right now) and  
> providing feedback on what's good and bad.  This is especially true  
> of people who have access to clusters (which many of us committers  
> will soon have thanks to a kind donation by Amazon.)
>
> Second, we should go through JIRA and (un)mark issues in JIRA as  
> either in or out of 0.1 or closed.  See https://issues.apache.org/jira/browse/MAHOUT/fixforversion/12312976 
>   Of these, MAHOUT-9, 56 and 60 are all pretty much done, they just  
> need a bit more documentation.  M-54 looks like it could be closed,  
> right Jeff, as the reporter hasn't responded to questions, etc.?   
> So, if you have something you think should be in 0.1, please go mark  
> it as such in JIRA.
>
> Next, we need to address https://issues.apache.org/jira/browse/MAHOUT-69 
> , at a minimum.  One of us should look at other ASF projects (Lucene/ 
> Solr) and grab their "How To Make a Release" documentation (on the  
> wiki) and put it up on our wiki.  Volunteers?
>
> After that, I'd suggest we are ready for a release.  Typically, we  
> call a "freeze" date, and then we release a series of release  
> candidates.  For Mahout, since we are so young and this is such an  
> early release, I don't think we need to obsess too much over this.   
> Our APIs are likely to change in the future, so we should just keep  
> things light: release early, release often.   I volunteer to be the  
> release manager.
>
> With the release ready to go, then we can go out and make some  
> noise, to help attract more people, etc.  We can work w/ the ASF PRC  
> (public relations committee) on this a bit, I think.  Additionally,  
> those of us who blog should do so.  I'd also think it would be great  
> if anyone with Wikipedia savviness could put us on the map there.   
> Currently, Wikipedia Mahout is: http://en.wikipedia.org/wiki/Mahout   
> but I think we could make it a "disambiguation" page, or at least  
> add in an Apache Mahout page.  Just food for thought...  Our  
> community is actually pretty big for a new project, or at least the  
> number of lurkers is pretty big.  I think a number of people are in  
> "wait and see" mode, so we (i.e. committers and active contributors)  
> need to get over the hump a bit so that others will feel more  
> comfortable joining in.  An official release should help with that,  
> but do let us know if you have other ideas as well.
>
> Time wise, I'd love it if we could have the release out within the  
> month, but of course, I know we are all busy.  That being said,  
> we've got a lot of goodness in our repo now, what w/ Taste,  
> Clustering, the GA stuff and the Naive Bayes stuff (kudos to our two  
> active GSOC students Deneche and Robin!)
>
> Cheers,
> Grant

Re: 0.1 Planning

Posted by Grant Ingersoll <gs...@apache.org>.

Also, we need to setup some Javadocs targets and then we can publish  
the release javadocs on the website, and also start building nightlies  
on Hudson.  I'm in the process of setting up to run the tests nightly.

-Grant

On Aug 20, 2008, at 8:59 AM, Grant Ingersoll wrote:

> Hi Mahouters,
>
> I'd like to suggest we start gearing up for a 0.1 release.  Since  
> this is our first one, we're going to have a bit of extra work to  
> get things in the right shape, so any extra time you have would be  
> most appreciated.
>
> First and foremost, would be testing, etc. on the current trunk  
> (assuming SVN is up, which it doesn't appear to be right now) and  
> providing feedback on what's good and bad.  This is especially true  
> of people who have access to clusters (which many of us committers  
> will soon have thanks to a kind donation by Amazon.)
>
> Second, we should go through JIRA and (un)mark issues in JIRA as  
> either in or out of 0.1 or closed.  See https://issues.apache.org/jira/browse/MAHOUT/fixforversion/12312976 
>   Of these, MAHOUT-9, 56 and 60 are all pretty much done, they just  
> need a bit more documentation.  M-54 looks like it could be closed,  
> right Jeff, as the reporter hasn't responded to questions, etc.?   
> So, if you have something you think should be in 0.1, please go mark  
> it as such in JIRA.
>
> Next, we need to address https://issues.apache.org/jira/browse/MAHOUT-69 
> , at a minimum.  One of us should look at other ASF projects (Lucene/ 
> Solr) and grab their "How To Make a Release" documentation (on the  
> wiki) and put it up on our wiki.  Volunteers?
>
> After that, I'd suggest we are ready for a release.  Typically, we  
> call a "freeze" date, and then we release a series of release  
> candidates.  For Mahout, since we are so young and this is such an  
> early release, I don't think we need to obsess too much over this.   
> Our APIs are likely to change in the future, so we should just keep  
> things light: release early, release often.   I volunteer to be the  
> release manager.
>
> With the release ready to go, then we can go out and make some  
> noise, to help attract more people, etc.  We can work w/ the ASF PRC  
> (public relations committee) on this a bit, I think.  Additionally,  
> those of us who blog should do so.  I'd also think it would be great  
> if anyone with Wikipedia savviness could put us on the map there.   
> Currently, Wikipedia Mahout is: http://en.wikipedia.org/wiki/Mahout   
> but I think we could make it a "disambiguation" page, or at least  
> add in an Apache Mahout page.  Just food for thought...  Our  
> community is actually pretty big for a new project, or at least the  
> number of lurkers is pretty big.  I think a number of people are in  
> "wait and see" mode, so we (i.e. committers and active contributors)  
> need to get over the hump a bit so that others will feel more  
> comfortable joining in.  An official release should help with that,  
> but do let us know if you have other ideas as well.
>
> Time wise, I'd love it if we could have the release out within the  
> month, but of course, I know we are all busy.  That being said,  
> we've got a lot of goodness in our repo now, what w/ Taste,  
> Clustering, the GA stuff and the Naive Bayes stuff (kudos to our two  
> active GSOC students Deneche and Robin!)
>
> Cheers,
> Grant