You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2013/04/09 00:18:59 UTC

Re: 0.8?

How are we doing on the Streaming K-Means and other 0.8 issues?  I'm still willing to be RM for 0.8, but I will need to plan for it a bit.


On Feb 5, 2013, at 2:13 PM, Dmitriy Lyubimov wrote:

> I guess i have nothing to report for this release. Hence 0.
> 
> 
> 
> On Tue, Feb 5, 2013 at 9:38 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
> 
>> +1
>> 
>> 
>> On 2/4/13 8:45 PM, Grant Ingersoll wrote:
>> 
>>> +1.
>>> 
>>> On Feb 3, 2013, at 7:48 PM, Ted Dunning wrote:
>>> 
>>> Fine by me.
>>>> 
>>>> Others?
>>>> 
>>>> On Sun, Feb 3, 2013 at 1:09 PM, Dan Filimon <dangeorge.filimon@gmail.com
>>>>> **wrote:
>>>> 
>>>> I can get back to work starting on February 11. That's 3 weeks to get
>>>>> the existing code in shape.
>>>>> It might be a bit of a stretch.
>>>>> 
>>>>> How about aiming for something like March 8 for the RC?
>>>>> 
>>>>> On Sun, Feb 3, 2013 at 9:45 PM, Ted Dunning <te...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> I think that getting it into the existing API would be very nice to
>>>>>> have,
>>>>>> but not absolutely critical.
>>>>>> 
>>>>>> If extending the release by, say, 2-3 weeks would solve the problem I
>>>>>> 
>>>>> would
>>>>> 
>>>>>> recommend extending.  Otherwise, we might want to have yet another API
>>>>>> enter the mix and do an 0.8.1 with the new API.
>>>>>> 
>>>>>> On Sun, Feb 3, 2013 at 9:09 AM, Dan Filimon <
>>>>>> dangeorge.filimon@gmail.com
>>>>>> wrote:
>>>>>> 
>>>>>> I'm working on the new clustering and I'm concerned the end of
>>>>>>> February is too early for a good quality RC.
>>>>>>> I say this because I haven't integrated it into the existing framework
>>>>>>> 
>>>>>> yet.
>>>>> 
>>>>>> What do you think Ted?
>>>>>>> 
>>>>>>> On Sun, Feb 3, 2013 at 3:26 PM, Grant Ingersoll <gs...@apache.org>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> +1 on new clustering.  I'd like to get  Frank's Lucene storage stuff
>>>>>>>> 
>>>>>>> in
>>>>> 
>>>>>> to, now that we are on 4.1.  I'm willing to be the release manager.
>>>>>>> How
>>>>>>> about we put in a date of cutting an RC by end of Feb?
>>>>>>> 
>>>>>>>> On Feb 2, 2013, at 5:33 AM, Sebastian Schelter wrote:
>>>>>>>> 
>>>>>>>> I also think that 0.8 should include the new clustering stuff, I
>>>>>>>>> 
>>>>>>>> recall
>>>>> 
>>>>>> we wanted even release numbers to contain new features. I plan a hack
>>>>>>>>> evening in Berlin with Isabel and Zeno (who ported some of his code
>>>>>>>>> 
>>>>>>>> from
>>>>> 
>>>>>> http://mymedialite.net/) in the next 2 weeks. We'll have another
>>>>>>>>> 
>>>>>>>> pass
>>>>> 
>>>>>> over the new recommenders to finalize them for 0.8.
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Sebastian
>>>>>>>>> 
>>>>>>>>> On 02.02.2013 10:01, Ted Dunning wrote:
>>>>>>>>> 
>>>>>>>>>> Sounds good to me.  Dan should have the new clustering stuff
>>>>>>>>>> 
>>>>>>>>> inserted
>>>>> 
>>>>>> soon.
>>>>>>> 
>>>>>>>> That was all I was after.
>>>>>>>>>> 
>>>>>>>>>> We should probably noodle a bit about how to update the MiA
>>>>>>>>>> examples
>>>>>>>>>> 
>>>>>>>>> since
>>>>>>> 
>>>>>>>> that keeps coming up on the list.  My first thought (from Ellen) is
>>>>>>>>>> 
>>>>>>>>> that
>>>>>>> 
>>>>>>>> asking Alex Ott to repeat his fabulous tech review work, possibly
>>>>>>>>>> 
>>>>>>>>> with
>>>>> 
>>>>>> some
>>>>>>> 
>>>>>>>> monetary incentive might be a good route to getting a bug list.
>>>>>>>>>> 
>>>>>>>>>> On Fri, Feb 1, 2013 at 7:07 PM, Grant Ingersoll <
>>>>>>>>>> 
>>>>>>>>> gsingers@apache.org>
>>>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Seems we have accumulated a good # of fixes these days.  Should we
>>>>>>>>>>> 
>>>>>>>>>> start
>>>>>>> 
>>>>>>>> thinking about cutting a 0.8 soon?
>>>>>>>>>>> 
>>>>>>>>>>> -Grant
>>>>>>>>>>> 
>>>>>>>>>>> ------------------------------**--------------
>>>>>>>> Grant Ingersoll
>>>>>>>> http://www.lucidworks.com
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ------------------------------**--------------
>>> Grant Ingersoll | @gsingers
>>> http://www.lucidworks.com
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 

--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com






Re: 0.8?

Posted by Dan Filimon <da...@gmail.com>.
Sorry for taking so long to get back to you. Here's where I am now:

The code is mostly ready and on ReviewBoard [1]; there's feedback from
Sebastian that I want to incorporate and I made some changes, but it'll be
fairly close to what's there.

The thing I'm doing now is looking at the quality of the clustering: how
does it compare to Mahout KMeans (I know there are lots of clustering
algorithms available, but this is really the one we're using as a reference
as it's the most similar to what we're doing).
So, now I'm:
- running it on more data sets;
- seeing what quirks it has on said data sets and figuring out if it's
because of the data (20 newsgroups still gives odd clusterings);
- seeing how StreamingKMeans in particular evolves at runtime (notably,
when you start with very large sparse vectors, they tend to densify as the
algorithm progresses and more points are added to the centers thereby
making it slower);
- adding more quality metrics.

It really depends on how this should be handled. I could focus on fixing
the obvious issues now and we could have an experimental 0.8 release with
this code.
Or, I could evaluate it more and polish it to get a more stable 0.8 release.

In the first case, we might need to have a 0.8.1 release that's stabilized
with the "final" version.

What's certain is that I want this done by late May (maybe run some
larger-scale experiments after that, but just to evaluate, not change the
code).
Then, there's the question of documentation. The code itself is well
documented but there's no Wiki page or instructions on using it.
Since this is also my thesis, I'm going to go over the details in the paper
(which I'll provide of course) as well as a series of blog posts, but I'm
not really sure how much documentation there should really be for a
successful release.

What do you think?

[1] https://reviews.apache.org/groups/mahout/


On Tue, Apr 9, 2013 at 2:05 AM, Ted Dunning <te...@gmail.com> wrote:

> Streaming k-means is coming along.  Dan is doing a killer job of evaluation
> on the algorithms to determine how we can make the system work as well as
> possible out of the box.
>
> I will let him speak to schedule, but he graduates before long.
>
>
> On Mon, Apr 8, 2013 at 3:18 PM, Grant Ingersoll <gs...@apache.org>
> wrote:
>
> > How are we doing on the Streaming K-Means and other 0.8 issues?  I'm
> still
> > willing to be RM for 0.8, but I will need to plan for it a bit.
> >
> >
> > On Feb 5, 2013, at 2:13 PM, Dmitriy Lyubimov wrote:
> >
> > > I guess i have nothing to report for this release. Hence 0.
> > >
> > >
> > >
> > > On Tue, Feb 5, 2013 at 9:38 AM, Jeff Eastman <
> jdog@windwardsolutions.com
> > >wrote:
> > >
> > >> +1
> > >>
> > >>
> > >> On 2/4/13 8:45 PM, Grant Ingersoll wrote:
> > >>
> > >>> +1.
> > >>>
> > >>> On Feb 3, 2013, at 7:48 PM, Ted Dunning wrote:
> > >>>
> > >>> Fine by me.
> > >>>>
> > >>>> Others?
> > >>>>
> > >>>> On Sun, Feb 3, 2013 at 1:09 PM, Dan Filimon <
> > dangeorge.filimon@gmail.com
> > >>>>> **wrote:
> > >>>>
> > >>>> I can get back to work starting on February 11. That's 3 weeks to
> get
> > >>>>> the existing code in shape.
> > >>>>> It might be a bit of a stretch.
> > >>>>>
> > >>>>> How about aiming for something like March 8 for the RC?
> > >>>>>
> > >>>>> On Sun, Feb 3, 2013 at 9:45 PM, Ted Dunning <ted.dunning@gmail.com
> >
> > >>>>> wrote:
> > >>>>>
> > >>>>>> I think that getting it into the existing API would be very nice
> to
> > >>>>>> have,
> > >>>>>> but not absolutely critical.
> > >>>>>>
> > >>>>>> If extending the release by, say, 2-3 weeks would solve the
> problem
> > I
> > >>>>>>
> > >>>>> would
> > >>>>>
> > >>>>>> recommend extending.  Otherwise, we might want to have yet another
> > API
> > >>>>>> enter the mix and do an 0.8.1 with the new API.
> > >>>>>>
> > >>>>>> On Sun, Feb 3, 2013 at 9:09 AM, Dan Filimon <
> > >>>>>> dangeorge.filimon@gmail.com
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>> I'm working on the new clustering and I'm concerned the end of
> > >>>>>>> February is too early for a good quality RC.
> > >>>>>>> I say this because I haven't integrated it into the existing
> > framework
> > >>>>>>>
> > >>>>>> yet.
> > >>>>>
> > >>>>>> What do you think Ted?
> > >>>>>>>
> > >>>>>>> On Sun, Feb 3, 2013 at 3:26 PM, Grant Ingersoll <
> > gsingers@apache.org>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> +1 on new clustering.  I'd like to get  Frank's Lucene storage
> > stuff
> > >>>>>>>>
> > >>>>>>> in
> > >>>>>
> > >>>>>> to, now that we are on 4.1.  I'm willing to be the release
> manager.
> > >>>>>>> How
> > >>>>>>> about we put in a date of cutting an RC by end of Feb?
> > >>>>>>>
> > >>>>>>>> On Feb 2, 2013, at 5:33 AM, Sebastian Schelter wrote:
> > >>>>>>>>
> > >>>>>>>> I also think that 0.8 should include the new clustering stuff, I
> > >>>>>>>>>
> > >>>>>>>> recall
> > >>>>>
> > >>>>>> we wanted even release numbers to contain new features. I plan a
> > hack
> > >>>>>>>>> evening in Berlin with Isabel and Zeno (who ported some of his
> > code
> > >>>>>>>>>
> > >>>>>>>> from
> > >>>>>
> > >>>>>> http://mymedialite.net/) in the next 2 weeks. We'll have another
> > >>>>>>>>>
> > >>>>>>>> pass
> > >>>>>
> > >>>>>> over the new recommenders to finalize them for 0.8.
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Sebastian
> > >>>>>>>>>
> > >>>>>>>>> On 02.02.2013 10:01, Ted Dunning wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Sounds good to me.  Dan should have the new clustering stuff
> > >>>>>>>>>>
> > >>>>>>>>> inserted
> > >>>>>
> > >>>>>> soon.
> > >>>>>>>
> > >>>>>>>> That was all I was after.
> > >>>>>>>>>>
> > >>>>>>>>>> We should probably noodle a bit about how to update the MiA
> > >>>>>>>>>> examples
> > >>>>>>>>>>
> > >>>>>>>>> since
> > >>>>>>>
> > >>>>>>>> that keeps coming up on the list.  My first thought (from Ellen)
> > is
> > >>>>>>>>>>
> > >>>>>>>>> that
> > >>>>>>>
> > >>>>>>>> asking Alex Ott to repeat his fabulous tech review work,
> possibly
> > >>>>>>>>>>
> > >>>>>>>>> with
> > >>>>>
> > >>>>>> some
> > >>>>>>>
> > >>>>>>>> monetary incentive might be a good route to getting a bug list.
> > >>>>>>>>>>
> > >>>>>>>>>> On Fri, Feb 1, 2013 at 7:07 PM, Grant Ingersoll <
> > >>>>>>>>>>
> > >>>>>>>>> gsingers@apache.org>
> > >>>>>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Seems we have accumulated a good # of fixes these days.  Should
> we
> > >>>>>>>>>>>
> > >>>>>>>>>> start
> > >>>>>>>
> > >>>>>>>> thinking about cutting a 0.8 soon?
> > >>>>>>>>>>>
> > >>>>>>>>>>> -Grant
> > >>>>>>>>>>>
> > >>>>>>>>>>> ------------------------------**--------------
> > >>>>>>>> Grant Ingersoll
> > >>>>>>>> http://www.lucidworks.com
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> ------------------------------**--------------
> > >>> Grant Ingersoll | @gsingers
> > >>> http://www.lucidworks.com
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>
> >
> > --------------------------------------------
> > Grant Ingersoll | @gsingers
> > http://www.lucidworks.com
> >
> >
> >
> >
> >
> >
>

Re: 0.8?

Posted by Ted Dunning <te...@gmail.com>.
Streaming k-means is coming along.  Dan is doing a killer job of evaluation
on the algorithms to determine how we can make the system work as well as
possible out of the box.

I will let him speak to schedule, but he graduates before long.


On Mon, Apr 8, 2013 at 3:18 PM, Grant Ingersoll <gs...@apache.org> wrote:

> How are we doing on the Streaming K-Means and other 0.8 issues?  I'm still
> willing to be RM for 0.8, but I will need to plan for it a bit.
>
>
> On Feb 5, 2013, at 2:13 PM, Dmitriy Lyubimov wrote:
>
> > I guess i have nothing to report for this release. Hence 0.
> >
> >
> >
> > On Tue, Feb 5, 2013 at 9:38 AM, Jeff Eastman <jdog@windwardsolutions.com
> >wrote:
> >
> >> +1
> >>
> >>
> >> On 2/4/13 8:45 PM, Grant Ingersoll wrote:
> >>
> >>> +1.
> >>>
> >>> On Feb 3, 2013, at 7:48 PM, Ted Dunning wrote:
> >>>
> >>> Fine by me.
> >>>>
> >>>> Others?
> >>>>
> >>>> On Sun, Feb 3, 2013 at 1:09 PM, Dan Filimon <
> dangeorge.filimon@gmail.com
> >>>>> **wrote:
> >>>>
> >>>> I can get back to work starting on February 11. That's 3 weeks to get
> >>>>> the existing code in shape.
> >>>>> It might be a bit of a stretch.
> >>>>>
> >>>>> How about aiming for something like March 8 for the RC?
> >>>>>
> >>>>> On Sun, Feb 3, 2013 at 9:45 PM, Ted Dunning <te...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I think that getting it into the existing API would be very nice to
> >>>>>> have,
> >>>>>> but not absolutely critical.
> >>>>>>
> >>>>>> If extending the release by, say, 2-3 weeks would solve the problem
> I
> >>>>>>
> >>>>> would
> >>>>>
> >>>>>> recommend extending.  Otherwise, we might want to have yet another
> API
> >>>>>> enter the mix and do an 0.8.1 with the new API.
> >>>>>>
> >>>>>> On Sun, Feb 3, 2013 at 9:09 AM, Dan Filimon <
> >>>>>> dangeorge.filimon@gmail.com
> >>>>>> wrote:
> >>>>>>
> >>>>>> I'm working on the new clustering and I'm concerned the end of
> >>>>>>> February is too early for a good quality RC.
> >>>>>>> I say this because I haven't integrated it into the existing
> framework
> >>>>>>>
> >>>>>> yet.
> >>>>>
> >>>>>> What do you think Ted?
> >>>>>>>
> >>>>>>> On Sun, Feb 3, 2013 at 3:26 PM, Grant Ingersoll <
> gsingers@apache.org>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> +1 on new clustering.  I'd like to get  Frank's Lucene storage
> stuff
> >>>>>>>>
> >>>>>>> in
> >>>>>
> >>>>>> to, now that we are on 4.1.  I'm willing to be the release manager.
> >>>>>>> How
> >>>>>>> about we put in a date of cutting an RC by end of Feb?
> >>>>>>>
> >>>>>>>> On Feb 2, 2013, at 5:33 AM, Sebastian Schelter wrote:
> >>>>>>>>
> >>>>>>>> I also think that 0.8 should include the new clustering stuff, I
> >>>>>>>>>
> >>>>>>>> recall
> >>>>>
> >>>>>> we wanted even release numbers to contain new features. I plan a
> hack
> >>>>>>>>> evening in Berlin with Isabel and Zeno (who ported some of his
> code
> >>>>>>>>>
> >>>>>>>> from
> >>>>>
> >>>>>> http://mymedialite.net/) in the next 2 weeks. We'll have another
> >>>>>>>>>
> >>>>>>>> pass
> >>>>>
> >>>>>> over the new recommenders to finalize them for 0.8.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Sebastian
> >>>>>>>>>
> >>>>>>>>> On 02.02.2013 10:01, Ted Dunning wrote:
> >>>>>>>>>
> >>>>>>>>>> Sounds good to me.  Dan should have the new clustering stuff
> >>>>>>>>>>
> >>>>>>>>> inserted
> >>>>>
> >>>>>> soon.
> >>>>>>>
> >>>>>>>> That was all I was after.
> >>>>>>>>>>
> >>>>>>>>>> We should probably noodle a bit about how to update the MiA
> >>>>>>>>>> examples
> >>>>>>>>>>
> >>>>>>>>> since
> >>>>>>>
> >>>>>>>> that keeps coming up on the list.  My first thought (from Ellen)
> is
> >>>>>>>>>>
> >>>>>>>>> that
> >>>>>>>
> >>>>>>>> asking Alex Ott to repeat his fabulous tech review work, possibly
> >>>>>>>>>>
> >>>>>>>>> with
> >>>>>
> >>>>>> some
> >>>>>>>
> >>>>>>>> monetary incentive might be a good route to getting a bug list.
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Feb 1, 2013 at 7:07 PM, Grant Ingersoll <
> >>>>>>>>>>
> >>>>>>>>> gsingers@apache.org>
> >>>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Seems we have accumulated a good # of fixes these days.  Should we
> >>>>>>>>>>>
> >>>>>>>>>> start
> >>>>>>>
> >>>>>>>> thinking about cutting a 0.8 soon?
> >>>>>>>>>>>
> >>>>>>>>>>> -Grant
> >>>>>>>>>>>
> >>>>>>>>>>> ------------------------------**--------------
> >>>>>>>> Grant Ingersoll
> >>>>>>>> http://www.lucidworks.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ------------------------------**--------------
> >>> Grant Ingersoll | @gsingers
> >>> http://www.lucidworks.com
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
>
> --------------------------------------------
> Grant Ingersoll | @gsingers
> http://www.lucidworks.com
>
>
>
>
>
>