You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Sebastian Schelter <ss...@apache.org> on 2014/04/13 18:45:21 UTC

Tackling the "legacy dilemma"

Hi,

I took some days to let the latest discussion about the state and future 
of Mahout go through my head. I think the most important thing to 
address right now is the MapReduce "legacy" codebase. A lot of the MR 
algorithms are currently unmaintained, documentation is outdated and the 
original authors have abandoned Mahout. For some algorithms it is hard 
to get even questions answered on the mailinglist (e.g. RandomForest). I 
agree with Sean's comments that letting the code linger around is no 
option and will continue to harm Mahout.

In the previous discussion, I suggested to make a radical move and aim 
to delete this codebase, but there were serious objections from 
committers and users that convinced me that there is still usage of and 
interested in that codebase.

That puts us into a "legacy dilemma". We cannot delete the code without 
harming our userbase. On the other hand, I don't see anyone willing to 
rework the codebase. Further, the code cannot linger around anymore as 
it is doing now, especially when we fail to answer questions or don't 
provide documentation.

*We have to make a move*!

I suggest the following actions with regard to the MR codebase. I hope 
that they find consent. If there are objections, please give 
alternatives, *keeping everything as-is is not an option*:

  * reject any future MR algorithm contributions, prominently state this 
on the website and in talks
  * make all existing algorithm code compatible with Hadoop 2, if there 
is no one willing to make an existing algorithm compatible, remove the 
algorithm
  * deprecate the existing MR algorithms, yet still take bug fix 
contributions
  * remove Random Forest as we cannot even answer questions to the 
implementation on the mailinglist

There are two more actions that I would like to see, but'd be willing to 
give up if there are objections:

  * move the MR algorithms into a separate maven module
  * remove Frequent Pattern Mining again (we already aimed for that in 
0.9 but had one user who shouted but never returned to us)

Let me know what you think.

--sebastian

Re: Tackling the "legacy dilemma"

Posted by Ted Dunning <te...@gmail.com>.

Deneb (Giorgio),

The code involved is really quite heinous and we haven't been able to find
volunteers to maintain this code in the past.

It might be possible to maintain a few selected algorithms, but we really
have to move forward.




On Sun, Apr 13, 2014 at 10:09 AM, Giorgio Zoppi <gi...@gmail.com>wrote:

> The best thing, should be do a plan, and see how much effort do you need to
> this. Then find out voluntaries to accomplish the task. Quite sure that
> there a lot of people around there that they are willing to help out.
>
> BR,
> deneb.
>
>
> 2014-04-13 18:45 GMT+02:00 Sebastian Schelter <ss...@apache.org>:
>
> > Hi,
> >
> > I took some days to let the latest discussion about the state and future
> > of Mahout go through my head. I think the most important thing to address
> > right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
> > are currently unmaintained, documentation is outdated and the original
> > authors have abandoned Mahout. For some algorithms it is hard to get even
> > questions answered on the mailinglist (e.g. RandomForest). I agree with
> > Sean's comments that letting the code linger around is no option and will
> > continue to harm Mahout.
> >
> > In the previous discussion, I suggested to make a radical move and aim to
> > delete this codebase, but there were serious objections from committers
> and
> > users that convinced me that there is still usage of and interested in
> that
> > codebase.
> >
> > That puts us into a "legacy dilemma". We cannot delete the code without
> > harming our userbase. On the other hand, I don't see anyone willing to
> > rework the codebase. Further, the code cannot linger around anymore as it
> > is doing now, especially when we fail to answer questions or don't
> provide
> > documentation.
> >
> > *We have to make a move*!
> >
> > I suggest the following actions with regard to the MR codebase. I hope
> > that they find consent. If there are objections, please give
> alternatives,
> > *keeping everything as-is is not an option*:
> >
> >  * reject any future MR algorithm contributions, prominently state this
> on
> > the website and in talks
> >  * make all existing algorithm code compatible with Hadoop 2, if there is
> > no one willing to make an existing algorithm compatible, remove the
> > algorithm
> >  * deprecate the existing MR algorithms, yet still take bug fix
> > contributions
> >  * remove Random Forest as we cannot even answer questions to the
> > implementation on the mailinglist
> >
> > There are two more actions that I would like to see, but'd be willing to
> > give up if there are objections:
> >
> >  * move the MR algorithms into a separate maven module
> >  * remove Frequent Pattern Mining again (we already aimed for that in 0.9
> > but had one user who shouted but never returned to us)
> >
> > Let me know what you think.
> >
> > --sebastian
> >
>
>
>
> --
> Quiero ser el rayo de sol que cada día te despierta
> para hacerte respirar y vivir en me.
> "Favola -Moda".
>

Re: Tackling the "legacy dilemma"

Posted by Frank Scholten <fr...@frankscholten.nl>.

Thanks for this Sebastian. I think the new direction is exciting but indeed
we first should focus on what we all agree on.

+1 on renaming of core to mr-legacy and moving stuff out and deprecating
some of the algorithms.

I would like to help with this restructuring.

Cheers,

Frank









On Tue, Apr 15, 2014 at 6:57 AM, Sebastian Schelter <ss...@apache.org> wrote:

> Hi,
>
> From reading the thread, I have the impression that we agree on the
> following actions:
>
>
>  * reject any future MR algorithm contributions, prominently state this
> on the website and in talks
>  * make all existing algorithm code compatible with Hadoop 2, if there is
> no one willing to make an existing algorithm compatible, remove the
> algorithm
>  * deprecate Canopy clustering
>  * email the original FPM and random forest authors to ask for maintenance
> of the algorithms
>  * rename core to "mr-legacy" (and  gradually pull items we really need
> out of that later)
>
> I will create jira tickets for those action points. I think the biggest
> challenge here is the Hadoop 2 compatibility, is someone volunteering to
> drive that? Would be awesome.
>
> Best,
> Sebastian
>
>
>
> On 04/13/2014 07:19 PM, Andrew Musselman wrote:
>
>> This is a good summary of how I feel too.
>>
>>  On Apr 13, 2014, at 10:15 AM, Sebastian Schelter <ss...@apache.org> wrote:
>>>
>>> Unfortunately, its not that easy to get enough voluntary work. I issued
>>> the third call for working on the documentation today as there are still
>>> lots of open issues. That's why I'm trying to suggest a move that involves
>>> as few work as possible.
>>>
>>> We should get the MR codebase into a state that we all can live with and
>>> then focus on new stuff like the scala DSL.
>>>
>>> --sebastian
>>>
>>>
>>>
>>>
>>>  On 04/13/2014 07:09 PM, Giorgio Zoppi wrote:
>>>> The best thing, should be do a plan, and see how much effort do you
>>>> need to
>>>> this. Then find out voluntaries to accomplish the task. Quite sure that
>>>> there a lot of people around there that they are willing to help out.
>>>>
>>>> BR,
>>>> deneb.
>>>>
>>>>
>>>> 2014-04-13 18:45 GMT+02:00 Sebastian Schelter <ss...@apache.org>:
>>>>
>>>>  Hi,
>>>>>
>>>>> I took some days to let the latest discussion about the state and
>>>>> future
>>>>> of Mahout go through my head. I think the most important thing to
>>>>> address
>>>>> right now is the MapReduce "legacy" codebase. A lot of the MR
>>>>> algorithms
>>>>> are currently unmaintained, documentation is outdated and the original
>>>>> authors have abandoned Mahout. For some algorithms it is hard to get
>>>>> even
>>>>> questions answered on the mailinglist (e.g. RandomForest). I agree with
>>>>> Sean's comments that letting the code linger around is no option and
>>>>> will
>>>>> continue to harm Mahout.
>>>>>
>>>>> In the previous discussion, I suggested to make a radical move and aim
>>>>> to
>>>>> delete this codebase, but there were serious objections from
>>>>> committers and
>>>>> users that convinced me that there is still usage of and interested in
>>>>> that
>>>>> codebase.
>>>>>
>>>>> That puts us into a "legacy dilemma". We cannot delete the code without
>>>>> harming our userbase. On the other hand, I don't see anyone willing to
>>>>> rework the codebase. Further, the code cannot linger around anymore as
>>>>> it
>>>>> is doing now, especially when we fail to answer questions or don't
>>>>> provide
>>>>> documentation.
>>>>>
>>>>> *We have to make a move*!
>>>>>
>>>>> I suggest the following actions with regard to the MR codebase. I hope
>>>>> that they find consent. If there are objections, please give
>>>>> alternatives,
>>>>> *keeping everything as-is is not an option*:
>>>>>
>>>>>   * reject any future MR algorithm contributions, prominently state
>>>>> this on
>>>>> the website and in talks
>>>>>   * make all existing algorithm code compatible with Hadoop 2, if
>>>>> there is
>>>>> no one willing to make an existing algorithm compatible, remove the
>>>>> algorithm
>>>>>   * deprecate the existing MR algorithms, yet still take bug fix
>>>>> contributions
>>>>>   * remove Random Forest as we cannot even answer questions to the
>>>>> implementation on the mailinglist
>>>>>
>>>>> There are two more actions that I would like to see, but'd be willing
>>>>> to
>>>>> give up if there are objections:
>>>>>
>>>>>   * move the MR algorithms into a separate maven module
>>>>>   * remove Frequent Pattern Mining again (we already aimed for that in
>>>>> 0.9
>>>>> but had one user who shouted but never returned to us)
>>>>>
>>>>> Let me know what you think.
>>>>>
>>>>> --sebastian
>>>>>
>>>>
>>>
>

Re: Tackling the "legacy dilemma"

Posted by Manoj Awasthi <aw...@gmail.com>.

Ok - that makes sense. Thanks.


On Wed, Apr 16, 2014 at 8:29 AM, Suneel Marthi <sm...@apache.org> wrote:

> The plan is to replace the existing Random Forests impl with a spark based
> Streaming Random Forests.
> As ssc had already mentioned the plan is not to entertain any new MR impls
> but accept bug fixes for existing ones.
>
>
> The consensus is to do away with existing MapReduce RF once the Spark
> based Streaming Random Forests is in place.
>
>
> On Tue, Apr 15, 2014 at 10:51 PM, Manoj Awasthi <aw...@gmail.com>wrote:
>
>>
>> >  * remove Random Forest as we cannot even answer questions to the
>> > implementation on the mailinglist
>> >
>>      -1 to removing present Random Forests. I think it is being used - we
>> (at adobe) are playing around with it a bit.  If the reason for removal is
>> that there no active maintainer that can be resolved by people using it
>> getting more active on this - a community action. FWIW, I vote against
>> throwing away this code.
>>
>>
>>
>> On Tue, Apr 15, 2014 at 2:38 PM, Sebastian Schelter <ss...@apache.org>wrote:
>>
>>> On 04/15/2014 11:07 AM, Suneel Marthi wrote:
>>>
>>>> On Tue, Apr 15, 2014 at 12:57 AM, Sebastian Schelter <ss...@apache.org>
>>>> wrote:
>>>>
>>>>  Hi,
>>>>>
>>>>>  From reading the thread, I have the impression that we agree on the
>>>>> following actions:
>>>>>
>>>>>
>>>>>   * reject any future MR algorithm contributions, prominently state
>>>>> this
>>>>> on the website and in talks
>>>>>   * make all existing algorithm code compatible with Hadoop 2, if
>>>>> there is
>>>>> no one willing to make an existing algorithm compatible, remove the
>>>>> algorithm
>>>>>   * deprecate Canopy clustering
>>>>>   * email the original FPM and random forest authors to ask for
>>>>> maintenance
>>>>> of the algorithms
>>>>>   * rename core to "mr-legacy" (and  gradually pull items we really
>>>>> need
>>>>> out of that later)
>>>>>
>>>>> I will create jira tickets for those action points. I think the biggest
>>>>> challenge here is the Hadoop 2 compatibility, is someone volunteering
>>>>> to
>>>>> drive that? Would be awesome.
>>>>>
>>>>>
>>>> With things settling down at work for me, I have time now to dedicate
>>>> back
>>>> to Mahout. I can drive this effort.
>>>>
>>>
>>> That is great news!
>>>
>>>
>>>
>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>>
>>>>> On 04/13/2014 07:19 PM, Andrew Musselman wrote:
>>>>>
>>>>>  This is a good summary of how I feel too.
>>>>>>
>>>>>>   On Apr 13, 2014, at 10:15 AM, Sebastian Schelter <ss...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Unfortunately, its not that easy to get enough voluntary work. I
>>>>>>> issued
>>>>>>> the third call for working on the documentation today as there are
>>>>>>> still
>>>>>>> lots of open issues. That's why I'm trying to suggest a move that
>>>>>>> involves
>>>>>>> as few work as possible.
>>>>>>>
>>>>>>> We should get the MR codebase into a state that we all can live with
>>>>>>> and
>>>>>>> then focus on new stuff like the scala DSL.
>>>>>>>
>>>>>>> --sebastian
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   On 04/13/2014 07:09 PM, Giorgio Zoppi wrote:
>>>>>>>
>>>>>>>> The best thing, should be do a plan, and see how much effort do you
>>>>>>>> need to
>>>>>>>> this. Then find out voluntaries to accomplish the task. Quite sure
>>>>>>>> that
>>>>>>>> there a lot of people around there that they are willing to help
>>>>>>>> out.
>>>>>>>>
>>>>>>>> BR,
>>>>>>>> deneb.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-04-13 18:45 GMT+02:00 Sebastian Schelter <ss...@apache.org>:
>>>>>>>>
>>>>>>>>
>>>>>>>>   Hi,
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I took some days to let the latest discussion about the state and
>>>>>>>>> future
>>>>>>>>> of Mahout go through my head. I think the most important thing to
>>>>>>>>> address
>>>>>>>>> right now is the MapReduce "legacy" codebase. A lot of the MR
>>>>>>>>> algorithms
>>>>>>>>> are currently unmaintained, documentation is outdated and the
>>>>>>>>> original
>>>>>>>>> authors have abandoned Mahout. For some algorithms it is hard to
>>>>>>>>> get
>>>>>>>>> even
>>>>>>>>> questions answered on the mailinglist (e.g. RandomForest). I agree
>>>>>>>>> with
>>>>>>>>> Sean's comments that letting the code linger around is no option
>>>>>>>>> and
>>>>>>>>> will
>>>>>>>>> continue to harm Mahout.
>>>>>>>>>
>>>>>>>>> In the previous discussion, I suggested to make a radical move and
>>>>>>>>> aim
>>>>>>>>> to
>>>>>>>>> delete this codebase, but there were serious objections from
>>>>>>>>> committers and
>>>>>>>>> users that convinced me that there is still usage of and
>>>>>>>>> interested in
>>>>>>>>> that
>>>>>>>>> codebase.
>>>>>>>>>
>>>>>>>>> That puts us into a "legacy dilemma". We cannot delete the code
>>>>>>>>> without
>>>>>>>>> harming our userbase. On the other hand, I don't see anyone
>>>>>>>>> willing to
>>>>>>>>> rework the codebase. Further, the code cannot linger around
>>>>>>>>> anymore as
>>>>>>>>> it
>>>>>>>>> is doing now, especially when we fail to answer questions or don't
>>>>>>>>> provide
>>>>>>>>> documentation.
>>>>>>>>>
>>>>>>>>> *We have to make a move*!
>>>>>>>>>
>>>>>>>>> I suggest the following actions with regard to the MR codebase. I
>>>>>>>>> hope
>>>>>>>>> that they find consent. If there are objections, please give
>>>>>>>>> alternatives,
>>>>>>>>> *keeping everything as-is is not an option*:
>>>>>>>>>
>>>>>>>>>    * reject any future MR algorithm contributions, prominently
>>>>>>>>> state
>>>>>>>>> this on
>>>>>>>>> the website and in talks
>>>>>>>>>    * make all existing algorithm code compatible with Hadoop 2, if
>>>>>>>>> there is
>>>>>>>>> no one willing to make an existing algorithm compatible, remove the
>>>>>>>>> algorithm
>>>>>>>>>    * deprecate the existing MR algorithms, yet still take bug fix
>>>>>>>>> contributions
>>>>>>>>>    * remove Random Forest as we cannot even answer questions to the
>>>>>>>>> implementation on the mailinglist
>>>>>>>>>
>>>>>>>>> There are two more actions that I would like to see, but'd be
>>>>>>>>> willing
>>>>>>>>> to
>>>>>>>>> give up if there are objections:
>>>>>>>>>
>>>>>>>>>    * move the MR algorithms into a separate maven module
>>>>>>>>>    * remove Frequent Pattern Mining again (we already aimed for
>>>>>>>>> that in
>>>>>>>>> 0.9
>>>>>>>>> but had one user who shouted but never returned to us)
>>>>>>>>>
>>>>>>>>> Let me know what you think.
>>>>>>>>>
>>>>>>>>> --sebastian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Tackling the "legacy dilemma"

Posted by Suneel Marthi <sm...@apache.org>.

The plan is to replace the existing Random Forests impl with a spark based
Streaming Random Forests.
As ssc had already mentioned the plan is not to entertain any new MR impls
but accept bug fixes for existing ones.


The consensus is to do away with existing MapReduce RF once the Spark based
Streaming Random Forests is in place.


On Tue, Apr 15, 2014 at 10:51 PM, Manoj Awasthi <aw...@gmail.com>wrote:

>
> >  * remove Random Forest as we cannot even answer questions to the
> > implementation on the mailinglist
> >
>      -1 to removing present Random Forests. I think it is being used - we
> (at adobe) are playing around with it a bit.  If the reason for removal is
> that there no active maintainer that can be resolved by people using it
> getting more active on this - a community action. FWIW, I vote against
> throwing away this code.
>
>
>
> On Tue, Apr 15, 2014 at 2:38 PM, Sebastian Schelter <ss...@apache.org>wrote:
>
>> On 04/15/2014 11:07 AM, Suneel Marthi wrote:
>>
>>> On Tue, Apr 15, 2014 at 12:57 AM, Sebastian Schelter <ss...@apache.org>
>>> wrote:
>>>
>>>  Hi,
>>>>
>>>>  From reading the thread, I have the impression that we agree on the
>>>> following actions:
>>>>
>>>>
>>>>   * reject any future MR algorithm contributions, prominently state this
>>>> on the website and in talks
>>>>   * make all existing algorithm code compatible with Hadoop 2, if there
>>>> is
>>>> no one willing to make an existing algorithm compatible, remove the
>>>> algorithm
>>>>   * deprecate Canopy clustering
>>>>   * email the original FPM and random forest authors to ask for
>>>> maintenance
>>>> of the algorithms
>>>>   * rename core to "mr-legacy" (and  gradually pull items we really need
>>>> out of that later)
>>>>
>>>> I will create jira tickets for those action points. I think the biggest
>>>> challenge here is the Hadoop 2 compatibility, is someone volunteering to
>>>> drive that? Would be awesome.
>>>>
>>>>
>>> With things settling down at work for me, I have time now to dedicate
>>> back
>>> to Mahout. I can drive this effort.
>>>
>>
>> That is great news!
>>
>>
>>
>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>>
>>>> On 04/13/2014 07:19 PM, Andrew Musselman wrote:
>>>>
>>>>  This is a good summary of how I feel too.
>>>>>
>>>>>   On Apr 13, 2014, at 10:15 AM, Sebastian Schelter <ss...@apache.org>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Unfortunately, its not that easy to get enough voluntary work. I
>>>>>> issued
>>>>>> the third call for working on the documentation today as there are
>>>>>> still
>>>>>> lots of open issues. That's why I'm trying to suggest a move that
>>>>>> involves
>>>>>> as few work as possible.
>>>>>>
>>>>>> We should get the MR codebase into a state that we all can live with
>>>>>> and
>>>>>> then focus on new stuff like the scala DSL.
>>>>>>
>>>>>> --sebastian
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   On 04/13/2014 07:09 PM, Giorgio Zoppi wrote:
>>>>>>
>>>>>>> The best thing, should be do a plan, and see how much effort do you
>>>>>>> need to
>>>>>>> this. Then find out voluntaries to accomplish the task. Quite sure
>>>>>>> that
>>>>>>> there a lot of people around there that they are willing to help out.
>>>>>>>
>>>>>>> BR,
>>>>>>> deneb.
>>>>>>>
>>>>>>>
>>>>>>> 2014-04-13 18:45 GMT+02:00 Sebastian Schelter <ss...@apache.org>:
>>>>>>>
>>>>>>>
>>>>>>>   Hi,
>>>>>>>
>>>>>>>>
>>>>>>>> I took some days to let the latest discussion about the state and
>>>>>>>> future
>>>>>>>> of Mahout go through my head. I think the most important thing to
>>>>>>>> address
>>>>>>>> right now is the MapReduce "legacy" codebase. A lot of the MR
>>>>>>>> algorithms
>>>>>>>> are currently unmaintained, documentation is outdated and the
>>>>>>>> original
>>>>>>>> authors have abandoned Mahout. For some algorithms it is hard to get
>>>>>>>> even
>>>>>>>> questions answered on the mailinglist (e.g. RandomForest). I agree
>>>>>>>> with
>>>>>>>> Sean's comments that letting the code linger around is no option and
>>>>>>>> will
>>>>>>>> continue to harm Mahout.
>>>>>>>>
>>>>>>>> In the previous discussion, I suggested to make a radical move and
>>>>>>>> aim
>>>>>>>> to
>>>>>>>> delete this codebase, but there were serious objections from
>>>>>>>> committers and
>>>>>>>> users that convinced me that there is still usage of and interested
>>>>>>>> in
>>>>>>>> that
>>>>>>>> codebase.
>>>>>>>>
>>>>>>>> That puts us into a "legacy dilemma". We cannot delete the code
>>>>>>>> without
>>>>>>>> harming our userbase. On the other hand, I don't see anyone willing
>>>>>>>> to
>>>>>>>> rework the codebase. Further, the code cannot linger around anymore
>>>>>>>> as
>>>>>>>> it
>>>>>>>> is doing now, especially when we fail to answer questions or don't
>>>>>>>> provide
>>>>>>>> documentation.
>>>>>>>>
>>>>>>>> *We have to make a move*!
>>>>>>>>
>>>>>>>> I suggest the following actions with regard to the MR codebase. I
>>>>>>>> hope
>>>>>>>> that they find consent. If there are objections, please give
>>>>>>>> alternatives,
>>>>>>>> *keeping everything as-is is not an option*:
>>>>>>>>
>>>>>>>>    * reject any future MR algorithm contributions, prominently state
>>>>>>>> this on
>>>>>>>> the website and in talks
>>>>>>>>    * make all existing algorithm code compatible with Hadoop 2, if
>>>>>>>> there is
>>>>>>>> no one willing to make an existing algorithm compatible, remove the
>>>>>>>> algorithm
>>>>>>>>    * deprecate the existing MR algorithms, yet still take bug fix
>>>>>>>> contributions
>>>>>>>>    * remove Random Forest as we cannot even answer questions to the
>>>>>>>> implementation on the mailinglist
>>>>>>>>
>>>>>>>> There are two more actions that I would like to see, but'd be
>>>>>>>> willing
>>>>>>>> to
>>>>>>>> give up if there are objections:
>>>>>>>>
>>>>>>>>    * move the MR algorithms into a separate maven module
>>>>>>>>    * remove Frequent Pattern Mining again (we already aimed for
>>>>>>>> that in
>>>>>>>> 0.9
>>>>>>>> but had one user who shouted but never returned to us)
>>>>>>>>
>>>>>>>> Let me know what you think.
>>>>>>>>
>>>>>>>> --sebastian
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Tackling the "legacy dilemma"

Posted by Ted Dunning <te...@gmail.com>.

Manoj,

Sounds like a fair trade there.

Hopefully, you would consider upgrading if we get Andy's code ported to the
DSL or if we incorporate the h2o random forest implementation.




On Tue, Apr 15, 2014 at 7:51 PM, Manoj Awasthi <aw...@gmail.com>wrote:

> >  * remove Random Forest as we cannot even answer questions to the
> > implementation on the mailinglist
> >
>      -1 to removing present Random Forests. I think it is being used - we
> (at adobe) are playing around with it a bit.  If the reason for removal is
> that there no active maintainer that can be resolved by people using it
> getting more active on this - a community action. FWIW, I vote against
> throwing away this code.
>
>
>
> On Tue, Apr 15, 2014 at 2:38 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
>
> > On 04/15/2014 11:07 AM, Suneel Marthi wrote:
> >
> >> On Tue, Apr 15, 2014 at 12:57 AM, Sebastian Schelter <ss...@apache.org>
> >> wrote:
> >>
> >>  Hi,
> >>>
> >>>  From reading the thread, I have the impression that we agree on the
> >>> following actions:
> >>>
> >>>
> >>>   * reject any future MR algorithm contributions, prominently state
> this
> >>> on the website and in talks
> >>>   * make all existing algorithm code compatible with Hadoop 2, if there
> >>> is
> >>> no one willing to make an existing algorithm compatible, remove the
> >>> algorithm
> >>>   * deprecate Canopy clustering
> >>>   * email the original FPM and random forest authors to ask for
> >>> maintenance
> >>> of the algorithms
> >>>   * rename core to "mr-legacy" (and  gradually pull items we really
> need
> >>> out of that later)
> >>>
> >>> I will create jira tickets for those action points. I think the biggest
> >>> challenge here is the Hadoop 2 compatibility, is someone volunteering
> to
> >>> drive that? Would be awesome.
> >>>
> >>>
> >> With things settling down at work for me, I have time now to dedicate
> back
> >> to Mahout. I can drive this effort.
> >>
> >
> > That is great news!
> >
> >
> >
> >>
> >>> Best,
> >>> Sebastian
> >>>
> >>>
> >>> On 04/13/2014 07:19 PM, Andrew Musselman wrote:
> >>>
> >>>  This is a good summary of how I feel too.
> >>>>
> >>>>   On Apr 13, 2014, at 10:15 AM, Sebastian Schelter <ss...@apache.org>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>> Unfortunately, its not that easy to get enough voluntary work. I
> issued
> >>>>> the third call for working on the documentation today as there are
> >>>>> still
> >>>>> lots of open issues. That's why I'm trying to suggest a move that
> >>>>> involves
> >>>>> as few work as possible.
> >>>>>
> >>>>> We should get the MR codebase into a state that we all can live with
> >>>>> and
> >>>>> then focus on new stuff like the scala DSL.
> >>>>>
> >>>>> --sebastian
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>   On 04/13/2014 07:09 PM, Giorgio Zoppi wrote:
> >>>>>
> >>>>>> The best thing, should be do a plan, and see how much effort do you
> >>>>>> need to
> >>>>>> this. Then find out voluntaries to accomplish the task. Quite sure
> >>>>>> that
> >>>>>> there a lot of people around there that they are willing to help
> out.
> >>>>>>
> >>>>>> BR,
> >>>>>> deneb.
> >>>>>>
> >>>>>>
> >>>>>> 2014-04-13 18:45 GMT+02:00 Sebastian Schelter <ss...@apache.org>:
> >>>>>>
> >>>>>>
> >>>>>>   Hi,
> >>>>>>
> >>>>>>>
> >>>>>>> I took some days to let the latest discussion about the state and
> >>>>>>> future
> >>>>>>> of Mahout go through my head. I think the most important thing to
> >>>>>>> address
> >>>>>>> right now is the MapReduce "legacy" codebase. A lot of the MR
> >>>>>>> algorithms
> >>>>>>> are currently unmaintained, documentation is outdated and the
> >>>>>>> original
> >>>>>>> authors have abandoned Mahout. For some algorithms it is hard to
> get
> >>>>>>> even
> >>>>>>> questions answered on the mailinglist (e.g. RandomForest). I agree
> >>>>>>> with
> >>>>>>> Sean's comments that letting the code linger around is no option
> and
> >>>>>>> will
> >>>>>>> continue to harm Mahout.
> >>>>>>>
> >>>>>>> In the previous discussion, I suggested to make a radical move and
> >>>>>>> aim
> >>>>>>> to
> >>>>>>> delete this codebase, but there were serious objections from
> >>>>>>> committers and
> >>>>>>> users that convinced me that there is still usage of and interested
> >>>>>>> in
> >>>>>>> that
> >>>>>>> codebase.
> >>>>>>>
> >>>>>>> That puts us into a "legacy dilemma". We cannot delete the code
> >>>>>>> without
> >>>>>>> harming our userbase. On the other hand, I don't see anyone willing
> >>>>>>> to
> >>>>>>> rework the codebase. Further, the code cannot linger around anymore
> >>>>>>> as
> >>>>>>> it
> >>>>>>> is doing now, especially when we fail to answer questions or don't
> >>>>>>> provide
> >>>>>>> documentation.
> >>>>>>>
> >>>>>>> *We have to make a move*!
> >>>>>>>
> >>>>>>> I suggest the following actions with regard to the MR codebase. I
> >>>>>>> hope
> >>>>>>> that they find consent. If there are objections, please give
> >>>>>>> alternatives,
> >>>>>>> *keeping everything as-is is not an option*:
> >>>>>>>
> >>>>>>>    * reject any future MR algorithm contributions, prominently
> state
> >>>>>>> this on
> >>>>>>> the website and in talks
> >>>>>>>    * make all existing algorithm code compatible with Hadoop 2, if
> >>>>>>> there is
> >>>>>>> no one willing to make an existing algorithm compatible, remove the
> >>>>>>> algorithm
> >>>>>>>    * deprecate the existing MR algorithms, yet still take bug fix
> >>>>>>> contributions
> >>>>>>>    * remove Random Forest as we cannot even answer questions to the
> >>>>>>> implementation on the mailinglist
> >>>>>>>
> >>>>>>> There are two more actions that I would like to see, but'd be
> willing
> >>>>>>> to
> >>>>>>> give up if there are objections:
> >>>>>>>
> >>>>>>>    * move the MR algorithms into a separate maven module
> >>>>>>>    * remove Frequent Pattern Mining again (we already aimed for
> that
> >>>>>>> in
> >>>>>>> 0.9
> >>>>>>> but had one user who shouted but never returned to us)
> >>>>>>>
> >>>>>>> Let me know what you think.
> >>>>>>>
> >>>>>>> --sebastian
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> >
>

Re: Tackling the "legacy dilemma"

Posted by Manoj Awasthi <aw...@gmail.com>.

>  * remove Random Forest as we cannot even answer questions to the
> implementation on the mailinglist
>
     -1 to removing present Random Forests. I think it is being used - we
(at adobe) are playing around with it a bit.  If the reason for removal is
that there no active maintainer that can be resolved by people using it
getting more active on this - a community action. FWIW, I vote against
throwing away this code.



On Tue, Apr 15, 2014 at 2:38 PM, Sebastian Schelter <ss...@apache.org> wrote:

> On 04/15/2014 11:07 AM, Suneel Marthi wrote:
>
>> On Tue, Apr 15, 2014 at 12:57 AM, Sebastian Schelter <ss...@apache.org>
>> wrote:
>>
>>  Hi,
>>>
>>>  From reading the thread, I have the impression that we agree on the
>>> following actions:
>>>
>>>
>>>   * reject any future MR algorithm contributions, prominently state this
>>> on the website and in talks
>>>   * make all existing algorithm code compatible with Hadoop 2, if there
>>> is
>>> no one willing to make an existing algorithm compatible, remove the
>>> algorithm
>>>   * deprecate Canopy clustering
>>>   * email the original FPM and random forest authors to ask for
>>> maintenance
>>> of the algorithms
>>>   * rename core to "mr-legacy" (and  gradually pull items we really need
>>> out of that later)
>>>
>>> I will create jira tickets for those action points. I think the biggest
>>> challenge here is the Hadoop 2 compatibility, is someone volunteering to
>>> drive that? Would be awesome.
>>>
>>>
>> With things settling down at work for me, I have time now to dedicate back
>> to Mahout. I can drive this effort.
>>
>
> That is great news!
>
>
>
>>
>>> Best,
>>> Sebastian
>>>
>>>
>>> On 04/13/2014 07:19 PM, Andrew Musselman wrote:
>>>
>>>  This is a good summary of how I feel too.
>>>>
>>>>   On Apr 13, 2014, at 10:15 AM, Sebastian Schelter <ss...@apache.org>
>>>> wrote:
>>>>
>>>>>
>>>>> Unfortunately, its not that easy to get enough voluntary work. I issued
>>>>> the third call for working on the documentation today as there are
>>>>> still
>>>>> lots of open issues. That's why I'm trying to suggest a move that
>>>>> involves
>>>>> as few work as possible.
>>>>>
>>>>> We should get the MR codebase into a state that we all can live with
>>>>> and
>>>>> then focus on new stuff like the scala DSL.
>>>>>
>>>>> --sebastian
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   On 04/13/2014 07:09 PM, Giorgio Zoppi wrote:
>>>>>
>>>>>> The best thing, should be do a plan, and see how much effort do you
>>>>>> need to
>>>>>> this. Then find out voluntaries to accomplish the task. Quite sure
>>>>>> that
>>>>>> there a lot of people around there that they are willing to help out.
>>>>>>
>>>>>> BR,
>>>>>> deneb.
>>>>>>
>>>>>>
>>>>>> 2014-04-13 18:45 GMT+02:00 Sebastian Schelter <ss...@apache.org>:
>>>>>>
>>>>>>
>>>>>>   Hi,
>>>>>>
>>>>>>>
>>>>>>> I took some days to let the latest discussion about the state and
>>>>>>> future
>>>>>>> of Mahout go through my head. I think the most important thing to
>>>>>>> address
>>>>>>> right now is the MapReduce "legacy" codebase. A lot of the MR
>>>>>>> algorithms
>>>>>>> are currently unmaintained, documentation is outdated and the
>>>>>>> original
>>>>>>> authors have abandoned Mahout. For some algorithms it is hard to get
>>>>>>> even
>>>>>>> questions answered on the mailinglist (e.g. RandomForest). I agree
>>>>>>> with
>>>>>>> Sean's comments that letting the code linger around is no option and
>>>>>>> will
>>>>>>> continue to harm Mahout.
>>>>>>>
>>>>>>> In the previous discussion, I suggested to make a radical move and
>>>>>>> aim
>>>>>>> to
>>>>>>> delete this codebase, but there were serious objections from
>>>>>>> committers and
>>>>>>> users that convinced me that there is still usage of and interested
>>>>>>> in
>>>>>>> that
>>>>>>> codebase.
>>>>>>>
>>>>>>> That puts us into a "legacy dilemma". We cannot delete the code
>>>>>>> without
>>>>>>> harming our userbase. On the other hand, I don't see anyone willing
>>>>>>> to
>>>>>>> rework the codebase. Further, the code cannot linger around anymore
>>>>>>> as
>>>>>>> it
>>>>>>> is doing now, especially when we fail to answer questions or don't
>>>>>>> provide
>>>>>>> documentation.
>>>>>>>
>>>>>>> *We have to make a move*!
>>>>>>>
>>>>>>> I suggest the following actions with regard to the MR codebase. I
>>>>>>> hope
>>>>>>> that they find consent. If there are objections, please give
>>>>>>> alternatives,
>>>>>>> *keeping everything as-is is not an option*:
>>>>>>>
>>>>>>>    * reject any future MR algorithm contributions, prominently state
>>>>>>> this on
>>>>>>> the website and in talks
>>>>>>>    * make all existing algorithm code compatible with Hadoop 2, if
>>>>>>> there is
>>>>>>> no one willing to make an existing algorithm compatible, remove the
>>>>>>> algorithm
>>>>>>>    * deprecate the existing MR algorithms, yet still take bug fix
>>>>>>> contributions
>>>>>>>    * remove Random Forest as we cannot even answer questions to the
>>>>>>> implementation on the mailinglist
>>>>>>>
>>>>>>> There are two more actions that I would like to see, but'd be willing
>>>>>>> to
>>>>>>> give up if there are objections:
>>>>>>>
>>>>>>>    * move the MR algorithms into a separate maven module
>>>>>>>    * remove Frequent Pattern Mining again (we already aimed for that
>>>>>>> in
>>>>>>> 0.9
>>>>>>> but had one user who shouted but never returned to us)
>>>>>>>
>>>>>>> Let me know what you think.
>>>>>>>
>>>>>>> --sebastian
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: Tackling the "legacy dilemma"

Posted by Sebastian Schelter <ss...@apache.org>.

On 04/15/2014 11:07 AM, Suneel Marthi wrote:
> On Tue, Apr 15, 2014 at 12:57 AM, Sebastian Schelter <ss...@apache.org> wrote:
>
>> Hi,
>>
>>  From reading the thread, I have the impression that we agree on the
>> following actions:
>>
>>
>>   * reject any future MR algorithm contributions, prominently state this
>> on the website and in talks
>>   * make all existing algorithm code compatible with Hadoop 2, if there is
>> no one willing to make an existing algorithm compatible, remove the
>> algorithm
>>   * deprecate Canopy clustering
>>   * email the original FPM and random forest authors to ask for maintenance
>> of the algorithms
>>   * rename core to "mr-legacy" (and  gradually pull items we really need
>> out of that later)
>>
>> I will create jira tickets for those action points. I think the biggest
>> challenge here is the Hadoop 2 compatibility, is someone volunteering to
>> drive that? Would be awesome.
>>
>
> With things settling down at work for me, I have time now to dedicate back
> to Mahout. I can drive this effort.

That is great news!

>
>>
>> Best,
>> Sebastian
>>
>>
>> On 04/13/2014 07:19 PM, Andrew Musselman wrote:
>>
>>> This is a good summary of how I feel too.
>>>
>>>   On Apr 13, 2014, at 10:15 AM, Sebastian Schelter <ss...@apache.org> wrote:
>>>>
>>>> Unfortunately, its not that easy to get enough voluntary work. I issued
>>>> the third call for working on the documentation today as there are still
>>>> lots of open issues. That's why I'm trying to suggest a move that involves
>>>> as few work as possible.
>>>>
>>>> We should get the MR codebase into a state that we all can live with and
>>>> then focus on new stuff like the scala DSL.
>>>>
>>>> --sebastian
>>>>
>>>>
>>>>
>>>>
>>>>   On 04/13/2014 07:09 PM, Giorgio Zoppi wrote:
>>>>> The best thing, should be do a plan, and see how much effort do you
>>>>> need to
>>>>> this. Then find out voluntaries to accomplish the task. Quite sure that
>>>>> there a lot of people around there that they are willing to help out.
>>>>>
>>>>> BR,
>>>>> deneb.
>>>>>
>>>>>
>>>>> 2014-04-13 18:45 GMT+02:00 Sebastian Schelter <ss...@apache.org>:
>>>>>
>>>>>
>>>>>   Hi,
>>>>>>
>>>>>> I took some days to let the latest discussion about the state and
>>>>>> future
>>>>>> of Mahout go through my head. I think the most important thing to
>>>>>> address
>>>>>> right now is the MapReduce "legacy" codebase. A lot of the MR
>>>>>> algorithms
>>>>>> are currently unmaintained, documentation is outdated and the original
>>>>>> authors have abandoned Mahout. For some algorithms it is hard to get
>>>>>> even
>>>>>> questions answered on the mailinglist (e.g. RandomForest). I agree with
>>>>>> Sean's comments that letting the code linger around is no option and
>>>>>> will
>>>>>> continue to harm Mahout.
>>>>>>
>>>>>> In the previous discussion, I suggested to make a radical move and aim
>>>>>> to
>>>>>> delete this codebase, but there were serious objections from
>>>>>> committers and
>>>>>> users that convinced me that there is still usage of and interested in
>>>>>> that
>>>>>> codebase.
>>>>>>
>>>>>> That puts us into a "legacy dilemma". We cannot delete the code without
>>>>>> harming our userbase. On the other hand, I don't see anyone willing to
>>>>>> rework the codebase. Further, the code cannot linger around anymore as
>>>>>> it
>>>>>> is doing now, especially when we fail to answer questions or don't
>>>>>> provide
>>>>>> documentation.
>>>>>>
>>>>>> *We have to make a move*!
>>>>>>
>>>>>> I suggest the following actions with regard to the MR codebase. I hope
>>>>>> that they find consent. If there are objections, please give
>>>>>> alternatives,
>>>>>> *keeping everything as-is is not an option*:
>>>>>>
>>>>>>    * reject any future MR algorithm contributions, prominently state
>>>>>> this on
>>>>>> the website and in talks
>>>>>>    * make all existing algorithm code compatible with Hadoop 2, if
>>>>>> there is
>>>>>> no one willing to make an existing algorithm compatible, remove the
>>>>>> algorithm
>>>>>>    * deprecate the existing MR algorithms, yet still take bug fix
>>>>>> contributions
>>>>>>    * remove Random Forest as we cannot even answer questions to the
>>>>>> implementation on the mailinglist
>>>>>>
>>>>>> There are two more actions that I would like to see, but'd be willing
>>>>>> to
>>>>>> give up if there are objections:
>>>>>>
>>>>>>    * move the MR algorithms into a separate maven module
>>>>>>    * remove Frequent Pattern Mining again (we already aimed for that in
>>>>>> 0.9
>>>>>> but had one user who shouted but never returned to us)
>>>>>>
>>>>>> Let me know what you think.
>>>>>>
>>>>>> --sebastian
>>>>>>
>>>>>
>>>>
>>
>

Re: Tackling the "legacy dilemma"

Posted by Suneel Marthi <sm...@apache.org>.

On Tue, Apr 15, 2014 at 12:57 AM, Sebastian Schelter <ss...@apache.org> wrote:

> Hi,
>
> From reading the thread, I have the impression that we agree on the
> following actions:
>
>
>  * reject any future MR algorithm contributions, prominently state this
> on the website and in talks
>  * make all existing algorithm code compatible with Hadoop 2, if there is
> no one willing to make an existing algorithm compatible, remove the
> algorithm
>  * deprecate Canopy clustering
>  * email the original FPM and random forest authors to ask for maintenance
> of the algorithms
>  * rename core to "mr-legacy" (and  gradually pull items we really need
> out of that later)
>
> I will create jira tickets for those action points. I think the biggest
> challenge here is the Hadoop 2 compatibility, is someone volunteering to
> drive that? Would be awesome.
>

With things settling down at work for me, I have time now to dedicate back
to Mahout. I can drive this effort.

>
> Best,
> Sebastian
>
>
> On 04/13/2014 07:19 PM, Andrew Musselman wrote:
>
>> This is a good summary of how I feel too.
>>
>>  On Apr 13, 2014, at 10:15 AM, Sebastian Schelter <ss...@apache.org> wrote:
>>>
>>> Unfortunately, its not that easy to get enough voluntary work. I issued
>>> the third call for working on the documentation today as there are still
>>> lots of open issues. That's why I'm trying to suggest a move that involves
>>> as few work as possible.
>>>
>>> We should get the MR codebase into a state that we all can live with and
>>> then focus on new stuff like the scala DSL.
>>>
>>> --sebastian
>>>
>>>
>>>
>>>
>>>  On 04/13/2014 07:09 PM, Giorgio Zoppi wrote:
>>>> The best thing, should be do a plan, and see how much effort do you
>>>> need to
>>>> this. Then find out voluntaries to accomplish the task. Quite sure that
>>>> there a lot of people around there that they are willing to help out.
>>>>
>>>> BR,
>>>> deneb.
>>>>
>>>>
>>>> 2014-04-13 18:45 GMT+02:00 Sebastian Schelter <ss...@apache.org>:
>>>>
>>>>
>>>>  Hi,
>>>>>
>>>>> I took some days to let the latest discussion about the state and
>>>>> future
>>>>> of Mahout go through my head. I think the most important thing to
>>>>> address
>>>>> right now is the MapReduce "legacy" codebase. A lot of the MR
>>>>> algorithms
>>>>> are currently unmaintained, documentation is outdated and the original
>>>>> authors have abandoned Mahout. For some algorithms it is hard to get
>>>>> even
>>>>> questions answered on the mailinglist (e.g. RandomForest). I agree with
>>>>> Sean's comments that letting the code linger around is no option and
>>>>> will
>>>>> continue to harm Mahout.
>>>>>
>>>>> In the previous discussion, I suggested to make a radical move and aim
>>>>> to
>>>>> delete this codebase, but there were serious objections from
>>>>> committers and
>>>>> users that convinced me that there is still usage of and interested in
>>>>> that
>>>>> codebase.
>>>>>
>>>>> That puts us into a "legacy dilemma". We cannot delete the code without
>>>>> harming our userbase. On the other hand, I don't see anyone willing to
>>>>> rework the codebase. Further, the code cannot linger around anymore as
>>>>> it
>>>>> is doing now, especially when we fail to answer questions or don't
>>>>> provide
>>>>> documentation.
>>>>>
>>>>> *We have to make a move*!
>>>>>
>>>>> I suggest the following actions with regard to the MR codebase. I hope
>>>>> that they find consent. If there are objections, please give
>>>>> alternatives,
>>>>> *keeping everything as-is is not an option*:
>>>>>
>>>>>   * reject any future MR algorithm contributions, prominently state
>>>>> this on
>>>>> the website and in talks
>>>>>   * make all existing algorithm code compatible with Hadoop 2, if
>>>>> there is
>>>>> no one willing to make an existing algorithm compatible, remove the
>>>>> algorithm
>>>>>   * deprecate the existing MR algorithms, yet still take bug fix
>>>>> contributions
>>>>>   * remove Random Forest as we cannot even answer questions to the
>>>>> implementation on the mailinglist
>>>>>
>>>>> There are two more actions that I would like to see, but'd be willing
>>>>> to
>>>>> give up if there are objections:
>>>>>
>>>>>   * move the MR algorithms into a separate maven module
>>>>>   * remove Frequent Pattern Mining again (we already aimed for that in
>>>>> 0.9
>>>>> but had one user who shouted but never returned to us)
>>>>>
>>>>> Let me know what you think.
>>>>>
>>>>> --sebastian
>>>>>
>>>>
>>>
>

Re: Tackling the "legacy dilemma"

Posted by Sebastian Schelter <ss...@apache.org>.

Hi,

 From reading the thread, I have the impression that we agree on the 
following actions:

  * reject any future MR algorithm contributions, prominently state this
on the website and in talks
  * make all existing algorithm code compatible with Hadoop 2, if there 
is no one willing to make an existing algorithm compatible, remove the 
algorithm
  * deprecate Canopy clustering
  * email the original FPM and random forest authors to ask for 
maintenance of the algorithms
  * rename core to "mr-legacy" (and  gradually pull items we really need 
out of that later)

I will create jira tickets for those action points. I think the biggest 
challenge here is the Hadoop 2 compatibility, is someone volunteering to 
drive that? Would be awesome.

Best,
Sebastian


On 04/13/2014 07:19 PM, Andrew Musselman wrote:
> This is a good summary of how I feel too.
>
>> On Apr 13, 2014, at 10:15 AM, Sebastian Schelter <ss...@apache.org> wrote:
>>
>> Unfortunately, its not that easy to get enough voluntary work. I issued the third call for working on the documentation today as there are still lots of open issues. That's why I'm trying to suggest a move that involves as few work as possible.
>>
>> We should get the MR codebase into a state that we all can live with and then focus on new stuff like the scala DSL.
>>
>> --sebastian
>>
>>
>>
>>
>>> On 04/13/2014 07:09 PM, Giorgio Zoppi wrote:
>>> The best thing, should be do a plan, and see how much effort do you need to
>>> this. Then find out voluntaries to accomplish the task. Quite sure that
>>> there a lot of people around there that they are willing to help out.
>>>
>>> BR,
>>> deneb.
>>>
>>>
>>> 2014-04-13 18:45 GMT+02:00 Sebastian Schelter <ss...@apache.org>:
>>>
>>>> Hi,
>>>>
>>>> I took some days to let the latest discussion about the state and future
>>>> of Mahout go through my head. I think the most important thing to address
>>>> right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
>>>> are currently unmaintained, documentation is outdated and the original
>>>> authors have abandoned Mahout. For some algorithms it is hard to get even
>>>> questions answered on the mailinglist (e.g. RandomForest). I agree with
>>>> Sean's comments that letting the code linger around is no option and will
>>>> continue to harm Mahout.
>>>>
>>>> In the previous discussion, I suggested to make a radical move and aim to
>>>> delete this codebase, but there were serious objections from committers and
>>>> users that convinced me that there is still usage of and interested in that
>>>> codebase.
>>>>
>>>> That puts us into a "legacy dilemma". We cannot delete the code without
>>>> harming our userbase. On the other hand, I don't see anyone willing to
>>>> rework the codebase. Further, the code cannot linger around anymore as it
>>>> is doing now, especially when we fail to answer questions or don't provide
>>>> documentation.
>>>>
>>>> *We have to make a move*!
>>>>
>>>> I suggest the following actions with regard to the MR codebase. I hope
>>>> that they find consent. If there are objections, please give alternatives,
>>>> *keeping everything as-is is not an option*:
>>>>
>>>>   * reject any future MR algorithm contributions, prominently state this on
>>>> the website and in talks
>>>>   * make all existing algorithm code compatible with Hadoop 2, if there is
>>>> no one willing to make an existing algorithm compatible, remove the
>>>> algorithm
>>>>   * deprecate the existing MR algorithms, yet still take bug fix
>>>> contributions
>>>>   * remove Random Forest as we cannot even answer questions to the
>>>> implementation on the mailinglist
>>>>
>>>> There are two more actions that I would like to see, but'd be willing to
>>>> give up if there are objections:
>>>>
>>>>   * move the MR algorithms into a separate maven module
>>>>   * remove Frequent Pattern Mining again (we already aimed for that in 0.9
>>>> but had one user who shouted but never returned to us)
>>>>
>>>> Let me know what you think.
>>>>
>>>> --sebastian
>>

Re: Tackling the "legacy dilemma"

Posted by Andrew Musselman <an...@gmail.com>.

This is a good summary of how I feel too.

> On Apr 13, 2014, at 10:15 AM, Sebastian Schelter <ss...@apache.org> wrote:
> 
> Unfortunately, its not that easy to get enough voluntary work. I issued the third call for working on the documentation today as there are still lots of open issues. That's why I'm trying to suggest a move that involves as few work as possible.
> 
> We should get the MR codebase into a state that we all can live with and then focus on new stuff like the scala DSL.
> 
> --sebastian
> 
> 
> 
> 
>> On 04/13/2014 07:09 PM, Giorgio Zoppi wrote:
>> The best thing, should be do a plan, and see how much effort do you need to
>> this. Then find out voluntaries to accomplish the task. Quite sure that
>> there a lot of people around there that they are willing to help out.
>> 
>> BR,
>> deneb.
>> 
>> 
>> 2014-04-13 18:45 GMT+02:00 Sebastian Schelter <ss...@apache.org>:
>> 
>>> Hi,
>>> 
>>> I took some days to let the latest discussion about the state and future
>>> of Mahout go through my head. I think the most important thing to address
>>> right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
>>> are currently unmaintained, documentation is outdated and the original
>>> authors have abandoned Mahout. For some algorithms it is hard to get even
>>> questions answered on the mailinglist (e.g. RandomForest). I agree with
>>> Sean's comments that letting the code linger around is no option and will
>>> continue to harm Mahout.
>>> 
>>> In the previous discussion, I suggested to make a radical move and aim to
>>> delete this codebase, but there were serious objections from committers and
>>> users that convinced me that there is still usage of and interested in that
>>> codebase.
>>> 
>>> That puts us into a "legacy dilemma". We cannot delete the code without
>>> harming our userbase. On the other hand, I don't see anyone willing to
>>> rework the codebase. Further, the code cannot linger around anymore as it
>>> is doing now, especially when we fail to answer questions or don't provide
>>> documentation.
>>> 
>>> *We have to make a move*!
>>> 
>>> I suggest the following actions with regard to the MR codebase. I hope
>>> that they find consent. If there are objections, please give alternatives,
>>> *keeping everything as-is is not an option*:
>>> 
>>>  * reject any future MR algorithm contributions, prominently state this on
>>> the website and in talks
>>>  * make all existing algorithm code compatible with Hadoop 2, if there is
>>> no one willing to make an existing algorithm compatible, remove the
>>> algorithm
>>>  * deprecate the existing MR algorithms, yet still take bug fix
>>> contributions
>>>  * remove Random Forest as we cannot even answer questions to the
>>> implementation on the mailinglist
>>> 
>>> There are two more actions that I would like to see, but'd be willing to
>>> give up if there are objections:
>>> 
>>>  * move the MR algorithms into a separate maven module
>>>  * remove Frequent Pattern Mining again (we already aimed for that in 0.9
>>> but had one user who shouted but never returned to us)
>>> 
>>> Let me know what you think.
>>> 
>>> --sebastian
>

Re: Tackling the "legacy dilemma"

Posted by Sebastian Schelter <ss...@apache.org>.

Unfortunately, its not that easy to get enough voluntary work. I issued 
the third call for working on the documentation today as there are still 
lots of open issues. That's why I'm trying to suggest a move that 
involves as few work as possible.

We should get the MR codebase into a state that we all can live with and 
then focus on new stuff like the scala DSL.

--sebastian




On 04/13/2014 07:09 PM, Giorgio Zoppi wrote:
> The best thing, should be do a plan, and see how much effort do you need to
> this. Then find out voluntaries to accomplish the task. Quite sure that
> there a lot of people around there that they are willing to help out.
>
> BR,
> deneb.
>
>
> 2014-04-13 18:45 GMT+02:00 Sebastian Schelter <ss...@apache.org>:
>
>> Hi,
>>
>> I took some days to let the latest discussion about the state and future
>> of Mahout go through my head. I think the most important thing to address
>> right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
>> are currently unmaintained, documentation is outdated and the original
>> authors have abandoned Mahout. For some algorithms it is hard to get even
>> questions answered on the mailinglist (e.g. RandomForest). I agree with
>> Sean's comments that letting the code linger around is no option and will
>> continue to harm Mahout.
>>
>> In the previous discussion, I suggested to make a radical move and aim to
>> delete this codebase, but there were serious objections from committers and
>> users that convinced me that there is still usage of and interested in that
>> codebase.
>>
>> That puts us into a "legacy dilemma". We cannot delete the code without
>> harming our userbase. On the other hand, I don't see anyone willing to
>> rework the codebase. Further, the code cannot linger around anymore as it
>> is doing now, especially when we fail to answer questions or don't provide
>> documentation.
>>
>> *We have to make a move*!
>>
>> I suggest the following actions with regard to the MR codebase. I hope
>> that they find consent. If there are objections, please give alternatives,
>> *keeping everything as-is is not an option*:
>>
>>   * reject any future MR algorithm contributions, prominently state this on
>> the website and in talks
>>   * make all existing algorithm code compatible with Hadoop 2, if there is
>> no one willing to make an existing algorithm compatible, remove the
>> algorithm
>>   * deprecate the existing MR algorithms, yet still take bug fix
>> contributions
>>   * remove Random Forest as we cannot even answer questions to the
>> implementation on the mailinglist
>>
>> There are two more actions that I would like to see, but'd be willing to
>> give up if there are objections:
>>
>>   * move the MR algorithms into a separate maven module
>>   * remove Frequent Pattern Mining again (we already aimed for that in 0.9
>> but had one user who shouted but never returned to us)
>>
>> Let me know what you think.
>>
>> --sebastian
>>
>
>
>

Re: Tackling the "legacy dilemma"

Posted by Giorgio Zoppi <gi...@gmail.com>.

The best thing, should be do a plan, and see how much effort do you need to
this. Then find out voluntaries to accomplish the task. Quite sure that
there a lot of people around there that they are willing to help out.

BR,
deneb.


2014-04-13 18:45 GMT+02:00 Sebastian Schelter <ss...@apache.org>:

> Hi,
>
> I took some days to let the latest discussion about the state and future
> of Mahout go through my head. I think the most important thing to address
> right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
> are currently unmaintained, documentation is outdated and the original
> authors have abandoned Mahout. For some algorithms it is hard to get even
> questions answered on the mailinglist (e.g. RandomForest). I agree with
> Sean's comments that letting the code linger around is no option and will
> continue to harm Mahout.
>
> In the previous discussion, I suggested to make a radical move and aim to
> delete this codebase, but there were serious objections from committers and
> users that convinced me that there is still usage of and interested in that
> codebase.
>
> That puts us into a "legacy dilemma". We cannot delete the code without
> harming our userbase. On the other hand, I don't see anyone willing to
> rework the codebase. Further, the code cannot linger around anymore as it
> is doing now, especially when we fail to answer questions or don't provide
> documentation.
>
> *We have to make a move*!
>
> I suggest the following actions with regard to the MR codebase. I hope
> that they find consent. If there are objections, please give alternatives,
> *keeping everything as-is is not an option*:
>
>  * reject any future MR algorithm contributions, prominently state this on
> the website and in talks
>  * make all existing algorithm code compatible with Hadoop 2, if there is
> no one willing to make an existing algorithm compatible, remove the
> algorithm
>  * deprecate the existing MR algorithms, yet still take bug fix
> contributions
>  * remove Random Forest as we cannot even answer questions to the
> implementation on the mailinglist
>
> There are two more actions that I would like to see, but'd be willing to
> give up if there are objections:
>
>  * move the MR algorithms into a separate maven module
>  * remove Frequent Pattern Mining again (we already aimed for that in 0.9
> but had one user who shouted but never returned to us)
>
> Let me know what you think.
>
> --sebastian
>



-- 
Quiero ser el rayo de sol que cada día te despierta
para hacerte respirar y vivir en me.
"Favola -Moda".

Re: Tackling the "legacy dilemma"

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Apr 13, 2014 10:21 AM, "Ted Dunning" <te...@gmail.com> wrote:
>
> On Sun, Apr 13, 2014 at 10:16 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>wrote:
>
> > >  * move the MR algorithms into a separate maven module
> > You mean, move  them out  of mahout-core? So the core is for single
machine
> > stuff only? Plus utils? We probably need to refactor core so there's no
> > core at all it seems. Our core, realistically, is utils, mahout-math &
> > math-scala(aka scalabindings), engine-agnostic logical layer of
> > mahout-spark. But for obvious reasons we probably dont want to put all
that
> > in a single module. Maybe at some point later when these things become
more
> > mainstream.
>
>
> This might be viewed as renaming core to be "mr-legacy" and then pulling
> those items we really need out of that.  Math is already separate as are
> scala bindings and similar.

Yes thats what i meant. It looks like it means full dissolution of
mahout-core rather than just moving out mr stuff specifically. I am ok with
that i guess.

Re: Tackling the "legacy dilemma"

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Apr 13, 2014 at 10:16 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> >  * move the MR algorithms into a separate maven module
> You mean, move  them out  of mahout-core? So the core is for single machine
> stuff only? Plus utils? We probably need to refactor core so there's no
> core at all it seems. Our core, realistically, is utils, mahout-math &
> math-scala(aka scalabindings), engine-agnostic logical layer of
> mahout-spark. But for obvious reasons we probably dont want to put all that
> in a single module. Maybe at some point later when these things become more
> mainstream.

This might be viewed as renaming core to be "mr-legacy" and then pulling
those items we really need out of that.  Math is already separate as are
scala bindings and similar.

Re: Tackling the "legacy dilemma"

Posted by Andrew Musselman <an...@gmail.com>.

I am okay with that, just suggesting a method for future.


On Sun, Apr 13, 2014 at 10:40 AM, Sebastian Schelter <
ssc.open@googlemail.com> wrote:

> I'd vote against a contrib area at the moment, because it would stand in
> the way of unifying, shrinking and stabilizing the codebase.
>
> --sebastian
> Am 13.04.2014 19:36 schrieb "Andrew Musselman" <andrew.musselman@gmail.com
> >:
>
> >
> > > On Apr 13, 2014, at 10:30 AM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> > >
> > >> On Apr 13, 2014 10:22 AM, "Ted Dunning" <te...@gmail.com>
> wrote:
> > >>
> > >> On Sun, Apr 13, 2014 at 10:16 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> > >> wrote:
> > >>
> > >>> +1, but more importantly, reject any new author who doesn't agree to
> > >>> explicitly plegdge a multi-year support.
> > >>
> > >> I am a little bit negative about this requirement.  My feeling is that
> > it
> > >> will wind up with accepting naive optimists (the ones we don't want)
> and
> > >> rejecting realists because they know that a true multi-year commitment
> > is
> > >> subject to buffeting by real-life.
> > > I true. I guess i mean more along the criteria lines, not about how we
> > make
> > > the inference. I meant if we really had a way to make reliable
> inference
> > > here. It may well be the case there's no such way. Usually the first
> good
> > > sign is that contributors are sticking to their issue in the first
> place
> > > for some time.
> >
> > This is where a contrib or piggybank-style sandbox could help, so people
> > could submit things "in probation" until they're proven out.
>

Re: Tackling the "legacy dilemma"

Posted by Sebastian Schelter <ss...@googlemail.com>.

I'd vote against a contrib area at the moment, because it would stand in
the way of unifying, shrinking and stabilizing the codebase.

--sebastian
Am 13.04.2014 19:36 schrieb "Andrew Musselman" <an...@gmail.com>:

>
> > On Apr 13, 2014, at 10:30 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >
> >> On Apr 13, 2014 10:22 AM, "Ted Dunning" <te...@gmail.com> wrote:
> >>
> >> On Sun, Apr 13, 2014 at 10:16 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >> wrote:
> >>
> >>> +1, but more importantly, reject any new author who doesn't agree to
> >>> explicitly plegdge a multi-year support.
> >>
> >> I am a little bit negative about this requirement.  My feeling is that
> it
> >> will wind up with accepting naive optimists (the ones we don't want) and
> >> rejecting realists because they know that a true multi-year commitment
> is
> >> subject to buffeting by real-life.
> > I true. I guess i mean more along the criteria lines, not about how we
> make
> > the inference. I meant if we really had a way to make reliable inference
> > here. It may well be the case there's no such way. Usually the first good
> > sign is that contributors are sticking to their issue in the first place
> > for some time.
>
> This is where a contrib or piggybank-style sandbox could help, so people
> could submit things "in probation" until they're proven out.

Re: Tackling the "legacy dilemma"

Posted by Andrew Musselman <an...@gmail.com>.

> On Apr 13, 2014, at 10:30 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
>> On Apr 13, 2014 10:22 AM, "Ted Dunning" <te...@gmail.com> wrote:
>> 
>> On Sun, Apr 13, 2014 at 10:16 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> wrote:
>> 
>>> +1, but more importantly, reject any new author who doesn't agree to
>>> explicitly plegdge a multi-year support.
>> 
>> I am a little bit negative about this requirement.  My feeling is that it
>> will wind up with accepting naive optimists (the ones we don't want) and
>> rejecting realists because they know that a true multi-year commitment is
>> subject to buffeting by real-life.
> I true. I guess i mean more along the criteria lines, not about how we make
> the inference. I meant if we really had a way to make reliable inference
> here. It may well be the case there's no such way. Usually the first good
> sign is that contributors are sticking to their issue in the first place
> for some time.

This is where a contrib or piggybank-style sandbox could help, so people could submit things "in probation" until they're proven out.

Re: Tackling the "legacy dilemma"

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Apr 13, 2014 10:22 AM, "Ted Dunning" <te...@gmail.com> wrote:
>
> On Sun, Apr 13, 2014 at 10:16 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>wrote:
>
> > +1, but more importantly, reject any new author who doesn't agree to
> > explicitly plegdge a multi-year support.
> >
>
> I am a little bit negative about this requirement.  My feeling is that it
> will wind up with accepting naive optimists (the ones we don't want) and
> rejecting realists because they know that a true multi-year commitment is
> subject to buffeting by real-life.
I true. I guess i mean more along the criteria lines, not about how we make
the inference. I meant if we really had a way to make reliable inference
here. It may well be the case there's no such way. Usually the first good
sign is that contributors are sticking to their issue in the first place
for some time.

Re: Tackling the "legacy dilemma"

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Apr 13, 2014 at 10:16 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> +1, but more importantly, reject any new author who doesn't agree to
> explicitly plegdge a multi-year support.
>

I am a little bit negative about this requirement.  My feeling is that it
will wind up with accepting naive optimists (the ones we don't want) and
rejecting realists because they know that a true multi-year commitment is
subject to buffeting by real-life.

Re: Tackling the "legacy dilemma"

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Apr 13, 2014 9:45 AM, "Sebastian Schelter" <ss...@apache.org> wrote:
>
> Hi,
>
> I took some days to let the latest discussion about the state and future
of Mahout go through my head. I think the most important thing to address
right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
are currently unmaintained, documentation is outdated and the original
authors have abandoned Mahout. For some algorithms it is hard to get even
questions answered on the mailinglist (e.g. RandomForest). I agree with
Sean's comments that letting the code linger around is no option and will
continue to harm Mahout.
>
> In the previous discussion, I suggested to make a radical move and aim to
delete this codebase, but there were serious objections from committers and
users that convinced me that there is still usage of and interested in that
codebase.
>
> That puts us into a "legacy dilemma". We cannot delete the code without
harming our userbase. On the other hand, I don't see anyone willing to
rework the codebase. Further, the code cannot linger around anymore as it
is doing now, especially when we fail to answer questions or don't provide
documentation.
>
> *We have to make a move*!
>
> I suggest the following actions with regard to the MR codebase. I hope
that they find consent. If there are objections, please give alternatives,
*keeping everything as-is is not an option*:
>
>  * reject any future MR algorithm contributions, prominently state this
on the website and in talks
+1, but more importantly, reject any new author who doesn't agree to
explicitly plegdge a multi-year support.
>  * make all existing algorithm code compatible with Hadoop 2, if there is
no one willing to make an existing algorithm compatible, remove the
algorithm
Ok, although my gut feeling this would take some time

>  * deprecate the existing MR algorithms, yet still take bug fix
contributions
I foresee a bit smoother mr transition. Deprecation means we loose them in
a release. That is, by the fall release. It would seem to me it would take
longer for us to provide full repleacement and convince ourselves of its
production worthiness.
Also, deprecation implies we can point a user to something else with "use
instead". So i wouldn't deprecate methods just now for which we cannot add
this phrase. As somebody menioned, long tail for deprecation is a good
policy here imo.

>  * remove Random Forest as we cannot even answer questions to the
implementation on the mailinglist

Do we know a direct email for FPM and random forest authors? I 'd suggest
to ping them one last time. They just may not be tuned to the list. Both
algorithms are kind of in a bread-and -butter category, it would be a huge
hit in coverage to just lose them without any resuscitation attempt
whatsoever.

>
> There are two more actions that I would like to see, but'd be willing to
give up if there are objections:
>
>  * move the MR algorithms into a separate maven module
You mean, move  them out  of mahout-core? So the core is for single machine
stuff only? Plus utils? We probably need to refactor core so there's no
core at all it seems. Our core, realistically, is utils, mahout-math &
math-scala(aka scalabindings), engine-agnostic logical layer of
mahout-spark. But for obvious reasons we probably dont want to put all that
in a single module. Maybe at some point later when these things become more
mainstream.

>  * remove Frequent Pattern Mining again (we already aimed for that in 0.9
but had one user who shouted but never returned to us)
>
> Let me know what you think.
>
> --sebastian

Re: Tackling the "legacy dilemma"

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I am ready to order a t-shirt with "Go, Andy! +100" accross it if it makes
any pragmatical sense.
On Apr 13, 2014 11:11 PM, "Sebastian Schelter" <ss...@apache.org> wrote:

> On 04/14/2014 08:00 AM, Dmitriy Lyubimov wrote:
>
>> not all things unfortunately map gracefully into algebra. But hopefully
>> some of the whole can still be.
>>
>
> Yes, that's why I was asking Andy if there are enough constructs. If not,
> we might have to add more.
>
>
>> I am even a little bit worried that we may develop almost too much (is
>> there such thing) of ML before we have a chance to cyrstallize data frames
>> and perhaps dictionary discussions. these are more tools to keep
>> abstracted.
>>
>
> I think it's a very good thing to have early ML implementations on the
> DSL, because it allows us to validate whether we are on the right path. We
> should start with providing the things that are most popular in mahout,
> like the item-based recommender from MAHOUT-1464. Having a few
> implementations on the DSL also helps with designing new abstractions,
> because for every proposed feature we can look at the existing code and see
> how helpful the new feature would be.
>
>
>> I just don't want Mahout to be yet-another mllib. I shudder every time
>> somebody says "we want to create a Spark version of (an|the) algorithm".
>>  I
>> know it will be creating wrong talking points for somebody anxious to draw
>> parallels.
>>
>
> Totally agree here. Looks history repeats itself from "I want to create a
> Hadoop implementation" to "I want to create a Spark implementation" :)
>
>
>>
>> On Sun, Apr 13, 2014 at 10:51 PM, Sebastian Schelter <ss...@apache.org>
>> wrote:
>>
>>  Andy, that would be awesome. Have you had a look at our new scala DSL
>>> [1]?
>>> Does it offer enough constructs for you to rewrite your implementation
>>> with
>>> it?
>>>
>>> --sebastian
>>>
>>>
>>> [1] https://mahout.apache.org/users/sparkbindings/home.html
>>>
>>>
>>> On 04/14/2014 07:47 AM, Andy Twigg wrote:
>>>
>>>         +1 to removing present Random Forests. Andy Twigg had provided a
>>>>
>>>>> Spark
>>>>> based Streaming Random Forests impl sometime last year. Its time to
>>>>> restart
>>>>> that conversation and integrate that into the codebase if the
>>>>> contributor
>>>>> is still willing i.e.
>>>>>
>>>>>
>>>> I'm happy to contribute this, but as it stands it's written against
>>>> spark, even forgetting the 'streaming' aspect. Do you have any advice
>>>> on how to proceed?
>>>>
>>>>
>>>>
>>>
>>
>

Re: Tackling the "legacy dilemma"

Posted by Sebastian Schelter <ss...@apache.org>.

On 04/14/2014 08:00 AM, Dmitriy Lyubimov wrote:
> not all things unfortunately map gracefully into algebra. But hopefully
> some of the whole can still be.

Yes, that's why I was asking Andy if there are enough constructs. If 
not, we might have to add more.

>
> I am even a little bit worried that we may develop almost too much (is
> there such thing) of ML before we have a chance to cyrstallize data frames
> and perhaps dictionary discussions. these are more tools to keep abstracted.

I think it's a very good thing to have early ML implementations on the 
DSL, because it allows us to validate whether we are on the right path. 
We should start with providing the things that are most popular in 
mahout, like the item-based recommender from MAHOUT-1464. Having a few 
implementations on the DSL also helps with designing new abstractions, 
because for every proposed feature we can look at the existing code and 
see how helpful the new feature would be.

>
> I just don't want Mahout to be yet-another mllib. I shudder every time
> somebody says "we want to create a Spark version of (an|the) algorithm".  I
> know it will be creating wrong talking points for somebody anxious to draw
> parallels.

Totally agree here. Looks history repeats itself from "I want to create 
a Hadoop implementation" to "I want to create a Spark implementation" :)

>
>
> On Sun, Apr 13, 2014 at 10:51 PM, Sebastian Schelter <ss...@apache.org> wrote:
>
>> Andy, that would be awesome. Have you had a look at our new scala DSL [1]?
>> Does it offer enough constructs for you to rewrite your implementation with
>> it?
>>
>> --sebastian
>>
>>
>> [1] https://mahout.apache.org/users/sparkbindings/home.html
>>
>>
>> On 04/14/2014 07:47 AM, Andy Twigg wrote:
>>
>>>        +1 to removing present Random Forests. Andy Twigg had provided a
>>>> Spark
>>>> based Streaming Random Forests impl sometime last year. Its time to
>>>> restart
>>>> that conversation and integrate that into the codebase if the contributor
>>>> is still willing i.e.
>>>>
>>>
>>> I'm happy to contribute this, but as it stands it's written against
>>> spark, even forgetting the 'streaming' aspect. Do you have any advice
>>> on how to proceed?
>>>
>>>
>>
>

Re: Tackling the "legacy dilemma"

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

not all things unfortunately map gracefully into algebra. But hopefully
some of the whole can still be.

I am even a little bit worried that we may develop almost too much (is
there such thing) of ML before we have a chance to cyrstallize data frames
and perhaps dictionary discussions. these are more tools to keep abstracted.

I just don't want Mahout to be yet-another mllib. I shudder every time
somebody says "we want to create a Spark version of (an|the) algorithm".  I
know it will be creating wrong talking points for somebody anxious to draw
parallels.

On Sun, Apr 13, 2014 at 10:51 PM, Sebastian Schelter <ss...@apache.org> wrote:

> Andy, that would be awesome. Have you had a look at our new scala DSL [1]?
> Does it offer enough constructs for you to rewrite your implementation with
> it?
>
> --sebastian
>
>
> [1] https://mahout.apache.org/users/sparkbindings/home.html
>
>
> On 04/14/2014 07:47 AM, Andy Twigg wrote:
>
>>       +1 to removing present Random Forests. Andy Twigg had provided a
>>> Spark
>>> based Streaming Random Forests impl sometime last year. Its time to
>>> restart
>>> that conversation and integrate that into the codebase if the contributor
>>> is still willing i.e.
>>>
>>
>> I'm happy to contribute this, but as it stands it's written against
>> spark, even forgetting the 'streaming' aspect. Do you have any advice
>> on how to proceed?
>>
>>
>

Re: Tackling the "legacy dilemma"

Posted by Sebastian Schelter <ss...@apache.org>.

Andy, that would be awesome. Have you had a look at our new scala DSL 
[1]? Does it offer enough constructs for you to rewrite your 
implementation with it?

--sebastian


[1] https://mahout.apache.org/users/sparkbindings/home.html

On 04/14/2014 07:47 AM, Andy Twigg wrote:
>>       +1 to removing present Random Forests. Andy Twigg had provided a Spark
>> based Streaming Random Forests impl sometime last year. Its time to restart
>> that conversation and integrate that into the codebase if the contributor
>> is still willing i.e.
>
> I'm happy to contribute this, but as it stands it's written against
> spark, even forgetting the 'streaming' aspect. Do you have any advice
> on how to proceed?
>

Re: Tackling the "legacy dilemma"

Posted by Andy Twigg <an...@gmail.com>.

>      +1 to removing present Random Forests. Andy Twigg had provided a Spark
> based Streaming Random Forests impl sometime last year. Its time to restart
> that conversation and integrate that into the codebase if the contributor
> is still willing i.e.

I'm happy to contribute this, but as it stands it's written against
spark, even forgetting the 'streaming' aspect. Do you have any advice
on how to proceed?

Re: Tackling the "legacy dilemma"

Posted by Suneel Marthi <sm...@apache.org>.

I meant to deprecate first (and eventually remove) Canopy clustering. This
is in line with the conversation I had with Ted and Frank at AMS about
weaning users away from the old style Canopy->KMeans clustering to start
using Streaming KMeans. No point in keeping Canopy once users switch to
using Streaming KMeans.


On Sun, Apr 13, 2014 at 1:12 PM, Sebastian Schelter <ss...@apache.org> wrote:

> Do you mean deprecating or removing Canopy clustering? I suggest to
> deprecate all MR code anyways.
>
> --sebastian
>
>
>
> On 04/13/2014 07:11 PM, Suneel Marthi wrote:
>
>  If I may add deprecating Canopy clustering to the list once we get
>> Streaming KMeans working right.
>>
>> On Sun, Apr 13, 2014 at 12:45 PM, Sebastian Schelter <ss...@apache.org>
>> wrote:
>>
>>  Hi,
>>>
>>> I took some days to let the latest discussion about the state and future
>>> of Mahout go through my head. I think the most important thing to address
>>> right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
>>> are currently unmaintained, documentation is outdated and the original
>>> authors have abandoned Mahout. For some algorithms it is hard to get even
>>> questions answered on the mailinglist (e.g. RandomForest). I agree with
>>> Sean's comments that letting the code linger around is no option and will
>>> continue to harm Mahout.
>>>
>>> In the previous discussion, I suggested to make a radical move and aim to
>>> delete this codebase, but there were serious objections from committers
>>> and
>>> users that convinced me that there is still usage of and interested in
>>> that
>>> codebase.
>>>
>>> That puts us into a "legacy dilemma". We cannot delete the code without
>>> harming our userbase. On the other hand, I don't see anyone willing to
>>> rework the codebase. Further, the code cannot linger around anymore as it
>>> is doing now, especially when we fail to answer questions or don't
>>> provide
>>> documentation.
>>>
>>> *We have to make a move*!
>>>
>>> I suggest the following actions with regard to the MR codebase. I hope
>>> that they find consent. If there are objections, please give
>>> alternatives,
>>> *keeping everything as-is is not an option*:
>>>
>>>   * reject any future MR algorithm contributions, prominently state this
>>> on
>>> the website and in talks
>>>
>>>       +1, this includes the new Frequent Pattern mining impl which is MR
>> based that was provided as a patch few months ago
>>
>>    * make all existing algorithm code compatible with Hadoop 2, if there
>>> is
>>> no one willing to make an existing algorithm compatible, remove the
>>> algorithm
>>>
>>>        +1. One of the questions I got asked when 0.9 was released was
>> 'when
>> is Mahout gonna be compatible with Yarn and Hadoop 2'?  We should target
>> that for the next major//interim release.
>>
>>    * deprecate the existing MR algorithms, yet still take bug fix
>>> contributions
>>>
>>>        I guess we'll be removing these in some future release, until
>> then we
>> keep absorbing bug fixes ??
>>
>>
>>    * remove Random Forest as we cannot even answer questions to the
>>> implementation on the mailinglist
>>>
>>>        +1 to removing present Random Forests. Andy Twigg had provided a
>> Spark
>> based Streaming Random Forests impl sometime last year. Its time to
>> restart
>> that conversation and integrate that into the codebase if the contributor
>> is still willing i.e.
>>
>>
>>> There are two more actions that I would like to see, but'd be willing to
>>> give up if there are objections:
>>>
>>>   * move the MR algorithms into a separate maven module
>>>
>>>         +1
>>
>>    * remove Frequent Pattern Mining again (we already aimed for that in
>>> 0.9
>>> but had one user who shouted but never returned to us)
>>>
>>>        This thing annoys me the most. We had removed this from 0.9 but
>> yet
>> restored it only because some user wanted it and promised to support it.
>> We
>> have not heard from the user again.
>>        Its got old MR code that we don't support anymore and this should
>> be
>> purged ASAP.
>>
>>
>>
>>  Let me know what you think.
>>>
>>> --sebastian
>>>
>>>
>>
>

Re: Tackling the "legacy dilemma"

Posted by Sebastian Schelter <ss...@apache.org>.

Do you mean deprecating or removing Canopy clustering? I suggest to 
deprecate all MR code anyways.

--sebastian


On 04/13/2014 07:11 PM, Suneel Marthi wrote:

> If I may add deprecating Canopy clustering to the list once we get
> Streaming KMeans working right.
>
> On Sun, Apr 13, 2014 at 12:45 PM, Sebastian Schelter <ss...@apache.org> wrote:
>
>> Hi,
>>
>> I took some days to let the latest discussion about the state and future
>> of Mahout go through my head. I think the most important thing to address
>> right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
>> are currently unmaintained, documentation is outdated and the original
>> authors have abandoned Mahout. For some algorithms it is hard to get even
>> questions answered on the mailinglist (e.g. RandomForest). I agree with
>> Sean's comments that letting the code linger around is no option and will
>> continue to harm Mahout.
>>
>> In the previous discussion, I suggested to make a radical move and aim to
>> delete this codebase, but there were serious objections from committers and
>> users that convinced me that there is still usage of and interested in that
>> codebase.
>>
>> That puts us into a "legacy dilemma". We cannot delete the code without
>> harming our userbase. On the other hand, I don't see anyone willing to
>> rework the codebase. Further, the code cannot linger around anymore as it
>> is doing now, especially when we fail to answer questions or don't provide
>> documentation.
>>
>> *We have to make a move*!
>>
>> I suggest the following actions with regard to the MR codebase. I hope
>> that they find consent. If there are objections, please give alternatives,
>> *keeping everything as-is is not an option*:
>>
>>   * reject any future MR algorithm contributions, prominently state this on
>> the website and in talks
>>
>      +1, this includes the new Frequent Pattern mining impl which is MR
> based that was provided as a patch few months ago
>
>>   * make all existing algorithm code compatible with Hadoop 2, if there is
>> no one willing to make an existing algorithm compatible, remove the
>> algorithm
>>
>       +1. One of the questions I got asked when 0.9 was released was 'when
> is Mahout gonna be compatible with Yarn and Hadoop 2'?  We should target
> that for the next major//interim release.
>
>>   * deprecate the existing MR algorithms, yet still take bug fix
>> contributions
>>
>       I guess we'll be removing these in some future release, until then we
> keep absorbing bug fixes ??
>
>
>>   * remove Random Forest as we cannot even answer questions to the
>> implementation on the mailinglist
>>
>       +1 to removing present Random Forests. Andy Twigg had provided a Spark
> based Streaming Random Forests impl sometime last year. Its time to restart
> that conversation and integrate that into the codebase if the contributor
> is still willing i.e.
>
>>
>> There are two more actions that I would like to see, but'd be willing to
>> give up if there are objections:
>>
>>   * move the MR algorithms into a separate maven module
>>
>        +1
>
>>   * remove Frequent Pattern Mining again (we already aimed for that in 0.9
>> but had one user who shouted but never returned to us)
>>
>       This thing annoys me the most. We had removed this from 0.9 but yet
> restored it only because some user wanted it and promised to support it. We
> have not heard from the user again.
>        Its got old MR code that we don't support anymore and this should be
> purged ASAP.
>
>
>
>> Let me know what you think.
>>
>> --sebastian
>>
>

Re: Tackling the "legacy dilemma"

Posted by Suneel Marthi <sm...@apache.org>.

If I may add deprecating Canopy clustering to the list once we get
Streaming KMeans working right.

On Sun, Apr 13, 2014 at 12:45 PM, Sebastian Schelter <ss...@apache.org> wrote:

> Hi,
>
> I took some days to let the latest discussion about the state and future
> of Mahout go through my head. I think the most important thing to address
> right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
> are currently unmaintained, documentation is outdated and the original
> authors have abandoned Mahout. For some algorithms it is hard to get even
> questions answered on the mailinglist (e.g. RandomForest). I agree with
> Sean's comments that letting the code linger around is no option and will
> continue to harm Mahout.
>
> In the previous discussion, I suggested to make a radical move and aim to
> delete this codebase, but there were serious objections from committers and
> users that convinced me that there is still usage of and interested in that
> codebase.
>
> That puts us into a "legacy dilemma". We cannot delete the code without
> harming our userbase. On the other hand, I don't see anyone willing to
> rework the codebase. Further, the code cannot linger around anymore as it
> is doing now, especially when we fail to answer questions or don't provide
> documentation.
>
> *We have to make a move*!
>
> I suggest the following actions with regard to the MR codebase. I hope
> that they find consent. If there are objections, please give alternatives,
> *keeping everything as-is is not an option*:
>
>  * reject any future MR algorithm contributions, prominently state this on
> the website and in talks
>
    +1, this includes the new Frequent Pattern mining impl which is MR
based that was provided as a patch few months ago

>  * make all existing algorithm code compatible with Hadoop 2, if there is
> no one willing to make an existing algorithm compatible, remove the
> algorithm
>
     +1. One of the questions I got asked when 0.9 was released was 'when
is Mahout gonna be compatible with Yarn and Hadoop 2'?  We should target
that for the next major//interim release.

>  * deprecate the existing MR algorithms, yet still take bug fix
> contributions
>
     I guess we'll be removing these in some future release, until then we
keep absorbing bug fixes ??


>  * remove Random Forest as we cannot even answer questions to the
> implementation on the mailinglist
>
     +1 to removing present Random Forests. Andy Twigg had provided a Spark
based Streaming Random Forests impl sometime last year. Its time to restart
that conversation and integrate that into the codebase if the contributor
is still willing i.e.

>
> There are two more actions that I would like to see, but'd be willing to
> give up if there are objections:
>
>  * move the MR algorithms into a separate maven module
>
      +1

>  * remove Frequent Pattern Mining again (we already aimed for that in 0.9
> but had one user who shouted but never returned to us)
>
     This thing annoys me the most. We had removed this from 0.9 but yet
restored it only because some user wanted it and promised to support it. We
have not heard from the user again.
      Its got old MR code that we don't support anymore and this should be
purged ASAP.



> Let me know what you think.
>
> --sebastian
>