You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Aliaksei Litouka <al...@gmail.com> on 2014/04/21 17:39:35 UTC

Any plans for new clustering algorithms?

Hi, Spark developers.
Are there any plans for implementing new clustering algorithms in MLLib? As
far as I understand, current version of Spark ships with only one
clustering algorithm - K-Means. I want to contribute to Spark and I'm
thinking of adding more clustering algorithms - maybe
DBSCAN<http://en.wikipedia.org/wiki/DBSCAN>.
I can start working on it. Does anyone want to join me?

Re: Any plans for new clustering algorithms?

Posted by Sandy Ryza <sa...@cloudera.com>.

Thanks Matei.  I added a section "How to contribute" page.


On Mon, Apr 21, 2014 at 7:25 PM, Matei Zaharia <ma...@gmail.com>wrote:

> The wiki is actually maintained separately in
> https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. We
> restricted editing of the wiki because bots would automatically add stuff.
> I've given you permissions now.
>
> Matei
>
> On Apr 21, 2014, at 6:22 PM, Nan Zhu <zh...@gmail.com> wrote:
>
> > I thought those are files of spark.apache.org?
> >
> > --
> > Nan Zhu
> >
> >
> > On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote:
> >
> >> The markdown files are under spark/docs. You can submit a PR for
> >> changes. -Xiangrui
> >>
> >> On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza <sandy.ryza@cloudera.com(mailto:
> sandy.ryza@cloudera.com)> wrote:
> >>> How do I get permissions to edit the wiki?
> >>>
> >>>
> >>> On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng <mengxr@gmail.com(mailto:
> mengxr@gmail.com)> wrote:
> >>>
> >>>> Cannot agree more with your words. Could you add one section about
> >>>> "how and what to contribute" to MLlib's guide? -Xiangrui
> >>>>
> >>>> On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
> >>>> <nick.pentreath@gmail.com (mailto:nick.pentreath@gmail.com)> wrote:
> >>>>> I'd say a section in the "how to contribute" page would be a good
> place
> >>>>
> >>>> to put this.
> >>>>>
> >>>>> In general I'd say that the criteria for inclusion of an algorithm
> is it
> >>>> should be high quality, widely known, used and accepted (citations and
> >>>> concrete use cases as examples of this), scalable and parallelizable,
> well
> >>>> documented and with reasonable expectation of dev support
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>>> On 21 Apr 2014, at 19:59, Sandy Ryza <sandy.ryza@cloudera.com(mailto:
> sandy.ryza@cloudera.com)> wrote:
> >>>>>>
> >>>>>> If it's not done already, would it make sense to codify this
> philosophy
> >>>>>> somewhere? I imagine this won't be the first time this discussion
> comes
> >>>>>> up, and it would be nice to have a doc to point to. I'd be happy to
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> take a
> >>>>>> stab at this.
> >>>>>>
> >>>>>>
> >>>>>>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <mengxr@gmail.com(mailto:
> mengxr@gmail.com)>
> >>>> wrote:
> >>>>>>>
> >>>>>>> +1 on Sean's comment. MLlib covers the basic algorithms but we
> >>>>>>> definitely need to spend more time on how to make the design
> scalable.
> >>>>>>> For example, think about current "ProblemWithAlgorithm" naming
> scheme.
> >>>>>>> That being said, new algorithms are welcomed. I wish they are
> >>>>>>> well-established and well-understood by users. They shouldn't be
> >>>>>>> research algorithms tuned to work well with a particular dataset
> but
> >>>>>>> not tested widely. You see the change log from Mahout:
> >>>>>>>
> >>>>>>> ===
> >>>>>>> The following algorithms that were marked deprecated in 0.8 have
> been
> >>>>>>> removed in 0.9:
> >>>>>>>
> >>>>>>> From Clustering:
> >>>>>>> Switched LDA implementation from using Gibbs Sampling to Collapsed
> >>>>>>> Variational Bayes (CVB)
> >>>>>>> Meanshift
> >>>>>>> MinHash - removed due to poor performance, lack of support and
> lack of
> >>>>>>> usage
> >>>>>>>
> >>>>>>> From Classification (both are sequential implementations)
> >>>>>>> Winnow - lack of actual usage and support
> >>>>>>> Perceptron - lack of actual usage and support
> >>>>>>>
> >>>>>>> Collaborative Filtering
> >>>>>>> SlopeOne implementations in
> >>>>>>> org.apache.mahout.cf.taste.hadoop.slopeone and
> >>>>>>> org.apache.mahout.cf.taste.impl.recommender.slopeone
> >>>>>>> Distributed pseudo recommender in
> >>>>>>> org.apache.mahout.cf.taste.hadoop.pseudo
> >>>>>>> TreeClusteringRecommender in
> >>>>>>> org.apache.mahout.cf.taste.impl.recommender
> >>>>>>>
> >>>>>>> Mahout Math
> >>>>>>> Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
> >>>>>>> ===
> >>>>>>>
> >>>>>>> In MLlib, we should include the algorithms users know how to use
> and
> >>>>>>> we can provide support rather than letting algorithms come and go.
> >>>>>>>
> >>>>>>> My $0.02,
> >>>>>>> Xiangrui
> >>>>>>>
> >>>>>>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <sowen@cloudera.com(mailto:
> sowen@cloudera.com)>
> >>>> wrote:
> >>>>>>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <prb@mult.ifario.us(mailto:
> prb@mult.ifario.us)>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>> wrote:
> >>>>>>>>> - MLlib as Mahout.next would be a unfortunate. There are some
> gems
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>> in
> >>>>>>>>> Mahout, but there are also lots of rocks. Setting a minimal bar
> of
> >>>>>>>>> working, correctly implemented, and documented requires a
> surprising
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> amount
> >>>>>>>>> of work.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> As someone with first-hand knowledge, this is correct. To Sang's
> >>>>>>>> question, I can't see value in 'porting' Mahout since it is based
> on a
> >>>>>>>> quite different paradigm. About the only part that translates is
> the
> >>>>>>>> algorithm concept itself.
> >>>>>>>>
> >>>>>>>> This is also the cautionary tale. The contents of the project have
> >>>>>>>> ended up being a number of "drive-by" contributions of
> implementations
> >>>>>>>> that, while individually perhaps brilliant (perhaps), didn't
> >>>>>>>> necessarily match any other implementation in structure,
> input/output,
> >>>>>>>> libraries used. The implementations were often a touch academic.
> The
> >>>>>>>> result was hard to document, maintain, evolve or use.
> >>>>>>>>
> >>>>>>>> Far more of the structure of the MLlib implementations are
> consistent
> >>>>>>>> by virtue of being built around Spark core already. That's great.
> >>>>>>>>
> >>>>>>>> One can't wait to completely build the foundation before building
> any
> >>>>>>>> implementations. To me, the existing implementations are almost
> >>>>>>>> exactly the basics I would choose. They cover the bases and will
> >>>>>>>> exercise the abstractions and structure. So that's also great
> IMHO.
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> >
> >
>
>

Re: Any plans for new clustering algorithms?

Posted by Matei Zaharia <ma...@gmail.com>.

The wiki is actually maintained separately in https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. We restricted editing of the wiki because bots would automatically add stuff. I’ve given you permissions now.

Matei

On Apr 21, 2014, at 6:22 PM, Nan Zhu <zh...@gmail.com> wrote:

> I thought those are files of spark.apache.org? 
> 
> -- 
> Nan Zhu
> 
> 
> On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote:
> 
>> The markdown files are under spark/docs. You can submit a PR for
>> changes. -Xiangrui
>> 
>> On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza <sandy.ryza@cloudera.com (mailto:sandy.ryza@cloudera.com)> wrote:
>>> How do I get permissions to edit the wiki?
>>> 
>>> 
>>> On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng <mengxr@gmail.com (mailto:mengxr@gmail.com)> wrote:
>>> 
>>>> Cannot agree more with your words. Could you add one section about
>>>> "how and what to contribute" to MLlib's guide? -Xiangrui
>>>> 
>>>> On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
>>>> <nick.pentreath@gmail.com (mailto:nick.pentreath@gmail.com)> wrote:
>>>>> I'd say a section in the "how to contribute" page would be a good place
>>>> 
>>>> to put this.
>>>>> 
>>>>> In general I'd say that the criteria for inclusion of an algorithm is it
>>>> should be high quality, widely known, used and accepted (citations and
>>>> concrete use cases as examples of this), scalable and parallelizable, well
>>>> documented and with reasonable expectation of dev support
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On 21 Apr 2014, at 19:59, Sandy Ryza <sandy.ryza@cloudera.com (mailto:sandy.ryza@cloudera.com)> wrote:
>>>>>> 
>>>>>> If it's not done already, would it make sense to codify this philosophy
>>>>>> somewhere? I imagine this won't be the first time this discussion comes
>>>>>> up, and it would be nice to have a doc to point to. I'd be happy to
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> take a
>>>>>> stab at this.
>>>>>> 
>>>>>> 
>>>>>>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <mengxr@gmail.com (mailto:mengxr@gmail.com)>
>>>> wrote:
>>>>>>> 
>>>>>>> +1 on Sean's comment. MLlib covers the basic algorithms but we
>>>>>>> definitely need to spend more time on how to make the design scalable.
>>>>>>> For example, think about current "ProblemWithAlgorithm" naming scheme.
>>>>>>> That being said, new algorithms are welcomed. I wish they are
>>>>>>> well-established and well-understood by users. They shouldn't be
>>>>>>> research algorithms tuned to work well with a particular dataset but
>>>>>>> not tested widely. You see the change log from Mahout:
>>>>>>> 
>>>>>>> ===
>>>>>>> The following algorithms that were marked deprecated in 0.8 have been
>>>>>>> removed in 0.9:
>>>>>>> 
>>>>>>> From Clustering:
>>>>>>> Switched LDA implementation from using Gibbs Sampling to Collapsed
>>>>>>> Variational Bayes (CVB)
>>>>>>> Meanshift
>>>>>>> MinHash - removed due to poor performance, lack of support and lack of
>>>>>>> usage
>>>>>>> 
>>>>>>> From Classification (both are sequential implementations)
>>>>>>> Winnow - lack of actual usage and support
>>>>>>> Perceptron - lack of actual usage and support
>>>>>>> 
>>>>>>> Collaborative Filtering
>>>>>>> SlopeOne implementations in
>>>>>>> org.apache.mahout.cf.taste.hadoop.slopeone and
>>>>>>> org.apache.mahout.cf.taste.impl.recommender.slopeone
>>>>>>> Distributed pseudo recommender in
>>>>>>> org.apache.mahout.cf.taste.hadoop.pseudo
>>>>>>> TreeClusteringRecommender in
>>>>>>> org.apache.mahout.cf.taste.impl.recommender
>>>>>>> 
>>>>>>> Mahout Math
>>>>>>> Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
>>>>>>> ===
>>>>>>> 
>>>>>>> In MLlib, we should include the algorithms users know how to use and
>>>>>>> we can provide support rather than letting algorithms come and go.
>>>>>>> 
>>>>>>> My $0.02,
>>>>>>> Xiangrui
>>>>>>> 
>>>>>>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <sowen@cloudera.com (mailto:sowen@cloudera.com)>
>>>> wrote:
>>>>>>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <prb@mult.ifario.us (mailto:prb@mult.ifario.us)>
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> wrote:
>>>>>>>>> - MLlib as Mahout.next would be a unfortunate. There are some gems
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> in
>>>>>>>>> Mahout, but there are also lots of rocks. Setting a minimal bar of
>>>>>>>>> working, correctly implemented, and documented requires a surprising
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> amount
>>>>>>>>> of work.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> As someone with first-hand knowledge, this is correct. To Sang's
>>>>>>>> question, I can't see value in 'porting' Mahout since it is based on a
>>>>>>>> quite different paradigm. About the only part that translates is the
>>>>>>>> algorithm concept itself.
>>>>>>>> 
>>>>>>>> This is also the cautionary tale. The contents of the project have
>>>>>>>> ended up being a number of "drive-by" contributions of implementations
>>>>>>>> that, while individually perhaps brilliant (perhaps), didn't
>>>>>>>> necessarily match any other implementation in structure, input/output,
>>>>>>>> libraries used. The implementations were often a touch academic. The
>>>>>>>> result was hard to document, maintain, evolve or use.
>>>>>>>> 
>>>>>>>> Far more of the structure of the MLlib implementations are consistent
>>>>>>>> by virtue of being built around Spark core already. That's great.
>>>>>>>> 
>>>>>>>> One can't wait to completely build the foundation before building any
>>>>>>>> implementations. To me, the existing implementations are almost
>>>>>>>> exactly the basics I would choose. They cover the bases and will
>>>>>>>> exercise the abstractions and structure. So that's also great IMHO.
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
>

Re: Any plans for new clustering algorithms?

Posted by Nan Zhu <zh...@gmail.com>.

I thought those are files of spark.apache.org? 

-- 
Nan Zhu


On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote:

> The markdown files are under spark/docs. You can submit a PR for
> changes. -Xiangrui
> 
> On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza <sandy.ryza@cloudera.com (mailto:sandy.ryza@cloudera.com)> wrote:
> > How do I get permissions to edit the wiki?
> > 
> > 
> > On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng <mengxr@gmail.com (mailto:mengxr@gmail.com)> wrote:
> > 
> > > Cannot agree more with your words. Could you add one section about
> > > "how and what to contribute" to MLlib's guide? -Xiangrui
> > > 
> > > On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
> > > <nick.pentreath@gmail.com (mailto:nick.pentreath@gmail.com)> wrote:
> > > > I'd say a section in the "how to contribute" page would be a good place
> > > 
> > > to put this.
> > > > 
> > > > In general I'd say that the criteria for inclusion of an algorithm is it
> > > should be high quality, widely known, used and accepted (citations and
> > > concrete use cases as examples of this), scalable and parallelizable, well
> > > documented and with reasonable expectation of dev support
> > > > 
> > > > Sent from my iPhone
> > > > 
> > > > > On 21 Apr 2014, at 19:59, Sandy Ryza <sandy.ryza@cloudera.com (mailto:sandy.ryza@cloudera.com)> wrote:
> > > > > 
> > > > > If it's not done already, would it make sense to codify this philosophy
> > > > > somewhere? I imagine this won't be the first time this discussion comes
> > > > > up, and it would be nice to have a doc to point to. I'd be happy to
> > > > > 
> > > > 
> > > > 
> > > 
> > > take a
> > > > > stab at this.
> > > > > 
> > > > > 
> > > > > > On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <mengxr@gmail.com (mailto:mengxr@gmail.com)>
> > > wrote:
> > > > > > 
> > > > > > +1 on Sean's comment. MLlib covers the basic algorithms but we
> > > > > > definitely need to spend more time on how to make the design scalable.
> > > > > > For example, think about current "ProblemWithAlgorithm" naming scheme.
> > > > > > That being said, new algorithms are welcomed. I wish they are
> > > > > > well-established and well-understood by users. They shouldn't be
> > > > > > research algorithms tuned to work well with a particular dataset but
> > > > > > not tested widely. You see the change log from Mahout:
> > > > > > 
> > > > > > ===
> > > > > > The following algorithms that were marked deprecated in 0.8 have been
> > > > > > removed in 0.9:
> > > > > > 
> > > > > > From Clustering:
> > > > > > Switched LDA implementation from using Gibbs Sampling to Collapsed
> > > > > > Variational Bayes (CVB)
> > > > > > Meanshift
> > > > > > MinHash - removed due to poor performance, lack of support and lack of
> > > > > > usage
> > > > > > 
> > > > > > From Classification (both are sequential implementations)
> > > > > > Winnow - lack of actual usage and support
> > > > > > Perceptron - lack of actual usage and support
> > > > > > 
> > > > > > Collaborative Filtering
> > > > > > SlopeOne implementations in
> > > > > > org.apache.mahout.cf.taste.hadoop.slopeone and
> > > > > > org.apache.mahout.cf.taste.impl.recommender.slopeone
> > > > > > Distributed pseudo recommender in
> > > > > > org.apache.mahout.cf.taste.hadoop.pseudo
> > > > > > TreeClusteringRecommender in
> > > > > > org.apache.mahout.cf.taste.impl.recommender
> > > > > > 
> > > > > > Mahout Math
> > > > > > Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
> > > > > > ===
> > > > > > 
> > > > > > In MLlib, we should include the algorithms users know how to use and
> > > > > > we can provide support rather than letting algorithms come and go.
> > > > > > 
> > > > > > My $0.02,
> > > > > > Xiangrui
> > > > > > 
> > > > > > > On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <sowen@cloudera.com (mailto:sowen@cloudera.com)>
> > > wrote:
> > > > > > > > On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <prb@mult.ifario.us (mailto:prb@mult.ifario.us)>
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > > wrote:
> > > > > > > > - MLlib as Mahout.next would be a unfortunate. There are some gems
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > > in
> > > > > > > > Mahout, but there are also lots of rocks. Setting a minimal bar of
> > > > > > > > working, correctly implemented, and documented requires a surprising
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > amount
> > > > > > > > of work.
> > > > > > > 
> > > > > > > 
> > > > > > > As someone with first-hand knowledge, this is correct. To Sang's
> > > > > > > question, I can't see value in 'porting' Mahout since it is based on a
> > > > > > > quite different paradigm. About the only part that translates is the
> > > > > > > algorithm concept itself.
> > > > > > > 
> > > > > > > This is also the cautionary tale. The contents of the project have
> > > > > > > ended up being a number of "drive-by" contributions of implementations
> > > > > > > that, while individually perhaps brilliant (perhaps), didn't
> > > > > > > necessarily match any other implementation in structure, input/output,
> > > > > > > libraries used. The implementations were often a touch academic. The
> > > > > > > result was hard to document, maintain, evolve or use.
> > > > > > > 
> > > > > > > Far more of the structure of the MLlib implementations are consistent
> > > > > > > by virtue of being built around Spark core already. That's great.
> > > > > > > 
> > > > > > > One can't wait to completely build the foundation before building any
> > > > > > > implementations. To me, the existing implementations are almost
> > > > > > > exactly the basics I would choose. They cover the bases and will
> > > > > > > exercise the abstractions and structure. So that's also great IMHO.
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> 
> 
>

Re: Any plans for new clustering algorithms?

Posted by Sandy Ryza <sa...@cloudera.com>.

I thought this might be a good thing to add to the wiki's "How to
contribute" page<https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark>,
as it's not tied to a release.


On Mon, Apr 21, 2014 at 6:09 PM, Xiangrui Meng <me...@gmail.com> wrote:

> The markdown files are under spark/docs. You can submit a PR for
> changes. -Xiangrui
>
> On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza <sa...@cloudera.com>
> wrote:
> > How do I get permissions to edit the wiki?
> >
> >
> > On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng <me...@gmail.com> wrote:
> >
> >> Cannot agree more with your words. Could you add one section about
> >> "how and what to contribute" to MLlib's guide? -Xiangrui
> >>
> >> On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
> >> <ni...@gmail.com> wrote:
> >> > I'd say a section in the "how to contribute" page would be a good
> place
> >> to put this.
> >> >
> >> > In general I'd say that the criteria for inclusion of an algorithm is
> it
> >> should be high quality, widely known, used and accepted (citations and
> >> concrete use cases as examples of this), scalable and parallelizable,
> well
> >> documented and with reasonable expectation of dev support
> >> >
> >> > Sent from my iPhone
> >> >
> >> >> On 21 Apr 2014, at 19:59, Sandy Ryza <sa...@cloudera.com>
> wrote:
> >> >>
> >> >> If it's not done already, would it make sense to codify this
> philosophy
> >> >> somewhere?  I imagine this won't be the first time this discussion
> comes
> >> >> up, and it would be nice to have a doc to point to.  I'd be happy to
> >> take a
> >> >> stab at this.
> >> >>
> >> >>
> >> >>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <me...@gmail.com>
> >> wrote:
> >> >>>
> >> >>> +1 on Sean's comment. MLlib covers the basic algorithms but we
> >> >>> definitely need to spend more time on how to make the design
> scalable.
> >> >>> For example, think about current "ProblemWithAlgorithm" naming
> scheme.
> >> >>> That being said, new algorithms are welcomed. I wish they are
> >> >>> well-established and well-understood by users. They shouldn't be
> >> >>> research algorithms tuned to work well with a particular dataset but
> >> >>> not tested widely. You see the change log from Mahout:
> >> >>>
> >> >>> ===
> >> >>> The following algorithms that were marked deprecated in 0.8 have
> been
> >> >>> removed in 0.9:
> >> >>>
> >> >>> From Clustering:
> >> >>>  Switched LDA implementation from using Gibbs Sampling to Collapsed
> >> >>> Variational Bayes (CVB)
> >> >>> Meanshift
> >> >>> MinHash - removed due to poor performance, lack of support and lack
> of
> >> >>> usage
> >> >>>
> >> >>> From Classification (both are sequential implementations)
> >> >>> Winnow - lack of actual usage and support
> >> >>> Perceptron - lack of actual usage and support
> >> >>>
> >> >>> Collaborative Filtering
> >> >>>    SlopeOne implementations in
> >> >>> org.apache.mahout.cf.taste.hadoop.slopeone and
> >> >>> org.apache.mahout.cf.taste.impl.recommender.slopeone
> >> >>>    Distributed pseudo recommender in
> >> >>> org.apache.mahout.cf.taste.hadoop.pseudo
> >> >>>    TreeClusteringRecommender in
> >> >>> org.apache.mahout.cf.taste.impl.recommender
> >> >>>
> >> >>> Mahout Math
> >> >>>    Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
> >> >>> ===
> >> >>>
> >> >>> In MLlib, we should include the algorithms users know how to use and
> >> >>> we can provide support rather than letting algorithms come and go.
> >> >>>
> >> >>> My $0.02,
> >> >>> Xiangrui
> >> >>>
> >> >>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com>
> >> wrote:
> >> >>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <pr...@mult.ifario.us>
> >> wrote:
> >> >>>>> - MLlib as Mahout.next would be a unfortunate.  There are some
> gems
> >> in
> >> >>>>> Mahout, but there are also lots of rocks.  Setting a minimal bar
> of
> >> >>>>> working, correctly implemented, and documented requires a
> surprising
> >> >>> amount
> >> >>>>> of work.
> >> >>>>
> >> >>>> As someone with first-hand knowledge, this is correct. To Sang's
> >> >>>> question, I can't see value in 'porting' Mahout since it is based
> on a
> >> >>>> quite different paradigm. About the only part that translates is
> the
> >> >>>> algorithm concept itself.
> >> >>>>
> >> >>>> This is also the cautionary tale. The contents of the project have
> >> >>>> ended up being a number of "drive-by" contributions of
> implementations
> >> >>>> that, while individually perhaps brilliant (perhaps), didn't
> >> >>>> necessarily match any other implementation in structure,
> input/output,
> >> >>>> libraries used. The implementations were often a touch academic.
> The
> >> >>>> result was hard to document, maintain, evolve or use.
> >> >>>>
> >> >>>> Far more of the structure of the MLlib implementations are
> consistent
> >> >>>> by virtue of being built around Spark core already. That's great.
> >> >>>>
> >> >>>> One can't wait to completely build the foundation before building
> any
> >> >>>> implementations. To me, the existing implementations are almost
> >> >>>> exactly the basics I would choose. They cover the bases and will
> >> >>>> exercise the abstractions and structure. So that's also great IMHO.
> >> >>>
> >>
>

Re: Any plans for new clustering algorithms?

Posted by Xiangrui Meng <me...@gmail.com>.

The markdown files are under spark/docs. You can submit a PR for
changes. -Xiangrui

On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza <sa...@cloudera.com> wrote:
> How do I get permissions to edit the wiki?
>
>
> On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng <me...@gmail.com> wrote:
>
>> Cannot agree more with your words. Could you add one section about
>> "how and what to contribute" to MLlib's guide? -Xiangrui
>>
>> On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
>> <ni...@gmail.com> wrote:
>> > I'd say a section in the "how to contribute" page would be a good place
>> to put this.
>> >
>> > In general I'd say that the criteria for inclusion of an algorithm is it
>> should be high quality, widely known, used and accepted (citations and
>> concrete use cases as examples of this), scalable and parallelizable, well
>> documented and with reasonable expectation of dev support
>> >
>> > Sent from my iPhone
>> >
>> >> On 21 Apr 2014, at 19:59, Sandy Ryza <sa...@cloudera.com> wrote:
>> >>
>> >> If it's not done already, would it make sense to codify this philosophy
>> >> somewhere?  I imagine this won't be the first time this discussion comes
>> >> up, and it would be nice to have a doc to point to.  I'd be happy to
>> take a
>> >> stab at this.
>> >>
>> >>
>> >>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <me...@gmail.com>
>> wrote:
>> >>>
>> >>> +1 on Sean's comment. MLlib covers the basic algorithms but we
>> >>> definitely need to spend more time on how to make the design scalable.
>> >>> For example, think about current "ProblemWithAlgorithm" naming scheme.
>> >>> That being said, new algorithms are welcomed. I wish they are
>> >>> well-established and well-understood by users. They shouldn't be
>> >>> research algorithms tuned to work well with a particular dataset but
>> >>> not tested widely. You see the change log from Mahout:
>> >>>
>> >>> ===
>> >>> The following algorithms that were marked deprecated in 0.8 have been
>> >>> removed in 0.9:
>> >>>
>> >>> From Clustering:
>> >>>  Switched LDA implementation from using Gibbs Sampling to Collapsed
>> >>> Variational Bayes (CVB)
>> >>> Meanshift
>> >>> MinHash - removed due to poor performance, lack of support and lack of
>> >>> usage
>> >>>
>> >>> From Classification (both are sequential implementations)
>> >>> Winnow - lack of actual usage and support
>> >>> Perceptron - lack of actual usage and support
>> >>>
>> >>> Collaborative Filtering
>> >>>    SlopeOne implementations in
>> >>> org.apache.mahout.cf.taste.hadoop.slopeone and
>> >>> org.apache.mahout.cf.taste.impl.recommender.slopeone
>> >>>    Distributed pseudo recommender in
>> >>> org.apache.mahout.cf.taste.hadoop.pseudo
>> >>>    TreeClusteringRecommender in
>> >>> org.apache.mahout.cf.taste.impl.recommender
>> >>>
>> >>> Mahout Math
>> >>>    Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
>> >>> ===
>> >>>
>> >>> In MLlib, we should include the algorithms users know how to use and
>> >>> we can provide support rather than letting algorithms come and go.
>> >>>
>> >>> My $0.02,
>> >>> Xiangrui
>> >>>
>> >>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com>
>> wrote:
>> >>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <pr...@mult.ifario.us>
>> wrote:
>> >>>>> - MLlib as Mahout.next would be a unfortunate.  There are some gems
>> in
>> >>>>> Mahout, but there are also lots of rocks.  Setting a minimal bar of
>> >>>>> working, correctly implemented, and documented requires a surprising
>> >>> amount
>> >>>>> of work.
>> >>>>
>> >>>> As someone with first-hand knowledge, this is correct. To Sang's
>> >>>> question, I can't see value in 'porting' Mahout since it is based on a
>> >>>> quite different paradigm. About the only part that translates is the
>> >>>> algorithm concept itself.
>> >>>>
>> >>>> This is also the cautionary tale. The contents of the project have
>> >>>> ended up being a number of "drive-by" contributions of implementations
>> >>>> that, while individually perhaps brilliant (perhaps), didn't
>> >>>> necessarily match any other implementation in structure, input/output,
>> >>>> libraries used. The implementations were often a touch academic. The
>> >>>> result was hard to document, maintain, evolve or use.
>> >>>>
>> >>>> Far more of the structure of the MLlib implementations are consistent
>> >>>> by virtue of being built around Spark core already. That's great.
>> >>>>
>> >>>> One can't wait to completely build the foundation before building any
>> >>>> implementations. To me, the existing implementations are almost
>> >>>> exactly the basics I would choose. They cover the bases and will
>> >>>> exercise the abstractions and structure. So that's also great IMHO.
>> >>>
>>

Re: Any plans for new clustering algorithms?

Posted by Sandy Ryza <sa...@cloudera.com>.

How do I get permissions to edit the wiki?


On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Cannot agree more with your words. Could you add one section about
> "how and what to contribute" to MLlib's guide? -Xiangrui
>
> On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
> <ni...@gmail.com> wrote:
> > I'd say a section in the "how to contribute" page would be a good place
> to put this.
> >
> > In general I'd say that the criteria for inclusion of an algorithm is it
> should be high quality, widely known, used and accepted (citations and
> concrete use cases as examples of this), scalable and parallelizable, well
> documented and with reasonable expectation of dev support
> >
> > Sent from my iPhone
> >
> >> On 21 Apr 2014, at 19:59, Sandy Ryza <sa...@cloudera.com> wrote:
> >>
> >> If it's not done already, would it make sense to codify this philosophy
> >> somewhere?  I imagine this won't be the first time this discussion comes
> >> up, and it would be nice to have a doc to point to.  I'd be happy to
> take a
> >> stab at this.
> >>
> >>
> >>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >>>
> >>> +1 on Sean's comment. MLlib covers the basic algorithms but we
> >>> definitely need to spend more time on how to make the design scalable.
> >>> For example, think about current "ProblemWithAlgorithm" naming scheme.
> >>> That being said, new algorithms are welcomed. I wish they are
> >>> well-established and well-understood by users. They shouldn't be
> >>> research algorithms tuned to work well with a particular dataset but
> >>> not tested widely. You see the change log from Mahout:
> >>>
> >>> ===
> >>> The following algorithms that were marked deprecated in 0.8 have been
> >>> removed in 0.9:
> >>>
> >>> From Clustering:
> >>>  Switched LDA implementation from using Gibbs Sampling to Collapsed
> >>> Variational Bayes (CVB)
> >>> Meanshift
> >>> MinHash - removed due to poor performance, lack of support and lack of
> >>> usage
> >>>
> >>> From Classification (both are sequential implementations)
> >>> Winnow - lack of actual usage and support
> >>> Perceptron - lack of actual usage and support
> >>>
> >>> Collaborative Filtering
> >>>    SlopeOne implementations in
> >>> org.apache.mahout.cf.taste.hadoop.slopeone and
> >>> org.apache.mahout.cf.taste.impl.recommender.slopeone
> >>>    Distributed pseudo recommender in
> >>> org.apache.mahout.cf.taste.hadoop.pseudo
> >>>    TreeClusteringRecommender in
> >>> org.apache.mahout.cf.taste.impl.recommender
> >>>
> >>> Mahout Math
> >>>    Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
> >>> ===
> >>>
> >>> In MLlib, we should include the algorithms users know how to use and
> >>> we can provide support rather than letting algorithms come and go.
> >>>
> >>> My $0.02,
> >>> Xiangrui
> >>>
> >>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com>
> wrote:
> >>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <pr...@mult.ifario.us>
> wrote:
> >>>>> - MLlib as Mahout.next would be a unfortunate.  There are some gems
> in
> >>>>> Mahout, but there are also lots of rocks.  Setting a minimal bar of
> >>>>> working, correctly implemented, and documented requires a surprising
> >>> amount
> >>>>> of work.
> >>>>
> >>>> As someone with first-hand knowledge, this is correct. To Sang's
> >>>> question, I can't see value in 'porting' Mahout since it is based on a
> >>>> quite different paradigm. About the only part that translates is the
> >>>> algorithm concept itself.
> >>>>
> >>>> This is also the cautionary tale. The contents of the project have
> >>>> ended up being a number of "drive-by" contributions of implementations
> >>>> that, while individually perhaps brilliant (perhaps), didn't
> >>>> necessarily match any other implementation in structure, input/output,
> >>>> libraries used. The implementations were often a touch academic. The
> >>>> result was hard to document, maintain, evolve or use.
> >>>>
> >>>> Far more of the structure of the MLlib implementations are consistent
> >>>> by virtue of being built around Spark core already. That's great.
> >>>>
> >>>> One can't wait to completely build the foundation before building any
> >>>> implementations. To me, the existing implementations are almost
> >>>> exactly the basics I would choose. They cover the bases and will
> >>>> exercise the abstractions and structure. So that's also great IMHO.
> >>>
>

Re: Any plans for new clustering algorithms?

Posted by Xiangrui Meng <me...@gmail.com>.

Cannot agree more with your words. Could you add one section about
"how and what to contribute" to MLlib's guide? -Xiangrui

On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
<ni...@gmail.com> wrote:
> I'd say a section in the "how to contribute" page would be a good place to put this.
>
> In general I'd say that the criteria for inclusion of an algorithm is it should be high quality, widely known, used and accepted (citations and concrete use cases as examples of this), scalable and parallelizable, well documented and with reasonable expectation of dev support
>
> Sent from my iPhone
>
>> On 21 Apr 2014, at 19:59, Sandy Ryza <sa...@cloudera.com> wrote:
>>
>> If it's not done already, would it make sense to codify this philosophy
>> somewhere?  I imagine this won't be the first time this discussion comes
>> up, and it would be nice to have a doc to point to.  I'd be happy to take a
>> stab at this.
>>
>>
>>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <me...@gmail.com> wrote:
>>>
>>> +1 on Sean's comment. MLlib covers the basic algorithms but we
>>> definitely need to spend more time on how to make the design scalable.
>>> For example, think about current "ProblemWithAlgorithm" naming scheme.
>>> That being said, new algorithms are welcomed. I wish they are
>>> well-established and well-understood by users. They shouldn't be
>>> research algorithms tuned to work well with a particular dataset but
>>> not tested widely. You see the change log from Mahout:
>>>
>>> ===
>>> The following algorithms that were marked deprecated in 0.8 have been
>>> removed in 0.9:
>>>
>>> From Clustering:
>>>  Switched LDA implementation from using Gibbs Sampling to Collapsed
>>> Variational Bayes (CVB)
>>> Meanshift
>>> MinHash - removed due to poor performance, lack of support and lack of
>>> usage
>>>
>>> From Classification (both are sequential implementations)
>>> Winnow - lack of actual usage and support
>>> Perceptron - lack of actual usage and support
>>>
>>> Collaborative Filtering
>>>    SlopeOne implementations in
>>> org.apache.mahout.cf.taste.hadoop.slopeone and
>>> org.apache.mahout.cf.taste.impl.recommender.slopeone
>>>    Distributed pseudo recommender in
>>> org.apache.mahout.cf.taste.hadoop.pseudo
>>>    TreeClusteringRecommender in
>>> org.apache.mahout.cf.taste.impl.recommender
>>>
>>> Mahout Math
>>>    Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
>>> ===
>>>
>>> In MLlib, we should include the algorithms users know how to use and
>>> we can provide support rather than letting algorithms come and go.
>>>
>>> My $0.02,
>>> Xiangrui
>>>
>>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <pr...@mult.ifario.us> wrote:
>>>>> - MLlib as Mahout.next would be a unfortunate.  There are some gems in
>>>>> Mahout, but there are also lots of rocks.  Setting a minimal bar of
>>>>> working, correctly implemented, and documented requires a surprising
>>> amount
>>>>> of work.
>>>>
>>>> As someone with first-hand knowledge, this is correct. To Sang's
>>>> question, I can't see value in 'porting' Mahout since it is based on a
>>>> quite different paradigm. About the only part that translates is the
>>>> algorithm concept itself.
>>>>
>>>> This is also the cautionary tale. The contents of the project have
>>>> ended up being a number of "drive-by" contributions of implementations
>>>> that, while individually perhaps brilliant (perhaps), didn't
>>>> necessarily match any other implementation in structure, input/output,
>>>> libraries used. The implementations were often a touch academic. The
>>>> result was hard to document, maintain, evolve or use.
>>>>
>>>> Far more of the structure of the MLlib implementations are consistent
>>>> by virtue of being built around Spark core already. That's great.
>>>>
>>>> One can't wait to completely build the foundation before building any
>>>> implementations. To me, the existing implementations are almost
>>>> exactly the basics I would choose. They cover the bases and will
>>>> exercise the abstractions and structure. So that's also great IMHO.
>>>

Re: Any plans for new clustering algorithms?

Posted by Nick Pentreath <ni...@gmail.com>.

I'd say a section in the "how to contribute" page would be a good place to put this.

In general I'd say that the criteria for inclusion of an algorithm is it should be high quality, widely known, used and accepted (citations and concrete use cases as examples of this), scalable and parallelizable, well documented and with reasonable expectation of dev support

Sent from my iPhone

> On 21 Apr 2014, at 19:59, Sandy Ryza <sa...@cloudera.com> wrote:
> 
> If it's not done already, would it make sense to codify this philosophy
> somewhere?  I imagine this won't be the first time this discussion comes
> up, and it would be nice to have a doc to point to.  I'd be happy to take a
> stab at this.
> 
> 
>> On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <me...@gmail.com> wrote:
>> 
>> +1 on Sean's comment. MLlib covers the basic algorithms but we
>> definitely need to spend more time on how to make the design scalable.
>> For example, think about current "ProblemWithAlgorithm" naming scheme.
>> That being said, new algorithms are welcomed. I wish they are
>> well-established and well-understood by users. They shouldn't be
>> research algorithms tuned to work well with a particular dataset but
>> not tested widely. You see the change log from Mahout:
>> 
>> ===
>> The following algorithms that were marked deprecated in 0.8 have been
>> removed in 0.9:
>> 
>> From Clustering:
>>  Switched LDA implementation from using Gibbs Sampling to Collapsed
>> Variational Bayes (CVB)
>> Meanshift
>> MinHash - removed due to poor performance, lack of support and lack of
>> usage
>> 
>> From Classification (both are sequential implementations)
>> Winnow - lack of actual usage and support
>> Perceptron - lack of actual usage and support
>> 
>> Collaborative Filtering
>>    SlopeOne implementations in
>> org.apache.mahout.cf.taste.hadoop.slopeone and
>> org.apache.mahout.cf.taste.impl.recommender.slopeone
>>    Distributed pseudo recommender in
>> org.apache.mahout.cf.taste.hadoop.pseudo
>>    TreeClusteringRecommender in
>> org.apache.mahout.cf.taste.impl.recommender
>> 
>> Mahout Math
>>    Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
>> ===
>> 
>> In MLlib, we should include the algorithms users know how to use and
>> we can provide support rather than letting algorithms come and go.
>> 
>> My $0.02,
>> Xiangrui
>> 
>>> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com> wrote:
>>>> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <pr...@mult.ifario.us> wrote:
>>>> - MLlib as Mahout.next would be a unfortunate.  There are some gems in
>>>> Mahout, but there are also lots of rocks.  Setting a minimal bar of
>>>> working, correctly implemented, and documented requires a surprising
>> amount
>>>> of work.
>>> 
>>> As someone with first-hand knowledge, this is correct. To Sang's
>>> question, I can't see value in 'porting' Mahout since it is based on a
>>> quite different paradigm. About the only part that translates is the
>>> algorithm concept itself.
>>> 
>>> This is also the cautionary tale. The contents of the project have
>>> ended up being a number of "drive-by" contributions of implementations
>>> that, while individually perhaps brilliant (perhaps), didn't
>>> necessarily match any other implementation in structure, input/output,
>>> libraries used. The implementations were often a touch academic. The
>>> result was hard to document, maintain, evolve or use.
>>> 
>>> Far more of the structure of the MLlib implementations are consistent
>>> by virtue of being built around Spark core already. That's great.
>>> 
>>> One can't wait to completely build the foundation before building any
>>> implementations. To me, the existing implementations are almost
>>> exactly the basics I would choose. They cover the bases and will
>>> exercise the abstractions and structure. So that's also great IMHO.
>>

Re: Any plans for new clustering algorithms?

Posted by Sandy Ryza <sa...@cloudera.com>.

If it's not done already, would it make sense to codify this philosophy
somewhere?  I imagine this won't be the first time this discussion comes
up, and it would be nice to have a doc to point to.  I'd be happy to take a
stab at this.


On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng <me...@gmail.com> wrote:

> +1 on Sean's comment. MLlib covers the basic algorithms but we
> definitely need to spend more time on how to make the design scalable.
> For example, think about current "ProblemWithAlgorithm" naming scheme.
> That being said, new algorithms are welcomed. I wish they are
> well-established and well-understood by users. They shouldn't be
> research algorithms tuned to work well with a particular dataset but
> not tested widely. You see the change log from Mahout:
>
> ===
> The following algorithms that were marked deprecated in 0.8 have been
> removed in 0.9:
>
> From Clustering:
>   Switched LDA implementation from using Gibbs Sampling to Collapsed
> Variational Bayes (CVB)
> Meanshift
> MinHash - removed due to poor performance, lack of support and lack of
> usage
>
> From Classification (both are sequential implementations)
> Winnow - lack of actual usage and support
> Perceptron - lack of actual usage and support
>
> Collaborative Filtering
>     SlopeOne implementations in
> org.apache.mahout.cf.taste.hadoop.slopeone and
> org.apache.mahout.cf.taste.impl.recommender.slopeone
>     Distributed pseudo recommender in
> org.apache.mahout.cf.taste.hadoop.pseudo
>     TreeClusteringRecommender in
> org.apache.mahout.cf.taste.impl.recommender
>
> Mahout Math
>     Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
> ===
>
> In MLlib, we should include the algorithms users know how to use and
> we can provide support rather than letting algorithms come and go.
>
> My $0.02,
> Xiangrui
>
> On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com> wrote:
> > On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <pr...@mult.ifario.us> wrote:
> >> - MLlib as Mahout.next would be a unfortunate.  There are some gems in
> >> Mahout, but there are also lots of rocks.  Setting a minimal bar of
> >> working, correctly implemented, and documented requires a surprising
> amount
> >> of work.
> >
> > As someone with first-hand knowledge, this is correct. To Sang's
> > question, I can't see value in 'porting' Mahout since it is based on a
> > quite different paradigm. About the only part that translates is the
> > algorithm concept itself.
> >
> > This is also the cautionary tale. The contents of the project have
> > ended up being a number of "drive-by" contributions of implementations
> > that, while individually perhaps brilliant (perhaps), didn't
> > necessarily match any other implementation in structure, input/output,
> > libraries used. The implementations were often a touch academic. The
> > result was hard to document, maintain, evolve or use.
> >
> > Far more of the structure of the MLlib implementations are consistent
> > by virtue of being built around Spark core already. That's great.
> >
> > One can't wait to completely build the foundation before building any
> > implementations. To me, the existing implementations are almost
> > exactly the basics I would choose. They cover the bases and will
> > exercise the abstractions and structure. So that's also great IMHO.
>

Re: Any plans for new clustering algorithms?

Posted by Xiangrui Meng <me...@gmail.com>.

+1 on Sean's comment. MLlib covers the basic algorithms but we
definitely need to spend more time on how to make the design scalable.
For example, think about current "ProblemWithAlgorithm" naming scheme.
That being said, new algorithms are welcomed. I wish they are
well-established and well-understood by users. They shouldn't be
research algorithms tuned to work well with a particular dataset but
not tested widely. You see the change log from Mahout:

===
The following algorithms that were marked deprecated in 0.8 have been
removed in 0.9:

>From Clustering:
  Switched LDA implementation from using Gibbs Sampling to Collapsed
Variational Bayes (CVB)
Meanshift
MinHash - removed due to poor performance, lack of support and lack of usage

>From Classification (both are sequential implementations)
Winnow - lack of actual usage and support
Perceptron - lack of actual usage and support

Collaborative Filtering
    SlopeOne implementations in
org.apache.mahout.cf.taste.hadoop.slopeone and
org.apache.mahout.cf.taste.impl.recommender.slopeone
    Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo
    TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender

Mahout Math
    Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
===

In MLlib, we should include the algorithms users know how to use and
we can provide support rather than letting algorithms come and go.

My $0.02,
Xiangrui

On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen <so...@cloudera.com> wrote:
> On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <pr...@mult.ifario.us> wrote:
>> - MLlib as Mahout.next would be a unfortunate.  There are some gems in
>> Mahout, but there are also lots of rocks.  Setting a minimal bar of
>> working, correctly implemented, and documented requires a surprising amount
>> of work.
>
> As someone with first-hand knowledge, this is correct. To Sang's
> question, I can't see value in 'porting' Mahout since it is based on a
> quite different paradigm. About the only part that translates is the
> algorithm concept itself.
>
> This is also the cautionary tale. The contents of the project have
> ended up being a number of "drive-by" contributions of implementations
> that, while individually perhaps brilliant (perhaps), didn't
> necessarily match any other implementation in structure, input/output,
> libraries used. The implementations were often a touch academic. The
> result was hard to document, maintain, evolve or use.
>
> Far more of the structure of the MLlib implementations are consistent
> by virtue of being built around Spark core already. That's great.
>
> One can't wait to completely build the foundation before building any
> implementations. To me, the existing implementations are almost
> exactly the basics I would choose. They cover the bases and will
> exercise the abstractions and structure. So that's also great IMHO.

Re: Any plans for new clustering algorithms?

Posted by Sean Owen <so...@cloudera.com>.

On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown <pr...@mult.ifario.us> wrote:
> - MLlib as Mahout.next would be a unfortunate.  There are some gems in
> Mahout, but there are also lots of rocks.  Setting a minimal bar of
> working, correctly implemented, and documented requires a surprising amount
> of work.

As someone with first-hand knowledge, this is correct. To Sang's
question, I can't see value in 'porting' Mahout since it is based on a
quite different paradigm. About the only part that translates is the
algorithm concept itself.

This is also the cautionary tale. The contents of the project have
ended up being a number of "drive-by" contributions of implementations
that, while individually perhaps brilliant (perhaps), didn't
necessarily match any other implementation in structure, input/output,
libraries used. The implementations were often a touch academic. The
result was hard to document, maintain, evolve or use.

Far more of the structure of the MLlib implementations are consistent
by virtue of being built around Spark core already. That's great.

One can't wait to completely build the foundation before building any
implementations. To me, the existing implementations are almost
exactly the basics I would choose. They cover the bases and will
exercise the abstractions and structure. So that's also great IMHO.

Re: Any plans for new clustering algorithms?

Posted by Paul Brown <pr...@mult.ifario.us>.

I agree that it will be good to see more algorithms added to the MLlib
universe, although this does bring to mind a couple of comments:

- MLlib as Mahout.next would be a unfortunate.  There are some gems in
Mahout, but there are also lots of rocks.  Setting a minimal bar of
working, correctly implemented, and documented requires a surprising amount
of work.

- Not getting any signal out of your data with an algorithm like K-means
implies one of the following: (1) there is no signal in your data, (2) you
should try tuning the algorithm differently, (3) you're using K-means
wrong, (4) you should try preparing the data differently, (5) all of the
above, or (6) none of the above.

My $0.02.
-- Paul


—
prb@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen <so...@cloudera.com> wrote:

> Nobody asked me, and this is a comment on a broader question, not this
> one, but:
>
> In light of a number of recent items about adding more algorithms,
> I'll say that I personally think an explosion of algorithms should
> come after the MLlib "core" is more fully baked. I'm thinking of
> finishing out the changes to vectors and matrices, for example. Things
> are going to change significantly in the short term as people use the
> algorithms and see how well the abstractions do or don't work. I've
> seen another similar project suffer mightily from too many algorithms
> too early, so maybe I'm just paranoid.
>
> Anyway, long-term, I think lots of good algorithms is a right and
> proper goal for MLlib, myself. Consistent approaches, representations
> and APIs will make or break MLlib much more than having or not having
> a particular algorithm. With the plumbing in place, writing the algo
> is the fun easy part.
> --
> Sean Owen | Director, Data Science | London
>
>
> On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka
> <al...@gmail.com> wrote:
> > Hi, Spark developers.
> > Are there any plans for implementing new clustering algorithms in MLLib?
> As
> > far as I understand, current version of Spark ships with only one
> > clustering algorithm - K-Means. I want to contribute to Spark and I'm
> > thinking of adding more clustering algorithms - maybe
> > DBSCAN<http://en.wikipedia.org/wiki/DBSCAN>.
> > I can start working on it. Does anyone want to join me?
>

Re: Any plans for new clustering algorithms?

Posted by Sang Venkatraman <sa...@gmail.com>.

Hi,

On a related note, I have not looked at the the MLlib library in detail but
are there plans on reusing or porting over parts of apache mahout.

Thanks,
Sang


On Mon, Apr 21, 2014 at 12:07 PM, Evan R. Sparks <ev...@gmail.com>wrote:

> While DBSCAN and others would be welcome contributions, I couldn't agree
> more with Sean.
>
>
>
>
> On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen <so...@cloudera.com> wrote:
>
> > Nobody asked me, and this is a comment on a broader question, not this
> > one, but:
> >
> > In light of a number of recent items about adding more algorithms,
> > I'll say that I personally think an explosion of algorithms should
> > come after the MLlib "core" is more fully baked. I'm thinking of
> > finishing out the changes to vectors and matrices, for example. Things
> > are going to change significantly in the short term as people use the
> > algorithms and see how well the abstractions do or don't work. I've
> > seen another similar project suffer mightily from too many algorithms
> > too early, so maybe I'm just paranoid.
> >
> > Anyway, long-term, I think lots of good algorithms is a right and
> > proper goal for MLlib, myself. Consistent approaches, representations
> > and APIs will make or break MLlib much more than having or not having
> > a particular algorithm. With the plumbing in place, writing the algo
> > is the fun easy part.
> > --
> > Sean Owen | Director, Data Science | London
> >
> >
> > On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka
> > <al...@gmail.com> wrote:
> > > Hi, Spark developers.
> > > Are there any plans for implementing new clustering algorithms in
> MLLib?
> > As
> > > far as I understand, current version of Spark ships with only one
> > > clustering algorithm - K-Means. I want to contribute to Spark and I'm
> > > thinking of adding more clustering algorithms - maybe
> > > DBSCAN<http://en.wikipedia.org/wiki/DBSCAN>.
> > > I can start working on it. Does anyone want to join me?
> >
>

Re: Any plans for new clustering algorithms?

Posted by Aliaksei Litouka <al...@gmail.com>.

Thank you very much for detailed answers.
I can't but agree that a good MLLib core is a higher priority than
algorithms built on top of it. I'll check if I can contribute anything to
the core. I will also follow Nick Pentreath's recommendation to start a new
GitHub project. Actually, here is a link to repository:
https://github.com/alitouka/spark_dbscan . Currently it is empty - I've
just created it :)


2014-04-21 11:40 GMT-05:00 Nick Pentreath <ni...@gmail.com>:

> I am very much +1 on Sean's comment.
>
> I think the correct abstractions and API for Vectors, Matrices and
> distributed matrices (distributed row matrix etc) will, once bedded down
> and battle tested in the wild, allow a whole lot of flexibility for
> developers of algorithms on top of MLlib core.
>
> This is true whether the algorithm finds itself in MLlib, MLBase, or
> resides in a separate contrib project. Just like Spark core sometimes risks
> becoming "trying to please everybody" by having the kitchen sink in terms
> of Hadoop integration aspects or RDD operations, and thus a spark-contrib
> project may make a lot of sense. So too could ml-contrib hold a lot of
> algorithms that are not core but still of wide interest. This can include,
> for example, models that are still cutting edge and perhaps not as widely
> used in production yet, or specialist models that are of interest to a more
> niche group.
>
> scikit-learn is very tough about this, requiring a very high bar for
> including a new algorithm (many citations, dev support, proof of strong
> performance and wide demand). And this leads to a very high quality code
> base in general.
>
> I'd say we should (if it hasn't been done already, I may have missed such a
> discussion), decide precisely what does constitute MLlib's "1.0.0" goals
> for algorithms. I'd say what we have in terms of clustering (K-Means||),
> linear models, decision trees and collaborative filtering is pretty much a
> good goal. Potentially the Random Forest implementation on top of the DT,
> and perhaps another form of recommendation model (such as the co-occurrence
> models cf. Mahout's) could be potential candidates for inclusion. I'd also
> say any other optimization methods/procedures in addition to SGD and LBFGS
> that are very strong and widely used for a variety of (distributed) ML
> problems, could be candidates. And finally things like useful utils,
> cross-validation and evaluation methods, etc.
>
> So I'd say by all means, please work on a new model such as DBSCAN. Put it
> in a new GitHub project, post some detailed performance comparisons vs
> MLlib K-Means, and then in future if it gets included in MLlib core it's a
> pretty easy to do.
>
>
> On Mon, Apr 21, 2014 at 6:07 PM, Evan R. Sparks <evan.sparks@gmail.com
> >wrote:
>
> > While DBSCAN and others would be welcome contributions, I couldn't agree
> > more with Sean.
> >
> >
> >
> >
> > On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen <so...@cloudera.com> wrote:
> >
> > > Nobody asked me, and this is a comment on a broader question, not this
> > > one, but:
> > >
> > > In light of a number of recent items about adding more algorithms,
> > > I'll say that I personally think an explosion of algorithms should
> > > come after the MLlib "core" is more fully baked. I'm thinking of
> > > finishing out the changes to vectors and matrices, for example. Things
> > > are going to change significantly in the short term as people use the
> > > algorithms and see how well the abstractions do or don't work. I've
> > > seen another similar project suffer mightily from too many algorithms
> > > too early, so maybe I'm just paranoid.
> > >
> > > Anyway, long-term, I think lots of good algorithms is a right and
> > > proper goal for MLlib, myself. Consistent approaches, representations
> > > and APIs will make or break MLlib much more than having or not having
> > > a particular algorithm. With the plumbing in place, writing the algo
> > > is the fun easy part.
> > > --
> > > Sean Owen | Director, Data Science | London
> > >
> > >
> > > On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka
> > > <al...@gmail.com> wrote:
> > > > Hi, Spark developers.
> > > > Are there any plans for implementing new clustering algorithms in
> > MLLib?
> > > As
> > > > far as I understand, current version of Spark ships with only one
> > > > clustering algorithm - K-Means. I want to contribute to Spark and I'm
> > > > thinking of adding more clustering algorithms - maybe
> > > > DBSCAN<http://en.wikipedia.org/wiki/DBSCAN>.
> > > > I can start working on it. Does anyone want to join me?
> > >
> >
>

Re: Any plans for new clustering algorithms?

Posted by Nick Pentreath <ni...@gmail.com>.

I am very much +1 on Sean's comment.

I think the correct abstractions and API for Vectors, Matrices and
distributed matrices (distributed row matrix etc) will, once bedded down
and battle tested in the wild, allow a whole lot of flexibility for
developers of algorithms on top of MLlib core.

This is true whether the algorithm finds itself in MLlib, MLBase, or
resides in a separate contrib project. Just like Spark core sometimes risks
becoming "trying to please everybody" by having the kitchen sink in terms
of Hadoop integration aspects or RDD operations, and thus a spark-contrib
project may make a lot of sense. So too could ml-contrib hold a lot of
algorithms that are not core but still of wide interest. This can include,
for example, models that are still cutting edge and perhaps not as widely
used in production yet, or specialist models that are of interest to a more
niche group.

scikit-learn is very tough about this, requiring a very high bar for
including a new algorithm (many citations, dev support, proof of strong
performance and wide demand). And this leads to a very high quality code
base in general.

I'd say we should (if it hasn't been done already, I may have missed such a
discussion), decide precisely what does constitute MLlib's "1.0.0" goals
for algorithms. I'd say what we have in terms of clustering (K-Means||),
linear models, decision trees and collaborative filtering is pretty much a
good goal. Potentially the Random Forest implementation on top of the DT,
and perhaps another form of recommendation model (such as the co-occurrence
models cf. Mahout's) could be potential candidates for inclusion. I'd also
say any other optimization methods/procedures in addition to SGD and LBFGS
that are very strong and widely used for a variety of (distributed) ML
problems, could be candidates. And finally things like useful utils,
cross-validation and evaluation methods, etc.

So I'd say by all means, please work on a new model such as DBSCAN. Put it
in a new GitHub project, post some detailed performance comparisons vs
MLlib K-Means, and then in future if it gets included in MLlib core it's a
pretty easy to do.

On Mon, Apr 21, 2014 at 6:07 PM, Evan R. Sparks <ev...@gmail.com>wrote:

> While DBSCAN and others would be welcome contributions, I couldn't agree
> more with Sean.
>
>
>
>
> On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen <so...@cloudera.com> wrote:
>
> > Nobody asked me, and this is a comment on a broader question, not this
> > one, but:
> >
> > In light of a number of recent items about adding more algorithms,
> > I'll say that I personally think an explosion of algorithms should
> > come after the MLlib "core" is more fully baked. I'm thinking of
> > finishing out the changes to vectors and matrices, for example. Things
> > are going to change significantly in the short term as people use the
> > algorithms and see how well the abstractions do or don't work. I've
> > seen another similar project suffer mightily from too many algorithms
> > too early, so maybe I'm just paranoid.
> >
> > Anyway, long-term, I think lots of good algorithms is a right and
> > proper goal for MLlib, myself. Consistent approaches, representations
> > and APIs will make or break MLlib much more than having or not having
> > a particular algorithm. With the plumbing in place, writing the algo
> > is the fun easy part.
> > --
> > Sean Owen | Director, Data Science | London
> >
> >
> > On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka
> > <al...@gmail.com> wrote:
> > > Hi, Spark developers.
> > > Are there any plans for implementing new clustering algorithms in
> MLLib?
> > As
> > > far as I understand, current version of Spark ships with only one
> > > clustering algorithm - K-Means. I want to contribute to Spark and I'm
> > > thinking of adding more clustering algorithms - maybe
> > > DBSCAN<http://en.wikipedia.org/wiki/DBSCAN>.
> > > I can start working on it. Does anyone want to join me?
> >
>

Re: Any plans for new clustering algorithms?

Posted by "Evan R. Sparks" <ev...@gmail.com>.

While DBSCAN and others would be welcome contributions, I couldn't agree
more with Sean.




On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen <so...@cloudera.com> wrote:

> Nobody asked me, and this is a comment on a broader question, not this
> one, but:
>
> In light of a number of recent items about adding more algorithms,
> I'll say that I personally think an explosion of algorithms should
> come after the MLlib "core" is more fully baked. I'm thinking of
> finishing out the changes to vectors and matrices, for example. Things
> are going to change significantly in the short term as people use the
> algorithms and see how well the abstractions do or don't work. I've
> seen another similar project suffer mightily from too many algorithms
> too early, so maybe I'm just paranoid.
>
> Anyway, long-term, I think lots of good algorithms is a right and
> proper goal for MLlib, myself. Consistent approaches, representations
> and APIs will make or break MLlib much more than having or not having
> a particular algorithm. With the plumbing in place, writing the algo
> is the fun easy part.
> --
> Sean Owen | Director, Data Science | London
>
>
> On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka
> <al...@gmail.com> wrote:
> > Hi, Spark developers.
> > Are there any plans for implementing new clustering algorithms in MLLib?
> As
> > far as I understand, current version of Spark ships with only one
> > clustering algorithm - K-Means. I want to contribute to Spark and I'm
> > thinking of adding more clustering algorithms - maybe
> > DBSCAN<http://en.wikipedia.org/wiki/DBSCAN>.
> > I can start working on it. Does anyone want to join me?
>

Re: Any plans for new clustering algorithms?

Posted by Sean Owen <so...@cloudera.com>.

Nobody asked me, and this is a comment on a broader question, not this
one, but:

In light of a number of recent items about adding more algorithms,
I'll say that I personally think an explosion of algorithms should
come after the MLlib "core" is more fully baked. I'm thinking of
finishing out the changes to vectors and matrices, for example. Things
are going to change significantly in the short term as people use the
algorithms and see how well the abstractions do or don't work. I've
seen another similar project suffer mightily from too many algorithms
too early, so maybe I'm just paranoid.

Anyway, long-term, I think lots of good algorithms is a right and
proper goal for MLlib, myself. Consistent approaches, representations
and APIs will make or break MLlib much more than having or not having
a particular algorithm. With the plumbing in place, writing the algo
is the fun easy part.
--
Sean Owen | Director, Data Science | London

On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka
<al...@gmail.com> wrote:
> Hi, Spark developers.
> Are there any plans for implementing new clustering algorithms in MLLib? As
> far as I understand, current version of Spark ships with only one
> clustering algorithm - K-Means. I want to contribute to Spark and I'm
> thinking of adding more clustering algorithms - maybe
> DBSCAN<http://en.wikipedia.org/wiki/DBSCAN>.
> I can start working on it. Does anyone want to join me?