You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by alex kamil <al...@gmail.com> on 2011/03/10 10:57:08 UTC

candidate projects

I posted this question on Quora, comments are welcomed
If people would like to contribute to the project what areas they should
focus on first? What are the most frequent user requests and features that
require improvement? What are some candidate projects for the GSoC program?
http://www.quora.com/Apache-Mahout/What-are-some-important-machine-learning-and-numerical-algorithms-not-yet-covered-in-Mahout

Thanks
Alex

Re: candidate projects

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 10, 2011, at 1:39 PM, Dan Brickley wrote:

> On 10 March 2011 18:32, Ted Dunning <te...@gmail.com> wrote:
> 
>> Out of all of this, you might get the impression that I don't see much to
>> add to Mahout and you would be right insofar as this list is concerned. That
>> doesn't mean that other people won't have other itches. If they do, they
>> should drop in on the Mahout mailing lists and see if we can't work together
>> to scratch that itch.
> 
> Is there any interest in reviving the RBM work?
> https://issues.apache.org/jira/browse/MAHOUT-214

Yes.

Re: candidate projects

Posted by je...@lewi.us.

There's also Hama which has a BSP (bulk synchronization package) which  
is related to Google's pregel does.

Would Hama's BSP engine be appropriate for building iterative  
algorithms so as to avoid the latency of reading from file on each  
iteration?

There's also Spark (http://www.cs.berkeley.edu/~matei/spark/)  which  
is built ontop of MESOS.

J
Quoting Ted Dunning <te...@gmail.com>:

> I don't know if iterative solutions are actually required.  Such iterations
> are relatively natural, but the same can be said for power methods and there
> are dramatic improvements available with those.
>
> Alternative frameworks are an excellent idea.  There is a LOT of stuff about
> to happen to make alternative frameworks possible in a cluster.  One
> alternative platform is MESOS which was accepted as an incubator in
> December.  Another is the MapReduce 2.0 stuff that Arun Murthy is working on
> at Yahoo but which is not yet released.  Both allow map-reduce to ultimately
> be demoted from cluster-godhood to the status of a library.  Both allow
> alternative computational paradigms to be built on an experimental level.
>  Neither is ready for prime-time.
>
> RBM's would be much easier to code on graph-lab, but right now graph-lab
> isn't very deployable into production.  Getting experience with RBM's is
> reasonable even without graphlab, however.
>
> On Thu, Mar 10, 2011 at 11:17 AM, <je...@lewi.us> wrote:
>
>> Ted in a different thread (I think it was in response to Bickson's
>> profiling of Hadoop), I think you pointed out that Hadoop isn't well suited
>> for iterative algorithms. Don't RBMs require iterative algorithms? Or is it
>> just when you stack them into a deep belief net that you need iterative
>> algorithms?
>>
>> Given the graphical/iterative nature of RBMs would it make more sense to
>> try to build a better framework then map-reduce; e.g scale Bickson's
>> graphlab to the cloud or an open-source equivalent of Google's pregel.
>>
>> There's already some pretty good codes out there for RBMs in addition to
>> GraphLab; there's Theano which leverages gpu. Would there be some way to
>> leverage those codes?
>>
>>
>> J
>>
>>
>> Quoting Ted Dunning <te...@gmail.com>:
>>
>>  RBM's would be very cool.  They have some potential for being at least as
>>> scalable as LDA.
>>>
>>> How would you go about it?
>>>
>>> On Thu, Mar 10, 2011 at 10:39 AM, Dan Brickley <da...@danbri.org> wrote:
>>>
>>>  On 10 March 2011 18:32, Ted Dunning <te...@gmail.com> wrote:
>>>>
>>>> > Out of all of this, you might get the impression that I don't see much
>>>> to
>>>> > add to Mahout and you would be right insofar as this list is concerned.
>>>> That
>>>> > doesn't mean that other people won't have other itches. If they do,
>>>> they
>>>> > should drop in on the Mahout mailing lists and see if we can't work
>>>> together
>>>> > to scratch that itch.
>>>>
>>>> Is there any interest in reviving the RBM work?
>>>> https://issues.apache.org/jira/browse/MAHOUT-214
>>>>
>>>>
>>>> http://en.wikipedia.org/wiki/Boltzmann_machine#Restricted_Boltzmann_Machine
>>>>
>>>>
>>>> http://www.scholarpedia.org/article/Boltzmann_machine#Restricted_Boltzmann_machines
>>>>
>>>> Hinton in http://www.youtube.com/watch?v=AyzOUbkUf3M is both
>>>> entertaining and persuasive, but I'm out of my maths depth.
>>>> Also http://www.cs.toronto.edu/~hinton/ ->
>>>> http://www.cs.toronto.edu/~hinton/science.pdf
>>>>
>>>> cheers,
>>>>
>>>> Dan
>>>>
>>>>
>>>
>>
>>
>>
>

Re: candidate projects

Posted by "Bae, Jae Hyeon" <me...@gmail.com>.

About density based clustering(DBSCAN), since it should be supported a
random access for distance matrix, it has a scalability problem. So, to
overcome this limitation, I've chosen an approach using union-find. As you
know, union-find is scalable and can be suitable to MapReduce way. To apply
union-find, we have to find initial clusters which are within MinPts and Eps
boundary and check two clusters are sharing some elements in pairwise way.
Armed with these paired clusters, we can apply union-find to that.

This method result in a slightly different output comparing to DBSCAN,
because expanding cluster through union-find does not regard boundary points
and core points. Anyway, this can be scalable.

I've implemented already based on Mahout. To vectorize documents, I used
Mahout seq2sparse module.

The disadvantage of above method is that we have to calculate O(N^2) module
twice. First is calculating pairwise similarity distance, second is checking
two clusters can be connected in a pairwise way. Although we are borrowing a
distributed computing power on Hadoop, O(N^2) is not tractable if we have
more one million documents. The worst thing is that checking two clusters
are sharing some elements can cost another linear time complexity. To boost
up this, I've used a kind of bitmap, but that's slow too :(

I thought I'd like to contribute to Mahout regarding density based
clustering. Unfortunately, I don't have much free time :(

Best, Jay

2011/3/10 Ted Dunning <te...@gmail.com>

> I don't know if iterative solutions are actually required.  Such iterations
> are relatively natural, but the same can be said for power methods and
> there
> are dramatic improvements available with those.
>
> Alternative frameworks are an excellent idea.  There is a LOT of stuff
> about
> to happen to make alternative frameworks possible in a cluster.  One
> alternative platform is MESOS which was accepted as an incubator in
> December.  Another is the MapReduce 2.0 stuff that Arun Murthy is working
> on
> at Yahoo but which is not yet released.  Both allow map-reduce to
> ultimately
> be demoted from cluster-godhood to the status of a library.  Both allow
> alternative computational paradigms to be built on an experimental level.
>  Neither is ready for prime-time.
>
> RBM's would be much easier to code on graph-lab, but right now graph-lab
> isn't very deployable into production.  Getting experience with RBM's is
> reasonable even without graphlab, however.
>
> On Thu, Mar 10, 2011 at 11:17 AM, <je...@lewi.us> wrote:
>
> > Ted in a different thread (I think it was in response to Bickson's
> > profiling of Hadoop), I think you pointed out that Hadoop isn't well
> suited
> > for iterative algorithms. Don't RBMs require iterative algorithms? Or is
> it
> > just when you stack them into a deep belief net that you need iterative
> > algorithms?
> >
> > Given the graphical/iterative nature of RBMs would it make more sense to
> > try to build a better framework then map-reduce; e.g scale Bickson's
> > graphlab to the cloud or an open-source equivalent of Google's pregel.
> >
> > There's already some pretty good codes out there for RBMs in addition to
> > GraphLab; there's Theano which leverages gpu. Would there be some way to
> > leverage those codes?
> >
> >
> > J
> >
> >
> > Quoting Ted Dunning <te...@gmail.com>:
> >
> >  RBM's would be very cool.  They have some potential for being at least
> as
> >> scalable as LDA.
> >>
> >> How would you go about it?
> >>
> >> On Thu, Mar 10, 2011 at 10:39 AM, Dan Brickley <da...@danbri.org>
> wrote:
> >>
> >>  On 10 March 2011 18:32, Ted Dunning <te...@gmail.com> wrote:
> >>>
> >>> > Out of all of this, you might get the impression that I don't see
> much
> >>> to
> >>> > add to Mahout and you would be right insofar as this list is
> concerned.
> >>> That
> >>> > doesn't mean that other people won't have other itches. If they do,
> >>> they
> >>> > should drop in on the Mahout mailing lists and see if we can't work
> >>> together
> >>> > to scratch that itch.
> >>>
> >>> Is there any interest in reviving the RBM work?
> >>> https://issues.apache.org/jira/browse/MAHOUT-214
> >>>
> >>>
> >>>
> http://en.wikipedia.org/wiki/Boltzmann_machine#Restricted_Boltzmann_Machine
> >>>
> >>>
> >>>
> http://www.scholarpedia.org/article/Boltzmann_machine#Restricted_Boltzmann_machines
> >>>
> >>> Hinton in http://www.youtube.com/watch?v=AyzOUbkUf3M is both
> >>> entertaining and persuasive, but I'm out of my maths depth.
> >>> Also http://www.cs.toronto.edu/~hinton/ ->
> >>> http://www.cs.toronto.edu/~hinton/science.pdf
> >>>
> >>> cheers,
> >>>
> >>> Dan
> >>>
> >>>
> >>
> >
> >
> >
>

Re: candidate projects

Posted by Ted Dunning <te...@gmail.com>.

I don't know if iterative solutions are actually required.  Such iterations
are relatively natural, but the same can be said for power methods and there
are dramatic improvements available with those.

Alternative frameworks are an excellent idea.  There is a LOT of stuff about
to happen to make alternative frameworks possible in a cluster.  One
alternative platform is MESOS which was accepted as an incubator in
December.  Another is the MapReduce 2.0 stuff that Arun Murthy is working on
at Yahoo but which is not yet released.  Both allow map-reduce to ultimately
be demoted from cluster-godhood to the status of a library.  Both allow
alternative computational paradigms to be built on an experimental level.
 Neither is ready for prime-time.

RBM's would be much easier to code on graph-lab, but right now graph-lab
isn't very deployable into production.  Getting experience with RBM's is
reasonable even without graphlab, however.

On Thu, Mar 10, 2011 at 11:17 AM, <je...@lewi.us> wrote:

> Ted in a different thread (I think it was in response to Bickson's
> profiling of Hadoop), I think you pointed out that Hadoop isn't well suited
> for iterative algorithms. Don't RBMs require iterative algorithms? Or is it
> just when you stack them into a deep belief net that you need iterative
> algorithms?
>
> Given the graphical/iterative nature of RBMs would it make more sense to
> try to build a better framework then map-reduce; e.g scale Bickson's
> graphlab to the cloud or an open-source equivalent of Google's pregel.
>
> There's already some pretty good codes out there for RBMs in addition to
> GraphLab; there's Theano which leverages gpu. Would there be some way to
> leverage those codes?
>
>
> J
>
>
> Quoting Ted Dunning <te...@gmail.com>:
>
>  RBM's would be very cool.  They have some potential for being at least as
>> scalable as LDA.
>>
>> How would you go about it?
>>
>> On Thu, Mar 10, 2011 at 10:39 AM, Dan Brickley <da...@danbri.org> wrote:
>>
>>  On 10 March 2011 18:32, Ted Dunning <te...@gmail.com> wrote:
>>>
>>> > Out of all of this, you might get the impression that I don't see much
>>> to
>>> > add to Mahout and you would be right insofar as this list is concerned.
>>> That
>>> > doesn't mean that other people won't have other itches. If they do,
>>> they
>>> > should drop in on the Mahout mailing lists and see if we can't work
>>> together
>>> > to scratch that itch.
>>>
>>> Is there any interest in reviving the RBM work?
>>> https://issues.apache.org/jira/browse/MAHOUT-214
>>>
>>>
>>> http://en.wikipedia.org/wiki/Boltzmann_machine#Restricted_Boltzmann_Machine
>>>
>>>
>>> http://www.scholarpedia.org/article/Boltzmann_machine#Restricted_Boltzmann_machines
>>>
>>> Hinton in http://www.youtube.com/watch?v=AyzOUbkUf3M is both
>>> entertaining and persuasive, but I'm out of my maths depth.
>>> Also http://www.cs.toronto.edu/~hinton/ ->
>>> http://www.cs.toronto.edu/~hinton/science.pdf
>>>
>>> cheers,
>>>
>>> Dan
>>>
>>>
>>
>
>
>

Re: candidate projects

Posted by je...@lewi.us.

Ted in a different thread (I think it was in response to Bickson's  
profiling of Hadoop), I think you pointed out that Hadoop isn't well  
suited for iterative algorithms. Don't RBMs require iterative  
algorithms? Or is it just when you stack them into a deep belief net  
that you need iterative algorithms?

Given the graphical/iterative nature of RBMs would it make more sense  
to try to build a better framework then map-reduce; e.g scale  
Bickson's graphlab to the cloud or an open-source equivalent of  
Google's pregel.

There's already some pretty good codes out there for RBMs in addition  
to GraphLab; there's Theano which leverages gpu. Would there be some  
way to leverage those codes?

J

Quoting Ted Dunning <te...@gmail.com>:

> RBM's would be very cool.  They have some potential for being at least as
> scalable as LDA.
>
> How would you go about it?
>
> On Thu, Mar 10, 2011 at 10:39 AM, Dan Brickley <da...@danbri.org> wrote:
>
>> On 10 March 2011 18:32, Ted Dunning <te...@gmail.com> wrote:
>>
>> > Out of all of this, you might get the impression that I don't see much to
>> > add to Mahout and you would be right insofar as this list is concerned.
>> That
>> > doesn't mean that other people won't have other itches. If they do, they
>> > should drop in on the Mahout mailing lists and see if we can't work
>> together
>> > to scratch that itch.
>>
>> Is there any interest in reviving the RBM work?
>> https://issues.apache.org/jira/browse/MAHOUT-214
>>
>> http://en.wikipedia.org/wiki/Boltzmann_machine#Restricted_Boltzmann_Machine
>>
>> http://www.scholarpedia.org/article/Boltzmann_machine#Restricted_Boltzmann_machines
>>
>> Hinton in http://www.youtube.com/watch?v=AyzOUbkUf3M is both
>> entertaining and persuasive, but I'm out of my maths depth.
>> Also http://www.cs.toronto.edu/~hinton/ ->
>> http://www.cs.toronto.edu/~hinton/science.pdf
>>
>> cheers,
>>
>> Dan
>>
>

Re: candidate projects

Posted by Ted Dunning <te...@gmail.com>.

RBM's would be very cool.  They have some potential for being at least as
scalable as LDA.

How would you go about it?

On Thu, Mar 10, 2011 at 10:39 AM, Dan Brickley <da...@danbri.org> wrote:

> On 10 March 2011 18:32, Ted Dunning <te...@gmail.com> wrote:
>
> > Out of all of this, you might get the impression that I don't see much to
> > add to Mahout and you would be right insofar as this list is concerned.
> That
> > doesn't mean that other people won't have other itches. If they do, they
> > should drop in on the Mahout mailing lists and see if we can't work
> together
> > to scratch that itch.
>
> Is there any interest in reviving the RBM work?
> https://issues.apache.org/jira/browse/MAHOUT-214
>
> http://en.wikipedia.org/wiki/Boltzmann_machine#Restricted_Boltzmann_Machine
>
> http://www.scholarpedia.org/article/Boltzmann_machine#Restricted_Boltzmann_machines
>
> Hinton in http://www.youtube.com/watch?v=AyzOUbkUf3M is both
> entertaining and persuasive, but I'm out of my maths depth.
> Also http://www.cs.toronto.edu/~hinton/ ->
> http://www.cs.toronto.edu/~hinton/science.pdf
>
> cheers,
>
> Dan
>

Re: candidate projects

Posted by Dan Brickley <da...@danbri.org>.

On 10 March 2011 18:32, Ted Dunning <te...@gmail.com> wrote:

> Out of all of this, you might get the impression that I don't see much to
> add to Mahout and you would be right insofar as this list is concerned. That
> doesn't mean that other people won't have other itches. If they do, they
> should drop in on the Mahout mailing lists and see if we can't work together
> to scratch that itch.

Is there any interest in reviving the RBM work?
https://issues.apache.org/jira/browse/MAHOUT-214

http://en.wikipedia.org/wiki/Boltzmann_machine#Restricted_Boltzmann_Machine
http://www.scholarpedia.org/article/Boltzmann_machine#Restricted_Boltzmann_machines

Hinton in http://www.youtube.com/watch?v=AyzOUbkUf3M is both
entertaining and persuasive, but I'm out of my maths depth.
Also http://www.cs.toronto.edu/~hinton/ ->
http://www.cs.toronto.edu/~hinton/science.pdf

cheers,

Dan

Re: candidate projects

Posted by Ted Dunning <te...@gmail.com>.

Here is my answer to the best comment on that Quora thread:

Mariana, nice to hear from you. It would be great to hear from you on the
Mahout mailing lists as well.

Here is what I know of your suggestions vis a vis Mahout:

> CoWeb and DBscan clustering methods.

I don't find any documentation on CoWeb.  DBscan looks like it is n log n
and requires random access which is not scalable.  Is there an alternative
formulation that is scalable?

> Decision Trees such as J48 and ID3.

We have an early implementation of random forests.  The standard algorithms
for J48 and ID3 are not scalable.  There is some interest in implementing
some faster versions of these.

> Decision Tables.

This is a bit ambiguous.

> Apriori

Apriori has serious scalability problems.  Instead, we have an alternative
frequent itemset algorithm.

> SMO

Fast sequential algorithms for SVD (of which SMO is just one alternative)
exist and implementations of these are available.  See liblinear, svmLight
and so on.  What doesn't exist is a version that scales to really large
data.  On the other hand, for very large sparse problems stochastic gradient
descent (SGD) for logistic regression seems to work as well or better than
SVM anyway.  Mahout has a state of the art SGD implementation.

> Genetic Algorithms

Mahout has a classical GA implementation. It also has a very practical
implementation of recorded-step evolutionary optimization.

Out of all of this, you might get the impression that I don't see much to
add to Mahout and you would be right insofar as this list is concerned. That
doesn't mean that other people won't have other itches. If they do, they
should drop in on the Mahout mailing lists and see if we can't work together
to scratch that itch.

On Thu, Mar 10, 2011 at 9:18 AM, Ted Dunning <te...@gmail.com> wrote:

> It's nice of you to talk about Mahout elsewhere, but if you really want to
> talk to the current developers,
> this mailing list is likely to be a much more effective place to go.
>
>
> On Thu, Mar 10, 2011 at 1:57 AM, alex kamil <al...@gmail.com> wrote:
>
>> I posted this question on Quora, comments are welcomed
>> If people would like to contribute to the project what areas they should
>> focus on first? What are the most frequent user requests and features that
>> require improvement? What are some candidate projects for the GSoC
>> program?
>>
>> http://www.quora.com/Apache-Mahout/What-are-some-important-machine-learning-and-numerical-algorithms-not-yet-covered-in-Mahout
>>
>> Thanks
>> Alex
>>
>
>

Re: candidate projects

Posted by Ted Dunning <te...@gmail.com>.

It's nice of you to talk about Mahout elsewhere, but if you really want to
talk to the current developers,
this mailing list is likely to be a much more effective place to go.

On Thu, Mar 10, 2011 at 1:57 AM, alex kamil <al...@gmail.com> wrote:

> I posted this question on Quora, comments are welcomed
> If people would like to contribute to the project what areas they should
> focus on first? What are the most frequent user requests and features that
> require improvement? What are some candidate projects for the GSoC program?
>
> http://www.quora.com/Apache-Mahout/What-are-some-important-machine-learning-and-numerical-algorithms-not-yet-covered-in-Mahout
>
> Thanks
> Alex
>