You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ying Liao <yl...@gmail.com> on 2014/02/18 22:08:56 UTC

Mahout on Spark?

Just wonder what is the future of Mahout. We are seeing new stuff from
0xdata and skytree. And spark is also design for in-memory iterative
analysis. What about mahout? Will mahout run on top of spark in future?

Thanks,
Ying Liao

Re: Mahout on Spark?

Posted by Sean Owen <sr...@gmail.com>.

Agree that 'merging' is so infeasible as to not make sense. Mahout has
been ML on M/R and that's it's thing, which seems fine. IMHO this
project has been hurt by an active unwillingness to define scope, and
pretending it's helpful to have little bits of lots of ideas and
technologies.

I also don't see a point in trying to duplicate mllib. Just add to
mllib. It's Apache, etc. I also agree that being a bag of algorithms
is a bad idea and we have told the mllib folks as much FWIW.

The Spark / Databricks guys are the few qualified to manage
contributions to mllib, and are doing a heroic job of handling the
flood of PRs. (Does Matei sleep anymore?) But they're getting overrun,
and focused on getting the machinery of Spark really production-ready,
esp. on Hadoop. My concern about mllib in the short term is there
aren't enough expert brain cells to spare to manage the load of
production-izing work that mllib could use, because it's secondary to
core Spark. All the more reason I can't see, in practice, any spare
cycles available to do some kind of Mahout-integration anything.

(FWIW I have high hopes for mllib and assuming we can get some basic
stuff fixed we're going to replace M/R-based implementations with
Spark in the stuff I work on. still needs a decent RDF implementation.
But then again, so does Mahout :( )


On Wed, Feb 19, 2014 at 8:27 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Unfortunately methinks the prospects of something like Mahout/MLLib merge
> seem very unlikely due to vastly diverged approach to the basics of linear
> algebra (and other things). Just like one cannot grow single tree out of
> two trunks -- not easily, anyway.
>
> It is fairly easy to port (and subsequently beat) MLib at this point from
> collection of algorithms point of view. But IMO goal should be more
> MLI-like first, and port second. And be very careful with concepts.
> Something that i so far don't see happening with MLib. MLib seems to be
> old-style Mahout-like rush to become a collection of basic algorithms
> rather than coherent foundation. Admittedly, i havent looked very closely.
>
>
> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ss...@apache.org> wrote:
>
>> I'm also convinced that Spark is a superior platform for executing
>> distributed ML algorithms. We've had a discussion about a change from
>> Hadoop to another platform some time ago, but at that point in time it was
>> not clear which of the upcoming dataflow processing systems (Spark,
>> Hyracks, Stratosphere) would establish itself amongst the users. To me it
>> seems pretty obvious that Spark made the race.
>>
>> I concur with Ted, it would be great to have the communities work
>> together. I know that at least 4 mahout committers (including me) are
>> already following Spark's mailinglist and actively participating in the
>> discussions.
>>
>> What are the ideas how a fruitful cooperation look like?
>>
>> Best,
>> Sebastian
>>
>> PS:
>>
>> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
>> to Spark some time ago, but I haven't had time to test my code on a large
>> dataset yet. I'd be happy to see someone help with that.
>>
>>
>>
>>
>>
>>
>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>
>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>> doing certain things, but we'd welcome as many Mahout devs as possible to
>>> work together.
>>>
>>>
>>> It may be too late, but perhaps a GSoC project to look at a port of some
>>> stuff like co occurrence recommender and streaming k-means?
>>>
>>>
>>>
>>>
>>> N
>>> --
>>> Sent from Mailbox for iPhone
>>>
>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>
>>>  On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>> nick.pentreath@gmail.com>wrote:
>>>>
>>>>> My (admittedly heavily biased) view is Spark is a superior platform
>>>>> overall
>>>>> for ML. If the two communities can work together to leverage the
>>>>> strengths
>>>>> of Spark, and the large amount of good stuff in Mahout (as well as the
>>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>>> achieved!
>>>>>
>>>>>  It makes a lot of sense that Spark would be better than Hadoop for ML
>>>> purposes given that Hadoop was intended to do web-crawl kinds of things
>>>> and
>>>> Spark was intentionally built to support machine learning.
>>>> Given that Spark has been announced by a majority of the Hadoop-based
>>>> distribution vendors, it makes sense that maybe Mahout should jump in.
>>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
>>>> could
>>>> work more closely together.  There is a lot of good to be had on both
>>>> sides.
>>>>
>>>
>>

Re: Mahout on Spark?

Posted by Suneel Marthi <su...@yahoo.com>.

On Wednesday, February 19, 2014 7:22 PM, Ted Dunning <te...@gmail.com> wrote:

On Wed, Feb 19, 2014 at 1:55 PM, peng <pc...@uowmail.edu.au> wrote:

> But maybe mahout can include contribs that M/R is not fit for, like
> downpour SGD or graph-based algorithms?
>

Yes.  Absolutely.

Downpour SGD is #1 on my list of features for 1.0, will start working on that once the MultiLayer Perceptron is functional and integrated into Mahout processing pipeline (should be by next week).

Re: Mahout on Spark?

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Feb 19, 2014 at 1:55 PM, peng <pc...@uowmail.edu.au> wrote:

> But maybe mahout can include contribs that M/R is not fit for, like
> downpour SGD or graph-based algorithms?
>

Yes.  Absolutely.

Re: Mahout on Spark?

Posted by Nick Pentreath <ni...@gmail.com>.

MLlib may be less production tested than Mahout that is true, but I would
say Spark is heavily production tested and getting close to a true 1.0
release. Why do you favour Hadoop for "sturdiness"? Spark uses HDFS as an
input source (or any Hadoop InputFormat) so benefits from the same fault
tolerance wrt input sources. Spark's fault tolerance model for tasks / jobs
is if anything superior to Hadoop M/R.

For a Downpour SGD-like implementation on Spark see:
https://github.com/apache/incubator-spark/pull/407. Assuming the framework
for Spark SGD / gradients etc is flexible enough, one should be able to
implement neural net / perceptron on top of this. Would be interested to
hear if it can be done easily with the current code framework.


On Wed, Feb 19, 2014 at 11:55 PM, peng <pc...@uowmail.edu.au> wrote:

> I was suggested to switch to MLlib for its performance, but I doubt if
> that is production ready, even if it is I would still favour hadoop's
> sturdiness and self-healing.
> But maybe mahout can include contribs that M/R is not fit for, like
> downpour SGD or graph-based algorithms?
>
>
> On Wed 19 Feb 2014 07:52:22 AM EST, Sean Owen wrote:
>
>> To set expectations appropriately, I think it's important to point out
>> this is completely infeasible short of a total rewrite, and I can't
>> imagine that will happen. It may not be obvious if you haven't looked
>> at the code how completely dependent on M/R it is.
>>
>> You can swap out M/R and Spark if you write in terms of something like
>> Crunch, but that is not at all the case here.
>>
>> On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> +100 for this, different execution engines, like the direction  pig and
>>> crunch take
>>>
>>> Sent from my iPhone
>>>
>>>  On Feb 19, 2014, at 5:19 AM, Gokhan Capan <gk...@gmail.com> wrote:
>>>>
>>>> I imagine in Mahout offering an option to the users to select from
>>>> different execution engines (just like we currently do by giving M/R or
>>>> sequential options), and starting from Spark. I am not sure what changes
>>>> needed in the codebase, though. Maybe following MLI (or alike) and
>>>> implementing some more stuff, such as common interfaces for iterating
>>>> over
>>>> data (the M/R way and the Spark way).
>>>>
>>>> IMO, another effort might be porting pre-online machine learning (such
>>>> transforming text into vector based on the dictionary generated by
>>>> seq2sparse before), machine learning based on mini-batches, and
>>>> streaming
>>>> summarization stuff in Mahout to Spark-Streaming.
>>>>
>>>> Best,
>>>> Gokhan
>>>>
>>>> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>> >wrote:
>>>>
>>>>  PS I am moving along cost optimizer for spark-backed DRMs on some
>>>>> multiplicative pipelines that is capable of figuring different
>>>>> cost-based
>>>>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>>>>> representations and blocks but it is painfully slow, i really only
>>>>> doing it
>>>>> like couple nights in a month. It does not look like i will be doing
>>>>> it on
>>>>> company time any time soon (and even if i did, the company doesn't
>>>>> seem to
>>>>> be inclined to contribute anything I do anything new on their time).
>>>>> It is
>>>>> all painfully slow, there's no direct funding for it anywhere with no
>>>>> string attached. That probably will be primary reason why Mahout would
>>>>> not
>>>>> be able to get much traction compared to university-based
>>>>> contributions.
>>>>>
>>>>>
>>>>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>>  Unfortunately methinks the prospects of something like Mahout/MLLib
>>>>>> merge
>>>>>> seem very unlikely due to vastly diverged approach to the basics of
>>>>>>
>>>>> linear
>>>>>
>>>>>> algebra (and other things). Just like one cannot grow single tree out
>>>>>> of
>>>>>> two trunks -- not easily, anyway.
>>>>>>
>>>>>> It is fairly easy to port (and subsequently beat) MLib at this point
>>>>>> from
>>>>>> collection of algorithms point of view. But IMO goal should be more
>>>>>> MLI-like first, and port second. And be very careful with concepts.
>>>>>> Something that i so far don't see happening with MLib. MLib seems to
>>>>>> be
>>>>>> old-style Mahout-like rush to become a collection of basic algorithms
>>>>>> rather than coherent foundation. Admittedly, i havent looked very
>>>>>>
>>>>> closely.
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ssc@apache.org
>>>>>> wrote:
>>>>>>
>>>>>>  I'm also convinced that Spark is a superior platform for executing
>>>>>>> distributed ML algorithms. We've had a discussion about a change from
>>>>>>> Hadoop to another platform some time ago, but at that point in time
>>>>>>> it
>>>>>>>
>>>>>> was
>>>>>
>>>>>> not clear which of the upcoming dataflow processing systems (Spark,
>>>>>>> Hyracks, Stratosphere) would establish itself amongst the users. To
>>>>>>> me
>>>>>>>
>>>>>> it
>>>>>
>>>>>> seems pretty obvious that Spark made the race.
>>>>>>>
>>>>>>> I concur with Ted, it would be great to have the communities work
>>>>>>> together. I know that at least 4 mahout committers (including me) are
>>>>>>> already following Spark's mailinglist and actively participating in
>>>>>>> the
>>>>>>> discussions.
>>>>>>>
>>>>>>> What are the ideas how a fruitful cooperation look like?
>>>>>>>
>>>>>>> Best,
>>>>>>> Sebastian
>>>>>>>
>>>>>>> PS:
>>>>>>>
>>>>>>> I ported LLR-based cooccurrence analysis (aka item-based
>>>>>>> recommendation)
>>>>>>> to Spark some time ago, but I haven't had time to test my code on a
>>>>>>>
>>>>>> large
>>>>>
>>>>>> dataset yet. I'd be happy to see someone help with that.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>>>>>>>
>>>>>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>>>>>>> doing certain things, but we'd welcome as many Mahout devs as
>>>>>>>> possible
>>>>>>>>
>>>>>>> to
>>>>>
>>>>>> work together.
>>>>>>>>
>>>>>>>>
>>>>>>>> It may be too late, but perhaps a GSoC project to look at a port of
>>>>>>>>
>>>>>>> some
>>>>>
>>>>>> stuff like co occurrence recommender and streaming k-means?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> N
>>>>>>>> --
>>>>>>>> Sent from Mailbox for iPhone
>>>>>>>>
>>>>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <ted.dunning@gmail.com
>>>>>>>> >
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>>>>>>
>>>>>>>>> nick.pentreath@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>  My (admittedly heavily biased) view is Spark is a superior
>>>>>>>>>> platform
>>>>>>>>>> overall
>>>>>>>>>> for ML. If the two communities can work together to leverage the
>>>>>>>>>> strengths
>>>>>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as
>>>>>>>>>>
>>>>>>>>> the
>>>>>
>>>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>>>>>>>> achieved!
>>>>>>>>>>
>>>>>>>>>> It makes a lot of sense that Spark would be better than Hadoop for
>>>>>>>>>>
>>>>>>>>> ML
>>>>>
>>>>>> purposes given that Hadoop was intended to do web-crawl kinds of
>>>>>>>>>
>>>>>>>> things
>>>>>
>>>>>> and
>>>>>>>>> Spark was intentionally built to support machine learning.
>>>>>>>>> Given that Spark has been announced by a majority of the
>>>>>>>>> Hadoop-based
>>>>>>>>> distribution vendors, it makes sense that maybe Mahout should jump
>>>>>>>>> in.
>>>>>>>>> I really would prefer it if the two communities (MLib/MLI and
>>>>>>>>> Mahout)
>>>>>>>>> could
>>>>>>>>> work more closely together.  There is a lot of good to be had on
>>>>>>>>> both
>>>>>>>>> sides.
>>>>>>>>>
>>>>>>>>
>>>>>

Re: Mahout on Spark?

Posted by peng <pc...@uowmail.edu.au>.

I was suggested to switch to MLlib for its performance, but I doubt if 
that is production ready, even if it is I would still favour hadoop's 
sturdiness and self-healing.
But maybe mahout can include contribs that M/R is not fit for, like 
downpour SGD or graph-based algorithms?

On Wed 19 Feb 2014 07:52:22 AM EST, Sean Owen wrote:
> To set expectations appropriately, I think it's important to point out
> this is completely infeasible short of a total rewrite, and I can't
> imagine that will happen. It may not be obvious if you haven't looked
> at the code how completely dependent on M/R it is.
>
> You can swap out M/R and Spark if you write in terms of something like
> Crunch, but that is not at all the case here.
>
> On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas <ja...@gmail.com> wrote:
>> +100 for this, different execution engines, like the direction  pig and crunch take
>>
>> Sent from my iPhone
>>
>>> On Feb 19, 2014, at 5:19 AM, Gokhan Capan <gk...@gmail.com> wrote:
>>>
>>> I imagine in Mahout offering an option to the users to select from
>>> different execution engines (just like we currently do by giving M/R or
>>> sequential options), and starting from Spark. I am not sure what changes
>>> needed in the codebase, though. Maybe following MLI (or alike) and
>>> implementing some more stuff, such as common interfaces for iterating over
>>> data (the M/R way and the Spark way).
>>>
>>> IMO, another effort might be porting pre-online machine learning (such
>>> transforming text into vector based on the dictionary generated by
>>> seq2sparse before), machine learning based on mini-batches, and streaming
>>> summarization stuff in Mahout to Spark-Streaming.
>>>
>>> Best,
>>> Gokhan
>>>
>>> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>>>
>>>> PS I am moving along cost optimizer for spark-backed DRMs on some
>>>> multiplicative pipelines that is capable of figuring different cost-based
>>>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>>>> representations and blocks but it is painfully slow, i really only doing it
>>>> like couple nights in a month. It does not look like i will be doing it on
>>>> company time any time soon (and even if i did, the company doesn't seem to
>>>> be inclined to contribute anything I do anything new on their time). It is
>>>> all painfully slow, there's no direct funding for it anywhere with no
>>>> string attached. That probably will be primary reason why Mahout would not
>>>> be able to get much traction compared to university-based contributions.
>>>>
>>>>
>>>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>>> wrote:
>>>>
>>>>> Unfortunately methinks the prospects of something like Mahout/MLLib merge
>>>>> seem very unlikely due to vastly diverged approach to the basics of
>>>> linear
>>>>> algebra (and other things). Just like one cannot grow single tree out of
>>>>> two trunks -- not easily, anyway.
>>>>>
>>>>> It is fairly easy to port (and subsequently beat) MLib at this point from
>>>>> collection of algorithms point of view. But IMO goal should be more
>>>>> MLI-like first, and port second. And be very careful with concepts.
>>>>> Something that i so far don't see happening with MLib. MLib seems to be
>>>>> old-style Mahout-like rush to become a collection of basic algorithms
>>>>> rather than coherent foundation. Admittedly, i havent looked very
>>>> closely.
>>>>>
>>>>>
>>>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ssc@apache.org
>>>>> wrote:
>>>>>
>>>>>> I'm also convinced that Spark is a superior platform for executing
>>>>>> distributed ML algorithms. We've had a discussion about a change from
>>>>>> Hadoop to another platform some time ago, but at that point in time it
>>>> was
>>>>>> not clear which of the upcoming dataflow processing systems (Spark,
>>>>>> Hyracks, Stratosphere) would establish itself amongst the users. To me
>>>> it
>>>>>> seems pretty obvious that Spark made the race.
>>>>>>
>>>>>> I concur with Ted, it would be great to have the communities work
>>>>>> together. I know that at least 4 mahout committers (including me) are
>>>>>> already following Spark's mailinglist and actively participating in the
>>>>>> discussions.
>>>>>>
>>>>>> What are the ideas how a fruitful cooperation look like?
>>>>>>
>>>>>> Best,
>>>>>> Sebastian
>>>>>>
>>>>>> PS:
>>>>>>
>>>>>> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
>>>>>> to Spark some time ago, but I haven't had time to test my code on a
>>>> large
>>>>>> dataset yet. I'd be happy to see someone help with that.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>>>>>>
>>>>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>>>>>> doing certain things, but we'd welcome as many Mahout devs as possible
>>>> to
>>>>>>> work together.
>>>>>>>
>>>>>>>
>>>>>>> It may be too late, but perhaps a GSoC project to look at a port of
>>>> some
>>>>>>> stuff like co occurrence recommender and streaming k-means?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> N
>>>>>>> --
>>>>>>> Sent from Mailbox for iPhone
>>>>>>>
>>>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>>>>>> nick.pentreath@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> My (admittedly heavily biased) view is Spark is a superior platform
>>>>>>>>> overall
>>>>>>>>> for ML. If the two communities can work together to leverage the
>>>>>>>>> strengths
>>>>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as
>>>> the
>>>>>>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>>>>>>> achieved!
>>>>>>>>>
>>>>>>>>> It makes a lot of sense that Spark would be better than Hadoop for
>>>> ML
>>>>>>>> purposes given that Hadoop was intended to do web-crawl kinds of
>>>> things
>>>>>>>> and
>>>>>>>> Spark was intentionally built to support machine learning.
>>>>>>>> Given that Spark has been announced by a majority of the Hadoop-based
>>>>>>>> distribution vendors, it makes sense that maybe Mahout should jump in.
>>>>>>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
>>>>>>>> could
>>>>>>>> work more closely together.  There is a lot of good to be had on both
>>>>>>>> sides.
>>>>

Re: Mahout on Spark?

Posted by Sebastian Schelter <ss...@apache.org>.

Completely agree with Sean's statement.

On 02/19/2014 01:52 PM, Sean Owen wrote:
> To set expectations appropriately, I think it's important to point out
> this is completely infeasible short of a total rewrite, and I can't
> imagine that will happen. It may not be obvious if you haven't looked
> at the code how completely dependent on M/R it is.
>
> You can swap out M/R and Spark if you write in terms of something like
> Crunch, but that is not at all the case here.
>
> On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas <ja...@gmail.com> wrote:
>> +100 for this, different execution engines, like the direction  pig and crunch take
>>
>> Sent from my iPhone
>>
>>> On Feb 19, 2014, at 5:19 AM, Gokhan Capan <gk...@gmail.com> wrote:
>>>
>>> I imagine in Mahout offering an option to the users to select from
>>> different execution engines (just like we currently do by giving M/R or
>>> sequential options), and starting from Spark. I am not sure what changes
>>> needed in the codebase, though. Maybe following MLI (or alike) and
>>> implementing some more stuff, such as common interfaces for iterating over
>>> data (the M/R way and the Spark way).
>>>
>>> IMO, another effort might be porting pre-online machine learning (such
>>> transforming text into vector based on the dictionary generated by
>>> seq2sparse before), machine learning based on mini-batches, and streaming
>>> summarization stuff in Mahout to Spark-Streaming.
>>>
>>> Best,
>>> Gokhan
>>>
>>> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>>>
>>>> PS I am moving along cost optimizer for spark-backed DRMs on some
>>>> multiplicative pipelines that is capable of figuring different cost-based
>>>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>>>> representations and blocks but it is painfully slow, i really only doing it
>>>> like couple nights in a month. It does not look like i will be doing it on
>>>> company time any time soon (and even if i did, the company doesn't seem to
>>>> be inclined to contribute anything I do anything new on their time). It is
>>>> all painfully slow, there's no direct funding for it anywhere with no
>>>> string attached. That probably will be primary reason why Mahout would not
>>>> be able to get much traction compared to university-based contributions.
>>>>
>>>>
>>>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>>> wrote:
>>>>
>>>>> Unfortunately methinks the prospects of something like Mahout/MLLib merge
>>>>> seem very unlikely due to vastly diverged approach to the basics of
>>>> linear
>>>>> algebra (and other things). Just like one cannot grow single tree out of
>>>>> two trunks -- not easily, anyway.
>>>>>
>>>>> It is fairly easy to port (and subsequently beat) MLib at this point from
>>>>> collection of algorithms point of view. But IMO goal should be more
>>>>> MLI-like first, and port second. And be very careful with concepts.
>>>>> Something that i so far don't see happening with MLib. MLib seems to be
>>>>> old-style Mahout-like rush to become a collection of basic algorithms
>>>>> rather than coherent foundation. Admittedly, i havent looked very
>>>> closely.
>>>>>
>>>>>
>>>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ssc@apache.org
>>>>> wrote:
>>>>>
>>>>>> I'm also convinced that Spark is a superior platform for executing
>>>>>> distributed ML algorithms. We've had a discussion about a change from
>>>>>> Hadoop to another platform some time ago, but at that point in time it
>>>> was
>>>>>> not clear which of the upcoming dataflow processing systems (Spark,
>>>>>> Hyracks, Stratosphere) would establish itself amongst the users. To me
>>>> it
>>>>>> seems pretty obvious that Spark made the race.
>>>>>>
>>>>>> I concur with Ted, it would be great to have the communities work
>>>>>> together. I know that at least 4 mahout committers (including me) are
>>>>>> already following Spark's mailinglist and actively participating in the
>>>>>> discussions.
>>>>>>
>>>>>> What are the ideas how a fruitful cooperation look like?
>>>>>>
>>>>>> Best,
>>>>>> Sebastian
>>>>>>
>>>>>> PS:
>>>>>>
>>>>>> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
>>>>>> to Spark some time ago, but I haven't had time to test my code on a
>>>> large
>>>>>> dataset yet. I'd be happy to see someone help with that.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>>>>>>
>>>>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>>>>>> doing certain things, but we'd welcome as many Mahout devs as possible
>>>> to
>>>>>>> work together.
>>>>>>>
>>>>>>>
>>>>>>> It may be too late, but perhaps a GSoC project to look at a port of
>>>> some
>>>>>>> stuff like co occurrence recommender and streaming k-means?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> N
>>>>>>> --
>>>>>>> Sent from Mailbox for iPhone
>>>>>>>
>>>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>>>>>> nick.pentreath@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> My (admittedly heavily biased) view is Spark is a superior platform
>>>>>>>>> overall
>>>>>>>>> for ML. If the two communities can work together to leverage the
>>>>>>>>> strengths
>>>>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as
>>>> the
>>>>>>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>>>>>>> achieved!
>>>>>>>>>
>>>>>>>>> It makes a lot of sense that Spark would be better than Hadoop for
>>>> ML
>>>>>>>> purposes given that Hadoop was intended to do web-crawl kinds of
>>>> things
>>>>>>>> and
>>>>>>>> Spark was intentionally built to support machine learning.
>>>>>>>> Given that Spark has been announced by a majority of the Hadoop-based
>>>>>>>> distribution vendors, it makes sense that maybe Mahout should jump in.
>>>>>>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
>>>>>>>> could
>>>>>>>> work more closely together.  There is a lot of good to be had on both
>>>>>>>> sides.
>>>>

Re: Mahout on Spark?

Posted by Sean Owen <sr...@gmail.com>.

To set expectations appropriately, I think it's important to point out
this is completely infeasible short of a total rewrite, and I can't
imagine that will happen. It may not be obvious if you haven't looked
at the code how completely dependent on M/R it is.

You can swap out M/R and Spark if you write in terms of something like
Crunch, but that is not at all the case here.

On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas <ja...@gmail.com> wrote:
> +100 for this, different execution engines, like the direction  pig and crunch take
>
> Sent from my iPhone
>
>> On Feb 19, 2014, at 5:19 AM, Gokhan Capan <gk...@gmail.com> wrote:
>>
>> I imagine in Mahout offering an option to the users to select from
>> different execution engines (just like we currently do by giving M/R or
>> sequential options), and starting from Spark. I am not sure what changes
>> needed in the codebase, though. Maybe following MLI (or alike) and
>> implementing some more stuff, such as common interfaces for iterating over
>> data (the M/R way and the Spark way).
>>
>> IMO, another effort might be porting pre-online machine learning (such
>> transforming text into vector based on the dictionary generated by
>> seq2sparse before), machine learning based on mini-batches, and streaming
>> summarization stuff in Mahout to Spark-Streaming.
>>
>> Best,
>> Gokhan
>>
>> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>>
>>> PS I am moving along cost optimizer for spark-backed DRMs on some
>>> multiplicative pipelines that is capable of figuring different cost-based
>>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>>> representations and blocks but it is painfully slow, i really only doing it
>>> like couple nights in a month. It does not look like i will be doing it on
>>> company time any time soon (and even if i did, the company doesn't seem to
>>> be inclined to contribute anything I do anything new on their time). It is
>>> all painfully slow, there's no direct funding for it anywhere with no
>>> string attached. That probably will be primary reason why Mahout would not
>>> be able to get much traction compared to university-based contributions.
>>>
>>>
>>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>> wrote:
>>>
>>>> Unfortunately methinks the prospects of something like Mahout/MLLib merge
>>>> seem very unlikely due to vastly diverged approach to the basics of
>>> linear
>>>> algebra (and other things). Just like one cannot grow single tree out of
>>>> two trunks -- not easily, anyway.
>>>>
>>>> It is fairly easy to port (and subsequently beat) MLib at this point from
>>>> collection of algorithms point of view. But IMO goal should be more
>>>> MLI-like first, and port second. And be very careful with concepts.
>>>> Something that i so far don't see happening with MLib. MLib seems to be
>>>> old-style Mahout-like rush to become a collection of basic algorithms
>>>> rather than coherent foundation. Admittedly, i havent looked very
>>> closely.
>>>>
>>>>
>>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ssc@apache.org
>>>> wrote:
>>>>
>>>>> I'm also convinced that Spark is a superior platform for executing
>>>>> distributed ML algorithms. We've had a discussion about a change from
>>>>> Hadoop to another platform some time ago, but at that point in time it
>>> was
>>>>> not clear which of the upcoming dataflow processing systems (Spark,
>>>>> Hyracks, Stratosphere) would establish itself amongst the users. To me
>>> it
>>>>> seems pretty obvious that Spark made the race.
>>>>>
>>>>> I concur with Ted, it would be great to have the communities work
>>>>> together. I know that at least 4 mahout committers (including me) are
>>>>> already following Spark's mailinglist and actively participating in the
>>>>> discussions.
>>>>>
>>>>> What are the ideas how a fruitful cooperation look like?
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>> PS:
>>>>>
>>>>> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
>>>>> to Spark some time ago, but I haven't had time to test my code on a
>>> large
>>>>> dataset yet. I'd be happy to see someone help with that.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>>>>>
>>>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>>>>> doing certain things, but we'd welcome as many Mahout devs as possible
>>> to
>>>>>> work together.
>>>>>>
>>>>>>
>>>>>> It may be too late, but perhaps a GSoC project to look at a port of
>>> some
>>>>>> stuff like co occurrence recommender and streaming k-means?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> N
>>>>>> --
>>>>>> Sent from Mailbox for iPhone
>>>>>>
>>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>>>>> nick.pentreath@gmail.com>wrote:
>>>>>>>
>>>>>>>> My (admittedly heavily biased) view is Spark is a superior platform
>>>>>>>> overall
>>>>>>>> for ML. If the two communities can work together to leverage the
>>>>>>>> strengths
>>>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as
>>> the
>>>>>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>>>>>> achieved!
>>>>>>>>
>>>>>>>> It makes a lot of sense that Spark would be better than Hadoop for
>>> ML
>>>>>>> purposes given that Hadoop was intended to do web-crawl kinds of
>>> things
>>>>>>> and
>>>>>>> Spark was intentionally built to support machine learning.
>>>>>>> Given that Spark has been announced by a majority of the Hadoop-based
>>>>>>> distribution vendors, it makes sense that maybe Mahout should jump in.
>>>>>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
>>>>>>> could
>>>>>>> work more closely together.  There is a lot of good to be had on both
>>>>>>> sides.
>>>

Re: Mahout on Spark?

Posted by Sean Owen <sr...@gmail.com>.

To set expectations appropriately, I think it's important to point out
this is completely infeasible short of a total rewrite, and I can't
imagine that will happen. It may not be obvious if you haven't looked
at the code how completely dependent on M/R it is.

You can swap out M/R and Spark if you write in terms of something like
Crunch, but that is not at all the case here.

On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas <ja...@gmail.com> wrote:
> +100 for this, different execution engines, like the direction  pig and crunch take
>
> Sent from my iPhone
>
>> On Feb 19, 2014, at 5:19 AM, Gokhan Capan <gk...@gmail.com> wrote:
>>
>> I imagine in Mahout offering an option to the users to select from
>> different execution engines (just like we currently do by giving M/R or
>> sequential options), and starting from Spark. I am not sure what changes
>> needed in the codebase, though. Maybe following MLI (or alike) and
>> implementing some more stuff, such as common interfaces for iterating over
>> data (the M/R way and the Spark way).
>>
>> IMO, another effort might be porting pre-online machine learning (such
>> transforming text into vector based on the dictionary generated by
>> seq2sparse before), machine learning based on mini-batches, and streaming
>> summarization stuff in Mahout to Spark-Streaming.
>>
>> Best,
>> Gokhan
>>
>> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>>
>>> PS I am moving along cost optimizer for spark-backed DRMs on some
>>> multiplicative pipelines that is capable of figuring different cost-based
>>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>>> representations and blocks but it is painfully slow, i really only doing it
>>> like couple nights in a month. It does not look like i will be doing it on
>>> company time any time soon (and even if i did, the company doesn't seem to
>>> be inclined to contribute anything I do anything new on their time). It is
>>> all painfully slow, there's no direct funding for it anywhere with no
>>> string attached. That probably will be primary reason why Mahout would not
>>> be able to get much traction compared to university-based contributions.
>>>
>>>
>>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>> wrote:
>>>
>>>> Unfortunately methinks the prospects of something like Mahout/MLLib merge
>>>> seem very unlikely due to vastly diverged approach to the basics of
>>> linear
>>>> algebra (and other things). Just like one cannot grow single tree out of
>>>> two trunks -- not easily, anyway.
>>>>
>>>> It is fairly easy to port (and subsequently beat) MLib at this point from
>>>> collection of algorithms point of view. But IMO goal should be more
>>>> MLI-like first, and port second. And be very careful with concepts.
>>>> Something that i so far don't see happening with MLib. MLib seems to be
>>>> old-style Mahout-like rush to become a collection of basic algorithms
>>>> rather than coherent foundation. Admittedly, i havent looked very
>>> closely.
>>>>
>>>>
>>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ssc@apache.org
>>>> wrote:
>>>>
>>>>> I'm also convinced that Spark is a superior platform for executing
>>>>> distributed ML algorithms. We've had a discussion about a change from
>>>>> Hadoop to another platform some time ago, but at that point in time it
>>> was
>>>>> not clear which of the upcoming dataflow processing systems (Spark,
>>>>> Hyracks, Stratosphere) would establish itself amongst the users. To me
>>> it
>>>>> seems pretty obvious that Spark made the race.
>>>>>
>>>>> I concur with Ted, it would be great to have the communities work
>>>>> together. I know that at least 4 mahout committers (including me) are
>>>>> already following Spark's mailinglist and actively participating in the
>>>>> discussions.
>>>>>
>>>>> What are the ideas how a fruitful cooperation look like?
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>> PS:
>>>>>
>>>>> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
>>>>> to Spark some time ago, but I haven't had time to test my code on a
>>> large
>>>>> dataset yet. I'd be happy to see someone help with that.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>>>>>
>>>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>>>>> doing certain things, but we'd welcome as many Mahout devs as possible
>>> to
>>>>>> work together.
>>>>>>
>>>>>>
>>>>>> It may be too late, but perhaps a GSoC project to look at a port of
>>> some
>>>>>> stuff like co occurrence recommender and streaming k-means?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> N
>>>>>> --
>>>>>> Sent from Mailbox for iPhone
>>>>>>
>>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>>>>> nick.pentreath@gmail.com>wrote:
>>>>>>>
>>>>>>>> My (admittedly heavily biased) view is Spark is a superior platform
>>>>>>>> overall
>>>>>>>> for ML. If the two communities can work together to leverage the
>>>>>>>> strengths
>>>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as
>>> the
>>>>>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>>>>>> achieved!
>>>>>>>>
>>>>>>>> It makes a lot of sense that Spark would be better than Hadoop for
>>> ML
>>>>>>> purposes given that Hadoop was intended to do web-crawl kinds of
>>> things
>>>>>>> and
>>>>>>> Spark was intentionally built to support machine learning.
>>>>>>> Given that Spark has been announced by a majority of the Hadoop-based
>>>>>>> distribution vendors, it makes sense that maybe Mahout should jump in.
>>>>>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
>>>>>>> could
>>>>>>> work more closely together.  There is a lot of good to be had on both
>>>>>>> sides.
>>>

Re: Mahout on Spark?

Posted by Jay Vyas <ja...@gmail.com>.

+100 for this, different execution engines, like the direction  pig and crunch take 

Sent from my iPhone

> On Feb 19, 2014, at 5:19 AM, Gokhan Capan <gk...@gmail.com> wrote:
> 
> I imagine in Mahout offering an option to the users to select from
> different execution engines (just like we currently do by giving M/R or
> sequential options), and starting from Spark. I am not sure what changes
> needed in the codebase, though. Maybe following MLI (or alike) and
> implementing some more stuff, such as common interfaces for iterating over
> data (the M/R way and the Spark way).
> 
> IMO, another effort might be porting pre-online machine learning (such
> transforming text into vector based on the dictionary generated by
> seq2sparse before), machine learning based on mini-batches, and streaming
> summarization stuff in Mahout to Spark-Streaming.
> 
> Best,
> Gokhan
> 
> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
> 
>> PS I am moving along cost optimizer for spark-backed DRMs on some
>> multiplicative pipelines that is capable of figuring different cost-based
>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>> representations and blocks but it is painfully slow, i really only doing it
>> like couple nights in a month. It does not look like i will be doing it on
>> company time any time soon (and even if i did, the company doesn't seem to
>> be inclined to contribute anything I do anything new on their time). It is
>> all painfully slow, there's no direct funding for it anywhere with no
>> string attached. That probably will be primary reason why Mahout would not
>> be able to get much traction compared to university-based contributions.
>> 
>> 
>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>> wrote:
>> 
>>> Unfortunately methinks the prospects of something like Mahout/MLLib merge
>>> seem very unlikely due to vastly diverged approach to the basics of
>> linear
>>> algebra (and other things). Just like one cannot grow single tree out of
>>> two trunks -- not easily, anyway.
>>> 
>>> It is fairly easy to port (and subsequently beat) MLib at this point from
>>> collection of algorithms point of view. But IMO goal should be more
>>> MLI-like first, and port second. And be very careful with concepts.
>>> Something that i so far don't see happening with MLib. MLib seems to be
>>> old-style Mahout-like rush to become a collection of basic algorithms
>>> rather than coherent foundation. Admittedly, i havent looked very
>> closely.
>>> 
>>> 
>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ssc@apache.org
>>> wrote:
>>> 
>>>> I'm also convinced that Spark is a superior platform for executing
>>>> distributed ML algorithms. We've had a discussion about a change from
>>>> Hadoop to another platform some time ago, but at that point in time it
>> was
>>>> not clear which of the upcoming dataflow processing systems (Spark,
>>>> Hyracks, Stratosphere) would establish itself amongst the users. To me
>> it
>>>> seems pretty obvious that Spark made the race.
>>>> 
>>>> I concur with Ted, it would be great to have the communities work
>>>> together. I know that at least 4 mahout committers (including me) are
>>>> already following Spark's mailinglist and actively participating in the
>>>> discussions.
>>>> 
>>>> What are the ideas how a fruitful cooperation look like?
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>> PS:
>>>> 
>>>> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
>>>> to Spark some time ago, but I haven't had time to test my code on a
>> large
>>>> dataset yet. I'd be happy to see someone help with that.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>>>> 
>>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>>>> doing certain things, but we'd welcome as many Mahout devs as possible
>> to
>>>>> work together.
>>>>> 
>>>>> 
>>>>> It may be too late, but perhaps a GSoC project to look at a port of
>> some
>>>>> stuff like co occurrence recommender and streaming k-means?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> N
>>>>> --
>>>>> Sent from Mailbox for iPhone
>>>>> 
>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>>>> nick.pentreath@gmail.com>wrote:
>>>>>> 
>>>>>>> My (admittedly heavily biased) view is Spark is a superior platform
>>>>>>> overall
>>>>>>> for ML. If the two communities can work together to leverage the
>>>>>>> strengths
>>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as
>> the
>>>>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>>>>> achieved!
>>>>>>> 
>>>>>>> It makes a lot of sense that Spark would be better than Hadoop for
>> ML
>>>>>> purposes given that Hadoop was intended to do web-crawl kinds of
>> things
>>>>>> and
>>>>>> Spark was intentionally built to support machine learning.
>>>>>> Given that Spark has been announced by a majority of the Hadoop-based
>>>>>> distribution vendors, it makes sense that maybe Mahout should jump in.
>>>>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
>>>>>> could
>>>>>> work more closely together.  There is a lot of good to be had on both
>>>>>> sides.
>>

Re: Mahout on Spark?

Posted by Gokhan Capan <gk...@gmail.com>.

I imagine in Mahout offering an option to the users to select from
different execution engines (just like we currently do by giving M/R or
sequential options), and starting from Spark. I am not sure what changes
needed in the codebase, though. Maybe following MLI (or alike) and
implementing some more stuff, such as common interfaces for iterating over
data (the M/R way and the Spark way).

IMO, another effort might be porting pre-online machine learning (such
transforming text into vector based on the dictionary generated by
seq2sparse before), machine learning based on mini-batches, and streaming
summarization stuff in Mahout to Spark-Streaming.

Best,
Gokhan

On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> PS I am moving along cost optimizer for spark-backed DRMs on some
> multiplicative pipelines that is capable of figuring different cost-based
> rewrites and R-Like DSL that mixes in-core and distributed matrix
> representations and blocks but it is painfully slow, i really only doing it
> like couple nights in a month. It does not look like i will be doing it on
> company time any time soon (and even if i did, the company doesn't seem to
> be inclined to contribute anything I do anything new on their time). It is
> all painfully slow, there's no direct funding for it anywhere with no
> string attached. That probably will be primary reason why Mahout would not
> be able to get much traction compared to university-based contributions.
>
>
> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > Unfortunately methinks the prospects of something like Mahout/MLLib merge
> > seem very unlikely due to vastly diverged approach to the basics of
> linear
> > algebra (and other things). Just like one cannot grow single tree out of
> > two trunks -- not easily, anyway.
> >
> > It is fairly easy to port (and subsequently beat) MLib at this point from
> > collection of algorithms point of view. But IMO goal should be more
> > MLI-like first, and port second. And be very careful with concepts.
> > Something that i so far don't see happening with MLib. MLib seems to be
> > old-style Mahout-like rush to become a collection of basic algorithms
> > rather than coherent foundation. Admittedly, i havent looked very
> closely.
> >
> >
> > On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ssc@apache.org
> >wrote:
> >
> >> I'm also convinced that Spark is a superior platform for executing
> >> distributed ML algorithms. We've had a discussion about a change from
> >> Hadoop to another platform some time ago, but at that point in time it
> was
> >> not clear which of the upcoming dataflow processing systems (Spark,
> >> Hyracks, Stratosphere) would establish itself amongst the users. To me
> it
> >> seems pretty obvious that Spark made the race.
> >>
> >> I concur with Ted, it would be great to have the communities work
> >> together. I know that at least 4 mahout committers (including me) are
> >> already following Spark's mailinglist and actively participating in the
> >> discussions.
> >>
> >> What are the ideas how a fruitful cooperation look like?
> >>
> >> Best,
> >> Sebastian
> >>
> >> PS:
> >>
> >> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
> >> to Spark some time ago, but I haven't had time to test my code on a
> large
> >> dataset yet. I'd be happy to see someone help with that.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
> >>
> >>> I know the Spark/Mllib devs can occasionally be quite set in ways of
> >>> doing certain things, but we'd welcome as many Mahout devs as possible
> to
> >>> work together.
> >>>
> >>>
> >>> It may be too late, but perhaps a GSoC project to look at a port of
> some
> >>> stuff like co occurrence recommender and streaming k-means?
> >>>
> >>>
> >>>
> >>>
> >>> N
> >>> --
> >>> Sent from Mailbox for iPhone
> >>>
> >>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>>
> >>>  On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
> >>>> nick.pentreath@gmail.com>wrote:
> >>>>
> >>>>> My (admittedly heavily biased) view is Spark is a superior platform
> >>>>> overall
> >>>>> for ML. If the two communities can work together to leverage the
> >>>>> strengths
> >>>>> of Spark, and the large amount of good stuff in Mahout (as well as
> the
> >>>>> fantastic depth of experience of Mahout devs) I think a lot can be
> >>>>> achieved!
> >>>>>
> >>>>>  It makes a lot of sense that Spark would be better than Hadoop for
> ML
> >>>> purposes given that Hadoop was intended to do web-crawl kinds of
> things
> >>>> and
> >>>> Spark was intentionally built to support machine learning.
> >>>> Given that Spark has been announced by a majority of the Hadoop-based
> >>>> distribution vendors, it makes sense that maybe Mahout should jump in.
> >>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
> >>>> could
> >>>> work more closely together.  There is a lot of good to be had on both
> >>>> sides.
> >>>>
> >>>
> >>
> >
>

Re: Mahout on Spark?

Posted by Gokhan Capan <gk...@gmail.com>.

I imagine in Mahout offering an option to the users to select from
different execution engines (just like we currently do by giving M/R or
sequential options), and starting from Spark. I am not sure what changes
needed in the codebase, though. Maybe following MLI (or alike) and
implementing some more stuff, such as common interfaces for iterating over
data (the M/R way and the Spark way).

IMO, another effort might be porting pre-online machine learning (such
transforming text into vector based on the dictionary generated by
seq2sparse before), machine learning based on mini-batches, and streaming
summarization stuff in Mahout to Spark-Streaming.

Best,
Gokhan

On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> PS I am moving along cost optimizer for spark-backed DRMs on some
> multiplicative pipelines that is capable of figuring different cost-based
> rewrites and R-Like DSL that mixes in-core and distributed matrix
> representations and blocks but it is painfully slow, i really only doing it
> like couple nights in a month. It does not look like i will be doing it on
> company time any time soon (and even if i did, the company doesn't seem to
> be inclined to contribute anything I do anything new on their time). It is
> all painfully slow, there's no direct funding for it anywhere with no
> string attached. That probably will be primary reason why Mahout would not
> be able to get much traction compared to university-based contributions.
>
>
> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > Unfortunately methinks the prospects of something like Mahout/MLLib merge
> > seem very unlikely due to vastly diverged approach to the basics of
> linear
> > algebra (and other things). Just like one cannot grow single tree out of
> > two trunks -- not easily, anyway.
> >
> > It is fairly easy to port (and subsequently beat) MLib at this point from
> > collection of algorithms point of view. But IMO goal should be more
> > MLI-like first, and port second. And be very careful with concepts.
> > Something that i so far don't see happening with MLib. MLib seems to be
> > old-style Mahout-like rush to become a collection of basic algorithms
> > rather than coherent foundation. Admittedly, i havent looked very
> closely.
> >
> >
> > On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ssc@apache.org
> >wrote:
> >
> >> I'm also convinced that Spark is a superior platform for executing
> >> distributed ML algorithms. We've had a discussion about a change from
> >> Hadoop to another platform some time ago, but at that point in time it
> was
> >> not clear which of the upcoming dataflow processing systems (Spark,
> >> Hyracks, Stratosphere) would establish itself amongst the users. To me
> it
> >> seems pretty obvious that Spark made the race.
> >>
> >> I concur with Ted, it would be great to have the communities work
> >> together. I know that at least 4 mahout committers (including me) are
> >> already following Spark's mailinglist and actively participating in the
> >> discussions.
> >>
> >> What are the ideas how a fruitful cooperation look like?
> >>
> >> Best,
> >> Sebastian
> >>
> >> PS:
> >>
> >> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
> >> to Spark some time ago, but I haven't had time to test my code on a
> large
> >> dataset yet. I'd be happy to see someone help with that.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
> >>
> >>> I know the Spark/Mllib devs can occasionally be quite set in ways of
> >>> doing certain things, but we'd welcome as many Mahout devs as possible
> to
> >>> work together.
> >>>
> >>>
> >>> It may be too late, but perhaps a GSoC project to look at a port of
> some
> >>> stuff like co occurrence recommender and streaming k-means?
> >>>
> >>>
> >>>
> >>>
> >>> N
> >>> --
> >>> Sent from Mailbox for iPhone
> >>>
> >>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>>
> >>>  On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
> >>>> nick.pentreath@gmail.com>wrote:
> >>>>
> >>>>> My (admittedly heavily biased) view is Spark is a superior platform
> >>>>> overall
> >>>>> for ML. If the two communities can work together to leverage the
> >>>>> strengths
> >>>>> of Spark, and the large amount of good stuff in Mahout (as well as
> the
> >>>>> fantastic depth of experience of Mahout devs) I think a lot can be
> >>>>> achieved!
> >>>>>
> >>>>>  It makes a lot of sense that Spark would be better than Hadoop for
> ML
> >>>> purposes given that Hadoop was intended to do web-crawl kinds of
> things
> >>>> and
> >>>> Spark was intentionally built to support machine learning.
> >>>> Given that Spark has been announced by a majority of the Hadoop-based
> >>>> distribution vendors, it makes sense that maybe Mahout should jump in.
> >>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
> >>>> could
> >>>> work more closely together.  There is a lot of good to be had on both
> >>>> sides.
> >>>>
> >>>
> >>
> >
>

Re: Mahout on Spark?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

PS I am moving along cost optimizer for spark-backed DRMs on some
multiplicative pipelines that is capable of figuring different cost-based
rewrites and R-Like DSL that mixes in-core and distributed matrix
representations and blocks but it is painfully slow, i really only doing it
like couple nights in a month. It does not look like i will be doing it on
company time any time soon (and even if i did, the company doesn't seem to
be inclined to contribute anything I do anything new on their time). It is
all painfully slow, there's no direct funding for it anywhere with no
string attached. That probably will be primary reason why Mahout would not
be able to get much traction compared to university-based contributions.


On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> Unfortunately methinks the prospects of something like Mahout/MLLib merge
> seem very unlikely due to vastly diverged approach to the basics of linear
> algebra (and other things). Just like one cannot grow single tree out of
> two trunks -- not easily, anyway.
>
> It is fairly easy to port (and subsequently beat) MLib at this point from
> collection of algorithms point of view. But IMO goal should be more
> MLI-like first, and port second. And be very careful with concepts.
> Something that i so far don't see happening with MLib. MLib seems to be
> old-style Mahout-like rush to become a collection of basic algorithms
> rather than coherent foundation. Admittedly, i havent looked very closely.
>
>
> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ss...@apache.org>wrote:
>
>> I'm also convinced that Spark is a superior platform for executing
>> distributed ML algorithms. We've had a discussion about a change from
>> Hadoop to another platform some time ago, but at that point in time it was
>> not clear which of the upcoming dataflow processing systems (Spark,
>> Hyracks, Stratosphere) would establish itself amongst the users. To me it
>> seems pretty obvious that Spark made the race.
>>
>> I concur with Ted, it would be great to have the communities work
>> together. I know that at least 4 mahout committers (including me) are
>> already following Spark's mailinglist and actively participating in the
>> discussions.
>>
>> What are the ideas how a fruitful cooperation look like?
>>
>> Best,
>> Sebastian
>>
>> PS:
>>
>> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
>> to Spark some time ago, but I haven't had time to test my code on a large
>> dataset yet. I'd be happy to see someone help with that.
>>
>>
>>
>>
>>
>>
>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>
>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>> doing certain things, but we'd welcome as many Mahout devs as possible to
>>> work together.
>>>
>>>
>>> It may be too late, but perhaps a GSoC project to look at a port of some
>>> stuff like co occurrence recommender and streaming k-means?
>>>
>>>
>>>
>>>
>>> N
>>> --
>>> Sent from Mailbox for iPhone
>>>
>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>
>>>  On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>> nick.pentreath@gmail.com>wrote:
>>>>
>>>>> My (admittedly heavily biased) view is Spark is a superior platform
>>>>> overall
>>>>> for ML. If the two communities can work together to leverage the
>>>>> strengths
>>>>> of Spark, and the large amount of good stuff in Mahout (as well as the
>>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>>> achieved!
>>>>>
>>>>>  It makes a lot of sense that Spark would be better than Hadoop for ML
>>>> purposes given that Hadoop was intended to do web-crawl kinds of things
>>>> and
>>>> Spark was intentionally built to support machine learning.
>>>> Given that Spark has been announced by a majority of the Hadoop-based
>>>> distribution vendors, it makes sense that maybe Mahout should jump in.
>>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
>>>> could
>>>> work more closely together.  There is a lot of good to be had on both
>>>> sides.
>>>>
>>>
>>
>

Re: Mahout on Spark?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Unfortunately methinks the prospects of something like Mahout/MLLib merge
seem very unlikely due to vastly diverged approach to the basics of linear
algebra (and other things). Just like one cannot grow single tree out of
two trunks -- not easily, anyway.

It is fairly easy to port (and subsequently beat) MLib at this point from
collection of algorithms point of view. But IMO goal should be more
MLI-like first, and port second. And be very careful with concepts.
Something that i so far don't see happening with MLib. MLib seems to be
old-style Mahout-like rush to become a collection of basic algorithms
rather than coherent foundation. Admittedly, i havent looked very closely.


On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ss...@apache.org> wrote:

> I'm also convinced that Spark is a superior platform for executing
> distributed ML algorithms. We've had a discussion about a change from
> Hadoop to another platform some time ago, but at that point in time it was
> not clear which of the upcoming dataflow processing systems (Spark,
> Hyracks, Stratosphere) would establish itself amongst the users. To me it
> seems pretty obvious that Spark made the race.
>
> I concur with Ted, it would be great to have the communities work
> together. I know that at least 4 mahout committers (including me) are
> already following Spark's mailinglist and actively participating in the
> discussions.
>
> What are the ideas how a fruitful cooperation look like?
>
> Best,
> Sebastian
>
> PS:
>
> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
> to Spark some time ago, but I haven't had time to test my code on a large
> dataset yet. I'd be happy to see someone help with that.
>
>
>
>
>
>
> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>
>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>> doing certain things, but we'd welcome as many Mahout devs as possible to
>> work together.
>>
>>
>> It may be too late, but perhaps a GSoC project to look at a port of some
>> stuff like co occurrence recommender and streaming k-means?
>>
>>
>>
>>
>> N
>> --
>> Sent from Mailbox for iPhone
>>
>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>>  On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>> nick.pentreath@gmail.com>wrote:
>>>
>>>> My (admittedly heavily biased) view is Spark is a superior platform
>>>> overall
>>>> for ML. If the two communities can work together to leverage the
>>>> strengths
>>>> of Spark, and the large amount of good stuff in Mahout (as well as the
>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>> achieved!
>>>>
>>>>  It makes a lot of sense that Spark would be better than Hadoop for ML
>>> purposes given that Hadoop was intended to do web-crawl kinds of things
>>> and
>>> Spark was intentionally built to support machine learning.
>>> Given that Spark has been announced by a majority of the Hadoop-based
>>> distribution vendors, it makes sense that maybe Mahout should jump in.
>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
>>> could
>>> work more closely together.  There is a lot of good to be had on both
>>> sides.
>>>
>>
>

Re: Mahout on Spark?

Posted by Sebastian Schelter <ss...@apache.org>.

I'm also convinced that Spark is a superior platform for executing 
distributed ML algorithms. We've had a discussion about a change from 
Hadoop to another platform some time ago, but at that point in time it 
was not clear which of the upcoming dataflow processing systems (Spark, 
Hyracks, Stratosphere) would establish itself amongst the users. To me 
it seems pretty obvious that Spark made the race.

I concur with Ted, it would be great to have the communities work 
together. I know that at least 4 mahout committers (including me) are 
already following Spark's mailinglist and actively participating in the 
discussions.

What are the ideas how a fruitful cooperation look like?

Best,
Sebastian

PS:

I ported LLR-based cooccurrence analysis (aka item-based recommendation) 
to Spark some time ago, but I haven't had time to test my code on a 
large dataset yet. I'd be happy to see someone help with that.

On 02/19/2014 08:04 AM, Nick Pentreath wrote:
> I know the Spark/Mllib devs can occasionally be quite set in ways of doing certain things, but we'd welcome as many Mahout devs as possible to work together.
>
>
> It may be too late, but perhaps a GSoC project to look at a port of some stuff like co occurrence recommender and streaming k-means?
>
>
>
>
> N
> —
> Sent from Mailbox for iPhone
>
> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <ni...@gmail.com>wrote:
>>> My (admittedly heavily biased) view is Spark is a superior platform overall
>>> for ML. If the two communities can work together to leverage the strengths
>>> of Spark, and the large amount of good stuff in Mahout (as well as the
>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>> achieved!
>>>
>> It makes a lot of sense that Spark would be better than Hadoop for ML
>> purposes given that Hadoop was intended to do web-crawl kinds of things and
>> Spark was intentionally built to support machine learning.
>> Given that Spark has been announced by a majority of the Hadoop-based
>> distribution vendors, it makes sense that maybe Mahout should jump in.
>> I really would prefer it if the two communities (MLib/MLI and Mahout) could
>> work more closely together.  There is a lot of good to be had on both sides.

Re: Mahout on Spark?

Posted by Nick Pentreath <ni...@gmail.com>.

I know the Spark/Mllib devs can occasionally be quite set in ways of doing certain things, but we'd welcome as many Mahout devs as possible to work together.


It may be too late, but perhaps a GSoC project to look at a port of some stuff like co occurrence recommender and streaming k-means?




N
—
Sent from Mailbox for iPhone

On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
wrote:

> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <ni...@gmail.com>wrote:
>> My (admittedly heavily biased) view is Spark is a superior platform overall
>> for ML. If the two communities can work together to leverage the strengths
>> of Spark, and the large amount of good stuff in Mahout (as well as the
>> fantastic depth of experience of Mahout devs) I think a lot can be
>> achieved!
>>
> It makes a lot of sense that Spark would be better than Hadoop for ML
> purposes given that Hadoop was intended to do web-crawl kinds of things and
> Spark was intentionally built to support machine learning.
> Given that Spark has been announced by a majority of the Hadoop-based
> distribution vendors, it makes sense that maybe Mahout should jump in.
> I really would prefer it if the two communities (MLib/MLI and Mahout) could
> work more closely together.  There is a lot of good to be had on both sides.

Re: Mahout on Spark?

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <ni...@gmail.com>wrote:

> My (admittedly heavily biased) view is Spark is a superior platform overall
> for ML. If the two communities can work together to leverage the strengths
> of Spark, and the large amount of good stuff in Mahout (as well as the
> fantastic depth of experience of Mahout devs) I think a lot can be
> achieved!
>

It makes a lot of sense that Spark would be better than Hadoop for ML
purposes given that Hadoop was intended to do web-crawl kinds of things and
Spark was intentionally built to support machine learning.

Given that Spark has been announced by a majority of the Hadoop-based
distribution vendors, it makes sense that maybe Mahout should jump in.

I really would prefer it if the two communities (MLib/MLI and Mahout) could
work more closely together.  There is a lot of good to be had on both sides.

Re: Mahout on Spark?

Posted by Nick Pentreath <ni...@gmail.com>.

Spark provides a "lower-level" ML library called MLlib. MLI / MLBase is
built on top of this and includes some high-level abstractions similar in
nature to distributed matrices / dataframes. But it's still pretty new and
rough at this point (https://github.com/amplab/MLI).
MLlib already provides (
https://github.com/apache/incubator-spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib
):
- regression / classification (Log loss, SVM, squared loss with L1 / L2
regularization) via SGD
- soon will have decision trees / random forests
- clustering (K-Means)
- recommendations (Alternating Least Squares)
- SVD

In terms of implementations, IMO:
- ALS is superior to Mahout because it is block-distributed and so more
efficient. Dmitriy has some things he has been working on that indicate
that a GraphX implementation may be even more efficient.
- big downside currently is lack of sparse support (coming soon hopefully
in https://github.com/apache/incubator-spark/pull/575)
- K-means is probably a wash in terms of algorithm, though Spark will
probably be faster due to caching and maybe slightly better due to init
algorithm
- Mahout has the Streaming K-Means which is neat, and would be a cool
addition to MLlib
- Mahout has co-occurence based recommender stuff which, though I've not
used, seems very good in practice and again should not be too crazy to port
in principle
- I think Mahout has a few more linear algebra implementations, though
MLlib includes SVD
- Mahout has the various integration layers for recommender stuff in
particular
- Mahout has more in terms of featurizing (text, analyzers, hashing etc) -
though MLI provides some of this
- Mahout has more in terms of analysis of model performance (like various
evaluation metrics)
- Mahout has more in terms of things like analysis/summarizers and Ted's
new t-digest (though with some monoid-ification this can be applied in
Spark fairly trivially)

It would be really cool to see if a Spark backend for Mahout could be
developed (I know Dmitriy has looked at this in respect of
DistributedMatrix stuff), or at least parts ported over to Spark. A very
big potential pain point is if Spark doesn't adopt mahout-math (which seems
the case at the moment though undecided). Still, notwithstanding this I
feel a lot of stuff from Mahout can be adapted to Spark without necessarily
needing a total overhaul.

My (admittedly heavily biased) view is Spark is a superior platform overall
for ML. If the two communities can work together to leverage the strengths
of Spark, and the large amount of good stuff in Mahout (as well as the
fantastic depth of experience of Mahout devs) I think a lot can be achieved!

N


On Tue, Feb 18, 2014 at 11:17 PM, Mohit Singh <mo...@gmail.com> wrote:

> In general, if you are interested in machine learning..  think there is
> already a machine learning specific initiative on spark called Mlbase (
> http://www.mlbase.org/)
> and graphx (http://amplab.github.io/graphx/) for graphlab style ml.
>
>
>
>
>
> On Tue, Feb 18, 2014 at 1:14 PM, Harshit Bapna <hr...@gmail.com> wrote:
>
> > I am very eager to know the same from the community.
> > Thanks for bringing it up.
> >
> > --Harshit
> >
> >
> > On Tue, Feb 18, 2014 at 1:08 PM, Ying Liao <yl...@gmail.com> wrote:
> >
> > > Just wonder what is the future of Mahout. We are seeing new stuff from
> > > 0xdata and skytree. And spark is also design for in-memory iterative
> > > analysis. What about mahout? Will mahout run on top of spark in future?
> > >
> > > Thanks,
> > > Ying Liao
> > >
> >
> >
> >
> > --
> > --Harshit
> >
>
>
>
> --
> Mohit
>
> "When you want success as badly as you want the air, then you will get it.
> There is no other secret of success."
> -Socrates
>

Re: Mahout on Spark?

Posted by Mohit Singh <mo...@gmail.com>.

In general, if you are interested in machine learning..  think there is
already a machine learning specific initiative on spark called Mlbase (
http://www.mlbase.org/)
and graphx (http://amplab.github.io/graphx/) for graphlab style ml.

On Tue, Feb 18, 2014 at 1:14 PM, Harshit Bapna <hr...@gmail.com> wrote:

> I am very eager to know the same from the community.
> Thanks for bringing it up.
>
> --Harshit
>
>
> On Tue, Feb 18, 2014 at 1:08 PM, Ying Liao <yl...@gmail.com> wrote:
>
> > Just wonder what is the future of Mahout. We are seeing new stuff from
> > 0xdata and skytree. And spark is also design for in-memory iterative
> > analysis. What about mahout? Will mahout run on top of spark in future?
> >
> > Thanks,
> > Ying Liao
> >
>
>
>
> --
> --Harshit
>

-- 
Mohit

"When you want success as badly as you want the air, then you will get it.
There is no other secret of success."
-Socrates

Re: Mahout on Spark?

Posted by Harshit Bapna <hr...@gmail.com>.

I am very eager to know the same from the community.
Thanks for bringing it up.

--Harshit

On Tue, Feb 18, 2014 at 1:08 PM, Ying Liao <yl...@gmail.com> wrote:

> Just wonder what is the future of Mahout. We are seeing new stuff from
> 0xdata and skytree. And spark is also design for in-memory iterative
> analysis. What about mahout? Will mahout run on top of spark in future?
>
> Thanks,
> Ying Liao
>

-- 
--Harshit

Re: Mahout on Spark?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

yes, this is a popular initiative.


On Tue, Feb 18, 2014 at 1:08 PM, Ying Liao <yl...@gmail.com> wrote:

> Just wonder what is the future of Mahout. We are seeing new stuff from
> 0xdata and skytree. And spark is also design for in-memory iterative
> analysis. What about mahout? Will mahout run on top of spark in future?
>
> Thanks,
> Ying Liao
>