You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Gokhan Capan <gk...@gmail.com> on 2014/02/19 11:19:42 UTC

Re: Mahout on Spark?

I imagine in Mahout offering an option to the users to select from
different execution engines (just like we currently do by giving M/R or
sequential options), and starting from Spark. I am not sure what changes
needed in the codebase, though. Maybe following MLI (or alike) and
implementing some more stuff, such as common interfaces for iterating over
data (the M/R way and the Spark way).

IMO, another effort might be porting pre-online machine learning (such
transforming text into vector based on the dictionary generated by
seq2sparse before), machine learning based on mini-batches, and streaming
summarization stuff in Mahout to Spark-Streaming.

Best,
Gokhan

On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> PS I am moving along cost optimizer for spark-backed DRMs on some
> multiplicative pipelines that is capable of figuring different cost-based
> rewrites and R-Like DSL that mixes in-core and distributed matrix
> representations and blocks but it is painfully slow, i really only doing it
> like couple nights in a month. It does not look like i will be doing it on
> company time any time soon (and even if i did, the company doesn't seem to
> be inclined to contribute anything I do anything new on their time). It is
> all painfully slow, there's no direct funding for it anywhere with no
> string attached. That probably will be primary reason why Mahout would not
> be able to get much traction compared to university-based contributions.
>
>
> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
>
> > Unfortunately methinks the prospects of something like Mahout/MLLib merge
> > seem very unlikely due to vastly diverged approach to the basics of
> linear
> > algebra (and other things). Just like one cannot grow single tree out of
> > two trunks -- not easily, anyway.
> >
> > It is fairly easy to port (and subsequently beat) MLib at this point from
> > collection of algorithms point of view. But IMO goal should be more
> > MLI-like first, and port second. And be very careful with concepts.
> > Something that i so far don't see happening with MLib. MLib seems to be
> > old-style Mahout-like rush to become a collection of basic algorithms
> > rather than coherent foundation. Admittedly, i havent looked very
> closely.
> >
> >
> > On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ssc@apache.org
> >wrote:
> >
> >> I'm also convinced that Spark is a superior platform for executing
> >> distributed ML algorithms. We've had a discussion about a change from
> >> Hadoop to another platform some time ago, but at that point in time it
> was
> >> not clear which of the upcoming dataflow processing systems (Spark,
> >> Hyracks, Stratosphere) would establish itself amongst the users. To me
> it
> >> seems pretty obvious that Spark made the race.
> >>
> >> I concur with Ted, it would be great to have the communities work
> >> together. I know that at least 4 mahout committers (including me) are
> >> already following Spark's mailinglist and actively participating in the
> >> discussions.
> >>
> >> What are the ideas how a fruitful cooperation look like?
> >>
> >> Best,
> >> Sebastian
> >>
> >> PS:
> >>
> >> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
> >> to Spark some time ago, but I haven't had time to test my code on a
> large
> >> dataset yet. I'd be happy to see someone help with that.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
> >>
> >>> I know the Spark/Mllib devs can occasionally be quite set in ways of
> >>> doing certain things, but we'd welcome as many Mahout devs as possible
> to
> >>> work together.
> >>>
> >>>
> >>> It may be too late, but perhaps a GSoC project to look at a port of
> some
> >>> stuff like co occurrence recommender and streaming k-means?
> >>>
> >>>
> >>>
> >>>
> >>> N
> >>> --
> >>> Sent from Mailbox for iPhone
> >>>
> >>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>>
> >>>  On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
> >>>> nick.pentreath@gmail.com>wrote:
> >>>>
> >>>>> My (admittedly heavily biased) view is Spark is a superior platform
> >>>>> overall
> >>>>> for ML. If the two communities can work together to leverage the
> >>>>> strengths
> >>>>> of Spark, and the large amount of good stuff in Mahout (as well as
> the
> >>>>> fantastic depth of experience of Mahout devs) I think a lot can be
> >>>>> achieved!
> >>>>>
> >>>>>  It makes a lot of sense that Spark would be better than Hadoop for
> ML
> >>>> purposes given that Hadoop was intended to do web-crawl kinds of
> things
> >>>> and
> >>>> Spark was intentionally built to support machine learning.
> >>>> Given that Spark has been announced by a majority of the Hadoop-based
> >>>> distribution vendors, it makes sense that maybe Mahout should jump in.
> >>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
> >>>> could
> >>>> work more closely together.  There is a lot of good to be had on both
> >>>> sides.
> >>>>
> >>>
> >>
> >
>

Re: Mahout on Spark?

Posted by Suneel Marthi <su...@yahoo.com>.

On Wednesday, February 19, 2014 7:22 PM, Ted Dunning <te...@gmail.com> wrote:

On Wed, Feb 19, 2014 at 1:55 PM, peng <pc...@uowmail.edu.au> wrote:

> But maybe mahout can include contribs that M/R is not fit for, like
> downpour SGD or graph-based algorithms?
>

Yes.  Absolutely.

Downpour SGD is #1 on my list of features for 1.0, will start working on that once the MultiLayer Perceptron is functional and integrated into Mahout processing pipeline (should be by next week).

Re: Mahout on Spark?

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Feb 19, 2014 at 1:55 PM, peng <pc...@uowmail.edu.au> wrote:

> But maybe mahout can include contribs that M/R is not fit for, like
> downpour SGD or graph-based algorithms?
>

Yes.  Absolutely.

Re: Mahout on Spark?

Posted by Nick Pentreath <ni...@gmail.com>.

MLlib may be less production tested than Mahout that is true, but I would
say Spark is heavily production tested and getting close to a true 1.0
release. Why do you favour Hadoop for "sturdiness"? Spark uses HDFS as an
input source (or any Hadoop InputFormat) so benefits from the same fault
tolerance wrt input sources. Spark's fault tolerance model for tasks / jobs
is if anything superior to Hadoop M/R.

For a Downpour SGD-like implementation on Spark see:
https://github.com/apache/incubator-spark/pull/407. Assuming the framework
for Spark SGD / gradients etc is flexible enough, one should be able to
implement neural net / perceptron on top of this. Would be interested to
hear if it can be done easily with the current code framework.


On Wed, Feb 19, 2014 at 11:55 PM, peng <pc...@uowmail.edu.au> wrote:

> I was suggested to switch to MLlib for its performance, but I doubt if
> that is production ready, even if it is I would still favour hadoop's
> sturdiness and self-healing.
> But maybe mahout can include contribs that M/R is not fit for, like
> downpour SGD or graph-based algorithms?
>
>
> On Wed 19 Feb 2014 07:52:22 AM EST, Sean Owen wrote:
>
>> To set expectations appropriately, I think it's important to point out
>> this is completely infeasible short of a total rewrite, and I can't
>> imagine that will happen. It may not be obvious if you haven't looked
>> at the code how completely dependent on M/R it is.
>>
>> You can swap out M/R and Spark if you write in terms of something like
>> Crunch, but that is not at all the case here.
>>
>> On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas <ja...@gmail.com> wrote:
>>
>>> +100 for this, different execution engines, like the direction  pig and
>>> crunch take
>>>
>>> Sent from my iPhone
>>>
>>>  On Feb 19, 2014, at 5:19 AM, Gokhan Capan <gk...@gmail.com> wrote:
>>>>
>>>> I imagine in Mahout offering an option to the users to select from
>>>> different execution engines (just like we currently do by giving M/R or
>>>> sequential options), and starting from Spark. I am not sure what changes
>>>> needed in the codebase, though. Maybe following MLI (or alike) and
>>>> implementing some more stuff, such as common interfaces for iterating
>>>> over
>>>> data (the M/R way and the Spark way).
>>>>
>>>> IMO, another effort might be porting pre-online machine learning (such
>>>> transforming text into vector based on the dictionary generated by
>>>> seq2sparse before), machine learning based on mini-batches, and
>>>> streaming
>>>> summarization stuff in Mahout to Spark-Streaming.
>>>>
>>>> Best,
>>>> Gokhan
>>>>
>>>> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>> >wrote:
>>>>
>>>>  PS I am moving along cost optimizer for spark-backed DRMs on some
>>>>> multiplicative pipelines that is capable of figuring different
>>>>> cost-based
>>>>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>>>>> representations and blocks but it is painfully slow, i really only
>>>>> doing it
>>>>> like couple nights in a month. It does not look like i will be doing
>>>>> it on
>>>>> company time any time soon (and even if i did, the company doesn't
>>>>> seem to
>>>>> be inclined to contribute anything I do anything new on their time).
>>>>> It is
>>>>> all painfully slow, there's no direct funding for it anywhere with no
>>>>> string attached. That probably will be primary reason why Mahout would
>>>>> not
>>>>> be able to get much traction compared to university-based
>>>>> contributions.
>>>>>
>>>>>
>>>>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>>  Unfortunately methinks the prospects of something like Mahout/MLLib
>>>>>> merge
>>>>>> seem very unlikely due to vastly diverged approach to the basics of
>>>>>>
>>>>> linear
>>>>>
>>>>>> algebra (and other things). Just like one cannot grow single tree out
>>>>>> of
>>>>>> two trunks -- not easily, anyway.
>>>>>>
>>>>>> It is fairly easy to port (and subsequently beat) MLib at this point
>>>>>> from
>>>>>> collection of algorithms point of view. But IMO goal should be more
>>>>>> MLI-like first, and port second. And be very careful with concepts.
>>>>>> Something that i so far don't see happening with MLib. MLib seems to
>>>>>> be
>>>>>> old-style Mahout-like rush to become a collection of basic algorithms
>>>>>> rather than coherent foundation. Admittedly, i havent looked very
>>>>>>
>>>>> closely.
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ssc@apache.org
>>>>>> wrote:
>>>>>>
>>>>>>  I'm also convinced that Spark is a superior platform for executing
>>>>>>> distributed ML algorithms. We've had a discussion about a change from
>>>>>>> Hadoop to another platform some time ago, but at that point in time
>>>>>>> it
>>>>>>>
>>>>>> was
>>>>>
>>>>>> not clear which of the upcoming dataflow processing systems (Spark,
>>>>>>> Hyracks, Stratosphere) would establish itself amongst the users. To
>>>>>>> me
>>>>>>>
>>>>>> it
>>>>>
>>>>>> seems pretty obvious that Spark made the race.
>>>>>>>
>>>>>>> I concur with Ted, it would be great to have the communities work
>>>>>>> together. I know that at least 4 mahout committers (including me) are
>>>>>>> already following Spark's mailinglist and actively participating in
>>>>>>> the
>>>>>>> discussions.
>>>>>>>
>>>>>>> What are the ideas how a fruitful cooperation look like?
>>>>>>>
>>>>>>> Best,
>>>>>>> Sebastian
>>>>>>>
>>>>>>> PS:
>>>>>>>
>>>>>>> I ported LLR-based cooccurrence analysis (aka item-based
>>>>>>> recommendation)
>>>>>>> to Spark some time ago, but I haven't had time to test my code on a
>>>>>>>
>>>>>> large
>>>>>
>>>>>> dataset yet. I'd be happy to see someone help with that.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>>>>>>>
>>>>>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>>>>>>> doing certain things, but we'd welcome as many Mahout devs as
>>>>>>>> possible
>>>>>>>>
>>>>>>> to
>>>>>
>>>>>> work together.
>>>>>>>>
>>>>>>>>
>>>>>>>> It may be too late, but perhaps a GSoC project to look at a port of
>>>>>>>>
>>>>>>> some
>>>>>
>>>>>> stuff like co occurrence recommender and streaming k-means?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> N
>>>>>>>> --
>>>>>>>> Sent from Mailbox for iPhone
>>>>>>>>
>>>>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <ted.dunning@gmail.com
>>>>>>>> >
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>>>>>>
>>>>>>>>> nick.pentreath@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>  My (admittedly heavily biased) view is Spark is a superior
>>>>>>>>>> platform
>>>>>>>>>> overall
>>>>>>>>>> for ML. If the two communities can work together to leverage the
>>>>>>>>>> strengths
>>>>>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as
>>>>>>>>>>
>>>>>>>>> the
>>>>>
>>>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>>>>>>>> achieved!
>>>>>>>>>>
>>>>>>>>>> It makes a lot of sense that Spark would be better than Hadoop for
>>>>>>>>>>
>>>>>>>>> ML
>>>>>
>>>>>> purposes given that Hadoop was intended to do web-crawl kinds of
>>>>>>>>>
>>>>>>>> things
>>>>>
>>>>>> and
>>>>>>>>> Spark was intentionally built to support machine learning.
>>>>>>>>> Given that Spark has been announced by a majority of the
>>>>>>>>> Hadoop-based
>>>>>>>>> distribution vendors, it makes sense that maybe Mahout should jump
>>>>>>>>> in.
>>>>>>>>> I really would prefer it if the two communities (MLib/MLI and
>>>>>>>>> Mahout)
>>>>>>>>> could
>>>>>>>>> work more closely together.  There is a lot of good to be had on
>>>>>>>>> both
>>>>>>>>> sides.
>>>>>>>>>
>>>>>>>>
>>>>>

Re: Mahout on Spark?

Posted by peng <pc...@uowmail.edu.au>.

I was suggested to switch to MLlib for its performance, but I doubt if 
that is production ready, even if it is I would still favour hadoop's 
sturdiness and self-healing.
But maybe mahout can include contribs that M/R is not fit for, like 
downpour SGD or graph-based algorithms?

On Wed 19 Feb 2014 07:52:22 AM EST, Sean Owen wrote:
> To set expectations appropriately, I think it's important to point out
> this is completely infeasible short of a total rewrite, and I can't
> imagine that will happen. It may not be obvious if you haven't looked
> at the code how completely dependent on M/R it is.
>
> You can swap out M/R and Spark if you write in terms of something like
> Crunch, but that is not at all the case here.
>
> On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas <ja...@gmail.com> wrote:
>> +100 for this, different execution engines, like the direction  pig and crunch take
>>
>> Sent from my iPhone
>>
>>> On Feb 19, 2014, at 5:19 AM, Gokhan Capan <gk...@gmail.com> wrote:
>>>
>>> I imagine in Mahout offering an option to the users to select from
>>> different execution engines (just like we currently do by giving M/R or
>>> sequential options), and starting from Spark. I am not sure what changes
>>> needed in the codebase, though. Maybe following MLI (or alike) and
>>> implementing some more stuff, such as common interfaces for iterating over
>>> data (the M/R way and the Spark way).
>>>
>>> IMO, another effort might be porting pre-online machine learning (such
>>> transforming text into vector based on the dictionary generated by
>>> seq2sparse before), machine learning based on mini-batches, and streaming
>>> summarization stuff in Mahout to Spark-Streaming.
>>>
>>> Best,
>>> Gokhan
>>>
>>> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>>>
>>>> PS I am moving along cost optimizer for spark-backed DRMs on some
>>>> multiplicative pipelines that is capable of figuring different cost-based
>>>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>>>> representations and blocks but it is painfully slow, i really only doing it
>>>> like couple nights in a month. It does not look like i will be doing it on
>>>> company time any time soon (and even if i did, the company doesn't seem to
>>>> be inclined to contribute anything I do anything new on their time). It is
>>>> all painfully slow, there's no direct funding for it anywhere with no
>>>> string attached. That probably will be primary reason why Mahout would not
>>>> be able to get much traction compared to university-based contributions.
>>>>
>>>>
>>>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>>> wrote:
>>>>
>>>>> Unfortunately methinks the prospects of something like Mahout/MLLib merge
>>>>> seem very unlikely due to vastly diverged approach to the basics of
>>>> linear
>>>>> algebra (and other things). Just like one cannot grow single tree out of
>>>>> two trunks -- not easily, anyway.
>>>>>
>>>>> It is fairly easy to port (and subsequently beat) MLib at this point from
>>>>> collection of algorithms point of view. But IMO goal should be more
>>>>> MLI-like first, and port second. And be very careful with concepts.
>>>>> Something that i so far don't see happening with MLib. MLib seems to be
>>>>> old-style Mahout-like rush to become a collection of basic algorithms
>>>>> rather than coherent foundation. Admittedly, i havent looked very
>>>> closely.
>>>>>
>>>>>
>>>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ssc@apache.org
>>>>> wrote:
>>>>>
>>>>>> I'm also convinced that Spark is a superior platform for executing
>>>>>> distributed ML algorithms. We've had a discussion about a change from
>>>>>> Hadoop to another platform some time ago, but at that point in time it
>>>> was
>>>>>> not clear which of the upcoming dataflow processing systems (Spark,
>>>>>> Hyracks, Stratosphere) would establish itself amongst the users. To me
>>>> it
>>>>>> seems pretty obvious that Spark made the race.
>>>>>>
>>>>>> I concur with Ted, it would be great to have the communities work
>>>>>> together. I know that at least 4 mahout committers (including me) are
>>>>>> already following Spark's mailinglist and actively participating in the
>>>>>> discussions.
>>>>>>
>>>>>> What are the ideas how a fruitful cooperation look like?
>>>>>>
>>>>>> Best,
>>>>>> Sebastian
>>>>>>
>>>>>> PS:
>>>>>>
>>>>>> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
>>>>>> to Spark some time ago, but I haven't had time to test my code on a
>>>> large
>>>>>> dataset yet. I'd be happy to see someone help with that.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>>>>>>
>>>>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>>>>>> doing certain things, but we'd welcome as many Mahout devs as possible
>>>> to
>>>>>>> work together.
>>>>>>>
>>>>>>>
>>>>>>> It may be too late, but perhaps a GSoC project to look at a port of
>>>> some
>>>>>>> stuff like co occurrence recommender and streaming k-means?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> N
>>>>>>> --
>>>>>>> Sent from Mailbox for iPhone
>>>>>>>
>>>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>>>>>> nick.pentreath@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> My (admittedly heavily biased) view is Spark is a superior platform
>>>>>>>>> overall
>>>>>>>>> for ML. If the two communities can work together to leverage the
>>>>>>>>> strengths
>>>>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as
>>>> the
>>>>>>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>>>>>>> achieved!
>>>>>>>>>
>>>>>>>>> It makes a lot of sense that Spark would be better than Hadoop for
>>>> ML
>>>>>>>> purposes given that Hadoop was intended to do web-crawl kinds of
>>>> things
>>>>>>>> and
>>>>>>>> Spark was intentionally built to support machine learning.
>>>>>>>> Given that Spark has been announced by a majority of the Hadoop-based
>>>>>>>> distribution vendors, it makes sense that maybe Mahout should jump in.
>>>>>>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
>>>>>>>> could
>>>>>>>> work more closely together.  There is a lot of good to be had on both
>>>>>>>> sides.
>>>>

Re: Mahout on Spark?

Posted by Sebastian Schelter <ss...@apache.org>.

Completely agree with Sean's statement.

On 02/19/2014 01:52 PM, Sean Owen wrote:
> To set expectations appropriately, I think it's important to point out
> this is completely infeasible short of a total rewrite, and I can't
> imagine that will happen. It may not be obvious if you haven't looked
> at the code how completely dependent on M/R it is.
>
> You can swap out M/R and Spark if you write in terms of something like
> Crunch, but that is not at all the case here.
>
> On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas <ja...@gmail.com> wrote:
>> +100 for this, different execution engines, like the direction  pig and crunch take
>>
>> Sent from my iPhone
>>
>>> On Feb 19, 2014, at 5:19 AM, Gokhan Capan <gk...@gmail.com> wrote:
>>>
>>> I imagine in Mahout offering an option to the users to select from
>>> different execution engines (just like we currently do by giving M/R or
>>> sequential options), and starting from Spark. I am not sure what changes
>>> needed in the codebase, though. Maybe following MLI (or alike) and
>>> implementing some more stuff, such as common interfaces for iterating over
>>> data (the M/R way and the Spark way).
>>>
>>> IMO, another effort might be porting pre-online machine learning (such
>>> transforming text into vector based on the dictionary generated by
>>> seq2sparse before), machine learning based on mini-batches, and streaming
>>> summarization stuff in Mahout to Spark-Streaming.
>>>
>>> Best,
>>> Gokhan
>>>
>>> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>>>
>>>> PS I am moving along cost optimizer for spark-backed DRMs on some
>>>> multiplicative pipelines that is capable of figuring different cost-based
>>>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>>>> representations and blocks but it is painfully slow, i really only doing it
>>>> like couple nights in a month. It does not look like i will be doing it on
>>>> company time any time soon (and even if i did, the company doesn't seem to
>>>> be inclined to contribute anything I do anything new on their time). It is
>>>> all painfully slow, there's no direct funding for it anywhere with no
>>>> string attached. That probably will be primary reason why Mahout would not
>>>> be able to get much traction compared to university-based contributions.
>>>>
>>>>
>>>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>>> wrote:
>>>>
>>>>> Unfortunately methinks the prospects of something like Mahout/MLLib merge
>>>>> seem very unlikely due to vastly diverged approach to the basics of
>>>> linear
>>>>> algebra (and other things). Just like one cannot grow single tree out of
>>>>> two trunks -- not easily, anyway.
>>>>>
>>>>> It is fairly easy to port (and subsequently beat) MLib at this point from
>>>>> collection of algorithms point of view. But IMO goal should be more
>>>>> MLI-like first, and port second. And be very careful with concepts.
>>>>> Something that i so far don't see happening with MLib. MLib seems to be
>>>>> old-style Mahout-like rush to become a collection of basic algorithms
>>>>> rather than coherent foundation. Admittedly, i havent looked very
>>>> closely.
>>>>>
>>>>>
>>>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ssc@apache.org
>>>>> wrote:
>>>>>
>>>>>> I'm also convinced that Spark is a superior platform for executing
>>>>>> distributed ML algorithms. We've had a discussion about a change from
>>>>>> Hadoop to another platform some time ago, but at that point in time it
>>>> was
>>>>>> not clear which of the upcoming dataflow processing systems (Spark,
>>>>>> Hyracks, Stratosphere) would establish itself amongst the users. To me
>>>> it
>>>>>> seems pretty obvious that Spark made the race.
>>>>>>
>>>>>> I concur with Ted, it would be great to have the communities work
>>>>>> together. I know that at least 4 mahout committers (including me) are
>>>>>> already following Spark's mailinglist and actively participating in the
>>>>>> discussions.
>>>>>>
>>>>>> What are the ideas how a fruitful cooperation look like?
>>>>>>
>>>>>> Best,
>>>>>> Sebastian
>>>>>>
>>>>>> PS:
>>>>>>
>>>>>> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
>>>>>> to Spark some time ago, but I haven't had time to test my code on a
>>>> large
>>>>>> dataset yet. I'd be happy to see someone help with that.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>>>>>>
>>>>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>>>>>> doing certain things, but we'd welcome as many Mahout devs as possible
>>>> to
>>>>>>> work together.
>>>>>>>
>>>>>>>
>>>>>>> It may be too late, but perhaps a GSoC project to look at a port of
>>>> some
>>>>>>> stuff like co occurrence recommender and streaming k-means?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> N
>>>>>>> --
>>>>>>> Sent from Mailbox for iPhone
>>>>>>>
>>>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>>>>>> nick.pentreath@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> My (admittedly heavily biased) view is Spark is a superior platform
>>>>>>>>> overall
>>>>>>>>> for ML. If the two communities can work together to leverage the
>>>>>>>>> strengths
>>>>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as
>>>> the
>>>>>>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>>>>>>> achieved!
>>>>>>>>>
>>>>>>>>> It makes a lot of sense that Spark would be better than Hadoop for
>>>> ML
>>>>>>>> purposes given that Hadoop was intended to do web-crawl kinds of
>>>> things
>>>>>>>> and
>>>>>>>> Spark was intentionally built to support machine learning.
>>>>>>>> Given that Spark has been announced by a majority of the Hadoop-based
>>>>>>>> distribution vendors, it makes sense that maybe Mahout should jump in.
>>>>>>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
>>>>>>>> could
>>>>>>>> work more closely together.  There is a lot of good to be had on both
>>>>>>>> sides.
>>>>

Re: Mahout on Spark?

Posted by Sean Owen <sr...@gmail.com>.

To set expectations appropriately, I think it's important to point out
this is completely infeasible short of a total rewrite, and I can't
imagine that will happen. It may not be obvious if you haven't looked
at the code how completely dependent on M/R it is.

You can swap out M/R and Spark if you write in terms of something like
Crunch, but that is not at all the case here.

On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas <ja...@gmail.com> wrote:
> +100 for this, different execution engines, like the direction  pig and crunch take
>
> Sent from my iPhone
>
>> On Feb 19, 2014, at 5:19 AM, Gokhan Capan <gk...@gmail.com> wrote:
>>
>> I imagine in Mahout offering an option to the users to select from
>> different execution engines (just like we currently do by giving M/R or
>> sequential options), and starting from Spark. I am not sure what changes
>> needed in the codebase, though. Maybe following MLI (or alike) and
>> implementing some more stuff, such as common interfaces for iterating over
>> data (the M/R way and the Spark way).
>>
>> IMO, another effort might be porting pre-online machine learning (such
>> transforming text into vector based on the dictionary generated by
>> seq2sparse before), machine learning based on mini-batches, and streaming
>> summarization stuff in Mahout to Spark-Streaming.
>>
>> Best,
>> Gokhan
>>
>> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>>
>>> PS I am moving along cost optimizer for spark-backed DRMs on some
>>> multiplicative pipelines that is capable of figuring different cost-based
>>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>>> representations and blocks but it is painfully slow, i really only doing it
>>> like couple nights in a month. It does not look like i will be doing it on
>>> company time any time soon (and even if i did, the company doesn't seem to
>>> be inclined to contribute anything I do anything new on their time). It is
>>> all painfully slow, there's no direct funding for it anywhere with no
>>> string attached. That probably will be primary reason why Mahout would not
>>> be able to get much traction compared to university-based contributions.
>>>
>>>
>>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>> wrote:
>>>
>>>> Unfortunately methinks the prospects of something like Mahout/MLLib merge
>>>> seem very unlikely due to vastly diverged approach to the basics of
>>> linear
>>>> algebra (and other things). Just like one cannot grow single tree out of
>>>> two trunks -- not easily, anyway.
>>>>
>>>> It is fairly easy to port (and subsequently beat) MLib at this point from
>>>> collection of algorithms point of view. But IMO goal should be more
>>>> MLI-like first, and port second. And be very careful with concepts.
>>>> Something that i so far don't see happening with MLib. MLib seems to be
>>>> old-style Mahout-like rush to become a collection of basic algorithms
>>>> rather than coherent foundation. Admittedly, i havent looked very
>>> closely.
>>>>
>>>>
>>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ssc@apache.org
>>>> wrote:
>>>>
>>>>> I'm also convinced that Spark is a superior platform for executing
>>>>> distributed ML algorithms. We've had a discussion about a change from
>>>>> Hadoop to another platform some time ago, but at that point in time it
>>> was
>>>>> not clear which of the upcoming dataflow processing systems (Spark,
>>>>> Hyracks, Stratosphere) would establish itself amongst the users. To me
>>> it
>>>>> seems pretty obvious that Spark made the race.
>>>>>
>>>>> I concur with Ted, it would be great to have the communities work
>>>>> together. I know that at least 4 mahout committers (including me) are
>>>>> already following Spark's mailinglist and actively participating in the
>>>>> discussions.
>>>>>
>>>>> What are the ideas how a fruitful cooperation look like?
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>> PS:
>>>>>
>>>>> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
>>>>> to Spark some time ago, but I haven't had time to test my code on a
>>> large
>>>>> dataset yet. I'd be happy to see someone help with that.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>>>>>
>>>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>>>>> doing certain things, but we'd welcome as many Mahout devs as possible
>>> to
>>>>>> work together.
>>>>>>
>>>>>>
>>>>>> It may be too late, but perhaps a GSoC project to look at a port of
>>> some
>>>>>> stuff like co occurrence recommender and streaming k-means?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> N
>>>>>> --
>>>>>> Sent from Mailbox for iPhone
>>>>>>
>>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>>>>> nick.pentreath@gmail.com>wrote:
>>>>>>>
>>>>>>>> My (admittedly heavily biased) view is Spark is a superior platform
>>>>>>>> overall
>>>>>>>> for ML. If the two communities can work together to leverage the
>>>>>>>> strengths
>>>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as
>>> the
>>>>>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>>>>>> achieved!
>>>>>>>>
>>>>>>>> It makes a lot of sense that Spark would be better than Hadoop for
>>> ML
>>>>>>> purposes given that Hadoop was intended to do web-crawl kinds of
>>> things
>>>>>>> and
>>>>>>> Spark was intentionally built to support machine learning.
>>>>>>> Given that Spark has been announced by a majority of the Hadoop-based
>>>>>>> distribution vendors, it makes sense that maybe Mahout should jump in.
>>>>>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
>>>>>>> could
>>>>>>> work more closely together.  There is a lot of good to be had on both
>>>>>>> sides.
>>>

Re: Mahout on Spark?

Posted by Sean Owen <sr...@gmail.com>.

To set expectations appropriately, I think it's important to point out
this is completely infeasible short of a total rewrite, and I can't
imagine that will happen. It may not be obvious if you haven't looked
at the code how completely dependent on M/R it is.

You can swap out M/R and Spark if you write in terms of something like
Crunch, but that is not at all the case here.

On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas <ja...@gmail.com> wrote:
> +100 for this, different execution engines, like the direction  pig and crunch take
>
> Sent from my iPhone
>
>> On Feb 19, 2014, at 5:19 AM, Gokhan Capan <gk...@gmail.com> wrote:
>>
>> I imagine in Mahout offering an option to the users to select from
>> different execution engines (just like we currently do by giving M/R or
>> sequential options), and starting from Spark. I am not sure what changes
>> needed in the codebase, though. Maybe following MLI (or alike) and
>> implementing some more stuff, such as common interfaces for iterating over
>> data (the M/R way and the Spark way).
>>
>> IMO, another effort might be porting pre-online machine learning (such
>> transforming text into vector based on the dictionary generated by
>> seq2sparse before), machine learning based on mini-batches, and streaming
>> summarization stuff in Mahout to Spark-Streaming.
>>
>> Best,
>> Gokhan
>>
>> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>>
>>> PS I am moving along cost optimizer for spark-backed DRMs on some
>>> multiplicative pipelines that is capable of figuring different cost-based
>>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>>> representations and blocks but it is painfully slow, i really only doing it
>>> like couple nights in a month. It does not look like i will be doing it on
>>> company time any time soon (and even if i did, the company doesn't seem to
>>> be inclined to contribute anything I do anything new on their time). It is
>>> all painfully slow, there's no direct funding for it anywhere with no
>>> string attached. That probably will be primary reason why Mahout would not
>>> be able to get much traction compared to university-based contributions.
>>>
>>>
>>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>> wrote:
>>>
>>>> Unfortunately methinks the prospects of something like Mahout/MLLib merge
>>>> seem very unlikely due to vastly diverged approach to the basics of
>>> linear
>>>> algebra (and other things). Just like one cannot grow single tree out of
>>>> two trunks -- not easily, anyway.
>>>>
>>>> It is fairly easy to port (and subsequently beat) MLib at this point from
>>>> collection of algorithms point of view. But IMO goal should be more
>>>> MLI-like first, and port second. And be very careful with concepts.
>>>> Something that i so far don't see happening with MLib. MLib seems to be
>>>> old-style Mahout-like rush to become a collection of basic algorithms
>>>> rather than coherent foundation. Admittedly, i havent looked very
>>> closely.
>>>>
>>>>
>>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ssc@apache.org
>>>> wrote:
>>>>
>>>>> I'm also convinced that Spark is a superior platform for executing
>>>>> distributed ML algorithms. We've had a discussion about a change from
>>>>> Hadoop to another platform some time ago, but at that point in time it
>>> was
>>>>> not clear which of the upcoming dataflow processing systems (Spark,
>>>>> Hyracks, Stratosphere) would establish itself amongst the users. To me
>>> it
>>>>> seems pretty obvious that Spark made the race.
>>>>>
>>>>> I concur with Ted, it would be great to have the communities work
>>>>> together. I know that at least 4 mahout committers (including me) are
>>>>> already following Spark's mailinglist and actively participating in the
>>>>> discussions.
>>>>>
>>>>> What are the ideas how a fruitful cooperation look like?
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>> PS:
>>>>>
>>>>> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
>>>>> to Spark some time ago, but I haven't had time to test my code on a
>>> large
>>>>> dataset yet. I'd be happy to see someone help with that.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>>>>>
>>>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>>>>> doing certain things, but we'd welcome as many Mahout devs as possible
>>> to
>>>>>> work together.
>>>>>>
>>>>>>
>>>>>> It may be too late, but perhaps a GSoC project to look at a port of
>>> some
>>>>>> stuff like co occurrence recommender and streaming k-means?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> N
>>>>>> --
>>>>>> Sent from Mailbox for iPhone
>>>>>>
>>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>>>>> nick.pentreath@gmail.com>wrote:
>>>>>>>
>>>>>>>> My (admittedly heavily biased) view is Spark is a superior platform
>>>>>>>> overall
>>>>>>>> for ML. If the two communities can work together to leverage the
>>>>>>>> strengths
>>>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as
>>> the
>>>>>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>>>>>> achieved!
>>>>>>>>
>>>>>>>> It makes a lot of sense that Spark would be better than Hadoop for
>>> ML
>>>>>>> purposes given that Hadoop was intended to do web-crawl kinds of
>>> things
>>>>>>> and
>>>>>>> Spark was intentionally built to support machine learning.
>>>>>>> Given that Spark has been announced by a majority of the Hadoop-based
>>>>>>> distribution vendors, it makes sense that maybe Mahout should jump in.
>>>>>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
>>>>>>> could
>>>>>>> work more closely together.  There is a lot of good to be had on both
>>>>>>> sides.
>>>

Re: Mahout on Spark?

Posted by Jay Vyas <ja...@gmail.com>.

+100 for this, different execution engines, like the direction  pig and crunch take 

Sent from my iPhone

> On Feb 19, 2014, at 5:19 AM, Gokhan Capan <gk...@gmail.com> wrote:
> 
> I imagine in Mahout offering an option to the users to select from
> different execution engines (just like we currently do by giving M/R or
> sequential options), and starting from Spark. I am not sure what changes
> needed in the codebase, though. Maybe following MLI (or alike) and
> implementing some more stuff, such as common interfaces for iterating over
> data (the M/R way and the Spark way).
> 
> IMO, another effort might be porting pre-online machine learning (such
> transforming text into vector based on the dictionary generated by
> seq2sparse before), machine learning based on mini-batches, and streaming
> summarization stuff in Mahout to Spark-Streaming.
> 
> Best,
> Gokhan
> 
> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
> 
>> PS I am moving along cost optimizer for spark-backed DRMs on some
>> multiplicative pipelines that is capable of figuring different cost-based
>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>> representations and blocks but it is painfully slow, i really only doing it
>> like couple nights in a month. It does not look like i will be doing it on
>> company time any time soon (and even if i did, the company doesn't seem to
>> be inclined to contribute anything I do anything new on their time). It is
>> all painfully slow, there's no direct funding for it anywhere with no
>> string attached. That probably will be primary reason why Mahout would not
>> be able to get much traction compared to university-based contributions.
>> 
>> 
>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>> wrote:
>> 
>>> Unfortunately methinks the prospects of something like Mahout/MLLib merge
>>> seem very unlikely due to vastly diverged approach to the basics of
>> linear
>>> algebra (and other things). Just like one cannot grow single tree out of
>>> two trunks -- not easily, anyway.
>>> 
>>> It is fairly easy to port (and subsequently beat) MLib at this point from
>>> collection of algorithms point of view. But IMO goal should be more
>>> MLI-like first, and port second. And be very careful with concepts.
>>> Something that i so far don't see happening with MLib. MLib seems to be
>>> old-style Mahout-like rush to become a collection of basic algorithms
>>> rather than coherent foundation. Admittedly, i havent looked very
>> closely.
>>> 
>>> 
>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <ssc@apache.org
>>> wrote:
>>> 
>>>> I'm also convinced that Spark is a superior platform for executing
>>>> distributed ML algorithms. We've had a discussion about a change from
>>>> Hadoop to another platform some time ago, but at that point in time it
>> was
>>>> not clear which of the upcoming dataflow processing systems (Spark,
>>>> Hyracks, Stratosphere) would establish itself amongst the users. To me
>> it
>>>> seems pretty obvious that Spark made the race.
>>>> 
>>>> I concur with Ted, it would be great to have the communities work
>>>> together. I know that at least 4 mahout committers (including me) are
>>>> already following Spark's mailinglist and actively participating in the
>>>> discussions.
>>>> 
>>>> What are the ideas how a fruitful cooperation look like?
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>> PS:
>>>> 
>>>> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
>>>> to Spark some time ago, but I haven't had time to test my code on a
>> large
>>>> dataset yet. I'd be happy to see someone help with that.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>>>> 
>>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>>>> doing certain things, but we'd welcome as many Mahout devs as possible
>> to
>>>>> work together.
>>>>> 
>>>>> 
>>>>> It may be too late, but perhaps a GSoC project to look at a port of
>> some
>>>>> stuff like co occurrence recommender and streaming k-means?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> N
>>>>> --
>>>>> Sent from Mailbox for iPhone
>>>>> 
>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <te...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>>>> nick.pentreath@gmail.com>wrote:
>>>>>> 
>>>>>>> My (admittedly heavily biased) view is Spark is a superior platform
>>>>>>> overall
>>>>>>> for ML. If the two communities can work together to leverage the
>>>>>>> strengths
>>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as
>> the
>>>>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>>>>> achieved!
>>>>>>> 
>>>>>>> It makes a lot of sense that Spark would be better than Hadoop for
>> ML
>>>>>> purposes given that Hadoop was intended to do web-crawl kinds of
>> things
>>>>>> and
>>>>>> Spark was intentionally built to support machine learning.
>>>>>> Given that Spark has been announced by a majority of the Hadoop-based
>>>>>> distribution vendors, it makes sense that maybe Mahout should jump in.
>>>>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
>>>>>> could
>>>>>> work more closely together.  There is a lot of good to be had on both
>>>>>> sides.
>>