You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by WangRamon <ra...@hotmail.com> on 2011/10/20 04:13:51 UTC

Has anyone tried Spark with Mahout?




Hi All I was told today that Spark is a much better platform for cluster computing, better than Hadoop at least at Recommendation computing way, I'm still very new at this area, if anyone has done some investigation on Spark, can you please share your idea here, thank you very much. Thanks Ramon

Re: Has anyone tried Spark with Mahout?

Posted by Dan Brickley <da...@danbri.org>.

On 20 October 2011 20:04, Josh Patterson <jo...@cloudera.com> wrote:
> Absolutely, I'd agree on that.
>
> From what I can tell its the best "Pregel"-style clone going, its
> heading towards MRv2 and seems to have some decent momentum behind it.

Yup, I dug into a few of them and it seems the most energetic, and
relatively easy to get up and running too.

Dan

Re: Has anyone tried Spark with Mahout?

Posted by Josh Patterson <jo...@cloudera.com>.

Absolutely, I'd agree on that.

>From what I can tell its the best "Pregel"-style clone going, its
heading towards MRv2 and seems to have some decent momentum behind it.

On Thu, Oct 20, 2011 at 1:48 PM, Sebastian Schelter <ss...@apache.org> wrote:
> On 20.10.2011 19:45, Ted Dunning wrote:
>> I think that giraph has a lot to offer here as well.
>
> +1 on that.
>
>>
>> Sent from my iPhone
>>
>> On Oct 20, 2011, at 8:30, Josh Patterson <jo...@cloudera.com> wrote:
>>
>>> I've run some tests with Spark in general, its a pretty interesting setup;
>>>
>>> I think the most interesting aspect (relevant to what you are asking
>>> about) is that Matei already has Spark running on top of MRv2:
>>>
>>> https://github.com/mesos/spark-yarn
>>>
>>> (you dont have to run mesos, but the YARN code needs to be able to see
>>> the jar in order to do its scheduling stuff)
>>>
>>> I've been playing around with writing a genetic algorithm in
>>> Scala/Spark to run on MRv2, and in the process got introduced to the
>>> book:
>>>
>>> "Parallel Iterative Algorithms, From Sequential to Grid Computing"
>>>
>>> which talks about strategies for parallelizing high iterative
>>> algorithms and the inherent issues involved (sync/async iterations,
>>> sync/async communications, etc). Since you can use Spark as a
>>> "BSP-style" framework (ignoring the RRDs if you like) and just shoot
>>> out slices of an array of items to be processed (relatively fast
>>> compared to MR), it has some interesting property/tradeoffs to take a
>>> look at.
>>>
>>> Toward the end of my ATL Hug talk I mentioned the possibility of how
>>> MRv2 could be used with other frameworks, like Spark, to be better
>>> suited for other algorithms (in this case, highly iterative):
>>>
>>> http://www.slideshare.net/jpatanooga/machine-learning-and-hadoop
>>>
>>> I think it would be interesting to have mahout sitting on top of MRv2,
>>> like Ted is referring to, and then have an algorithm matched to a
>>> framework on YARN and a workflow that mixed and matched these
>>> combinations.
>>>
>>> Lot's of possibilities here.
>>>
>>> JP
>>>
>>>
>>> On Wed, Oct 19, 2011 at 10:42 PM, Ted Dunning <te...@gmail.com> wrote:
>>>> Spark is very cool but very incompatible with Hadoop code.  Many Mahout
>>>> algorithms would run much faster on Spark, but you will have to do the
>>>> porting yourself.
>>>>
>>>> Let us know how it turns how!
>>>>
>>>> 2011/10/19 WangRamon <ra...@hotmail.com>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Hi All I was told today that Spark is a much better platform for cluster
>>>>> computing, better than Hadoop at least at Recommendation computing way, I'm
>>>>> still very new at this area, if anyone has done some investigation on Spark,
>>>>> can you please share your idea here, thank you very much. Thanks Ramon
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Twitter: @jpatanooga
>>> Solution Architect @ Cloudera
>>> hadoop: http://www.cloudera.com
>
>



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com

Re: Has anyone tried Spark with Mahout?

Posted by Sebastian Schelter <ss...@apache.org>.

On 20.10.2011 19:45, Ted Dunning wrote:
> I think that giraph has a lot to offer here as well. 

+1 on that.

> 
> Sent from my iPhone
> 
> On Oct 20, 2011, at 8:30, Josh Patterson <jo...@cloudera.com> wrote:
> 
>> I've run some tests with Spark in general, its a pretty interesting setup;
>>
>> I think the most interesting aspect (relevant to what you are asking
>> about) is that Matei already has Spark running on top of MRv2:
>>
>> https://github.com/mesos/spark-yarn
>>
>> (you dont have to run mesos, but the YARN code needs to be able to see
>> the jar in order to do its scheduling stuff)
>>
>> I've been playing around with writing a genetic algorithm in
>> Scala/Spark to run on MRv2, and in the process got introduced to the
>> book:
>>
>> "Parallel Iterative Algorithms, From Sequential to Grid Computing"
>>
>> which talks about strategies for parallelizing high iterative
>> algorithms and the inherent issues involved (sync/async iterations,
>> sync/async communications, etc). Since you can use Spark as a
>> "BSP-style" framework (ignoring the RRDs if you like) and just shoot
>> out slices of an array of items to be processed (relatively fast
>> compared to MR), it has some interesting property/tradeoffs to take a
>> look at.
>>
>> Toward the end of my ATL Hug talk I mentioned the possibility of how
>> MRv2 could be used with other frameworks, like Spark, to be better
>> suited for other algorithms (in this case, highly iterative):
>>
>> http://www.slideshare.net/jpatanooga/machine-learning-and-hadoop
>>
>> I think it would be interesting to have mahout sitting on top of MRv2,
>> like Ted is referring to, and then have an algorithm matched to a
>> framework on YARN and a workflow that mixed and matched these
>> combinations.
>>
>> Lot's of possibilities here.
>>
>> JP
>>
>>
>> On Wed, Oct 19, 2011 at 10:42 PM, Ted Dunning <te...@gmail.com> wrote:
>>> Spark is very cool but very incompatible with Hadoop code.  Many Mahout
>>> algorithms would run much faster on Spark, but you will have to do the
>>> porting yourself.
>>>
>>> Let us know how it turns how!
>>>
>>> 2011/10/19 WangRamon <ra...@hotmail.com>
>>>
>>>>
>>>>
>>>>
>>>>
>>>> Hi All I was told today that Spark is a much better platform for cluster
>>>> computing, better than Hadoop at least at Recommendation computing way, I'm
>>>> still very new at this area, if anyone has done some investigation on Spark,
>>>> can you please share your idea here, thank you very much. Thanks Ramon
>>>>
>>>
>>
>>
>>
>> -- 
>> Twitter: @jpatanooga
>> Solution Architect @ Cloudera
>> hadoop: http://www.cloudera.com

Re: Has anyone tried Spark with Mahout?

Posted by Ted Dunning <te...@gmail.com>.

I think that giraph has a lot to offer here as well. 

Sent from my iPhone

On Oct 20, 2011, at 8:30, Josh Patterson <jo...@cloudera.com> wrote:

> I've run some tests with Spark in general, its a pretty interesting setup;
> 
> I think the most interesting aspect (relevant to what you are asking
> about) is that Matei already has Spark running on top of MRv2:
> 
> https://github.com/mesos/spark-yarn
> 
> (you dont have to run mesos, but the YARN code needs to be able to see
> the jar in order to do its scheduling stuff)
> 
> I've been playing around with writing a genetic algorithm in
> Scala/Spark to run on MRv2, and in the process got introduced to the
> book:
> 
> "Parallel Iterative Algorithms, From Sequential to Grid Computing"
> 
> which talks about strategies for parallelizing high iterative
> algorithms and the inherent issues involved (sync/async iterations,
> sync/async communications, etc). Since you can use Spark as a
> "BSP-style" framework (ignoring the RRDs if you like) and just shoot
> out slices of an array of items to be processed (relatively fast
> compared to MR), it has some interesting property/tradeoffs to take a
> look at.
> 
> Toward the end of my ATL Hug talk I mentioned the possibility of how
> MRv2 could be used with other frameworks, like Spark, to be better
> suited for other algorithms (in this case, highly iterative):
> 
> http://www.slideshare.net/jpatanooga/machine-learning-and-hadoop
> 
> I think it would be interesting to have mahout sitting on top of MRv2,
> like Ted is referring to, and then have an algorithm matched to a
> framework on YARN and a workflow that mixed and matched these
> combinations.
> 
> Lot's of possibilities here.
> 
> JP
> 
> 
> On Wed, Oct 19, 2011 at 10:42 PM, Ted Dunning <te...@gmail.com> wrote:
>> Spark is very cool but very incompatible with Hadoop code.  Many Mahout
>> algorithms would run much faster on Spark, but you will have to do the
>> porting yourself.
>> 
>> Let us know how it turns how!
>> 
>> 2011/10/19 WangRamon <ra...@hotmail.com>
>> 
>>> 
>>> 
>>> 
>>> 
>>> Hi All I was told today that Spark is a much better platform for cluster
>>> computing, better than Hadoop at least at Recommendation computing way, I'm
>>> still very new at this area, if anyone has done some investigation on Spark,
>>> can you please share your idea here, thank you very much. Thanks Ramon
>>> 
>> 
> 
> 
> 
> -- 
> Twitter: @jpatanooga
> Solution Architect @ Cloudera
> hadoop: http://www.cloudera.com

Re: Has anyone tried Spark with Mahout?

Posted by Prashant Sharma <pr...@gmail.com>.

This is nice ! . With only problem one would have to learn a new paradigm.
People have habit of sticking to what they are familiar with.
-P

On Mon, Oct 31, 2011 at 4:39 PM, Nick Pentreath <ni...@gmail.com>wrote:

> I have this crazy idea to combine Scalala (which aims to be a library
> for linear algebra in Scala, based on netlib-java, that provides
> Matlab / numpy like syntax and plotting), scalanlp (same developer as
> Scalala, focused on NLP/ML algorithms), Spark and Mahout in some way,
> to create a Matlab-like environment (or better an IPython-like
> super-shell, that could also be integrated into a GUI) that allows you
> to write code that seamlessly operates locally and across a Hadoop
> cluster using Spark's framework.
>
> Ideally it would wrap / port Mahout's distributed matrix operations
> (multiplication, SVD, other decompositions etc), as well as SGD and
> some others etc, and integrate scalanlp's algorithms. It would be
> seamless in the sense that calling, say, A * B, or SVD on a matrix in
> local mode or cluster mode is exactly the same, save for setting
> Spark's context to be local vs cluster (and specifying the HDFS
> location of the data for cluster mode etc) - this is based on
> Scalala's idea of optimised code paths depending on the matrix type.
> This would allow rapid prototyping on a local machine / test cluster,
> and deploying the exact same code across huge clusters...
>
> I don't have enough experience yet with Mahout, let alone Scala and
> Scalala, to think about tackling this, but I wonder if this is
> something people would like to see?!
>
> n
>
> On 20 Oct 2011, at 16:30, Josh Patterson <jo...@cloudera.com> wrote:
>
> > I've run some tests with Spark in general, its a pretty interesting
> setup;
> >
> > I think the most interesting aspect (relevant to what you are asking
> > about) is that Matei already has Spark running on top of MRv2:
> >
> > https://github.com/mesos/spark-yarn
> >
> > (you dont have to run mesos, but the YARN code needs to be able to see
> > the jar in order to do its scheduling stuff)
> >
> > I've been playing around with writing a genetic algorithm in
> > Scala/Spark to run on MRv2, and in the process got introduced to the
> > book:
> >
> > "Parallel Iterative Algorithms, From Sequential to Grid Computing"
> >
> > which talks about strategies for parallelizing high iterative
> > algorithms and the inherent issues involved (sync/async iterations,
> > sync/async communications, etc). Since you can use Spark as a
> > "BSP-style" framework (ignoring the RRDs if you like) and just shoot
> > out slices of an array of items to be processed (relatively fast
> > compared to MR), it has some interesting property/tradeoffs to take a
> > look at.
> >
> > Toward the end of my ATL Hug talk I mentioned the possibility of how
> > MRv2 could be used with other frameworks, like Spark, to be better
> > suited for other algorithms (in this case, highly iterative):
> >
> > http://www.slideshare.net/jpatanooga/machine-learning-and-hadoop
> >
> > I think it would be interesting to have mahout sitting on top of MRv2,
> > like Ted is referring to, and then have an algorithm matched to a
> > framework on YARN and a workflow that mixed and matched these
> > combinations.
> >
> > Lot's of possibilities here.
> >
> > JP
> >
> >
> > On Wed, Oct 19, 2011 at 10:42 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >> Spark is very cool but very incompatible with Hadoop code.  Many Mahout
> >> algorithms would run much faster on Spark, but you will have to do the
> >> porting yourself.
> >>
> >> Let us know how it turns how!
> >>
> >> 2011/10/19 WangRamon <ra...@hotmail.com>
> >>
> >>>
> >>>
> >>>
> >>>
> >>> Hi All I was told today that Spark is a much better platform for
> cluster
> >>> computing, better than Hadoop at least at Recommendation computing
> way, I'm
> >>> still very new at this area, if anyone has done some investigation on
> Spark,
> >>> can you please share your idea here, thank you very much. Thanks Ramon
> >>>
> >>
> >
> >
> >
> > --
> > Twitter: @jpatanooga
> > Solution Architect @ Cloudera
> > hadoop: http://www.cloudera.com
>

Re: Has anyone tried Spark with Mahout?

Posted by Chris K Wensel <ch...@wensel.net>.

I've made a few comments on the differences here.

http://www.quora.com/Apache-Hadoop/What-are-the-differences-between-Crunch-and-Cascading/answer/Chris-K-Wensel

chris

On Oct 31, 2011, at 2:44 PM, Ted Dunning wrote:

> +Chris Wensel
> 
> The biggest difference between Cascading and Plume/Crunch/FlumeJava is that the latter all do more lazy evaluation and more program restructuring and much less large scale scheduling.  Certainly the PCFJ group do much more to make the results look like a java collection and are better at talking to conventional java types.
> 
> I think that Cascading could do the more extensive job graph rewrites.  It would be hard for Cascading to generalize its data structures, though without major backward compatibility issues.  
> 
> In sum, I think that the difference between Cascading and PCFJ is largely a matter of taste, not inherent system design.
> 
> 
> On Mon, Oct 31, 2011 at 2:36 PM, Charles Earl <ch...@me.com> wrote:
> Thanks. This is an insightful discussion. Having just glanced now at both Plume and Crunch these seem similar to Cascading in the sense of being dataflow languages. I wonder are you able to comment on if there are important distinctions.

--
Chris K Wensel
chris@concurrentinc.com
http://www.concurrentinc.com

-- Concurrent, Inc. offers mentoring, support for Cascading

Re: Has anyone tried Spark with Mahout?

Posted by Ted Dunning <te...@gmail.com>.

+Chris Wensel

The biggest difference between Cascading and Plume/Crunch/FlumeJava is that
the latter all do more lazy evaluation and more program restructuring and
much less large scale scheduling.  Certainly the PCFJ group do much more to
make the results look like a java collection and are better at talking to
conventional java types.

I think that Cascading could do the more extensive job graph rewrites.  It
would be hard for Cascading to generalize its data structures, though
without major backward compatibility issues.

In sum, I think that the difference between Cascading and PCFJ is largely a
matter of taste, not inherent system design.

On Mon, Oct 31, 2011 at 2:36 PM, Charles Earl <ch...@me.com> wrote:

> Thanks. This is an insightful discussion. Having just glanced now at both
> Plume and Crunch these seem similar to Cascading in the sense of being
> dataflow languages. I wonder are you able to comment on if there are
> important distinctions.
>

Re: Has anyone tried Spark with Mahout?

Posted by Charles Earl <ch...@me.com>.

Thanks. This is an insightful discussion. Having just glanced now at both Plume and Crunch these seem similar to Cascading in the sense of being dataflow languages. I wonder are you able to comment on if there are important distinctions.
C
On Oct 31, 2011, at 5:07 PM, Ted Dunning wrote:

> Yeah...
> 
> But that doesn't help when I want to write a Pig library for you.  It also
> doesn't help when I want to write a pig script that calls your library
> stuff in the middle and then passes the result to something that Jake
> wrote.  Pig's optimizer can't build a complete data flow across that
> composite program.
> 
> It does help a bit with the problem of, say, iterating over files in a
> directory.
> 
> My preference is languages like FlumeJava which start with java and use
> builder-style API to inject the data flow specification.
> 
> On Mon, Oct 31, 2011 at 12:54 PM, Dan Brickley <da...@danbri.org> wrote:
> 
>> On 31 October 2011 20:22, Ted Dunning <te...@gmail.com> wrote:
>>> On Mon, Oct 31, 2011 at 12:00 PM, Dan Brickley <da...@danbri.org>
>> wrote:
>>> 
>>>> On 31 October 2011 17:27, Ted Dunning <te...@gmail.com> wrote:
>>>>> I think this would be very interesting to see.  Whether it should be
>> part
>>>>> of Mahout or a separate project is an open question.
>>>>> 
>>>>> PIG, is, unfortunately not a real language in the sense of turing
>>>>> completion or extensibility.  It is good at what it does, but not at
>>>> being
>>>>> extended to do more.
>>>> 
>>>> ...although you can call out to functions defined in Java, Python etc.
>>>> This doesn't make the top level language into a programming language,
>>>> though. Was that your point, Ted?
>>>> Yes.  That was the point.  Calling out is different from being able to
>>> control the process from the outside in.
>> 
>> I've just found http://wiki.apache.org/pig/TuringCompletePig which has
>> copious notes on ways to address this. Excerpting a little:
>> 
>> """Pig Latin is a data flow language. As such it does not offer users
>> control flow and modularity features that are present in general
>> purpose programming languages, including functions, modules, loops,
>> and branches. Given that it is a data flow language adding these
>> constructs is neither straightforward nor reasonable. However, users
>> do want to be able to integrate standard programming techniques of
>> separation and code sharing offered by functions and modules as well
>> as integration of control flow offered by functions, loops, and
>> branches. This document proposes a way to accomplish these goals while
>> preserving Pig Latin's data flow orientation."""
>> 
>> Spoiler alert (wiki page has a lot more detail).  Plan seems to be
>> combination of macros (which are now in the language) and "second part
>> of the proposal is to embed Pig Latin scripts in the host scripting
>> language via a JDBC like compile, bind, run model. "
>> 
>> I'm not sure how far along that part is...
>> 
>> Dan
>> 
>> ps. the following 3 links have everything I attempted before with
>> Pig/Mahout integration; not a lot, but it left me intrigued and
>> frustrated in equal measure.
>> 
>> http://www.mail-archive.com/user@pig.apache.org/msg02848.html
>> https://gist.github.com/1192831
>> 
>> http://search-lucene.com/m/IOfRIc6wGq1&subj=+Unknown+program+chosen+Valid+program+names+are+truncated+list+from+Hadoop+program+driver
>>

Re: Has anyone tried Spark with Mahout?

Posted by Ted Dunning <te...@gmail.com>.

Yeah...

But that doesn't help when I want to write a Pig library for you.  It also
doesn't help when I want to write a pig script that calls your library
stuff in the middle and then passes the result to something that Jake
wrote.  Pig's optimizer can't build a complete data flow across that
composite program.

It does help a bit with the problem of, say, iterating over files in a
directory.

My preference is languages like FlumeJava which start with java and use
builder-style API to inject the data flow specification.

On Mon, Oct 31, 2011 at 12:54 PM, Dan Brickley <da...@danbri.org> wrote:

> On 31 October 2011 20:22, Ted Dunning <te...@gmail.com> wrote:
> > On Mon, Oct 31, 2011 at 12:00 PM, Dan Brickley <da...@danbri.org>
> wrote:
> >
> >> On 31 October 2011 17:27, Ted Dunning <te...@gmail.com> wrote:
> >> > I think this would be very interesting to see.  Whether it should be
> part
> >> > of Mahout or a separate project is an open question.
> >> >
> >> > PIG, is, unfortunately not a real language in the sense of turing
> >> > completion or extensibility.  It is good at what it does, but not at
> >> being
> >> > extended to do more.
> >>
> >> ...although you can call out to functions defined in Java, Python etc.
> >> This doesn't make the top level language into a programming language,
> >> though. Was that your point, Ted?
> >> Yes.  That was the point.  Calling out is different from being able to
> > control the process from the outside in.
>
> I've just found http://wiki.apache.org/pig/TuringCompletePig which has
> copious notes on ways to address this. Excerpting a little:
>
> """Pig Latin is a data flow language. As such it does not offer users
> control flow and modularity features that are present in general
> purpose programming languages, including functions, modules, loops,
> and branches. Given that it is a data flow language adding these
> constructs is neither straightforward nor reasonable. However, users
> do want to be able to integrate standard programming techniques of
> separation and code sharing offered by functions and modules as well
> as integration of control flow offered by functions, loops, and
> branches. This document proposes a way to accomplish these goals while
> preserving Pig Latin's data flow orientation."""
>
> Spoiler alert (wiki page has a lot more detail).  Plan seems to be
> combination of macros (which are now in the language) and "second part
> of the proposal is to embed Pig Latin scripts in the host scripting
> language via a JDBC like compile, bind, run model. "
>
> I'm not sure how far along that part is...
>
> Dan
>
> ps. the following 3 links have everything I attempted before with
> Pig/Mahout integration; not a lot, but it left me intrigued and
> frustrated in equal measure.
>
> http://www.mail-archive.com/user@pig.apache.org/msg02848.html
> https://gist.github.com/1192831
>
> http://search-lucene.com/m/IOfRIc6wGq1&subj=+Unknown+program+chosen+Valid+program+names+are+truncated+list+from+Hadoop+program+driver
>

Re: Has anyone tried Spark with Mahout?

Posted by Dan Brickley <da...@danbri.org>.

On 31 October 2011 20:22, Ted Dunning <te...@gmail.com> wrote:
> On Mon, Oct 31, 2011 at 12:00 PM, Dan Brickley <da...@danbri.org> wrote:
>
>> On 31 October 2011 17:27, Ted Dunning <te...@gmail.com> wrote:
>> > I think this would be very interesting to see.  Whether it should be part
>> > of Mahout or a separate project is an open question.
>> >
>> > PIG, is, unfortunately not a real language in the sense of turing
>> > completion or extensibility.  It is good at what it does, but not at
>> being
>> > extended to do more.
>>
>> ...although you can call out to functions defined in Java, Python etc.
>> This doesn't make the top level language into a programming language,
>> though. Was that your point, Ted?
>> Yes.  That was the point.  Calling out is different from being able to
> control the process from the outside in.

I've just found http://wiki.apache.org/pig/TuringCompletePig which has
copious notes on ways to address this. Excerpting a little:

"""Pig Latin is a data flow language. As such it does not offer users
control flow and modularity features that are present in general
purpose programming languages, including functions, modules, loops,
and branches. Given that it is a data flow language adding these
constructs is neither straightforward nor reasonable. However, users
do want to be able to integrate standard programming techniques of
separation and code sharing offered by functions and modules as well
as integration of control flow offered by functions, loops, and
branches. This document proposes a way to accomplish these goals while
preserving Pig Latin's data flow orientation."""

Spoiler alert (wiki page has a lot more detail).  Plan seems to be
combination of macros (which are now in the language) and "second part
of the proposal is to embed Pig Latin scripts in the host scripting
language via a JDBC like compile, bind, run model. "

I'm not sure how far along that part is...

Dan

ps. the following 3 links have everything I attempted before with
Pig/Mahout integration; not a lot, but it left me intrigued and
frustrated in equal measure.

http://www.mail-archive.com/user@pig.apache.org/msg02848.html
https://gist.github.com/1192831
http://search-lucene.com/m/IOfRIc6wGq1&subj=+Unknown+program+chosen+Valid+program+names+are+truncated+list+from+Hadoop+program+driver

Re: Has anyone tried Spark with Mahout?

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Oct 31, 2011 at 12:00 PM, Dan Brickley <da...@danbri.org> wrote:

> On 31 October 2011 17:27, Ted Dunning <te...@gmail.com> wrote:
> > I think this would be very interesting to see.  Whether it should be part
> > of Mahout or a separate project is an open question.
> >
> > PIG, is, unfortunately not a real language in the sense of turing
> > completion or extensibility.  It is good at what it does, but not at
> being
> > extended to do more.
>
> ...although you can call out to functions defined in Java, Python etc.
> This doesn't make the top level language into a programming language,
> though. Was that your point, Ted?
>
>
Yes.  That was the point.  Calling out is different from being able to
control the process from the outside in.

Re: Has anyone tried Spark with Mahout?

Posted by Dan Brickley <da...@danbri.org>.

On 31 October 2011 17:27, Ted Dunning <te...@gmail.com> wrote:
> I think this would be very interesting to see.  Whether it should be part
> of Mahout or a separate project is an open question.
>
> PIG, is, unfortunately not a real language in the sense of turing
> completion or extensibility.  It is good at what it does, but not at being
> extended to do more.

...although you can call out to functions defined in Java, Python etc.
This doesn't make the top level language into a programming language,
though. Was that your point, Ted?

Dan

Re: Has anyone tried Spark with Mahout?

Posted by Ted Dunning <te...@gmail.com>.

I think this would be very interesting to see.  Whether it should be part
of Mahout or a separate project is an open question.

PIG, is, unfortunately not a real language in the sense of turing
completion or extensibility.  It is good at what it does, but not at being
extended to do more.

On Mon, Oct 31, 2011 at 4:58 AM, Charles Earl <ch...@me.com> wrote:

> Sounds interesting. I suspect that Spark might provide some performance
> improvement based upon their papers. Testing that hypothesis is on my todo
> list for November.
>  I have been wondering also whether PIG might be a starting point for
> providing interactive Matlab environment.
> Charles
>
> On Oct 31, 2011, at 7:09 AM, Nick Pentreath <ni...@gmail.com>
> wrote:
>
> > I have this crazy idea to combine Scalala (which aims to be a library
> > for linear algebra in Scala, based on netlib-java, that provides
> > Matlab / numpy like syntax and plotting), scalanlp (same developer as
> > Scalala, focused on NLP/ML algorithms), Spark and Mahout in some way,
> > to create a Matlab-like environment (or better an IPython-like
> > super-shell, that could also be integrated into a GUI) that allows you
> > to write code that seamlessly operates locally and across a Hadoop
> > cluster using Spark's framework.
> >
> > Ideally it would wrap / port Mahout's distributed matrix operations
> > (multiplication, SVD, other decompositions etc), as well as SGD and
> > some others etc, and integrate scalanlp's algorithms. It would be
> > seamless in the sense that calling, say, A * B, or SVD on a matrix in
> > local mode or cluster mode is exactly the same, save for setting
> > Spark's context to be local vs cluster (and specifying the HDFS
> > location of the data for cluster mode etc) - this is based on
> > Scalala's idea of optimised code paths depending on the matrix type.
> > This would allow rapid prototyping on a local machine / test cluster,
> > and deploying the exact same code across huge clusters...
> >
> > I don't have enough experience yet with Mahout, let alone Scala and
> > Scalala, to think about tackling this, but I wonder if this is
> > something people would like to see?!
> >
> > n
> >
> > On 20 Oct 2011, at 16:30, Josh Patterson <jo...@cloudera.com> wrote:
> >
> >> I've run some tests with Spark in general, its a pretty interesting
> setup;
> >>
> >> I think the most interesting aspect (relevant to what you are asking
> >> about) is that Matei already has Spark running on top of MRv2:
> >>
> >> https://github.com/mesos/spark-yarn
> >>
> >> (you dont have to run mesos, but the YARN code needs to be able to see
> >> the jar in order to do its scheduling stuff)
> >>
> >> I've been playing around with writing a genetic algorithm in
> >> Scala/Spark to run on MRv2, and in the process got introduced to the
> >> book:
> >>
> >> "Parallel Iterative Algorithms, From Sequential to Grid Computing"
> >>
> >> which talks about strategies for parallelizing high iterative
> >> algorithms and the inherent issues involved (sync/async iterations,
> >> sync/async communications, etc). Since you can use Spark as a
> >> "BSP-style" framework (ignoring the RRDs if you like) and just shoot
> >> out slices of an array of items to be processed (relatively fast
> >> compared to MR), it has some interesting property/tradeoffs to take a
> >> look at.
> >>
> >> Toward the end of my ATL Hug talk I mentioned the possibility of how
> >> MRv2 could be used with other frameworks, like Spark, to be better
> >> suited for other algorithms (in this case, highly iterative):
> >>
> >> http://www.slideshare.net/jpatanooga/machine-learning-and-hadoop
> >>
> >> I think it would be interesting to have mahout sitting on top of MRv2,
> >> like Ted is referring to, and then have an algorithm matched to a
> >> framework on YARN and a workflow that mixed and matched these
> >> combinations.
> >>
> >> Lot's of possibilities here.
> >>
> >> JP
> >>
> >>
> >> On Wed, Oct 19, 2011 at 10:42 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >>> Spark is very cool but very incompatible with Hadoop code.  Many Mahout
> >>> algorithms would run much faster on Spark, but you will have to do the
> >>> porting yourself.
> >>>
> >>> Let us know how it turns how!
> >>>
> >>> 2011/10/19 WangRamon <ra...@hotmail.com>
> >>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Hi All I was told today that Spark is a much better platform for
> cluster
> >>>> computing, better than Hadoop at least at Recommendation computing
> way, I'm
> >>>> still very new at this area, if anyone has done some investigation on
> Spark,
> >>>> can you please share your idea here, thank you very much. Thanks Ramon
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Twitter: @jpatanooga
> >> Solution Architect @ Cloudera
> >> hadoop: http://www.cloudera.com
>

Re: Has anyone tried Spark with Mahout?

Posted by Charles Earl <ch...@me.com>.

Sounds interesting. I suspect that Spark might provide some performance improvement based upon their papers. Testing that hypothesis is on my todo list for November.
 I have been wondering also whether PIG might be a starting point for providing interactive Matlab environment. 
Charles

On Oct 31, 2011, at 7:09 AM, Nick Pentreath <ni...@gmail.com> wrote:

> I have this crazy idea to combine Scalala (which aims to be a library
> for linear algebra in Scala, based on netlib-java, that provides
> Matlab / numpy like syntax and plotting), scalanlp (same developer as
> Scalala, focused on NLP/ML algorithms), Spark and Mahout in some way,
> to create a Matlab-like environment (or better an IPython-like
> super-shell, that could also be integrated into a GUI) that allows you
> to write code that seamlessly operates locally and across a Hadoop
> cluster using Spark's framework.
> 
> Ideally it would wrap / port Mahout's distributed matrix operations
> (multiplication, SVD, other decompositions etc), as well as SGD and
> some others etc, and integrate scalanlp's algorithms. It would be
> seamless in the sense that calling, say, A * B, or SVD on a matrix in
> local mode or cluster mode is exactly the same, save for setting
> Spark's context to be local vs cluster (and specifying the HDFS
> location of the data for cluster mode etc) - this is based on
> Scalala's idea of optimised code paths depending on the matrix type.
> This would allow rapid prototyping on a local machine / test cluster,
> and deploying the exact same code across huge clusters...
> 
> I don't have enough experience yet with Mahout, let alone Scala and
> Scalala, to think about tackling this, but I wonder if this is
> something people would like to see?!
> 
> n
> 
> On 20 Oct 2011, at 16:30, Josh Patterson <jo...@cloudera.com> wrote:
> 
>> I've run some tests with Spark in general, its a pretty interesting setup;
>> 
>> I think the most interesting aspect (relevant to what you are asking
>> about) is that Matei already has Spark running on top of MRv2:
>> 
>> https://github.com/mesos/spark-yarn
>> 
>> (you dont have to run mesos, but the YARN code needs to be able to see
>> the jar in order to do its scheduling stuff)
>> 
>> I've been playing around with writing a genetic algorithm in
>> Scala/Spark to run on MRv2, and in the process got introduced to the
>> book:
>> 
>> "Parallel Iterative Algorithms, From Sequential to Grid Computing"
>> 
>> which talks about strategies for parallelizing high iterative
>> algorithms and the inherent issues involved (sync/async iterations,
>> sync/async communications, etc). Since you can use Spark as a
>> "BSP-style" framework (ignoring the RRDs if you like) and just shoot
>> out slices of an array of items to be processed (relatively fast
>> compared to MR), it has some interesting property/tradeoffs to take a
>> look at.
>> 
>> Toward the end of my ATL Hug talk I mentioned the possibility of how
>> MRv2 could be used with other frameworks, like Spark, to be better
>> suited for other algorithms (in this case, highly iterative):
>> 
>> http://www.slideshare.net/jpatanooga/machine-learning-and-hadoop
>> 
>> I think it would be interesting to have mahout sitting on top of MRv2,
>> like Ted is referring to, and then have an algorithm matched to a
>> framework on YARN and a workflow that mixed and matched these
>> combinations.
>> 
>> Lot's of possibilities here.
>> 
>> JP
>> 
>> 
>> On Wed, Oct 19, 2011 at 10:42 PM, Ted Dunning <te...@gmail.com> wrote:
>>> Spark is very cool but very incompatible with Hadoop code.  Many Mahout
>>> algorithms would run much faster on Spark, but you will have to do the
>>> porting yourself.
>>> 
>>> Let us know how it turns how!
>>> 
>>> 2011/10/19 WangRamon <ra...@hotmail.com>
>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Hi All I was told today that Spark is a much better platform for cluster
>>>> computing, better than Hadoop at least at Recommendation computing way, I'm
>>>> still very new at this area, if anyone has done some investigation on Spark,
>>>> can you please share your idea here, thank you very much. Thanks Ramon
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Twitter: @jpatanooga
>> Solution Architect @ Cloudera
>> hadoop: http://www.cloudera.com

Re: Has anyone tried Spark with Mahout?

Posted by Nick Pentreath <ni...@gmail.com>.

I have this crazy idea to combine Scalala (which aims to be a library
for linear algebra in Scala, based on netlib-java, that provides
Matlab / numpy like syntax and plotting), scalanlp (same developer as
Scalala, focused on NLP/ML algorithms), Spark and Mahout in some way,
to create a Matlab-like environment (or better an IPython-like
super-shell, that could also be integrated into a GUI) that allows you
to write code that seamlessly operates locally and across a Hadoop
cluster using Spark's framework.

Ideally it would wrap / port Mahout's distributed matrix operations
(multiplication, SVD, other decompositions etc), as well as SGD and
some others etc, and integrate scalanlp's algorithms. It would be
seamless in the sense that calling, say, A * B, or SVD on a matrix in
local mode or cluster mode is exactly the same, save for setting
Spark's context to be local vs cluster (and specifying the HDFS
location of the data for cluster mode etc) - this is based on
Scalala's idea of optimised code paths depending on the matrix type.
This would allow rapid prototyping on a local machine / test cluster,
and deploying the exact same code across huge clusters...

I don't have enough experience yet with Mahout, let alone Scala and
Scalala, to think about tackling this, but I wonder if this is
something people would like to see?!

n

On 20 Oct 2011, at 16:30, Josh Patterson <jo...@cloudera.com> wrote:

> I've run some tests with Spark in general, its a pretty interesting setup;
>
> I think the most interesting aspect (relevant to what you are asking
> about) is that Matei already has Spark running on top of MRv2:
>
> https://github.com/mesos/spark-yarn
>
> (you dont have to run mesos, but the YARN code needs to be able to see
> the jar in order to do its scheduling stuff)
>
> I've been playing around with writing a genetic algorithm in
> Scala/Spark to run on MRv2, and in the process got introduced to the
> book:
>
> "Parallel Iterative Algorithms, From Sequential to Grid Computing"
>
> which talks about strategies for parallelizing high iterative
> algorithms and the inherent issues involved (sync/async iterations,
> sync/async communications, etc). Since you can use Spark as a
> "BSP-style" framework (ignoring the RRDs if you like) and just shoot
> out slices of an array of items to be processed (relatively fast
> compared to MR), it has some interesting property/tradeoffs to take a
> look at.
>
> Toward the end of my ATL Hug talk I mentioned the possibility of how
> MRv2 could be used with other frameworks, like Spark, to be better
> suited for other algorithms (in this case, highly iterative):
>
> http://www.slideshare.net/jpatanooga/machine-learning-and-hadoop
>
> I think it would be interesting to have mahout sitting on top of MRv2,
> like Ted is referring to, and then have an algorithm matched to a
> framework on YARN and a workflow that mixed and matched these
> combinations.
>
> Lot's of possibilities here.
>
> JP
>
>
> On Wed, Oct 19, 2011 at 10:42 PM, Ted Dunning <te...@gmail.com> wrote:
>> Spark is very cool but very incompatible with Hadoop code.  Many Mahout
>> algorithms would run much faster on Spark, but you will have to do the
>> porting yourself.
>>
>> Let us know how it turns how!
>>
>> 2011/10/19 WangRamon <ra...@hotmail.com>
>>
>>>
>>>
>>>
>>>
>>> Hi All I was told today that Spark is a much better platform for cluster
>>> computing, better than Hadoop at least at Recommendation computing way, I'm
>>> still very new at this area, if anyone has done some investigation on Spark,
>>> can you please share your idea here, thank you very much. Thanks Ramon
>>>
>>
>
>
>
> --
> Twitter: @jpatanooga
> Solution Architect @ Cloudera
> hadoop: http://www.cloudera.com

Re: Has anyone tried Spark with Mahout?

Posted by Josh Patterson <jo...@cloudera.com>.

I've run some tests with Spark in general, its a pretty interesting setup;

I think the most interesting aspect (relevant to what you are asking
about) is that Matei already has Spark running on top of MRv2:

https://github.com/mesos/spark-yarn

(you dont have to run mesos, but the YARN code needs to be able to see
the jar in order to do its scheduling stuff)

I've been playing around with writing a genetic algorithm in
Scala/Spark to run on MRv2, and in the process got introduced to the
book:

"Parallel Iterative Algorithms, From Sequential to Grid Computing"

which talks about strategies for parallelizing high iterative
algorithms and the inherent issues involved (sync/async iterations,
sync/async communications, etc). Since you can use Spark as a
"BSP-style" framework (ignoring the RRDs if you like) and just shoot
out slices of an array of items to be processed (relatively fast
compared to MR), it has some interesting property/tradeoffs to take a
look at.

Toward the end of my ATL Hug talk I mentioned the possibility of how
MRv2 could be used with other frameworks, like Spark, to be better
suited for other algorithms (in this case, highly iterative):

http://www.slideshare.net/jpatanooga/machine-learning-and-hadoop

I think it would be interesting to have mahout sitting on top of MRv2,
like Ted is referring to, and then have an algorithm matched to a
framework on YARN and a workflow that mixed and matched these
combinations.

Lot's of possibilities here.

JP

On Wed, Oct 19, 2011 at 10:42 PM, Ted Dunning <te...@gmail.com> wrote:
> Spark is very cool but very incompatible with Hadoop code.  Many Mahout
> algorithms would run much faster on Spark, but you will have to do the
> porting yourself.
>
> Let us know how it turns how!
>
> 2011/10/19 WangRamon <ra...@hotmail.com>
>
>>
>>
>>
>>
>> Hi All I was told today that Spark is a much better platform for cluster
>> computing, better than Hadoop at least at Recommendation computing way, I'm
>> still very new at this area, if anyone has done some investigation on Spark,
>> can you please share your idea here, thank you very much. Thanks Ramon
>>
>

-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com

Re: Has anyone tried Spark with Mahout?

Posted by Ted Dunning <te...@gmail.com>.

Spark is very cool but very incompatible with Hadoop code.  Many Mahout
algorithms would run much faster on Spark, but you will have to do the
porting yourself.

Let us know how it turns how!

2011/10/19 WangRamon <ra...@hotmail.com>

>
>
>
>
> Hi All I was told today that Spark is a much better platform for cluster
> computing, better than Hadoop at least at Recommendation computing way, I'm
> still very new at this area, if anyone has done some investigation on Spark,
> can you please share your idea here, thank you very much. Thanks Ramon
>