You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by nileshc <ni...@nileshc.com> on 2014/02/02 00:57:08 UTC

Hadoop MapReduce on Spark

This might seem like a silly question, so please bear with me. I'm not sure
about it myself, just would like to know if you think it's utterly
unfeasible or not, and if it's at all worth doing.

Does anyone feel like it'll be a good idea to build some sort of a library
that allows us to write code for Spark using the usual bloated Hadoop API?
This is for the people who want to run their existing MapReduce code (with
NIL or minimal adjustments) with Spark to take advantage of its speed and
its better support for iterative workflows.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hadoop-MapReduce-on-Spark-tp1110.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Hadoop MapReduce on Spark

Posted by Ashish Rangole <ar...@gmail.com>.

I can see how that would be a valid use case. A lot of folks have code
written using Hadoop MR apis or other layers that use them. It will help
those Dev teams in migrating those apps to Spark if such a translation
Layer was available
On Feb 1, 2014 5:01 PM, "Ankur Chauhan" <ac...@brightcove.com> wrote:

> I think the whole idea of the spark API  is to simplify building iterative
> workflows/algorithms when compared to Hadoop's bloated API
>
> I am not saying it's completely wrong or anything although it would be
> clearer if you had a particular use case in mind that you wish to tackle.
>
> > On Feb 1, 2014, at 15:57, nileshc <ni...@nileshc.com> wrote:
> >
> > This might seem like a silly question, so please bear with me. I'm not
> sure
> > about it myself, just would like to know if you think it's utterly
> > unfeasible or not, and if it's at all worth doing.
> >
> > Does anyone feel like it'll be a good idea to build some sort of a
> library
> > that allows us to write code for Spark using the usual bloated Hadoop
> API?
> > This is for the people who want to run their existing MapReduce code
> (with
> > NIL or minimal adjustments) with Spark to take advantage of its speed and
> > its better support for iterative workflows.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Hadoop-MapReduce-on-Spark-tp1110.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Hadoop MapReduce on Spark

Posted by Ankur Chauhan <ac...@brightcove.com>.

I think the whole idea of the spark API  is to simplify building iterative workflows/algorithms when compared to Hadoop's bloated API

I am not saying it's completely wrong or anything although it would be clearer if you had a particular use case in mind that you wish to tackle. 

> On Feb 1, 2014, at 15:57, nileshc <ni...@nileshc.com> wrote:
> 
> This might seem like a silly question, so please bear with me. I'm not sure
> about it myself, just would like to know if you think it's utterly
> unfeasible or not, and if it's at all worth doing.
> 
> Does anyone feel like it'll be a good idea to build some sort of a library
> that allows us to write code for Spark using the usual bloated Hadoop API?
> This is for the people who want to run their existing MapReduce code (with
> NIL or minimal adjustments) with Spark to take advantage of its speed and
> its better support for iterative workflows.
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hadoop-MapReduce-on-Spark-tp1110.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Hadoop MapReduce on Spark

Posted by Matei Zaharia <ma...@gmail.com>.

It’s fairly easy to take your existing Mapper and Reducer objects and call them within Spark. First, you can use SparkContext.hadoopRDD to read a file with any Hadoop InputFormat (you can even pass it the JobConf you would’ve created in Hadoop). Then use mapPartitions to iterate through each partition and pass it to your mapper, and reduceByKey or groupByKey to go to the reducer.

We’ve investigated offering the MapReduce API directly, and while it’s possible, one problem is that a lot of MapReduce code isn’t thread-safe. Hadoop runs each task in a separate JVM, while Spark can run multiple tasks concurrently in the same JVM, so some existing code in the jobs we tried porting this way broke. But if your code is thread-safe, the approach mentioned above should work pretty well.

Matei

On Feb 1, 2014, at 3:57 PM, nileshc <ni...@nileshc.com> wrote:

> This might seem like a silly question, so please bear with me. I'm not sure
> about it myself, just would like to know if you think it's utterly
> unfeasible or not, and if it's at all worth doing.
> 
> Does anyone feel like it'll be a good idea to build some sort of a library
> that allows us to write code for Spark using the usual bloated Hadoop API?
> This is for the people who want to run their existing MapReduce code (with
> NIL or minimal adjustments) with Spark to take advantage of its speed and
> its better support for iterative workflows.
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hadoop-MapReduce-on-Spark-tp1110.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Hadoop MapReduce on Spark

Posted by Debasish Das <de...@gmail.com>.

Most of the use cases fall into two categories...

1. Pre-processing over TB/PB scale data where the data size is larger than
total RAM available on the cluster...Due to maturity of map-reduce, a DAG
based job scheduling framework running on top of Map Reduce
(Scalding/Cascading and Scrunch/Crunch) gives you the power to write code
in a higher abstraction as Sean mentioned. Since anyway you are shuffing
results on disk, here I don't see much difference between Map-Reduce and
Spark pipelines

2. Running iterative algorithms over features: Here the data has been
cleaned from 1 and you are running algorithmic analysis, perhaps going to
convergence of some sort...Map-Reduce paradigm was not meant for such
tasks...Even for distributed graphs and streaming data the same analogy
holds. Here Spark starts to shine as you can take the DAG and mark parts of
the DAG or the whole DAG to be cached in-memory. Scalding/Scrunch can also
come up with the api for in-memory caching of parts of the DAG but it is
not available yet.

Basically to sum up, I think we will need both the tools for different
use-cases till they are merged (?) by a higher abstraction layer (hopefully
scalding/scrunch !

On Sat, Feb 1, 2014 at 4:43 PM, Sean Owen <so...@cloudera.com> wrote:

> An M/R job is a one-shot job, in itself. Making it iterative is what a
> higher-level controller does, by running it several times and pointing
> it at the right input. That bit isn't part of M/R. So I don't think
> you would accomplish this goal by implementing something *under* the
> M/R API.
>
> M/Rs still get written but I think most people serious about it are
> already using higher-level APIs like Apache Crunch, or Cascading.
>
> For those who haven't seen it, Crunch's abstraction bears a lot of
> resemblance to the Spark model -- handles on remote collections. So,
> *the reverse* of this suggestion (i.e. Spark-ish API on M/R) is
> basically Crunch, or Scrunch if you like Scala.
>
> I know Josh Wills has put work into getting Crunch to operate *on top
> of Spark* even. That might be of interest to the original idea of
> getting a possibly more familiar API, for some current Hadoop devs,
> running on top of Spark. (Josh tells me it also enables a few tricks
> that are hard in Spark.)
>
>
>
>
>
>
> --
> Sean Owen | Director, Data Science | London
>
>
> On Sat, Feb 1, 2014 at 11:57 PM, nileshc <ni...@nileshc.com> wrote:
> > This might seem like a silly question, so please bear with me. I'm not
> sure
> > about it myself, just would like to know if you think it's utterly
> > unfeasible or not, and if it's at all worth doing.
> >
> > Does anyone feel like it'll be a good idea to build some sort of a
> library
> > that allows us to write code for Spark using the usual bloated Hadoop
> API?
> > This is for the people who want to run their existing MapReduce code
> (with
> > NIL or minimal adjustments) with Spark to take advantage of its speed and
> > its better support for iterative workflows.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Hadoop-MapReduce-on-Spark-tp1110.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Hadoop MapReduce on Spark

Posted by Sean Owen <so...@cloudera.com>.

An M/R job is a one-shot job, in itself. Making it iterative is what a
higher-level controller does, by running it several times and pointing
it at the right input. That bit isn't part of M/R. So I don't think
you would accomplish this goal by implementing something *under* the
M/R API.

M/Rs still get written but I think most people serious about it are
already using higher-level APIs like Apache Crunch, or Cascading.

For those who haven't seen it, Crunch's abstraction bears a lot of
resemblance to the Spark model -- handles on remote collections. So,
*the reverse* of this suggestion (i.e. Spark-ish API on M/R) is
basically Crunch, or Scrunch if you like Scala.

I know Josh Wills has put work into getting Crunch to operate *on top
of Spark* even. That might be of interest to the original idea of
getting a possibly more familiar API, for some current Hadoop devs,
running on top of Spark. (Josh tells me it also enables a few tricks
that are hard in Spark.)

--
Sean Owen | Director, Data Science | London

On Sat, Feb 1, 2014 at 11:57 PM, nileshc <ni...@nileshc.com> wrote:
> This might seem like a silly question, so please bear with me. I'm not sure
> about it myself, just would like to know if you think it's utterly
> unfeasible or not, and if it's at all worth doing.
>
> Does anyone feel like it'll be a good idea to build some sort of a library
> that allows us to write code for Spark using the usual bloated Hadoop API?
> This is for the people who want to run their existing MapReduce code (with
> NIL or minimal adjustments) with Spark to take advantage of its speed and
> its better support for iterative workflows.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hadoop-MapReduce-on-Spark-tp1110.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.