You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aureliano Buendia <bu...@gmail.com> on 2014/06/26 03:23:41 UTC

Spark vs Google cloud dataflow

Hi,

Today Google announced their cloud dataflow, which is very similar to spark
in performing batch processing and stream processing.

How does spark compare to Google cloud dataflow? Are they solutions trying
to aim the same problem?

Re: Spark vs Google cloud dataflow

Posted by Martin Goodson <ma...@skimlinks.com>.

My experience is that gaining 20 spot instances accounts for a tiny
fraction of the total time of provisioning a cluster with spark-ec2. This
is not (solely) an AWS issue.


-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]


On Thu, Jun 26, 2014 at 10:14 PM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> Hmm, I remember a discussion on here about how the way in which spark-ec2
> rsyncs stuff to the cluster for setup could be improved, and I’m assuming
> there are other such improvements to be made. Perhaps those improvements
> don’t matter much when compared to EC2 instance launch times, but I’m not
> sure.
> 
>
>
> On Thu, Jun 26, 2014 at 4:48 PM, Aureliano Buendia <bu...@gmail.com>
> wrote:
>
>>
>>
>>
>> On Thu, Jun 26, 2014 at 9:42 PM, Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>>
>>> That’s technically true, but I’d be surprised if there wasn’t a lot of
>>> room for improvement in spark-ec2 regarding cluster launch+config
>>> times.
>>>
>> Unfortunately, this is a spark support issue, but an AWS one. Starting a
>> few months ago, Amazon AWS services have been having bigger and bigger
>> lags. Indeed, the default timeout hard coded  in spark-ec2 is no longer
>> able to launch the cluster successfully, and many people here reported that
>> they had to increase it.
>>
>>
>> 
>>>
>>
>>
>

Re: Spark vs Google cloud dataflow

Posted by Nicholas Chammas <ni...@gmail.com>.

Hmm, I remember a discussion on here about how the way in which spark-ec2
rsyncs stuff to the cluster for setup could be improved, and I’m assuming
there are other such improvements to be made. Perhaps those improvements
don’t matter much when compared to EC2 instance launch times, but I’m not
sure.

On Thu, Jun 26, 2014 at 4:48 PM, Aureliano Buendia <bu...@gmail.com>
wrote:

>
>
>
> On Thu, Jun 26, 2014 at 9:42 PM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>>
>> That’s technically true, but I’d be surprised if there wasn’t a lot of
>> room for improvement in spark-ec2 regarding cluster launch+config times.
>>
> Unfortunately, this is a spark support issue, but an AWS one. Starting a
> few months ago, Amazon AWS services have been having bigger and bigger
> lags. Indeed, the default timeout hard coded  in spark-ec2 is no longer
> able to launch the cluster successfully, and many people here reported that
> they had to increase it.
>
>
> 
>>
>
>

Re: Spark vs Google cloud dataflow

Posted by Aureliano Buendia <bu...@gmail.com>.

On Thu, Jun 26, 2014 at 9:42 PM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

>
> That’s technically true, but I’d be surprised if there wasn’t a lot of
> room for improvement in spark-ec2 regarding cluster launch+config times.
>
Unfortunately, this is a spark support issue, but an AWS one. Starting a
few months ago, Amazon AWS services have been having bigger and bigger
lags. Indeed, the default timeout hard coded  in spark-ec2 is no longer
able to launch the cluster successfully, and many people here reported that
they had to increase it.



>

Re: Spark vs Google cloud dataflow

Posted by Nicholas Chammas <ni...@gmail.com>.

On Thu, Jun 26, 2014 at 2:26 PM, Michael Bach Bui <fr...@adatao.com>
wrote:

The overhead of bringing up a AWS Spark spot instances is NOT the
> inherent problem of Spark.

 That’s technically true, but I’d be surprised if there wasn’t a lot of
room for improvement in spark-ec2 regarding cluster launch+config times.
The tooling around Spark is part of the “Spark experience”, so to speak. So
a problem with the tooling affects how the core product is used.

Nick

Re: Spark vs Google cloud dataflow

Posted by Michael Bach Bui <fr...@adatao.com>.

"The current problem with Spark is the big overhead and cost of bringing up
a cluster. On a good day, it takes AWS spot instances 15 - 20 minutes to
bring up a 30 node cluster. This makes it non-efficient for computations
which may take only 10 - 15 minutes."

Hmm, this is a misleading message.The overhead of bringing up a AWS Spark
spot instances is NOT the inherent problem of Spark.
If you have a cluster that is already running, a Spark job can be started
within ~100ms.

Best,

On Thu, Jun 26, 2014 at 7:15 AM, Aureliano Buendia <bu...@gmail.com>
wrote:

>
>
>
> On Thu, Jun 26, 2014 at 10:58 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> My first reaction was that Dataflow mapped more to Summingbird, as part
>>
>
> Summingbird is for map/reduce. Dataflow is the third generation of
> google's map/reduce, and it generalizes map/reduce the way Spark does. See
> more about this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s
>
> It seems Dataflow is based on this paper:
> http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf
>
> The paper mentions a few times in-memory computation. But I'm not sure how
> much Google's implementation resembles to Spark when it comes to in-memory
> computation.
>
> The current problem with Spark is the big overhead and cost of bringing up
> a cluster. On a good day, it takes AWS spot instances 15 - 20 minutes to
> bring up a 30 node cluster. This makes it non-efficient for computations
> which may take only 10 - 15 minutes.
>
>
>> of it is a higher-level system for doing a specific thing in
>> batch/streaming -- aggregations.
>>
>> On Wed, Jun 25, 2014 at 8:23 PM, Aureliano Buendia <bu...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > Today Google announced their cloud dataflow, which is very similar to
>> spark
>> > in performing batch processing and stream processing.
>> >
>> > How does spark compare to Google cloud dataflow? Are they solutions
>> trying
>> > to aim the same problem?
>> >
>> >
>>
>
>

-- 

Michael B. Bui, PhD,
Senior Software Architect, ADATAO Inc.
www.adatao.com

Re: Spark vs Google cloud dataflow

Posted by Khanderao Kand <kh...@gmail.com>.

DataFlow is based on two papers, MillWheel for Stream processing and
FlumeJava for programming optimization and abstraction.

Millwheel http://research.google.com/pubs/pub41378.html
FlumeJava http://dl.acm.org/citation.cfm?id=1806638

Here is my blog entry on this
http://texploration.wordpress.com/2014/06/26/google-dataflow-service-to-fight-against-amazon-kinesis/




On Fri, Jun 27, 2014 at 5:16 AM, Sean Owen <so...@cloudera.com> wrote:

> On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia <bu...@gmail.com>
> wrote:
> > Summingbird is for map/reduce. Dataflow is the third generation of
> google's
> > map/reduce, and it generalizes map/reduce the way Spark does. See more
> about
> > this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s
>
> Yes, my point was that Summingbird is similar in that it is a
> higher-level service for batch/streaming computation, not that it is
> similar for being MapReduce-based.
>
> > It seems Dataflow is based on this paper:
> > http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf
>
> FlumeJava maps to Crunch in the Hadoop ecosystem. I think Dataflows is
> more than that but yeah that seems to be some of the 'language'. It is
> similar in that it is a distributed collection abstraction.
>

Re: Spark vs Google cloud dataflow

Posted by Marco Shaw <ma...@gmail.com>.

Sorry. Never mind...  I guess that's what "Summingbird" is all about. Never heard of it. 

> On Jun 27, 2014, at 7:10 PM, Marco Shaw <ma...@gmail.com> wrote:
> 
> Dean: Some interesting information... Do you know where I can read more about these coming changes to Scalding/Cascading?
> 
>> On Jun 27, 2014, at 9:40 AM, Dean Wampler <de...@gmail.com> wrote:
>> 
>> ... and to be clear on the point, Summingbird is not limited to MapReduce. It abstracts over Scalding (which abstracts over Cascading, which is being moved from MR to Spark) and over Storm for event processing.
>> 
>> 
>>> On Fri, Jun 27, 2014 at 7:16 AM, Sean Owen <so...@cloudera.com> wrote:
>>> On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia <bu...@gmail.com> wrote:
>>> > Summingbird is for map/reduce. Dataflow is the third generation of google's
>>> > map/reduce, and it generalizes map/reduce the way Spark does. See more about
>>> > this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s
>>> 
>>> Yes, my point was that Summingbird is similar in that it is a
>>> higher-level service for batch/streaming computation, not that it is
>>> similar for being MapReduce-based.
>>> 
>>> > It seems Dataflow is based on this paper:
>>> > http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf
>>> 
>>> FlumeJava maps to Crunch in the Hadoop ecosystem. I think Dataflows is
>>> more than that but yeah that seems to be some of the 'language'. It is
>>> similar in that it is a distributed collection abstraction.
>> 
>> 
>> 
>> -- 
>> Dean Wampler, Ph.D.
>> Typesafe
>> @deanwampler
>> http://typesafe.com
>> http://polyglotprogramming.com

Re: Spark vs Google cloud dataflow

Posted by Marco Shaw <ma...@gmail.com>.

Dean: Some interesting information... Do you know where I can read more about these coming changes to Scalding/Cascading?

> On Jun 27, 2014, at 9:40 AM, Dean Wampler <de...@gmail.com> wrote:
> 
> ... and to be clear on the point, Summingbird is not limited to MapReduce. It abstracts over Scalding (which abstracts over Cascading, which is being moved from MR to Spark) and over Storm for event processing.
> 
> 
>> On Fri, Jun 27, 2014 at 7:16 AM, Sean Owen <so...@cloudera.com> wrote:
>> On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia <bu...@gmail.com> wrote:
>> > Summingbird is for map/reduce. Dataflow is the third generation of google's
>> > map/reduce, and it generalizes map/reduce the way Spark does. See more about
>> > this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s
>> 
>> Yes, my point was that Summingbird is similar in that it is a
>> higher-level service for batch/streaming computation, not that it is
>> similar for being MapReduce-based.
>> 
>> > It seems Dataflow is based on this paper:
>> > http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf
>> 
>> FlumeJava maps to Crunch in the Hadoop ecosystem. I think Dataflows is
>> more than that but yeah that seems to be some of the 'language'. It is
>> similar in that it is a distributed collection abstraction.
> 
> 
> 
> -- 
> Dean Wampler, Ph.D.
> Typesafe
> @deanwampler
> http://typesafe.com
> http://polyglotprogramming.com

Re: Spark vs Google cloud dataflow

Posted by Dean Wampler <de...@gmail.com>.

... and to be clear on the point, Summingbird is not limited to MapReduce.
It abstracts over Scalding (which abstracts over Cascading, which is being
moved from MR to Spark) and over Storm for event processing.


On Fri, Jun 27, 2014 at 7:16 AM, Sean Owen <so...@cloudera.com> wrote:

> On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia <bu...@gmail.com>
> wrote:
> > Summingbird is for map/reduce. Dataflow is the third generation of
> google's
> > map/reduce, and it generalizes map/reduce the way Spark does. See more
> about
> > this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s
>
> Yes, my point was that Summingbird is similar in that it is a
> higher-level service for batch/streaming computation, not that it is
> similar for being MapReduce-based.
>
> > It seems Dataflow is based on this paper:
> > http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf
>
> FlumeJava maps to Crunch in the Hadoop ecosystem. I think Dataflows is
> more than that but yeah that seems to be some of the 'language'. It is
> similar in that it is a distributed collection abstraction.
>



-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com

Re: Spark vs Google cloud dataflow

Posted by Sean Owen <so...@cloudera.com>.

On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia <bu...@gmail.com> wrote:
> Summingbird is for map/reduce. Dataflow is the third generation of google's
> map/reduce, and it generalizes map/reduce the way Spark does. See more about
> this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s

Yes, my point was that Summingbird is similar in that it is a
higher-level service for batch/streaming computation, not that it is
similar for being MapReduce-based.

> It seems Dataflow is based on this paper:
> http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf

FlumeJava maps to Crunch in the Hadoop ecosystem. I think Dataflows is
more than that but yeah that seems to be some of the 'language'. It is
similar in that it is a distributed collection abstraction.

Re: Spark vs Google cloud dataflow

Posted by Nicholas Chammas <ni...@gmail.com>.

On Thu, Jun 26, 2014 at 10:15 AM, Aureliano Buendia <bu...@gmail.com>
wrote:

> On a good day, it takes AWS spot instances 15 - 20 minutes to bring up a
> 30 node cluster. This makes it non-efficient for computations which may
> take only 10 - 15 minutes.

I feel like there should be an issue or something to track bringing this
time down. It would be a major improvement to be able to spin up clusters
of any size in less than 5 minutes. Perhaps through the clever use of AMIs
and parallelized SSH it can be done.

Nick

Re: Spark vs Google cloud dataflow

Posted by Aureliano Buendia <bu...@gmail.com>.

On Thu, Jun 26, 2014 at 10:58 AM, Sean Owen <so...@cloudera.com> wrote:

> My first reaction was that Dataflow mapped more to Summingbird, as part
>

Summingbird is for map/reduce. Dataflow is the third generation of google's
map/reduce, and it generalizes map/reduce the way Spark does. See more
about this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s

It seems Dataflow is based on this paper:
http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf

The paper mentions a few times in-memory computation. But I'm not sure how
much Google's implementation resembles to Spark when it comes to in-memory
computation.

The current problem with Spark is the big overhead and cost of bringing up
a cluster. On a good day, it takes AWS spot instances 15 - 20 minutes to
bring up a 30 node cluster. This makes it non-efficient for computations
which may take only 10 - 15 minutes.

> of it is a higher-level system for doing a specific thing in
> batch/streaming -- aggregations.
>
> On Wed, Jun 25, 2014 at 8:23 PM, Aureliano Buendia <bu...@gmail.com>
> wrote:
> > Hi,
> >
> > Today Google announced their cloud dataflow, which is very similar to
> spark
> > in performing batch processing and stream processing.
> >
> > How does spark compare to Google cloud dataflow? Are they solutions
> trying
> > to aim the same problem?
> >
> >
>

Re: Spark vs Google cloud dataflow

Posted by Sean Owen <so...@cloudera.com>.

Dataflow is a hosted service and tries to abstract an entire pipeline;
Spark maps to some components in that pipeline and is software. My
first reaction was that Dataflow mapped more to Summingbird, as part
of it is a higher-level system for doing a specific thing in
batch/streaming -- aggregations.

On Wed, Jun 25, 2014 at 8:23 PM, Aureliano Buendia <bu...@gmail.com> wrote:
> Hi,
>
> Today Google announced their cloud dataflow, which is very similar to spark
> in performing batch processing and stream processing.
>
> How does spark compare to Google cloud dataflow? Are they solutions trying
> to aim the same problem?
>
>