You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by dev wearebold <we...@gmail.com> on 2019/09/28 06:42:48 UTC

What's the difference between Cloud Dataflow and Spark in terms of the execution?

Hello folks!

I’m trying to get a deeper understanding of how Cloud Dataflow runs our Beam programs.

I worked with Spark a few months and I understood that you have some kind of cluster topology with a driver program which creates the SparkContext, some worker nodes and a cluster manager. Also, I know that Spark is very fast via it’s in-memory computing.

Is it the same case for Cloud Dataflow? What are the big differences between them?

Re: What's the difference between Cloud Dataflow and Spark in terms of the execution?

Posted by Eugene Kirpichov <ki...@google.com>.
Hi,

Cloud Dataflow also has worker nodes and a master - though the master is
part of the Cloud Dataflow service and runs on Google's internal servers. I
believe all data distributed data processing tools use a similar
architecture.

Some major differences I can point out quickly:
- As you said, Spark uses in-memory caching of datasets. Dataflow doesn't
do that, because its programming model is different (see below).
- Dataflow separates the pipeline construction stage from execution - you
construct the whole pipeline and give it to Dataflow, Dataflow optimizes
the whole thing and runs it. Because of this, PCollection's are merely
logical nodes in the execution plan. In Spark this is a lot more blurred -
collections (RDDs) can be used directly. This allows the user program to
interactively request their contents, making interactive and iterative
computing possible (Dataflow currently has very young support of the former
and non-existent of the latter) - however it comes at the expense of
difficulty doing whole-program optimization / monitoring / analysis.
In-memory caching plays a critical role in enabling this aspect of Spark,
but would not be so useful in Dataflow.
- Dataflow's sharding model is different. Both Spark and Dataflow split
datasets into shards for parallel execution, but in case of Spark the set
of shards is predetermined at the beginning of an operation and its
execution model critically relies on this fact, whereas in Dataflow the
execution model only relies on the fact "once all shards of a stage
complete, the stage is done", so Dataflow can do liquid sharding (dynamic
splitting of running shards). Liquid sharding, in turn, makes autoscaling
possible, e.g. Dataflow can start running a stage with only a few shards
and gradually subsplit them into thousands of shards running on hundreds of
workers as it realizes that the stage is very large.
- Dataflow's streaming engine is very different from Spark's, though I
believe Spark has gotten closer. I'm not familiar enough with either to
comment more, maybe someone else can.
- Dataflow's shuffle engine is also very different from Spark's, and is
encapsulated as a service (the Shuffle Service), which further improves
ability to do autoscaling (much easier to do when no data is stored on the
workers) and is faster because it can use fancy internal-only hardware and
software not available on cloud workers.

Disclaimer: I've worked on Dataflow but I didn't work on Spark, so the
above is biased in favor of Dataflow. People with more knowledge of Spark
should be able to balance this out with more of Spark's major capabilities.

On Fri, Sep 27, 2019 at 11:42 PM dev wearebold <we...@gmail.com>
wrote:

> Hello folks!
>
> I’m trying to get a deeper understanding of how Cloud Dataflow runs our
> Beam programs.
>
> I worked with Spark a few months and I understood that you have some kind
> of cluster topology with a driver program which creates the SparkContext,
> some worker nodes and a cluster manager. Also, I know that Spark is very
> fast via it’s in-memory computing.
>
> Is it the same case for Cloud Dataflow? What are the big differences
> between them?