You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by dev wearebold <we...@gmail.com> on 2019/09/30 17:55:53 UTC
Re: What's the difference between Cloud Dataflow and Spark in terms
of the execution?
Hello thanks for this explanation =)
J
On 2019/09/30 16:59:54, Eugene Kirpichov <k....@google.com> wrote:
> Hi,>
>
> Cloud Dataflow also has worker nodes and a master - though the master is>
> part of the Cloud Dataflow service and runs on Google's internal servers. I>
> believe all data distributed data processing tools use a similar>
> architecture.>
>
> Some major differences I can point out quickly:>
> - As you said, Spark uses in-memory caching of datasets. Dataflow doesn't>
> do that, because its programming model is different (see below).>
> - Dataflow separates the pipeline construction stage from execution - you>
> construct the whole pipeline and give it to Dataflow, Dataflow optimizes>
> the whole thing and runs it. Because of this, PCollection's are merely>
> logical nodes in the execution plan. In Spark this is a lot more blurred ->
> collections (RDDs) can be used directly. This allows the user program to>
> interactively request their contents, making interactive and iterative>
> computing possible (Dataflow currently has very young support of the former>
> and non-existent of the latter) - however it comes at the expense of>
> difficulty doing whole-program optimization / monitoring / analysis.>
> In-memory caching plays a critical role in enabling this aspect of Spark,>
> but would not be so useful in Dataflow.>
> - Dataflow's sharding model is different. Both Spark and Dataflow split>
> datasets into shards for parallel execution, but in case of Spark the set>
> of shards is predetermined at the beginning of an operation and its>
> execution model critically relies on this fact, whereas in Dataflow the>
> execution model only relies on the fact "once all shards of a stage>
> complete, the stage is done", so Dataflow can do liquid sharding (dynamic>
> splitting of running shards). Liquid sharding, in turn, makes autoscaling>
> possible, e.g. Dataflow can start running a stage with only a few shards>
> and gradually subsplit them into thousands of shards running on hundreds of>
> workers as it realizes that the stage is very large.>
> - Dataflow's streaming engine is very different from Spark's, though I>
> believe Spark has gotten closer. I'm not familiar enough with either to>
> comment more, maybe someone else can.>
> - Dataflow's shuffle engine is also very different from Spark's, and is>
> encapsulated as a service (the Shuffle Service), which further improves>
> ability to do autoscaling (much easier to do when no data is stored on the>
> workers) and is faster because it can use fancy internal-only hardware and>
> software not available on cloud workers.>
>
> Disclaimer: I've worked on Dataflow but I didn't work on Spark, so the>
> above is biased in favor of Dataflow. People with more knowledge of Spark>
> should be able to balance this out with more of Spark's major capabilities.>
>
> On Fri, Sep 27, 2019 at 11:42 PM dev wearebold <we...@gmail.com>>
> wrote:>
>
> > Hello folks!>
> >>
> > I’m trying to get a deeper understanding of how Cloud Dataflow runs our>
> > Beam programs.>
> >>
> > I worked with Spark a few months and I understood that you have some kind>
> > of cluster topology with a driver program which creates the SparkContext,>
> > some worker nodes and a cluster manager. Also, I know that Spark is very>
> > fast via it’s in-memory computing.>
> >>
> > Is it the same case for Cloud Dataflow? What are the big differences>
> > between them?>
>