You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by dev wearebold <we...@gmail.com> on 2019/09/30 17:55:53 UTC
Re: What's the difference between Cloud Dataflow and Spark in terms of the execution?

Hello thanks for this explanation =)

J


On 2019/09/30 16:59:54, Eugene Kirpichov <k....@google.com> wrote: 
> Hi,> 
> 
> Cloud Dataflow also has worker nodes and a master - though the master is> 
> part of the Cloud Dataflow service and runs on Google's internal servers. I> 
> believe all data distributed data processing tools use a similar> 
> architecture.> 
> 
> Some major differences I can point out quickly:> 
> - As you said, Spark uses in-memory caching of datasets. Dataflow doesn't> 
> do that, because its programming model is different (see below).> 
> - Dataflow separates the pipeline construction stage from execution - you> 
> construct the whole pipeline and give it to Dataflow, Dataflow optimizes> 
> the whole thing and runs it. Because of this, PCollection's are merely> 
> logical nodes in the execution plan. In Spark this is a lot more blurred -> 
> collections (RDDs) can be used directly. This allows the user program to> 
> interactively request their contents, making interactive and iterative> 
> computing possible (Dataflow currently has very young support of the former> 
> and non-existent of the latter) - however it comes at the expense of> 
> difficulty doing whole-program optimization / monitoring / analysis.> 
> In-memory caching plays a critical role in enabling this aspect of Spark,> 
> but would not be so useful in Dataflow.> 
> - Dataflow's sharding model is different. Both Spark and Dataflow split> 
> datasets into shards for parallel execution, but in case of Spark the set> 
> of shards is predetermined at the beginning of an operation and its> 
> execution model critically relies on this fact, whereas in Dataflow the> 
> execution model only relies on the fact "once all shards of a stage> 
> complete, the stage is done", so Dataflow can do liquid sharding (dynamic> 
> splitting of running shards). Liquid sharding, in turn, makes autoscaling> 
> possible, e.g. Dataflow can start running a stage with only a few shards> 
> and gradually subsplit them into thousands of shards running on hundreds of> 
> workers as it realizes that the stage is very large.> 
> - Dataflow's streaming engine is very different from Spark's, though I> 
> believe Spark has gotten closer. I'm not familiar enough with either to> 
> comment more, maybe someone else can.> 
> - Dataflow's shuffle engine is also very different from Spark's, and is> 
> encapsulated as a service (the Shuffle Service), which further improves> 
> ability to do autoscaling (much easier to do when no data is stored on the> 
> workers) and is faster because it can use fancy internal-only hardware and> 
> software not available on cloud workers.> 
> 
> Disclaimer: I've worked on Dataflow but I didn't work on Spark, so the> 
> above is biased in favor of Dataflow. People with more knowledge of Spark> 
> should be able to balance this out with more of Spark's major capabilities.> 
> 
> On Fri, Sep 27, 2019 at 11:42 PM dev wearebold <we...@gmail.com>> 
> wrote:> 
> 
> > Hello folks!> 
> >> 
> > I’m trying to get a deeper understanding of how Cloud Dataflow runs our> 
> > Beam programs.> 
> >> 
> > I worked with Spark a few months and I understood that you have some kind> 
> > of cluster topology with a driver program which creates the SparkContext,> 
> > some worker nodes and a cluster manager. Also, I know that Spark is very> 
> > fast via it’s in-memory computing.> 
> >> 
> > Is it the same case for Cloud Dataflow? What are the big differences> 
> > between them?> 
>