You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by didata <su...@didata.us> on 2014/09/06 20:44:58 UTC

Q: About scenarios where driver execution flow may block...

Hello friends:


I have a theory question about call blocking in a Spark driver.


Consider this (admittedly contrived =:)) snippet to illustrate this question...


>>> x = rdd01.reduceByKey()  # or maybe some other 'shuffle-requiring action'.

>>> b = sc.broadcast(x. take(20)) # Or any statement that requires the previous 
>>> statement to complete, cluster-wide.

>>> y = rdd02.someAction(f(b))


Would the first or second statement above block because the second (or 
third) statement needs to wait for the previous one to complete, cluster-wide?


Maybe this isn't the best example (typed on a phone), but generally I'm 
trying to understand the scenario(s) where a rdd call in the driver may 
block because the graph indicates that the next statement is dependent on 
the completion of the current one, cluster-wide (noy just lazy evaluated).

Thank you. :)


Sincerely yours,
Team Dimension Data

Re: Q: About scenarios where driver execution flow may block...

Posted by Mayur Rustagi <ma...@gmail.com>.

Statements are executed only when you try to cause some effect on the
server (produce data, collect data on driver). At time of execution Spark
does all the depedency resolution & truncates paths that dont go anywhere
as well as optimize execution pipelines. So you really dont have to worry
about these.

Important thing is if you are doing certain actions in your functions that
are non-explicitly dependent on others then you may start seeing errors.
For example you may write a file in hdfs during a map operations & expect
to read it another map operations, according to spark map operation is not
expected to alter anything apart from the RDD it is created upon, hence
spark may not realize this dependency & try to parallelize the two
operations, causing error . Bottom line as long as you make all your
depedencies explicit in RDD, spark will take care of the magic.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>

On Sun, Sep 7, 2014 at 12:14 AM, didata <su...@didata.us> wrote:

>  Hello friends:
>
> I have a theory question about call blocking in a Spark driver.
>
> Consider this (admittedly contrived =:)) snippet to illustrate this
> question...
>
> >>> x = rdd01.reduceByKey()  # or maybe some other 'shuffle-requiring
> action'.
>
> >>> b = sc.broadcast(x. take(20)) # Or any statement that requires the
> previous statement to complete, cluster-wide.
>
> >>> y = rdd02.someAction(f(b))
>
> Would the first or second statement above block because the second (or
> third) statement needs to wait for the previous one to complete,
> cluster-wide?
>
> Maybe this isn't the best example (typed on a phone), but generally I'm
> trying to understand the scenario(s) where a rdd call in the driver may
> block because the graph indicates that the next statement is dependent on
> the completion of the current one, cluster-wide (noy just lazy evaluated).
>
> Thank you. :)
>
> Sincerely yours,
> Team Dimension Data
>