You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Daniel Stojanov <ma...@danielstojanov.com> on 2020/08/06 02:50:36 UTC

Understanding Spark execution plans

Hi,

When an execution plan is printed it lists the tree of operations that will
be completed when the job is run. The tasks have somewhat cryptic names of
the sort: BroadcastHashJoin, Project, Filter, etc. These do not appear to
map directly to functions that are performed on an RDD.

1) Is there a place in which each of these steps are documented?
2) Is there documentation, outside of Spark's source code, in which the map
between operations on Spark dataframes or RDDs and the resulting physical
execution plan is described? At least in a way that would allow for more
accurately understanding physical execution steps and predicting the steps
that would result from particular actions.

Regards,