You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Lunagariya, Dhaval " <dh...@citi.com.INVALID> on 2019/02/25 06:46:30 UTC

Don't find Skipped Stages in Spark Dataset

I am trying to understand spark execution in case of Dataset.

For RDD i found in Spark Docs below -

Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don't need to be re-created if the lineage is re-computed.

I tried runing similar thing with RDD and Dataset, I don't find skipped stages in case Dataset execution. Is there any hint i need to do in code for preserving shuffle.I mean i want dataset should share shuffle files between jobs.
Code sample available here:
https://stackoverflow.com/questions/54848119/dont-find-skipped-stages-in-spark-dataset

Regards,
Dhaval