You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Tarek Elgamal <ta...@gmail.com> on 2016/02/27 09:58:48 UTC

Spark Checkpointing behavior

Hi,

I am trying to understand the behavior of rdd.checkpoint() in Spark. I am
running the JavaPageRank
<https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java>
example on a 1 GB graph and I am checkpointing the *ranks *rdd inside each
iteration (between line 125 and 126 in the given link). Spark execution
starts when it hits the *collect()* action. I am expecting that after each
iteration the intermediate ranks will be materialized and written in the
checkpoint dir but, it seems that the rdd is only written once in the end
of the program, although I am invoking ranks.checkpoint() inside the for
loop. Is that the default behavior ?

Note that I am caching the rdd before checkpointing in order to avoid
recomputing

Best Regards,
Tarek