You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ian Wood <ib...@gmail.com> on 2015/08/31 22:39:54 UTC

Checkpointing in Spark without Streaming

I've been trying to track down a problem in Spark code relating to a task
with many iterations. When trying to recreate an error with simpler code, I
ran into a StackOverflowError due to large lineage. The solution is to add
checkpoints, but the behavior of checkpoints is not well defined in the
Spark documentation. I have some questions regarding checkpoints when not
using Spark Streaming.

1. How can I clean up checkpoints once a new checkpoint has been created? I
have heard this is automatically done in Spark Streaming, can this be set
up for programs that otherwise need nothing from Spark Streaming?

2. In the code below (runs in spark-shell with master=yarn-client) Is the
map (x => x) really necessary to successfully create a checkpoint?

sc.setCheckpointDir("checkpoints")
var hello = sc.parallelize(Array(1.0, 2.0, 3.0))
hello.checkpoint()
println(hello.isCheckpointed) // prints false
hello.count()
println(hello.isCheckpointed) // prints true

val mappedRDD = inRDD.map(x => -x)
val result = mappedRDD.reduce(_+_)

hello = mappedRDD
hello.checkpoint()
println(hello.isCheckpointed) // prints false
hello.count()
println(hello.isCheckpointed) // prints false

hello = mappedRDD.map(x => x)
hello.checkpoint()
println(hello.isCheckpointed) // prints false
hello.count()
println(hello.isCheckpointed) // prints true

3. Is it possible that a long lineage could cause ActorNotFound errors in
the yarn logs without StackOverflow errors in those same logs (in
particular, with GraphX code)?

Any insight into these problems would be very appreciated.

Thanks,
Ian