You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Adrian Tanase <at...@adobe.com> on 2015/10/29 12:04:04 UTC

Spark streaming - failed recovery from checkpoint

Hi guys,

I’ve encountered some problems with a crashed Spark Streaming job, when restoring from checkpoint.
I’m runnning spark 1.5.1 on Yarn (hadoop 2.6) in cluster mode, reading from Kafka with the direct consumer and a few updateStateByKey stateful transformations.

After investigating, I think the following happened:

  *   Active ResourceManager crashed (aws machine crashed)
  *   10 minutes later — default Yarn settings :( — Standby took over and redeployed the job, sending a SIGTERM to the running driver
  *   Recovery from checkpoint failed because of missing RDD in checkpoint folder

One complication - UNCONFIRMED because of missing logs – I believe that the new driver was started ~5 minutes before the old one stopped.

With your help, I’m trying to zero in on a root cause or a combination of:

  *   bad Yarn/Spark configuration (10 minutes to react to missing node, already fixed through more aggressive liveliness settings)
  *   YARN fact of life – why is running job redeployed when standby RM takes over?
  *   Bug/race condition in spark checkpoint cleanup/recovery? (why is RDD cleaned up by the old app and then recovery fails when it looks for it?)
  *   Bugs in the Yarn-Spark integration (missing heartbeats? Why is the new app started 5 minutes before the old one dies?)
  *   Application code – should we add graceful shutdown? Should I add a Zookeeper lock that prevents 2 instances of the driver starting at the same time?

Sorry if the questions are a little all over the place, getting to the root cause of this was a pain and I can’t even log an issue in Jira without your help.

Attaching some logs that showcase the checkpoint recovery failure (I’ve grepped for “checkpoint” to highlight the core issue):

  *   Driver logs prior to shutdown: http://pastebin.com/eKqw27nT
  *   Driver logs, failed recovery: http://pastebin.com/pqACKK7W
  *
Other info:
     *   spark.streaming.unpersist = true
     *   spark.cleaner.ttl = 259200 (3 days)

Last question – in the checkpoint recovery process I notice that it’s going back ~6 minutes on the persisted RDDs and ~10 minutes to replay from kafka.
I’m running with 20 second batches and 100 seconds checkpoint interval (small issue - one of the RDDs was using the default interval of 20 secs). Shouldn’t the lineage be a lot smaller?
Based on the documentation I would have expected that the recovery goes back at most 100 seconds, as I’m not doing any windowed operations…

Thanks in advance!
-adrian