You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Krot Viacheslav <kr...@gmail.com> on 2016/06/07 14:11:06 UTC

streaming JobScheduler and error handling confusing behavior

Hi,
I don't know if it is a bug or a feature, but one thing in streaming error
handling seems confusing to me - I create streaming context, start and call
#awaitTermination like this:

try {
  ssc.awaitTermination();
} catch (Exception e) {
  LoggerFactory.getLogger(getClass()).error("Job failed. Stopping JVM", e);
  System.exit(-1);
}

I expect that jvm will be terminated as soon as any job fails and no more
jobs are started. But actually this is not true - before exception is
caught another job starts.
This is caused by the design of JobScheduler event loop:

private def processEvent(event: JobSchedulerEvent) {
  try {
    event match {
      case JobStarted(job, startTime) => handleJobStart(job, startTime)
      case JobCompleted(job, completedTime) => handleJobCompletion(job,
completedTime)
      case ErrorReported(m, e) => handleError(m, e)
    }
  } catch {
    case e: Throwable =>
      reportError("Error in job scheduler", e)
  }
}

If error happens it calls handleError that wakes up a lock in ContextWaiter
and notifies my main thread. But meanwhile it starts next job, and
sometimes it is enough to complete it! I have several jobs in each batch
and want each of them run only and only if previous completed successfully.

For API user point of view this behavior is confusing and you cannot guess
how it works until looking into the source code.

What do you think about adding another spark configuration parameter
'stopOnError' that stops the streaming context if error happens and does
not allow to run next job?