You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Brett Randall (JIRA)" <ji...@apache.org> on 2017/11/12 03:52:00 UTC

[jira] [Commented] (SPARK-15685) StackOverflowError (VirtualMachineError) or NoClassDefFoundError (LinkageError) should not System.exit() in local mode

    [ https://issues.apache.org/jira/browse/SPARK-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248776#comment-16248776 ] 

Brett Randall commented on SPARK-15685:
---------------------------------------

[~srowen] this seems to be fixed in 2.0.2, but I don't know which commit is responsible.

> StackOverflowError (VirtualMachineError) or NoClassDefFoundError (LinkageError) should not System.exit() in local mode
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-15685
>                 URL: https://issues.apache.org/jira/browse/SPARK-15685
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Brett Randall
>
> Spark, when running in local mode, can encounter certain types of {{Error}} exceptions in developer-code or third-party libraries and call {{System.exit()}}, potentially killing a long-running JVM/service.  The caller should decide on the exception-handling and whether the error should be deemed fatal.
> *Consider this scenario:*
> * Spark is being used in local master mode within a long-running JVM microservice, e.g. a Jetty instance.
> * A task is run.  The task errors with particular types of unchecked throwables:
> ** a) there some bad code and/or bad data that exposes a bug where there's unterminated recursion, leading to a {{StackOverflowError}}, or
> ** b) a particular not-often used function is called - there's a packaging error with the service, a third-party library is missing some dependencies, a {{NoClassDefFoundError}} is found.
> *Expected behaviour:* Since we are running in local mode, we might expect some unchecked exception to be thrown, to be optionally-handled by the Spark caller.  In the case of Jetty, a request thread or some other background worker thread might handle the exception or not, the thread might exit or note an error.  The caller should decide how the error is handled.
> *Actual behaviour:* {{System.exit()}} is called, the JVM exits and the Jetty microservice is down and must be restarted.
> *Consequence:* Any local code or third-party library might cause Spark to exit the long-running JVM/microservice, so Spark can be a problem in this architecture.  I have seen this now on three separate occasions, leading to service-down bug reports.
> *Analysis:*
> The line of code that seems to be the problem is: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L405
> {code}
> // Don't forcibly exit unless the exception was inherently fatal, to avoid
> // stopping other tasks unnecessarily.
> if (Utils.isFatalError(t)) {
>     SparkUncaughtExceptionHandler.uncaughtException(t)
> }
> {code}
> [Utils.isFatalError()|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1818] first excludes Scala [NonFatal|https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/control/NonFatal.scala#L31], which excludes everything except {{VirtualMachineError}}, {{ThreadDeath}}, {{InterruptedException}}, {{LinkageError}} and {{ControlThrowable}}.  {{Utils.isFatalError()}} further excludes {{InterruptedException}}, {{NotImplementedError}} and {{ControlThrowable}}.
> Remaining are {{Error}} s such as {{StackOverflowError extends VirtualMachineError}} or {{NoClassDefFoundError extends LinkageError}}, which occur in the aforementioned scenarios.  {{SparkUncaughtExceptionHandler.uncaughtException()}} proceeds to call {{System.exit()}}.
> [Further up|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L77] in in {{Executor}} we see exclusions for registering {{SparkUncaughtExceptionHandler}} if in local mode:
> {code}
>   if (!isLocal) {
>     // Setup an uncaught exception handler for non-local mode.
>     // Make any thread terminations due to uncaught exceptions kill the entire
>     // executor process to avoid surprising stalls.
>     Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler)
>   }
> {code}
> This same exclusion must be applied for local mode for "fatal" errors - cannot afford to shutdown the enclosing JVM (e.g. Jetty), the caller should decide.
> A minimal test-case is supplied.  It installs a logging {{SecurityManager}} to confirm that {{System.exit()}} was called from {{SparkUncaughtExceptionHandler.uncaughtException}} via {{Executor}}.  It also hints at the workaround - install your own {{SecurityManager}} and inspect the current stack in {{checkExit()}} to prevent Spark from exiting the JVM.
> Test-case: https://github.com/javabrett/SPARK-15685 .



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org