You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by aarondav <gi...@git.apache.org> on 2014/05/09 22:52:47 UTC

[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

GitHub user aarondav opened a pull request:

    https://github.com/apache/spark/pull/715

    [RFC] SPARK-1772 Stop catching Throwable, let Executors die

    The main issue this patch fixes is [SPARK-1772](https://issues.apache.org/jira/browse/SPARK-1772), in which Executors may not die when fatal exceptions (e.g., OOM) are thrown. This patch causes Executors to delegate to the ExecutorUncaughtExceptionHandler when a fatal exception is thrown.
    
    This patch also continues the fight in the neverending war against `case t: Throwable =>`, by only catching Exceptions in many places, and adding a wrapper for Threads and Runnables to make sure any uncaught exceptions are at least printed to the logs.
    
    It also turns out that it is unlikely that the IndestructibleActorSystem actually works, given testing ([here](https://gist.github.com/aarondav/ca1f0cdcd50727f89c0d)). The uncaughtExceptionHandler is not called from the places that we expected it would be.
    [SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620) deals with part of this issue, but refactoring our Actor Systems to ensure that exceptions are dealt with properly is a much bigger change, outside the scope of this PR.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aarondav/spark throwable

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/715.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #715
    
----
commit 1867867a00241ff1bd20d2ac3ac610ed126a9280
Author: Aaron Davidson <aa...@databricks.com>
Date:   2014-05-09T20:28:26Z

    [RFC] SPARK-1772 Stop catching Throwable, let Executors die
    
    The main issue this patch fixes is [SPARK-1772](https://issues.apache.org/jira/browse/SPARK-1772),
    in which Executors may not die when fatal exceptions (e.g., OOM) are thrown. This patch causes
    Executors to delegate to the ExecutorUncaughtExceptionHandler when a fatal exception is thrown.
    
    This patch also continues the fight in the neverending war against `case t: Throwable =>`,
    by only catching Exceptions in many places, and adding a wrapper for Threads and Runnables
    to make sure any uncaught exceptions are at least printed to the logs.
    
    It also turns out that it is unlikely that the IndestructibleActorSystem actually works,
    given testing ([here](https://gist.github.com/aarondav/ca1f0cdcd50727f89c0d)). The
    uncaughtExceptionHandler is not called from the places that we expected it would be.
    [SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620) deals with part of this
    issue, but refactoring our Actor Systems to ensure that exceptions are dealt with properly
    is a much bigger change, outside the scope of this PR.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/715#issuecomment-42716337
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14854/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/715#discussion_r12515148
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
    @@ -259,19 +238,30 @@ private[spark] class Executor(
             }
     
             case t: Throwable => {
    -          val serviceTime = System.currentTimeMillis() - taskStart
    -          val metrics = attemptedTask.flatMap(t => t.metrics)
    -          for (m <- metrics) {
    -            m.executorRunTime = serviceTime
    -            m.jvmGCTime = gcTime - startGCTime
    -          }
    -          val reason = ExceptionFailure(t.getClass.getName, t.toString, t.getStackTrace, metrics)
    -          execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason))
    +          // Attempt to exit cleanly by informing the driver of our failure.
    +          // If anything goes wrong (or this was a fatal exception), we will delegate to
    +          // the default uncaught exception handler, which will terminate the Executor.
    +          try {
    +            logError("Exception in task ID " + taskId, t)
    +
    +            val serviceTime = System.currentTimeMillis() - taskStart
    +            val metrics = attemptedTask.flatMap(t => t.metrics)
    +            for (m <- metrics) {
    +              m.executorRunTime = serviceTime
    +              m.jvmGCTime = gcTime - startGCTime
    +            }
    +            val reason = ExceptionFailure(t.getClass.getName, t.toString, t.getStackTrace, metrics)
    +            execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason))
     
    -          // TODO: Should we exit the whole executor here? On the one hand, the failed task may
    -          // have left some weird state around depending on when the exception was thrown, but on
    -          // the other hand, maybe we could detect that when future tasks fail and exit then.
    -          logError("Exception in task ID " + taskId, t)
    +            // Don't forcibly exit unless the exception was inherently fatal, to avoid
    +            // stopping other tasks unnecessarily.
    +            if (Utils.isFatalError(t)) {
    +              ExecutorUncaughtExceptionHandler.uncaughtException(t)
    +            }
    +          } catch {
    +            case t2: Throwable =>
    +              ExecutorUncaughtExceptionHandler.uncaughtException(t2)
    --- End diff --
    
    Hmm, good point. I kind of like being explicit over relying on the globally set uncaught exception handler. I could be happy with getting rid of this and replacing it with a comment, though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on the pull request:

    https://github.com/apache/spark/pull/715#issuecomment-42797519
  
    > It also turns out that it is unlikely that the IndestructibleActorSystem actually works, given testing (here).  
    I works but in case of OOMs, the behavior can be very sporadic. The only reason it was needed was in akka 2.0.x days netty was tolerating OOMs and it was thus never the *chance* that Akka got to deal with them. Since netty almost always had them and mostly manage to eat them. Very weird things. Just saying for posterity. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on a diff in the pull request:

    https://github.com/apache/spark/pull/715#discussion_r12514419
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
    @@ -71,7 +71,7 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String
               stopDaemon()
               startDaemon()
               new Socket(daemonHost, daemonPort)
    -        case e: Throwable => throw e
    +        case e: Exception => throw e
    --- End diff --
    
    Do we really need this line ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/715#issuecomment-42797560
  
    Looks pretty good to me, just made one small comment. I think it's good to eliminate these now. I haven't seen many cases where they're super useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/715#discussion_r12515152
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
    @@ -259,19 +238,30 @@ private[spark] class Executor(
             }
     
             case t: Throwable => {
    -          val serviceTime = System.currentTimeMillis() - taskStart
    -          val metrics = attemptedTask.flatMap(t => t.metrics)
    -          for (m <- metrics) {
    -            m.executorRunTime = serviceTime
    -            m.jvmGCTime = gcTime - startGCTime
    -          }
    -          val reason = ExceptionFailure(t.getClass.getName, t.toString, t.getStackTrace, metrics)
    -          execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason))
    +          // Attempt to exit cleanly by informing the driver of our failure.
    +          // If anything goes wrong (or this was a fatal exception), we will delegate to
    +          // the default uncaught exception handler, which will terminate the Executor.
    +          try {
    +            logError("Exception in task ID " + taskId, t)
    +
    +            val serviceTime = System.currentTimeMillis() - taskStart
    +            val metrics = attemptedTask.flatMap(t => t.metrics)
    +            for (m <- metrics) {
    +              m.executorRunTime = serviceTime
    +              m.jvmGCTime = gcTime - startGCTime
    +            }
    +            val reason = ExceptionFailure(t.getClass.getName, t.toString, t.getStackTrace, metrics)
    +            execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason))
     
    -          // TODO: Should we exit the whole executor here? On the one hand, the failed task may
    -          // have left some weird state around depending on when the exception was thrown, but on
    -          // the other hand, maybe we could detect that when future tasks fail and exit then.
    -          logError("Exception in task ID " + taskId, t)
    +            // Don't forcibly exit unless the exception was inherently fatal, to avoid
    +            // stopping other tasks unnecessarily.
    +            if (Utils.isFatalError(t)) {
    +              ExecutorUncaughtExceptionHandler.uncaughtException(t)
    +            }
    +          } catch {
    +            case t2: Throwable =>
    +              ExecutorUncaughtExceptionHandler.uncaughtException(t2)
    --- End diff --
    
    Actually just realized we basically already have that comment, just interpreted in a different way :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1772 Stop catching Throwable, let Execut...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/715#issuecomment-42858250
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1772 Stop catching Throwable, let Execut...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/715


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1772 Stop catching Throwable, let Execut...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/715#issuecomment-42858235
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1772 Stop catching Throwable, let Execut...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/715#issuecomment-42858109
  
    Addressed all comments and took "RFC" out of the PR title.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/715#issuecomment-42716336
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1772 Stop catching Throwable, let Execut...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/715#issuecomment-42866864
  
    This passed all tests except for the (bogus) MIMA issue, so I'll merge it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1772 Stop catching Throwable, let Execut...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/715#issuecomment-42863004
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14897/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/715#issuecomment-42712542
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1772 Stop catching Throwable, let Execut...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/715#issuecomment-42863001
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

Posted by ScrapCodes <gi...@git.apache.org>.
Github user ScrapCodes commented on the pull request:

    https://github.com/apache/spark/pull/715#issuecomment-42799372
  
    Looks good to me too !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/715#issuecomment-42763346
  
    /cc @scrapcodes who worked on the IndestructableActorSystem


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/715#issuecomment-42712525
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [RFC] SPARK-1772 Stop catching Throwable, let ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/715#discussion_r12514626
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
    @@ -259,19 +238,30 @@ private[spark] class Executor(
             }
     
             case t: Throwable => {
    -          val serviceTime = System.currentTimeMillis() - taskStart
    -          val metrics = attemptedTask.flatMap(t => t.metrics)
    -          for (m <- metrics) {
    -            m.executorRunTime = serviceTime
    -            m.jvmGCTime = gcTime - startGCTime
    -          }
    -          val reason = ExceptionFailure(t.getClass.getName, t.toString, t.getStackTrace, metrics)
    -          execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason))
    +          // Attempt to exit cleanly by informing the driver of our failure.
    +          // If anything goes wrong (or this was a fatal exception), we will delegate to
    +          // the default uncaught exception handler, which will terminate the Executor.
    +          try {
    +            logError("Exception in task ID " + taskId, t)
    +
    +            val serviceTime = System.currentTimeMillis() - taskStart
    +            val metrics = attemptedTask.flatMap(t => t.metrics)
    +            for (m <- metrics) {
    +              m.executorRunTime = serviceTime
    +              m.jvmGCTime = gcTime - startGCTime
    +            }
    +            val reason = ExceptionFailure(t.getClass.getName, t.toString, t.getStackTrace, metrics)
    +            execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason))
     
    -          // TODO: Should we exit the whole executor here? On the one hand, the failed task may
    -          // have left some weird state around depending on when the exception was thrown, but on
    -          // the other hand, maybe we could detect that when future tasks fail and exit then.
    -          logError("Exception in task ID " + taskId, t)
    +            // Don't forcibly exit unless the exception was inherently fatal, to avoid
    +            // stopping other tasks unnecessarily.
    +            if (Utils.isFatalError(t)) {
    +              ExecutorUncaughtExceptionHandler.uncaughtException(t)
    +            }
    +          } catch {
    +            case t2: Throwable =>
    +              ExecutorUncaughtExceptionHandler.uncaughtException(t2)
    --- End diff --
    
    Can't the uncaught exception handler for this thread be set to deal with this, instead of another catch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---