You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kay Ousterhout (JIRA)" <ji...@apache.org> on 2015/10/26 00:45:27 UTC

[jira] [Created] (SPARK-11306) Executor JVM loss can lead to a hang in Standalone mode

Kay Ousterhout created SPARK-11306:
--------------------------------------

             Summary: Executor JVM loss can lead to a hang in Standalone mode
                 Key: SPARK-11306
                 URL: https://issues.apache.org/jira/browse/SPARK-11306
             Project: Spark
          Issue Type: Bug
            Reporter: Kay Ousterhout
            Assignee: Kay Ousterhout


This commit: https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0 introduced a bug where, in Standalone mode, if a task fails and crashes the JVM, the failure is considered a "normal failure" (meaning it's considered unrelated to the task), so the failure isn't counted against the task's maximum number of failures: https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-a755f3d892ff2506a7aa7db52022d77cL138.  As a result, if a task fails in a way that results in it crashing the JVM, it will continuously be re-launched, resulting in a hang.

Unfortunately this issue is difficult to reproduce because of a race condition where we have multiple code paths that are used to handle executor losses, and in the setup I'm using, Akka's notification that the executor was lost always gets to the TaskSchedulerImpl first, so the task eventually gets killed (see my recent email to the dev list).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org