You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sam Tran (JIRA)" <ji...@apache.org> on 2019/04/02 15:32:02 UTC

[jira] [Created] (SPARK-27347) [MESOS] Fix supervised driver retry logic when agent crashes/restarts

Sam Tran created SPARK-27347:
--------------------------------

             Summary: [MESOS] Fix supervised driver retry logic when agent crashes/restarts
                 Key: SPARK-27347
                 URL: https://issues.apache.org/jira/browse/SPARK-27347
             Project: Spark
          Issue Type: Bug
          Components: Mesos
    Affects Versions: 2.4.0, 2.3.2, 2.2.1
            Reporter: Sam Tran


Ran into scenarios where {{--supervised}} Spark jobs were retried multiple times when an agent would crash, come back, and re-register even when those jobs had already relaunched on a different agent.

That is:
 * supervised driver is running on agent1
 * agent1 crashes
 * driver is relaunched on another agent as `<task-id>-retry-1`
 * agent1 comes back online and re-registers with scheduler
 * spark relaunches the same job as `<task-id>-retry-2`
 * now there are two jobs running simultaneously and the first retry job is effectively orphaned within Zookeeper

This is because when an agent comes back and re-registers, it sends a status update {{TASK_FAILED}} for its old driver-task. Previous logic would indiscriminately remove the {{submissionId }}from Zookeeper's {{launchedDrivers}} node and add it to {{retryList}} node.

Then, when a new offer came in, it would relaunch another {{-retry-}} task even though one was previously running.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org