You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sam Tran (JIRA)" <ji...@apache.org> on 2019/04/02 15:46:02 UTC

[jira] [Updated] (SPARK-27347) [MESOS] Fix supervised driver retry logic when agent crashes/restarts

     [ https://issues.apache.org/jira/browse/SPARK-27347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sam Tran updated SPARK-27347:
-----------------------------
    Description: 
Ran into scenarios where {{--supervised}} Spark jobs were retried multiple times when an agent would crash, come back, and re-register even when those jobs had already relaunched on a different agent.

That is:
 * supervised driver is running on agent1
 * agent1 crashes
 * driver is relaunched on another agent as `<task-id>-retry-1`
 * agent1 comes back online and re-registers with scheduler
 * spark relaunches the same job as `<task-id>-retry-2`
 * now there are two jobs running simultaneously and the first retry job is effectively orphaned within Zookeeper

This is because when an agent comes back and re-registers, it sends a status update {{TASK_FAILED}} for its old driver-task. Previous logic would indiscriminately remove the {{submissionId }}from Zookeeper's {{launchedDrivers}} node and add it to {{retryList}} node.

Then, when a new offer came in, it would relaunch another {{-retry-}} task even though one was previously running.

Sample log looks something like this: 
{code:java}
```
19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: 
... [offers] ...
19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001"
...
19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_STARTING message=''
19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_RUNNING message=''
...
19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed out' reason=REASON_SLAVE_REMOVED
...
19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001-retry-1"
...
19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message=''
19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message=''
...
19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable agent re-reregistered'
...
19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED
19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with driver-20190115192138-0001 in status update
...
19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001-retry-2"
...
19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message=''
19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''
```{code}

  was:
Ran into scenarios where {{--supervised}} Spark jobs were retried multiple times when an agent would crash, come back, and re-register even when those jobs had already relaunched on a different agent.

That is:
 * supervised driver is running on agent1
 * agent1 crashes
 * driver is relaunched on another agent as `<task-id>-retry-1`
 * agent1 comes back online and re-registers with scheduler
 * spark relaunches the same job as `<task-id>-retry-2`
 * now there are two jobs running simultaneously and the first retry job is effectively orphaned within Zookeeper

This is because when an agent comes back and re-registers, it sends a status update {{TASK_FAILED}} for its old driver-task. Previous logic would indiscriminately remove the {{submissionId }}from Zookeeper's {{launchedDrivers}} node and add it to {{retryList}} node.

Then, when a new offer came in, it would relaunch another {{-retry-}} task even though one was previously running.


> [MESOS] Fix supervised driver retry logic when agent crashes/restarts
> ---------------------------------------------------------------------
>
>                 Key: SPARK-27347
>                 URL: https://issues.apache.org/jira/browse/SPARK-27347
>             Project: Spark
>          Issue Type: Bug
>          Components: Mesos
>    Affects Versions: 2.2.1, 2.3.2, 2.4.0
>            Reporter: Sam Tran
>            Priority: Major
>              Labels: mesos
>
> Ran into scenarios where {{--supervised}} Spark jobs were retried multiple times when an agent would crash, come back, and re-register even when those jobs had already relaunched on a different agent.
> That is:
>  * supervised driver is running on agent1
>  * agent1 crashes
>  * driver is relaunched on another agent as `<task-id>-retry-1`
>  * agent1 comes back online and re-registers with scheduler
>  * spark relaunches the same job as `<task-id>-retry-2`
>  * now there are two jobs running simultaneously and the first retry job is effectively orphaned within Zookeeper
> This is because when an agent comes back and re-registers, it sends a status update {{TASK_FAILED}} for its old driver-task. Previous logic would indiscriminately remove the {{submissionId }}from Zookeeper's {{launchedDrivers}} node and add it to {{retryList}} node.
> Then, when a new offer came in, it would relaunch another {{-retry-}} task even though one was previously running.
> Sample log looks something like this: 
> {code:java}
> ```
> 19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: 
> ... [offers] ...
> 19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001"
> ...
> 19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_STARTING message=''
> 19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed out' reason=REASON_SLAVE_REMOVED
> ...
> 19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001-retry-1"
> ...
> 19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message=''
> 19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable agent re-reregistered'
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED
> 19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with driver-20190115192138-0001 in status update
> ...
> 19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001-retry-2"
> ...
> 19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message=''
> 19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''
> ```{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org