You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/04/02 15:39:42 UTC
[GitHub] [spark] samvantran opened a new pull request #24276: [SPARK-27347][MESOS] Fix supervised driver retry logic for outdated tasks

samvantran opened a new pull request #24276: [SPARK-27347][MESOS] Fix supervised driver retry logic for outdated tasks
URL: https://github.com/apache/spark/pull/24276
 
 
   ## What changes were proposed in this pull request?
   
   This patch fixes a bug where `--supervised` Spark jobs would retry multiple times whenever an agent would crash, come back, and re-register even when those jobs had already relaunched on a different agent.
   
   That is: 
   ```
   - supervised driver is running on agent1
   - agent1 crashes
   - driver is relaunched on another agent as `<task-id>-retry-1`
   - agent1 comes back online and re-registers with scheduler
   - spark relaunches the same job as `<task-id>-retry-2`
   - now there are two jobs running simultaneously
   ```
   
   This is because when an agent would come back and re-register it would send a status update `TASK_FAILED` for its old driver-task. Previous logic would indiscriminately remove the `submissionId` from Zookeeper's `launchedDrivers` node and add it to `retryList` node. Then, when a new offer came in, it would relaunch another `-retry-`  task even though one was previously running. 
   
   For example logs, scroll to bottom
   
   ## How was this patch tested?
   
   - Added a unit test to simulate behavior described above
   - Tested manually on a DC/OS cluster by
     ```
     - launching a --supervised spark job
     - dcos node ssh <to the agent with the running spark-driver>
     - systemctl stop dcos-mesos-slave
     - docker kill <driver-container-id>
     - [ wait until spark job is relaunched ]
     - systemctl start dcos-mesos-slave
     - [ observe spark driver is not relaunched as `-retry-2` ]
     ```
   
   Log snippets included below. Notice the `-retry-1` task is running when status update for the old task comes in afterward: 
   ```
   19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: 
   ... [offers] ...
   19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001"
   ...
   19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_STARTING message=''
   19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_RUNNING message=''
   ...
   19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed out' reason=REASON_SLAVE_REMOVED
   ...
   19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001-retry-1"
   ...
   19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message=''
   19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message=''
   ...
   19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable agent re-reregistered'
   ...
   19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED
   19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with driver-20190115192138-0001 in status update
   ...
   19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001-retry-2"
   ...
   19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message=''
   19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org