You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Peter Bacsko (Jira)" <ji...@apache.org> on 2022/06/02 15:49:00 UTC
[jira] [Comment Edited] (YUNIKORN-1217) Ensure that Spark driver pod is processed before executor pods during recovery
[ https://issues.apache.org/jira/browse/YUNIKORN-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545525#comment-17545525 ]
Peter Bacsko edited comment on YUNIKORN-1217 at 6/2/22 3:48 PM:
----------------------------------------------------------------
During today's sync-up, we agreed with [~wilfreds] that the simplest approach is to sort the retrieved pods based on {{{}CreationTime{}}}. Since drivers are created earlier than executor, this will always work.
We just have to sort the pod slice in {{{}AppManagementService.recoverApps(){}}}:
{noformat}
pods, err := m.ListPods()
if err != nil {
log.Logger().Error("failed to list apps", zap.Error(err))
return recoveringApps, err
}
// put new sort code here
for _, pod := range pods {
app := svc.podEventHandler.HandleEvent(general.AddPod, general.Recovery, pod)
recoveringApps[app.GetApplicationID()] = app
}
{noformat}
was (Author: pbacsko):
During today's sync-up, we agreed with [~wilfreds] that the simplest approach is to sort the retrieved pods based on {{{}CreationTime{}}}. Since drivers are created earlier than executor, this will always work.
We just have to sort the pod slice in {{{}AppManagementService.recoverApps(){}}}:
{noformat}
pods, err := m.ListPods()
if err != nil {
log.Logger().Error("failed to list apps", zap.Error(err))
return recoveringApps, err
}
// put new sort code here
for _, pod := range pods {
app := svc.podEventHandler.HandleEvent(general.AddPod, general.Recovery, pod)
recoveringApps[app.GetApplicationID()] = app
}
{noformat}
> Ensure that Spark driver pod is processed before executor pods during recovery
> ------------------------------------------------------------------------------
>
> Key: YUNIKORN-1217
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1217
> Project: Apache YuniKorn
> Issue Type: Sub-task
> Components: shim - kubernetes
> Reporter: Peter Bacsko
> Assignee: Peter Bacsko
> Priority: Major
>
> When running a Spark workload with gang scheduling, the driver and executor pods have different annotations.
> It is critical that we process the driver first, because it has the task group definitions. Based on [https://yunikorn.apache.org/docs/next/user_guide/gang_scheduling/,] the executor only needs {{{}yunikorn.apache.org/taskGroupName{}}}.
> So when we add the pods in the recovery code path, we have to start with the driver.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org