You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Peter Bacsko (Jira)" <ji...@apache.org> on 2022/06/02 15:49:00 UTC

[jira] [Comment Edited] (YUNIKORN-1217) Ensure that Spark driver pod is processed before executor pods during recovery

    [ https://issues.apache.org/jira/browse/YUNIKORN-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545525#comment-17545525 ] 

Peter Bacsko edited comment on YUNIKORN-1217 at 6/2/22 3:48 PM:
----------------------------------------------------------------

During today's sync-up, we agreed with [~wilfreds] that the simplest approach is to sort the retrieved pods based on {{{}CreationTime{}}}. Since drivers are created earlier than executor, this will always work.

We just have to sort the pod slice in {{{}AppManagementService.recoverApps(){}}}:
{noformat}
pods, err := m.ListPods()
if err != nil {
	log.Logger().Error("failed to list apps", zap.Error(err))
	return recoveringApps, err
}

// put new sort code here

for _, pod := range pods {
	app := svc.podEventHandler.HandleEvent(general.AddPod, general.Recovery, pod)
	recoveringApps[app.GetApplicationID()] = app
}
{noformat}


 


was (Author: pbacsko):
During today's sync-up, we agreed with [~wilfreds] that the simplest approach is to sort the retrieved pods based on {{{}CreationTime{}}}. Since drivers are created earlier than executor, this will always work.

We just have to sort the pod slice in {{{}AppManagementService.recoverApps(){}}}:
{noformat}
			pods, err := m.ListPods()
			if err != nil {
				log.Logger().Error("failed to list apps", zap.Error(err))
				return recoveringApps, err
			}

                        // put new sort code here

			for _, pod := range pods {
				app := svc.podEventHandler.HandleEvent(general.AddPod, general.Recovery, pod)
				recoveringApps[app.GetApplicationID()] = app
			}
{noformat}


 

> Ensure that Spark driver pod is processed before executor pods during recovery
> ------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-1217
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1217
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: shim - kubernetes
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Major
>
> When running a Spark workload with gang scheduling, the driver and executor pods have different annotations.
> It is critical that we process the driver first, because it has the task group definitions. Based on [https://yunikorn.apache.org/docs/next/user_guide/gang_scheduling/,] the executor only needs {{{}yunikorn.apache.org/taskGroupName{}}}.
> So when we add the pods in the recovery code path, we have to start with the driver.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org