You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Wilfred Spiegelenburg (Jira)" <ji...@apache.org> on 2023/05/02 07:11:00 UTC
[jira] [Comment Edited] (YUNIKORN-1712) Placeholder allocations are being removed twice

    [ https://issues.apache.org/jira/browse/YUNIKORN-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718458#comment-17718458 ] 

Wilfred Spiegelenburg edited comment on YUNIKORN-1712 at 5/2/23 7:10 AM:
-------------------------------------------------------------------------

First to make sure the IDs are all documented, this is the driver placeholder. AllocatioKey is the same as the POD UID, uuid is the allocationID on the core side:
{code:java}
"appID": "spark-000000031tn2lgv2gar",
"allocationKey": "98e2100c-aebb-4a2a-82ca-1206dde51322",
"uuid": "20a4cf77-7095-4635-b9e9-43a7564385c4",
"allocatedResource": "map[memory:2147483648 vcore:1200]",
"placeholder": true,
"targetNode": "ip-10-0-112-148.eu-central-1.compute.internal" 
{code}
This is the replacement of the placeholder pod with the real pod on the core side:
{code:java}
2023-04-25T06:25:40.716Z INFO scheduler/partition.go:827 scheduler replace placeholder processed {"appID": "spark-000000031tn2lgv2gar", "allocationKey": "19d0ed5c-8cbe-4115-9971-a4adef5ed38c", "uuid": "2b68c8e8-d578-41b9-9ed1-32daa99773d6", "placeholder released uuid": "20a4cf77-7095-4635-b9e9-43a7564385c4"}{code}
The real pod has UID {{19d0ed5c-8cbe-4115-9971-a4adef5ed38c}}

That change triggers the release of the pod on the shim side, and 6 seconds later the shim says that the pod status has changed to succeeded:
{code:java}
2023-04-25T06:25:40.716Z INFO cache/application.go:619 try to release pod from application {"appID": "spark-000000031tn2lgv2gar", "allocationUUID": "20a4cf77-7095-4635-b9e9-43a7564385c4", "terminationType": "PLACEHOLDER_REPLACED"}
2023-04-25T06:25:46.279Z INFO general/general.go:179 task completes {"appType": "general", "namespace": "sandbox", "podName": "tg-spark-driver-app-spark-000000031tn2lgv2gar-0", "podUID": "98e2100c-aebb-4a2a-82ca-1206dde51322", "podStatus": "Succeeded"}
2023-04-25T06:25:46.279Z INFO cache/task.go:543 releasing allocations {"numOfAsksToRelease": 0, "numOfAllocationsToRelease": 1}
2023-04-25T06:25:46.279Z INFO cache/task_state.go:372 object transition {"object": {}, "source": "Bound", "destination": "Completed", "event": "CompleteTask"}
{code}
So far everything is working as expected. The pod and task should now be in the completed state. Nothing should change anymore for this task. The status change triggers an update to the core which now can proceed with finalising the replacement on the core side:
{code:java}
2023-04-25T06:25:46.279Z INFO scheduler/partition.go:1233 replacing placeholder allocation {"appID": "spark-000000031tn2lgv2gar", "allocationId": "20a4cf77-7095-4635-b9e9-43a7564385c4"}
2023-04-25T06:25:46.279Z INFO objects/application_state.go:132 Application state transition {"appID": "spark-000000031tn2lgv2gar", "source": "Starting", "destination": "Completing", "event": "completeApplication"}
2023-04-25T06:25:46.279Z WARN scheduler/partition.go:1272 replacing placeholder: placeholder is larger than real allocation {"allocationID": "2b68c8e8-d578-41b9-9ed1-32daa99773d6", "requested resource": "map[memory:1712324608 vcore:1200]", "placeholderID": "20a4cf77-7095-4635-b9e9-43a7564385c4", "placeholder resource": "map[memory:2147483648 vcore:1200]"}
2023-04-25T06:25:46.279Z INFO scheduler/partition.go:1286 replacing placeholder allocation on node {"nodeID": "ip-10-0-112-148.eu-central-1.compute.internal", "allocationId": "20a4cf77-7095-4635-b9e9-43a7564385c4", "allocation nodeID": "ip-10-0-112-148.eu-central-1.compute.internal"}
{code}
Again all good. The real allocation, {{{}19d0ed5c-8cbe-4115-9971-a4adef5ed38c{}}}, is added to the node and that information is passed back to the shim which triggers the bind:
{code:java}
2023-04-25T06:25:46.279Z INFO cache/task_state.go:372 object transition {"object": {}, "source": "Scheduling", "destination": "Allocated", "event": "TaskAllocated"}
2023-04-25T06:25:46.279Z INFO cache/context.go:468 Binding Pod Volumes skipped: all volumes already bound {"podName": "spark-000000031tn2lgv2gar-driver"}
2023-04-25T06:25:46.279Z INFO client/kubeclient.go:112 bind pod to node {"podName": "spark-000000031tn2lgv2gar-driver", "podUID": "19d0ed5c-8cbe-4115-9971-a4adef5ed38c", "nodeID": "ip-10-0-112-148.eu-central-1.compute.internal"}
2023-04-25T06:25:46.291Z INFO cache/task.go:404 successfully bound pod {"podName": "spark-000000031tn2lgv2gar-driver"}
2023-04-25T06:25:46.291Z INFO cache/task_state.go:372 object transition {"object": {}, "source": "Allocated", "destination": "Bound", "event": "TaskBound"}
{code}
 

This is still as expected. We are now in the correct state. The core has the real allocation on the node and no longer knows about the placeholder

Then the second release is triggered based on a delete of the pod, not a status update:

 
{code:java}
2023-04-25T06:25:46.299Z INFO general/general.go:213 delete pod {"appType": "general", "namespace": "sandbox", "podName": "tg-spark-driver-app-spark-000000031tn2lgv2gar-0", "podUID": "98e2100c-aebb-4a2a-82ca-1206dde51322"}
2023-04-25T06:25:46.299Z INFO cache/task.go:543 releasing allocations {"numOfAsksToRelease": 0, "numOfAllocationsToRelease": 1}
2023-04-25T06:25:46.299Z INFO scheduler/partition.go:1233 replacing placeholder allocation {"appID": "spark-000000031tn2lgv2gar", "allocationId": "20a4cf77-7095-4635-b9e9-43a7564385c4"}
{code}
It gets to the scheduler core and gets logged. This again is CORRECTLY handled. There is no follow up on this request. If the second delete request would have triggered a further update, on app; node or queue, we should have seen the same message(s) as before. The real allocation is still smaller than the placeholder so the line with "{{{}replacing placeholder: placeholder is larger than real allocation"{}}} should have been repeated. Since that did not happen the only conclusion is that this [check|https://github.com/apache/yunikorn-core/blob/branch-1.2/pkg/scheduler/objects/application.go#L1592-L1593] that is build into {{application.ReplaceAllocation()}} has worked. Debug logging is the only way to confirm that.

There is something broken, I agree with that but I do not think that the double release is the cause. If this is easily reproducible a debug log might show the detail we are looking for.


was (Author: wifreds):
 

First to make sure the IDs are all documented, this is the driver placeholder. AllocatioKey is the same as the POD UID, uuid is the allocationID on the core side:

 
{code:java}
"appID": "spark-000000031tn2lgv2gar",
"allocationKey": "98e2100c-aebb-4a2a-82ca-1206dde51322",
"uuid": "20a4cf77-7095-4635-b9e9-43a7564385c4",
"allocatedResource": "map[memory:2147483648 vcore:1200]",
"placeholder": true,
"targetNode": "ip-10-0-112-148.eu-central-1.compute.internal" 
{code}
This is the replacement of the placeholder pod with the real pod on the core side:

 
{code:java}
2023-04-25T06:25:40.716Z INFO scheduler/partition.go:827 scheduler replace placeholder processed {"appID": "spark-000000031tn2lgv2gar", "allocationKey": "19d0ed5c-8cbe-4115-9971-a4adef5ed38c", "uuid": "2b68c8e8-d578-41b9-9ed1-32daa99773d6", "placeholder released uuid": "20a4cf77-7095-4635-b9e9-43a7564385c4"}{code}
 

The real pod has UID {{19d0ed5c-8cbe-4115-9971-a4adef5ed38c}}

That change triggers the release of the pod on the shim side, and 6 seconds later the shim says that the pod status has changed to succeeded:

 
{code:java}
2023-04-25T06:25:40.716Z INFO cache/application.go:619 try to release pod from application {"appID": "spark-000000031tn2lgv2gar", "allocationUUID": "20a4cf77-7095-4635-b9e9-43a7564385c4", "terminationType": "PLACEHOLDER_REPLACED"}
2023-04-25T06:25:46.279Z INFO general/general.go:179 task completes {"appType": "general", "namespace": "sandbox", "podName": "tg-spark-driver-app-spark-000000031tn2lgv2gar-0", "podUID": "98e2100c-aebb-4a2a-82ca-1206dde51322", "podStatus": "Succeeded"}
2023-04-25T06:25:46.279Z INFO cache/task.go:543 releasing allocations {"numOfAsksToRelease": 0, "numOfAllocationsToRelease": 1}
2023-04-25T06:25:46.279Z INFO cache/task_state.go:372 object transition {"object": {}, "source": "Bound", "destination": "Completed", "event": "CompleteTask"}
{code}
 

So far everything is working as expected. The pod and task should now be in the completed state. Nothing should change anymore for this task. The status change triggers an update to the core which now can proceed with finalising the replacement on the core side:

 
{code:java}
2023-04-25T06:25:46.279Z INFO scheduler/partition.go:1233 replacing placeholder allocation {"appID": "spark-000000031tn2lgv2gar", "allocationId": "20a4cf77-7095-4635-b9e9-43a7564385c4"}
2023-04-25T06:25:46.279Z INFO objects/application_state.go:132 Application state transition {"appID": "spark-000000031tn2lgv2gar", "source": "Starting", "destination": "Completing", "event": "completeApplication"}
2023-04-25T06:25:46.279Z WARN scheduler/partition.go:1272 replacing placeholder: placeholder is larger than real allocation {"allocationID": "2b68c8e8-d578-41b9-9ed1-32daa99773d6", "requested resource": "map[memory:1712324608 vcore:1200]", "placeholderID": "20a4cf77-7095-4635-b9e9-43a7564385c4", "placeholder resource": "map[memory:2147483648 vcore:1200]"}
2023-04-25T06:25:46.279Z INFO scheduler/partition.go:1286 replacing placeholder allocation on node {"nodeID": "ip-10-0-112-148.eu-central-1.compute.internal", "allocationId": "20a4cf77-7095-4635-b9e9-43a7564385c4", "allocation nodeID": "ip-10-0-112-148.eu-central-1.compute.internal"}
{code}
 

Again all good. The real allocation, {{{}19d0ed5c-8cbe-4115-9971-a4adef5ed38c{}}}, is added to the node and that information is passed back to the shim which triggers the bind:

 
{code:java}
2023-04-25T06:25:46.279Z INFO cache/task_state.go:372 object transition {"object": {}, "source": "Scheduling", "destination": "Allocated", "event": "TaskAllocated"}
2023-04-25T06:25:46.279Z INFO cache/context.go:468 Binding Pod Volumes skipped: all volumes already bound {"podName": "spark-000000031tn2lgv2gar-driver"}
2023-04-25T06:25:46.279Z INFO client/kubeclient.go:112 bind pod to node {"podName": "spark-000000031tn2lgv2gar-driver", "podUID": "19d0ed5c-8cbe-4115-9971-a4adef5ed38c", "nodeID": "ip-10-0-112-148.eu-central-1.compute.internal"}
2023-04-25T06:25:46.291Z INFO cache/task.go:404 successfully bound pod {"podName": "spark-000000031tn2lgv2gar-driver"}
2023-04-25T06:25:46.291Z INFO cache/task_state.go:372 object transition {"object": {}, "source": "Allocated", "destination": "Bound", "event": "TaskBound"}
{code}
 

This is still as expected. We are now in the correct state. The core has the real allocation on the node and no longer knows about the placeholder

Then the second release is triggered based on a delete of the pod, not a status update:

 
{code:java}
2023-04-25T06:25:46.299Z INFO general/general.go:213 delete pod {"appType": "general", "namespace": "sandbox", "podName": "tg-spark-driver-app-spark-000000031tn2lgv2gar-0", "podUID": "98e2100c-aebb-4a2a-82ca-1206dde51322"}
2023-04-25T06:25:46.299Z INFO cache/task.go:543 releasing allocations {"numOfAsksToRelease": 0, "numOfAllocationsToRelease": 1}
2023-04-25T06:25:46.299Z INFO scheduler/partition.go:1233 replacing placeholder allocation {"appID": "spark-000000031tn2lgv2gar", "allocationId": "20a4cf77-7095-4635-b9e9-43a7564385c4"}
{code}
It gets to the scheduler core and gets logged. This again is CORRECTLY handled. There is no follow up on this request. If the second delete request would have triggered a further update, on app; node or queue, we should have seen the same message(s) as before. The real allocation is still smaller than the placeholder so the line with "{{{}replacing placeholder: placeholder is larger than real allocation"{}}} should have been repeated. Since that did not happen the only conclusion is that this [check|https://github.com/apache/yunikorn-core/blob/branch-1.2/pkg/scheduler/objects/application.go#L1592-L1593] that is build into {{application.ReplaceAllocation()}} has worked. Debug logging is the only way to confirm that.

 

> Placeholder allocations are being removed twice
> -----------------------------------------------
>
>                 Key: YUNIKORN-1712
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1712
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Craig Condit
>            Assignee: Qi Zhu
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: logs.txt
>
>
> Under some circumstances, it seems that placeholder allocations are being removed multiple times:
> {code}
> 2023-04-25T06:25:46.279Z	INFO	scheduler/partition.go:1233	replacing placeholder allocation {"appID": "spark-000000031tn2lgv2gar", "allocationId": "20a4cf77-7095-4635-b9e9-43a7564385c4"}
> ...
> 2023-04-25T06:25:46.299Z	INFO	scheduler/partition.go:1233	replacing placeholder allocation {"appID": "spark-000000031tn2lgv2gar", "allocationId": "20a4cf77-7095-4635-b9e9-43a7564385c4"}
> {code}
> This message only appears once in the codebase, in {{PartitionContext.removeAllocation()}}. Furthermore, it is guarded by a test for {{release.TerminationType == si.TerminationType_PLACEHOLDER_REPLACED}}. This would seem to indicate that {{removeAllocation()}} is somehow being called *twice*. I believe this would cause the used resources on the node to be subtracted twice for the same allocation. This quickly results in health checks failing:
> {code}
> 2023-04-25T06:26:10.632Z        WARN    scheduler/health_checker.go:176 Scheduler is not healthy        {"health check values": [..., {"Name":"Consistency of data","Succeeded":false,"Description":"Check if node total resource = allocated resource + occupied resource + available resource","DiagnosisMessage":"Nodes with inconsistent data: [\"ip-10-0-112-148.eu-central-1.compute.internal\"]"}, ...]}
> {code}
> This was originally thought to be YUNIKORN-1615, but that seems related to occupied (rather than used) resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org