You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@yunikorn.apache.org by "Wilfred Spiegelenburg (Jira)" <ji...@apache.org> on 2022/06/30 06:01:00 UTC
[jira] [Resolved] (YUNIKORN-229) shim sends the same remove request twice for a remove allocation

     [ https://issues.apache.org/jira/browse/YUNIKORN-229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wilfred Spiegelenburg resolved YUNIKORN-229.
--------------------------------------------
    Resolution: Cannot Reproduce

The communication between the shim and the core has been simplified and we only trigger updates between the k8shim and the core if the object really exists in the k8shim cache.

We have completely re-written the cache in the k8shim over the last months to fix leaked objects and inconsistent updates. We have not seen the issue since those two changes went in.

Closing as no longer reproducible.

 

> shim sends the same remove request twice for a remove allocation
> ----------------------------------------------------------------
>
>                 Key: YUNIKORN-229
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-229
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Weiwei Yang
>            Priority: Critical
>
> In the logs it looks like the shim asks to remove the same allocation using the same UUID:
> First release request from shim:
> {code}
> 2020-06-10T05:54:24.564Z	DEBUG	cache/cluster_info.go:136	enqueued event	{"eventType": "*cacheevent.RMUpdateRequestEvent", "event": {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task completed"}]},"rmID":"mycluster"}}, "currentQueueSize": 0}
> 2020-06-10T05:54:24.565Z	DEBUG	scheduler/scheduler.go:191	enqueued event	{"eventType": "*schedulerevent.SchedulerAllocationUpdatesEvent", "event": {"RejectedAllocations":null,"AcceptedAllocations":null,"NewAsks":null,"ToReleases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task completed"}]},"ExistingAllocations":null,"RMId":""}, "currentQueueSize": 0}
> 2020-06-10T05:54:24.565Z	DEBUG	cache/cluster_info.go:136	enqueued event	{"eventType": "*cacheevent.ReleaseAllocationsEvent", "event": {"AllocationsToRelease":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","ApplicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","PartitionName":"[mycluster]default","Message":"task completed","ReleaseType":0}]}, "currentQueueSize": 0}
> 2020-06-10T05:54:24.565Z	DEBUG	cache/partition_info.go:429	removing allocations	{"appID": "spark-3a34f5a12bc54c24b7d5f02957cff30c", "allocationId": "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
> 2020-06-10T05:54:24.566Z	INFO	cache/partition_info.go:477	allocation removed	{"numOfAllocationReleased": 1, "partitionName": "[mycluster]default"}
> 2020-06-10T05:54:24.566Z	DEBUG	rmproxy/rmproxy.go:65	enqueue event	{"event": {"RmID":"mycluster","ReleasedAllocations":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task completed"}]}, "currentQueueSize": 0}
> 2020-06-10T05:54:24.566Z	DEBUG	callback/scheduler_callback.go:44	callback received	{"updateResponse": "releasedAllocations:<UUID:\"3bf0a159-89ee-4bdc-ada1-c577ac2097d1\" message:\"task completed\" > "}
> 2020-06-10T05:54:24.566Z	DEBUG	callback/scheduler_callback.go:119	callback: response to released allocations	{"UUID": "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
> {code}
> Second release request from shim 0.16 seconds after the first request:
> {code}
> 2020-06-10T05:54:40.423Z	DEBUG	cache/cluster_info.go:136	enqueued event	{"eventType": "*cacheevent.RMUpdateRequestEvent", "event": {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task completed"}]},"rmID":"mycluster"}}, "currentQueueSize": 0}
> 2020-06-10T05:54:40.423Z	DEBUG	scheduler/scheduler.go:191	enqueued event	{"eventType": "*schedulerevent.SchedulerAllocationUpdatesEvent", "event": {"RejectedAllocations":null,"AcceptedAllocations":null,"NewAsks":null,"ToReleases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task completed"}]},"ExistingAllocations":null,"RMId":""}, "currentQueueSize": 0}
> 2020-06-10T05:54:40.423Z	DEBUG	cache/cluster_info.go:136	enqueued event	{"eventType": "*cacheevent.ReleaseAllocationsEvent", "event": {"AllocationsToRelease":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","ApplicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","PartitionName":"[mycluster]default","Message":"task completed","ReleaseType":0}]}, "currentQueueSize": 0}
> 2020-06-10T05:54:40.423Z	DEBUG	cache/partition_info.go:429	removing allocations	{"appID": "spark-3a34f5a12bc54c24b7d5f02957cff30c", "allocationId": "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
> 2020-06-10T05:54:40.423Z	DEBUG	cache/partition_info.go:442	no active allocations found to release	{"appID": "spark-3a34f5a12bc54c24b7d5f02957cff30c"}
> {code}
> The core scheduler handles it correctly and just ignores the request but when the number of tasks in the shim grows this could have a big performance impact and we need to find out why it removes it twice.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: dev-help@yunikorn.apache.org