You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Craig Condit (Jira)" <ji...@apache.org> on 2022/06/21 16:58:00 UTC

[jira] [Commented] (YUNIKORN-1244) Performance regression in scheduling after YUNIKORN-1227

    [ https://issues.apache.org/jira/browse/YUNIKORN-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557010#comment-17557010 ] 

Craig Condit commented on YUNIKORN-1244:
----------------------------------------

The regression appears caused by the new snapshot code. While this does fix the original issue, taking a snapshot of the entire scheduler cache whenever it changes doesn't scale well for extremely large clusters. Based on some synthetic test cases, the overhead for a new snapshot grows primarily with the number of pods allocated on the cluster.

With only 1 node / 1 pod allocated, the cost of a new snapshot is 27us on a test machine. With 10k nodes and 10k pods this time grows to 41ms, and with 1k nodes and 100k pods the snapshot takes 332ms.

We will need to investigate additional means of solving the original race condition, as this performance degrades quite badly at the extremes.

> Performance regression in scheduling after YUNIKORN-1227
> --------------------------------------------------------
>
>                 Key: YUNIKORN-1244
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1244
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>            Reporter: Craig Condit
>            Assignee: Craig Condit
>            Priority: Major
>
> After YUNIKORN-1227, performance regressions have been reported on very large clusters.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org