You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@yunikorn.apache.org by "Craig Condit (Jira)" <ji...@apache.org> on 2023/03/28 22:28:00 UTC

[jira] [Resolved] (YUNIKORN-1632) Yunikorn fails to account for the max number of pods on a node

     [ https://issues.apache.org/jira/browse/YUNIKORN-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Craig Condit resolved YUNIKORN-1632.
------------------------------------
     Fix Version/s: 1.3.0
    Target Version: 1.3.0
        Resolution: Fixed

Merged to master.

> Yunikorn fails to account for the max number of pods on a node
> --------------------------------------------------------------
>
>                 Key: YUNIKORN-1632
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1632
>             Project: Apache YuniKorn
>          Issue Type: Bug
>            Reporter: Eli Schiff
>            Assignee: Eli Schiff
>            Priority: Major
>             Fix For: 1.3.0
>
>
> In my cluster I am seeing occasional events like this on some of my pods which causes them to get stuck.
> {code:java}
> 54m     Warning  OutOfpods           pod/tg-spark-executor-640b4349263cc74570ae3a1e-0     Node didn't have enough resource: pods, requested: 1, used: 12, capacity: 12{code}
>  
> It mostly happens when a bunch of small pods get created all at once and all get assigned to the same node. That node could fit the pods based on cpu/memory alone. The limiting factor here is the pod limit.
> Here is an example of a node when I get a state-dump (note that this node is not the one that was out of capacity, I could not get the statedump in time. This is just a random example node.)
> Note how the capacity section shows vcore, memory, and pods among others. The available section subtracts the vcore and memory correctly but pods is still at 12 which is the same as the capacity even though there are pods on this node.
> {code:java}
> {
>   "nodeID": "node-1",
>   "hostName": "",
>   "rackName": "",
>   "capacity": {
>     "attachable-volumes-gce-pd": 127,
>     "ephemeral-storage": 1426128608967,
>     "hugepages-1Gi": 0,
>     "hugepages-2Mi": 0,
>     "memory": 257603937977,
>     "pods": 12,
>     "vcore": 31800
>   },
>   "allocated": {
>     "memory": 211619414016,
>     "vcore": 28400
>   },
>   "occupied": {
>     "memory": 648019968,
>     "vcore": 220
>   },
>   "available": {
>     "attachable-volumes-gce-pd": 127,
>     "ephemeral-storage": 1426128608967,
>     "hugepages-1Gi": 0,
>     "hugepages-2Mi": 0,
>     "memory": 45336503993,
>     "pods": 12,
>     "vcore": 3180
>   },
>   "utilized": {
>     "memory": 82,
>     "vcore": 89
>   },
>   "allocations": [
>     {
>       "allocationKey": "44883a88-342f-47ff-ad89-013f420be4a2",
>       "allocationTags": {REDACTED},
>       "requestTime": 1678499763840497296,
>       "allocationTime": 1678499790893957853,
>       "allocationDelay": 27053460557,
>       "uuid": "7097888a-ba67-4eeb-abf4-aba4c8960abc",
>       "resource": {
>         "memory": 52904853504,
>         "vcore": 7100
>       },
>       "priority": "0",
>       "nodeId": "node-1",
>       "applicationId": "640bdfa8dcd4e8dd542d1767",
>       "partition": "default",
>       "placeholder": false,
>       "placeholderUsed": true,
>       "taskGroupName": "",
>       "preempted": false
>     },
>     {
>       "allocationKey": "b30b2d21-8442-43ec-a49c-7f69d4cabe62",
>       "allocationTags": {REDACTED},
>       "requestTime": 1678499763846638434,
>       "allocationTime": 1678499789888691963,
>       "allocationDelay": 26042053529,
>       "uuid": "a6077f71-db66-4cb3-b98b-5f86e3323085",
>       "resource": {
>         "memory": 52904853504,
>         "vcore": 7100
>       },
>       "priority": "0",
>       "nodeId": "node-1",
>       "applicationId": "640bdfa8dcd4e8dd542d1767",
>       "partition": "default",
>       "placeholder": false,
>       "placeholderUsed": true,
>       "taskGroupName": "",
>       "preempted": false
>     },
>     {
>       "allocationKey": "fb5a3169-7d68-4bfd-9009-4b7373fd5daf",
>       "allocationTags": {REDACTED},
>       "requestTime": 1678499763852070372,
>       "allocationTime": 1678499790930270840,
>       "allocationDelay": 27078200468,
>       "uuid": "a593c7a7-4434-44a0-b144-f8e29232162c",
>       "resource": {
>         "memory": 52904853504,
>         "vcore": 7100
>       },
>       "priority": "0",
>       "nodeId": "node-1",
>       "applicationId": "640bdfa8dcd4e8dd542d1767",
>       "partition": "default",
>       "placeholder": false,
>       "placeholderUsed": true,
>       "taskGroupName": "",
>       "preempted": false
>     },
>     {
>       "allocationKey": "c1066dbf-94c9-414b-8bd8-d686cfafa554",
>       "allocationTags": {REDACTED},
>       "requestTime": 1678499893886008831,
>       "allocationTime": 1678499893889194430,
>       "allocationDelay": 3185599,
>       "uuid": "51e5fd09-6224-4c63-ba58-d0be8c6a9dee",
>       "resource": {
>         "memory": 52904853504,
>         "vcore": 7100
>       },
>       "priority": "0",
>       "nodeId": "node-1",
>       "applicationId": "640bdfa7ea72a327e004222b",
>       "partition": "default",
>       "placeholder": false,
>       "placeholderUsed": false,
>       "taskGroupName": "",
>       "preempted": false
>     }
>   ],
>   "schedulable": true,
>   "isReserved": false,
>   "reservations": []
> },{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: dev-help@yunikorn.apache.org