You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Ayub Pathan (Jira)" <ji...@apache.org> on 2021/03/16 04:42:00 UTC

[jira] [Created] (YUNIKORN-575) Regression: Post restart, Yunikorn tries to recover completed apps and schedules placeholder pods.

Ayub Pathan created YUNIKORN-575:
------------------------------------

             Summary: Regression: Post restart, Yunikorn tries to recover completed apps and schedules placeholder pods.
                 Key: YUNIKORN-575
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-575
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: shim - kubernetes
            Reporter: Ayub Pathan
         Attachments: Screen Shot 2021-03-15 at 9.27.10 PM.png, yk_recover.log

* Post restart, YK tries to recover the completed apps and schedules placeholder pods(even though the real pods are in completed state), which may not be needed. This leads to resource mismanagement.
{noformat}
gang-app-timeout-1006-5jqqk               0/1     Completed   0          69m
gang-app-timeout-1007-tw44t               0/1     Completed   0          66m
gang-app-timeout-1008-dmzc4               0/1     Completed   0          64m
gang-app-timeout-1008-dwxgq               0/1     Completed   0          64m
gang-app-timeout-1008-sl2x9               0/1     Completed   0          64m
tg-timeout-1006-gang-app-timeout-1006-0   1/1     Running     0          60s
tg-timeout-1006-gang-app-timeout-1006-1   1/1     Running     0          60s
tg-timeout-1006-gang-app-timeout-1006-2   1/1     Running     0          60s
tg-timeout-1007-gang-app-timeout-1007-0   1/1     Running     0          60s
tg-timeout-1007-gang-app-timeout-1007-1   1/1     Running     0          60s
tg-timeout-1007-gang-app-timeout-1007-2   0/1     Pending     0          60s
tg-timeout-1008-gang-app-timeout-1008-0   1/1     Running     0          60s
tg-timeout-1008-gang-app-timeout-1008-1   1/1     Running     0          60s
tg-timeout-1008-gang-app-timeout-1008-2   1/1     Running     0          60s
{noformat}

* *All the completed apps are marked as failed, post restart and the allocations are not released. This could be a resource leak post restart.*
{noformat}
[
    {
        "allocations": null,
        "applicationID": "gang-app-timeout-1009",
        "applicationState": "Accepted",
        "partition": "[mycluster]default",
        "queueName": "root.fifo",
        "submissionTime": 1615868052062417676,
        "usedResource": "[]"
    },
    {
        "allocations": null,
        "applicationID": "gang-app-timeout-1011",
        "applicationState": "Accepted",
        "partition": "[mycluster]default",
        "queueName": "root.fifo",
        "submissionTime": 1615868052062788287,
        "usedResource": "[]"
    },
    {
        "allocations": null,
        "applicationID": "gang-app-timeout-1010",
        "applicationState": "Accepted",
        "partition": "[mycluster]default",
        "queueName": "root.fifo",
        "submissionTime": 1615868052057156621,
        "usedResource": "[]"
    },
    {
        "allocations": null,
        "applicationID": "gang-app-timeout-1003",
        "applicationState": "Accepted",
        "partition": "[mycluster]default",
        "queueName": "root.fifo",
        "submissionTime": 1615868052062023562,
        "usedResource": "[]"
    },
    {
        "allocations": [
            {
                "allocationKey": "0a761a05-4b00-4e34-a54d-22411007553a",
                "allocationTags": {
                    "kubernetes.io/label/applicationId": "gang-app-timeout-1008",
                    "kubernetes.io/label/queue": "fifo",
                    "kubernetes.io/meta/namespace": "fifo",
                    "kubernetes.io/meta/podName": "tg-timeout-1008-gang-app-timeout-1008-0"
                },
                "applicationId": "gang-app-timeout-1008",
                "nodeId": "ip-10-192-131-213.ca-central-1.compute.internal",
                "partition": "default",
                "priority": "0",
                "queueName": "root.fifo",
                "resource": "[memory:300 vcore:300]",
                "uuid": "9704811c-422d-4efa-bb42-ab565fb5f16b"
            },
            {
                "allocationKey": "2505258b-3358-4143-b2a2-9084ffa0977b",
                "allocationTags": {
                    "kubernetes.io/label/applicationId": "gang-app-timeout-1008",
                    "kubernetes.io/label/queue": "fifo",
                    "kubernetes.io/meta/namespace": "fifo",
                    "kubernetes.io/meta/podName": "tg-timeout-1008-gang-app-timeout-1008-1"
                },
                "applicationId": "gang-app-timeout-1008",
                "nodeId": "ip-10-192-131-213.ca-central-1.compute.internal",
                "partition": "default",
                "priority": "0",
                "queueName": "root.fifo",
                "resource": "[memory:300 vcore:300]",
                "uuid": "e0ff467d-ec18-4d5b-b981-861835f1604a"
            },
            {
                "allocationKey": "29dbfaec-7632-4bff-b4ea-e313521497f1",
                "allocationTags": {
                    "kubernetes.io/label/applicationId": "gang-app-timeout-1008",
                    "kubernetes.io/label/queue": "fifo",
                    "kubernetes.io/meta/namespace": "fifo",
                    "kubernetes.io/meta/podName": "tg-timeout-1008-gang-app-timeout-1008-2"
                },
                "applicationId": "gang-app-timeout-1008",
                "nodeId": "ip-10-192-142-84.ca-central-1.compute.internal",
                "partition": "default",
                "priority": "0",
                "queueName": "root.fifo",
                "resource": "[memory:300 vcore:300]",
                "uuid": "6723d3ac-c7c8-4935-bb23-3b443909a252"
            }
        ],
        "applicationID": "gang-app-timeout-1008",
        "applicationState": "Failed",
        "partition": "[mycluster]default",
        "queueName": "root.fifo",
        "submissionTime": 1615868050004448061,
        "usedResource": "[]"
    },
    {
        "allocations": [
            {
                "allocationKey": "05d87d17-a6dc-4bc0-b495-c76f1cd0a3cb",
                "allocationTags": {
                    "kubernetes.io/label/applicationId": "gang-app-timeout-1007",
                    "kubernetes.io/label/queue": "fifo",
                    "kubernetes.io/meta/namespace": "fifo",
                    "kubernetes.io/meta/podName": "tg-timeout-1007-gang-app-timeout-1007-0"
                },
                "applicationId": "gang-app-timeout-1007",
                "nodeId": "ip-10-192-131-213.ca-central-1.compute.internal",
                "partition": "default",
                "priority": "0",
                "queueName": "root.fifo",
                "resource": "[memory:300 vcore:300]",
                "uuid": "67401008-61b0-4957-8361-6d0e8917c21f"
            },
            {
                "allocationKey": "1af95692-0186-44fe-b712-30edb51b85c2",
                "allocationTags": {
                    "kubernetes.io/label/applicationId": "gang-app-timeout-1007",
                    "kubernetes.io/label/queue": "fifo",
                    "kubernetes.io/meta/namespace": "fifo",
                    "kubernetes.io/meta/podName": "tg-timeout-1007-gang-app-timeout-1007-1"
                },
                "applicationId": "gang-app-timeout-1007",
                "nodeId": "ip-10-192-142-84.ca-central-1.compute.internal",
                "partition": "default",
                "priority": "0",
                "queueName": "root.fifo",
                "resource": "[memory:300 vcore:300]",
                "uuid": "5d1f129e-3e40-4103-b2e6-53daf408465f"
            }
        ],
        "applicationID": "gang-app-timeout-1007",
        "applicationState": "Failed",
        "partition": "[mycluster]default",
        "queueName": "root.fifo",
        "submissionTime": 1615868050004840460,
        "usedResource": "[]"
    },
    {
        "allocations": [
            {
                "allocationKey": "8524d2ab-a591-4fca-8a5f-3847e8d173ab",
                "allocationTags": {
                    "kubernetes.io/label/applicationId": "gang-app-timeout-1006",
                    "kubernetes.io/label/queue": "fifo",
                    "kubernetes.io/meta/namespace": "fifo",
                    "kubernetes.io/meta/podName": "tg-timeout-1006-gang-app-timeout-1006-1"
                },
                "applicationId": "gang-app-timeout-1006",
                "nodeId": "ip-10-192-142-84.ca-central-1.compute.internal",
                "partition": "default",
                "priority": "0",
                "queueName": "root.fifo",
                "resource": "[memory:300 vcore:300]",
                "uuid": "909735f0-607b-4799-bf4c-8b45f59c174b"
            },
            {
                "allocationKey": "b33078a1-aac6-4217-afd5-3c80248782dd",
                "allocationTags": {
                    "kubernetes.io/label/applicationId": "gang-app-timeout-1006",
                    "kubernetes.io/label/queue": "fifo",
                    "kubernetes.io/meta/namespace": "fifo",
                    "kubernetes.io/meta/podName": "tg-timeout-1006-gang-app-timeout-1006-2"
                },
                "applicationId": "gang-app-timeout-1006",
                "nodeId": "ip-10-192-131-213.ca-central-1.compute.internal",
                "partition": "default",
                "priority": "0",
                "queueName": "root.fifo",
                "resource": "[memory:300 vcore:300]",
                "uuid": "80f04647-ada2-4851-9361-d6bcb5c18c65"
            },
            {
                "allocationKey": "e7aa1b09-fac8-43bf-aae9-48215086ae36",
                "allocationTags": {
                    "kubernetes.io/label/applicationId": "gang-app-timeout-1006",
                    "kubernetes.io/label/queue": "fifo",
                    "kubernetes.io/meta/namespace": "fifo",
                    "kubernetes.io/meta/podName": "tg-timeout-1006-gang-app-timeout-1006-0"
                },
                "applicationId": "gang-app-timeout-1006",
                "nodeId": "ip-10-192-142-84.ca-central-1.compute.internal",
                "partition": "default",
                "priority": "0",
                "queueName": "root.fifo",
                "resource": "[memory:300 vcore:300]",
                "uuid": "f6172318-7e4a-4252-8bf5-8346de4a4d48"
            }
        ],
        "applicationID": "gang-app-timeout-1006",
        "applicationState": "Failed",
        "partition": "[mycluster]default",
        "queueName": "root.fifo",
        "submissionTime": 1615868050003595376,
        "usedResource": "[]"
    }
]
{noformat}

YK UI snapshot showing apps marked as failed.
 !image-2021-03-15-21-37-56-129.png|thumbnail! 

Attached log. [^yk_recover.log] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org