You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Ayub Pathan (Jira)" <ji...@apache.org> on 2021/07/08 15:34:00 UTC

[jira] [Updated] (YUNIKORN-518) Placeholder manager failed to init during scheduler recovery

     [ https://issues.apache.org/jira/browse/YUNIKORN-518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ayub Pathan updated YUNIKORN-518:
---------------------------------
    Description: 
{noformat}
Name:         yunikorn-scheduler-6577f789d8-vc5cc
Namespace:    yunikorn
Priority:     0
Node:         ip-10-192-153-109.ca-central-1.compute.internal/10.192.153.109
Start Time:   Tue, 26 Jan 2021 19:17:12 -0800
Labels:       app=yunikorn
              component=yunikorn-scheduler
              pod-template-hash=6577f789d8
              release=yunikorn
Annotations:  cni.projectcalico.org/podIP: 100.100.166.78/32
              cni.projectcalico.org/podIPs: 100.100.166.78/32
              kubernetes.io/psp: eks.privileged
Status:       Running
IP:           100.100.166.78
IPs:
  IP:           100.100.166.78
Controlled By:  ReplicaSet/yunikorn-scheduler-6577f789d8
Containers:
  yunikorn-scheduler-k8s:
    Container ID:   docker://759f2b2f14ba37f46a42cdc59a5c51ed19d442ed717b81ee98d30177b7a184e6
    Image:          <>/cloudera/yunikorn-scheduler:0.10.0-b9
    Image ID:       docker-pullable://<>/cloudera/yunikorn-scheduler@sha256:878300a91cfd3b9d6dc515948afbfab23572a475b0df7006f06480ee06d1aceb
    Port:           9080/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 26 Jan 2021 19:18:01 -0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 26 Jan 2021 19:17:33 -0800
      Finished:     Tue, 26 Jan 2021 19:17:33 -0800
    Ready:          True
    Restart Count:  3
    Limits:
      cpu:     4
      memory:  2Gi
    Requests:
      cpu:     200m
      memory:  1Gi
    Environment:
      NAMESPACE:                                yunikorn (v1:metadata.namespace)
      ADMISSION_CONTROLLER_IMAGE_REGISTRY:      <>/cloudera/yunikorn-admission
      ADMISSION_CONTROLLER_IMAGE_TAG:           0.10.0-b9
      ADMISSION_CONTROLLER_IMAGE_PULL_POLICY:   Always
      ADMISSION_CONTROLLER_IMAGE_PULL_SECRETS:  [dockercreds]
    Mounts:
      /etc/yunikorn/ from config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from yunikorn-admin-token-dnq4h (ro)
  yunikorn-scheduler-web:
    Container ID:   docker://0b8205bb8292f193765bbc563ea10010106fd316257e523c3446c5685ee0d5bf
    Image:          <>/cloudera/yunikorn-web:0.10.0-b9
    Image ID:       docker-pullable://<>/cloudera/yunikorn-web@sha256:a64b986df2dc737958701838f41f9fae7f2e4a353a497949ba6b9e75b4b44b66
    Port:           9889/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 26 Jan 2021 19:17:17 -0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     200m
      memory:  500Mi
    Requests:
      cpu:        100m
      memory:     100Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from yunikorn-admin-token-dnq4h (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      yunikorn-configs
    Optional:  false
  yunikorn-admin-token-dnq4h:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  yunikorn-admin-token-dnq4h
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  role.node.kubernetes.io/liftie-infra=true
Tolerations:     CriticalAddonsOnly op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                 role.node.kubernetes.io/liftie-infra=true:NoSchedule
Events:
  Type     Reason               Age                From               Message
  ----     ------               ----               ----               -------
  Normal   Scheduled            61s                default-scheduler  Successfully assigned yunikorn/yunikorn-scheduler-6577f789d8-vc5cc to ip-10-192-153-109.ca-central-1.compute.internal
  Normal   Pulling              57s                kubelet            Pulling image "<>/cloudera/yunikorn-web:0.10.0-b9"
  Normal   Started              56s                kubelet            Started container yunikorn-scheduler-web
  Normal   Created              56s                kubelet            Created container yunikorn-scheduler-web
  Normal   Pulled               56s                kubelet            Successfully pulled image "<>/cloudera/yunikorn-web:0.10.0-b9"
  Warning  FailedPreStopHook    55s (x2 over 58s)  kubelet            Exec lifecycle hook ([/bin/sh /admission_util.sh delete]) for Container "yunikorn-scheduler-k8s" in Pod "yunikorn-scheduler-6577f789d8-vc5cc_yunikorn(082e1cc7-8765-4aa3-baac-48e3b048cfc6)" failed - error: command '/bin/sh /admission_util.sh delete' exited with 126: , message: "cannot exec in a stopped state: unknown\r\n"
  Normal   Killing              55s (x2 over 58s)  kubelet            FailedPostStartHook
  Warning  BackOff              53s (x2 over 54s)  kubelet            Back-off restarting failed container
  Normal   Pulling              41s (x3 over 60s)  kubelet            Pulling image "<>/cloudera/yunikorn-scheduler:0.10.0-b9"
  Warning  FailedPostStartHook  40s (x3 over 58s)  kubelet            Exec lifecycle hook ([/bin/sh /admission_util.sh create]) for Container "yunikorn-scheduler-k8s" in Pod "yunikorn-scheduler-6577f789d8-vc5cc_yunikorn(082e1cc7-8765-4aa3-baac-48e3b048cfc6)" failed - error: command '/bin/sh /admission_util.sh create' exited with 137: , message: ""
  Normal   Started              40s (x3 over 58s)  kubelet            Started container yunikorn-scheduler-k8s
  Normal   Created              40s (x3 over 58s)  kubelet            Created container yunikorn-scheduler-k8s
  Normal   Pulled               40s (x3 over 58s)  kubelet            Successfully pulled image "<>/cloudera/yunikorn-scheduler:0.10.0-b9" {noformat}

This is not a blocker but the scheduler was restarted multiple(3) times, hence reporting. This could be due to issue in admission controller start script/

  was:
{noformat}
Name:         yunikorn-scheduler-6577f789d8-vc5cc
Namespace:    yunikorn
Priority:     0
Node:         ip-10-192-153-109.ca-central-1.compute.internal/10.192.153.109
Start Time:   Tue, 26 Jan 2021 19:17:12 -0800
Labels:       app=yunikorn
              component=yunikorn-scheduler
              pod-template-hash=6577f789d8
              release=yunikorn
Annotations:  cni.projectcalico.org/podIP: 100.100.166.78/32
              cni.projectcalico.org/podIPs: 100.100.166.78/32
              kubernetes.io/psp: eks.privileged
Status:       Running
IP:           100.100.166.78
IPs:
  IP:           100.100.166.78
Controlled By:  ReplicaSet/yunikorn-scheduler-6577f789d8
Containers:
  yunikorn-scheduler-k8s:
    Container ID:   docker://759f2b2f14ba37f46a42cdc59a5c51ed19d442ed717b81ee98d30177b7a184e6
    Image:          container-dev.repo.cloudera.com/cloudera/yunikorn-scheduler:0.10.0-b9
    Image ID:       docker-pullable://container-dev.repo.cloudera.com/cloudera/yunikorn-scheduler@sha256:878300a91cfd3b9d6dc515948afbfab23572a475b0df7006f06480ee06d1aceb
    Port:           9080/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 26 Jan 2021 19:18:01 -0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 26 Jan 2021 19:17:33 -0800
      Finished:     Tue, 26 Jan 2021 19:17:33 -0800
    Ready:          True
    Restart Count:  3
    Limits:
      cpu:     4
      memory:  2Gi
    Requests:
      cpu:     200m
      memory:  1Gi
    Environment:
      NAMESPACE:                                yunikorn (v1:metadata.namespace)
      ADMISSION_CONTROLLER_IMAGE_REGISTRY:      container-dev.repo.cloudera.com/cloudera/yunikorn-admission
      ADMISSION_CONTROLLER_IMAGE_TAG:           0.10.0-b9
      ADMISSION_CONTROLLER_IMAGE_PULL_POLICY:   Always
      ADMISSION_CONTROLLER_IMAGE_PULL_SECRETS:  [dockercreds]
    Mounts:
      /etc/yunikorn/ from config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from yunikorn-admin-token-dnq4h (ro)
  yunikorn-scheduler-web:
    Container ID:   docker://0b8205bb8292f193765bbc563ea10010106fd316257e523c3446c5685ee0d5bf
    Image:          container-dev.repo.cloudera.com/cloudera/yunikorn-web:0.10.0-b9
    Image ID:       docker-pullable://container-dev.repo.cloudera.com/cloudera/yunikorn-web@sha256:a64b986df2dc737958701838f41f9fae7f2e4a353a497949ba6b9e75b4b44b66
    Port:           9889/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 26 Jan 2021 19:17:17 -0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     200m
      memory:  500Mi
    Requests:
      cpu:        100m
      memory:     100Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from yunikorn-admin-token-dnq4h (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      yunikorn-configs
    Optional:  false
  yunikorn-admin-token-dnq4h:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  yunikorn-admin-token-dnq4h
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  role.node.kubernetes.io/liftie-infra=true
Tolerations:     CriticalAddonsOnly op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                 role.node.kubernetes.io/liftie-infra=true:NoSchedule
Events:
  Type     Reason               Age                From               Message
  ----     ------               ----               ----               -------
  Normal   Scheduled            61s                default-scheduler  Successfully assigned yunikorn/yunikorn-scheduler-6577f789d8-vc5cc to ip-10-192-153-109.ca-central-1.compute.internal
  Normal   Pulling              57s                kubelet            Pulling image "container-dev.repo.cloudera.com/cloudera/yunikorn-web:0.10.0-b9"
  Normal   Started              56s                kubelet            Started container yunikorn-scheduler-web
  Normal   Created              56s                kubelet            Created container yunikorn-scheduler-web
  Normal   Pulled               56s                kubelet            Successfully pulled image "container-dev.repo.cloudera.com/cloudera/yunikorn-web:0.10.0-b9"
  Warning  FailedPreStopHook    55s (x2 over 58s)  kubelet            Exec lifecycle hook ([/bin/sh /admission_util.sh delete]) for Container "yunikorn-scheduler-k8s" in Pod "yunikorn-scheduler-6577f789d8-vc5cc_yunikorn(082e1cc7-8765-4aa3-baac-48e3b048cfc6)" failed - error: command '/bin/sh /admission_util.sh delete' exited with 126: , message: "cannot exec in a stopped state: unknown\r\n"
  Normal   Killing              55s (x2 over 58s)  kubelet            FailedPostStartHook
  Warning  BackOff              53s (x2 over 54s)  kubelet            Back-off restarting failed container
  Normal   Pulling              41s (x3 over 60s)  kubelet            Pulling image "container-dev.repo.cloudera.com/cloudera/yunikorn-scheduler:0.10.0-b9"
  Warning  FailedPostStartHook  40s (x3 over 58s)  kubelet            Exec lifecycle hook ([/bin/sh /admission_util.sh create]) for Container "yunikorn-scheduler-k8s" in Pod "yunikorn-scheduler-6577f789d8-vc5cc_yunikorn(082e1cc7-8765-4aa3-baac-48e3b048cfc6)" failed - error: command '/bin/sh /admission_util.sh create' exited with 137: , message: ""
  Normal   Started              40s (x3 over 58s)  kubelet            Started container yunikorn-scheduler-k8s
  Normal   Created              40s (x3 over 58s)  kubelet            Created container yunikorn-scheduler-k8s
  Normal   Pulled               40s (x3 over 58s)  kubelet            Successfully pulled image "container-dev.repo.cloudera.com/cloudera/yunikorn-scheduler:0.10.0-b9" {noformat}

This is not a blocker but the scheduler was restarted multiple(3) times, hence reporting. This could be due to issue in admission controller start script/


> Placeholder manager failed to init during scheduler recovery
> ------------------------------------------------------------
>
>                 Key: YUNIKORN-518
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-518
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: shim - kubernetes
>    Affects Versions: 0.10
>            Reporter: Ayub Pathan
>            Assignee: Weiwei Yang
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.10
>
>         Attachments: yk-sc.log
>
>
> {noformat}
> Name:         yunikorn-scheduler-6577f789d8-vc5cc
> Namespace:    yunikorn
> Priority:     0
> Node:         ip-10-192-153-109.ca-central-1.compute.internal/10.192.153.109
> Start Time:   Tue, 26 Jan 2021 19:17:12 -0800
> Labels:       app=yunikorn
>               component=yunikorn-scheduler
>               pod-template-hash=6577f789d8
>               release=yunikorn
> Annotations:  cni.projectcalico.org/podIP: 100.100.166.78/32
>               cni.projectcalico.org/podIPs: 100.100.166.78/32
>               kubernetes.io/psp: eks.privileged
> Status:       Running
> IP:           100.100.166.78
> IPs:
>   IP:           100.100.166.78
> Controlled By:  ReplicaSet/yunikorn-scheduler-6577f789d8
> Containers:
>   yunikorn-scheduler-k8s:
>     Container ID:   docker://759f2b2f14ba37f46a42cdc59a5c51ed19d442ed717b81ee98d30177b7a184e6
>     Image:          <>/cloudera/yunikorn-scheduler:0.10.0-b9
>     Image ID:       docker-pullable://<>/cloudera/yunikorn-scheduler@sha256:878300a91cfd3b9d6dc515948afbfab23572a475b0df7006f06480ee06d1aceb
>     Port:           9080/TCP
>     Host Port:      0/TCP
>     State:          Running
>       Started:      Tue, 26 Jan 2021 19:18:01 -0800
>     Last State:     Terminated
>       Reason:       Error
>       Exit Code:    1
>       Started:      Tue, 26 Jan 2021 19:17:33 -0800
>       Finished:     Tue, 26 Jan 2021 19:17:33 -0800
>     Ready:          True
>     Restart Count:  3
>     Limits:
>       cpu:     4
>       memory:  2Gi
>     Requests:
>       cpu:     200m
>       memory:  1Gi
>     Environment:
>       NAMESPACE:                                yunikorn (v1:metadata.namespace)
>       ADMISSION_CONTROLLER_IMAGE_REGISTRY:      <>/cloudera/yunikorn-admission
>       ADMISSION_CONTROLLER_IMAGE_TAG:           0.10.0-b9
>       ADMISSION_CONTROLLER_IMAGE_PULL_POLICY:   Always
>       ADMISSION_CONTROLLER_IMAGE_PULL_SECRETS:  [dockercreds]
>     Mounts:
>       /etc/yunikorn/ from config-volume (rw)
>       /var/run/secrets/kubernetes.io/serviceaccount from yunikorn-admin-token-dnq4h (ro)
>   yunikorn-scheduler-web:
>     Container ID:   docker://0b8205bb8292f193765bbc563ea10010106fd316257e523c3446c5685ee0d5bf
>     Image:          <>/cloudera/yunikorn-web:0.10.0-b9
>     Image ID:       docker-pullable://<>/cloudera/yunikorn-web@sha256:a64b986df2dc737958701838f41f9fae7f2e4a353a497949ba6b9e75b4b44b66
>     Port:           9889/TCP
>     Host Port:      0/TCP
>     State:          Running
>       Started:      Tue, 26 Jan 2021 19:17:17 -0800
>     Ready:          True
>     Restart Count:  0
>     Limits:
>       cpu:     200m
>       memory:  500Mi
>     Requests:
>       cpu:        100m
>       memory:     100Mi
>     Environment:  <none>
>     Mounts:
>       /var/run/secrets/kubernetes.io/serviceaccount from yunikorn-admin-token-dnq4h (ro)
> Conditions:
>   Type              Status
>   Initialized       True
>   Ready             True
>   ContainersReady   True
>   PodScheduled      True
> Volumes:
>   config-volume:
>     Type:      ConfigMap (a volume populated by a ConfigMap)
>     Name:      yunikorn-configs
>     Optional:  false
>   yunikorn-admin-token-dnq4h:
>     Type:        Secret (a volume populated by a Secret)
>     SecretName:  yunikorn-admin-token-dnq4h
>     Optional:    false
> QoS Class:       Burstable
> Node-Selectors:  role.node.kubernetes.io/liftie-infra=true
> Tolerations:     CriticalAddonsOnly op=Exists
>                  node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
>                  node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
>                  role.node.kubernetes.io/liftie-infra=true:NoSchedule
> Events:
>   Type     Reason               Age                From               Message
>   ----     ------               ----               ----               -------
>   Normal   Scheduled            61s                default-scheduler  Successfully assigned yunikorn/yunikorn-scheduler-6577f789d8-vc5cc to ip-10-192-153-109.ca-central-1.compute.internal
>   Normal   Pulling              57s                kubelet            Pulling image "<>/cloudera/yunikorn-web:0.10.0-b9"
>   Normal   Started              56s                kubelet            Started container yunikorn-scheduler-web
>   Normal   Created              56s                kubelet            Created container yunikorn-scheduler-web
>   Normal   Pulled               56s                kubelet            Successfully pulled image "<>/cloudera/yunikorn-web:0.10.0-b9"
>   Warning  FailedPreStopHook    55s (x2 over 58s)  kubelet            Exec lifecycle hook ([/bin/sh /admission_util.sh delete]) for Container "yunikorn-scheduler-k8s" in Pod "yunikorn-scheduler-6577f789d8-vc5cc_yunikorn(082e1cc7-8765-4aa3-baac-48e3b048cfc6)" failed - error: command '/bin/sh /admission_util.sh delete' exited with 126: , message: "cannot exec in a stopped state: unknown\r\n"
>   Normal   Killing              55s (x2 over 58s)  kubelet            FailedPostStartHook
>   Warning  BackOff              53s (x2 over 54s)  kubelet            Back-off restarting failed container
>   Normal   Pulling              41s (x3 over 60s)  kubelet            Pulling image "<>/cloudera/yunikorn-scheduler:0.10.0-b9"
>   Warning  FailedPostStartHook  40s (x3 over 58s)  kubelet            Exec lifecycle hook ([/bin/sh /admission_util.sh create]) for Container "yunikorn-scheduler-k8s" in Pod "yunikorn-scheduler-6577f789d8-vc5cc_yunikorn(082e1cc7-8765-4aa3-baac-48e3b048cfc6)" failed - error: command '/bin/sh /admission_util.sh create' exited with 137: , message: ""
>   Normal   Started              40s (x3 over 58s)  kubelet            Started container yunikorn-scheduler-k8s
>   Normal   Created              40s (x3 over 58s)  kubelet            Created container yunikorn-scheduler-k8s
>   Normal   Pulled               40s (x3 over 58s)  kubelet            Successfully pulled image "<>/cloudera/yunikorn-scheduler:0.10.0-b9" {noformat}
> This is not a blocker but the scheduler was restarted multiple(3) times, hence reporting. This could be due to issue in admission controller start script/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org