You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "xingbe (Jira)" <ji...@apache.org> on 2023/03/29 06:58:00 UTC

[jira] [Created] (FLINK-31652) Flink should handle the delete event if the pod was deleted while pending

xingbe created FLINK-31652:
------------------------------

             Summary: Flink should handle the delete event if the pod was deleted while pending
                 Key: FLINK-31652
                 URL: https://issues.apache.org/jira/browse/FLINK-31652
             Project: Flink
          Issue Type: Bug
          Components: Deployment / Kubernetes
    Affects Versions: 1.16.1, 1.17.0
            Reporter: xingbe


I found that in kubernetes deployment, if the taskmanager pod is deleted in 'Pending' phase, the flink job will get stuck and keep waiting for the pod scheduled. We can reproduce this issue with the 'kubectl delete pod' command to delete the pod when it is in the pending phase.
 
The cause reason is that the pod status will not be updated in time in this case, so the KubernetesResourceManagerDriver won't detect the pod is terminated, and I also verified this by logging the pod status in KubernetesPod#isTerminated(), and it shows as follows.
{code:java}
public boolean isTerminated() {
    log.info("pod status: " + getInternalResource().getStatus());
    if (getInternalResource().getStatus() != null) {
        final boolean podFailed =
                PodPhase.Failed.name().equals(getInternalResource().getStatus().getPhase());
        final boolean containersFailed =
                getInternalResource().getStatus().getContainerStatuses().stream()
                        .anyMatch(
                                e ->
                                        e.getState() != null
                                                && e.getState().getTerminated() != null);
        return containersFailed || podFailed;
    }
    return false;
} {code}
In the case, this function will return false because `containersFailed` and `podFailed` are both false.
{code:java}
PodStatus(conditions=[PodCondition(lastProbeTime=null, lastTransitionTime=2023-03-28T12:35:10Z, reason=Unschedulable, status=False, type=PodScheduled, additionalProperties={})], containerStatuses=[], ephemeralContainerStatuses=[], hostIP=null, initContainerStatuses=[], message=null, nominatedNodeName=null, phase=Pending, podIP=null, podIPs=[], qosClass=Guaranteed, reason=null, startTime=null, additionalProperties={}){code}
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)