You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Xintong Song (Jira)" <ji...@apache.org> on 2023/04/03 08:06:00 UTC

[jira] [Closed] (FLINK-31652) Flink should handle the delete event if the pod was deleted while pending

     [ https://issues.apache.org/jira/browse/FLINK-31652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xintong Song closed FLINK-31652.
--------------------------------
    Fix Version/s: 1.16.2
                   1.18.0
                   1.17.1
       Resolution: Fixed

- master (1.18): 9e83858c7dc309f272a03c62b1e295d192acaf89
- release-1.17: 98d8e2712feb5077c4b35698cea5ecbdd72e4c06
- release-1.16: 99c025b438a99eb8ddcf8214aba5f285972106ca

> Flink should handle the delete event if the pod was deleted while pending
> -------------------------------------------------------------------------
>
>                 Key: FLINK-31652
>                 URL: https://issues.apache.org/jira/browse/FLINK-31652
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.17.0, 1.16.1
>            Reporter: xingbe
>            Assignee: xingbe
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.16.2, 1.18.0, 1.17.1
>
>
> I found that in kubernetes deployment, if the taskmanager pod is deleted in 'Pending' phase, the flink job will get stuck and keep waiting for the pod scheduled. We can reproduce this issue with the 'kubectl delete pod' command to delete the pod when it is in the pending phase.
>  
> The cause reason is that the pod status will not be updated in time in this case, so the KubernetesResourceManagerDriver won't detect the pod is terminated, and I also verified this by logging the pod status in KubernetesPod#isTerminated(), and it shows as follows.
> {code:java}
> public boolean isTerminated() {
>     log.info("pod status: " + getInternalResource().getStatus());
>     if (getInternalResource().getStatus() != null) {
>         final boolean podFailed =
>                 PodPhase.Failed.name().equals(getInternalResource().getStatus().getPhase());
>         final boolean containersFailed =
>                 getInternalResource().getStatus().getContainerStatuses().stream()
>                         .anyMatch(
>                                 e ->
>                                         e.getState() != null
>                                                 && e.getState().getTerminated() != null);
>         return containersFailed || podFailed;
>     }
>     return false;
> } {code}
> In the case, this function will return false because `containersFailed` and `podFailed` are both false.
> {code:java}
> PodStatus(conditions=[PodCondition(lastProbeTime=null, lastTransitionTime=2023-03-28T12:35:10Z, reason=Unschedulable, status=False, type=PodScheduled, additionalProperties={})], containerStatuses=[], ephemeralContainerStatuses=[], hostIP=null, initContainerStatuses=[], message=null, nominatedNodeName=null, phase=Pending, podIP=null, podIPs=[], qosClass=Guaranteed, reason=null, startTime=null, additionalProperties={}){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)