You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by "Daniel Cooper (Jira)" <ji...@apache.org> on 2020/06/19 07:03:00 UTC

[jira] [Commented] (AIRFLOW-5589) KubernetesPodOperator: Duplicate pods created on worker restart

    [ https://issues.apache.org/jira/browse/AIRFLOW-5589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140280#comment-17140280 ] 

Daniel Cooper commented on AIRFLOW-5589:
----------------------------------------

Hey [~dimberman], thanks for getting the PR for this in.  I saw you tagged the PR as in 1.10.11 so assigned this to you & set the fix version so it isn't missed in release notes.

> KubernetesPodOperator: Duplicate pods created on worker restart
> ---------------------------------------------------------------
>
>                 Key: AIRFLOW-5589
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5589
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: worker
>    Affects Versions: 1.10.4, 1.10.5
>            Reporter: Daniel Cooper
>            Assignee: Daniel Imberman
>            Priority: Major
>             Fix For: 1.10.11
>
>
> K8sPodOperator holds state within the execute function that monitors the running pod. If a worker restarts for any reason (pod death, pod shuffle, upgrade etc.) then this state is lost.
> At this point the scheduler notices (after max heartbeat interval wait) that the task is now 'zombie' (not monitored) and reschedules the task.
> The new worker has no knowledge of the existing running pod and so creates a new duplicate pod.  This can lead to many duplicate pods for the same task running together in extreme cases.
> I believe this is the problem Nicholas Brenwald (King) described as having when running k8s pod operator on Google Composer (at the September meetup at King).
> My fix is to add enough labels to uniquely identify a running pod as being from a given task instance (dag_id, task_id, run_id).  We then do a namespaced list of pods from k8s with a label selector and monitor the existing pod if it exists otherwise we create a new one as normal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)