You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Daniel Cooper (Jira)" <ji...@apache.org> on 2020/06/19 07:03:00 UTC
[jira] [Commented] (AIRFLOW-5589) KubernetesPodOperator: Duplicate
pods created on worker restart
[ https://issues.apache.org/jira/browse/AIRFLOW-5589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140280#comment-17140280 ]
Daniel Cooper commented on AIRFLOW-5589:
----------------------------------------
Hey [~dimberman], thanks for getting the PR for this in. I saw you tagged the PR as in 1.10.11 so assigned this to you & set the fix version so it isn't missed in release notes.
> KubernetesPodOperator: Duplicate pods created on worker restart
> ---------------------------------------------------------------
>
> Key: AIRFLOW-5589
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5589
> Project: Apache Airflow
> Issue Type: Bug
> Components: worker
> Affects Versions: 1.10.4, 1.10.5
> Reporter: Daniel Cooper
> Assignee: Daniel Imberman
> Priority: Major
> Fix For: 1.10.11
>
>
> K8sPodOperator holds state within the execute function that monitors the running pod. If a worker restarts for any reason (pod death, pod shuffle, upgrade etc.) then this state is lost.
> At this point the scheduler notices (after max heartbeat interval wait) that the task is now 'zombie' (not monitored) and reschedules the task.
> The new worker has no knowledge of the existing running pod and so creates a new duplicate pod. This can lead to many duplicate pods for the same task running together in extreme cases.
> I believe this is the problem Nicholas Brenwald (King) described as having when running k8s pod operator on Google Composer (at the September meetup at King).
> My fix is to add enough labels to uniquely identify a running pod as being from a given task instance (dag_id, task_id, run_id). We then do a namespaced list of pods from k8s with a label selector and monitor the existing pod if it exists otherwise we create a new one as normal.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)