You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/01/04 15:02:00 UTC

[jira] [Commented] (AIRFLOW-5589) KubernetesPodOperator: Duplicate pods created on worker restart

    [ https://issues.apache.org/jira/browse/AIRFLOW-5589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008059#comment-17008059 ] 

ASF GitHub Bot commented on AIRFLOW-5589:
-----------------------------------------

stale[bot] commented on pull request #6377: [AIRFLOW-5589] monitor pods by labels instead of names
URL: https://github.com/apache/airflow/pull/6377
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> KubernetesPodOperator: Duplicate pods created on worker restart
> ---------------------------------------------------------------
>
>                 Key: AIRFLOW-5589
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5589
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: worker
>    Affects Versions: 1.10.4, 1.10.5
>            Reporter: Daniel Cooper
>            Assignee: Daniel Cooper
>            Priority: Major
>
> K8sPodOperator holds state within the execute function that monitors the running pod. If a worker restarts for any reason (pod death, pod shuffle, upgrade etc.) then this state is lost.
> At this point the scheduler notices (after max heartbeat interval wait) that the task is now 'zombie' (not monitored) and reschedules the task.
> The new worker has no knowledge of the existing running pod and so creates a new duplicate pod.  This can lead to many duplicate pods for the same task running together in extreme cases.
> I believe this is the problem Nicholas Brenwald (King) described as having when running k8s pod operator on Google Composer (at the September meetup at King).
> My fix is to add enough labels to uniquely identify a running pod as being from a given task instance (dag_id, task_id, run_id).  We then do a namespaced list of pods from k8s with a label selector and monitor the existing pod if it exists otherwise we create a new one as normal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)