You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Lindsay Portelli (Jira)" <ji...@apache.org> on 2020/08/12 21:34:00 UTC

[jira] [Comment Edited] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

    [ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176404#comment-17176404 ] 

Lindsay Portelli edited comment on AIRFLOW-6014 at 8/12/20, 9:33 PM:
---------------------------------------------------------------------

I am facing a similar issue. But when the auto scaling removes a node, the pod is deleted and the k8s logs that the task should be up for rescheduler but the task instance in the database stays in a queued state.


was (Author: lindsable):
I am facing the same issue as [~kiruthiga24]. I was able to mitigate the issue of the failed pods by increasing the number of retries, but cannot figure out how to best clear the tasks stuck in the queued state.

> Kubernetes executor - handle preempted deleted pods - queued tasks
> ------------------------------------------------------------------
>
>                 Key: AIRFLOW-6014
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: executor-kubernetes
>    Affects Versions: 1.10.6
>            Reporter: afusr
>            Assignee: Daniel Imberman
>            Priority: Minor
>             Fix For: 1.10.10
>
>         Attachments: image-2020-07-14-11-27-21-277.png, image-2020-07-14-11-29-14-334.png
>
>
> We have encountered an issue whereby when using the kubernetes executor, and using autoscaling, airflow pods are preempted and airflow never attempts to rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, kubernetes schedules a number of airflow pods onto the new node, as well as any pods required by k8s/daemon sets. As these are higher priority, the Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node gke-some--airflow-00000000-node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)