You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/03/01 22:26:00 UTC

[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

    [ https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048694#comment-17048694 ] 

ASF GitHub Bot commented on AIRFLOW-6014:
-----------------------------------------

atrbgithub commented on pull request #6606: [AIRFLOW-6014] - handle pods which are preempted and deleted by kuber…
URL: https://github.com/apache/airflow/pull/6606
 
 
   …netes but not restarted
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] My PR addresses the following [Airflow Jira](https://issues.apache.org/jira/browse/AIRFLOW-6014) issues and references them in the PR title. 
   
   ### Description
   
   - [x] Here are some details about my PR, including screenshots of any UI changes:
   This PR addresses the issue of when a pod is Preempted during the creation phase and due to pods having the following in the spec ```restartPolicy: Never``` The pod is never restarted and ends up as a queued task within Airflow until the scheduler is restarted.  
   
   ### Tests
   
   - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   Unsure if it is possible to simulate this scenario. 
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes how to use it.
     - All the public functions and the classes in the PR contain docstrings that explain what it does
     - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Kubernetes executor - handle preempted deleted pods - queued tasks
> ------------------------------------------------------------------
>
>                 Key: AIRFLOW-6014
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: executor-kubernetes
>    Affects Versions: 1.10.6
>            Reporter: afusr
>            Assignee: Daniel Imberman
>            Priority: Minor
>
> We have encountered an issue whereby when using the kubernetes executor, and using autoscaling, airflow pods are preempted and airflow never attempts to rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, kubernetes schedules a number of airflow pods onto the new node, as well as any pods required by k8s/daemon sets. As these are higher priority, the Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node gke-some--airflow-00000000-node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)