You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Aizhamal Nurmamat kyzy (JIRA)" <ji...@apache.org> on 2019/05/18 02:52:02 UTC

[jira] [Updated] (AIRFLOW-4143) Airflow unaware of worker pods being terminated or failing without being able for Local Executor service updating the task state.

     [ https://issues.apache.org/jira/browse/AIRFLOW-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aizhamal Nurmamat kyzy updated AIRFLOW-4143:
--------------------------------------------
         Labels: kubernetes  (was: )
    Component/s: executor

Adding executor component, and kubernetes label as part of component refactor.

> Airflow unaware of worker pods being terminated or failing without being able for Local Executor service updating the task state.
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-4143
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4143
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executor, kubernetes
>    Affects Versions: 1.10.2
>         Environment: Kubernetes, Centos
>            Reporter: Paul Bramhall
>            Assignee: Paul Bramhall
>            Priority: Major
>              Labels: kubernetes
>
> Whenever a worker pod is terminated within Kubernetes, in a way that causes the LocalExecutor running on that pod to exit without being able to update the task state within airflow, AIrflow is unaware and blissfully keeps the task in a running state.  This can prevent future runs of the dag from triggering, or the dag from retrying upon failure.
> When Airflow is restarted, the task state is still marked as running & is not updated.
> There is a JobWatcher process within the Kubernetes Executor, but this has no impact upon this condition.  It will receive a Pod Event of DELETED, and still display the task state as 'Running':
> {code:java}
> [2019-03-20 15:37:54,764] {{kubernetes_executor.py:296}} INFO - Event: kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 had an event of type MODIFIED [2019-03-20 15:37:54,764] {{kubernetes_executor.py:336}} INFO - Event: kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 is Running [2019-03-20 15:37:54,767] {{kubernetes_executor.py:296}} INFO - Event: kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 had an event of type DELETED [2019-03-20 15:37:54,767] {{kubernetes_executor.py:336}} INFO - Event: kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 is Running
> {code}
> ^^ This event is if the pod is killed without the Local Executor running on that worker updating the state.  Airflow continues to show the task as Running.
> Ideally we need a check within the Kubernetes Executor to perform the following actions:
> Upon Executor Startup:
>  # Loop through all running Dags and Task Instances
>  # Check for the existence of worker pods associated to these instances.
>  # If no pod is found, yet the task is shown as 'Running', perform a graceful failure using the 'handle_failure()' function within Airflow itself.
> During the executor running, we can perform the same, if not a similar check from within the KubernetesJobWatcher class when the following conditions are met:
>  * k8s Pod Event type is 'DELETED' _and_ Task State is still set as 'Running'
> This way, if a pod does unexpectedly terminate (for example, it exceeds the predefined resource limits and oom's), then the dag should be marked as failed accordingly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)