You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Aizhamal Nurmamat kyzy (JIRA)" <ji...@apache.org> on 2019/05/18 02:52:02 UTC
[jira] [Updated] (AIRFLOW-4143) Airflow unaware of worker pods
being terminated or failing without being able for Local Executor service
updating the task state.
[ https://issues.apache.org/jira/browse/AIRFLOW-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aizhamal Nurmamat kyzy updated AIRFLOW-4143:
--------------------------------------------
Labels: kubernetes (was: )
Component/s: executor
Adding executor component, and kubernetes label as part of component refactor.
> Airflow unaware of worker pods being terminated or failing without being able for Local Executor service updating the task state.
> ---------------------------------------------------------------------------------------------------------------------------------
>
> Key: AIRFLOW-4143
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4143
> Project: Apache Airflow
> Issue Type: Bug
> Components: executor, kubernetes
> Affects Versions: 1.10.2
> Environment: Kubernetes, Centos
> Reporter: Paul Bramhall
> Assignee: Paul Bramhall
> Priority: Major
> Labels: kubernetes
>
> Whenever a worker pod is terminated within Kubernetes, in a way that causes the LocalExecutor running on that pod to exit without being able to update the task state within airflow, AIrflow is unaware and blissfully keeps the task in a running state. This can prevent future runs of the dag from triggering, or the dag from retrying upon failure.
> When Airflow is restarted, the task state is still marked as running & is not updated.
> There is a JobWatcher process within the Kubernetes Executor, but this has no impact upon this condition. It will receive a Pod Event of DELETED, and still display the task state as 'Running':
> {code:java}
> [2019-03-20 15:37:54,764] {{kubernetes_executor.py:296}} INFO - Event: kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 had an event of type MODIFIED [2019-03-20 15:37:54,764] {{kubernetes_executor.py:336}} INFO - Event: kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 is Running [2019-03-20 15:37:54,767] {{kubernetes_executor.py:296}} INFO - Event: kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 had an event of type DELETED [2019-03-20 15:37:54,767] {{kubernetes_executor.py:336}} INFO - Event: kubernetesexecutorworkeradhocdagworkerpodadhoc-73661b6115-ae484 is Running
> {code}
> ^^ This event is if the pod is killed without the Local Executor running on that worker updating the state. Airflow continues to show the task as Running.
> Ideally we need a check within the Kubernetes Executor to perform the following actions:
> Upon Executor Startup:
> # Loop through all running Dags and Task Instances
> # Check for the existence of worker pods associated to these instances.
> # If no pod is found, yet the task is shown as 'Running', perform a graceful failure using the 'handle_failure()' function within Airflow itself.
> During the executor running, we can perform the same, if not a similar check from within the KubernetesJobWatcher class when the following conditions are met:
> * k8s Pod Event type is 'DELETED' _and_ Task State is still set as 'Running'
> This way, if a pod does unexpectedly terminate (for example, it exceeds the predefined resource limits and oom's), then the dag should be marked as failed accordingly.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)