You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Sam Stephens (JIRA)" <ji...@apache.org> on 2019/08/05 08:08:00 UTC

[jira] [Commented] (AIRFLOW-4910) KuberenetesExecutor - KubernetesJobWatcher can silently fail

    [ https://issues.apache.org/jira/browse/AIRFLOW-4910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899870#comment-16899870 ] 

Sam Stephens commented on AIRFLOW-4910:
---------------------------------------

[~dimberman] I've had some additional experiences recently with the KubernetesExecutor that have the same underlying issue, although the "trigger" which exposes it has (possibly?) different.

Recently I've had issues with Pods getting stuck in a ContainerCreating state in Kube. The problem is - the KubernetesExecutor only reacts to certain Pod state changes (specifically only when it enters Error, Success or Failed), meaning Pods being terminated or Pods remaining in a state unexpectedly do not get removed from the running queue. In fact we realised: clearing a task does not cause it to be rescheduled since when the scheduler tries, the KubernetesExecutor reports that it is already running (already in the running queue). Our "fix" is to clear the task, delete the Pod that is stuck and restart the Airflow scheduler Pod. Additionally, we need to set up alerting to tell us when this is occurring.

We're looking at fixing the ContainerCreating issue on the Kubernetes side, since this specifically is clearly a Kubernetes issue. However - KubernetesExecutor's internal state diverging from that of Kubernetes I think is an Airflow issue. Aside from "reconciling" as suggested above, I have a couple of other proposals:
1. Use Kubernetes Jobs: Should mean that the KubernetesExecutor doesn't need to replicate all of the Kube scheduling logic. Also means things like "activeDeadlineSeconds" could be used, to give users the ability to set timeouts for jobs. In general it may be a more "future-proof" solution, since Kube Jobs will presumably be more tightly integrated with versions of Kubernetes than the Airflow executor can be
2. Track Pod state transition times in the KubernetesExecutor, so logic can be applied (e.g. timeouts for transition to running state, trigger reconcile after certain amount of time)

Interested to hear thoughts

> KuberenetesExecutor - KubernetesJobWatcher can silently fail
> ------------------------------------------------------------
>
>                 Key: AIRFLOW-4910
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4910
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executors
>    Affects Versions: 1.10.3
>            Reporter: Sam Stephens
>            Assignee: Daniel Imberman
>            Priority: Major
>              Labels: kubernetes
>
> After not monitoring Airflow for a while, I noticed that tasks had not been running for several days.
> My setup: Scheduler and web-server running in one pod, with KubernetesExecutor. 4 different DAGs, none of them very large: 1 running once per day, 2 every 30 mins and 1 every 2 minutes.
> Airflow had log messages such as these:
> {code:java}
> {{jobs.py:1144}} INFO - Figuring out tasks to run in Pool(name=None) with 128 open slots and 179 task instances in queue{code}
> {code:java}
> {{jobs.py:1210}} DEBUG - Not handling task ('example_python_operator', 'print_the_context', datetime.datetime(2019, 6, 7, 0, 0, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>), 1) as the executor reports it is running{code}
> ... and a bit further down:
> {code:java}
> {{base_executor.py:124}} DEBUG - 32 running task instances{code}
> In the Kubernetes cluster, there were no pods created by Airflow (they'd all finished and been deleted).
> After digging into the logs around the time at which jobs stopped progressing, I noticed that at this point in time the KubernetesJobWatcher stopped logging the state changes of pods - even though I could see log messages for new pods being created.
> It's hard to tell why this happened - if the subprocess running the job watcher died it should have been detected in the [heartbeat|https://github.com/apache/airflow/blob/1.10.3/airflow/contrib/executors/kubernetes_executor.py#L442]. If the [Watch threw an exception|https://github.com/apache/airflow/blob/1.10.3/airflow/contrib/executors/kubernetes_executor.py#L295], there should have been logs (which there weren't) and then it should have restarted.
> I have a few theories as to what might have happened:
>  # The Watch hung indefinitely - although I can't see any issues against the Kubernetes python client that suggest other people have had this issue
>  # The KubernetesJobWatcher died, but the heartbeat was not functioning correctly
>  # The Watcher experienced a large gap between watch requests meaning some relevant events were "lost" leaving the respective tasks in the "running" state
> Unfortunately I dont have the answers, so I'm posting this in the hope someone has some additional insight.
> As a side note - Im using Kubernetes Client version 9.0.0
> My only suggestion for a fix is to periodically check what Pods are actually running, and reconcile that against the "running" queue in the executor and maybe force-restart the job watcher if the state has diverged).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)