You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Sam Stephens (JIRA)" <ji...@apache.org> on 2019/07/08 11:27:00 UTC

[jira] [Updated] (AIRFLOW-4910) KuberenetesExecutor - KubernetesJobWatcher can silently fail

     [ https://issues.apache.org/jira/browse/AIRFLOW-4910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sam Stephens updated AIRFLOW-4910:
----------------------------------
    Description: 
After not monitoring Airflow for a while, I noticed that tasks had not been running for several days.

My setup: Scheduler and web-server running in one pod, with KubernetesExecutor. 4 different DAGs, none of them very large: 1 running once per day, 2 every 30 mins and 1 every 2 minutes.

Airflow had log messages such as these:
{code:java}
{{jobs.py:1144}} INFO - Figuring out tasks to run in Pool(name=None) with 128 open slots and 179 task instances in queue{code}
{code:java}
{{jobs.py:1210}} DEBUG - Not handling task ('example_python_operator', 'print_the_context', datetime.datetime(2019, 6, 7, 0, 0, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>), 1) as the executor reports it is running{code}
... and a bit further down:
{code:java}
{{base_executor.py:124}} DEBUG - 32 running task instances{code}
In the Kubernetes cluster, there were no pods created by Airflow (they'd all finished and been deleted).

After digging into the logs around the time at which jobs stopped progressing, I noticed that at this point in time the KubernetesJobWatcher stopped logging the state changes of pods - even though I could see log messages for new pods being created.

It's hard to tell why this happened - if the subprocess running the job watcher died it should have been detected in the [heartbeat|https://github.com/apache/airflow/blob/1.10.3/airflow/contrib/executors/kubernetes_executor.py#L442]. If the [Watch threw an exception|https://github.com/apache/airflow/blob/1.10.3/airflow/contrib/executors/kubernetes_executor.py#L295], there should have been logs (which there weren't) and then it should have restarted.

I have a few theories as to what might have happened:
 # The Watch hung indefinitely - although I can't see any issues against the Kubernetes python client that suggest other people have had this issue
 # The KubernetesJobWatcher died, but the heartbeat was not functioning correctly
 # The Watcher experienced a large gap between watch requests meaning some relevant events were "lost" leaving the respective tasks in the "running" state

Unfortunately I dont have the answers, so I'm posting this in the hope someone has some additional insight.

As a side note - Im using Kubernetes Client version 9.0.0

My only suggestion for a fix is to periodically check what Pods are actually running, and reconcile that against the "running" queue in the executor and maybe force-restart the job watcher if the state has diverged).

  was:
After not monitoring Airflow for a while, I noticed that tasks had not been running for several days.

My setup: Scheduler and web-server running in one pod, with KubernetesExecutor. 4 different DAGs, none of them very large: 1 running once per day, 2 every 30 mins and 1 every 2 minutes.

Airflow had log messages such as these:
{code:java}
{{jobs.py:1144}} INFO - Figuring out tasks to run in Pool(name=None) with 128 open slots and 179 task instances in queue{code}
{code:java}
{{jobs.py:1210}} DEBUG - Not handling task ('example_python_operator', 'print_the_context', datetime.datetime(2019, 6, 7, 0, 0, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>), 1) as the executor reports it is running{code}
... and a bit further down:
{code:java}
{{base_executor.py:124}} DEBUG - 32 running task instances{code}
In the Kubernetes cluster, there were no pods created by Airflow (they'd all finished and been deleted).

After digging into the logs around the time at which jobs stopped progressing, I noticed that at this point in time the KubernetesJobWatcher stopped logging the state changes of pods - even though I could see log messages for new pods being created.

It's hard to tell why this happened - if the subprocess running the job watcher died it should have been detected in the [heartbeat|[https://github.com/apache/airflow/blob/1.10.3/airflow/contrib/executors/kubernetes_executor.py#L442]]. If the [Watch threw an exception|[https://github.com/apache/airflow/blob/1.10.3/airflow/contrib/executors/kubernetes_executor.py#L295]], there should have been logs (which there weren't) and then it should have restarted.

I have a few theories as to what might have happened:
 # The Watch hung indefinitely - although I can't see any issues against the Kubernetes python client that suggest other people have had this issue
 # The KubernetesJobWatcher died, but the heartbeat was not functioning correctly
 # The Watcher experienced a large gap between watch requests meaning some relevant events were "lost" leaving the respective tasks in the "running" state

Unfortunately I dont have the answers, so I'm posting this in the hope someone has some additional insight.

As a side note - Im using Kubernetes Client version 9.0.0

My only suggestion for a fix is to periodically check what Pods are actually running, and reconcile that against the "running" queue in the executor and maybe force-restart the job watcher if the state has diverged).


> KuberenetesExecutor - KubernetesJobWatcher can silently fail
> ------------------------------------------------------------
>
>                 Key: AIRFLOW-4910
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4910
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executors
>    Affects Versions: 1.10.3
>            Reporter: Sam Stephens
>            Priority: Major
>              Labels: kubernetes
>
> After not monitoring Airflow for a while, I noticed that tasks had not been running for several days.
> My setup: Scheduler and web-server running in one pod, with KubernetesExecutor. 4 different DAGs, none of them very large: 1 running once per day, 2 every 30 mins and 1 every 2 minutes.
> Airflow had log messages such as these:
> {code:java}
> {{jobs.py:1144}} INFO - Figuring out tasks to run in Pool(name=None) with 128 open slots and 179 task instances in queue{code}
> {code:java}
> {{jobs.py:1210}} DEBUG - Not handling task ('example_python_operator', 'print_the_context', datetime.datetime(2019, 6, 7, 0, 0, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>), 1) as the executor reports it is running{code}
> ... and a bit further down:
> {code:java}
> {{base_executor.py:124}} DEBUG - 32 running task instances{code}
> In the Kubernetes cluster, there were no pods created by Airflow (they'd all finished and been deleted).
> After digging into the logs around the time at which jobs stopped progressing, I noticed that at this point in time the KubernetesJobWatcher stopped logging the state changes of pods - even though I could see log messages for new pods being created.
> It's hard to tell why this happened - if the subprocess running the job watcher died it should have been detected in the [heartbeat|https://github.com/apache/airflow/blob/1.10.3/airflow/contrib/executors/kubernetes_executor.py#L442]. If the [Watch threw an exception|https://github.com/apache/airflow/blob/1.10.3/airflow/contrib/executors/kubernetes_executor.py#L295], there should have been logs (which there weren't) and then it should have restarted.
> I have a few theories as to what might have happened:
>  # The Watch hung indefinitely - although I can't see any issues against the Kubernetes python client that suggest other people have had this issue
>  # The KubernetesJobWatcher died, but the heartbeat was not functioning correctly
>  # The Watcher experienced a large gap between watch requests meaning some relevant events were "lost" leaving the respective tasks in the "running" state
> Unfortunately I dont have the answers, so I'm posting this in the hope someone has some additional insight.
> As a side note - Im using Kubernetes Client version 9.0.0
> My only suggestion for a fix is to periodically check what Pods are actually running, and reconcile that against the "running" queue in the executor and maybe force-restart the job watcher if the state has diverged).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)