You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Kang Yu (Jira)" <ji...@apache.org> on 2020/01/06 07:53:00 UTC
[jira] [Created] (AIRFLOW-6482) task_instance will be in queued
state forever
Kang Yu created AIRFLOW-6482:
--------------------------------
Summary: task_instance will be in queued state forever
Key: AIRFLOW-6482
URL: https://issues.apache.org/jira/browse/AIRFLOW-6482
Project: Apache Airflow
Issue Type: Bug
Components: scheduler
Affects Versions: 1.10.5
Environment: aws EC2 spot instance
Reporter: Kang Yu
We built an airflow cluster on aws EC2 directly.
We are using Spot EC2 instances for workers, so sometimes work will be terminated.
This will cause airflow.job record missing, then the zombie detection won't work with this case I guess.
I investigated the code, there is a detection logic for zombie tasks:
[https://github.com/apache/airflow/blob/704e48dee368d193f742e064f42461205ef587e2/airflow/models/dagbag.py#L295-L306]
But this logic needs to join two tables: airflow.job and airflow.task_instance.
Because the record of airflow.job was inserted when the worker starts running a task_instance. It could be lost if EC2 instance dead.
I think here have to enhance the zombie detection logic by using a record of task_instance only:
detect the queued_dttm time with a threshold. *if (now - queued_dttm > THE_THRESHOLD and LJ.state != State.RUNNING) then mark it as failed directly.*
--
This message was sent by Atlassian Jira
(v8.3.4#803005)