You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Kang Yu (Jira)" <ji...@apache.org> on 2020/01/06 07:53:00 UTC

[jira] [Created] (AIRFLOW-6482) task_instance will be in queued state forever

Kang Yu created AIRFLOW-6482:
--------------------------------

             Summary: task_instance will be in queued state forever
                 Key: AIRFLOW-6482
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6482
             Project: Apache Airflow
          Issue Type: Bug
          Components: scheduler
    Affects Versions: 1.10.5
         Environment: aws EC2 spot instance
            Reporter: Kang Yu


We built an airflow cluster on aws EC2 directly.

We are using Spot EC2 instances for workers, so sometimes work will be terminated.

This will cause airflow.job record missing, then the zombie detection won't work with this case I guess.


I investigated the code, there is a detection logic for zombie tasks:

[https://github.com/apache/airflow/blob/704e48dee368d193f742e064f42461205ef587e2/airflow/models/dagbag.py#L295-L306]

But this logic needs to join two tables: airflow.job and airflow.task_instance.

Because the record of airflow.job was inserted when the worker starts running a task_instance. It could be lost if EC2 instance dead.

 

I think here have to enhance the zombie detection logic by using a record of task_instance only:

detect the queued_dttm time with a threshold. *if (now - queued_dttm > THE_THRESHOLD and LJ.state != State.RUNNING) then mark it as failed directly.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)