You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Matt C. Wilson (JIRA)" <ji...@apache.org> on 2019/08/12 12:39:00 UTC

[jira] [Created] (AIRFLOW-5171) Random task gets stuck in queued state despite all dependencies met

Matt C. Wilson created AIRFLOW-5171:
---------------------------------------

             Summary: Random task gets stuck in queued state despite all dependencies met
                 Key: AIRFLOW-5171
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5171
             Project: Apache Airflow
          Issue Type: Bug
          Components: executors, scheduler
    Affects Versions: 1.10.2
            Reporter: Matt C. Wilson
         Attachments: Airflow - Log.png, Airflow - Task Instance Details.htm

We are experiencing an issue similar to that reported in AIRFLOW-1641 and AIRFLOW-4586.  We run two parallel dags, both using a common set of pools, both using LocalExecutor.

What we are seeing is once every couple dozen dag runs, a task will reach the `queued` status and not continue into a `running` state once a pool slot is open / dependencies are filled.

Investigating the task instance details confirms the same; Airflow reports that it expects the task to commence shortly once resources are available.  See attachment. [^Airflow - Task Instance Details.htm]

While tasks are in this state, the sibling parallel dag is able to flow completely, even multiple times through.  So we know the issue is not with pool constraints, executor issues, etc.  The problem really seems to be that Airflow has simply lost track of the task and failed to start it.

Clearing the task state has no effect - the task does not get moved back into a `scheduled` or `queued` or `running` state, it just stays at the `none` state.  The task must be marked as `failed` or `success` to resume normal dag flow.

This issue has been causing sporadic production degradation for us, with no obvious avenue for troubleshooting.  It's not clear if changing the `dagbag_import_timeout` (as reported in 1641) will help because our task has no log showing in the Airflow UI.   See screenshot.   !Airflow - Log.png!

I'm open to all recommendations to try to get to the bottom of this.  Please let me know if there is any log data or other info I can provide.

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)