You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Michal TOMA (JIRA)" <ji...@apache.org> on 2016/10/25 13:33:58 UTC

[jira] [Commented] (AIRFLOW-194) Task hangs in up_for_retry state for very long

    [ https://issues.apache.org/jira/browse/AIRFLOW-194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605333#comment-15605333 ] 

Michal TOMA commented on AIRFLOW-194:
-------------------------------------

Some additional information.
The problem seems related to one task in one dag taking very long.
I have a task that does an important catch up job now. I have 6 DAGs in my setup. 3 of them are in the described state where the DAG is in "running" state but all of their tasks are in "finished" state. 2 are in finished state with tasks that did start and finish before the very long catch up task started. Only one is in the "running" state with my catch up task still running for several hours now.
It looks like if the scheduled was blocked waiting for this task to finish to be able to finish any other DAG.

> Task hangs in up_for_retry state for very long
> ----------------------------------------------
>
>                 Key: AIRFLOW-194
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-194
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: Airflow 1.7.0
>         Environment: Airflow 1.7.0 on RHEL 7 and OpenSuse 13.2
>            Reporter: Michal TOMA
>            Assignee: Norman Mu
>         Attachments: screenshot-1.png, screenshot-2.png
>
>
> I can observe this problem on 2 separate Airflow installations.
> The symptoms are:
> - One (and only one) task stays in up_for_retry state even when the last of the retries finished with an OK stays.
> - It is yellow in the tree view.
> - The execution somehow resumes several hours later automatically
> - It seems (not a certitude) related to a mode when the task execution is "lagging" behind normal execution.
> Here is an example of a task that should run every hour "0 * * * *":
> Current date : 2016-05-30T15:31:00+0200
> ----- Run 1 ------
> Run ID: 2016-05-05T21:00:00
> Task start: 2015-05-30T07:38:XX.XXX
> Task end: 2015-05-30T08:23:XX.XXX
> Marked as success
> ----- Run 2 ------
> Run ID: 2016-05-05T22:00:00
> Task start: 2015-05-30T11:10:XX.XXX
> Task end: 2015-05-30T11:56:XX.XXX
> Marked as success
> ----- Run 3 ------
> Run ID: 2016-05-05T23:00:00
> Task start: 2015-05-30T11:56:XX.XXX
> Task end: 2015-05-30T12:41:XX.XXX
> Marked as success
> ----- Run 4 ------
> Run ID: 2016-05-06T00:00:00
> Task start: 2015-05-30T15:12:XX.XXX
> Task end: (Still running now)
> Marked as running
> There are nearly 2 hours between Run-1 and Run-2, and nearly 2 hours as well between Run-3 and Run-4.
> Only Run-3 starts immediately after the end of Run-2 what is the expected behavior as the Runs are very late on schedule (Run ID is 2016-05-06 while we are on 2016-05-30)
> This is a high priority issue for our setup. I could try to dig more in depth into this problem but I have no idea where to look to debug this issue.
> Any pointers would be more than welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)