You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Michal TOMA (JIRA)" <ji...@apache.org> on 2016/07/06 07:26:11 UTC

[jira] [Commented] (AIRFLOW-194) Task hangs in up_for_retry state for very long

    [ https://issues.apache.org/jira/browse/AIRFLOW-194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15363901#comment-15363901 ] 

Michal TOMA commented on AIRFLOW-194:
-------------------------------------

Here is a simple DAG that reproduces this problem. In fact the DAG run is not stuck in "up for retry state" but also in "running" state even when all tasks are finished. Here all tasks are run in parallel and the longest one leasts 745 seconds. The dag run is scheduled every 5 minutes so I'm expecting this DAG to start every 745 seconds + a few seconds of treatment and reschedule. Instead I sometimes see several hours between the DAG runs. This is tested with the 1.7.1.3 git tag.

"""
Code that goes along with the Airflow located at:
http://airflow.readthedocs.org/en/latest/tutorial.html
"""
from airflow import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta

default_args = {
	'owner': 'airflow',
	'depends_on_past': False,
	'start_date': datetime(2016, 6, 5, 10, 0, 0),
	'email': ['mt@sicoop.com'],
	'email_on_failure': True,
	'email_on_retry': True,
	'retries': 0,
	'retry_delay': timedelta(minutes=5),
	# 'queue': 'bash_queue',
	# 'pool': 'backfill',
	# 'priority_weight': 10,
	# 'end_date': datetime(2016, 1, 1),
}

dag = DAG('test-DAG-AIRFLOW-194', default_args=default_args, schedule_interval='*/5 * * * *', concurrency=10, max_active_runs=1)

task_1 = BashOperator(
	task_id="task_1",
	bash_command="sleep 319",
	dag=dag,
)

task_2 = BashOperator(
	task_id="task_2",
	bash_command="sleep 112",
	dag=dag,
)

task_3 = BashOperator(
	task_id="task_3",
	bash_command="sleep 745",
	dag=dag,
)

task_4 = BashOperator(
	task_id="task_4",
	bash_command="sleep 722",
	dag=dag,
)


> Task hangs in up_for_retry state for very long
> ----------------------------------------------
>
>                 Key: AIRFLOW-194
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-194
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: Airflow 1.7.0
>         Environment: Airflow 1.7.0 on RHEL 7 and OpenSuse 13.2
>            Reporter: Michal TOMA
>            Assignee: Siddharth Anand
>
> I can observe this problem on 2 separate Airflow installations.
> The symptoms are:
> - One (and only one) task stays in up_for_retry state even when the last of the retries finished with an OK stays.
> - It is yellow in the tree view.
> - The execution somehow resumes several hours later automatically
> - It seems (not a certitude) related to a mode when the task execution is "lagging" behind normal execution.
> Here is an example of a task that should run every hour "0 * * * *":
> Current date : 2016-05-30T15:31:00+0200
> ----- Run 1 ------
> Run ID: 2016-05-05T21:00:00
> Task start: 2015-05-30T07:38:XX.XXX
> Task end: 2015-05-30T08:23:XX.XXX
> Marked as success
> ----- Run 2 ------
> Run ID: 2016-05-05T22:00:00
> Task start: 2015-05-30T11:10:XX.XXX
> Task end: 2015-05-30T11:56:XX.XXX
> Marked as success
> ----- Run 3 ------
> Run ID: 2016-05-05T23:00:00
> Task start: 2015-05-30T11:56:XX.XXX
> Task end: 2015-05-30T12:41:XX.XXX
> Marked as success
> ----- Run 4 ------
> Run ID: 2016-05-06T00:00:00
> Task start: 2015-05-30T15:12:XX.XXX
> Task end: (Still running now)
> Marked as running
> There are nearly 2 hours between Run-1 and Run-2, and nearly 2 hours as well between Run-3 and Run-4.
> Only Run-3 starts immediately after the end of Run-2 what is the expected behavior as the Runs are very late on schedule (Run ID is 2016-05-06 while we are on 2016-05-30)
> This is a high priority issue for our setup. I could try to dig more in depth into this problem but I have no idea where to look to debug this issue.
> Any pointers would be more than welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)