You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Emmanuel Brard (JIRA)" <ji...@apache.org> on 2019/04/15 19:11:00 UTC

[jira] [Commented] (AIRFLOW-1555) Backfill job gets killed 1 hour after starting

    [ https://issues.apache.org/jira/browse/AIRFLOW-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818292#comment-16818292 ] 

Emmanuel Brard commented on AIRFLOW-1555:
-----------------------------------------

The visibility timeout issue is supposed to trigger another task which then should terminate right away since each task checks if there is no other running task at start-up (dependencies all met: not in state running something like this).

> Backfill job gets killed 1 hour after starting
> ----------------------------------------------
>
>                 Key: AIRFLOW-1555
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1555
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: backfill
>    Affects Versions: 1.8.1
>         Environment: Airflow 1.8.1
> Celery 3.1.23 with one coordinator, redis and 3 workers
> Python 3.5.2
> Debian GNU/Linux 8.9 (jessie)
> snakebite uninstalled because it does not work with Python 3.5.2
> MySQL 5.6
>            Reporter: Shreyas Joshi
>            Assignee: Shreyas Joshi
>            Priority: Major
>             Fix For: 1.10.0
>
>
> *What happens?*
> After running for an hour tasks in a backfill die. The task log shows:
> {code}
> ...
> [2017-08-31 06:48:06,425] {jobs.py:2172} WARNING - Recorded pid 5451 is not a descendant of the current pid 21571
> [2017-08-31 06:48:11,884] {jobs.py:2179} WARNING - State of this instance has been externally set to failed. Taking the poison pill. So long.
> [2017-08-31 06:48:11,892] {helpers.py:220} WARNING - Terminating descendant processes of [<REDACTED>] PID: 5451
> [2017-08-31 06:48:11,892] {helpers.py:224} WARNING - Terminating descendant process [<REDACTED>] PID: 5459
> [2017-08-31 06:48:11,896] {helpers.py:231} WARNING - Waiting up to 5s for processes to exit...
> ...
> {code}
> The backfill log shows:
> {code}
> ...
> [2017-08-31 11:23:44,025] {jobs.py:1729} ERROR - Executor reports task instance <TaskInstance: dag_name.task_name 2017-08-30 02:00:00 [running]> finished (failed) although the task says its running. Was the task killed externally?
> [2017-08-31 11:23:44,025] {models.py:1427} ERROR - Executor reports task instance <TaskInstance: analytics_events.page_views 2017-08-30 02:00:00 [running]> finished (failed) although the task says its running. Was the task killed externally?
> ...
> {code}
> The Celery UI has the following exception, but status shows "success"
> {code}
> Traceback (most recent call last):
>   File "/data/airflow-sources/.venv/lib/python3.5/site-packages/airflow/executors/celery_executor.py", line 56, in execute_command
>     subprocess.check_call(command, shell=True)
>   File "/usr/share/pyenv/versions/3.5.2/lib/python3.5/subprocess.py", line 581, in check_call
>     raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command 'airflow run dag_name task_name 2017-08-30T02:00:00 --pickle 14 --local' returned non-zero exit status 1
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "/data/airflow-sources/.venv/lib/python3.5/site-packages/celery/app/trace.py", line 240, in trace_task
>     R = retval = fun(*args, **kwargs)
>   File "/data/airflow-sources/.venv/lib/python3.5/site-packages/celery/app/trace.py", line 438, in __protected_call__
>     return self.run(*args, **kwargs)
>   File "/data/airflow-sources/.venv/lib/python3.5/site-packages/airflow/executors/celery_executor.py", line 59, in execute_command
>     raise AirflowException('Celery command failed')
> airflow.exceptions.AirflowException: Celery command failed
> {code}
> The tasks have timeouts explicitly set to 6 hours and SLA set to 5 hours. In the course of debugging this I also set dagrun_timeout to 6 hours. It did not make a difference.
> Here is a thread on [stackoverflow | https://stackoverflow.com/questions/44274381/airflow-long-running-task-in-subdag-marked-as-failed-after-an-hour] that talks about a very similar issue.
> These tasks run fine on our older Airflow 1.7. This is currently blocking our upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)