You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Szymon Grzemski (Jira)" <ji...@apache.org> on 2020/08/05 18:38:00 UTC

[jira] [Comment Edited] (AIRFLOW-5071) Thousand os Executor reports task instance X finished (success) although the task says its queued. Was the task killed externally?

    [ https://issues.apache.org/jira/browse/AIRFLOW-5071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171669#comment-17171669 ] 

Szymon Grzemski edited comment on AIRFLOW-5071 at 8/5/20, 6:37 PM:
-------------------------------------------------------------------

In our case sensors are marked as a failed (even if they were successful) on UP_TO_RESCHEDULE state. As mentioned before:
{code:java}
[2020-07-06 13:24:48,189] {scheduler_job.py:1328} ERROR - Executor reports task instance <TaskInstance: dedupe-emr-job-flow.check_signals_table 2020-07-06 12:00:00+00:00 [queued]> finished (success) although the task says its queued. Was the task killed externally?{code}
The error occures at the end of this funtion:
 [https://github.com/apache/airflow/blob/1.10.9/airflow/jobs/scheduler_job.py#L1441]

which calls:
https://github.com/apache/airflow/blob/1.10.9/airflow/jobs/scheduler_job.py#L1511-L1530

but it fails inside:
 [https://github.com/apache/airflow/blob/1.10.9/airflow/jobs/scheduler_job.py#L1508]

I think the state mismatch may happen somewhere between those lines. After reading the _validate_and_run_task_instances function you can see that tasks execution happens just before calling the executor heartbeat (which is reponsible for communication with the Celery queue) and it fails right after the heartbeat completes. So how can Celery executor read a success state from the queue when the task has just been queued.

[~turbaszek] Am I right or am I lost in the code? :D


was (Author: sgrzemski):
In our case sensors are marked as a failed (even if they were successful) on UP_TO_RESCHEDULE state. As mentioned before:
{code:java}
[2020-07-06 13:24:48,189] {scheduler_job.py:1328} ERROR - Executor reports task instance <TaskInstance: dedupe-emr-job-flow.check_signals_table 2020-07-06 12:00:00+00:00 [queued]> finished (success) although the task says its queued. Was the task killed externally?{code}
The error occures at the end of this funtion:
 [https://github.com/apache/airflow/blob/1.10.9/airflow/jobs/scheduler_job.py#L1441]

which calls:
 [https://github.com/apache/airflow/blob/1.10.9/airflow/jobs/scheduler_job.py#L1511-L1530
]

but it fails inside:
[https://github.com/apache/airflow/blob/1.10.9/airflow/jobs/scheduler_job.py#L1508]

I think the state mismatch may happen somewhere between those lines. After reading the _validate_and_run_task_instances function you can see that tasks execution happens just before calling the executor heartbeat (which is reponsible for communication with the Celery queue) and it fails right after the heartbeat completes. So how can Celery executor read a success state from the queue when the task has just been queued.

[~turbaszek] Am I right or am I lost in the code? :D

> Thousand os Executor reports task instance X finished (success) although the task says its queued. Was the task killed externally?
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-5071
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5071
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: DAG, scheduler
>    Affects Versions: 1.10.3
>            Reporter: msempere
>            Priority: Critical
>             Fix For: 1.10.12
>
>         Attachments: image-2020-01-27-18-10-29-124.png, image-2020-07-08-07-58-42-972.png
>
>
> I'm opening this issue because since I update to 1.10.3 I'm seeing thousands of daily messages like the following in the logs:
>  
> ```
>  {{__init__.py:1580}} ERROR - Executor reports task instance <TaskInstance: X 2019-07-29 00:00:00+00:00 [queued]> finished (success) although the task says its queued. Was the task killed externally?
> {{jobs.py:1484}} ERROR - Executor reports task instance <TaskInstance: X 2019-07-29 00:00:00+00:00 [queued]> finished (success) although the task says its queued. Was the task killed externally?
> ```
> -And looks like this is triggering also thousand of daily emails because the flag to send email in case of failure is set to True.-
> I have Airflow setup to use Celery and Redis as a backend queue service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)