You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Szymon Grzemski (Jira)" <ji...@apache.org> on 2020/08/05 18:22:00 UTC

[jira] [Commented] (AIRFLOW-5071) Thousand os Executor reports task instance X finished (success) although the task says its queued. Was the task killed externally?

    [ https://issues.apache.org/jira/browse/AIRFLOW-5071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171669#comment-17171669 ] 

Szymon Grzemski commented on AIRFLOW-5071:
------------------------------------------

In our case sensors are marked as a failed (even if they were successful) on UP_TO_RESCHEDULE state. As mentioned before:
{code:java}
[2020-07-06 13:24:48,189] {scheduler_job.py:1328} ERROR - Executor reports task instance <TaskInstance: dedupe-emr-job-flow.check_signals_table 2020-07-06 12:00:00+00:00 [queued]> finished (success) although the task says its queued. Was the task killed externally?{code}
The error occures in this funtion:
[https://github.com/apache/airflow/blob/1.10.9/airflow/jobs/scheduler_job.py#L1441]

 

which calls:
[https://github.com/apache/airflow/blob/1.10.9/airflow/jobs/scheduler_job.py#L1511-L1530]

I think the state mismatch may happen somewhere between those lines. After reading the _validate_and_run_task_instances function you can see that executing/queueing the tasks happens just before calling the executor heartbeat (which is reponsible for communication with the Celery queue). So how can Celery executor read a success state from the queue when the task has just been queued.

[~turbaszek] Am I right or am I lost in the code? :D

> Thousand os Executor reports task instance X finished (success) although the task says its queued. Was the task killed externally?
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-5071
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5071
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: DAG, scheduler
>    Affects Versions: 1.10.3
>            Reporter: msempere
>            Priority: Critical
>             Fix For: 1.10.12
>
>         Attachments: image-2020-01-27-18-10-29-124.png, image-2020-07-08-07-58-42-972.png
>
>
> I'm opening this issue because since I update to 1.10.3 I'm seeing thousands of daily messages like the following in the logs:
>  
> ```
>  {{__init__.py:1580}} ERROR - Executor reports task instance <TaskInstance: X 2019-07-29 00:00:00+00:00 [queued]> finished (success) although the task says its queued. Was the task killed externally?
> {{jobs.py:1484}} ERROR - Executor reports task instance <TaskInstance: X 2019-07-29 00:00:00+00:00 [queued]> finished (success) although the task says its queued. Was the task killed externally?
> ```
> -And looks like this is triggering also thousand of daily emails because the flag to send email in case of failure is set to True.-
> I have Airflow setup to use Celery and Redis as a backend queue service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)