You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Kaxil Naik (JIRA)" <ji...@apache.org> on 2019/05/28 12:45:00 UTC

[jira] [Updated] (AIRFLOW-4485) All tasks stop running when using reschedule mode due to some tasks having negative a try_number

     [ https://issues.apache.org/jira/browse/AIRFLOW-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kaxil Naik updated AIRFLOW-4485:
--------------------------------
    Fix Version/s: 1.10.4

> All tasks stop running when using reschedule mode due to some tasks having negative a try_number
> ------------------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-4485
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4485
>             Project: Apache Airflow
>          Issue Type: Bug
>    Affects Versions: 1.10.3
>            Reporter: Teresa Martyny
>            Priority: Major
>             Fix For: 1.10.4
>
>
> When we use reschedule mode for our sensors, about an hour into our core pipeline running, the following happens:
> 1. Negative try_number: We begin to see on the Scheduler `Executor reports execution of [task info here] exited with status success for try_number -1` .... this then proceeds to continue to decrement until it reaches try_number -4 - With each run, -4 is the number where the following steps proceed to play out:
> 2. We see a spike(and then stop) in this error message on the Scheduler: `ERROR - Executor reports task instance {} finished ({}) although the task says its {}. Was the task killed externally?` coming from `airflow/jobs.py#_process_executor_events`
> 3. Sometimes followed by a few instances of the error on a single Worker: `Celery command failed` coming from `airflow/executors/celery_executor.py#execute_command`
> 4. Followed on the Worker by one instance of the error: `ZeroDivisionError` 
> 5. Followed by a spike in `ZeroDivisionError` on the Scheduler originating from `airflow/models/__init__.py#next_retry_datetime` line 1183
> 6. The pipeline then grinds to a halt. Tasks sit in a scheduled state in the scheduler, celery won't touch them. If try_numbers go negative, but never make it to negative 4, it doesn't grind to a halt. 
>  
> We identified that the reschedule mode decrements the try_number in `airflow/models/__init__.py#_handle_reschedule` 
> We did not identify why it never re-increments the `try_number` again to ostensibly do what the code is attempting: use the same `try_number` and write to the same log file.
> When we switched the sensors to use poke instead all of the above problems stopped. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)