You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Kaxil Naik (JIRA)" <ji...@apache.org> on 2019/05/28 12:45:00 UTC
[jira] [Updated] (AIRFLOW-4485) All tasks stop running when using
reschedule mode due to some tasks having negative a try_number
[ https://issues.apache.org/jira/browse/AIRFLOW-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kaxil Naik updated AIRFLOW-4485:
--------------------------------
Fix Version/s: 1.10.4
> All tasks stop running when using reschedule mode due to some tasks having negative a try_number
> ------------------------------------------------------------------------------------------------
>
> Key: AIRFLOW-4485
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4485
> Project: Apache Airflow
> Issue Type: Bug
> Affects Versions: 1.10.3
> Reporter: Teresa Martyny
> Priority: Major
> Fix For: 1.10.4
>
>
> When we use reschedule mode for our sensors, about an hour into our core pipeline running, the following happens:
> 1. Negative try_number: We begin to see on the Scheduler `Executor reports execution of [task info here] exited with status success for try_number -1` .... this then proceeds to continue to decrement until it reaches try_number -4 - With each run, -4 is the number where the following steps proceed to play out:
> 2. We see a spike(and then stop) in this error message on the Scheduler: `ERROR - Executor reports task instance {} finished ({}) although the task says its {}. Was the task killed externally?` coming from `airflow/jobs.py#_process_executor_events`
> 3. Sometimes followed by a few instances of the error on a single Worker: `Celery command failed` coming from `airflow/executors/celery_executor.py#execute_command`
> 4. Followed on the Worker by one instance of the error: `ZeroDivisionError`
> 5. Followed by a spike in `ZeroDivisionError` on the Scheduler originating from `airflow/models/__init__.py#next_retry_datetime` line 1183
> 6. The pipeline then grinds to a halt. Tasks sit in a scheduled state in the scheduler, celery won't touch them. If try_numbers go negative, but never make it to negative 4, it doesn't grind to a halt.
>
> We identified that the reschedule mode decrements the try_number in `airflow/models/__init__.py#_handle_reschedule`
> We did not identify why it never re-increments the `try_number` again to ostensibly do what the code is attempting: use the same `try_number` and write to the same log file.
> When we switched the sensors to use poke instead all of the above problems stopped.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)