You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by "Teresa Martyny (JIRA)" <ji...@apache.org> on 2019/05/08 18:15:00 UTC

[jira] [Created] (AIRFLOW-4485) All tasks stop running when using reschedule mode due to some tasks having negative a try_number

Teresa Martyny created AIRFLOW-4485:
---------------------------------------

Summary: All tasks stop running when using reschedule mode due to some tasks having negative a try_number
Key: AIRFLOW-4485
URL: https://issues.apache.org/jira/browse/AIRFLOW-4485
Project: Apache Airflow
Issue Type: Bug
Affects Versions: 1.10.3
Reporter: Teresa Martyny

When we use reschedule mode for our sensors, about an hour into our core pipeline running, the following happens:

1. Negative try_number: We begin to see on the Scheduler `Executor reports execution of [task info here] exited with status success for try_number -1` .... this then proceeds to continue to decrement until it reaches try_number -4 - With each run, -4 is the number where the following steps proceed to play out:

2. We see a spike(and then stop) in this error message on the Scheduler: `ERROR - Executor reports task instance {} finished ({}) although the task says its {}. Was the task killed externally?` coming from `airflow/jobs.py#_process_executor_events`

3. Sometimes followed by a few instances of the error on a single Worker: `Celery command failed` coming from `airflow/executors/celery_executor.py#execute_command`

4. Followed on the Worker by one instance of the error: `ZeroDivisionError`

5. Followed by a spike in `ZeroDivisionError` on the Scheduler originating from `airflow/models/__init__.py#next_retry_datetime` line 1183

6. The pipeline then grinds to a halt. Tasks sit in a scheduled state in the scheduler, celery won't touch them. If try_numbers go negative, but never make it to negative 4, it doesn't grind to a halt.

We identified that the reschedule mode decrements the try_number in `airflow/models/__init__.py#_handle_reschedule`

We did not identify why it never re-increments the `try_number` again to ostensibly do what the code is attempting: use the same `try_number` and write to the same log file.

When we switched the sensors to use poke instead all of the above problems stopped.

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)