You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/04/26 08:15:18 UTC

[GitHub] [airflow] shopee-jin opened a new issue #15527: Rescheduled sensor task stuck forever because scheduler receive executor failure event [Extremely Flaky]

shopee-jin opened a new issue #15527:
URL: https://github.com/apache/airflow/issues/15527


   **Apache Airflow version**:
   1.10.9
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
   NOT running containers
   
   **Environment**:
   Ubuntu 18.04.5 LTS
   Redis, CeleryExecutor
   
   **What happened**:
   We have >10k sensor tasks in the airflow cluster and we use `Reschedule` mode to save resources.
   In extremely rare scenarios (**estimated 1 out of a million**), a sensor task could stuck forever.
    
   https://github.com/apache/airflow/blob/4aec433e48dcc66c9c7b74947c499260ab6be9e9/airflow/jobs/scheduler_job.py#L1316
   
   Scheduler received a executor failure event,
   
   `2021-04-25 11:02:46,402 INFO - Executor reports execution of dag_vision_base_MY.xxxx execution_date=2021-04-24 06:00:00+08:00 exited with status failed for try_number 1
   `
   
   After that,  the task stuck forever.
   
   `[2021-04-25 11:02:43,074] {scheduler_job.py:1604} INFO - Creating / updating <TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [scheduled]> in ORM
   [2021-04-25 11:03:16,631] {logging_mixin.py:112} INFO - [2021-04-25 11:03:16,631] {taskinstance.py:672} DEBUG - <TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Not In Retry Period' PASSED: True, The context specified that being in a retry period was permitted.
   [2021-04-25 11:03:16,632] {logging_mixin.py:112} INFO - [2021-04-25 11:03:16,631] {taskinstance.py:672} DEBUG - <TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Trigger Rule' PASSED: True, The task instance did not have any upstream tasks.
   [2021-04-25 11:03:16,632] {logging_mixin.py:112} INFO - [2021-04-25 11:03:16,632] {taskinstance.py:672} DEBUG - <TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Previous Dagrun State' PASSED: True, The task did not have depends_on_past set.
   [2021-04-25 11:03:16,632] {logging_mixin.py:112} INFO - [2021-04-25 11:03:16,632] {taskinstance.py:672} DEBUG - <TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Ready To Reschedule' PASSED: True, The context specified that being in a reschedule period was permitted.
   [2021-04-25 11:03:16,632] {logging_mixin.py:112} INFO - [2021-04-25 11:03:16,632] {taskinstance.py:655} DEBUG - Dependencies all met for <TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]>
   [2021-04-25 11:03:50,523] {logging_mixin.py:112} INFO - [2021-04-25 11:03:50,523] {taskinstance.py:672} DEBUG - <TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Not In Retry Period' PASSED: True, The context specified that being in a retry period was permitted.
   [2021-04-25 11:03:50,523] {logging_mixin.py:112} INFO - [2021-04-25 11:03:50,523] {taskinstance.py:672} DEBUG - <TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Trigger Rule' PASSED: True, The task instance did not have any upstream tasks.
   [2021-04-25 11:03:50,523] {logging_mixin.py:112} INFO - [2021-04-25 11:03:50,523] {taskinstance.py:672} DEBUG - <TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Previous Dagrun State' PASSED: True, The task did not have depends_on_past set.
   [2021-04-25 11:03:50,523] {logging_mixin.py:112} INFO - [2021-04-25 11:03:50,523] {taskinstance.py:672} DEBUG - <TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Ready To Reschedule' PASSED: True, The context specified that being in a reschedule period was permitted.
   [2021-04-25 11:03:50,524] {logging_mixin.py:112} INFO - [2021-04-25 11:03:50,523] {taskinstance.py:655} DEBUG - Dependencies all met for <TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]>`
   
   **What you expected to happen**:
   All sensor tasks should be rescheduled correctly no matter it returned success or failure.
   
   <!-- What do you think went wrong? -->
   
   
   **How to reproduce it**:
   [Extremely Flaky]
   
   **Anything else we need to know**:
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #15527: Rescheduled sensor task stuck forever because scheduler receive executor failure event [Extremely Flaky]

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #15527:
URL: https://github.com/apache/airflow/issues/15527#issuecomment-826615201


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org