You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/04/11 15:50:11 UTC

[GitHub] [airflow] woodywuuu commented on issue #10790: Copy of [AIRFLOW-5071] JIRA: Thousands of Executor reports task instance X finished (success) although the task says its queued. Was the task killed externally?

woodywuuu commented on issue #10790:
URL: https://github.com/apache/airflow/issues/10790#issuecomment-1095231809

   airflow: 2.2.2 with mysql8、 HA scheduler、celery executor(redis backend)
   
   From logs, it show that those ti reported this error `killed externally (status: success)` , were rescheduled! 
   1. scheduler found a ti to scheduled (ti from None to scheduled)
   2. scheduler queued ti(ti from scheduled to queued)
   3. scheduler send ti to celery
   4. worker get ti
   5. worker found ti‘s state in mysql  is scheduled https://github.com/apache/airflow/blob/2.2.2/airflow/models/taskinstance.py#L1224
   6. worker set this ti to None
   7. scheduler reschedule this ti
   8. scheduler could not queue this ti again, and found this ti success(in celery), so set it to failed
   
   From mysql we get that: all failed task has no external_executor_id!
   
   We use 5000 dags, each with 50 dummy task, found that, if the following two conditions are met,the probability of triggering this problem will highly increase:
   
   1. no external_executor_id was set to queued ti in celery https://github.com/apache/airflow/blob/2.2.2/airflow/jobs/scheduler_job.py#L537
      * This sql above has skip_locked, and some queued ti in celery may miss this external_executor_id. 
   10. a scheduler loop cost very long(more than 60s), `adopt_or_reset_orphaned_tasks` judge that schedulerJob failed, and try adopt orphaned ti https://github.com/apache/airflow/blob/9ac742885ffb83c15f7e3dc910b0cf9df073407a/airflow/executors/celery_executor.py#L442
   
   We do these tests:
   1. patch `SchedulerJob. _process_executor_events `, not to set external_executor_id to those queued ti
      * 300+ dag failed with `killed externally (status: success)` normally less than 10
   2. patch `adopt_or_reset_orphaned_tasks`, not to adopt orphaned ti 
      * no dag failed !
   
   I read the notes [below](https://github.com/apache/airflow/blob/9ac742885ffb83c15f7e3dc910b0cf9df073407a/airflow/executors/celery_executor.py#L442) , but still don't understand this problems:
   1. why should we handle queued ti in celery and set this external id ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org