You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "droppoint (via GitHub)" <gi...@apache.org> on 2023/12/07 15:24:17 UTC

Re: [PR] Fix race condition in KubernetesExecutor with concurrently running schedulers [airflow]

droppoint commented on PR #35800:
URL: https://github.com/apache/airflow/pull/35800#issuecomment-1845545959

   We've refactored the _adopt_completed_pods function to the _delete_orphaned_completed_pods function and now it removes completed pods from failed schedulers properly.
   
   Here's a step-by-step breakdown of our test:
   1. Set the number of schedulers in the namespace to 2.
   2. Create a DAG that sleeps for 5 minutes.
   3. Set orphaned_tasks_check_interval to 20 minutes.
   4. Run the DAG on scheduler №1.
   5. Wait until DAGRun/Job/TaskInstance/Pod is in the "Running" state.
   6. Kill scheduler №1 and prevent its restart.
   7. Wait until the pod is in the Completed state.
   8. Wait until adoption starts on scheduler №2.
   9. Wait until the cleanup-pods cronjob starts.
   
   Results:
   - TaskInstance/DAGRun/Job status changed to "success" after step 7 but before step 8.
   - The pod was deleted after step 8 but before step 9.
   
   > The task OOMs and therefore cannot report its state by itself.
   
   We'll check that in a few days


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org