You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/09/16 20:17:36 UTC

[GitHub] [airflow] wolfier opened a new issue #18304: User induced deadlock for DAGs with `depends_on_past`

wolfier opened a new issue #18304:
URL: https://github.com/apache/airflow/issues/18304


   ### Apache Airflow version
   
   main (development)
   
   ### Operating System
   
   Debian GNU/Linux 10 (buster)
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Astronomer
   
   ### Deployment details
   
   Barebones deployment with one DAG.
   
   ### What happened
   
   
   I have a DAG with the following attribute:
   -  `depend_on_past` set as true
   -  `max_active_runs` set as one
   -  `catchup` set as true
   
   When my task failed, which then followed by my dagrun failing, Airflow moved on to execute the next dagrun. The current dagrun it is working stopped because of the previous instance of the task instance failed.
   
   Instinctively, I cleared the failed task and the dagrun went back into queued. 
   
   This is where the deadlock happens, the current dagrun will not move because of the previous dagrun and the previous dagrun is in the queued state waiting for available active dagrun slots. 
   
   ![image](https://user-images.githubusercontent.com/5952735/133678178-3fa8d5da-2201-4102-99d0-88d17a4a2fd7.png)
   
   ### What you expected to happen
   
   _No response_
   
   ### How to reproduce
   
   Imagine a DAG with these tasks and dependencies. A -> B -> C. 
   
   The DAG has two dagruns 1 and 2 that are consecutive of each other where 1 is before 2 in execution date.
   
   1. Task B of dagrun 1 fails so dagrun 1 fails.
   2. Dagrun 2 goes into running state
   3. Task A of dagrun 2 succeeds but scheduler cannot queue Task B of dagrun 2 because the past instance of Task B is not in the successful state.
   4. User clears Task B of dagrun 1 but dagrun 1 cannot go into the running state since the limit of maximum active runs is reached with dagrun 2 is in the running state.
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nikie edited a comment on issue #18304: User induced deadlock for DAGs with `depends_on_past`

Posted by GitBox <gi...@apache.org>.
nikie edited a comment on issue #18304:
URL: https://github.com/apache/airflow/issues/18304#issuecomment-945172125


   The user is able to unblock the runs as follows:
   5. Clear Task B in dagrun 2. This moves dagrun 2 into queued state. It appears that clearing a task in "no_status" state deactivates a running dagrun.
   6. Dagrun 1 completes. Dagrun 2 starts executing.
   
   A possible fix could be to detect the deadlock situation and automatically move dagrun 2 into scheduled state, clearing path for the dagrun 1. Not sure if this is feasible from technical perspective.
   It might be worth to mention the workaround in documentation if not going to fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nikie commented on issue #18304: User induced deadlock for DAGs with `depends_on_past`

Posted by GitBox <gi...@apache.org>.
nikie commented on issue #18304:
URL: https://github.com/apache/airflow/issues/18304#issuecomment-945179427


   This issue is similar to #17375 and might be related to #14205.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nikie edited a comment on issue #18304: User induced deadlock for DAGs with `depends_on_past`

Posted by GitBox <gi...@apache.org>.
nikie edited a comment on issue #18304:
URL: https://github.com/apache/airflow/issues/18304#issuecomment-945172125


   The user is able to unblock the runs as follows:
   5. Clear Task B in dagrun 2. This moves dagrun 2 into queued state. It appears that clearing a task in "no_status" state (white square) deactivates a running dagrun.
   6. Dagrun 1 completes. Dagrun 2 starts executing.
   
   A possible fix could be to detect the deadlock situation and automatically move dagrun 2 into queued state, clearing path for the dagrun 1. Not sure if this is feasible from technical perspective.
   Another option could be to add a feature "Mark queued" for dagruns so that they can be deactivated manually without touching individual tasks, which is not obvious.
   
   It might be worth to mention the workaround in documentation if not going to fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nikie edited a comment on issue #18304: User induced deadlock for DAGs with `depends_on_past`

Posted by GitBox <gi...@apache.org>.
nikie edited a comment on issue #18304:
URL: https://github.com/apache/airflow/issues/18304#issuecomment-945172125


   The user is able to unblock the runs as follows:
   5. Clear Task B in dagrun 2. This moves dagrun 2 into queued state. It appears that clearing a task in "no_status" state (white square) deactivates a running dagrun.
   6. Dagrun 1 completes. Dagrun 2 starts executing.
   
   A possible fix could be to detect the deadlock situation and automatically move dagrun 2 into scheduled state, clearing path for the dagrun 1. Not sure if this is feasible from technical perspective.
   Another option could be to add a feature "Mark queued" for dagruns so that they can be deactivated without touching individual tasks, which is not obvious.
   
   It might be worth to mention the workaround in documentation if not going to fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nikie commented on issue #18304: User induced deadlock for DAGs with `depends_on_past`

Posted by GitBox <gi...@apache.org>.
nikie commented on issue #18304:
URL: https://github.com/apache/airflow/issues/18304#issuecomment-945172125


   A fast and persistent user is able to unblock the runs as follows:
   5. Mark dagrun 2 as failed.
   6. Once dagrun 1 starts executing - clear Task B in dagrun 2 so that dagrun 2 is scheduled again. 
   7(a) Dagrun 1 completes, dagrun 2 starts, deadlock is resolved.
   7(b) If dagrun 1 completes before Task B was cleared in dagrun 2, then dagrun 3 will start and will be blocked by dagrun 2.  In this case repeat from (5) with the new dagrun pair :)
   
   A possible fix could be to detect the deadlock situation and automatically move dagrun 2 into scheduled state, clearing path for the dagrun 1. Not sure if this is feasible from technical perspective.
   It might be worth to mention the workaround in documentation if not going to fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nikie edited a comment on issue #18304: User induced deadlock for DAGs with `depends_on_past`

Posted by GitBox <gi...@apache.org>.
nikie edited a comment on issue #18304:
URL: https://github.com/apache/airflow/issues/18304#issuecomment-945172125


   The user is able to unblock the runs as follows:
   5. Clear Task B in dagrun 2. This moves dagrun 2 into queued state. It appears that clearing a task in "no_status" state (white square) deactivates a running dagrun.
   6. Dagrun 1 completes. Dagrun 2 starts executing.
   
   A possible fix could be to detect the deadlock situation and automatically move dagrun 2 into scheduled state, clearing path for the dagrun 1. Not sure if this is feasible from technical perspective.
   It might be worth to mention the workaround in documentation if not going to fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #18304: User induced deadlock for DAGs with `depends_on_past`

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #18304:
URL: https://github.com/apache/airflow/issues/18304#issuecomment-937399559


   What's the actionable here? Since the deadlock is induced by the user, should we prevent this from happening by not allowing Task B to be cleared? Or do we somehow resolve that deadlock—how?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #18304: User induced deadlock for DAGs with `depends_on_past`

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #18304:
URL: https://github.com/apache/airflow/issues/18304#issuecomment-945362143


   > A possible fix could be to detect the deadlock situation and automatically move dagrun 2 into queued state, clearing path for the dagrun 1. Not sure if this is feasible from technical perspective.
   > Another option could be to add a feature "Mark queued" for dagruns so that they can be deactivated manually without touching individual tasks, which is not obvious.
   
   I kind of feel maybe we should _always_ do this instead of trying to detect a deadlock at all. So if all the DAG run's "visiting" tasks (not sure what the right term is) are waiting for their respective past instances, the DAG run goes into a "running but not actually" state, which is treated as running in the UI, but queued in the scheduler. When any of those tasks receives its past instance's result, the run resumes its running state.
   
   If that sounds complicated, I think it's because it is (mainly the "all visiting tasks" part). So maybe a more pragmatic approach would be to simply add a button for users to resolve the deadlock themselves...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nikie edited a comment on issue #18304: User induced deadlock for DAGs with `depends_on_past`

Posted by GitBox <gi...@apache.org>.
nikie edited a comment on issue #18304:
URL: https://github.com/apache/airflow/issues/18304#issuecomment-945172125


   The user is able to unblock the runs as follows:
   5. Clear Task B in dagrun 2. This moves dagrun 2 into queued state. It appears that clearing a task in "no_status" state (white square) deactivates a running dagrun.
   6. Dagrun 1 completes. Dagrun 2 starts executing.
   
   A possible fix could be to detect the deadlock situation and automatically move dagrun 2 into scheduled state, clearing path for the dagrun 1. Not sure if this is feasible from technical perspective.
   Another option could be to add a feature "Mark queued" for dagruns so that they can be deactivated manually without touching individual tasks, which is not obvious.
   
   It might be worth to mention the workaround in documentation if not going to fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] nikie commented on issue #18304: User induced deadlock for DAGs with `depends_on_past`

Posted by GitBox <gi...@apache.org>.
nikie commented on issue #18304:
URL: https://github.com/apache/airflow/issues/18304#issuecomment-949030839


   @uranusjr 
   There is already a case when Airflow kills deadlocked dagruns: https://github.com/apache/airflow/blob/34e586a162ad9756d484d17b275c7b3dc8cefbc2/airflow/models/dagrun.py#L520
   Maybe, it would be better to fail the run in our case as well for consistency? Scheduling is already a fairly magic thing, so adding more magic like "running but not actually" or turning dagrun off/on would complicate it even more. In case of "running but not actually" solution, bug reports about "active dag runs exceed the max_active_runs setting" are likely to appear.
   
   We can try to extend above check to also fire if `not none_depends_on_past` (i.e. there are some "on past" dependencies), but `max_active_runs` is already reached and there are no running tasks in other runs.
   The "max active runs reached" state could be passed from the method `SchedulerJob._schedule_dag_run`, which calls the `DagRun.update_state`.
   How it would be better to check for running tasks in other runs?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #18304: User induced deadlock for DAGs with `depends_on_past`

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #18304:
URL: https://github.com/apache/airflow/issues/18304#issuecomment-937399559


   What's the actionable here? Since the deadlock is induced by the user, should we prevent this from happening by not allowing Task B to be cleared? Or do we somehow resolve that deadlock—how?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org