You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2019/11/02 07:30:29 UTC
[GitHub] [airflow] yuqian90 commented on issue #6392: [AIRFLOW-5648] Add ClearTaskOperator for clearing tasks in a DAG

yuqian90 commented on issue #6392: [AIRFLOW-5648] Add ClearTaskOperator for clearing tasks in a DAG
URL: https://github.com/apache/airflow/pull/6392#issuecomment-549019116
 
 
   > You could separate them into 2 DAGs, and use TriggerDagRunOperator after task K in DAG 1 to trigger DAG 2 after sensor passes True.
   > 
   > You can have BranchPythonOperator in DAG2: to decide if it needs to just run L & M or run the other branch where 1st task can be TriggerDagRunOperator to run Dag 1. But note that this can end up in an always True loops so make sure that your BranchPythonOperator in Dag 2 has the correct logic.
   
   Hi, @kaxil , instead of `TriggerDagRunOperator`, what do you think about extracting the sub dag `A, C, E, ..., J` and putting them into a `SubDagOperator`? I.e like this:
   ```
   A >> C >> E >> G >> H >> I >> J >> K >> L >> M >> Finish
                  ^                   ^          
                  |                   |         
   B >> D >> F>>>>                    |
                                      |
   Sensor >> SubDag_A_to_J >>>>>>>>>>>
   
   
   where SubDag_A_to_J is a SubDagOperator containing these tasks:
   A >> C >> E >> G >> H >> I >> J
   ```
   The user experience with `SubDagOperator` for this purpose seems great. Users see the rerun abstracted into a node on the main DAG and when they click on the sub dag node it zooms into the sub dag which shows what is being run. And most importantly for us, everything is still tightly linked together on the main DAG. If we need to rerun things historically it works just fine.
   
   However some research shows that the internet suggests not to use `SubDagOperator` because it has some shortcomings, e.g. [this Astronomer page](https://www.astronomer.io/guides/subdags/). The most cited reason is that `SubDagOperator` used to cause deadlock and thus the default executor in Airflow 1.10.* was changed to `SequentialExecutor` which means only one task in the sub dag can run at a time. That is not great. But there are some workaround offered online, such as [this one](https://medium.com/@team_24989/fixing-subdagoperator-deadlock-in-airflow-6c64312ebb10) that uses a dedicated celery queue to fix the deadlock problem and thus can still let the SubDagOperator run tasks in parallel.
   
   And more interestingly, the latest `SubDagOperator` in master branch has been made into a `BaseSensorOperator` in [this change](https://github.com/apache/airflow/pull/5498). The docstr indicates the deadlock issue can be fixed if we set mode to "reschedule". It sounds like the performance issue people were complaining about has already been addressed in the latest Airflow although that change was not released in 1.10.6. Is that the case? If that's the case, it sounds like we have to wait a while before we can safely use `SubDagOperator`.
   
   ```
   Although SubDagOperator can occupy a pool/concurrency slot,
   user can specify the mode=reschedule so that the slot will be
   released periodically to avoid potential deadlock.
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services