You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/07/01 09:14:38 UTC

[GitHub] [airflow] hterik opened a new issue, #24778: "Reversing" the branching concept, to execute tasks based on final downstream dependencies.

hterik opened a new issue, #24778:
URL: https://github.com/apache/airflow/issues/24778

   ### Description
   
   A DAG executes from left-to-right. Upstream tasks trigger downstream tasks once completed.
   In this graph you can add [Branching](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html?highlight=branch#branching), deciding which downstream tasks to follow once an upstream task is completed.
   
   I find this branching concept to be a bit backwards and not utilizing the full potential of a dependency graph like a DAG.
   
   In a simple dag, the current BranchOperators are fairly trivial, for example 
   ```
   t1 >> t2 >> t3 >> resultA
   s1 >> s2 >> s3 >> resultB
   branch_op >> [t1, s1]
   ```
   But when you get more complex dags, where some `sX` and `tX` start depending across each others path, you instead more often would like to choose between `resultA` and `resultB`, without having to know the web of dependencies of `sX` and `tY` and to avoid execution of tasks that are not necessary for the final outcome.
   
   Compare this to a conventional build-system, like `make` or `ninja`, where the execution-order and branching is decided automatically, based on the dependency graph and what final downstream tasks you asked it to build. For example "`make flashimages documentation test`"
   
   ----------------
   
   The decision to build `resultA` or `resultB` can itself be complex and based on runtime-data, I'm thinking it need to be a task itself, where it can evaluate dag parameters and other data. In a way like a BranchOperator today.
   
   This is different from [dynamically generating dags](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html?highlight=branch#dynamic-dags), since that is only based on static configuration and can't use params or other per-dagrun information into account.
   
   Is there any way to achieve this with existing constructs? If not, is this a good idea to add and how big effort would such an implementation require?
   
   ### Use case/motivation
   
   Task-execution avoidance, computed based on dependencies already declared in DAG, chosen by what final outcome user want's to get out of individual dag run executions. With possibility to evaluate dagrun parameters and other runtime data in the decision.
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] hterik commented on issue #24778: "Reversing" the branching concept, to execute tasks based on final downstream dependencies.

Posted by GitBox <gi...@apache.org>.
hterik commented on issue #24778:
URL: https://github.com/apache/airflow/issues/24778#issuecomment-1172296081

   > If there is a condition of which downstream task can be skipped then place `ShortCircuitOperator` to achieve that.
   
   What i'm looking for is a condition of which _upstream_ tasks that can be skipped :) 
   The deciding task must still be run first of course, sortof like the "configure" stage of a build-system.
   
   Excuse my paint-skills, imagine following (this is not the same dag as my code from OP):
   ![image](https://user-images.githubusercontent.com/89977373/176893427-0dde322a-8def-49db-8d29-fa85a74645b6.png)
   Here if one decides that A is the desired outcome, then all of x+s1+s2+t1+t2+t3 must be run, but s3 can be skipped. 
   With the BranchOperator, one must run both of the tracks completely, because otherwise there is no way to get s2 into t3, if i'm not mistaken.
   This is the most simple case, real world dag dependencies can be a lot more complex than that.
   
   I don't have experience with the `ShortCircuitOperator`, will look into it a bit more to understand better.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on issue #24778: "Reversing" the branching concept, to execute tasks based on final downstream dependencies.

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #24778:
URL: https://github.com/apache/airflow/issues/24778#issuecomment-1172235003

   > avoid execution of tasks that are not necessary for the final outcome.
   
   If there is a condition of which downstream task can be skipped then place `ShortCircuitOperator` to achieve that.
   
   I didn't fully understand what you are actually proposing and what is the problem you are facing?
   You describe the problem as complex but didn't provide an actual example of the complexity. Could there be parts in your workflow that don't have to be coupled into a single dag (Maybe the complexity could be reduced) ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #24778: "Reversing" the branching concept, to execute tasks based on final downstream dependencies.

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #24778:
URL: https://github.com/apache/airflow/issues/24778#issuecomment-1172637873

   Interesting idea, but I tihnk it's mixing "task graph" with "artifact graph" behaviour. And actually we already have the AIP 48 in progress that (IMHO) implements the "concept" you somehow have in mind in a much better way - without actually changing the upstream/downstream behaviour of Airflow and changing the branching concept.
   
   The mechanism you describe is fine for "build system" when you decide what 'target" you want to achieve, but I believe Airflow DAGs are describing the "process" not the "target" . 
   
   The whole premise of DAGs is to describe what procesing tasks should happen, not what "data artifact" we want to achieve as a result of the DAG run. Each of the steps in Airlfow DAG might result in an artifact dataset - even more than one  that might be used inside the DAG, but what makes it different from the `make` - it also might be used outside of the DAG.  
   
   The parallel to "build systems" is wrong - because nodes in the "build system" are the "artifacts" themselves. You specify a "binary", "library", "source" as "nodes" in the graph and describe relations between them and tell "I want to get this arifact and please find out which other artifacts are needed". Airlfow DAG is different - it does not describe artifacts, it describes tasks - i.e. actions that might produce the artifacts. In the build system you do not specify "I want to run complilation task on  X to recive Y", you specify "I want to get X and it needs Y", but you let the system figure out what task needs to be run to get  from X to Y. 
   
   For me the idea you have is great to describe "data dependencies" (build system) but not "task dependencies" (Airflow). And we are going to implement the use-case you talk about without changing Airlfow task dependencies mode.
   
   This is maybe not as clear but with https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling  being implemented, we will enable what you think about - but way better. We are not "reversing" the branching concept. We are adding "dataset" concept into existing DAG structure (which is good). This is the big part of bringing the data lineage into airflow world and I thiink this is really what you are about. you are not interested in running "taskA" or "taskB". You are really interested in getting dataset "D1" or "D2" instead and a way how to do that. 
   
   By implementing data dependencies and scheduling and adding open-lineage on top, we are going to add an option for anyone to get the "I want to generate the dataset X - which tasks should be run to get it ?". I believe this is what you are really asking for here.
   
   There was an excellent talk which is very related to it from Willy Luciuc at the Airlfow Summit https://airflowsummit.org/sessions/2022/automating-airflow-backfills-with-marquez/ 
   
   I strongly recommend to watch it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #24778: "Reversing" the branching concept, to execute tasks based on final downstream dependencies.

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #24778:
URL: https://github.com/apache/airflow/issues/24778#issuecomment-1172638576

   BTW. converting it to a discussion. This is not a feature. If anything this is a starting discussion at an extremely high level.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #24778: "Reversing" the branching concept, to execute tasks based on final downstream dependencies.

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #24778: "Reversing" the branching concept, to execute tasks based on final downstream dependencies.
URL: https://github.com/apache/airflow/issues/24778


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org