You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/07/01 19:03:42 UTC

[GitHub] [airflow] potiuk commented on issue #24778: "Reversing" the branching concept, to execute tasks based on final downstream dependencies.

potiuk commented on issue #24778:
URL: https://github.com/apache/airflow/issues/24778#issuecomment-1172637873

Interesting idea, but I tihnk it's mixing "task graph" with "artifact graph" behaviour. And actually we already have the AIP 48 in progress that (IMHO) implements the "concept" you somehow have in mind in a much better way - without actually changing the upstream/downstream behaviour of Airflow and changing the branching concept.

The mechanism you describe is fine for "build system" when you decide what 'target" you want to achieve, but I believe Airflow DAGs are describing the "process" not the "target" .

The whole premise of DAGs is to describe what procesing tasks should happen, not what "data artifact" we want to achieve as a result of the DAG run. Each of the steps in Airlfow DAG might result in an artifact dataset - even more than one that might be used inside the DAG, but what makes it different from the `make` - it also might be used outside of the DAG.

The parallel to "build systems" is wrong - because nodes in the "build system" are the "artifacts" themselves. You specify a "binary", "library", "source" as "nodes" in the graph and describe relations between them and tell "I want to get this arifact and please find out which other artifacts are needed". Airlfow DAG is different - it does not describe artifacts, it describes tasks - i.e. actions that might produce the artifacts. In the build system you do not specify "I want to run complilation task on X to recive Y", you specify "I want to get X and it needs Y", but you let the system figure out what task needs to be run to get from X to Y.

For me the idea you have is great to describe "data dependencies" (build system) but not "task dependencies" (Airflow). And we are going to implement the use-case you talk about without changing Airlfow task dependencies mode.

This is maybe not as clear but with https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling being implemented, we will enable what you think about - but way better. We are not "reversing" the branching concept. We are adding "dataset" concept into existing DAG structure (which is good). This is the big part of bringing the data lineage into airflow world and I thiink this is really what you are about. you are not interested in running "taskA" or "taskB". You are really interested in getting dataset "D1" or "D2" instead and a way how to do that.

By implementing data dependencies and scheduling and adding open-lineage on top, we are going to add an option for anyone to get the "I want to generate the dataset X - which tasks should be run to get it ?". I believe this is what you are really asking for here.

There was an excellent talk which is very related to it from Willy Luciuc at the Airlfow Summit https://airflowsummit.org/sessions/2022/automating-airflow-backfills-with-marquez/

I strongly recommend to watch it.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org