You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Ash Berlin-Taylor <as...@apache.org> on 2022/02/17 17:06:18 UTC

[DISCUSS] AIP-48 Data Dependency Management and Data Driven Scheduling

Hi everyone,

I'd like to start discussion about a new AIP that we've been thinking 
about at Astronomer and that has been kicking around our heads since 
before I started preparing my Keynote for Airflow Summit 2021! At the 
time I called it a "New Concept: Data object".

<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling>

This AIP has gone through a number of rounds of growing and shrinking 
before we've finally ended up at what we think is the core foundation 
of the idea that fixes a real need of our users right away, and that 
gives us the foundation to add lots of cool features in the future.

In an attempt to distil the essence of the AIP for those of the tl;dr 
persuasion:

We want to make Airflow aware of the datasets that tasks and DAGs 
consume and produce.

We want to allow DAGs to be triggered based on datasets being updated, 
no longer just time based schedules.

We would like to add the foundation of automatic Data movement (reading 
and writing).

There is a lot more detail in the AIP, but this is the core of our idea.

We'd love your feedback, and none of the code shown is set in stone so 
I'm happy to hear idea on how to improve the DAG writer experience.

Ash and Vikram,
Astronomer.io


Re: [DISCUSS] AIP-48 Data Dependency Management and Data Driven Scheduling

Posted by Jarek Potiuk <ja...@potiuk.com>.
I had a look, finally .

TL;DR; I really like where it goes - and the open-lineage unification
(mentioned in the comment by Ash is cool). But I have a few
questions/concerns that I raised.

I added my comments in the document but to summarize it briefly:

* I think there is a (slight) ambiguity in defining DataSetReference. The
URI with /without authentication info are a bit different conceptually
(with - is really a DataProvider reference, without - is DataSetReference).
Maybe clarifying and finding good names for both might help with the
consistency/ambiguity

* I think big (not yet answered IMHO) question is how the "dataset"
schedule (which is essentially task-based) plays with "dag" schedule (which
does not necessarily need to be time-based now) - it's not really clear how
we are going to address the cases where there are different datasets used
(by different tasks)  and how the schedule will look like in case datasets
change while execution of the DagRun. I have a feeling we should discuss
and define it before the AIP is "ready" - at least to clarify what is the
behaviour in this case.

* one other important comment - something to consider - I think we should
make some inventory of (realistic) cases that the current proposal
(single-input -> single output per task) will handle. While I think the
vast majority of tasks will be single input -> single output, this will
severely limit the actual use cases that could be handled as "whole dags"
if there is not even a single task that can take multiple inputs or produce
multiple outputs. And likely handling multiple inputs/outputs might not be
that more complex eventually. I am afraid that we might come up with
reverse-Pareto rule - rather than "achieve 80% of result with 20% of
effort" we will "achieve 20% of result while spending 80% of effort".  But
I think simple inventory and examples of cases that might be good to handle
(and those which are impossible to handle) with single-input -> single
output might give us a better idea (because right now it is just my
intuition, nothing more).

Those are the "bigger" comments I have now - more details in the AIP. And
happy to discuss it of course :)

J.


On Thu, Feb 24, 2022 at 2:30 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> Definitely. I will have a close look soon after short break I had :).
>
> J
>
> On Wed, Feb 23, 2022 at 9:12 PM Kaxil Naik <ka...@gmail.com> wrote:
>
>> Looking forward to it. We have been talking about making Airflow
>> data-aware since a long time :)
>>
>> On Thu, 17 Feb 2022 at 17:07, Ash Berlin-Taylor <as...@apache.org> wrote:
>>
>>> Hi everyone,
>>>
>>> I'd like to start discussion about a new AIP that we've been thinking
>>> about at Astronomer and that has been kicking around our heads since before
>>> I started preparing my Keynote for Airflow Summit 2021! At the time I
>>> called it a "New Concept: Data object".
>>>
>>>
>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling
>>>
>>> This AIP has gone through a number of rounds of growing and shrinking
>>> before we've finally ended up at what we think is the core foundation of
>>> the idea that fixes a real need of our users right away, and that gives us
>>> the foundation to add lots of cool features in the future.
>>>
>>> In an attempt to distil the essence of the AIP for those of the tl;dr
>>> persuasion:
>>>
>>> We want to make Airflow aware of the datasets that tasks and DAGs
>>> consume and produce.
>>>
>>> We want to allow DAGs to be triggered based on datasets being updated,
>>> no longer just time based schedules.
>>>
>>> We would like to add the foundation of automatic Data movement (reading
>>> and writing).
>>>
>>> There is a lot more detail in the AIP, but this is the core of our idea.
>>>
>>> We'd love your feedback, and none of the code shown is set in stone so
>>> I'm happy to hear idea on how to improve the DAG writer experience.
>>>
>>> Ash and Vikram,
>>> Astronomer.io
>>>
>>

Re: [DISCUSS] AIP-48 Data Dependency Management and Data Driven Scheduling

Posted by Jarek Potiuk <ja...@potiuk.com>.
Definitely. I will have a close look soon after short break I had :).

J

On Wed, Feb 23, 2022 at 9:12 PM Kaxil Naik <ka...@gmail.com> wrote:

> Looking forward to it. We have been talking about making Airflow
> data-aware since a long time :)
>
> On Thu, 17 Feb 2022 at 17:07, Ash Berlin-Taylor <as...@apache.org> wrote:
>
>> Hi everyone,
>>
>> I'd like to start discussion about a new AIP that we've been thinking
>> about at Astronomer and that has been kicking around our heads since before
>> I started preparing my Keynote for Airflow Summit 2021! At the time I
>> called it a "New Concept: Data object".
>>
>>
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling
>>
>> This AIP has gone through a number of rounds of growing and shrinking
>> before we've finally ended up at what we think is the core foundation of
>> the idea that fixes a real need of our users right away, and that gives us
>> the foundation to add lots of cool features in the future.
>>
>> In an attempt to distil the essence of the AIP for those of the tl;dr
>> persuasion:
>>
>> We want to make Airflow aware of the datasets that tasks and DAGs consume
>> and produce.
>>
>> We want to allow DAGs to be triggered based on datasets being updated, no
>> longer just time based schedules.
>>
>> We would like to add the foundation of automatic Data movement (reading
>> and writing).
>>
>> There is a lot more detail in the AIP, but this is the core of our idea.
>>
>> We'd love your feedback, and none of the code shown is set in stone so
>> I'm happy to hear idea on how to improve the DAG writer experience.
>>
>> Ash and Vikram,
>> Astronomer.io
>>
>

Re: [DISCUSS] AIP-48 Data Dependency Management and Data Driven Scheduling

Posted by Kaxil Naik <ka...@gmail.com>.
Looking forward to it. We have been talking about making Airflow data-aware
since a long time :)

On Thu, 17 Feb 2022 at 17:07, Ash Berlin-Taylor <as...@apache.org> wrote:

> Hi everyone,
>
> I'd like to start discussion about a new AIP that we've been thinking
> about at Astronomer and that has been kicking around our heads since before
> I started preparing my Keynote for Airflow Summit 2021! At the time I
> called it a "New Concept: Data object".
>
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling
>
> This AIP has gone through a number of rounds of growing and shrinking
> before we've finally ended up at what we think is the core foundation of
> the idea that fixes a real need of our users right away, and that gives us
> the foundation to add lots of cool features in the future.
>
> In an attempt to distil the essence of the AIP for those of the tl;dr
> persuasion:
>
> We want to make Airflow aware of the datasets that tasks and DAGs consume
> and produce.
>
> We want to allow DAGs to be triggered based on datasets being updated, no
> longer just time based schedules.
>
> We would like to add the foundation of automatic Data movement (reading
> and writing).
>
> There is a lot more detail in the AIP, but this is the core of our idea.
>
> We'd love your feedback, and none of the code shown is set in stone so I'm
> happy to hear idea on how to improve the DAG writer experience.
>
> Ash and Vikram,
> Astronomer.io
>