You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Pablo Estrada <pa...@google.com.INVALID> on 2022/07/17 06:43:54 UTC

Wiki access please?

Hi there!
I would like to start a discussion of an idea that I had for a testing
framework for airflow.
I believe the first step would be to write up an AIP - so could I have
access to write a new one on the cwiki?

Thanks!
-P.

Re: Wiki access please?

Posted by Jarek Potiuk <ja...@potiuk.com>.

Ah. I only saw your answer now Austin - sorry for that, catching up only
now after some deep dive down into several issues.

TL;DR; It's just a much longer explanation on why I think it should start
outside (without ruling out the possibility we will bring it in).

Pablo, Austin,

> Sounds somewhat like a question of whether to grow the tent of
contributors, committers, pmc of what is deemed to be "Airflow" (capital
"A" and in)?  Or err towards things manageable for the existing committers,
pmc?  With more things deemed not-in, would adding new blood to the project
be more difficult?

No, not really. Airflow consists of the core "Airflow" - where most
contributors and committers commit their code. And it has - optional -
providers. Airflow Providers is an optional feature of Airflow and I think
our goal is to have not only Airflow "community" managed providers, for
things that are useful for a vast number of users, but also to build a
thriving ecosystem of people who would like to build their own providers.
Simply because Airflow is so popular and is the "backbone" of data
processing orchestration.  And we have plenty of room for anyone who wants
to contribute to either of them.

And conceptually - there is no need for "less popular" providers to be part
of the community. We already have 70+ community managed providers that are
rather popular. Also there are many providers that people develop outside
of Airflow - some of them - if they gain popularity - might be eventually
contributed to the community, some of them - especially if they are "niche"
services, are likely better to stay outside. I think that is a pretty
natural choice.

The existing Community providers IMHO are a  HUGE asset of Airflow. The
fact that when you start using Airflow it comes with Aws, Google,
Databricks and 70+ other integrations that you know you can rely on
because they are maintained by the community is a huge selling point for
Airflow as a platform of choice. But there is a law of diminishing returns
- the more "less important" providers we add to the community code, the
less value they bring and the more maintenance burden they cause. And I
personally think we reached the level where we MUST consider both when we
make a decision of accepting a new code.  Not everyone realises that, but
code is more often a liability than asset. Accepting a new code is not
"benefits only". It often slows you down, limits you in what you can do and
has the chance of angry users flooding you with issues when your change
breaks their workflow. We already refused to accept a few donations of code
in the past on that ground - most notably CWL (Common Workflow Language)
wanted to donate their integration and we refused - because the cost of
maintaining it would far outgrow the benefits and we made the decision that
if CWL is interested in maintaining the integration, it's better that it
stays with them rather than we take over the maintenance of it. The most
important aspect of it was that it was a deliberate decision taken after
long debate and considering various voices (you can look it up in the
devlist). The tests that Pablo mentions are rather similar to the CWL case.
We might or might not choose to accept it to the community, but there is
literally no problem in starting it outside and when we see the usefulness
of it and the amount of burden it brings where we can make a decision
whether to accept it or not. In the case of CWL - when they approached us,
it was already working, not only POC but this was something they iterated
over and developed, because they found it super-useful to integrate and run
CWL workflows on Airflow. But it was Airflow 1.10 only at that time, and it
would require quite an effort to make it work with Airflow 2.0 and continue
with newer releases. We just felt that it's not worth it to take the extra
burden on the community. It would slow us down.

Of course - one could say that then it might make Pablo and others less
interested in developing it - knowing that it might not make it to the
"community managed code". But if this is the only motivation, then it is a
bad motivation in the first place. If we know that something is useful and
we see that the community will benefit - this should be the main motivator.
If it turns out to be useful and has a great value - we will accept it, if
not - maybe that was not a great idea in the first place. But bottomline
what we are talking about is one or few people interested in making things
better for the community risking their time and effort vs. the whole team
of maintainers committing to something that is difficult to assess what
maintenance cost it has and what benefit it brings. I think if you believe
in the idea, taking the cost of showing and developing something like that
outside initially is a good idea - at least to the point where we see it
and see how it can be used and what it can bring.

Side comment - we are gearing up for splitting providers out of the
"airflow core" technically (i.e. put them in separate repositories). I
spoke about some challenges it involves with a number of people - one of
them is that when we split repo, there might be a notion that the separate
repos might make people less inclined to contribute. I think we will be
able to avoid that, but it will require a really good communication and
building some kind of "umbrella" experience to not make the people who only
contribute to a provider feel less "Airflow contributors". This is actually
one of the most important problems to solve in the whole split. How to keep
the people "Airflow" contributors even if they will be contributing to
separate repositories? I already have a few ideas but let's leave it for a
separate discussion. But I spoke even to a few ASF old-timers already (I am
preparing for the split for about a year now and talking to a number of
people, and I am going to have a lot of discussions about it in the
ApacheCon in New Orleans in a month) and I even heard voices that the
3rd-party integrations should not be part of the ASF code in the first
place. I do not agree with that actually, but you can see how you can have
various opinions there. I would rather think now how we can keep the
current "popular" providers in and keep the community around it when we
split, rather than growing the number of providers, when we do not see a
clear "need" from the wider community to bring the provider in. And knowing
that the split is coming -  would it make huge difference if the new
providers are added to a new "apache/airlfow-xxxx" repo, or kept in
"xxxx/airflow-provider". And similar with the "test" frameworks. It's
likely by the time it gets contirbuted we will already have in place the
multi-repo structure and bringing it in might be simply easier - just
literally transferring the repo and plugging-in the CI of ours. I am
preparing a very solid ground now to be able to do it. So starting
separately might be a way to go first.

I also think it has nothing to do with the growing number of contributors
and committers. I am not sure if you are aware of it,  Austin, but Airflow
has by far the biggest number of contributors out of all Apache Software
Foundation projects. By FAR. We are the TOP 1 project. We bypassed Apache
Spark in November 2021 (both projects had ca. 1800 contributors then in
GitHub). The numbers are 1829 Spark, 2180 Airflow today. We are FLYING when
it comes to growing our contributor's base. We are also continuously adding
new committers and PMC members.

J.


On Wed, Aug 10, 2022 at 9:35 PM Austin Bennett <wh...@gmail.com>
wrote:

> If much is out, rather than in, is there a different pool from where you
> will draw contributors and eventually committers/pmc?
>
> Sounds somewhat like a question of whether to grow the tent of
> contributors, committers, pmc of what is deemed to be "Airflow" (capital
> "A" and in)?  Or err towards things manageable for the existing committers,
> pmc?  With more things deemed not-in, would adding new blood to the project
> be more difficult?
>
>
>
> On Sat, Aug 6, 2022, 9:02 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> What do you think, Pablo about the "being out" vs. "being in" the
>> official repo?
>>
>> On Thu, Jul 28, 2022 at 3:51 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> Anyone :) ?
>>>
>>> On Mon, Jul 18, 2022 at 10:38 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>>> I would love to hear what others think about the "in/out" approach -
>>>> mine is just the line of thoughts I've been exploring during the last few
>>>> months where I prepared my own line of thought about providers,
>>>> maintenance, incentive of entities maintaining open-source projects, and
>>>> especially - expectations of the users that it creates. But those are just
>>>> my thoughts and I'd love to hear what others think about it.
>>>>
>>>> On Mon, Jul 18, 2022 at 10:33 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>
>>>>> I had some thoughts about it - this also connected with recent
>>>>> discussions about mixed governance for providers, and I think it's worth
>>>>> using this discussion to set some rules and "boundaries" on when and
>>>>> how and especially why we want to accept some contributions, while for some
>>>>> other contributions it's better to be outside.
>>>>>
>>>>> We are about to start more seriously thinking (and discussing) on how
>>>>> to split Airflow providers off airflow. And I think we can split off more
>>>>> than providers - this might be a good candidate to be a standalone, but
>>>>> still community maintained package. If we are going to solve the problem of
>>>>> splitting airflow to N packages, one more package does not matter.
>>>>> And it would nicely solve "version independence". We could even make
>>>>> it airflow 2.0+ compliant if we want.
>>>>>
>>>>> So I think while the question of "is it tied with a specific airflow
>>>>> version or not" does not really prevent us from making it part of community
>>>>> - those two are not related (if we are going to have more repositories
>>>>> anyway)
>>>>>
>>>>> The important part is really how "self-servicing" we can make it and
>>>>> how we make sure it stays relevant with future versions of Airflow and who
>>>>> does it I think - namely who has the incentive and "responsibility" to
>>>>> maintain it. I am sure we will add more features to Airflow DAGs and
>>>>> simplify the way DAGs are written over time, and the test harness will have
>>>>> to adapt to it.
>>>>>
>>>>> There are pros and cons of having such a standalone package "in the
>>>>> community/ASF project" and "out of it". We have a good example (from
>>>>> similar kinds of tools/utils) in the past that we can learn from(and maybe
>>>>> Bas can share more insights).
>>>>>
>>>>> https://github.com/BasPH/pylint-airflow - pylint plugin for
>>>>> Airflow DAGs
>>>>>
>>>>> Initially that was "sponsored" by GoDataDriven where Bas worked and I
>>>>> think this is where it was born. And that made sense as it was likely also
>>>>> useful for the customers of GoDataDriven (here I am guessing). But
>>>>> apparently both GoDataDriven's incentives winded down and it turned out
>>>>> that usefulness of it was not as big (also I think we all in Python
>>>>> community learned that Pylint is more of a distraction than real help - we
>>>>> dumped Pylint eventually and the plugin was not maintained beyond some
>>>>> versions of 1.10. And the tool is all but defunct now. Which is perfectly
>>>>> understandable.
>>>>>
>>>>> In this case there is (I think) no risk of a "pylint" like problem,
>>>>> but the question of maintenance and adaptation to future versions of
>>>>> Airflow remains.
>>>>>
>>>>> I think there is one big differences of something that is "in ASF
>>>>> repos" and "out":
>>>>>
>>>>> * if we make it a standalone package in "asf airflow community" - we
>>>>> will have some obligation and expectations from our users to maintain it.
>>>>> We can add some test harness (regardless if it will be in airflow
>>>>> repository or in a separate one) to make sure that new airflow "core"
>>>>> changes will not break it (and we can fail our PRs if they do - basically
>>>>> making "core" maintainers take care about this problem rather than delegate
>>>>> it to someone else to react on core changes (this is what has to  happen
>>>>> with providers I believe even if we split them to separate repo).  I think
>>>>> anything that we as the ASF community release should have such harnesses -
>>>>> making sure that whatever we release and make available to our users work
>>>>> together.
>>>>>
>>>>> * if it is outside of the "ASF community", someone will have to react
>>>>> to "core airflow" changes. We will not do it in the community, we will not
>>>>> pay attention, such an "external tool" might break at any time because we
>>>>> introduced a change in part of a core that the external tool implicitly
>>>>> relied on.
>>>>>
>>>>> For me the question is whether something should be in/out should be
>>>>> based on :
>>>>>
>>>>> * is it really useful for the community as a whole? -> if yes we
>>>>> should consider it
>>>>> * is it strongly tied with the core of airflow in the sense of relying
>>>>> on some internals that might change easily? -> if not, there is no need to
>>>>> bring it in, it can be easily maintained outside by anyone
>>>>> * if it is strongly tied with the core - > is there someone (person,
>>>>> organisation) who wants to take the burden of maintaining it and has
>>>>> incentive of doing it for quite some time -> if yes, great, let them do
>>>>> that!
>>>>> * if it is strongly tied, do we want to take a burden as "core airflow
>>>>> maintainers" to keep it updated together with the core if it is? -> if yes,
>>>>> we should bring it in
>>>>>
>>>>> If we have a strongly tied tool that we do not want to maintain in the
>>>>> core and there is no entity who would like to do it, then I think this idea
>>>>> should be dropped :).
>>>>>
>>>>> J.
>>>>>
>>>>>
>>>>> On Mon, Jul 18, 2022 at 1:52 AM Ping Zhang <pi...@umich.edu> wrote:
>>>>>
>>>>>> Hi Pablo,
>>>>>>
>>>>>> Wow, I really love this idea. This will greatly enrich the airflow
>>>>>> ecosystem.
>>>>>>
>>>>>> I agree with Ash, it is better to have it as a standalone package.
>>>>>> And we can use this framework to write airflow core invariants tests, so
>>>>>> that we will run them on every airflow release to guarantee no regressions.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Ping
>>>>>>
>>>>>>
>>>>>> On Sun, Jul 17, 2022 at 1:09 PM Pablo Estrada
>>>>>> <pa...@google.com.invalid> wrote:
>>>>>>
>>>>>>> Understood!
>>>>>>>
>>>>>>> TL;DR: I propose a testing framework where users can check for 'DAG
>>>>>>> execution invariants' or 'DAG execution expectations' given certain task
>>>>>>> outcomes.
>>>>>>>
>>>>>>> As DAGs grow in complexity, sometimes it might become difficult to
>>>>>>> reason about their runtime behavior in many scenarios. Users may want to
>>>>>>> lay out rules in the form of tests that can verify  DAG execution results.
>>>>>>> For example:
>>>>>>>
>>>>>>> - If any of my database_backup_* tasks fails, I want to ensure that
>>>>>>> at least one email_alert_* task will run.
>>>>>>> - If my 'check_authentication' task fails, I want to ensure that the
>>>>>>> whole DAG will fail.
>>>>>>> - If any of my DataflowOperator tasks fails, I want to ensure that a
>>>>>>> PubsubOperator downstream will always run.
>>>>>>>
>>>>>>> These sorts of invariants don't need the DAG to be executed; but in
>>>>>>> fact, they are pretty hard to test today: Staging environments can't check
>>>>>>> every possible runtime outcome.
>>>>>>>
>>>>>>> In this framework, users would define unit tests like this:
>>>>>>>
>>>>>>> ```
>>>>>>> def test_my_example_dag():
>>>>>>>   the_dag = models.DAG(
>>>>>>>         'the_basic_dag',
>>>>>>>         schedule_interval='@daily',
>>>>>>>         start_date=DEFAULT_DATE,
>>>>>>>     )
>>>>>>>
>>>>>>>     with the_dag:
>>>>>>>         op1 = EmptyOperator(task_id='task_1')
>>>>>>>         op2 = EmptyOperator(task_id='task_2')
>>>>>>>         op3 = EmptyOperator(task_id='task_3')
>>>>>>>
>>>>>>>         op1 >> op2 >> op3
>>>>>>>     # DAG invariant: If task_1 and task_2 succeeds, then task_3 will
>>>>>>> always run
>>>>>>>     assert_that(
>>>>>>>             given(thedag)\
>>>>>>>                 .when(task('task_1'), succeeds())\
>>>>>>>                 .and_(task('task_2'), succeeds())\
>>>>>>>                 .then(task('task_3'), runs()))
>>>>>>> ```
>>>>>>>
>>>>>>> This is a very simple example - and it's not great, because it only
>>>>>>> duplicates the DAG logic - but you can see more examples in my
>>>>>>> draft PR
>>>>>>> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1]
>>>>>>> and in my draft AIP
>>>>>>> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g>
>>>>>>> [2].
>>>>>>>
>>>>>>> I started writing up an AIP in a Google doc[2] which y'all can
>>>>>>> check. It's very close to what I have written here : )
>>>>>>>
>>>>>>> LMK what y'all think. I am also happy to publish this as a separate
>>>>>>> library if y'all wanna be cautious about adding it directly to Airflow.
>>>>>>> -P.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82
>>>>>>> [2]
>>>>>>> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <ja...@potiuk.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yep. Just outline your proposal on devlist, Pablo :).
>>>>>>>>
>>>>>>>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <as...@apache.org>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > Hi Pablo,
>>>>>>>> >
>>>>>>>> > Could you describe at a high level what you are thinking of? It's
>>>>>>>> entirely possible it doesn't need any changes to core Airflow, or isn't
>>>>>>>> significant enough to need an AIP.
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Ash
>>>>>>>> >
>>>>>>>> > On 17 July 2022 07:43:54 BST, Pablo Estrada
>>>>>>>> <pa...@google.com.INVALID> wrote:
>>>>>>>> >>
>>>>>>>> >> Hi there!
>>>>>>>> >> I would like to start a discussion of an idea that I had for a
>>>>>>>> testing framework for airflow.
>>>>>>>> >> I believe the first step would be to write up an AIP - so could
>>>>>>>> I have access to write a new one on the cwiki?
>>>>>>>> >>
>>>>>>>> >> Thanks!
>>>>>>>> >> -P.
>>>>>>>>
>>>>>>>

Re: Wiki access please?

Posted by Austin Bennett <wh...@gmail.com>.

If much is out, rather than in, is there a different pool from where you
will draw contributors and eventually committers/pmc?

Sounds somewhat like a question of whether to grow the tent of
contributors, committers, pmc of what is deemed to be "Airflow" (capital
"A" and in)?  Or err towards things manageable for the existing committers,
pmc?  With more things deemed not-in, would adding new blood to the project
be more difficult?



On Sat, Aug 6, 2022, 9:02 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> What do you think, Pablo about the "being out" vs. "being in" the
> official repo?
>
> On Thu, Jul 28, 2022 at 3:51 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> Anyone :) ?
>>
>> On Mon, Jul 18, 2022 at 10:38 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> I would love to hear what others think about the "in/out" approach -
>>> mine is just the line of thoughts I've been exploring during the last few
>>> months where I prepared my own line of thought about providers,
>>> maintenance, incentive of entities maintaining open-source projects, and
>>> especially - expectations of the users that it creates. But those are just
>>> my thoughts and I'd love to hear what others think about it.
>>>
>>> On Mon, Jul 18, 2022 at 10:33 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>>> I had some thoughts about it - this also connected with recent
>>>> discussions about mixed governance for providers, and I think it's worth
>>>> using this discussion to set some rules and "boundaries" on when and
>>>> how and especially why we want to accept some contributions, while for some
>>>> other contributions it's better to be outside.
>>>>
>>>> We are about to start more seriously thinking (and discussing) on how
>>>> to split Airflow providers off airflow. And I think we can split off more
>>>> than providers - this might be a good candidate to be a standalone, but
>>>> still community maintained package. If we are going to solve the problem of
>>>> splitting airflow to N packages, one more package does not matter.
>>>> And it would nicely solve "version independence". We could even make it
>>>> airflow 2.0+ compliant if we want.
>>>>
>>>> So I think while the question of "is it tied with a specific airflow
>>>> version or not" does not really prevent us from making it part of community
>>>> - those two are not related (if we are going to have more repositories
>>>> anyway)
>>>>
>>>> The important part is really how "self-servicing" we can make it and
>>>> how we make sure it stays relevant with future versions of Airflow and who
>>>> does it I think - namely who has the incentive and "responsibility" to
>>>> maintain it. I am sure we will add more features to Airflow DAGs and
>>>> simplify the way DAGs are written over time, and the test harness will have
>>>> to adapt to it.
>>>>
>>>> There are pros and cons of having such a standalone package "in the
>>>> community/ASF project" and "out of it". We have a good example (from
>>>> similar kinds of tools/utils) in the past that we can learn from(and maybe
>>>> Bas can share more insights).
>>>>
>>>> https://github.com/BasPH/pylint-airflow - pylint plugin for
>>>> Airflow DAGs
>>>>
>>>> Initially that was "sponsored" by GoDataDriven where Bas worked and I
>>>> think this is where it was born. And that made sense as it was likely also
>>>> useful for the customers of GoDataDriven (here I am guessing). But
>>>> apparently both GoDataDriven's incentives winded down and it turned out
>>>> that usefulness of it was not as big (also I think we all in Python
>>>> community learned that Pylint is more of a distraction than real help - we
>>>> dumped Pylint eventually and the plugin was not maintained beyond some
>>>> versions of 1.10. And the tool is all but defunct now. Which is perfectly
>>>> understandable.
>>>>
>>>> In this case there is (I think) no risk of a "pylint" like problem, but
>>>> the question of maintenance and adaptation to future versions of Airflow
>>>> remains.
>>>>
>>>> I think there is one big differences of something that is "in ASF
>>>> repos" and "out":
>>>>
>>>> * if we make it a standalone package in "asf airflow community" - we
>>>> will have some obligation and expectations from our users to maintain it.
>>>> We can add some test harness (regardless if it will be in airflow
>>>> repository or in a separate one) to make sure that new airflow "core"
>>>> changes will not break it (and we can fail our PRs if they do - basically
>>>> making "core" maintainers take care about this problem rather than delegate
>>>> it to someone else to react on core changes (this is what has to  happen
>>>> with providers I believe even if we split them to separate repo).  I think
>>>> anything that we as the ASF community release should have such harnesses -
>>>> making sure that whatever we release and make available to our users work
>>>> together.
>>>>
>>>> * if it is outside of the "ASF community", someone will have to react
>>>> to "core airflow" changes. We will not do it in the community, we will not
>>>> pay attention, such an "external tool" might break at any time because we
>>>> introduced a change in part of a core that the external tool implicitly
>>>> relied on.
>>>>
>>>> For me the question is whether something should be in/out should be
>>>> based on :
>>>>
>>>> * is it really useful for the community as a whole? -> if yes we should
>>>> consider it
>>>> * is it strongly tied with the core of airflow in the sense of relying
>>>> on some internals that might change easily? -> if not, there is no need to
>>>> bring it in, it can be easily maintained outside by anyone
>>>> * if it is strongly tied with the core - > is there someone (person,
>>>> organisation) who wants to take the burden of maintaining it and has
>>>> incentive of doing it for quite some time -> if yes, great, let them do
>>>> that!
>>>> * if it is strongly tied, do we want to take a burden as "core airflow
>>>> maintainers" to keep it updated together with the core if it is? -> if yes,
>>>> we should bring it in
>>>>
>>>> If we have a strongly tied tool that we do not want to maintain in the
>>>> core and there is no entity who would like to do it, then I think this idea
>>>> should be dropped :).
>>>>
>>>> J.
>>>>
>>>>
>>>> On Mon, Jul 18, 2022 at 1:52 AM Ping Zhang <pi...@umich.edu> wrote:
>>>>
>>>>> Hi Pablo,
>>>>>
>>>>> Wow, I really love this idea. This will greatly enrich the airflow
>>>>> ecosystem.
>>>>>
>>>>> I agree with Ash, it is better to have it as a standalone package. And
>>>>> we can use this framework to write airflow core invariants tests, so that
>>>>> we will run them on every airflow release to guarantee no regressions.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Ping
>>>>>
>>>>>
>>>>> On Sun, Jul 17, 2022 at 1:09 PM Pablo Estrada
>>>>> <pa...@google.com.invalid> wrote:
>>>>>
>>>>>> Understood!
>>>>>>
>>>>>> TL;DR: I propose a testing framework where users can check for 'DAG
>>>>>> execution invariants' or 'DAG execution expectations' given certain task
>>>>>> outcomes.
>>>>>>
>>>>>> As DAGs grow in complexity, sometimes it might become difficult to
>>>>>> reason about their runtime behavior in many scenarios. Users may want to
>>>>>> lay out rules in the form of tests that can verify  DAG execution results.
>>>>>> For example:
>>>>>>
>>>>>> - If any of my database_backup_* tasks fails, I want to ensure that
>>>>>> at least one email_alert_* task will run.
>>>>>> - If my 'check_authentication' task fails, I want to ensure that the
>>>>>> whole DAG will fail.
>>>>>> - If any of my DataflowOperator tasks fails, I want to ensure that a
>>>>>> PubsubOperator downstream will always run.
>>>>>>
>>>>>> These sorts of invariants don't need the DAG to be executed; but in
>>>>>> fact, they are pretty hard to test today: Staging environments can't check
>>>>>> every possible runtime outcome.
>>>>>>
>>>>>> In this framework, users would define unit tests like this:
>>>>>>
>>>>>> ```
>>>>>> def test_my_example_dag():
>>>>>>   the_dag = models.DAG(
>>>>>>         'the_basic_dag',
>>>>>>         schedule_interval='@daily',
>>>>>>         start_date=DEFAULT_DATE,
>>>>>>     )
>>>>>>
>>>>>>     with the_dag:
>>>>>>         op1 = EmptyOperator(task_id='task_1')
>>>>>>         op2 = EmptyOperator(task_id='task_2')
>>>>>>         op3 = EmptyOperator(task_id='task_3')
>>>>>>
>>>>>>         op1 >> op2 >> op3
>>>>>>     # DAG invariant: If task_1 and task_2 succeeds, then task_3 will
>>>>>> always run
>>>>>>     assert_that(
>>>>>>             given(thedag)\
>>>>>>                 .when(task('task_1'), succeeds())\
>>>>>>                 .and_(task('task_2'), succeeds())\
>>>>>>                 .then(task('task_3'), runs()))
>>>>>> ```
>>>>>>
>>>>>> This is a very simple example - and it's not great, because it only
>>>>>> duplicates the DAG logic - but you can see more examples in my draft
>>>>>> PR
>>>>>> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1]
>>>>>> and in my draft AIP
>>>>>> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g>
>>>>>> [2].
>>>>>>
>>>>>> I started writing up an AIP in a Google doc[2] which y'all can check.
>>>>>> It's very close to what I have written here : )
>>>>>>
>>>>>> LMK what y'all think. I am also happy to publish this as a separate
>>>>>> library if y'all wanna be cautious about adding it directly to Airflow.
>>>>>> -P.
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82
>>>>>> [2]
>>>>>> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#
>>>>>>
>>>>>>
>>>>>> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <ja...@potiuk.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yep. Just outline your proposal on devlist, Pablo :).
>>>>>>>
>>>>>>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <as...@apache.org>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Hi Pablo,
>>>>>>> >
>>>>>>> > Could you describe at a high level what you are thinking of? It's
>>>>>>> entirely possible it doesn't need any changes to core Airflow, or isn't
>>>>>>> significant enough to need an AIP.
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Ash
>>>>>>> >
>>>>>>> > On 17 July 2022 07:43:54 BST, Pablo Estrada
>>>>>>> <pa...@google.com.INVALID> wrote:
>>>>>>> >>
>>>>>>> >> Hi there!
>>>>>>> >> I would like to start a discussion of an idea that I had for a
>>>>>>> testing framework for airflow.
>>>>>>> >> I believe the first step would be to write up an AIP - so could I
>>>>>>> have access to write a new one on the cwiki?
>>>>>>> >>
>>>>>>> >> Thanks!
>>>>>>> >> -P.
>>>>>>>
>>>>>>

Re: Wiki access please?

Posted by Jarek Potiuk <ja...@potiuk.com>.

What do you think, Pablo about the "being out" vs. "being in" the
official repo?

On Thu, Jul 28, 2022 at 3:51 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> Anyone :) ?
>
> On Mon, Jul 18, 2022 at 10:38 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> I would love to hear what others think about the "in/out" approach - mine
>> is just the line of thoughts I've been exploring during the last few months
>> where I prepared my own line of thought about providers, maintenance,
>> incentive of entities maintaining open-source projects, and especially -
>> expectations of the users that it creates. But those are just my thoughts
>> and I'd love to hear what others think about it.
>>
>> On Mon, Jul 18, 2022 at 10:33 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> I had some thoughts about it - this also connected with recent
>>> discussions about mixed governance for providers, and I think it's worth
>>> using this discussion to set some rules and "boundaries" on when and
>>> how and especially why we want to accept some contributions, while for some
>>> other contributions it's better to be outside.
>>>
>>> We are about to start more seriously thinking (and discussing) on how to
>>> split Airflow providers off airflow. And I think we can split off more than
>>> providers - this might be a good candidate to be a standalone, but still
>>> community maintained package. If we are going to solve the problem of
>>> splitting airflow to N packages, one more package does not matter.
>>> And it would nicely solve "version independence". We could even make it
>>> airflow 2.0+ compliant if we want.
>>>
>>> So I think while the question of "is it tied with a specific airflow
>>> version or not" does not really prevent us from making it part of community
>>> - those two are not related (if we are going to have more repositories
>>> anyway)
>>>
>>> The important part is really how "self-servicing" we can make it and how
>>> we make sure it stays relevant with future versions of Airflow and who does
>>> it I think - namely who has the incentive and "responsibility" to maintain
>>> it. I am sure we will add more features to Airflow DAGs and simplify the
>>> way DAGs are written over time, and the test harness will have to adapt to
>>> it.
>>>
>>> There are pros and cons of having such a standalone package "in the
>>> community/ASF project" and "out of it". We have a good example (from
>>> similar kinds of tools/utils) in the past that we can learn from(and maybe
>>> Bas can share more insights).
>>>
>>> https://github.com/BasPH/pylint-airflow - pylint plugin for Airflow DAGs
>>>
>>> Initially that was "sponsored" by GoDataDriven where Bas worked and I
>>> think this is where it was born. And that made sense as it was likely also
>>> useful for the customers of GoDataDriven (here I am guessing). But
>>> apparently both GoDataDriven's incentives winded down and it turned out
>>> that usefulness of it was not as big (also I think we all in Python
>>> community learned that Pylint is more of a distraction than real help - we
>>> dumped Pylint eventually and the plugin was not maintained beyond some
>>> versions of 1.10. And the tool is all but defunct now. Which is perfectly
>>> understandable.
>>>
>>> In this case there is (I think) no risk of a "pylint" like problem, but
>>> the question of maintenance and adaptation to future versions of Airflow
>>> remains.
>>>
>>> I think there is one big differences of something that is "in ASF repos"
>>> and "out":
>>>
>>> * if we make it a standalone package in "asf airflow community" - we
>>> will have some obligation and expectations from our users to maintain it.
>>> We can add some test harness (regardless if it will be in airflow
>>> repository or in a separate one) to make sure that new airflow "core"
>>> changes will not break it (and we can fail our PRs if they do - basically
>>> making "core" maintainers take care about this problem rather than delegate
>>> it to someone else to react on core changes (this is what has to  happen
>>> with providers I believe even if we split them to separate repo).  I think
>>> anything that we as the ASF community release should have such harnesses -
>>> making sure that whatever we release and make available to our users work
>>> together.
>>>
>>> * if it is outside of the "ASF community", someone will have to react to
>>> "core airflow" changes. We will not do it in the community, we will not pay
>>> attention, such an "external tool" might break at any time because we
>>> introduced a change in part of a core that the external tool implicitly
>>> relied on.
>>>
>>> For me the question is whether something should be in/out should be
>>> based on :
>>>
>>> * is it really useful for the community as a whole? -> if yes we should
>>> consider it
>>> * is it strongly tied with the core of airflow in the sense of relying
>>> on some internals that might change easily? -> if not, there is no need to
>>> bring it in, it can be easily maintained outside by anyone
>>> * if it is strongly tied with the core - > is there someone (person,
>>> organisation) who wants to take the burden of maintaining it and has
>>> incentive of doing it for quite some time -> if yes, great, let them do
>>> that!
>>> * if it is strongly tied, do we want to take a burden as "core airflow
>>> maintainers" to keep it updated together with the core if it is? -> if yes,
>>> we should bring it in
>>>
>>> If we have a strongly tied tool that we do not want to maintain in the
>>> core and there is no entity who would like to do it, then I think this idea
>>> should be dropped :).
>>>
>>> J.
>>>
>>>
>>> On Mon, Jul 18, 2022 at 1:52 AM Ping Zhang <pi...@umich.edu> wrote:
>>>
>>>> Hi Pablo,
>>>>
>>>> Wow, I really love this idea. This will greatly enrich the airflow
>>>> ecosystem.
>>>>
>>>> I agree with Ash, it is better to have it as a standalone package. And
>>>> we can use this framework to write airflow core invariants tests, so that
>>>> we will run them on every airflow release to guarantee no regressions.
>>>>
>>>> Thanks,
>>>>
>>>> Ping
>>>>
>>>>
>>>> On Sun, Jul 17, 2022 at 1:09 PM Pablo Estrada
>>>> <pa...@google.com.invalid> wrote:
>>>>
>>>>> Understood!
>>>>>
>>>>> TL;DR: I propose a testing framework where users can check for 'DAG
>>>>> execution invariants' or 'DAG execution expectations' given certain task
>>>>> outcomes.
>>>>>
>>>>> As DAGs grow in complexity, sometimes it might become difficult to
>>>>> reason about their runtime behavior in many scenarios. Users may want to
>>>>> lay out rules in the form of tests that can verify  DAG execution results.
>>>>> For example:
>>>>>
>>>>> - If any of my database_backup_* tasks fails, I want to ensure that at
>>>>> least one email_alert_* task will run.
>>>>> - If my 'check_authentication' task fails, I want to ensure that the
>>>>> whole DAG will fail.
>>>>> - If any of my DataflowOperator tasks fails, I want to ensure that a
>>>>> PubsubOperator downstream will always run.
>>>>>
>>>>> These sorts of invariants don't need the DAG to be executed; but in
>>>>> fact, they are pretty hard to test today: Staging environments can't check
>>>>> every possible runtime outcome.
>>>>>
>>>>> In this framework, users would define unit tests like this:
>>>>>
>>>>> ```
>>>>> def test_my_example_dag():
>>>>>   the_dag = models.DAG(
>>>>>         'the_basic_dag',
>>>>>         schedule_interval='@daily',
>>>>>         start_date=DEFAULT_DATE,
>>>>>     )
>>>>>
>>>>>     with the_dag:
>>>>>         op1 = EmptyOperator(task_id='task_1')
>>>>>         op2 = EmptyOperator(task_id='task_2')
>>>>>         op3 = EmptyOperator(task_id='task_3')
>>>>>
>>>>>         op1 >> op2 >> op3
>>>>>     # DAG invariant: If task_1 and task_2 succeeds, then task_3 will
>>>>> always run
>>>>>     assert_that(
>>>>>             given(thedag)\
>>>>>                 .when(task('task_1'), succeeds())\
>>>>>                 .and_(task('task_2'), succeeds())\
>>>>>                 .then(task('task_3'), runs()))
>>>>> ```
>>>>>
>>>>> This is a very simple example - and it's not great, because it only
>>>>> duplicates the DAG logic - but you can see more examples in my draft
>>>>> PR
>>>>> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1]
>>>>> and in my draft AIP
>>>>> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g>
>>>>> [2].
>>>>>
>>>>> I started writing up an AIP in a Google doc[2] which y'all can check.
>>>>> It's very close to what I have written here : )
>>>>>
>>>>> LMK what y'all think. I am also happy to publish this as a separate
>>>>> library if y'all wanna be cautious about adding it directly to Airflow.
>>>>> -P.
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82
>>>>> [2]
>>>>> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#
>>>>>
>>>>>
>>>>> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>>
>>>>>> Yep. Just outline your proposal on devlist, Pablo :).
>>>>>>
>>>>>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <as...@apache.org>
>>>>>> wrote:
>>>>>> >
>>>>>> > Hi Pablo,
>>>>>> >
>>>>>> > Could you describe at a high level what you are thinking of? It's
>>>>>> entirely possible it doesn't need any changes to core Airflow, or isn't
>>>>>> significant enough to need an AIP.
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Ash
>>>>>> >
>>>>>> > On 17 July 2022 07:43:54 BST, Pablo Estrada
>>>>>> <pa...@google.com.INVALID> wrote:
>>>>>> >>
>>>>>> >> Hi there!
>>>>>> >> I would like to start a discussion of an idea that I had for a
>>>>>> testing framework for airflow.
>>>>>> >> I believe the first step would be to write up an AIP - so could I
>>>>>> have access to write a new one on the cwiki?
>>>>>> >>
>>>>>> >> Thanks!
>>>>>> >> -P.
>>>>>>
>>>>>

Re: Wiki access please?

Posted by Jarek Potiuk <ja...@potiuk.com>.

Anyone :) ?

On Mon, Jul 18, 2022 at 10:38 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> I would love to hear what others think about the "in/out" approach - mine
> is just the line of thoughts I've been exploring during the last few months
> where I prepared my own line of thought about providers, maintenance,
> incentive of entities maintaining open-source projects, and especially -
> expectations of the users that it creates. But those are just my thoughts
> and I'd love to hear what others think about it.
>
> On Mon, Jul 18, 2022 at 10:33 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> I had some thoughts about it - this also connected with recent
>> discussions about mixed governance for providers, and I think it's worth
>> using this discussion to set some rules and "boundaries" on when and
>> how and especially why we want to accept some contributions, while for some
>> other contributions it's better to be outside.
>>
>> We are about to start more seriously thinking (and discussing) on how to
>> split Airflow providers off airflow. And I think we can split off more than
>> providers - this might be a good candidate to be a standalone, but still
>> community maintained package. If we are going to solve the problem of
>> splitting airflow to N packages, one more package does not matter.
>> And it would nicely solve "version independence". We could even make it
>> airflow 2.0+ compliant if we want.
>>
>> So I think while the question of "is it tied with a specific airflow
>> version or not" does not really prevent us from making it part of community
>> - those two are not related (if we are going to have more repositories
>> anyway)
>>
>> The important part is really how "self-servicing" we can make it and how
>> we make sure it stays relevant with future versions of Airflow and who does
>> it I think - namely who has the incentive and "responsibility" to maintain
>> it. I am sure we will add more features to Airflow DAGs and simplify the
>> way DAGs are written over time, and the test harness will have to adapt to
>> it.
>>
>> There are pros and cons of having such a standalone package "in the
>> community/ASF project" and "out of it". We have a good example (from
>> similar kinds of tools/utils) in the past that we can learn from(and maybe
>> Bas can share more insights).
>>
>> https://github.com/BasPH/pylint-airflow - pylint plugin for Airflow DAGs
>>
>> Initially that was "sponsored" by GoDataDriven where Bas worked and I
>> think this is where it was born. And that made sense as it was likely also
>> useful for the customers of GoDataDriven (here I am guessing). But
>> apparently both GoDataDriven's incentives winded down and it turned out
>> that usefulness of it was not as big (also I think we all in Python
>> community learned that Pylint is more of a distraction than real help - we
>> dumped Pylint eventually and the plugin was not maintained beyond some
>> versions of 1.10. And the tool is all but defunct now. Which is perfectly
>> understandable.
>>
>> In this case there is (I think) no risk of a "pylint" like problem, but
>> the question of maintenance and adaptation to future versions of Airflow
>> remains.
>>
>> I think there is one big differences of something that is "in ASF repos"
>> and "out":
>>
>> * if we make it a standalone package in "asf airflow community" - we will
>> have some obligation and expectations from our users to maintain it. We can
>> add some test harness (regardless if it will be in airflow repository or in
>> a separate one) to make sure that new airflow "core" changes will not break
>> it (and we can fail our PRs if they do - basically making "core"
>> maintainers take care about this problem rather than delegate it to someone
>> else to react on core changes (this is what has to  happen with providers I
>> believe even if we split them to separate repo).  I think anything that we
>> as the ASF community release should have such harnesses - making sure that
>> whatever we release and make available to our users work together.
>>
>> * if it is outside of the "ASF community", someone will have to react to
>> "core airflow" changes. We will not do it in the community, we will not pay
>> attention, such an "external tool" might break at any time because we
>> introduced a change in part of a core that the external tool implicitly
>> relied on.
>>
>> For me the question is whether something should be in/out should be based
>> on :
>>
>> * is it really useful for the community as a whole? -> if yes we should
>> consider it
>> * is it strongly tied with the core of airflow in the sense of relying on
>> some internals that might change easily? -> if not, there is no need to
>> bring it in, it can be easily maintained outside by anyone
>> * if it is strongly tied with the core - > is there someone (person,
>> organisation) who wants to take the burden of maintaining it and has
>> incentive of doing it for quite some time -> if yes, great, let them do
>> that!
>> * if it is strongly tied, do we want to take a burden as "core airflow
>> maintainers" to keep it updated together with the core if it is? -> if yes,
>> we should bring it in
>>
>> If we have a strongly tied tool that we do not want to maintain in the
>> core and there is no entity who would like to do it, then I think this idea
>> should be dropped :).
>>
>> J.
>>
>>
>> On Mon, Jul 18, 2022 at 1:52 AM Ping Zhang <pi...@umich.edu> wrote:
>>
>>> Hi Pablo,
>>>
>>> Wow, I really love this idea. This will greatly enrich the airflow
>>> ecosystem.
>>>
>>> I agree with Ash, it is better to have it as a standalone package. And
>>> we can use this framework to write airflow core invariants tests, so that
>>> we will run them on every airflow release to guarantee no regressions.
>>>
>>> Thanks,
>>>
>>> Ping
>>>
>>>
>>> On Sun, Jul 17, 2022 at 1:09 PM Pablo Estrada <pa...@google.com.invalid>
>>> wrote:
>>>
>>>> Understood!
>>>>
>>>> TL;DR: I propose a testing framework where users can check for 'DAG
>>>> execution invariants' or 'DAG execution expectations' given certain task
>>>> outcomes.
>>>>
>>>> As DAGs grow in complexity, sometimes it might become difficult to
>>>> reason about their runtime behavior in many scenarios. Users may want to
>>>> lay out rules in the form of tests that can verify  DAG execution results.
>>>> For example:
>>>>
>>>> - If any of my database_backup_* tasks fails, I want to ensure that at
>>>> least one email_alert_* task will run.
>>>> - If my 'check_authentication' task fails, I want to ensure that the
>>>> whole DAG will fail.
>>>> - If any of my DataflowOperator tasks fails, I want to ensure that a
>>>> PubsubOperator downstream will always run.
>>>>
>>>> These sorts of invariants don't need the DAG to be executed; but in
>>>> fact, they are pretty hard to test today: Staging environments can't check
>>>> every possible runtime outcome.
>>>>
>>>> In this framework, users would define unit tests like this:
>>>>
>>>> ```
>>>> def test_my_example_dag():
>>>>   the_dag = models.DAG(
>>>>         'the_basic_dag',
>>>>         schedule_interval='@daily',
>>>>         start_date=DEFAULT_DATE,
>>>>     )
>>>>
>>>>     with the_dag:
>>>>         op1 = EmptyOperator(task_id='task_1')
>>>>         op2 = EmptyOperator(task_id='task_2')
>>>>         op3 = EmptyOperator(task_id='task_3')
>>>>
>>>>         op1 >> op2 >> op3
>>>>     # DAG invariant: If task_1 and task_2 succeeds, then task_3 will
>>>> always run
>>>>     assert_that(
>>>>             given(thedag)\
>>>>                 .when(task('task_1'), succeeds())\
>>>>                 .and_(task('task_2'), succeeds())\
>>>>                 .then(task('task_3'), runs()))
>>>> ```
>>>>
>>>> This is a very simple example - and it's not great, because it only
>>>> duplicates the DAG logic - but you can see more examples in my draft PR
>>>> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1]
>>>> and in my draft AIP
>>>> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g>
>>>> [2].
>>>>
>>>> I started writing up an AIP in a Google doc[2] which y'all can check.
>>>> It's very close to what I have written here : )
>>>>
>>>> LMK what y'all think. I am also happy to publish this as a separate
>>>> library if y'all wanna be cautious about adding it directly to Airflow.
>>>> -P.
>>>>
>>>> [1]
>>>> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82
>>>> [2]
>>>> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#
>>>>
>>>>
>>>> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>
>>>>> Yep. Just outline your proposal on devlist, Pablo :).
>>>>>
>>>>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <as...@apache.org>
>>>>> wrote:
>>>>> >
>>>>> > Hi Pablo,
>>>>> >
>>>>> > Could you describe at a high level what you are thinking of? It's
>>>>> entirely possible it doesn't need any changes to core Airflow, or isn't
>>>>> significant enough to need an AIP.
>>>>> >
>>>>> > Thanks,
>>>>> > Ash
>>>>> >
>>>>> > On 17 July 2022 07:43:54 BST, Pablo Estrada
>>>>> <pa...@google.com.INVALID> wrote:
>>>>> >>
>>>>> >> Hi there!
>>>>> >> I would like to start a discussion of an idea that I had for a
>>>>> testing framework for airflow.
>>>>> >> I believe the first step would be to write up an AIP - so could I
>>>>> have access to write a new one on the cwiki?
>>>>> >>
>>>>> >> Thanks!
>>>>> >> -P.
>>>>>
>>>>

Re: Wiki access please?

Posted by Jarek Potiuk <ja...@potiuk.com>.

I would love to hear what others think about the "in/out" approach - mine
is just the line of thoughts I've been exploring during the last few months
where I prepared my own line of thought about providers, maintenance,
incentive of entities maintaining open-source projects, and especially -
expectations of the users that it creates. But those are just my thoughts
and I'd love to hear what others think about it.

On Mon, Jul 18, 2022 at 10:33 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> I had some thoughts about it - this also connected with recent discussions
> about mixed governance for providers, and I think it's worth using this
> discussion to set some rules and "boundaries" on when and how and
> especially why we want to accept some contributions, while for some other
> contributions it's better to be outside.
>
> We are about to start more seriously thinking (and discussing) on how to
> split Airflow providers off airflow. And I think we can split off more than
> providers - this might be a good candidate to be a standalone, but still
> community maintained package. If we are going to solve the problem of
> splitting airflow to N packages, one more package does not matter.
> And it would nicely solve "version independence". We could even make it
> airflow 2.0+ compliant if we want.
>
> So I think while the question of "is it tied with a specific airflow
> version or not" does not really prevent us from making it part of community
> - those two are not related (if we are going to have more repositories
> anyway)
>
> The important part is really how "self-servicing" we can make it and how
> we make sure it stays relevant with future versions of Airflow and who does
> it I think - namely who has the incentive and "responsibility" to maintain
> it. I am sure we will add more features to Airflow DAGs and simplify the
> way DAGs are written over time, and the test harness will have to adapt to
> it.
>
> There are pros and cons of having such a standalone package "in the
> community/ASF project" and "out of it". We have a good example (from
> similar kinds of tools/utils) in the past that we can learn from(and maybe
> Bas can share more insights).
>
> https://github.com/BasPH/pylint-airflow - pylint plugin for Airflow DAGs
>
> Initially that was "sponsored" by GoDataDriven where Bas worked and I
> think this is where it was born. And that made sense as it was likely also
> useful for the customers of GoDataDriven (here I am guessing). But
> apparently both GoDataDriven's incentives winded down and it turned out
> that usefulness of it was not as big (also I think we all in Python
> community learned that Pylint is more of a distraction than real help - we
> dumped Pylint eventually and the plugin was not maintained beyond some
> versions of 1.10. And the tool is all but defunct now. Which is perfectly
> understandable.
>
> In this case there is (I think) no risk of a "pylint" like problem, but
> the question of maintenance and adaptation to future versions of Airflow
> remains.
>
> I think there is one big differences of something that is "in ASF repos"
> and "out":
>
> * if we make it a standalone package in "asf airflow community" - we will
> have some obligation and expectations from our users to maintain it. We can
> add some test harness (regardless if it will be in airflow repository or in
> a separate one) to make sure that new airflow "core" changes will not break
> it (and we can fail our PRs if they do - basically making "core"
> maintainers take care about this problem rather than delegate it to someone
> else to react on core changes (this is what has to  happen with providers I
> believe even if we split them to separate repo).  I think anything that we
> as the ASF community release should have such harnesses - making sure that
> whatever we release and make available to our users work together.
>
> * if it is outside of the "ASF community", someone will have to react to
> "core airflow" changes. We will not do it in the community, we will not pay
> attention, such an "external tool" might break at any time because we
> introduced a change in part of a core that the external tool implicitly
> relied on.
>
> For me the question is whether something should be in/out should be based
> on :
>
> * is it really useful for the community as a whole? -> if yes we should
> consider it
> * is it strongly tied with the core of airflow in the sense of relying on
> some internals that might change easily? -> if not, there is no need to
> bring it in, it can be easily maintained outside by anyone
> * if it is strongly tied with the core - > is there someone (person,
> organisation) who wants to take the burden of maintaining it and has
> incentive of doing it for quite some time -> if yes, great, let them do
> that!
> * if it is strongly tied, do we want to take a burden as "core airflow
> maintainers" to keep it updated together with the core if it is? -> if yes,
> we should bring it in
>
> If we have a strongly tied tool that we do not want to maintain in the
> core and there is no entity who would like to do it, then I think this idea
> should be dropped :).
>
> J.
>
>
> On Mon, Jul 18, 2022 at 1:52 AM Ping Zhang <pi...@umich.edu> wrote:
>
>> Hi Pablo,
>>
>> Wow, I really love this idea. This will greatly enrich the airflow
>> ecosystem.
>>
>> I agree with Ash, it is better to have it as a standalone package. And we
>> can use this framework to write airflow core invariants tests, so that we
>> will run them on every airflow release to guarantee no regressions.
>>
>> Thanks,
>>
>> Ping
>>
>>
>> On Sun, Jul 17, 2022 at 1:09 PM Pablo Estrada <pa...@google.com.invalid>
>> wrote:
>>
>>> Understood!
>>>
>>> TL;DR: I propose a testing framework where users can check for 'DAG
>>> execution invariants' or 'DAG execution expectations' given certain task
>>> outcomes.
>>>
>>> As DAGs grow in complexity, sometimes it might become difficult to
>>> reason about their runtime behavior in many scenarios. Users may want to
>>> lay out rules in the form of tests that can verify  DAG execution results.
>>> For example:
>>>
>>> - If any of my database_backup_* tasks fails, I want to ensure that at
>>> least one email_alert_* task will run.
>>> - If my 'check_authentication' task fails, I want to ensure that the
>>> whole DAG will fail.
>>> - If any of my DataflowOperator tasks fails, I want to ensure that a
>>> PubsubOperator downstream will always run.
>>>
>>> These sorts of invariants don't need the DAG to be executed; but in
>>> fact, they are pretty hard to test today: Staging environments can't check
>>> every possible runtime outcome.
>>>
>>> In this framework, users would define unit tests like this:
>>>
>>> ```
>>> def test_my_example_dag():
>>>   the_dag = models.DAG(
>>>         'the_basic_dag',
>>>         schedule_interval='@daily',
>>>         start_date=DEFAULT_DATE,
>>>     )
>>>
>>>     with the_dag:
>>>         op1 = EmptyOperator(task_id='task_1')
>>>         op2 = EmptyOperator(task_id='task_2')
>>>         op3 = EmptyOperator(task_id='task_3')
>>>
>>>         op1 >> op2 >> op3
>>>     # DAG invariant: If task_1 and task_2 succeeds, then task_3 will
>>> always run
>>>     assert_that(
>>>             given(thedag)\
>>>                 .when(task('task_1'), succeeds())\
>>>                 .and_(task('task_2'), succeeds())\
>>>                 .then(task('task_3'), runs()))
>>> ```
>>>
>>> This is a very simple example - and it's not great, because it only
>>> duplicates the DAG logic - but you can see more examples in my draft PR
>>> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1]
>>> and in my draft AIP
>>> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g>
>>> [2].
>>>
>>> I started writing up an AIP in a Google doc[2] which y'all can check.
>>> It's very close to what I have written here : )
>>>
>>> LMK what y'all think. I am also happy to publish this as a separate
>>> library if y'all wanna be cautious about adding it directly to Airflow.
>>> -P.
>>>
>>> [1]
>>> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82
>>> [2]
>>> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#
>>>
>>>
>>> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>>> Yep. Just outline your proposal on devlist, Pablo :).
>>>>
>>>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <as...@apache.org>
>>>> wrote:
>>>> >
>>>> > Hi Pablo,
>>>> >
>>>> > Could you describe at a high level what you are thinking of? It's
>>>> entirely possible it doesn't need any changes to core Airflow, or isn't
>>>> significant enough to need an AIP.
>>>> >
>>>> > Thanks,
>>>> > Ash
>>>> >
>>>> > On 17 July 2022 07:43:54 BST, Pablo Estrada
>>>> <pa...@google.com.INVALID> wrote:
>>>> >>
>>>> >> Hi there!
>>>> >> I would like to start a discussion of an idea that I had for a
>>>> testing framework for airflow.
>>>> >> I believe the first step would be to write up an AIP - so could I
>>>> have access to write a new one on the cwiki?
>>>> >>
>>>> >> Thanks!
>>>> >> -P.
>>>>
>>>

Re: Wiki access please?

Posted by Jarek Potiuk <ja...@potiuk.com>.

I had some thoughts about it - this also connected with recent discussions
about mixed governance for providers, and I think it's worth using this
discussion to set some rules and "boundaries" on when and how and
especially why we want to accept some contributions, while for some other
contributions it's better to be outside.

We are about to start more seriously thinking (and discussing) on how to
split Airflow providers off airflow. And I think we can split off more than
providers - this might be a good candidate to be a standalone, but still
community maintained package. If we are going to solve the problem of
splitting airflow to N packages, one more package does not matter.
And it would nicely solve "version independence". We could even make it
airflow 2.0+ compliant if we want.

So I think while the question of "is it tied with a specific airflow
version or not" does not really prevent us from making it part of community
- those two are not related (if we are going to have more repositories
anyway)

The important part is really how "self-servicing" we can make it and how we
make sure it stays relevant with future versions of Airflow and who does it
I think - namely who has the incentive and "responsibility" to maintain it.
I am sure we will add more features to Airflow DAGs and simplify the way
DAGs are written over time, and the test harness will have to adapt to it.

There are pros and cons of having such a standalone package "in the
community/ASF project" and "out of it". We have a good example (from
similar kinds of tools/utils) in the past that we can learn from(and maybe
Bas can share more insights).

https://github.com/BasPH/pylint-airflow - pylint plugin for Airflow DAGs

Initially that was "sponsored" by GoDataDriven where Bas worked and I think
this is where it was born. And that made sense as it was likely also useful
for the customers of GoDataDriven (here I am guessing). But apparently both
GoDataDriven's incentives winded down and it turned out that usefulness of
it was not as big (also I think we all in Python community learned that
Pylint is more of a distraction than real help - we dumped Pylint
eventually and the plugin was not maintained beyond some versions of 1.10.
And the tool is all but defunct now. Which is perfectly understandable.

In this case there is (I think) no risk of a "pylint" like problem, but the
question of maintenance and adaptation to future versions of Airflow
remains.

I think there is one big differences of something that is "in ASF repos"
and "out":

* if we make it a standalone package in "asf airflow community" - we will
have some obligation and expectations from our users to maintain it. We can
add some test harness (regardless if it will be in airflow repository or in
a separate one) to make sure that new airflow "core" changes will not break
it (and we can fail our PRs if they do - basically making "core"
maintainers take care about this problem rather than delegate it to someone
else to react on core changes (this is what has to  happen with providers I
believe even if we split them to separate repo).  I think anything that we
as the ASF community release should have such harnesses - making sure that
whatever we release and make available to our users work together.

* if it is outside of the "ASF community", someone will have to react to
"core airflow" changes. We will not do it in the community, we will not pay
attention, such an "external tool" might break at any time because we
introduced a change in part of a core that the external tool implicitly
relied on.

For me the question is whether something should be in/out should be based
on :

* is it really useful for the community as a whole? -> if yes we should
consider it
* is it strongly tied with the core of airflow in the sense of relying on
some internals that might change easily? -> if not, there is no need to
bring it in, it can be easily maintained outside by anyone
* if it is strongly tied with the core - > is there someone (person,
organisation) who wants to take the burden of maintaining it and has
incentive of doing it for quite some time -> if yes, great, let them do
that!
* if it is strongly tied, do we want to take a burden as "core airflow
maintainers" to keep it updated together with the core if it is? -> if yes,
we should bring it in

If we have a strongly tied tool that we do not want to maintain in the core
and there is no entity who would like to do it, then I think this idea
should be dropped :).

J.

On Mon, Jul 18, 2022 at 1:52 AM Ping Zhang <pi...@umich.edu> wrote:

> Hi Pablo,
>
> Wow, I really love this idea. This will greatly enrich the airflow
> ecosystem.
>
> I agree with Ash, it is better to have it as a standalone package. And we
> can use this framework to write airflow core invariants tests, so that we
> will run them on every airflow release to guarantee no regressions.
>
> Thanks,
>
> Ping
>
>
> On Sun, Jul 17, 2022 at 1:09 PM Pablo Estrada <pa...@google.com.invalid>
> wrote:
>
>> Understood!
>>
>> TL;DR: I propose a testing framework where users can check for 'DAG
>> execution invariants' or 'DAG execution expectations' given certain task
>> outcomes.
>>
>> As DAGs grow in complexity, sometimes it might become difficult to reason
>> about their runtime behavior in many scenarios. Users may want to lay out
>> rules in the form of tests that can verify  DAG execution results. For
>> example:
>>
>> - If any of my database_backup_* tasks fails, I want to ensure that at
>> least one email_alert_* task will run.
>> - If my 'check_authentication' task fails, I want to ensure that the
>> whole DAG will fail.
>> - If any of my DataflowOperator tasks fails, I want to ensure that a
>> PubsubOperator downstream will always run.
>>
>> These sorts of invariants don't need the DAG to be executed; but in fact,
>> they are pretty hard to test today: Staging environments can't check every
>> possible runtime outcome.
>>
>> In this framework, users would define unit tests like this:
>>
>> ```
>> def test_my_example_dag():
>>   the_dag = models.DAG(
>>         'the_basic_dag',
>>         schedule_interval='@daily',
>>         start_date=DEFAULT_DATE,
>>     )
>>
>>     with the_dag:
>>         op1 = EmptyOperator(task_id='task_1')
>>         op2 = EmptyOperator(task_id='task_2')
>>         op3 = EmptyOperator(task_id='task_3')
>>
>>         op1 >> op2 >> op3
>>     # DAG invariant: If task_1 and task_2 succeeds, then task_3 will
>> always run
>>     assert_that(
>>             given(thedag)\
>>                 .when(task('task_1'), succeeds())\
>>                 .and_(task('task_2'), succeeds())\
>>                 .then(task('task_3'), runs()))
>> ```
>>
>> This is a very simple example - and it's not great, because it only
>> duplicates the DAG logic - but you can see more examples in my draft PR
>> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1]
>> and in my draft AIP
>> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g>
>> [2].
>>
>> I started writing up an AIP in a Google doc[2] which y'all can check.
>> It's very close to what I have written here : )
>>
>> LMK what y'all think. I am also happy to publish this as a separate
>> library if y'all wanna be cautious about adding it directly to Airflow.
>> -P.
>>
>> [1]
>> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82
>> [2]
>> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#
>>
>>
>> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> Yep. Just outline your proposal on devlist, Pablo :).
>>>
>>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <as...@apache.org>
>>> wrote:
>>> >
>>> > Hi Pablo,
>>> >
>>> > Could you describe at a high level what you are thinking of? It's
>>> entirely possible it doesn't need any changes to core Airflow, or isn't
>>> significant enough to need an AIP.
>>> >
>>> > Thanks,
>>> > Ash
>>> >
>>> > On 17 July 2022 07:43:54 BST, Pablo Estrada <pa...@google.com.INVALID>
>>> wrote:
>>> >>
>>> >> Hi there!
>>> >> I would like to start a discussion of an idea that I had for a
>>> testing framework for airflow.
>>> >> I believe the first step would be to write up an AIP - so could I
>>> have access to write a new one on the cwiki?
>>> >>
>>> >> Thanks!
>>> >> -P.
>>>
>>

Re: Wiki access please?

Posted by Ping Zhang <pi...@umich.edu>.

Hi Pablo,

Wow, I really love this idea. This will greatly enrich the airflow
ecosystem.

I agree with Ash, it is better to have it as a standalone package. And we
can use this framework to write airflow core invariants tests, so that we
will run them on every airflow release to guarantee no regressions.

Thanks,

Ping


On Sun, Jul 17, 2022 at 1:09 PM Pablo Estrada <pa...@google.com.invalid>
wrote:

> Understood!
>
> TL;DR: I propose a testing framework where users can check for 'DAG
> execution invariants' or 'DAG execution expectations' given certain task
> outcomes.
>
> As DAGs grow in complexity, sometimes it might become difficult to reason
> about their runtime behavior in many scenarios. Users may want to lay out
> rules in the form of tests that can verify  DAG execution results. For
> example:
>
> - If any of my database_backup_* tasks fails, I want to ensure that at
> least one email_alert_* task will run.
> - If my 'check_authentication' task fails, I want to ensure that the whole
> DAG will fail.
> - If any of my DataflowOperator tasks fails, I want to ensure that a
> PubsubOperator downstream will always run.
>
> These sorts of invariants don't need the DAG to be executed; but in fact,
> they are pretty hard to test today: Staging environments can't check every
> possible runtime outcome.
>
> In this framework, users would define unit tests like this:
>
> ```
> def test_my_example_dag():
>   the_dag = models.DAG(
>         'the_basic_dag',
>         schedule_interval='@daily',
>         start_date=DEFAULT_DATE,
>     )
>
>     with the_dag:
>         op1 = EmptyOperator(task_id='task_1')
>         op2 = EmptyOperator(task_id='task_2')
>         op3 = EmptyOperator(task_id='task_3')
>
>         op1 >> op2 >> op3
>     # DAG invariant: If task_1 and task_2 succeeds, then task_3 will
> always run
>     assert_that(
>             given(thedag)\
>                 .when(task('task_1'), succeeds())\
>                 .and_(task('task_2'), succeeds())\
>                 .then(task('task_3'), runs()))
> ```
>
> This is a very simple example - and it's not great, because it only
> duplicates the DAG logic - but you can see more examples in my draft PR
> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1]
> and in my draft AIP
> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g>
> [2].
>
> I started writing up an AIP in a Google doc[2] which y'all can check. It's
> very close to what I have written here : )
>
> LMK what y'all think. I am also happy to publish this as a separate
> library if y'all wanna be cautious about adding it directly to Airflow.
> -P.
>
> [1]
> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82
> [2]
> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#
>
>
> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> Yep. Just outline your proposal on devlist, Pablo :).
>>
>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <as...@apache.org>
>> wrote:
>> >
>> > Hi Pablo,
>> >
>> > Could you describe at a high level what you are thinking of? It's
>> entirely possible it doesn't need any changes to core Airflow, or isn't
>> significant enough to need an AIP.
>> >
>> > Thanks,
>> > Ash
>> >
>> > On 17 July 2022 07:43:54 BST, Pablo Estrada <pa...@google.com.INVALID>
>> wrote:
>> >>
>> >> Hi there!
>> >> I would like to start a discussion of an idea that I had for a testing
>> framework for airflow.
>> >> I believe the first step would be to write up an AIP - so could I have
>> access to write a new one on the cwiki?
>> >>
>> >> Thanks!
>> >> -P.
>>
>

Re: Wiki access please?

Posted by Jarek Potiuk <ja...@potiuk.com>.

Yeah. Developing faster than airflow itself is a very valid point Ash.

On Sun, Jul 17, 2022 at 10:36 PM Ash Berlin-Taylor <as...@apache.org> wrote:

> I agree this would be a great addition to the Airflow ecosystem but I
> think it should start out life as an external package for two reasons:
>
> 1. It means you can release and iterate quickly without being beholden to
> the Airflow release process (voting, timelines etc)
> 2. It means we can see how popular it is before we (Airflow maintainers)
> have to commit to supporting it long term.
>
> -a
>
> On 17 July 2022 21:19:21 BST, Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>> First comment - without looking at the details yet - those kinds of tests
>> are badly needed. We have many questions from our users "How do I test my
>> dags", and also one of the comments I've heard about some other
>> orchestration framework was ("I really like how easy to to run tests is".
>> Getting a "built-in" simple test harness for DAG writing would be cool.
>>
>> Whether it is part of Airflow or external library - I think both have
>> pros/cons but as long as it is small and easy to follow and maintain, I am
>> for getting it in (providing that we will have good documentation/guidance
>> for our users how to use it and plenty of examples). I think this is the
>> only thing I'd be worried about when considering accepting such a framework
>> to the community - the code we get in Airflow might become a liability if
>> people who use it will drag more attention and effort of maintainers out of
>> other things. This is basically something that in regular business is
>> called "lost opportunity" cost.
>>
>> So as long as we can get really great documentation, examples and some
>> ways to make our users self-serviced mostly, I am all in.
>>
>> J.
>>
>> On Sun, Jul 17, 2022 at 10:09 PM Pablo Estrada <pa...@google.com.invalid>
>> wrote:
>>
>>> Understood!
>>>
>>> TL;DR: I propose a testing framework where users can check for 'DAG
>>> execution invariants' or 'DAG execution expectations' given certain task
>>> outcomes.
>>>
>>> As DAGs grow in complexity, sometimes it might become difficult to
>>> reason about their runtime behavior in many scenarios. Users may want to
>>> lay out rules in the form of tests that can verify  DAG execution results.
>>> For example:
>>>
>>> - If any of my database_backup_* tasks fails, I want to ensure that at
>>> least one email_alert_* task will run.
>>> - If my 'check_authentication' task fails, I want to ensure that the
>>> whole DAG will fail.
>>> - If any of my DataflowOperator tasks fails, I want to ensure that a
>>> PubsubOperator downstream will always run.
>>>
>>> These sorts of invariants don't need the DAG to be executed; but in
>>> fact, they are pretty hard to test today: Staging environments can't check
>>> every possible runtime outcome.
>>>
>>> In this framework, users would define unit tests like this:
>>>
>>> ```
>>> def test_my_example_dag():
>>>   the_dag = models.DAG(
>>>         'the_basic_dag',
>>>         schedule_interval='@daily',
>>>         start_date=DEFAULT_DATE,
>>>     )
>>>
>>>     with the_dag:
>>>         op1 = EmptyOperator(task_id='task_1')
>>>         op2 = EmptyOperator(task_id='task_2')
>>>         op3 = EmptyOperator(task_id='task_3')
>>>
>>>         op1 >> op2 >> op3
>>>     # DAG invariant: If task_1 and task_2 succeeds, then task_3 will
>>> always run
>>>     assert_that(
>>>             given(thedag)\
>>>                 .when(task('task_1'), succeeds())\
>>>                 .and_(task('task_2'), succeeds())\
>>>                 .then(task('task_3'), runs()))
>>> ```
>>>
>>> This is a very simple example - and it's not great, because it only
>>> duplicates the DAG logic - but you can see more examples in my draft PR
>>> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1]
>>> and in my draft AIP
>>> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g>
>>> [2].
>>>
>>> I started writing up an AIP in a Google doc[2] which y'all can check.
>>> It's very close to what I have written here : )
>>>
>>> LMK what y'all think. I am also happy to publish this as a separate
>>> library if y'all wanna be cautious about adding it directly to Airflow.
>>> -P.
>>>
>>> [1]
>>> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82
>>> [2]
>>> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#
>>>
>>>
>>> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>>> Yep. Just outline your proposal on devlist, Pablo :).
>>>>
>>>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <as...@apache.org>
>>>> wrote:
>>>> >
>>>> > Hi Pablo,
>>>> >
>>>> > Could you describe at a high level what you are thinking of? It's
>>>> entirely possible it doesn't need any changes to core Airflow, or isn't
>>>> significant enough to need an AIP.
>>>> >
>>>> > Thanks,
>>>> > Ash
>>>> >
>>>> > On 17 July 2022 07:43:54 BST, Pablo Estrada
>>>> <pa...@google.com.INVALID> wrote:
>>>> >>
>>>> >> Hi there!
>>>> >> I would like to start a discussion of an idea that I had for a
>>>> testing framework for airflow.
>>>> >> I believe the first step would be to write up an AIP - so could I
>>>> have access to write a new one on the cwiki?
>>>> >>
>>>> >> Thanks!
>>>> >> -P.
>>>>
>>>

Re: Wiki access please?

Posted by Ash Berlin-Taylor <as...@apache.org>.

I agree this would be a great addition to the Airflow ecosystem but I think it should start out life as an external package for two reasons: 

1. It means you can release and iterate quickly without being beholden to the Airflow release process (voting, timelines etc)
2. It means we can see how popular it is before we (Airflow maintainers) have to commit to supporting it long term.

-a

On 17 July 2022 21:19:21 BST, Jarek Potiuk <ja...@potiuk.com> wrote:
>First comment - without looking at the details yet - those kinds of tests
>are badly needed. We have many questions from our users "How do I test my
>dags", and also one of the comments I've heard about some other
>orchestration framework was ("I really like how easy to to run tests is".
>Getting a "built-in" simple test harness for DAG writing would be cool.
>
>Whether it is part of Airflow or external library - I think both have
>pros/cons but as long as it is small and easy to follow and maintain, I am
>for getting it in (providing that we will have good documentation/guidance
>for our users how to use it and plenty of examples). I think this is the
>only thing I'd be worried about when considering accepting such a framework
>to the community - the code we get in Airflow might become a liability if
>people who use it will drag more attention and effort of maintainers out of
>other things. This is basically something that in regular business is
>called "lost opportunity" cost.
>
>So as long as we can get really great documentation, examples and some ways
>to make our users self-serviced mostly, I am all in.
>
>J.
>
>On Sun, Jul 17, 2022 at 10:09 PM Pablo Estrada <pa...@google.com.invalid>
>wrote:
>
>> Understood!
>>
>> TL;DR: I propose a testing framework where users can check for 'DAG
>> execution invariants' or 'DAG execution expectations' given certain task
>> outcomes.
>>
>> As DAGs grow in complexity, sometimes it might become difficult to reason
>> about their runtime behavior in many scenarios. Users may want to lay out
>> rules in the form of tests that can verify  DAG execution results. For
>> example:
>>
>> - If any of my database_backup_* tasks fails, I want to ensure that at
>> least one email_alert_* task will run.
>> - If my 'check_authentication' task fails, I want to ensure that the whole
>> DAG will fail.
>> - If any of my DataflowOperator tasks fails, I want to ensure that a
>> PubsubOperator downstream will always run.
>>
>> These sorts of invariants don't need the DAG to be executed; but in fact,
>> they are pretty hard to test today: Staging environments can't check every
>> possible runtime outcome.
>>
>> In this framework, users would define unit tests like this:
>>
>> ```
>> def test_my_example_dag():
>>   the_dag = models.DAG(
>>         'the_basic_dag',
>>         schedule_interval='@daily',
>>         start_date=DEFAULT_DATE,
>>     )
>>
>>     with the_dag:
>>         op1 = EmptyOperator(task_id='task_1')
>>         op2 = EmptyOperator(task_id='task_2')
>>         op3 = EmptyOperator(task_id='task_3')
>>
>>         op1 >> op2 >> op3
>>     # DAG invariant: If task_1 and task_2 succeeds, then task_3 will
>> always run
>>     assert_that(
>>             given(thedag)\
>>                 .when(task('task_1'), succeeds())\
>>                 .and_(task('task_2'), succeeds())\
>>                 .then(task('task_3'), runs()))
>> ```
>>
>> This is a very simple example - and it's not great, because it only
>> duplicates the DAG logic - but you can see more examples in my draft PR
>> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1]
>> and in my draft AIP
>> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g>
>> [2].
>>
>> I started writing up an AIP in a Google doc[2] which y'all can check. It's
>> very close to what I have written here : )
>>
>> LMK what y'all think. I am also happy to publish this as a separate
>> library if y'all wanna be cautious about adding it directly to Airflow.
>> -P.
>>
>> [1]
>> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82
>> [2]
>> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#
>>
>>
>> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> Yep. Just outline your proposal on devlist, Pablo :).
>>>
>>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <as...@apache.org>
>>> wrote:
>>> >
>>> > Hi Pablo,
>>> >
>>> > Could you describe at a high level what you are thinking of? It's
>>> entirely possible it doesn't need any changes to core Airflow, or isn't
>>> significant enough to need an AIP.
>>> >
>>> > Thanks,
>>> > Ash
>>> >
>>> > On 17 July 2022 07:43:54 BST, Pablo Estrada <pa...@google.com.INVALID>
>>> wrote:
>>> >>
>>> >> Hi there!
>>> >> I would like to start a discussion of an idea that I had for a testing
>>> framework for airflow.
>>> >> I believe the first step would be to write up an AIP - so could I have
>>> access to write a new one on the cwiki?
>>> >>
>>> >> Thanks!
>>> >> -P.
>>>
>>

Re: Wiki access please?

Posted by Jarek Potiuk <ja...@potiuk.com>.

First comment - without looking at the details yet - those kinds of tests
are badly needed. We have many questions from our users "How do I test my
dags", and also one of the comments I've heard about some other
orchestration framework was ("I really like how easy to to run tests is".
Getting a "built-in" simple test harness for DAG writing would be cool.

Whether it is part of Airflow or external library - I think both have
pros/cons but as long as it is small and easy to follow and maintain, I am
for getting it in (providing that we will have good documentation/guidance
for our users how to use it and plenty of examples). I think this is the
only thing I'd be worried about when considering accepting such a framework
to the community - the code we get in Airflow might become a liability if
people who use it will drag more attention and effort of maintainers out of
other things. This is basically something that in regular business is
called "lost opportunity" cost.

So as long as we can get really great documentation, examples and some ways
to make our users self-serviced mostly, I am all in.

J.

On Sun, Jul 17, 2022 at 10:09 PM Pablo Estrada <pa...@google.com.invalid>
wrote:

> Understood!
>
> TL;DR: I propose a testing framework where users can check for 'DAG
> execution invariants' or 'DAG execution expectations' given certain task
> outcomes.
>
> As DAGs grow in complexity, sometimes it might become difficult to reason
> about their runtime behavior in many scenarios. Users may want to lay out
> rules in the form of tests that can verify  DAG execution results. For
> example:
>
> - If any of my database_backup_* tasks fails, I want to ensure that at
> least one email_alert_* task will run.
> - If my 'check_authentication' task fails, I want to ensure that the whole
> DAG will fail.
> - If any of my DataflowOperator tasks fails, I want to ensure that a
> PubsubOperator downstream will always run.
>
> These sorts of invariants don't need the DAG to be executed; but in fact,
> they are pretty hard to test today: Staging environments can't check every
> possible runtime outcome.
>
> In this framework, users would define unit tests like this:
>
> ```
> def test_my_example_dag():
>   the_dag = models.DAG(
>         'the_basic_dag',
>         schedule_interval='@daily',
>         start_date=DEFAULT_DATE,
>     )
>
>     with the_dag:
>         op1 = EmptyOperator(task_id='task_1')
>         op2 = EmptyOperator(task_id='task_2')
>         op3 = EmptyOperator(task_id='task_3')
>
>         op1 >> op2 >> op3
>     # DAG invariant: If task_1 and task_2 succeeds, then task_3 will
> always run
>     assert_that(
>             given(thedag)\
>                 .when(task('task_1'), succeeds())\
>                 .and_(task('task_2'), succeeds())\
>                 .then(task('task_3'), runs()))
> ```
>
> This is a very simple example - and it's not great, because it only
> duplicates the DAG logic - but you can see more examples in my draft PR
> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1]
> and in my draft AIP
> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g>
> [2].
>
> I started writing up an AIP in a Google doc[2] which y'all can check. It's
> very close to what I have written here : )
>
> LMK what y'all think. I am also happy to publish this as a separate
> library if y'all wanna be cautious about adding it directly to Airflow.
> -P.
>
> [1]
> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82
> [2]
> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#
>
>
> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> Yep. Just outline your proposal on devlist, Pablo :).
>>
>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <as...@apache.org>
>> wrote:
>> >
>> > Hi Pablo,
>> >
>> > Could you describe at a high level what you are thinking of? It's
>> entirely possible it doesn't need any changes to core Airflow, or isn't
>> significant enough to need an AIP.
>> >
>> > Thanks,
>> > Ash
>> >
>> > On 17 July 2022 07:43:54 BST, Pablo Estrada <pa...@google.com.INVALID>
>> wrote:
>> >>
>> >> Hi there!
>> >> I would like to start a discussion of an idea that I had for a testing
>> framework for airflow.
>> >> I believe the first step would be to write up an AIP - so could I have
>> access to write a new one on the cwiki?
>> >>
>> >> Thanks!
>> >> -P.
>>
>

Re: Wiki access please?

Posted by Pablo Estrada <pa...@google.com.INVALID>.

Understood!

TL;DR: I propose a testing framework where users can check for 'DAG
execution invariants' or 'DAG execution expectations' given certain task
outcomes.

As DAGs grow in complexity, sometimes it might become difficult to reason
about their runtime behavior in many scenarios. Users may want to lay out
rules in the form of tests that can verify  DAG execution results. For
example:

- If any of my database_backup_* tasks fails, I want to ensure that at
least one email_alert_* task will run.
- If my 'check_authentication' task fails, I want to ensure that the whole
DAG will fail.
- If any of my DataflowOperator tasks fails, I want to ensure that a
PubsubOperator downstream will always run.

These sorts of invariants don't need the DAG to be executed; but in fact,
they are pretty hard to test today: Staging environments can't check every
possible runtime outcome.

In this framework, users would define unit tests like this:

```
def test_my_example_dag():
  the_dag = models.DAG(
        'the_basic_dag',
        schedule_interval='@daily',
        start_date=DEFAULT_DATE,
    )

    with the_dag:
        op1 = EmptyOperator(task_id='task_1')
        op2 = EmptyOperator(task_id='task_2')
        op3 = EmptyOperator(task_id='task_3')

        op1 >> op2 >> op3
    # DAG invariant: If task_1 and task_2 succeeds, then task_3 will always
run
    assert_that(
            given(thedag)\
                .when(task('task_1'), succeeds())\
                .and_(task('task_2'), succeeds())\
                .then(task('task_3'), runs()))
```

This is a very simple example - and it's not great, because it only
duplicates the DAG logic - but you can see more examples in my draft PR
<https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1]
and in my draft AIP
<https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g>
[2].

I started writing up an AIP in a Google doc[2] which y'all can check. It's
very close to what I have written here : )

LMK what y'all think. I am also happy to publish this as a separate library
if y'all wanna be cautious about adding it directly to Airflow.
-P.

[1]
https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82
[2]
https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#

On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> Yep. Just outline your proposal on devlist, Pablo :).
>
> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <as...@apache.org> wrote:
> >
> > Hi Pablo,
> >
> > Could you describe at a high level what you are thinking of? It's
> entirely possible it doesn't need any changes to core Airflow, or isn't
> significant enough to need an AIP.
> >
> > Thanks,
> > Ash
> >
> > On 17 July 2022 07:43:54 BST, Pablo Estrada <pa...@google.com.INVALID>
> wrote:
> >>
> >> Hi there!
> >> I would like to start a discussion of an idea that I had for a testing
> framework for airflow.
> >> I believe the first step would be to write up an AIP - so could I have
> access to write a new one on the cwiki?
> >>
> >> Thanks!
> >> -P.
>

Re: Wiki access please?

Posted by Jarek Potiuk <ja...@potiuk.com>.

Yep. Just outline your proposal on devlist, Pablo :).

On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <as...@apache.org> wrote:
>
> Hi Pablo,
>
> Could you describe at a high level what you are thinking of? It's entirely possible it doesn't need any changes to core Airflow, or isn't significant enough to need an AIP.
>
> Thanks,
> Ash
>
> On 17 July 2022 07:43:54 BST, Pablo Estrada <pa...@google.com.INVALID> wrote:
>>
>> Hi there!
>> I would like to start a discussion of an idea that I had for a testing framework for airflow.
>> I believe the first step would be to write up an AIP - so could I have access to write a new one on the cwiki?
>>
>> Thanks!
>> -P.

Re: Wiki access please?

Posted by Ash Berlin-Taylor <as...@apache.org>.

Hi Pablo,

Could you describe at a high level what you are thinking of? It's entirely possible it doesn't need any changes to core Airflow, or isn't significant enough to need an AIP.

Thanks,
Ash 

On 17 July 2022 07:43:54 BST, Pablo Estrada <pa...@google.com.INVALID> wrote:
>Hi there!
>I would like to start a discussion of an idea that I had for a testing
>framework for airflow.
>I believe the first step would be to write up an AIP - so could I have
>access to write a new one on the cwiki?
>
>Thanks!
>-P.