You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by "Oliveira, Niko" <on...@amazon.com.INVALID> on 2022/08/18 23:29:48 UTC

Vending AWS System Test Results Back to the Community

Hey folks,

Those of us on the AWS Airflow team (myself, Dennis F, Vincent B, Seyed H) have been working on a few projects over the past few months:

1. Writing example dags/docs for all existing Operators in the AWS Airflow provider package (done)

2. Writing AWS specific logic in Airflow codebase to support AIP-47 (done)

3. Converting all example dags to AIP-47 compliant system tests (just over halfway done)

All of these are ultimately culminating to the goal of us running these system tests at a regular cadence within Amazon (where we have access to funded AWS accounts). We will run these system tests, triggered by updates to airflow:main, at least once a day.

I'd like to open a discussion on how we can vend these results back to the community in a way that is most consumable for contributors, release managers and users alike.

A quick and easy approach would be to create a publicly viewable CloudWatch Dashboard. With at least the following metrics for each system test over time: pass/fail, duration, and execution count.
This would be a human readable way to consume the current status of AWS Operators.

If a more machine readable format is required/preferred (e.g. for scripts related to Airflow release management perhaps) we could also put together a simple API Gateway endpoint that would vend the data in a format such as JSON.

Another interesting option would be for us to publish the CloudFormation templates (or the codebase used to generate the templates) for configuring the system test environment and executing the tests. This could be deployed to an AWS account owned and managed by the Airflow community where tests would be run periodically. AWS has provided some credits in the past which could be used to help fund the account. But this introduces a large component that would need ownership and management by folks within the Airflow community who have access to such AWS accounts and credits (likely only committers/release managers?). So it might not be worth the complexity.

I'd like to hear what folks think!

Cheers,
Niko

Re: Vending AWS System Test Results Back to the Community

Posted by "Oliveira, Niko" <on...@amazon.com.INVALID>.

Hey Jarek,

Thanks for taking the time! I think we are well aligned actually :) The framework we're developing to setup and run these tests in an AWS account has all 4 of the characteristics you mentioned. So the real question is where do they run and who is the owner.

> Looking at the expectations above - I think it would be better to run such tests by the Amazon team for Amazon, Google team for Google etc.
> I think it will be far more efficient to get it in the hands of those stakeholders who are mostly interested in getting the "green" tests

We on our end are committed to keeping those tests green, or at least triaging them and opening tickets to work with the community to get them green. We'll be doing this either way, whether there is a publicly running copy of the stack or not.

> That is much more scalable solution from the community point of view. We are not going to publish it to our users and is not really
> needed to be run on our infra. I don't see a particular need for regular community members to even know how/what infrastructure is
>used to run the tests - the test execution is pretty standardised, and I think we are really interested in output rather than the infra to run > it.

Agreed, so the question now is what is the "API" between us and the community if we run this infra:

1. For vending the results it seems agreeable to publish a public dashboard of the real-time results as we run the tests daily. This will likely be our first goal. We can link to this from somewhere in the Airflow docs/site (perhaps somewhere on the ecosystem page?) for any interested folks. Though, I agree that it will be mostly release managers who are interested for provider package releases.
I'm not sure that notifying on the Apache Airflow Slack will be useful for many users. Also once more cloud providers' system tests are up and running it could become quite spam-y. Though, I'm interested to hear what others think.

2. For allowing folks to trigger the tests and inspect the logs, this is trickier. But I'm not sure it's actually a blocking issue, at least to start. All of the AWS system tests and the code to run them in Breeze are published to the Airflow code base. So if some system test, say SQS, is failing and someone from the community would like to work on it there's nothing stopping them from creating an AWS account, deploying credentials to their machine and running breeze testing tests --system amazon tests/system/providers/amazon/aws/example_sqs.py

For an individual developer working on a PR, this will be a faster dev/test cycle anyway.
You may run into services that don't have sufficient free-tier usage for contributors to do this for free, but I think it will cover most cases for the short term until we implement a way for folks to remotely trigger these system tests.

So how about this for an incremental plan:

1. We at Amazon will continue to host and run AWS system tests for the time being. With us being the "owners" of triaging failing tests, either fixing them or cutting tickets to the Apache Airflow Github for other contributors to take on.
2. We'll work next on exposing results via some kind of public dashboard so that the community can see the real-time health of the AWS provider package.
3. Then follow up with producing some mechanisms to trigger these tests from the Airflow community, whether that be from within a PR or by the release manager. Though, I think this one still needs some more thinking on just how that would work and scale.

Cheers,
Niko

________________________________
From: Jarek Potiuk <ja...@potiuk.com>
Sent: Friday, August 19, 2022 2:32:11 PM
To: dev@airflow.apache.org
Subject: RE: [EXTERNAL]Vending AWS System Test Results Back to the Community

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

Hey Niko,

Very good points to discuss. I think this is something equally needed on AWS/Google but also other "popular" services we have integration with (and eventually fulfilling the goal of AIP-4) :)

Context:

I am a big fan of thinking of CI systems as largely invisible to the "regular" users. The best CI system (and it's almost impossible goal but should always be our compass) is a system you are not aware of until you (or someone) introduced a change that needs an action from someone - when things get broken (and the breakage might come from multiple sources in case of System Tests: code change, library upgrade, service changing the API, required permission change etc. etc. Unlike in other types of tests, a lot more of those might be independent from PRs merged, this is actually more likely that some external change might impact System Tests.

Also System Tests - cannot be really run with every PR (they take too long) so there is a bit different "usage" and "access" needs for the results. This impacts what can be the trigger for such tests. I think it's far more likely that we will run the system tests regularly on schedule (once a day?) and manually by the release manager when we prepare a provider's release to verify that the providers are still working or when we want to test if we fix the problem reported by failing "scheduled" run (from a PR branch).

And I think it's not only "how you access" the results, but how "failure notifications" are delivered and how the tests are triggered and we should answer all those questions.

Audience of the solution:

This leads to another question - WHO will be interested in seeing notifications and fixing the problems ?

I think this is the most important question to answer. I really think regular contributors (even those who contribute to - say - Amazon provider) will not be interested and will not regularly monitor failure notifications. Unlike regular "test failure" notifications - they should not come to the users who made the PR but to those who are interested in keeping the "provider" green. It will be rather difficult in a number of cases to engage (automatically) a contributor's attention when such system tests start to fail. But eventually (and I think this split is quite obvious) there will be people interested in monitoring the overall health of a given provider. They don't have to fix it, they can merely investigate and see that likely this or that caused the problem and "pull in" the PR contributors. But there is a watchout there - those contributors might not have or want to use access to run such system tests manually (it might cost, might need paid accounts, have some risks involved as data is deleted/recreated etc.). Those contributors should see the logs/results, should be able to fix the problem but then they should also be able to trigger system test execution on their PR when they want to check if the problem is fixed.

Characteristics of the solution:

So I think any solution should have those characteristics:

1) Produce notification that tests executed have failed/succeeded - this should come to a dedicated, separate per-provider place (for example slack channel - with the "low" frequency of such messages, slack channel seems like the best idea. Then people who want to monitor a given provider could simply subscribe to that channel. Seeing regular "All tests passed" and sometimes "some tests failed" message there is a great indication that a) tests continue to work in general, b) see when things fail. We need to have regular schedule and notify about successes as well - to make some kind of "heartbeat" which would tell those people monitoring the errors that things are not working when the heartbeat is missing.

2) Seeing logs - the notifications should contain links to logs that could be browsed by anyone (Read-only). Luckily we have no "secret" information so it could be a publicly available link. Ideally, it should be a cloud-based one (CloudWatch for AWS). I think access to logs is absolutely crucial for anyone trying to investigate and fix the problem. And I think this is the only thing needed by anyone outside of the group that is interested in keeping the provider "green" - such individual contributors might only need to see a "green" log to compare what has changed in this particular build, maybe, but they are not really interested in seeing the historical stats.

3) Triggering the tests. This one is tricky. This is something that should be accessible by everyone contributing a PR, but it should be controlled somehow. No idea yet how this can be gated/controlled but this is something we will need to figure out. Who and when and how to trigger such a build for your own PR? There might be various ways - special comment on the PR + some conditions (approvals?) that the PR/user should fulfill to be able to trigger it - this is not really something I have complete proposal for.

4) Dashboard - that one is mostly interesting to release manager and people who are interested in keeping given provider "green".  It's OK to make it public, but it does not need to be "beautiful" or anything - it can be a very "raw" output.

In this context, answering some of your questions Niko:

* I do not think we need automated API. The frequency and nature of "reasons" for failures are low and I do not see a reason why we would consume it (but this might come in the future as we learn).
* Public Dashboard is fine, but public access to logs is far more important IMHO

Who should run the infrastructure ?

Looking at the expectations above - I think it would be better to run such tests by the Amazon team for Amazon, Google team for Google etc. - While the CloudFormation scripts would be best if published, I think it will be far more efficient to get it in the hands of those stakeholders who are mostly interested in getting the "green" tests. That is much more scalable solution from the community point of view. We are not going to publish it to our users and is not really needed to be run on our infra. I don't see a particular need for regular community members to even know how/what infrastructure is used to run the tests - the test execution is pretty standardised, and I think we are really interested in output rather than the infra to run it.

J.

On Fri, Aug 19, 2022 at 2:53 PM Kamil Breguła <dz...@gmail.com>> wrote:
I don't think we have to limit ourselves that only the commiters have access to the Amazon account managed by Airflow community. In the past, commiters was supported by other people whom they trust e.g. commiter asked for help from another co-worker from her company when he needed it.

This means that there are no restrictions on Amazon employees using this account and maintaining this environment.

We just have to be careful that no-commiters have not write permission to the repository, and that they cannot publish a new version of the application that can be seen as official released by the Apache Foundation.

On Fri, Aug 19, 2022, 01:30 Oliveira, Niko <on...@amazon.com.invalid> wrote:

Hey folks,

Those of us on the AWS Airflow team (myself, Dennis F, Vincent B, Seyed H) have been working on a few projects over the past few months:

1. Writing example dags/docs for all existing Operators in the AWS Airflow provider package (done)

2. Writing AWS specific logic in Airflow codebase to support AIP-47 (done)

3. Converting all example dags to AIP-47 compliant system tests (just over halfway done)

All of these are ultimately culminating to the goal of us running these system tests at a regular cadence within Amazon (where we have access to funded AWS accounts). We will run these system tests, triggered by updates to airflow:main, at least once a day.

I'd like to open a discussion on how we can vend these results back to the community in a way that is most consumable for contributors, release managers and users alike.

A quick and easy approach would be to create a publicly viewable CloudWatch Dashboard. With at least the following metrics for each system test over time:  pass/fail, duration, and execution count.
This would be a human readable way to consume the current status of AWS Operators.

If a more machine readable format is required/preferred (e.g. for scripts related to Airflow release management perhaps) we could also put together a simple API Gateway endpoint that would vend the data in a format such as JSON.

Another interesting option would be for us to publish the CloudFormation templates (or the codebase used to generate the templates) for configuring the system test environment and executing the tests. This could be deployed to an AWS account owned and managed by the Airflow community where tests would be run periodically. AWS has provided some credits in the past which could be used to help fund the account. But this introduces a large component that would need ownership and management by folks within the Airflow community who have access to such AWS accounts and credits (likely only committers/release managers?). So it might not be worth the complexity.

I'd like to hear what folks think!

Cheers,
Niko

Re: Vending AWS System Test Results Back to the Community

Posted by Jarek Potiuk <ja...@potiuk.com>.

Hey Niko,

Very good points to discuss. I think this is something equally needed on
AWS/Google but also other "popular" services we have integration with (and
eventually fulfilling the goal of AIP-4) :)

*Context*:

I am a big fan of thinking of CI systems as largely invisible to the
"regular" users. The best CI system (and it's almost impossible goal but
should always be our compass) is a system you are not aware of until
you (or someone) introduced a change that needs an action from someone -
when things get broken (and the breakage might come from multiple sources
in case of System Tests: code change, library upgrade, service changing the
API, required permission change etc. etc. Unlike in other types of tests, a
lot more of those might be independent from PRs merged, this is actually
more likely that some external change might impact System Tests.

Also System Tests - cannot be really run with every PR (they take too long)
so there is a bit different "usage" and "access" needs for the results.
This impacts what can be the trigger for such tests. I think it's far more
likely that we will run the system tests regularly on schedule (once a
day?) and manually by the release manager when we prepare a provider's
release to verify that the providers are still working or when we want to
test if we fix the problem reported by failing "scheduled" run (from a PR
branch).

And I think it's not only "how you access" the results, but how "failure
notifications" are delivered and how the tests are triggered and we should
answer all those questions.

*Audience of the solution:*

This leads to another question - WHO will be interested in seeing
notifications and fixing the problems ?

I think this is the most important question to answer. I really think
regular contributors (even those who contribute to - say - Amazon provider)
will not be interested and will not regularly monitor failure
notifications. Unlike regular "test failure" notifications - they should
not come to the users who made the PR but to those who are interested in
keeping the "provider" green. It will be rather difficult in a number of
cases to engage (automatically) a contributor's attention when such system
tests start to fail. But eventually (and I think this split is quite
obvious) there will be people interested in monitoring the overall health
of a given provider. They don't have to fix it, they can merely investigate
and see that likely this or that caused the problem and "pull in" the PR
contributors. But there is a watchout there - those contributors might not
have or want to use access to run such system tests manually (it might
cost, might need paid accounts, have some risks involved as data is
deleted/recreated etc.). Those contributors should see the logs/results,
should be able to fix the problem but then they should also be able to
trigger system test execution on their PR when they want to check if the
problem is fixed.

*Characteristics of the solution:*

So I think any solution should have those characteristics:

1) Produce notification that tests executed have failed/succeeded - this
should come to a dedicated, separate per-provider place (for example slack
channel - with the "low" frequency of such messages, slack channel seems
like the best idea. Then people who want to monitor a given provider could
simply subscribe to that channel. Seeing regular "All tests passed" and
sometimes "some tests failed" message there is a great indication that a)
tests continue to work in general, b) see when things fail. We need to have
regular schedule and notify about successes as well - to make some kind of
"heartbeat" which would tell those people monitoring the errors that things
are not working when the heartbeat is missing.

2) Seeing logs - the notifications should contain links to logs that could
be browsed by anyone (Read-only). Luckily we have no "secret" information
so it could be a publicly available link. Ideally, it should be a
cloud-based one (CloudWatch for AWS). I think access to logs is absolutely
crucial for anyone trying to investigate and fix the problem. And I think
this is the only thing needed by anyone outside of the group that is
interested in keeping the provider "green" - such individual contributors
might only need to see a "green" log to compare what has changed in this
particular build, maybe, but they are not really interested in seeing the
historical stats.

3) Triggering the tests. This one is tricky. This is something that should
be accessible by everyone contributing a PR, but it should be controlled
somehow. No idea yet how this can be gated/controlled but this is something
we will need to figure out. Who and when and how to trigger such a build
for your own PR? There might be various ways - special comment on the PR +
some conditions (approvals?) that the PR/user should fulfill to be able to
trigger it - this is not really something I have complete proposal for.

4) Dashboard - that one is mostly interesting to release manager and people
who are interested in keeping given provider "green".  It's OK to make it
public, but it does not need to be "beautiful" or anything - it can be a
very "raw" output.

In this context, answering some of your questions Niko:

* I do not think we need automated API. The frequency and nature of
"reasons" for failures are low and I do not see a reason why we would
consume it (but this might come in the future as we learn).
* Public Dashboard is fine, but public access to logs is far more important
IMHO

*Who should run the infrastructure ?*

Looking at the expectations above - I think it would be better to run such
tests by the Amazon team for Amazon, Google team for Google etc. - While
the CloudFormation scripts would be best if published, I think it will be
far more efficient to get it in the hands of those stakeholders who are
mostly interested in getting the "green" tests. That is much more scalable
solution from the community point of view. We are not going to publish it
to our users and is not really needed to be run on our infra. I don't see a
particular need for regular community members to even know how/what
infrastructure is used to run the tests - the test execution is pretty
standardised, and I think we are really interested in output rather than
the infra to run it.

J.

On Fri, Aug 19, 2022 at 2:53 PM Kamil Breguła <dz...@gmail.com> wrote:

> I don't think we have to limit ourselves that only the commiters have
> access to the Amazon account managed by Airflow community. In the past,
> commiters was supported by other people whom they trust e.g. commiter asked
> for help from another co-worker from her company when he needed it.
>
> This means that there are no restrictions on Amazon employees using this
> account and maintaining this environment.
>
> We just have to be careful that no-commiters have not write permission to
> the repository, and that they cannot publish a new version of the
> application that can be seen as official released by the Apache Foundation.
>
> On Fri, Aug 19, 2022, 01:30 Oliveira, Niko <on...@amazon.com.invalid>
> wrote:
>
>> Hey folks,
>>
>>
>> Those of us on the AWS Airflow team (myself, Dennis F, Vincent B, Seyed
>> H) have been working on a few projects over the past few months:
>>
>> 1. Writing example dags/docs for all existing Operators in the AWS
>> Airflow provider package (done)
>>
>> 2. Writing AWS specific logic in Airflow codebase to support AIP-47 (done)
>>
>> 3. Converting all example dags to AIP-47 compliant system tests (just
>> over halfway done)
>>
>>
>> All of these are ultimately culminating to the goal of us running these
>> system tests at a regular cadence within Amazon (where we have access to
>> funded AWS accounts). We will run these system tests, triggered by updates
>> to airflow:main, at least once a day.
>>
>> I'd like to open a discussion on how we can vend these results back to
>> the community in a way that is most consumable for contributors, release
>> managers and users alike.
>>
>> A quick and easy approach would be to create a publicly viewable
>> CloudWatch Dashboard. With at least the following metrics for each system
>> test over time:  pass/fail, duration, and execution count.
>> This would be a human readable way to consume the current status of AWS
>> Operators.
>>
>> If a more machine readable format is required/preferred (e.g. for scripts
>> related to Airflow release management perhaps) we could also put together a
>> simple API Gateway endpoint that would vend the data in a format such as
>> JSON.
>>
>> Another interesting option would be for us to publish the CloudFormation
>> templates (or the codebase used to generate the templates) for configuring
>> the system test environment and executing the tests. This could be deployed
>> to an AWS account owned and managed by the Airflow community where tests
>> would be run periodically. AWS has provided some credits in the past which
>> could be used to help fund the account. But this introduces a large
>> component that would need ownership and management by folks within the
>> Airflow community who have access to such AWS accounts and credits (likely
>> only committers/release managers?). So it might not be worth the complexity.
>>
>>
>> I'd like to hear what folks think!
>>
>> Cheers,
>> Niko
>>
>>
>>
>>

Re: Vending AWS System Test Results Back to the Community

Posted by Kamil Breguła <dz...@gmail.com>.

I don't think we have to limit ourselves that only the commiters have
access to the Amazon account managed by Airflow community. In the past,
commiters was supported by other people whom they trust e.g. commiter asked
for help from another co-worker from her company when he needed it.

This means that there are no restrictions on Amazon employees using this
account and maintaining this environment.

We just have to be careful that no-commiters have not write permission to
the repository, and that they cannot publish a new version of the
application that can be seen as official released by the Apache Foundation.

On Fri, Aug 19, 2022, 01:30 Oliveira, Niko <on...@amazon.com.invalid>
wrote:

> Hey folks,
>
>
> Those of us on the AWS Airflow team (myself, Dennis F, Vincent B, Seyed H)
> have been working on a few projects over the past few months:
>
> 1. Writing example dags/docs for all existing Operators in the AWS Airflow
> provider package (done)
>
> 2. Writing AWS specific logic in Airflow codebase to support AIP-47 (done)
>
> 3. Converting all example dags to AIP-47 compliant system tests (just over
> halfway done)
>
>
> All of these are ultimately culminating to the goal of us running these
> system tests at a regular cadence within Amazon (where we have access to
> funded AWS accounts). We will run these system tests, triggered by updates
> to airflow:main, at least once a day.
>
> I'd like to open a discussion on how we can vend these results back to the
> community in a way that is most consumable for contributors, release
> managers and users alike.
>
> A quick and easy approach would be to create a publicly viewable
> CloudWatch Dashboard. With at least the following metrics for each system
> test over time:  pass/fail, duration, and execution count.
> This would be a human readable way to consume the current status of AWS
> Operators.
>
> If a more machine readable format is required/preferred (e.g. for scripts
> related to Airflow release management perhaps) we could also put together a
> simple API Gateway endpoint that would vend the data in a format such as
> JSON.
>
> Another interesting option would be for us to publish the CloudFormation
> templates (or the codebase used to generate the templates) for configuring
> the system test environment and executing the tests. This could be deployed
> to an AWS account owned and managed by the Airflow community where tests
> would be run periodically. AWS has provided some credits in the past which
> could be used to help fund the account. But this introduces a large
> component that would need ownership and management by folks within the
> Airflow community who have access to such AWS accounts and credits (likely
> only committers/release managers?). So it might not be worth the complexity.
>
>
> I'd like to hear what folks think!
>
> Cheers,
> Niko
>
>
>
>