You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Julien Le Dem <ju...@astronomer.io.INVALID> on 2023/02/13 17:24:30 UTC

Test discussion AIP-53 OpenLineage in Airflow

[changing the subject line to separate this discussion from the voting
thread]
Thank you Jarek,
Yes, I am expecting most of the testing coverage to be in unit tests.
I think following up on tickets and PRs is appropriate to make sure
coverage is at the right level and tests are in the right place.
I'm looking forward to more discussion on the details.
Julien

On Sat, Feb 11, 2023 at 11:57 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> A little side-track., small comment to what Shubham wrote
>
> Yeah. I also noticed AIP-47 mentioned - but I considered that
> implementation detail. I read that those will be rather regular unit
> tests (so not reaching out to external systems as it makes little
> sense and we definitely want to make open-lineage tests run regularly
> with every PR - otherwise we would end up in the same boat as
> currently where the repos are separated out), I believe the AIP-47
> mentioned there was more an attempt to say "the tests coverage will be
> high". Julian, am I right ?
>
> On Sat, Feb 11, 2023 at 11:57 PM Mehta, Shubham
> <sh...@amazon.com.invalid> wrote:
> >
> > +1 non-binding. I'll be on the lookout for initial PRs to learn more
> about the implementation details of how System Tests will be extended to
> cover these changes, as well as the ongoing maintenance required from
> providers. The proposed changes should definitely make it easier for
> Airflow customers to adopt lineage and improve stability. I'm looking
> forward to seeing how customers will end up using it!
> >
> >
> > Shubham
> >
> >
> >
> > From: Julien Le Dem <ju...@astronomer.io.INVALID>
> > Reply-To: "dev@airflow.apache.org" <de...@airflow.apache.org>
> > Date: Friday, February 10, 2023 at 3:28 PM
> > To: "dev@airflow.apache.org" <de...@airflow.apache.org>
> > Subject: [EXTERNAL] [VOTE] AIP-53 OpenLineage in Airflow
> >
> >
> >
> > CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
> >
> >
> >
> > Dear Airflow community,
> >
> >
> >
> > Following the discussion thread over the past few weeks, I'd like to
> call a vote on AIP-53 OpenLineage in Airflow:
> >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> >
> >
> >
> > The discussion thread is linked in the confluence doc if you wish to
> consult the history of the conversation. Thank you to all who contributed!
> >
> >
> >
> > This is my (non-binding!) +1, the vote will last until midnight (UTC) on
> Friday 17th February.
> >
> >
> >
> > Thanks,
> >
> > Julien
> >
> >
> >
> > For reference, the Motivation section in the doc:
> >
> > Operational lineage collection is a common need to understand
> dependencies between data pipelines and track end-to-end provenance of
> data. It enables many use cases from ensuring reliable delivery of data
> through observability to compliance and cost management.
> >
> > Publishing operational lineage is a core Airflow capability to enable
> troubleshooting and governance.
> >
> > OpenLineage is a project part of the LFAI&Data foundation that provides
> a spec standardizing operational lineage collection and sharing across the
> data ecosystem. If it provides plugins for popular open source projects,
> its intent is very similar to OpenTelemetry (also under the Linux
> Foundation umbrella): to remain a spec for lineage exchange that projects -
> open source or proprietary - implement.
> >
> > Built-in OpenLineage support in Airflow will make it easier and more
> reliable for Airflow users to publish their operational lineage through the
> OpenLineage ecosystem.
> >
> > The current external plugin maintained in the OpenLineage project
> depends on Airflow and operators internals and gets broken when changes are
> made on those. Having a built-in integration ensures a better first class
> support to expose lineage that gets tested alongside other changes and
> therefore is more stable.
> >
> > Today, OpenLineage consumers in the ecosystem include: Egeria (bank
> compliance), Marquez (build your own metadata platform for compliance for
> example), Microsoft Purview (Governance, …), Astro (data observability),
> Amundsen. AWS recently blogged about using OpenLineage in the AWS
> ecosystem. Other projects are at various levels of progress.
> >
> > On the producer side, there is support for open source projects like
> Airflow, dbt, Spark, Flink, GreatExpectations and proprietary warehouses
> like Snowflake, BigQuery, Redshift through API integration or SQL parsing.
> >
> > Examples of users talking about their usage of OpenLineage can be found
> on the Openlineage blog..
> >
> > This integration will also stimulate the continued growth of the
> OpenLineage ecosystem and create more value for Airflow users.
>

Re: Test discussion AIP-53 OpenLineage in Airflow

Posted by Jarek Potiuk <ja...@potiuk.com>.
Cool

On Mon, Feb 13, 2023 at 6:24 PM Julien Le Dem
<ju...@astronomer.io.invalid> wrote:
>
> [changing the subject line to separate this discussion from the voting thread]
> Thank you Jarek,
> Yes, I am expecting most of the testing coverage to be in unit tests.
> I think following up on tickets and PRs is appropriate to make sure coverage is at the right level and tests are in the right place.
> I'm looking forward to more discussion on the details.
> Julien
>
> On Sat, Feb 11, 2023 at 11:57 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>> A little side-track., small comment to what Shubham wrote
>>
>> Yeah. I also noticed AIP-47 mentioned - but I considered that
>> implementation detail. I read that those will be rather regular unit
>> tests (so not reaching out to external systems as it makes little
>> sense and we definitely want to make open-lineage tests run regularly
>> with every PR - otherwise we would end up in the same boat as
>> currently where the repos are separated out), I believe the AIP-47
>> mentioned there was more an attempt to say "the tests coverage will be
>> high". Julian, am I right ?
>>
>> On Sat, Feb 11, 2023 at 11:57 PM Mehta, Shubham
>> <sh...@amazon.com.invalid> wrote:
>> >
>> > +1 non-binding. I'll be on the lookout for initial PRs to learn more about the implementation details of how System Tests will be extended to cover these changes, as well as the ongoing maintenance required from providers. The proposed changes should definitely make it easier for Airflow customers to adopt lineage and improve stability. I'm looking forward to seeing how customers will end up using it!
>> >
>> >
>> > Shubham
>> >
>> >
>> >
>> > From: Julien Le Dem <ju...@astronomer.io.INVALID>
>> > Reply-To: "dev@airflow.apache.org" <de...@airflow.apache.org>
>> > Date: Friday, February 10, 2023 at 3:28 PM
>> > To: "dev@airflow.apache.org" <de...@airflow.apache.org>
>> > Subject: [EXTERNAL] [VOTE] AIP-53 OpenLineage in Airflow
>> >
>> >
>> >
>> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>> >
>> >
>> >
>> > Dear Airflow community,
>> >
>> >
>> >
>> > Following the discussion thread over the past few weeks, I'd like to call a vote on AIP-53 OpenLineage in Airflow:
>> >
>> > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
>> >
>> >
>> >
>> > The discussion thread is linked in the confluence doc if you wish to consult the history of the conversation. Thank you to all who contributed!
>> >
>> >
>> >
>> > This is my (non-binding!) +1, the vote will last until midnight (UTC) on Friday 17th February.
>> >
>> >
>> >
>> > Thanks,
>> >
>> > Julien
>> >
>> >
>> >
>> > For reference, the Motivation section in the doc:
>> >
>> > Operational lineage collection is a common need to understand dependencies between data pipelines and track end-to-end provenance of data. It enables many use cases from ensuring reliable delivery of data through observability to compliance and cost management.
>> >
>> > Publishing operational lineage is a core Airflow capability to enable troubleshooting and governance.
>> >
>> > OpenLineage is a project part of the LFAI&Data foundation that provides a spec standardizing operational lineage collection and sharing across the data ecosystem. If it provides plugins for popular open source projects, its intent is very similar to OpenTelemetry (also under the Linux Foundation umbrella): to remain a spec for lineage exchange that projects - open source or proprietary - implement.
>> >
>> > Built-in OpenLineage support in Airflow will make it easier and more reliable for Airflow users to publish their operational lineage through the OpenLineage ecosystem.
>> >
>> > The current external plugin maintained in the OpenLineage project depends on Airflow and operators internals and gets broken when changes are made on those. Having a built-in integration ensures a better first class support to expose lineage that gets tested alongside other changes and therefore is more stable.
>> >
>> > Today, OpenLineage consumers in the ecosystem include: Egeria (bank compliance), Marquez (build your own metadata platform for compliance for example), Microsoft Purview (Governance, …), Astro (data observability), Amundsen. AWS recently blogged about using OpenLineage in the AWS ecosystem. Other projects are at various levels of progress.
>> >
>> > On the producer side, there is support for open source projects like Airflow, dbt, Spark, Flink, GreatExpectations and proprietary warehouses like Snowflake, BigQuery, Redshift through API integration or SQL parsing.
>> >
>> > Examples of users talking about their usage of OpenLineage can be found on the Openlineage blog..
>> >
>> > This integration will also stimulate the continued growth of the OpenLineage ecosystem and create more value for Airflow users.