You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Julien Le Dem <ju...@astronomer.io.INVALID> on 2023/03/17 03:21:02 UTC

Re: Request for feedback on proposal for new OpenLineage provider in Airflow

We are planning to do this session next Thursday at 5pm CET 9am PT. I will
send a zoom link in advance.
Julien

On Sat, Feb 25, 2023 at 05:59 Jarek Potiuk <ja...@potiuk.com> wrote:

> Cool. I am looking forward to it :). It would be great to get some
> insight from those who attempted to get the lineage working in several
> versions of Open Lineage and finally arrived at the current
> specs/integration.
>
> On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem
> <ju...@astronomer.io.invalid> wrote:
> >
> > Thank you Jarek,
> > I am happy to organize a zoom presentation about OpenLineage and answer
> any question. It is indeed a spec decoupling the data transformation layer
> from the Metadata store people are using. Just like OpenTelemetry is for
> service metrics/traces.
> > Best,
> > Julien
> >
> > On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> >>
> >> And to add a little "parallel" - I think Open Lineage integration
> replacing our "generic lineage" is very similar step to the new
> "Multi-tenant"-ready authentication interface we are discussing in
> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck
> >>
> >> Yes - we have a generic authentication interface, but no - it's useless
> for the case where multi-tenancy and good level of resource authorization
> is needed. It's just far too simplistic and limited.
> >>
> >> Same with current lineage generic interface - yes, we have it but it's
> only useful in a limited set of cases. and if we want to step-it-up we need
> to come up with something better (and Open Lineage happens to be one that
> has been developed with Airflow in mind and battle tested).
> >>
> >> J.
> >>
> >> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <ja...@potiuk.com> wrote:
> >>>
> >>> Hey Rafał (Eugene, Michal - and others who are looking),
> >>>
> >>> I think I know where your/Eugen/Michał concerns are coming from. And I
> think it would be great if we can talk it over a bit.  I believe this is -
> in parts - quite a misunderstanding of what Open Lineage really is, how
> much of an integration it is and what are the reasons why it has been
> implemented the way it was implemented in Airflow.
> >>>
> >>> **Idea**: (Julien -  Maybe you can organize it ?):
> >>>
> >>> Maybe we can have an open-to-everyone presentation/zoom call with
> quite some time foreseen to ask questions where you would explain the
> community about those integration points (and especially those people who
> are worried we are losing something by choosing the OpenLineage
> integration). I would love to see such a presentation - specifically
> focused on explaining how Open-Lineage is really improving the current
> lineage approach and what problems it solves that the existing generic
> interface doesn't.
> >>>
> >>> Just to set the tone and focus for such meeting if we have one:
> >>>
> >>> For me - when I look at Open Lineage, it is really "this is how
> lineage generic interface **should** be done in Airflow". The "generic"
> lineage support we have now is very, very basic, I'd even say far too
> simplistic. I would even say, it's useless besides a few, very basic use
> cases. Simply because there was never a good "receiver" of the information
> to cover those cases.
> >>>
> >>> When you look closely at OpenLineage, it's nothing more than a better
> convention of the dictionaries that we send as a metadata, better meta-data
> in case of SQL operators (Hooks in the future hopefully), allowing handling
> some cases that current lineage simply cannot.  Also what open-lineage
> integration with Airflow covers better handling of the lifecycle "task" and
> "dag" in Airflow to be able to bind lineage data together. That's my
> understanding of what we get when we integrate OL in.
> >>>
> >>> I think over the last 2 years Datakin/Astronomer people had worked out
> the level of interface that **just works** and if we would like to get the
> lineage information from Airflow as useful as it is in OL, we would have to
> anyway implement pretty much all of the things they already did.
> >>>
> >>> I would love (and I think many community members) to take part in such
> a call to hear on that particular aspect of the OL integration.
> >>>
> >>> J.
> >>>
> >>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz <
> rafalbiegacz@google.com.invalid> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I second/echo the input provided by Eugene and Michal.
> >>>>
> >>>> In general, Airflow should provide generic interfaces to lineage
> backends so it's easy to configure the one preferred by the user. Whether
> it's Open Lineage, proprietary solution, Dataplex Lineage, etc. it should
> be the user's choice.
> >>>>
> >>>> We should avoid close integration with any specific lineage backend
> due to the reasons already mentioned, i.e. to avoid translations between
> lineage backends. Also, we would closely couple one framework (Airflow)
> with another one (Open Lineage) - it makes Airflow more complex and less
> flexible. Loose coupling between lineage backends and Airflow seems to be
> more future-proven.
> >>>>
> >>>> Regards, Rafal.
> >>>>
> >>>>
> >>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem
> <ju...@astronomer.io.invalid> wrote:
> >>>>>
> >>>>> Dear Airflow community,
> >>>>> I have transferred the content of the working google doc I shared a
> few weeks ago to the Airflow confluence:
> >>>>>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> >>>>> All comments have been answered, I added clarifications to the doc
> accordingly and I also added your suggestions to improve the proposal.
> >>>>> All that history is linked from the discussion thread link in the
> confluence doc if you wish to consult it.
> >>>>> Thank you all for your feedback and help in the process.
> >>>>> Best
> >>>>> Julien
> >>>>>
> >>>>>
> >>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <ju...@astronomer.io>
> wrote:
> >>>>>>
> >>>>>> Thank you for the email Jarek, and Eugene for your suggestions,
> >>>>>> I do agree with Jarek's assessment. I don't have very much to add
> to his argument, it is very thoughtful!
> >>>>>> OpenLineage was started to avoid the cartesian complexity that
> Eugene mentions. There's actually that specific illustration in the
> OpenLineage doc.
> >>>>>> Lineage consumers want to avoid having to understand the lineage
> format of each individual observed data transformation layer. And
> transformation layers don't want to understand every Metadata store's model
> and protocol.
> >>>>>> Eugene, about your specific proposal about a global vocabulary of
> entities, I think it is a great suggestion.
> >>>>>> We can map those entities to Datasets in OpenLineage. The way
> OpenLineage models this is by allowing specific facets attached to Dataset.
> Facets are pieces of metadata each with their own JsonSchema.
> >>>>>> For example a table from a relational database will have a schema
> facet when a file in GCS might not.
> >>>>>> So I think in Airflow we could have each of the entity classes you
> describe be used in the get_openlineage_facets*() API in the Operators.
> >>>>>> Each of those classes would know what OpenLineage facets they can
> expose.
> >>>>>> I'll add a mention in the AIP and I think we can go in more details
> in a ticket.
> >>>>>> Cheers,
> >>>>>> Julien
> >>>>>>
> >>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <ja...@potiuk.com>
> wrote:
> >>>>>>>
> >>>>>>> Just a quick personal view on it, Eugene (I bet Julian's answer
> will
> >>>>>>> be more thoughtful).
> >>>>>>>
> >>>>>>> I think you are right to the "agnostic" part. But I have one
> question
> >>>>>>> - what are we considering "agnostic"?
> >>>>>>>
> >>>>>>>  There is no "widespread" standard for lineage (yet). Open Lineage
> >>>>>>> with its donation to Linux Foundation Data & AI is aspiring to
> become
> >>>>>>> one. And it's a pretty good candidate:
> >>>>>>>
> >>>>>>> * designed from grounds-up to be agnostic (Open Lineage was only
> >>>>>>> published as an API from day one)
> >>>>>>> * as of recently, the ownership and governance of Open Lineage is
> with
> >>>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/)  which
> is
> >>>>>>> part of "Linux Foundation Project" - well known and respectful
> >>>>>>> foundation that - similarly to the ASF is an umbrella and provides
> >>>>>>> governance rules for a big number of well established OSS projects
> >>>>>>>
> >>>>>>> In essence it is the same approach as we already discussed and
> >>>>>>> approved for Open Telemetry (which is governed by CNCF which is in
> the
> >>>>>>> same league as recognition and governance to LFP) (not yet
> implemented
> >>>>>>> though). In the case of Open-Telemetry, we decided against
> developing
> >>>>>>> our "own" existing standard but we opted for one that is out there.
> >>>>>>> Yes it is a bit more established and popular than Open Lineage is,
> but
> >>>>>>> i so wish that we chose and implemented it already (and earlier as
> not
> >>>>>>> having a standard there - except statsd which is really, really
> poor)
> >>>>>>> has a great impact on Airflow being just "pluggable" in existing
> >>>>>>> solutions for monitoring. (BTW. I hope we implement it soon and I
> hear
> >>>>>>> (and see) there are attempts to do so).
> >>>>>>>
> >>>>>>> In the case of Open Lineage, the questions are - is there an
> >>>>>>> alternative of the same caliber? Shall we produce our own "agnostic
> >>>>>>> standard" for it instead ? Is there a chance the idea of
> >>>>>>> "airflow-specific" attributes will catch up and many "consumers"
> will
> >>>>>>> be writing their own conversions to the way they can consume it?
> >>>>>>>
> >>>>>>> I would really, really try to avoid the pitfalls nicely summarized
> >>>>>>> here: https://xkcd.com/927/
> >>>>>>>
> >>>>>>> We can of course make a wrong bet and in 2 years Airflow might be
> the
> >>>>>>> only one supporting Open Lineage. That might happen. Though the
> list
> >>>>>>> of "consumers" of Open Lineage is already pretty good IMHO. Or
> maybe -
> >>>>>>> more likely - once Airflow implements it, due to Airflow's
> popularity
> >>>>>>> and the fact that there is already competition supporting it (e.g.
> >>>>>>> Amundsen) we will increase the chance of "hockey-stick" adoption of
> >>>>>>> Open Lineage. My bet is -  the latter and for the benefit of the
> whole
> >>>>>>> ecosystem. I think we have a chance to influence creation of a new,
> >>>>>>> important standard. Much less so, I think if we just provide our
> own
> >>>>>>> custom solution - with lots and lots of work for others to be able
> to
> >>>>>>> consume it, no time to properly nurture the API and make it easier
> to
> >>>>>>> implement it (which is undoubtedly what Datakin, Astronomer and now
> >>>>>>> LFData & AI run governance main focus is)
> >>>>>>>
> >>>>>>> Are there other alternatives we should consider ? Do we want to
> >>>>>>> develop our own standard (and implement all the integrations from
> the
> >>>>>>> grounds up) ?
> >>>>>>>
> >>>>>>> J.
> >>>>>>>
> >>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <eu...@kosteev.com>
> wrote:
> >>>>>>> >
> >>>>>>> > Hi Julien.
> >>>>>>> >
> >>>>>>> > I reviewed the design doc.
> >>>>>>> > The general idea looks good to me, but I have some concerns that
> I would like to share.
> >>>>>>> >
> >>>>>>> > If I understand correctly the proposed design is to fill in
> "operators" with self-methods to extract lineage metadata from it, and I
> agree with the motivation. If those are decoupled (in a form of extractors
> in separate package) from operators itself, then the downsides is that (as
> you mentioned) - extractors will be distributed separately and "operators"
> logic is out of sync with "lineage extraction" logic by design.
> >>>>>>> > Also knowledge about internals of operator spills out of the
> operator which is not good at all (at the very least).
> >>>>>>> >
> >>>>>>> > However, if we make every operator being exposing method to
> generate lineage metadata of the specific format, e.g. OpenLineage etc.,
> then we will end up with cartesian complexity of supporting in each
> provider+operator each backend format.
> >>>>>>> >
> >>>>>>> > If you say that the goal is that "operators" will always
> generate OpenLineage format only and each consumer will convert this format
> to their own internal representation, well, if they do this then this seems
> like a working approach. But with the assumption that each consumer will
> support it.
> >>>>>>> >
> >>>>>>> > I think it comes down to the question: is OpenLineage format
> enough popular, complete and proper for the lineage metadata that every
> consumer will be convinced to support it. We may also consider issues like
> mismatch of lineage feature parity, e.g. OpenLineage supports field-level
> lineage but consumer doesn't support (or not at the moment), so we would
> prefer lineage metadata transferred to the backend to be slightly different
> in this case.
> >>>>>>> >
> >>>>>>> > What do you think about the idea:
> >>>>>>> > 1. make lineage metadata generated by "operators" to be agnostic
> of the specific format, just using entities from big generic vocabulary of
> entities e.g. created here
> https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py.
> We would have there e.g. entities like:
> >>>>>>> >
> --------------------------------------------------------------------
> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> >>>>>>> > class PostgresTable:
> >>>>>>> >     """Airflow lineage entity representing Postgres table."""
> >>>>>>> >
> >>>>>>> >     host: str = attr.ib()
> >>>>>>> >     port: str = attr.ib()
> >>>>>>> >     database: str = attr.ib()
> >>>>>>> >     schema: str = attr.ib()
> >>>>>>> >     table: str = attr.ib()
> >>>>>>> >
> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> >>>>>>> > class GCSEntity:
> >>>>>>> >     """Airflow lineage entity representing generic Google Cloud
> Storage entity."""
> >>>>>>> >
> >>>>>>> >     bucket: str = attr.ib()
> >>>>>>> >     path: str = attr.ib()
> >>>>>>> >
> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> >>>>>>> > class AWSS3Entity:
> >>>>>>> >     """Airflow lineage entity representing generic AWS S3
> entity."""
> >>>>>>> >
> >>>>>>> >     bucket: str = attr.ib()
> >>>>>>> >     path: str = attr.ib()
> >>>>>>> >
> --------------------------------------------------------------------
> >>>>>>> > 2. Implement "adapters" that will act as a bridge between
> "operators" and backends. Their responsibility will be to convert lineage
> metadata generated by "operators" to a format understandable by specific
> backend.
> >>>>>>> > And then we can use the built-in mechanism of inlets/outlets to
> bypass Airflow lineage metadata to the Airflow lineage backend.
> >>>>>>> >
> >>>>>>> > I didn't get exactly implementation details of your proposed
> design, but I think maintaining global vocabulary of entities to use in
> inlets/outlets of operators is crucial for Airflow, as this could be
> leveraged to build various features on top of it, like displaying lineage
> graph in Airflow UI (based on XCOM):)
> >>>>>>> >
> >>>>>>> > Importantly to note, if we decide to send out from Airflow
> lineage metadata only in OpenLineage format, well, we could have than only
> one "adapter" OpenLineageAdapter. But the "adapters" approach leaves us
> room for adding support to others (following "pluggable" approach as
> Airflow is mainly known/good about).
> >>>>>>> >
> >>>>>>> > All in all:
> >>>>>>> > - global vocabulary of entities used across all "operators"
> (with all advantages out of it, mentioned above)
> >>>>>>> > - "adapters" approach
> >>>>>>> > seems to me crucial points in the design that make sense to me.
> >>>>>>> >
> >>>>>>> > What do you think about this?
> >>>>>>> >
> >>>>>>> > - Eugene
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem
> <ju...@astronomer.io.invalid> wrote:
> >>>>>>> >>
> >>>>>>> >> Hello Michał,
> >>>>>>> >> Thank you for your input.
> >>>>>>> >> I would clarify that OpenLineage doesn't make any assumption
> about the backend being used to store lineage and is an adapter-like layer.
> >>>>>>> >> OpenLineage exists as the spec specifically for that purpose of
> avoiding the problem of every lineage consumer having to understand every
> lineage producer.
> >>>>>>> >> Consumers of lineage want a unified spec consuming lineage from
> any data transformation layer like Airflow, Spark, Flink, SQL, Warehouses,
> ...
> >>>>>>> >> Just like OpenTelemetry allows consuming traces independently
> of the technology used, so does OpenLineage for lineage.
> >>>>>>> >> Julien
> >>>>>>> >>
> >>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras <
> michalmodras@google.com> wrote:
> >>>>>>> >>>
> >>>>>>> >>> Hi everyone,
> >>>>>>> >>>
> >>>>>>> >>> As Airflow already supports lineage functionality through
> pluggable lineage backends, I think OpenLineage and other lineage systems
> integration should follow this path. I think more 'native' integration with
> OpenLineage (or any other lineage system) in Airflow while maintaining the
> generic lineage backend architecture in parallel would make the user
> experience less open, troublesome to maintain, and the Airflow architecture
> itself more constrained by a logic of a specific system.
> >>>>>>> >>>
> >>>>>>> >>> I think enriching operators with a generic method exposing
> lineage metadata that could be leveraged by lineage backends regardless of
> their implementation is a good idea which the Cloud Composer team would
> gladly contribute to. I believe the translation of the Airflow metadata
> exposed by the operators should be done by lineage backends (or another
> adapter-like layer). Tying Airflow operators' development to a specific
> lineage system like OpenLineage forces operators' contributors to
> understand that system too, which increases both the entry costs and
> maintenance costs. I see it as unnecessary coupling.
> >>>>>>> >>>
> >>>>>>> >>> Best,
> >>>>>>> >>> Michal
> >>>>>>> >>>
> >>>>>>> >>>
> >>>>>>> >>>
> >>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem <
> julien@astronomer.io> wrote:
> >>>>>>> >>>>
> >>>>>>> >>>> Thank you Eugen,
> >>>>>>> >>>> This sounds very aligned with the goals of OpenLineage and I
> think this would work well.
> >>>>>>> >>>> Here are the sections in the doc that I think address your
> points:
> >>>>>>> >>>> - generalize lineage metadata extraction as self-method in
> each operator, using generic lineage entities
> >>>>>>> >>>> See: OpenLineage support in providers. It describes how each
> operator exposes its lineage.
> >>>>>>> >>>> - implement "adapter"s to convert generated metadata to Data
> Lineage format, Open Lineage format, etc.
> >>>>>>> >>>> The goal here is each consumer turns from OpenLineage format
> to their own internal representation as you are suggesting.
> >>>>>>> >>>> In the motivation section, towards the end, I link to a few
> examples of data catalogs doing just that.
> >>>>>>> >>>>
> >>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <
> eugen@kosteev.com> wrote:
> >>>>>>> >>>>>
> >>>>>>> >>>>> ++ Michal Modras
> >>>>>>> >>>>>
> >>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <
> eugen@kosteev.com> wrote:
> >>>>>>> >>>>>>
> >>>>>>> >>>>>> Cloud Composer recently launched "Data lineage with
> Dataplex" feature which effectively means to generate lineage out of
> DAG/task executions and export it to Data Lineage (Data Catalog service)
> for further analysis.
> >>>>>>> >>>>>>
> https://cloud.google.com/composer/docs/composer-2/lineage-integration
> >>>>>>> >>>>>>
> >>>>>>> >>>>>> This feature is as of now in the "Preview" state.
> >>>>>>> >>>>>> The current implementation uses built-in "Airflow lineage
> backend" feature and methods to extract lineage metadata on task post
> execution events.
> >>>>>>> >>>>>>
> >>>>>>> >>>>>> The general idea was to contribute this to the Airflow
> community in a form:
> >>>>>>> >>>>>> - generalize lineage metadata extraction as self-method in
> each operator, using generic lineage entities
> >>>>>>> >>>>>> - implement "adapter"s to convert generated metadata to
> Data Lineage format, Open Lineage format, etc.
> >>>>>>> >>>>>>
> >>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer would mean
> to introduce an additional layer of converting from OpenLineage format to
> Data Lineage (Data Catalog/Dataplex) format. But this is definitely a
> possibility.
> >>>>>>> >>>>>>
> >>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem
> <ju...@astronomer.io.invalid> wrote:
> >>>>>>> >>>>>>>
> >>>>>>> >>>>>>> Thank you very much for your input Jarek.
> >>>>>>> >>>>>>> I am responding in the comments and adding to the doc
> accordingly.
> >>>>>>> >>>>>>> I would also love to hear from more stakeholders.
> >>>>>>> >>>>>>> Thanks to all who provided feedback so far.
> >>>>>>> >>>>>>> Julien
> >>>>>>> >>>>>>>
> >>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <
> jarek@potiuk.com> wrote:
> >>>>>>> >>>>>>>>
> >>>>>>> >>>>>>>> General comment from my side: I think Open Lineage is
> (and should be
> >>>>>>> >>>>>>>> even more) a feature of Airflow that expands Airflow's
> capabilities
> >>>>>>> >>>>>>>> greatly and opens up the direction we've been all working
> on - Airflow
> >>>>>>> >>>>>>>> as a Platform.
> >>>>>>> >>>>>>>>
> >>>>>>> >>>>>>>> I think closely integrating it with Open-Lineage goes the
> same
> >>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open Telemetry
> goes, where we
> >>>>>>> >>>>>>>> might decide to support certain standards in order to
> expand
> >>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows to
> plug-in multiple
> >>>>>>> >>>>>>>> external solutions that would use the standard API. After
> Open-Lineage
> >>>>>>> >>>>>>>> graduated recently to  LFAI&Data foundation (I've been
> watching this
> >>>>>>> >>>>>>>> happening from far), it is I think the perfect candidate
> for Airflow
> >>>>>>> >>>>>>>> to incorporate it. I hope this will help all the players
> to make use
> >>>>>>> >>>>>>>> of the extra work necessary by the community to make it
> "officially
> >>>>>>> >>>>>>>> supported". I think we have to also get some feedback
> from the big
> >>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to have
> such a
> >>>>>>> >>>>>>>> capability, and another is to get it used in all the ways
> Airflow is
> >>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users (which is
> obviously a
> >>>>>>> >>>>>>>> huge driving factor) but also everywhere where Airflow is
> exposed by
> >>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see some
> warm words from
> >>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to hear
> whether the
> >>>>>>> >>>>>>>> Composer team at Google would be on board in using the
> open-lineage
> >>>>>>> >>>>>>>> information exposed this way in their Data Catalog (and
> likely more)
> >>>>>>> >>>>>>>> offering. We have Amundsen and others and possibly other
> stakeholders
> >>>>>>> >>>>>>>> might want to say something.
> >>>>>>> >>>>>>>>
> >>>>>>> >>>>>>>>
> >>>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved in
> implementing and
> >>>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned, that is
> the main
> >>>>>>> >>>>>>>> reason why the Open Lineage community would like to make
> the
> >>>>>>> >>>>>>>> integration part of Airflow. But by being smart and
> integrating it in
> >>>>>>> >>>>>>>> the way that will allow to plug-it-in into our CI,
> verification
> >>>>>>> >>>>>>>> process and making some very clear expectations about
> what it means
> >>>>>>> >>>>>>>> for contributors to Airflow to get it running, we can
> make some
> >>>>>>> >>>>>>>> initial investment in making it happen and minimise
> on-going cost,
> >>>>>>> >>>>>>>> while maximising the gain.
> >>>>>>> >>>>>>>>
> >>>>>>> >>>>>>>> And looking at all the above - I am super happy to help
> with all that
> >>>>>>> >>>>>>>> to make this easy to "swallow" and integrate well, even
> if it will
> >>>>>>> >>>>>>>> take an extra effort, especially that we will have
> experts from Open
> >>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open Lineage
> being the core
> >>>>>>> >>>>>>>> part of the effort. I am actually super excited - this
> might be the
> >>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its position as
> an
> >>>>>>> >>>>>>>> indispensable component of "even more modern data stack".
> >>>>>>> >>>>>>>>
> >>>>>>> >>>>>>>> I made my initial comments in the doc, and am looking
> forward to
> >>>>>>> >>>>>>>> making it happen :).
> >>>>>>> >>>>>>>>
> >>>>>>> >>>>>>>> J.
> >>>>>>> >>>>>>>>
> >>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
> >>>>>>> >>>>>>>> <ju...@astronomer.io.invalid> wrote:
> >>>>>>> >>>>>>>> >
> >>>>>>> >>>>>>>> > Dear Airflow Community,
> >>>>>>> >>>>>>>> > I have been working on a proposal to bring an
> OpenLineage provider to Airflow.
> >>>>>>> >>>>>>>> > I am looking for feedback with the goal to post an
> official AIP.
> >>>>>>> >>>>>>>> > Please feel free to comment in the doc above.
> >>>>>>> >>>>>>>> > Thank you,
> >>>>>>> >>>>>>>> > Julien (OpenLineage project lead)
> >>>>>>> >>>>>>>> >
> >>>>>>> >>>>>>>> > For convenience, here is the rationale from the doc:
> >>>>>>> >>>>>>>> >
> >>>>>>> >>>>>>>> > Operational lineage collection is a common need to
> understand dependencies between data pipelines and track end-to-end
> provenance of data. It enables many use cases from ensuring reliable
> delivery of data through observability to compliance and cost management.
> >>>>>>> >>>>>>>> >
> >>>>>>> >>>>>>>> > Publishing operational lineage is a core Airflow
> capability to enable troubleshooting and governance.
> >>>>>>> >>>>>>>> >
> >>>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data
> foundation that provides a spec standardizing operational lineage
> collection and sharing across the data ecosystem. If it provides plugins
> for popular open source projects, its intent is very similar to
> OpenTelemetry (also under the Linux Foundation umbrella): to remain a spec
> for lineage exchange that projects - open source or proprietary - implement.
> >>>>>>> >>>>>>>> >
> >>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will make it
> easier and more reliable for Airflow users to publish their operational
> lineage through the OpenLineage ecosystem.
> >>>>>>> >>>>>>>> >
> >>>>>>> >>>>>>>> > The current external plugin maintained in the
> OpenLineage project depends on Airflow and operators internals and gets
> broken when changes are made on those. Having a built-in integration
> ensures a better first class support to expose lineage that gets tested
> alongside other changes and therefore is more stable.
> >>>>>>> >>>>>>
> >>>>>>> >>>>>>
> >>>>>>> >>>>>>
> >>>>>>> >>>>>> --
> >>>>>>> >>>>>> Eugene
> >>>>>>> >>>>>
> >>>>>>> >>>>>
> >>>>>>> >>>>>
> >>>>>>> >>>>> --
> >>>>>>> >>>>> Eugene
> >>>>>>> >
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > --
> >>>>>>> > Eugene
>

Re: Request for feedback on proposal for new OpenLineage provider in Airflow

Posted by Julien Le Dem <ju...@astronomer.io.INVALID>.
And here is the recording:
https://youtu.be/fAqvoMzz7Tk

On Fri, Mar 31, 2023 at 1:51 PM Julien Le Dem <ju...@astronomer.io> wrote:

> Thank you all who attended.
> Here are the slides we presented today:
>
> https://docs.google.com/presentation/d/1o8VnXHXME_Vf-eQpQ5qvC7fb951CjBHmrsCsLW1Z1_8/edit?usp=sharing
> I'll also post the recording once available.
> Julien
>
> On Fri, Mar 24, 2023 at 2:40 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> Added :)
>>
>> On Fri, Mar 24, 2023 at 4:30 AM Bowrna Prabhakaran <ma...@gmail.com>
>> wrote:
>> >
>> > Can I get added to the invitation as well? (mailbowrna@gmail.com)
>> > Thanks
>> >
>> > On Fri, Mar 24, 2023 at 2:37 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>> >
>> > > did
>> > >
>> > > On Thu, Mar 23, 2023 at 9:22 PM c c <ch...@gmail.com>
>> wrote:
>> > > >
>> > > > Can I be added to the invitation as well(changcheng12345@gmail.com
>> )?
>> > > > thanks!
>> > > >
>> > > > On Thu, Mar 23, 2023 at 12:59 PM Jarek Potiuk <ja...@potiuk.com>
>> wrote:
>> > > >
>> > > > > I added all those who asked. It's really cool we have so much
>> interest
>> > > :).
>> > > > >
>> > > > > Julien, Maciej: NO PRESSURE
>> > > > >
>> > > > >
>> > > > >
>> ---------------------------------------------------------------------
>> > > > > To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
>> > > > > For additional commands, e-mail: dev-help@airflow.apache.org
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
>> > > For additional commands, e-mail: dev-help@airflow.apache.org
>> > >
>> > >
>> >
>> > --
>> > Regards
>> >
>> > Bowrna Prabhakaran
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
>> For additional commands, e-mail: dev-help@airflow.apache.org
>>
>>

Re: Request for feedback on proposal for new OpenLineage provider in Airflow

Posted by Julien Le Dem <ju...@astronomer.io.INVALID>.
Thank you all who attended.
Here are the slides we presented today:
https://docs.google.com/presentation/d/1o8VnXHXME_Vf-eQpQ5qvC7fb951CjBHmrsCsLW1Z1_8/edit?usp=sharing
I'll also post the recording once available.
Julien

On Fri, Mar 24, 2023 at 2:40 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> Added :)
>
> On Fri, Mar 24, 2023 at 4:30 AM Bowrna Prabhakaran <ma...@gmail.com>
> wrote:
> >
> > Can I get added to the invitation as well? (mailbowrna@gmail.com)
> > Thanks
> >
> > On Fri, Mar 24, 2023 at 2:37 AM Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > > did
> > >
> > > On Thu, Mar 23, 2023 at 9:22 PM c c <ch...@gmail.com> wrote:
> > > >
> > > > Can I be added to the invitation as well(changcheng12345@gmail.com)?
> > > > thanks!
> > > >
> > > > On Thu, Mar 23, 2023 at 12:59 PM Jarek Potiuk <ja...@potiuk.com>
> wrote:
> > > >
> > > > > I added all those who asked. It's really cool we have so much
> interest
> > > :).
> > > > >
> > > > > Julien, Maciej: NO PRESSURE
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
> > > > > For additional commands, e-mail: dev-help@airflow.apache.org
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
> > > For additional commands, e-mail: dev-help@airflow.apache.org
> > >
> > >
> >
> > --
> > Regards
> >
> > Bowrna Prabhakaran
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
> For additional commands, e-mail: dev-help@airflow.apache.org
>
>

Re: Request for feedback on proposal for new OpenLineage provider in Airflow

Posted by Jarek Potiuk <ja...@potiuk.com>.
Added :)

On Fri, Mar 24, 2023 at 4:30 AM Bowrna Prabhakaran <ma...@gmail.com> wrote:
>
> Can I get added to the invitation as well? (mailbowrna@gmail.com)
> Thanks
>
> On Fri, Mar 24, 2023 at 2:37 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > did
> >
> > On Thu, Mar 23, 2023 at 9:22 PM c c <ch...@gmail.com> wrote:
> > >
> > > Can I be added to the invitation as well(changcheng12345@gmail.com)?
> > > thanks!
> > >
> > > On Thu, Mar 23, 2023 at 12:59 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> > >
> > > > I added all those who asked. It's really cool we have so much interest
> > :).
> > > >
> > > > Julien, Maciej: NO PRESSURE
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
> > > > For additional commands, e-mail: dev-help@airflow.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
> > For additional commands, e-mail: dev-help@airflow.apache.org
> >
> >
>
> --
> Regards
>
> Bowrna Prabhakaran

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
For additional commands, e-mail: dev-help@airflow.apache.org


Re: Request for feedback on proposal for new OpenLineage provider in Airflow

Posted by Bowrna Prabhakaran <ma...@gmail.com>.
Can I get added to the invitation as well? (mailbowrna@gmail.com)
Thanks

On Fri, Mar 24, 2023 at 2:37 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> did
>
> On Thu, Mar 23, 2023 at 9:22 PM c c <ch...@gmail.com> wrote:
> >
> > Can I be added to the invitation as well(changcheng12345@gmail.com)?
> > thanks!
> >
> > On Thu, Mar 23, 2023 at 12:59 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > > I added all those who asked. It's really cool we have so much interest
> :).
> > >
> > > Julien, Maciej: NO PRESSURE
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
> > > For additional commands, e-mail: dev-help@airflow.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
> For additional commands, e-mail: dev-help@airflow.apache.org
>
>

-- 
Regards

Bowrna Prabhakaran

Re: Request for feedback on proposal for new OpenLineage provider in Airflow

Posted by Jarek Potiuk <ja...@potiuk.com>.
did

On Thu, Mar 23, 2023 at 9:22 PM c c <ch...@gmail.com> wrote:
>
> Can I be added to the invitation as well(changcheng12345@gmail.com)?
> thanks!
>
> On Thu, Mar 23, 2023 at 12:59 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > I added all those who asked. It's really cool we have so much interest :).
> >
> > Julien, Maciej: NO PRESSURE
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
> > For additional commands, e-mail: dev-help@airflow.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
For additional commands, e-mail: dev-help@airflow.apache.org


Re: Request for feedback on proposal for new OpenLineage provider in Airflow

Posted by c c <ch...@gmail.com>.
Can I be added to the invitation as well(changcheng12345@gmail.com)?
thanks!

On Thu, Mar 23, 2023 at 12:59 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> I added all those who asked. It's really cool we have so much interest :).
>
> Julien, Maciej: NO PRESSURE
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
> For additional commands, e-mail: dev-help@airflow.apache.org

Re: Request for feedback on proposal for new OpenLineage provider in Airflow

Posted by Jarek Potiuk <ja...@potiuk.com>.
I added all those who asked. It's really cool we have so much interest :).

Julien, Maciej: NO PRESSURE


Re: Request for feedback on proposal for new OpenLineage provider in Airflow

Posted by Marcelo Costa <me...@gmail.com>.
I'd like to join as well! (mesmacosta@gmail.com)

On Thu, 23 Mar 2023 at 19:23 Oliveira, Niko <on...@amazon.com.invalid>
wrote:

> I'd like to join as well! (oliveira.n3@gmail.com)
>
> ________________________________
> From: Igor Kholopov <ik...@google.com.INVALID>
> Sent: Wednesday, March 22, 2023 4:01:40 PM
> To: dev@airflow.apache.org
> Subject: RE: [EXTERNAL]Request for feedback on proposal for new
> OpenLineage provider in Airflow
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> +1, would be happy to join the session! (Please add either
> ikholopov@google.com or kholopovus@gmail.com).
>
> Best,
> Igor
>
> On Wed, Mar 22, 2023 at 11:27 PM Pierre Jeambrun <pi...@gmail.com>
> wrote:
>
> > Same here if you can add me please.
> >
> > Looking forward to this session.
> >
> > Le mer. 22 mars 2023 à 23:07, Mehta, Shubham <sh...@amazon.com.invalid>
> a
> > écrit :
> >
> > > Please include me, I will try my best to join (
> shubhammehta.93@gmail.com
> > )
> > >
> > > Best,
> > > Shubham
> > >
> > > On 2023-03-22, 2:24 PM, "Jarek Potiuk" <jarek@potiuk.com <mailto:
> > > jarek@potiuk.com>> wrote:
> > >
> > >
> > > CAUTION: This email originated from outside of the organization. Do not
> > > click links or open attachments unless you can confirm the sender and
> > know
> > > the content is safe.
> > >
> > >
> > >
> > >
> > >
> > >
> > > There are some strange behaviours in the calendar entry - I think you
> > > cannot add yourself, only guests can add others :)
> > > I've added you Eugen, maybe if someone wants to be also added - please
> > > post here with your gmail/calendar addresses.
> > >
> > >
> > > J.
> > >
> > >
> > > On Wed, Mar 22, 2023 at 9:56 PM Eugen Kosteev <eugen@kosteev.com
> > <mailto:
> > > eugen@kosteev.com>> wrote:
> > > >
> > > > Hi Julien.
> > > >
> > > > Can you, please, include me there as well: eugen@kosteev.com
> <mailto:
> > > eugen@kosteev.com> or
> > > > kosteev@google.com <ma...@google.com>.
> > > > Looking forward to see presentation.
> > > >
> > > > - Eugene
> > > >
> > > > On Wed, Mar 22, 2023 at 8:36 PM Julien Le Dem
> > <julien@astronomer.io.inva
> > > <ma...@astronomer.io.inva>lid>
> > > > wrote:
> > > >
> > > > > Hello all,
> > > > > I have to move the OpenLineage presentation to next week.
> > > > > Sorry for the change.
> > > > > It will be Friday next week March 31st at 5pm CET 9am PT.
> > > > >
> > > > >
> > >
> >
> https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&tmsrc=julien%40astronomer.io
> > > <
> > >
> >
> https://calendar.google.com/calendar/event?action=TEMPLATE&amp;tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&amp;tmsrc=julien%40astronomer.io
> > > >
> > > > > Julien
> > > > >
> > > > > On Thu, Mar 16, 2023 at 8:21 PM Julien Le Dem <
> julien@astronomer.io
> > > <ma...@astronomer.io>>
> > > > > wrote:
> > > > >
> > > > > > We are planning to do this session next Thursday at 5pm CET 9am
> > PT. I
> > > > > will
> > > > > > send a zoom link in advance.
> > > > > > Julien
> > > > > >
> > > > > > On Sat, Feb 25, 2023 at 05:59 Jarek Potiuk <jarek@potiuk.com
> > > <ma...@potiuk.com>> wrote:
> > > > > >
> > > > > >> Cool. I am looking forward to it :). It would be great to get
> some
> > > > > >> insight from those who attempted to get the lineage working in
> > > several
> > > > > >> versions of Open Lineage and finally arrived at the current
> > > > > >> specs/integration.
> > > > > >>
> > > > > >> On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem
> > > > > >> <julien@astronomer.io.inva <mailto:julien@astronomer.io.inva
> >lid>
> > > wrote:
> > > > > >> >
> > > > > >> > Thank you Jarek,
> > > > > >> > I am happy to organize a zoom presentation about OpenLineage
> and
> > > > > answer
> > > > > >> any question. It is indeed a spec decoupling the data
> > transformation
> > > > > layer
> > > > > >> from the Metadata store people are using. Just like
> OpenTelemetry
> > > is for
> > > > > >> service metrics/traces.
> > > > > >> > Best,
> > > > > >> > Julien
> > > > > >> >
> > > > > >> > On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <
> jarek@potiuk.com
> > > <ma...@potiuk.com>>
> > > > > wrote:
> > > > > >> >>
> > > > > >> >> And to add a little "parallel" - I think Open Lineage
> > integration
> > > > > >> replacing our "generic lineage" is very similar step to the new
> > > > > >> "Multi-tenant"-ready authentication interface we are discussing
> in
> > > > > >>
> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck
> > <
> > > https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck>
> > > > > >> >>
> > > > > >> >> Yes - we have a generic authentication interface, but no -
> it's
> > > > > >> useless for the case where multi-tenancy and good level of
> > resource
> > > > > >> authorization is needed. It's just far too simplistic and
> limited.
> > > > > >> >>
> > > > > >> >> Same with current lineage generic interface - yes, we have it
> > but
> > > > > it's
> > > > > >> only useful in a limited set of cases. and if we want to
> > step-it-up
> > > we
> > > > > need
> > > > > >> to come up with something better (and Open Lineage happens to be
> > one
> > > > > that
> > > > > >> has been developed with Airflow in mind and battle tested).
> > > > > >> >>
> > > > > >> >> J.
> > > > > >> >>
> > > > > >> >> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <
> jarek@potiuk.com
> > > <ma...@potiuk.com>>
> > > > > wrote:
> > > > > >> >>>
> > > > > >> >>> Hey Rafał (Eugene, Michal - and others who are looking),
> > > > > >> >>>
> > > > > >> >>> I think I know where your/Eugen/Michał concerns are coming
> > > from. And
> > > > > >> I think it would be great if we can talk it over a bit. I
> believe
> > > this
> > > > > is
> > > > > >> - in parts - quite a misunderstanding of what Open Lineage
> really
> > > is,
> > > > > how
> > > > > >> much of an integration it is and what are the reasons why it has
> > > been
> > > > > >> implemented the way it was implemented in Airflow.
> > > > > >> >>>
> > > > > >> >>> **Idea**: (Julien - Maybe you can organize it ?):
> > > > > >> >>>
> > > > > >> >>> Maybe we can have an open-to-everyone presentation/zoom call
> > > with
> > > > > >> quite some time foreseen to ask questions where you would
> explain
> > > the
> > > > > >> community about those integration points (and especially those
> > > people
> > > > > who
> > > > > >> are worried we are losing something by choosing the OpenLineage
> > > > > >> integration). I would love to see such a presentation -
> > specifically
> > > > > >> focused on explaining how Open-Lineage is really improving the
> > > current
> > > > > >> lineage approach and what problems it solves that the existing
> > > generic
> > > > > >> interface doesn't.
> > > > > >> >>>
> > > > > >> >>> Just to set the tone and focus for such meeting if we have
> > one:
> > > > > >> >>>
> > > > > >> >>> For me - when I look at Open Lineage, it is really "this is
> > how
> > > > > >> lineage generic interface **should** be done in Airflow". The
> > > "generic"
> > > > > >> lineage support we have now is very, very basic, I'd even say
> far
> > > too
> > > > > >> simplistic. I would even say, it's useless besides a few, very
> > > basic use
> > > > > >> cases. Simply because there was never a good "receiver" of the
> > > > > information
> > > > > >> to cover those cases.
> > > > > >> >>>
> > > > > >> >>> When you look closely at OpenLineage, it's nothing more
> than a
> > > > > better
> > > > > >> convention of the dictionaries that we send as a metadata,
> better
> > > > > meta-data
> > > > > >> in case of SQL operators (Hooks in the future hopefully),
> allowing
> > > > > handling
> > > > > >> some cases that current lineage simply cannot. Also what
> > > open-lineage
> > > > > >> integration with Airflow covers better handling of the lifecycle
> > > "task"
> > > > > and
> > > > > >> "dag" in Airflow to be able to bind lineage data together.
> That's
> > my
> > > > > >> understanding of what we get when we integrate OL in.
> > > > > >> >>>
> > > > > >> >>> I think over the last 2 years Datakin/Astronomer people had
> > > worked
> > > > > >> out the level of interface that **just works** and if we would
> > like
> > > to
> > > > > get
> > > > > >> the lineage information from Airflow as useful as it is in OL,
> we
> > > would
> > > > > >> have to anyway implement pretty much all of the things they
> > already
> > > did.
> > > > > >> >>>
> > > > > >> >>> I would love (and I think many community members) to take
> part
> > > in
> > > > > >> such a call to hear on that particular aspect of the OL
> > integration.
> > > > > >> >>>
> > > > > >> >>> J.
> > > > > >> >>>
> > > > > >> >>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz <
> > > > > >> rafalbiegacz@google.com.inva <mailto:
> rafalbiegacz@google.com.inva
> > >lid>
> > > wrote:
> > > > > >> >>>>
> > > > > >> >>>> Hi,
> > > > > >> >>>>
> > > > > >> >>>> I second/echo the input provided by Eugene and Michal.
> > > > > >> >>>>
> > > > > >> >>>> In general, Airflow should provide generic interfaces to
> > > lineage
> > > > > >> backends so it's easy to configure the one preferred by the
> user.
> > > > > Whether
> > > > > >> it's Open Lineage, proprietary solution, Dataplex Lineage, etc.
> it
> > > > > should
> > > > > >> be the user's choice.
> > > > > >> >>>>
> > > > > >> >>>> We should avoid close integration with any specific lineage
> > > backend
> > > > > >> due to the reasons already mentioned, i.e. to avoid translations
> > > between
> > > > > >> lineage backends. Also, we would closely couple one framework
> > > (Airflow)
> > > > > >> with another one (Open Lineage) - it makes Airflow more complex
> > and
> > > less
> > > > > >> flexible. Loose coupling between lineage backends and Airflow
> > seems
> > > to
> > > > > be
> > > > > >> more future-proven.
> > > > > >> >>>>
> > > > > >> >>>> Regards, Rafal.
> > > > > >> >>>>
> > > > > >> >>>>
> > > > > >> >>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem
> > > > > >> <julien@astronomer.io.inva <mailto:julien@astronomer.io.inva
> >lid>
> > > wrote:
> > > > > >> >>>>>
> > > > > >> >>>>> Dear Airflow community,
> > > > > >> >>>>> I have transferred the content of the working google doc I
> > > shared
> > > > > a
> > > > > >> few weeks ago to the Airflow confluence:
> > > > > >> >>>>>
> > > > > >>
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> > > >
> > > > > >> >>>>> All comments have been answered, I added clarifications to
> > > the doc
> > > > > >> accordingly and I also added your suggestions to improve the
> > > proposal.
> > > > > >> >>>>> All that history is linked from the discussion thread link
> > in
> > > the
> > > > > >> confluence doc if you wish to consult it.
> > > > > >> >>>>> Thank you all for your feedback and help in the process.
> > > > > >> >>>>> Best
> > > > > >> >>>>> Julien
> > > > > >> >>>>>
> > > > > >> >>>>>
> > > > > >> >>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <
> > > > > julien@astronomer.io <ma...@astronomer.io>>
> > > > > >> wrote:
> > > > > >> >>>>>>
> > > > > >> >>>>>> Thank you for the email Jarek, and Eugene for your
> > > suggestions,
> > > > > >> >>>>>> I do agree with Jarek's assessment. I don't have very
> much
> > > to add
> > > > > >> to his argument, it is very thoughtful!
> > > > > >> >>>>>> OpenLineage was started to avoid the cartesian complexity
> > > that
> > > > > >> Eugene mentions. There's actually that specific illustration in
> > the
> > > > > >> OpenLineage doc.
> > > > > >> >>>>>> Lineage consumers want to avoid having to understand the
> > > lineage
> > > > > >> format of each individual observed data transformation layer.
> And
> > > > > >> transformation layers don't want to understand every Metadata
> > > store's
> > > > > model
> > > > > >> and protocol.
> > > > > >> >>>>>> Eugene, about your specific proposal about a global
> > > vocabulary of
> > > > > >> entities, I think it is a great suggestion.
> > > > > >> >>>>>> We can map those entities to Datasets in OpenLineage. The
> > way
> > > > > >> OpenLineage models this is by allowing specific facets attached
> to
> > > > > Dataset.
> > > > > >> Facets are pieces of metadata each with their own JsonSchema.
> > > > > >> >>>>>> For example a table from a relational database will have
> a
> > > schema
> > > > > >> facet when a file in GCS might not.
> > > > > >> >>>>>> So I think in Airflow we could have each of the entity
> > > classes
> > > > > you
> > > > > >> describe be used in the get_openlineage_facets*() API in the
> > > Operators.
> > > > > >> >>>>>> Each of those classes would know what OpenLineage facets
> > > they can
> > > > > >> expose.
> > > > > >> >>>>>> I'll add a mention in the AIP and I think we can go in
> more
> > > > > >> details in a ticket.
> > > > > >> >>>>>> Cheers,
> > > > > >> >>>>>> Julien
> > > > > >> >>>>>>
> > > > > >> >>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <
> > > jarek@potiuk.com <ma...@potiuk.com>>
> > > > > >> wrote:
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> Just a quick personal view on it, Eugene (I bet Julian's
> > > answer
> > > > > >> will
> > > > > >> >>>>>>> be more thoughtful).
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> I think you are right to the "agnostic" part. But I have
> > one
> > > > > >> question
> > > > > >> >>>>>>> - what are we considering "agnostic"?
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> There is no "widespread" standard for lineage (yet).
> Open
> > > > > Lineage
> > > > > >> >>>>>>> with its donation to Linux Foundation Data & AI is
> > aspiring
> > > to
> > > > > >> become
> > > > > >> >>>>>>> one. And it's a pretty good candidate:
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> * designed from grounds-up to be agnostic (Open Lineage
> > was
> > > only
> > > > > >> >>>>>>> published as an API from day one)
> > > > > >> >>>>>>> * as of recently, the ownership and governance of Open
> > > Lineage
> > > > > is
> > > > > >> with
> > > > > >> >>>>>>> Linux Foundation Data & AI (
> https://lfaidata.foundation/
> > <
> > > https://lfaidata.foundation/>)
> > > > > which
> > > > > >> is
> > > > > >> >>>>>>> part of "Linux Foundation Project" - well known and
> > > respectful
> > > > > >> >>>>>>> foundation that - similarly to the ASF is an umbrella
> and
> > > > > provides
> > > > > >> >>>>>>> governance rules for a big number of well established
> OSS
> > > > > projects
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> In essence it is the same approach as we already
> discussed
> > > and
> > > > > >> >>>>>>> approved for Open Telemetry (which is governed by CNCF
> > > which is
> > > > > >> in the
> > > > > >> >>>>>>> same league as recognition and governance to LFP) (not
> yet
> > > > > >> implemented
> > > > > >> >>>>>>> though). In the case of Open-Telemetry, we decided
> against
> > > > > >> developing
> > > > > >> >>>>>>> our "own" existing standard but we opted for one that is
> > out
> > > > > >> there.
> > > > > >> >>>>>>> Yes it is a bit more established and popular than Open
> > > Lineage
> > > > > >> is, but
> > > > > >> >>>>>>> i so wish that we chose and implemented it already (and
> > > earlier
> > > > > >> as not
> > > > > >> >>>>>>> having a standard there - except statsd which is really,
> > > really
> > > > > >> poor)
> > > > > >> >>>>>>> has a great impact on Airflow being just "pluggable" in
> > > existing
> > > > > >> >>>>>>> solutions for monitoring. (BTW. I hope we implement it
> > soon
> > > and
> > > > > I
> > > > > >> hear
> > > > > >> >>>>>>> (and see) there are attempts to do so).
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> In the case of Open Lineage, the questions are - is
> there
> > an
> > > > > >> >>>>>>> alternative of the same caliber? Shall we produce our
> own
> > > > > >> "agnostic
> > > > > >> >>>>>>> standard" for it instead ? Is there a chance the idea of
> > > > > >> >>>>>>> "airflow-specific" attributes will catch up and many
> > > "consumers"
> > > > > >> will
> > > > > >> >>>>>>> be writing their own conversions to the way they can
> > > consume it?
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> I would really, really try to avoid the pitfalls nicely
> > > > > summarized
> > > > > >> >>>>>>> here: https://xkcd.com/927/ <https://xkcd.com/927/>
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> We can of course make a wrong bet and in 2 years Airflow
> > > might
> > > > > be
> > > > > >> the
> > > > > >> >>>>>>> only one supporting Open Lineage. That might happen.
> > Though
> > > the
> > > > > >> list
> > > > > >> >>>>>>> of "consumers" of Open Lineage is already pretty good
> > IMHO.
> > > Or
> > > > > >> maybe -
> > > > > >> >>>>>>> more likely - once Airflow implements it, due to
> Airflow's
> > > > > >> popularity
> > > > > >> >>>>>>> and the fact that there is already competition
> supporting
> > it
> > > > > (e.g.
> > > > > >> >>>>>>> Amundsen) we will increase the chance of "hockey-stick"
> > > adoption
> > > > > >> of
> > > > > >> >>>>>>> Open Lineage. My bet is - the latter and for the benefit
> > of
> > > the
> > > > > >> whole
> > > > > >> >>>>>>> ecosystem. I think we have a chance to influence
> creation
> > > of a
> > > > > >> new,
> > > > > >> >>>>>>> important standard. Much less so, I think if we just
> > > provide our
> > > > > >> own
> > > > > >> >>>>>>> custom solution - with lots and lots of work for others
> to
> > > be
> > > > > >> able to
> > > > > >> >>>>>>> consume it, no time to properly nurture the API and make
> > it
> > > > > >> easier to
> > > > > >> >>>>>>> implement it (which is undoubtedly what Datakin,
> > Astronomer
> > > and
> > > > > >> now
> > > > > >> >>>>>>> LFData & AI run governance main focus is)
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> Are there other alternatives we should consider ? Do we
> > > want to
> > > > > >> >>>>>>> develop our own standard (and implement all the
> > integrations
> > > > > from
> > > > > >> the
> > > > > >> >>>>>>> grounds up) ?
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> J.
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <
> > > > > eugen@kosteev.com <ma...@kosteev.com>>
> > > > > >> wrote:
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > Hi Julien.
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > I reviewed the design doc.
> > > > > >> >>>>>>> > The general idea looks good to me, but I have some
> > > concerns
> > > > > >> that I would like to share.
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > If I understand correctly the proposed design is to
> fill
> > > in
> > > > > >> "operators" with self-methods to extract lineage metadata from
> it,
> > > and I
> > > > > >> agree with the motivation. If those are decoupled (in a form of
> > > > > extractors
> > > > > >> in separate package) from operators itself, then the downsides
> is
> > > that
> > > > > (as
> > > > > >> you mentioned) - extractors will be distributed separately and
> > > > > "operators"
> > > > > >> logic is out of sync with "lineage extraction" logic by design.
> > > > > >> >>>>>>> > Also knowledge about internals of operator spills out
> of
> > > the
> > > > > >> operator which is not good at all (at the very least).
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > However, if we make every operator being exposing
> method
> > > to
> > > > > >> generate lineage metadata of the specific format, e.g.
> OpenLineage
> > > etc.,
> > > > > >> then we will end up with cartesian complexity of supporting in
> > each
> > > > > >> provider+operator each backend format.
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > If you say that the goal is that "operators" will
> always
> > > > > >> generate OpenLineage format only and each consumer will convert
> > this
> > > > > format
> > > > > >> to their own internal representation, well, if they do this then
> > > this
> > > > > seems
> > > > > >> like a working approach. But with the assumption that each
> > consumer
> > > will
> > > > > >> support it.
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > I think it comes down to the question: is OpenLineage
> > > format
> > > > > >> enough popular, complete and proper for the lineage metadata
> that
> > > every
> > > > > >> consumer will be convinced to support it. We may also consider
> > > issues
> > > > > like
> > > > > >> mismatch of lineage feature parity, e.g. OpenLineage supports
> > > > > field-level
> > > > > >> lineage but consumer doesn't support (or not at the moment), so
> we
> > > would
> > > > > >> prefer lineage metadata transferred to the backend to be
> slightly
> > > > > different
> > > > > >> in this case.
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > What do you think about the idea:
> > > > > >> >>>>>>> > 1. make lineage metadata generated by "operators" to
> be
> > > > > >> agnostic of the specific format, just using entities from big
> > > generic
> > > > > >> vocabulary of entities e.g. created here
> > > > > >>
> > >
> https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py
> > <
> > >
> https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py>
> > > > > .
> > > > > >> We would have there e.g. entities like:
> > > > > >> >>>>>>> >
> > > > > >>
> > --------------------------------------------------------------------
> > > > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > > > > >> >>>>>>> > class PostgresTable:
> > > > > >> >>>>>>> > """Airflow lineage entity representing Postgres
> > table."""
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > host: str = attr.ib()
> > > > > >> >>>>>>> > port: str = attr.ib()
> > > > > >> >>>>>>> > database: str = attr.ib()
> > > > > >> >>>>>>> > schema: str = attr.ib()
> > > > > >> >>>>>>> > table: str = attr.ib()
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > > > > >> >>>>>>> > class GCSEntity:
> > > > > >> >>>>>>> > """Airflow lineage entity representing generic Google
> > > > > Cloud
> > > > > >> Storage entity."""
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > bucket: str = attr.ib()
> > > > > >> >>>>>>> > path: str = attr.ib()
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > > > > >> >>>>>>> > class AWSS3Entity:
> > > > > >> >>>>>>> > """Airflow lineage entity representing generic AWS S3
> > > > > >> entity."""
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > bucket: str = attr.ib()
> > > > > >> >>>>>>> > path: str = attr.ib()
> > > > > >> >>>>>>> >
> > > > > >>
> > --------------------------------------------------------------------
> > > > > >> >>>>>>> > 2. Implement "adapters" that will act as a bridge
> > between
> > > > > >> "operators" and backends. Their responsibility will be to
> convert
> > > > > lineage
> > > > > >> metadata generated by "operators" to a format understandable by
> > > specific
> > > > > >> backend.
> > > > > >> >>>>>>> > And then we can use the built-in mechanism of
> > > inlets/outlets
> > > > > to
> > > > > >> bypass Airflow lineage metadata to the Airflow lineage backend.
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > I didn't get exactly implementation details of your
> > > proposed
> > > > > >> design, but I think maintaining global vocabulary of entities to
> > > use in
> > > > > >> inlets/outlets of operators is crucial for Airflow, as this
> could
> > be
> > > > > >> leveraged to build various features on top of it, like
> displaying
> > > > > lineage
> > > > > >> graph in Airflow UI (based on XCOM):)
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > Importantly to note, if we decide to send out from
> > Airflow
> > > > > >> lineage metadata only in OpenLineage format, well, we could have
> > > than
> > > > > only
> > > > > >> one "adapter" OpenLineageAdapter. But the "adapters" approach
> > > leaves us
> > > > > >> room for adding support to others (following "pluggable"
> approach
> > as
> > > > > >> Airflow is mainly known/good about).
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > All in all:
> > > > > >> >>>>>>> > - global vocabulary of entities used across all
> > > "operators"
> > > > > >> (with all advantages out of it, mentioned above)
> > > > > >> >>>>>>> > - "adapters" approach
> > > > > >> >>>>>>> > seems to me crucial points in the design that make
> sense
> > > to
> > > > > me.
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > What do you think about this?
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > - Eugene
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem
> > > > > >> <julien@astronomer.io.inva <mailto:julien@astronomer.io.inva
> >lid>
> > > wrote:
> > > > > >> >>>>>>> >>
> > > > > >> >>>>>>> >> Hello Michał,
> > > > > >> >>>>>>> >> Thank you for your input.
> > > > > >> >>>>>>> >> I would clarify that OpenLineage doesn't make any
> > > assumption
> > > > > >> about the backend being used to store lineage and is an
> > adapter-like
> > > > > layer.
> > > > > >> >>>>>>> >> OpenLineage exists as the spec specifically for that
> > > purpose
> > > > > >> of avoiding the problem of every lineage consumer having to
> > > understand
> > > > > >> every lineage producer.
> > > > > >> >>>>>>> >> Consumers of lineage want a unified spec consuming
> > > lineage
> > > > > >> from any data transformation layer like Airflow, Spark, Flink,
> > SQL,
> > > > > >> Warehouses, ...
> > > > > >> >>>>>>> >> Just like OpenTelemetry allows consuming traces
> > > independently
> > > > > >> of the technology used, so does OpenLineage for lineage.
> > > > > >> >>>>>>> >> Julien
> > > > > >> >>>>>>> >>
> > > > > >> >>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras <
> > > > > >> michalmodras@google.com <ma...@google.com>>
> wrote:
> > > > > >> >>>>>>> >>>
> > > > > >> >>>>>>> >>> Hi everyone,
> > > > > >> >>>>>>> >>>
> > > > > >> >>>>>>> >>> As Airflow already supports lineage functionality
> > > through
> > > > > >> pluggable lineage backends, I think OpenLineage and other
> lineage
> > > > > systems
> > > > > >> integration should follow this path. I think more 'native'
> > > integration
> > > > > with
> > > > > >> OpenLineage (or any other lineage system) in Airflow while
> > > maintaining
> > > > > the
> > > > > >> generic lineage backend architecture in parallel would make the
> > user
> > > > > >> experience less open, troublesome to maintain, and the Airflow
> > > > > architecture
> > > > > >> itself more constrained by a logic of a specific system.
> > > > > >> >>>>>>> >>>
> > > > > >> >>>>>>> >>> I think enriching operators with a generic method
> > > exposing
> > > > > >> lineage metadata that could be leveraged by lineage backends
> > > regardless
> > > > > of
> > > > > >> their implementation is a good idea which the Cloud Composer
> team
> > > would
> > > > > >> gladly contribute to. I believe the translation of the Airflow
> > > metadata
> > > > > >> exposed by the operators should be done by lineage backends (or
> > > another
> > > > > >> adapter-like layer). Tying Airflow operators' development to a
> > > specific
> > > > > >> lineage system like OpenLineage forces operators' contributors
> to
> > > > > >> understand that system too, which increases both the entry costs
> > and
> > > > > >> maintenance costs. I see it as unnecessary coupling.
> > > > > >> >>>>>>> >>>
> > > > > >> >>>>>>> >>> Best,
> > > > > >> >>>>>>> >>> Michal
> > > > > >> >>>>>>> >>>
> > > > > >> >>>>>>> >>>
> > > > > >> >>>>>>> >>>
> > > > > >> >>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem <
> > > > > >> julien@astronomer.io <ma...@astronomer.io>> wrote:
> > > > > >> >>>>>>> >>>>
> > > > > >> >>>>>>> >>>> Thank you Eugen,
> > > > > >> >>>>>>> >>>> This sounds very aligned with the goals of
> > OpenLineage
> > > and
> > > > > I
> > > > > >> think this would work well.
> > > > > >> >>>>>>> >>>> Here are the sections in the doc that I think
> address
> > > your
> > > > > >> points:
> > > > > >> >>>>>>> >>>> - generalize lineage metadata extraction as
> > > self-method in
> > > > > >> each operator, using generic lineage entities
> > > > > >> >>>>>>> >>>> See: OpenLineage support in providers. It describes
> > how
> > > > > each
> > > > > >> operator exposes its lineage.
> > > > > >> >>>>>>> >>>> - implement "adapter"s to convert generated
> metadata
> > to
> > > > > Data
> > > > > >> Lineage format, Open Lineage format, etc.
> > > > > >> >>>>>>> >>>> The goal here is each consumer turns from
> OpenLineage
> > > > > format
> > > > > >> to their own internal representation as you are suggesting.
> > > > > >> >>>>>>> >>>> In the motivation section, towards the end, I link
> to
> > > a few
> > > > > >> examples of data catalogs doing just that.
> > > > > >> >>>>>>> >>>>
> > > > > >> >>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <
> > > > > >> eugen@kosteev.com <ma...@kosteev.com>> wrote:
> > > > > >> >>>>>>> >>>>>
> > > > > >> >>>>>>> >>>>> ++ Michal Modras
> > > > > >> >>>>>>> >>>>>
> > > > > >> >>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <
> > > > > >> eugen@kosteev.com <ma...@kosteev.com>> wrote:
> > > > > >> >>>>>>> >>>>>>
> > > > > >> >>>>>>> >>>>>> Cloud Composer recently launched "Data lineage
> with
> > > > > >> Dataplex" feature which effectively means to generate lineage
> out
> > of
> > > > > >> DAG/task executions and export it to Data Lineage (Data Catalog
> > > service)
> > > > > >> for further analysis.
> > > > > >> >>>>>>> >>>>>>
> > > > > >>
> > > https://cloud.google.com/composer/docs/composer-2/lineage-integration
> <
> > > https://cloud.google.com/composer/docs/composer-2/lineage-integration>
> > > > > >> >>>>>>> >>>>>>
> > > > > >> >>>>>>> >>>>>> This feature is as of now in the "Preview" state.
> > > > > >> >>>>>>> >>>>>> The current implementation uses built-in "Airflow
> > > lineage
> > > > > >> backend" feature and methods to extract lineage metadata on task
> > > post
> > > > > >> execution events.
> > > > > >> >>>>>>> >>>>>>
> > > > > >> >>>>>>> >>>>>> The general idea was to contribute this to the
> > > Airflow
> > > > > >> community in a form:
> > > > > >> >>>>>>> >>>>>> - generalize lineage metadata extraction as
> > > self-method
> > > > > in
> > > > > >> each operator, using generic lineage entities
> > > > > >> >>>>>>> >>>>>> - implement "adapter"s to convert generated
> > metadata
> > > to
> > > > > >> Data Lineage format, Open Lineage format, etc.
> > > > > >> >>>>>>> >>>>>>
> > > > > >> >>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer
> > would
> > > mean
> > > > > >> to introduce an additional layer of converting from OpenLineage
> > > format
> > > > > to
> > > > > >> Data Lineage (Data Catalog/Dataplex) format. But this is
> > definitely
> > > a
> > > > > >> possibility.
> > > > > >> >>>>>>> >>>>>>
> > > > > >> >>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem
> > > > > >> <julien@astronomer.io.inva <mailto:julien@astronomer.io.inva
> >lid>
> > > wrote:
> > > > > >> >>>>>>> >>>>>>>
> > > > > >> >>>>>>> >>>>>>> Thank you very much for your input Jarek.
> > > > > >> >>>>>>> >>>>>>> I am responding in the comments and adding to
> the
> > > doc
> > > > > >> accordingly.
> > > > > >> >>>>>>> >>>>>>> I would also love to hear from more
> stakeholders.
> > > > > >> >>>>>>> >>>>>>> Thanks to all who provided feedback so far.
> > > > > >> >>>>>>> >>>>>>> Julien
> > > > > >> >>>>>>> >>>>>>>
> > > > > >> >>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <
> > > > > >> jarek@potiuk.com <ma...@potiuk.com>> wrote:
> > > > > >> >>>>>>> >>>>>>>>
> > > > > >> >>>>>>> >>>>>>>> General comment from my side: I think Open
> > Lineage
> > > is
> > > > > >> (and should be
> > > > > >> >>>>>>> >>>>>>>> even more) a feature of Airflow that expands
> > > Airflow's
> > > > > >> capabilities
> > > > > >> >>>>>>> >>>>>>>> greatly and opens up the direction we've been
> all
> > > > > >> working on - Airflow
> > > > > >> >>>>>>> >>>>>>>> as a Platform.
> > > > > >> >>>>>>> >>>>>>>>
> > > > > >> >>>>>>> >>>>>>>> I think closely integrating it with
> Open-Lineage
> > > goes
> > > > > >> the same
> > > > > >> >>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open
> > > Telemetry
> > > > > >> goes, where we
> > > > > >> >>>>>>> >>>>>>>> might decide to support certain standards in
> > order
> > > to
> > > > > >> expand
> > > > > >> >>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and
> allows
> > to
> > > > > >> plug-in multiple
> > > > > >> >>>>>>> >>>>>>>> external solutions that would use the standard
> > API.
> > > > > >> After Open-Lineage
> > > > > >> >>>>>>> >>>>>>>> graduated recently to LFAI&Data foundation
> (I've
> > > been
> > > > > >> watching this
> > > > > >> >>>>>>> >>>>>>>> happening from far), it is I think the perfect
> > > > > candidate
> > > > > >> for Airflow
> > > > > >> >>>>>>> >>>>>>>> to incorporate it. I hope this will help all
> the
> > > > > players
> > > > > >> to make use
> > > > > >> >>>>>>> >>>>>>>> of the extra work necessary by the community to
> > > make it
> > > > > >> "officially
> > > > > >> >>>>>>> >>>>>>>> supported". I think we have to also get some
> > > feedback
> > > > > >> from the big
> > > > > >> >>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is
> to
> > > have
> > > > > >> such a
> > > > > >> >>>>>>> >>>>>>>> capability, and another is to get it used in
> all
> > > the
> > > > > >> ways Airflow is
> > > > > >> >>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users
> > > (which
> > > > > >> is obviously a
> > > > > >> >>>>>>> >>>>>>>> huge driving factor) but also everywhere where
> > > Airflow
> > > > > >> is exposed by
> > > > > >> >>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we
> see
> > > some
> > > > > >> warm words from
> > > > > >> >>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to
> > hear
> > > > > >> whether the
> > > > > >> >>>>>>> >>>>>>>> Composer team at Google would be on board in
> > using
> > > the
> > > > > >> open-lineage
> > > > > >> >>>>>>> >>>>>>>> information exposed this way in their Data
> > Catalog
> > > (and
> > > > > >> likely more)
> > > > > >> >>>>>>> >>>>>>>> offering. We have Amundsen and others and
> > possibly
> > > > > other
> > > > > >> stakeholders
> > > > > >> >>>>>>> >>>>>>>> might want to say something.
> > > > > >> >>>>>>> >>>>>>>>
> > > > > >> >>>>>>> >>>>>>>>
> > > > > >> >>>>>>> >>>>>>>> There is - undoubtedly - an extra effort
> involved
> > > in
> > > > > >> implementing and
> > > > > >> >>>>>>> >>>>>>>> keeping it running smoothly (as Julian
> mentioned,
> > > that
> > > > > >> is the main
> > > > > >> >>>>>>> >>>>>>>> reason why the Open Lineage community would
> like
> > to
> > > > > make
> > > > > >> the
> > > > > >> >>>>>>> >>>>>>>> integration part of Airflow. But by being smart
> > and
> > > > > >> integrating it in
> > > > > >> >>>>>>> >>>>>>>> the way that will allow to plug-it-in into our
> > CI,
> > > > > >> verification
> > > > > >> >>>>>>> >>>>>>>> process and making some very clear expectations
> > > about
> > > > > >> what it means
> > > > > >> >>>>>>> >>>>>>>> for contributors to Airflow to get it running,
> we
> > > can
> > > > > >> make some
> > > > > >> >>>>>>> >>>>>>>> initial investment in making it happen and
> > minimise
> > > > > >> on-going cost,
> > > > > >> >>>>>>> >>>>>>>> while maximising the gain.
> > > > > >> >>>>>>> >>>>>>>>
> > > > > >> >>>>>>> >>>>>>>> And looking at all the above - I am super happy
> > to
> > > help
> > > > > >> with all that
> > > > > >> >>>>>>> >>>>>>>> to make this easy to "swallow" and integrate
> > well,
> > > even
> > > > > >> if it will
> > > > > >> >>>>>>> >>>>>>>> take an extra effort, especially that we will
> > have
> > > > > >> experts from Open
> > > > > >> >>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open
> > > Lineage
> > > > > >> being the core
> > > > > >> >>>>>>> >>>>>>>> part of the effort. I am actually super
> excited -
> > > this
> > > > > >> might be the
> > > > > >> >>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its
> > > position
> > > > > as
> > > > > >> an
> > > > > >> >>>>>>> >>>>>>>> indispensable component of "even more modern
> data
> > > > > stack".
> > > > > >> >>>>>>> >>>>>>>>
> > > > > >> >>>>>>> >>>>>>>> I made my initial comments in the doc, and am
> > > looking
> > > > > >> forward to
> > > > > >> >>>>>>> >>>>>>>> making it happen :).
> > > > > >> >>>>>>> >>>>>>>>
> > > > > >> >>>>>>> >>>>>>>> J.
> > > > > >> >>>>>>> >>>>>>>>
> > > > > >> >>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
> > > > > >> >>>>>>> >>>>>>>> <julien@astronomer.io.inva <mailto:
> > > julien@astronomer.io.inva>lid> wrote:
> > > > > >> >>>>>>> >>>>>>>> >
> > > > > >> >>>>>>> >>>>>>>> > Dear Airflow Community,
> > > > > >> >>>>>>> >>>>>>>> > I have been working on a proposal to bring an
> > > > > >> OpenLineage provider to Airflow.
> > > > > >> >>>>>>> >>>>>>>> > I am looking for feedback with the goal to
> post
> > > an
> > > > > >> official AIP.
> > > > > >> >>>>>>> >>>>>>>> > Please feel free to comment in the doc above.
> > > > > >> >>>>>>> >>>>>>>> > Thank you,
> > > > > >> >>>>>>> >>>>>>>> > Julien (OpenLineage project lead)
> > > > > >> >>>>>>> >>>>>>>> >
> > > > > >> >>>>>>> >>>>>>>> > For convenience, here is the rationale from
> the
> > > doc:
> > > > > >> >>>>>>> >>>>>>>> >
> > > > > >> >>>>>>> >>>>>>>> > Operational lineage collection is a common
> need
> > > to
> > > > > >> understand dependencies between data pipelines and track
> > end-to-end
> > > > > >> provenance of data. It enables many use cases from ensuring
> > reliable
> > > > > >> delivery of data through observability to compliance and cost
> > > > > management.
> > > > > >> >>>>>>> >>>>>>>> >
> > > > > >> >>>>>>> >>>>>>>> > Publishing operational lineage is a core
> > Airflow
> > > > > >> capability to enable troubleshooting and governance.
> > > > > >> >>>>>>> >>>>>>>> >
> > > > > >> >>>>>>> >>>>>>>> > OpenLineage is a project part of the
> LFAI&Data
> > > > > >> foundation that provides a spec standardizing operational
> lineage
> > > > > >> collection and sharing across the data ecosystem. If it provides
> > > plugins
> > > > > >> for popular open source projects, its intent is very similar to
> > > > > >> OpenTelemetry (also under the Linux Foundation umbrella): to
> > remain
> > > a
> > > > > spec
> > > > > >> for lineage exchange that projects - open source or proprietary
> -
> > > > > implement.
> > > > > >> >>>>>>> >>>>>>>> >
> > > > > >> >>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will
> > > make it
> > > > > >> easier and more reliable for Airflow users to publish their
> > > operational
> > > > > >> lineage through the OpenLineage ecosystem.
> > > > > >> >>>>>>> >>>>>>>> >
> > > > > >> >>>>>>> >>>>>>>> > The current external plugin maintained in the
> > > > > >> OpenLineage project depends on Airflow and operators internals
> and
> > > gets
> > > > > >> broken when changes are made on those. Having a built-in
> > integration
> > > > > >> ensures a better first class support to expose lineage that gets
> > > tested
> > > > > >> alongside other changes and therefore is more stable.
> > > > > >> >>>>>>> >>>>>>
> > > > > >> >>>>>>> >>>>>>
> > > > > >> >>>>>>> >>>>>>
> > > > > >> >>>>>>> >>>>>> --
> > > > > >> >>>>>>> >>>>>> Eugene
> > > > > >> >>>>>>> >>>>>
> > > > > >> >>>>>>> >>>>>
> > > > > >> >>>>>>> >>>>>
> > > > > >> >>>>>>> >>>>> --
> > > > > >> >>>>>>> >>>>> Eugene
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> >
> > > > > >> >>>>>>> > --
> > > > > >> >>>>>>> > Eugene
> > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Eugene
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org <mailto:
> > > dev-unsubscribe@airflow.apache.org>
> > > For additional commands, e-mail: dev-help@airflow.apache.org <mailto:
> > > dev-help@airflow.apache.org>
> > >
> > >
> > >
> > >
> > >
> > >
> >
>

Re: Request for feedback on proposal for new OpenLineage provider in Airflow

Posted by "Oliveira, Niko" <on...@amazon.com.INVALID>.
I'd like to join as well! (oliveira.n3@gmail.com)

________________________________
From: Igor Kholopov <ik...@google.com.INVALID>
Sent: Wednesday, March 22, 2023 4:01:40 PM
To: dev@airflow.apache.org
Subject: RE: [EXTERNAL]Request for feedback on proposal for new OpenLineage provider in Airflow

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



+1, would be happy to join the session! (Please add either
ikholopov@google.com or kholopovus@gmail.com).

Best,
Igor

On Wed, Mar 22, 2023 at 11:27 PM Pierre Jeambrun <pi...@gmail.com>
wrote:

> Same here if you can add me please.
>
> Looking forward to this session.
>
> Le mer. 22 mars 2023 à 23:07, Mehta, Shubham <sh...@amazon.com.invalid> a
> écrit :
>
> > Please include me, I will try my best to join (shubhammehta.93@gmail.com
> )
> >
> > Best,
> > Shubham
> >
> > On 2023-03-22, 2:24 PM, "Jarek Potiuk" <jarek@potiuk.com <mailto:
> > jarek@potiuk.com>> wrote:
> >
> >
> > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and
> know
> > the content is safe.
> >
> >
> >
> >
> >
> >
> > There are some strange behaviours in the calendar entry - I think you
> > cannot add yourself, only guests can add others :)
> > I've added you Eugen, maybe if someone wants to be also added - please
> > post here with your gmail/calendar addresses.
> >
> >
> > J.
> >
> >
> > On Wed, Mar 22, 2023 at 9:56 PM Eugen Kosteev <eugen@kosteev.com
> <mailto:
> > eugen@kosteev.com>> wrote:
> > >
> > > Hi Julien.
> > >
> > > Can you, please, include me there as well: eugen@kosteev.com <mailto:
> > eugen@kosteev.com> or
> > > kosteev@google.com <ma...@google.com>.
> > > Looking forward to see presentation.
> > >
> > > - Eugene
> > >
> > > On Wed, Mar 22, 2023 at 8:36 PM Julien Le Dem
> <julien@astronomer.io.inva
> > <ma...@astronomer.io.inva>lid>
> > > wrote:
> > >
> > > > Hello all,
> > > > I have to move the OpenLineage presentation to next week.
> > > > Sorry for the change.
> > > > It will be Friday next week March 31st at 5pm CET 9am PT.
> > > >
> > > >
> >
> https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&tmsrc=julien%40astronomer.io
> > <
> >
> https://calendar.google.com/calendar/event?action=TEMPLATE&amp;tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&amp;tmsrc=julien%40astronomer.io
> > >
> > > > Julien
> > > >
> > > > On Thu, Mar 16, 2023 at 8:21 PM Julien Le Dem <julien@astronomer.io
> > <ma...@astronomer.io>>
> > > > wrote:
> > > >
> > > > > We are planning to do this session next Thursday at 5pm CET 9am
> PT. I
> > > > will
> > > > > send a zoom link in advance.
> > > > > Julien
> > > > >
> > > > > On Sat, Feb 25, 2023 at 05:59 Jarek Potiuk <jarek@potiuk.com
> > <ma...@potiuk.com>> wrote:
> > > > >
> > > > >> Cool. I am looking forward to it :). It would be great to get some
> > > > >> insight from those who attempted to get the lineage working in
> > several
> > > > >> versions of Open Lineage and finally arrived at the current
> > > > >> specs/integration.
> > > > >>
> > > > >> On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem
> > > > >> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid>
> > wrote:
> > > > >> >
> > > > >> > Thank you Jarek,
> > > > >> > I am happy to organize a zoom presentation about OpenLineage and
> > > > answer
> > > > >> any question. It is indeed a spec decoupling the data
> transformation
> > > > layer
> > > > >> from the Metadata store people are using. Just like OpenTelemetry
> > is for
> > > > >> service metrics/traces.
> > > > >> > Best,
> > > > >> > Julien
> > > > >> >
> > > > >> > On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <jarek@potiuk.com
> > <ma...@potiuk.com>>
> > > > wrote:
> > > > >> >>
> > > > >> >> And to add a little "parallel" - I think Open Lineage
> integration
> > > > >> replacing our "generic lineage" is very similar step to the new
> > > > >> "Multi-tenant"-ready authentication interface we are discussing in
> > > > >> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck
> <
> > https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck>
> > > > >> >>
> > > > >> >> Yes - we have a generic authentication interface, but no - it's
> > > > >> useless for the case where multi-tenancy and good level of
> resource
> > > > >> authorization is needed. It's just far too simplistic and limited.
> > > > >> >>
> > > > >> >> Same with current lineage generic interface - yes, we have it
> but
> > > > it's
> > > > >> only useful in a limited set of cases. and if we want to
> step-it-up
> > we
> > > > need
> > > > >> to come up with something better (and Open Lineage happens to be
> one
> > > > that
> > > > >> has been developed with Airflow in mind and battle tested).
> > > > >> >>
> > > > >> >> J.
> > > > >> >>
> > > > >> >> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <jarek@potiuk.com
> > <ma...@potiuk.com>>
> > > > wrote:
> > > > >> >>>
> > > > >> >>> Hey Rafał (Eugene, Michal - and others who are looking),
> > > > >> >>>
> > > > >> >>> I think I know where your/Eugen/Michał concerns are coming
> > from. And
> > > > >> I think it would be great if we can talk it over a bit. I believe
> > this
> > > > is
> > > > >> - in parts - quite a misunderstanding of what Open Lineage really
> > is,
> > > > how
> > > > >> much of an integration it is and what are the reasons why it has
> > been
> > > > >> implemented the way it was implemented in Airflow.
> > > > >> >>>
> > > > >> >>> **Idea**: (Julien - Maybe you can organize it ?):
> > > > >> >>>
> > > > >> >>> Maybe we can have an open-to-everyone presentation/zoom call
> > with
> > > > >> quite some time foreseen to ask questions where you would explain
> > the
> > > > >> community about those integration points (and especially those
> > people
> > > > who
> > > > >> are worried we are losing something by choosing the OpenLineage
> > > > >> integration). I would love to see such a presentation -
> specifically
> > > > >> focused on explaining how Open-Lineage is really improving the
> > current
> > > > >> lineage approach and what problems it solves that the existing
> > generic
> > > > >> interface doesn't.
> > > > >> >>>
> > > > >> >>> Just to set the tone and focus for such meeting if we have
> one:
> > > > >> >>>
> > > > >> >>> For me - when I look at Open Lineage, it is really "this is
> how
> > > > >> lineage generic interface **should** be done in Airflow". The
> > "generic"
> > > > >> lineage support we have now is very, very basic, I'd even say far
> > too
> > > > >> simplistic. I would even say, it's useless besides a few, very
> > basic use
> > > > >> cases. Simply because there was never a good "receiver" of the
> > > > information
> > > > >> to cover those cases.
> > > > >> >>>
> > > > >> >>> When you look closely at OpenLineage, it's nothing more than a
> > > > better
> > > > >> convention of the dictionaries that we send as a metadata, better
> > > > meta-data
> > > > >> in case of SQL operators (Hooks in the future hopefully), allowing
> > > > handling
> > > > >> some cases that current lineage simply cannot. Also what
> > open-lineage
> > > > >> integration with Airflow covers better handling of the lifecycle
> > "task"
> > > > and
> > > > >> "dag" in Airflow to be able to bind lineage data together. That's
> my
> > > > >> understanding of what we get when we integrate OL in.
> > > > >> >>>
> > > > >> >>> I think over the last 2 years Datakin/Astronomer people had
> > worked
> > > > >> out the level of interface that **just works** and if we would
> like
> > to
> > > > get
> > > > >> the lineage information from Airflow as useful as it is in OL, we
> > would
> > > > >> have to anyway implement pretty much all of the things they
> already
> > did.
> > > > >> >>>
> > > > >> >>> I would love (and I think many community members) to take part
> > in
> > > > >> such a call to hear on that particular aspect of the OL
> integration.
> > > > >> >>>
> > > > >> >>> J.
> > > > >> >>>
> > > > >> >>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz <
> > > > >> rafalbiegacz@google.com.inva <mailto:rafalbiegacz@google.com.inva
> >lid>
> > wrote:
> > > > >> >>>>
> > > > >> >>>> Hi,
> > > > >> >>>>
> > > > >> >>>> I second/echo the input provided by Eugene and Michal.
> > > > >> >>>>
> > > > >> >>>> In general, Airflow should provide generic interfaces to
> > lineage
> > > > >> backends so it's easy to configure the one preferred by the user.
> > > > Whether
> > > > >> it's Open Lineage, proprietary solution, Dataplex Lineage, etc. it
> > > > should
> > > > >> be the user's choice.
> > > > >> >>>>
> > > > >> >>>> We should avoid close integration with any specific lineage
> > backend
> > > > >> due to the reasons already mentioned, i.e. to avoid translations
> > between
> > > > >> lineage backends. Also, we would closely couple one framework
> > (Airflow)
> > > > >> with another one (Open Lineage) - it makes Airflow more complex
> and
> > less
> > > > >> flexible. Loose coupling between lineage backends and Airflow
> seems
> > to
> > > > be
> > > > >> more future-proven.
> > > > >> >>>>
> > > > >> >>>> Regards, Rafal.
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem
> > > > >> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid>
> > wrote:
> > > > >> >>>>>
> > > > >> >>>>> Dear Airflow community,
> > > > >> >>>>> I have transferred the content of the working google doc I
> > shared
> > > > a
> > > > >> few weeks ago to the Airflow confluence:
> > > > >> >>>>>
> > > > >>
> > > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> > <
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> > >
> > > > >> >>>>> All comments have been answered, I added clarifications to
> > the doc
> > > > >> accordingly and I also added your suggestions to improve the
> > proposal.
> > > > >> >>>>> All that history is linked from the discussion thread link
> in
> > the
> > > > >> confluence doc if you wish to consult it.
> > > > >> >>>>> Thank you all for your feedback and help in the process.
> > > > >> >>>>> Best
> > > > >> >>>>> Julien
> > > > >> >>>>>
> > > > >> >>>>>
> > > > >> >>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <
> > > > julien@astronomer.io <ma...@astronomer.io>>
> > > > >> wrote:
> > > > >> >>>>>>
> > > > >> >>>>>> Thank you for the email Jarek, and Eugene for your
> > suggestions,
> > > > >> >>>>>> I do agree with Jarek's assessment. I don't have very much
> > to add
> > > > >> to his argument, it is very thoughtful!
> > > > >> >>>>>> OpenLineage was started to avoid the cartesian complexity
> > that
> > > > >> Eugene mentions. There's actually that specific illustration in
> the
> > > > >> OpenLineage doc.
> > > > >> >>>>>> Lineage consumers want to avoid having to understand the
> > lineage
> > > > >> format of each individual observed data transformation layer. And
> > > > >> transformation layers don't want to understand every Metadata
> > store's
> > > > model
> > > > >> and protocol.
> > > > >> >>>>>> Eugene, about your specific proposal about a global
> > vocabulary of
> > > > >> entities, I think it is a great suggestion.
> > > > >> >>>>>> We can map those entities to Datasets in OpenLineage. The
> way
> > > > >> OpenLineage models this is by allowing specific facets attached to
> > > > Dataset.
> > > > >> Facets are pieces of metadata each with their own JsonSchema.
> > > > >> >>>>>> For example a table from a relational database will have a
> > schema
> > > > >> facet when a file in GCS might not.
> > > > >> >>>>>> So I think in Airflow we could have each of the entity
> > classes
> > > > you
> > > > >> describe be used in the get_openlineage_facets*() API in the
> > Operators.
> > > > >> >>>>>> Each of those classes would know what OpenLineage facets
> > they can
> > > > >> expose.
> > > > >> >>>>>> I'll add a mention in the AIP and I think we can go in more
> > > > >> details in a ticket.
> > > > >> >>>>>> Cheers,
> > > > >> >>>>>> Julien
> > > > >> >>>>>>
> > > > >> >>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <
> > jarek@potiuk.com <ma...@potiuk.com>>
> > > > >> wrote:
> > > > >> >>>>>>>
> > > > >> >>>>>>> Just a quick personal view on it, Eugene (I bet Julian's
> > answer
> > > > >> will
> > > > >> >>>>>>> be more thoughtful).
> > > > >> >>>>>>>
> > > > >> >>>>>>> I think you are right to the "agnostic" part. But I have
> one
> > > > >> question
> > > > >> >>>>>>> - what are we considering "agnostic"?
> > > > >> >>>>>>>
> > > > >> >>>>>>> There is no "widespread" standard for lineage (yet). Open
> > > > Lineage
> > > > >> >>>>>>> with its donation to Linux Foundation Data & AI is
> aspiring
> > to
> > > > >> become
> > > > >> >>>>>>> one. And it's a pretty good candidate:
> > > > >> >>>>>>>
> > > > >> >>>>>>> * designed from grounds-up to be agnostic (Open Lineage
> was
> > only
> > > > >> >>>>>>> published as an API from day one)
> > > > >> >>>>>>> * as of recently, the ownership and governance of Open
> > Lineage
> > > > is
> > > > >> with
> > > > >> >>>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/
> <
> > https://lfaidata.foundation/>)
> > > > which
> > > > >> is
> > > > >> >>>>>>> part of "Linux Foundation Project" - well known and
> > respectful
> > > > >> >>>>>>> foundation that - similarly to the ASF is an umbrella and
> > > > provides
> > > > >> >>>>>>> governance rules for a big number of well established OSS
> > > > projects
> > > > >> >>>>>>>
> > > > >> >>>>>>> In essence it is the same approach as we already discussed
> > and
> > > > >> >>>>>>> approved for Open Telemetry (which is governed by CNCF
> > which is
> > > > >> in the
> > > > >> >>>>>>> same league as recognition and governance to LFP) (not yet
> > > > >> implemented
> > > > >> >>>>>>> though). In the case of Open-Telemetry, we decided against
> > > > >> developing
> > > > >> >>>>>>> our "own" existing standard but we opted for one that is
> out
> > > > >> there.
> > > > >> >>>>>>> Yes it is a bit more established and popular than Open
> > Lineage
> > > > >> is, but
> > > > >> >>>>>>> i so wish that we chose and implemented it already (and
> > earlier
> > > > >> as not
> > > > >> >>>>>>> having a standard there - except statsd which is really,
> > really
> > > > >> poor)
> > > > >> >>>>>>> has a great impact on Airflow being just "pluggable" in
> > existing
> > > > >> >>>>>>> solutions for monitoring. (BTW. I hope we implement it
> soon
> > and
> > > > I
> > > > >> hear
> > > > >> >>>>>>> (and see) there are attempts to do so).
> > > > >> >>>>>>>
> > > > >> >>>>>>> In the case of Open Lineage, the questions are - is there
> an
> > > > >> >>>>>>> alternative of the same caliber? Shall we produce our own
> > > > >> "agnostic
> > > > >> >>>>>>> standard" for it instead ? Is there a chance the idea of
> > > > >> >>>>>>> "airflow-specific" attributes will catch up and many
> > "consumers"
> > > > >> will
> > > > >> >>>>>>> be writing their own conversions to the way they can
> > consume it?
> > > > >> >>>>>>>
> > > > >> >>>>>>> I would really, really try to avoid the pitfalls nicely
> > > > summarized
> > > > >> >>>>>>> here: https://xkcd.com/927/ <https://xkcd.com/927/>
> > > > >> >>>>>>>
> > > > >> >>>>>>> We can of course make a wrong bet and in 2 years Airflow
> > might
> > > > be
> > > > >> the
> > > > >> >>>>>>> only one supporting Open Lineage. That might happen.
> Though
> > the
> > > > >> list
> > > > >> >>>>>>> of "consumers" of Open Lineage is already pretty good
> IMHO.
> > Or
> > > > >> maybe -
> > > > >> >>>>>>> more likely - once Airflow implements it, due to Airflow's
> > > > >> popularity
> > > > >> >>>>>>> and the fact that there is already competition supporting
> it
> > > > (e.g.
> > > > >> >>>>>>> Amundsen) we will increase the chance of "hockey-stick"
> > adoption
> > > > >> of
> > > > >> >>>>>>> Open Lineage. My bet is - the latter and for the benefit
> of
> > the
> > > > >> whole
> > > > >> >>>>>>> ecosystem. I think we have a chance to influence creation
> > of a
> > > > >> new,
> > > > >> >>>>>>> important standard. Much less so, I think if we just
> > provide our
> > > > >> own
> > > > >> >>>>>>> custom solution - with lots and lots of work for others to
> > be
> > > > >> able to
> > > > >> >>>>>>> consume it, no time to properly nurture the API and make
> it
> > > > >> easier to
> > > > >> >>>>>>> implement it (which is undoubtedly what Datakin,
> Astronomer
> > and
> > > > >> now
> > > > >> >>>>>>> LFData & AI run governance main focus is)
> > > > >> >>>>>>>
> > > > >> >>>>>>> Are there other alternatives we should consider ? Do we
> > want to
> > > > >> >>>>>>> develop our own standard (and implement all the
> integrations
> > > > from
> > > > >> the
> > > > >> >>>>>>> grounds up) ?
> > > > >> >>>>>>>
> > > > >> >>>>>>> J.
> > > > >> >>>>>>>
> > > > >> >>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <
> > > > eugen@kosteev.com <ma...@kosteev.com>>
> > > > >> wrote:
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > Hi Julien.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > I reviewed the design doc.
> > > > >> >>>>>>> > The general idea looks good to me, but I have some
> > concerns
> > > > >> that I would like to share.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > If I understand correctly the proposed design is to fill
> > in
> > > > >> "operators" with self-methods to extract lineage metadata from it,
> > and I
> > > > >> agree with the motivation. If those are decoupled (in a form of
> > > > extractors
> > > > >> in separate package) from operators itself, then the downsides is
> > that
> > > > (as
> > > > >> you mentioned) - extractors will be distributed separately and
> > > > "operators"
> > > > >> logic is out of sync with "lineage extraction" logic by design.
> > > > >> >>>>>>> > Also knowledge about internals of operator spills out of
> > the
> > > > >> operator which is not good at all (at the very least).
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > However, if we make every operator being exposing method
> > to
> > > > >> generate lineage metadata of the specific format, e.g. OpenLineage
> > etc.,
> > > > >> then we will end up with cartesian complexity of supporting in
> each
> > > > >> provider+operator each backend format.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > If you say that the goal is that "operators" will always
> > > > >> generate OpenLineage format only and each consumer will convert
> this
> > > > format
> > > > >> to their own internal representation, well, if they do this then
> > this
> > > > seems
> > > > >> like a working approach. But with the assumption that each
> consumer
> > will
> > > > >> support it.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > I think it comes down to the question: is OpenLineage
> > format
> > > > >> enough popular, complete and proper for the lineage metadata that
> > every
> > > > >> consumer will be convinced to support it. We may also consider
> > issues
> > > > like
> > > > >> mismatch of lineage feature parity, e.g. OpenLineage supports
> > > > field-level
> > > > >> lineage but consumer doesn't support (or not at the moment), so we
> > would
> > > > >> prefer lineage metadata transferred to the backend to be slightly
> > > > different
> > > > >> in this case.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > What do you think about the idea:
> > > > >> >>>>>>> > 1. make lineage metadata generated by "operators" to be
> > > > >> agnostic of the specific format, just using entities from big
> > generic
> > > > >> vocabulary of entities e.g. created here
> > > > >>
> > https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py
> <
> > https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py>
> > > > .
> > > > >> We would have there e.g. entities like:
> > > > >> >>>>>>> >
> > > > >>
> --------------------------------------------------------------------
> > > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > > > >> >>>>>>> > class PostgresTable:
> > > > >> >>>>>>> > """Airflow lineage entity representing Postgres
> table."""
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > host: str = attr.ib()
> > > > >> >>>>>>> > port: str = attr.ib()
> > > > >> >>>>>>> > database: str = attr.ib()
> > > > >> >>>>>>> > schema: str = attr.ib()
> > > > >> >>>>>>> > table: str = attr.ib()
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > > > >> >>>>>>> > class GCSEntity:
> > > > >> >>>>>>> > """Airflow lineage entity representing generic Google
> > > > Cloud
> > > > >> Storage entity."""
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > bucket: str = attr.ib()
> > > > >> >>>>>>> > path: str = attr.ib()
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > > > >> >>>>>>> > class AWSS3Entity:
> > > > >> >>>>>>> > """Airflow lineage entity representing generic AWS S3
> > > > >> entity."""
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > bucket: str = attr.ib()
> > > > >> >>>>>>> > path: str = attr.ib()
> > > > >> >>>>>>> >
> > > > >>
> --------------------------------------------------------------------
> > > > >> >>>>>>> > 2. Implement "adapters" that will act as a bridge
> between
> > > > >> "operators" and backends. Their responsibility will be to convert
> > > > lineage
> > > > >> metadata generated by "operators" to a format understandable by
> > specific
> > > > >> backend.
> > > > >> >>>>>>> > And then we can use the built-in mechanism of
> > inlets/outlets
> > > > to
> > > > >> bypass Airflow lineage metadata to the Airflow lineage backend.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > I didn't get exactly implementation details of your
> > proposed
> > > > >> design, but I think maintaining global vocabulary of entities to
> > use in
> > > > >> inlets/outlets of operators is crucial for Airflow, as this could
> be
> > > > >> leveraged to build various features on top of it, like displaying
> > > > lineage
> > > > >> graph in Airflow UI (based on XCOM):)
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > Importantly to note, if we decide to send out from
> Airflow
> > > > >> lineage metadata only in OpenLineage format, well, we could have
> > than
> > > > only
> > > > >> one "adapter" OpenLineageAdapter. But the "adapters" approach
> > leaves us
> > > > >> room for adding support to others (following "pluggable" approach
> as
> > > > >> Airflow is mainly known/good about).
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > All in all:
> > > > >> >>>>>>> > - global vocabulary of entities used across all
> > "operators"
> > > > >> (with all advantages out of it, mentioned above)
> > > > >> >>>>>>> > - "adapters" approach
> > > > >> >>>>>>> > seems to me crucial points in the design that make sense
> > to
> > > > me.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > What do you think about this?
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > - Eugene
> > > > >> >>>>>>> >
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem
> > > > >> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid>
> > wrote:
> > > > >> >>>>>>> >>
> > > > >> >>>>>>> >> Hello Michał,
> > > > >> >>>>>>> >> Thank you for your input.
> > > > >> >>>>>>> >> I would clarify that OpenLineage doesn't make any
> > assumption
> > > > >> about the backend being used to store lineage and is an
> adapter-like
> > > > layer.
> > > > >> >>>>>>> >> OpenLineage exists as the spec specifically for that
> > purpose
> > > > >> of avoiding the problem of every lineage consumer having to
> > understand
> > > > >> every lineage producer.
> > > > >> >>>>>>> >> Consumers of lineage want a unified spec consuming
> > lineage
> > > > >> from any data transformation layer like Airflow, Spark, Flink,
> SQL,
> > > > >> Warehouses, ...
> > > > >> >>>>>>> >> Just like OpenTelemetry allows consuming traces
> > independently
> > > > >> of the technology used, so does OpenLineage for lineage.
> > > > >> >>>>>>> >> Julien
> > > > >> >>>>>>> >>
> > > > >> >>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras <
> > > > >> michalmodras@google.com <ma...@google.com>> wrote:
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>> Hi everyone,
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>> As Airflow already supports lineage functionality
> > through
> > > > >> pluggable lineage backends, I think OpenLineage and other lineage
> > > > systems
> > > > >> integration should follow this path. I think more 'native'
> > integration
> > > > with
> > > > >> OpenLineage (or any other lineage system) in Airflow while
> > maintaining
> > > > the
> > > > >> generic lineage backend architecture in parallel would make the
> user
> > > > >> experience less open, troublesome to maintain, and the Airflow
> > > > architecture
> > > > >> itself more constrained by a logic of a specific system.
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>> I think enriching operators with a generic method
> > exposing
> > > > >> lineage metadata that could be leveraged by lineage backends
> > regardless
> > > > of
> > > > >> their implementation is a good idea which the Cloud Composer team
> > would
> > > > >> gladly contribute to. I believe the translation of the Airflow
> > metadata
> > > > >> exposed by the operators should be done by lineage backends (or
> > another
> > > > >> adapter-like layer). Tying Airflow operators' development to a
> > specific
> > > > >> lineage system like OpenLineage forces operators' contributors to
> > > > >> understand that system too, which increases both the entry costs
> and
> > > > >> maintenance costs. I see it as unnecessary coupling.
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>> Best,
> > > > >> >>>>>>> >>> Michal
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem <
> > > > >> julien@astronomer.io <ma...@astronomer.io>> wrote:
> > > > >> >>>>>>> >>>>
> > > > >> >>>>>>> >>>> Thank you Eugen,
> > > > >> >>>>>>> >>>> This sounds very aligned with the goals of
> OpenLineage
> > and
> > > > I
> > > > >> think this would work well.
> > > > >> >>>>>>> >>>> Here are the sections in the doc that I think address
> > your
> > > > >> points:
> > > > >> >>>>>>> >>>> - generalize lineage metadata extraction as
> > self-method in
> > > > >> each operator, using generic lineage entities
> > > > >> >>>>>>> >>>> See: OpenLineage support in providers. It describes
> how
> > > > each
> > > > >> operator exposes its lineage.
> > > > >> >>>>>>> >>>> - implement "adapter"s to convert generated metadata
> to
> > > > Data
> > > > >> Lineage format, Open Lineage format, etc.
> > > > >> >>>>>>> >>>> The goal here is each consumer turns from OpenLineage
> > > > format
> > > > >> to their own internal representation as you are suggesting.
> > > > >> >>>>>>> >>>> In the motivation section, towards the end, I link to
> > a few
> > > > >> examples of data catalogs doing just that.
> > > > >> >>>>>>> >>>>
> > > > >> >>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <
> > > > >> eugen@kosteev.com <ma...@kosteev.com>> wrote:
> > > > >> >>>>>>> >>>>>
> > > > >> >>>>>>> >>>>> ++ Michal Modras
> > > > >> >>>>>>> >>>>>
> > > > >> >>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <
> > > > >> eugen@kosteev.com <ma...@kosteev.com>> wrote:
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> Cloud Composer recently launched "Data lineage with
> > > > >> Dataplex" feature which effectively means to generate lineage out
> of
> > > > >> DAG/task executions and export it to Data Lineage (Data Catalog
> > service)
> > > > >> for further analysis.
> > > > >> >>>>>>> >>>>>>
> > > > >>
> > https://cloud.google.com/composer/docs/composer-2/lineage-integration <
> > https://cloud.google.com/composer/docs/composer-2/lineage-integration>
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> This feature is as of now in the "Preview" state.
> > > > >> >>>>>>> >>>>>> The current implementation uses built-in "Airflow
> > lineage
> > > > >> backend" feature and methods to extract lineage metadata on task
> > post
> > > > >> execution events.
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> The general idea was to contribute this to the
> > Airflow
> > > > >> community in a form:
> > > > >> >>>>>>> >>>>>> - generalize lineage metadata extraction as
> > self-method
> > > > in
> > > > >> each operator, using generic lineage entities
> > > > >> >>>>>>> >>>>>> - implement "adapter"s to convert generated
> metadata
> > to
> > > > >> Data Lineage format, Open Lineage format, etc.
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer
> would
> > mean
> > > > >> to introduce an additional layer of converting from OpenLineage
> > format
> > > > to
> > > > >> Data Lineage (Data Catalog/Dataplex) format. But this is
> definitely
> > a
> > > > >> possibility.
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem
> > > > >> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid>
> > wrote:
> > > > >> >>>>>>> >>>>>>>
> > > > >> >>>>>>> >>>>>>> Thank you very much for your input Jarek.
> > > > >> >>>>>>> >>>>>>> I am responding in the comments and adding to the
> > doc
> > > > >> accordingly.
> > > > >> >>>>>>> >>>>>>> I would also love to hear from more stakeholders.
> > > > >> >>>>>>> >>>>>>> Thanks to all who provided feedback so far.
> > > > >> >>>>>>> >>>>>>> Julien
> > > > >> >>>>>>> >>>>>>>
> > > > >> >>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <
> > > > >> jarek@potiuk.com <ma...@potiuk.com>> wrote:
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> General comment from my side: I think Open
> Lineage
> > is
> > > > >> (and should be
> > > > >> >>>>>>> >>>>>>>> even more) a feature of Airflow that expands
> > Airflow's
> > > > >> capabilities
> > > > >> >>>>>>> >>>>>>>> greatly and opens up the direction we've been all
> > > > >> working on - Airflow
> > > > >> >>>>>>> >>>>>>>> as a Platform.
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> I think closely integrating it with Open-Lineage
> > goes
> > > > >> the same
> > > > >> >>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open
> > Telemetry
> > > > >> goes, where we
> > > > >> >>>>>>> >>>>>>>> might decide to support certain standards in
> order
> > to
> > > > >> expand
> > > > >> >>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows
> to
> > > > >> plug-in multiple
> > > > >> >>>>>>> >>>>>>>> external solutions that would use the standard
> API.
> > > > >> After Open-Lineage
> > > > >> >>>>>>> >>>>>>>> graduated recently to LFAI&Data foundation (I've
> > been
> > > > >> watching this
> > > > >> >>>>>>> >>>>>>>> happening from far), it is I think the perfect
> > > > candidate
> > > > >> for Airflow
> > > > >> >>>>>>> >>>>>>>> to incorporate it. I hope this will help all the
> > > > players
> > > > >> to make use
> > > > >> >>>>>>> >>>>>>>> of the extra work necessary by the community to
> > make it
> > > > >> "officially
> > > > >> >>>>>>> >>>>>>>> supported". I think we have to also get some
> > feedback
> > > > >> from the big
> > > > >> >>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to
> > have
> > > > >> such a
> > > > >> >>>>>>> >>>>>>>> capability, and another is to get it used in all
> > the
> > > > >> ways Airflow is
> > > > >> >>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users
> > (which
> > > > >> is obviously a
> > > > >> >>>>>>> >>>>>>>> huge driving factor) but also everywhere where
> > Airflow
> > > > >> is exposed by
> > > > >> >>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see
> > some
> > > > >> warm words from
> > > > >> >>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to
> hear
> > > > >> whether the
> > > > >> >>>>>>> >>>>>>>> Composer team at Google would be on board in
> using
> > the
> > > > >> open-lineage
> > > > >> >>>>>>> >>>>>>>> information exposed this way in their Data
> Catalog
> > (and
> > > > >> likely more)
> > > > >> >>>>>>> >>>>>>>> offering. We have Amundsen and others and
> possibly
> > > > other
> > > > >> stakeholders
> > > > >> >>>>>>> >>>>>>>> might want to say something.
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved
> > in
> > > > >> implementing and
> > > > >> >>>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned,
> > that
> > > > >> is the main
> > > > >> >>>>>>> >>>>>>>> reason why the Open Lineage community would like
> to
> > > > make
> > > > >> the
> > > > >> >>>>>>> >>>>>>>> integration part of Airflow. But by being smart
> and
> > > > >> integrating it in
> > > > >> >>>>>>> >>>>>>>> the way that will allow to plug-it-in into our
> CI,
> > > > >> verification
> > > > >> >>>>>>> >>>>>>>> process and making some very clear expectations
> > about
> > > > >> what it means
> > > > >> >>>>>>> >>>>>>>> for contributors to Airflow to get it running, we
> > can
> > > > >> make some
> > > > >> >>>>>>> >>>>>>>> initial investment in making it happen and
> minimise
> > > > >> on-going cost,
> > > > >> >>>>>>> >>>>>>>> while maximising the gain.
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> And looking at all the above - I am super happy
> to
> > help
> > > > >> with all that
> > > > >> >>>>>>> >>>>>>>> to make this easy to "swallow" and integrate
> well,
> > even
> > > > >> if it will
> > > > >> >>>>>>> >>>>>>>> take an extra effort, especially that we will
> have
> > > > >> experts from Open
> > > > >> >>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open
> > Lineage
> > > > >> being the core
> > > > >> >>>>>>> >>>>>>>> part of the effort. I am actually super excited -
> > this
> > > > >> might be the
> > > > >> >>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its
> > position
> > > > as
> > > > >> an
> > > > >> >>>>>>> >>>>>>>> indispensable component of "even more modern data
> > > > stack".
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> I made my initial comments in the doc, and am
> > looking
> > > > >> forward to
> > > > >> >>>>>>> >>>>>>>> making it happen :).
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> J.
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
> > > > >> >>>>>>> >>>>>>>> <julien@astronomer.io.inva <mailto:
> > julien@astronomer.io.inva>lid> wrote:
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > Dear Airflow Community,
> > > > >> >>>>>>> >>>>>>>> > I have been working on a proposal to bring an
> > > > >> OpenLineage provider to Airflow.
> > > > >> >>>>>>> >>>>>>>> > I am looking for feedback with the goal to post
> > an
> > > > >> official AIP.
> > > > >> >>>>>>> >>>>>>>> > Please feel free to comment in the doc above.
> > > > >> >>>>>>> >>>>>>>> > Thank you,
> > > > >> >>>>>>> >>>>>>>> > Julien (OpenLineage project lead)
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > For convenience, here is the rationale from the
> > doc:
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > Operational lineage collection is a common need
> > to
> > > > >> understand dependencies between data pipelines and track
> end-to-end
> > > > >> provenance of data. It enables many use cases from ensuring
> reliable
> > > > >> delivery of data through observability to compliance and cost
> > > > management.
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > Publishing operational lineage is a core
> Airflow
> > > > >> capability to enable troubleshooting and governance.
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data
> > > > >> foundation that provides a spec standardizing operational lineage
> > > > >> collection and sharing across the data ecosystem. If it provides
> > plugins
> > > > >> for popular open source projects, its intent is very similar to
> > > > >> OpenTelemetry (also under the Linux Foundation umbrella): to
> remain
> > a
> > > > spec
> > > > >> for lineage exchange that projects - open source or proprietary -
> > > > implement.
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will
> > make it
> > > > >> easier and more reliable for Airflow users to publish their
> > operational
> > > > >> lineage through the OpenLineage ecosystem.
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > The current external plugin maintained in the
> > > > >> OpenLineage project depends on Airflow and operators internals and
> > gets
> > > > >> broken when changes are made on those. Having a built-in
> integration
> > > > >> ensures a better first class support to expose lineage that gets
> > tested
> > > > >> alongside other changes and therefore is more stable.
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> --
> > > > >> >>>>>>> >>>>>> Eugene
> > > > >> >>>>>>> >>>>>
> > > > >> >>>>>>> >>>>>
> > > > >> >>>>>>> >>>>>
> > > > >> >>>>>>> >>>>> --
> > > > >> >>>>>>> >>>>> Eugene
> > > > >> >>>>>>> >
> > > > >> >>>>>>> >
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > --
> > > > >> >>>>>>> > Eugene
> > > > >>
> > > > >
> > > >
> > >
> > >
> > > --
> > > Eugene
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org <mailto:
> > dev-unsubscribe@airflow.apache.org>
> > For additional commands, e-mail: dev-help@airflow.apache.org <mailto:
> > dev-help@airflow.apache.org>
> >
> >
> >
> >
> >
> >
>

Re: Request for feedback on proposal for new OpenLineage provider in Airflow

Posted by Igor Kholopov <ik...@google.com.INVALID>.
+1, would be happy to join the session! (Please add either
ikholopov@google.com or kholopovus@gmail.com).

Best,
Igor

On Wed, Mar 22, 2023 at 11:27 PM Pierre Jeambrun <pi...@gmail.com>
wrote:

> Same here if you can add me please.
>
> Looking forward to this session.
>
> Le mer. 22 mars 2023 à 23:07, Mehta, Shubham <sh...@amazon.com.invalid> a
> écrit :
>
> > Please include me, I will try my best to join (shubhammehta.93@gmail.com
> )
> >
> > Best,
> > Shubham
> >
> > On 2023-03-22, 2:24 PM, "Jarek Potiuk" <jarek@potiuk.com <mailto:
> > jarek@potiuk.com>> wrote:
> >
> >
> > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and
> know
> > the content is safe.
> >
> >
> >
> >
> >
> >
> > There are some strange behaviours in the calendar entry - I think you
> > cannot add yourself, only guests can add others :)
> > I've added you Eugen, maybe if someone wants to be also added - please
> > post here with your gmail/calendar addresses.
> >
> >
> > J.
> >
> >
> > On Wed, Mar 22, 2023 at 9:56 PM Eugen Kosteev <eugen@kosteev.com
> <mailto:
> > eugen@kosteev.com>> wrote:
> > >
> > > Hi Julien.
> > >
> > > Can you, please, include me there as well: eugen@kosteev.com <mailto:
> > eugen@kosteev.com> or
> > > kosteev@google.com <ma...@google.com>.
> > > Looking forward to see presentation.
> > >
> > > - Eugene
> > >
> > > On Wed, Mar 22, 2023 at 8:36 PM Julien Le Dem
> <julien@astronomer.io.inva
> > <ma...@astronomer.io.inva>lid>
> > > wrote:
> > >
> > > > Hello all,
> > > > I have to move the OpenLineage presentation to next week.
> > > > Sorry for the change.
> > > > It will be Friday next week March 31st at 5pm CET 9am PT.
> > > >
> > > >
> >
> https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&tmsrc=julien%40astronomer.io
> > <
> >
> https://calendar.google.com/calendar/event?action=TEMPLATE&amp;tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&amp;tmsrc=julien%40astronomer.io
> > >
> > > > Julien
> > > >
> > > > On Thu, Mar 16, 2023 at 8:21 PM Julien Le Dem <julien@astronomer.io
> > <ma...@astronomer.io>>
> > > > wrote:
> > > >
> > > > > We are planning to do this session next Thursday at 5pm CET 9am
> PT. I
> > > > will
> > > > > send a zoom link in advance.
> > > > > Julien
> > > > >
> > > > > On Sat, Feb 25, 2023 at 05:59 Jarek Potiuk <jarek@potiuk.com
> > <ma...@potiuk.com>> wrote:
> > > > >
> > > > >> Cool. I am looking forward to it :). It would be great to get some
> > > > >> insight from those who attempted to get the lineage working in
> > several
> > > > >> versions of Open Lineage and finally arrived at the current
> > > > >> specs/integration.
> > > > >>
> > > > >> On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem
> > > > >> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid>
> > wrote:
> > > > >> >
> > > > >> > Thank you Jarek,
> > > > >> > I am happy to organize a zoom presentation about OpenLineage and
> > > > answer
> > > > >> any question. It is indeed a spec decoupling the data
> transformation
> > > > layer
> > > > >> from the Metadata store people are using. Just like OpenTelemetry
> > is for
> > > > >> service metrics/traces.
> > > > >> > Best,
> > > > >> > Julien
> > > > >> >
> > > > >> > On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <jarek@potiuk.com
> > <ma...@potiuk.com>>
> > > > wrote:
> > > > >> >>
> > > > >> >> And to add a little "parallel" - I think Open Lineage
> integration
> > > > >> replacing our "generic lineage" is very similar step to the new
> > > > >> "Multi-tenant"-ready authentication interface we are discussing in
> > > > >> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck
> <
> > https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck>
> > > > >> >>
> > > > >> >> Yes - we have a generic authentication interface, but no - it's
> > > > >> useless for the case where multi-tenancy and good level of
> resource
> > > > >> authorization is needed. It's just far too simplistic and limited.
> > > > >> >>
> > > > >> >> Same with current lineage generic interface - yes, we have it
> but
> > > > it's
> > > > >> only useful in a limited set of cases. and if we want to
> step-it-up
> > we
> > > > need
> > > > >> to come up with something better (and Open Lineage happens to be
> one
> > > > that
> > > > >> has been developed with Airflow in mind and battle tested).
> > > > >> >>
> > > > >> >> J.
> > > > >> >>
> > > > >> >> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <jarek@potiuk.com
> > <ma...@potiuk.com>>
> > > > wrote:
> > > > >> >>>
> > > > >> >>> Hey Rafał (Eugene, Michal - and others who are looking),
> > > > >> >>>
> > > > >> >>> I think I know where your/Eugen/Michał concerns are coming
> > from. And
> > > > >> I think it would be great if we can talk it over a bit. I believe
> > this
> > > > is
> > > > >> - in parts - quite a misunderstanding of what Open Lineage really
> > is,
> > > > how
> > > > >> much of an integration it is and what are the reasons why it has
> > been
> > > > >> implemented the way it was implemented in Airflow.
> > > > >> >>>
> > > > >> >>> **Idea**: (Julien - Maybe you can organize it ?):
> > > > >> >>>
> > > > >> >>> Maybe we can have an open-to-everyone presentation/zoom call
> > with
> > > > >> quite some time foreseen to ask questions where you would explain
> > the
> > > > >> community about those integration points (and especially those
> > people
> > > > who
> > > > >> are worried we are losing something by choosing the OpenLineage
> > > > >> integration). I would love to see such a presentation -
> specifically
> > > > >> focused on explaining how Open-Lineage is really improving the
> > current
> > > > >> lineage approach and what problems it solves that the existing
> > generic
> > > > >> interface doesn't.
> > > > >> >>>
> > > > >> >>> Just to set the tone and focus for such meeting if we have
> one:
> > > > >> >>>
> > > > >> >>> For me - when I look at Open Lineage, it is really "this is
> how
> > > > >> lineage generic interface **should** be done in Airflow". The
> > "generic"
> > > > >> lineage support we have now is very, very basic, I'd even say far
> > too
> > > > >> simplistic. I would even say, it's useless besides a few, very
> > basic use
> > > > >> cases. Simply because there was never a good "receiver" of the
> > > > information
> > > > >> to cover those cases.
> > > > >> >>>
> > > > >> >>> When you look closely at OpenLineage, it's nothing more than a
> > > > better
> > > > >> convention of the dictionaries that we send as a metadata, better
> > > > meta-data
> > > > >> in case of SQL operators (Hooks in the future hopefully), allowing
> > > > handling
> > > > >> some cases that current lineage simply cannot. Also what
> > open-lineage
> > > > >> integration with Airflow covers better handling of the lifecycle
> > "task"
> > > > and
> > > > >> "dag" in Airflow to be able to bind lineage data together. That's
> my
> > > > >> understanding of what we get when we integrate OL in.
> > > > >> >>>
> > > > >> >>> I think over the last 2 years Datakin/Astronomer people had
> > worked
> > > > >> out the level of interface that **just works** and if we would
> like
> > to
> > > > get
> > > > >> the lineage information from Airflow as useful as it is in OL, we
> > would
> > > > >> have to anyway implement pretty much all of the things they
> already
> > did.
> > > > >> >>>
> > > > >> >>> I would love (and I think many community members) to take part
> > in
> > > > >> such a call to hear on that particular aspect of the OL
> integration.
> > > > >> >>>
> > > > >> >>> J.
> > > > >> >>>
> > > > >> >>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz <
> > > > >> rafalbiegacz@google.com.inva <mailto:rafalbiegacz@google.com.inva
> >lid>
> > wrote:
> > > > >> >>>>
> > > > >> >>>> Hi,
> > > > >> >>>>
> > > > >> >>>> I second/echo the input provided by Eugene and Michal.
> > > > >> >>>>
> > > > >> >>>> In general, Airflow should provide generic interfaces to
> > lineage
> > > > >> backends so it's easy to configure the one preferred by the user.
> > > > Whether
> > > > >> it's Open Lineage, proprietary solution, Dataplex Lineage, etc. it
> > > > should
> > > > >> be the user's choice.
> > > > >> >>>>
> > > > >> >>>> We should avoid close integration with any specific lineage
> > backend
> > > > >> due to the reasons already mentioned, i.e. to avoid translations
> > between
> > > > >> lineage backends. Also, we would closely couple one framework
> > (Airflow)
> > > > >> with another one (Open Lineage) - it makes Airflow more complex
> and
> > less
> > > > >> flexible. Loose coupling between lineage backends and Airflow
> seems
> > to
> > > > be
> > > > >> more future-proven.
> > > > >> >>>>
> > > > >> >>>> Regards, Rafal.
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem
> > > > >> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid>
> > wrote:
> > > > >> >>>>>
> > > > >> >>>>> Dear Airflow community,
> > > > >> >>>>> I have transferred the content of the working google doc I
> > shared
> > > > a
> > > > >> few weeks ago to the Airflow confluence:
> > > > >> >>>>>
> > > > >>
> > > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> > <
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> > >
> > > > >> >>>>> All comments have been answered, I added clarifications to
> > the doc
> > > > >> accordingly and I also added your suggestions to improve the
> > proposal.
> > > > >> >>>>> All that history is linked from the discussion thread link
> in
> > the
> > > > >> confluence doc if you wish to consult it.
> > > > >> >>>>> Thank you all for your feedback and help in the process.
> > > > >> >>>>> Best
> > > > >> >>>>> Julien
> > > > >> >>>>>
> > > > >> >>>>>
> > > > >> >>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <
> > > > julien@astronomer.io <ma...@astronomer.io>>
> > > > >> wrote:
> > > > >> >>>>>>
> > > > >> >>>>>> Thank you for the email Jarek, and Eugene for your
> > suggestions,
> > > > >> >>>>>> I do agree with Jarek's assessment. I don't have very much
> > to add
> > > > >> to his argument, it is very thoughtful!
> > > > >> >>>>>> OpenLineage was started to avoid the cartesian complexity
> > that
> > > > >> Eugene mentions. There's actually that specific illustration in
> the
> > > > >> OpenLineage doc.
> > > > >> >>>>>> Lineage consumers want to avoid having to understand the
> > lineage
> > > > >> format of each individual observed data transformation layer. And
> > > > >> transformation layers don't want to understand every Metadata
> > store's
> > > > model
> > > > >> and protocol.
> > > > >> >>>>>> Eugene, about your specific proposal about a global
> > vocabulary of
> > > > >> entities, I think it is a great suggestion.
> > > > >> >>>>>> We can map those entities to Datasets in OpenLineage. The
> way
> > > > >> OpenLineage models this is by allowing specific facets attached to
> > > > Dataset.
> > > > >> Facets are pieces of metadata each with their own JsonSchema.
> > > > >> >>>>>> For example a table from a relational database will have a
> > schema
> > > > >> facet when a file in GCS might not.
> > > > >> >>>>>> So I think in Airflow we could have each of the entity
> > classes
> > > > you
> > > > >> describe be used in the get_openlineage_facets*() API in the
> > Operators.
> > > > >> >>>>>> Each of those classes would know what OpenLineage facets
> > they can
> > > > >> expose.
> > > > >> >>>>>> I'll add a mention in the AIP and I think we can go in more
> > > > >> details in a ticket.
> > > > >> >>>>>> Cheers,
> > > > >> >>>>>> Julien
> > > > >> >>>>>>
> > > > >> >>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <
> > jarek@potiuk.com <ma...@potiuk.com>>
> > > > >> wrote:
> > > > >> >>>>>>>
> > > > >> >>>>>>> Just a quick personal view on it, Eugene (I bet Julian's
> > answer
> > > > >> will
> > > > >> >>>>>>> be more thoughtful).
> > > > >> >>>>>>>
> > > > >> >>>>>>> I think you are right to the "agnostic" part. But I have
> one
> > > > >> question
> > > > >> >>>>>>> - what are we considering "agnostic"?
> > > > >> >>>>>>>
> > > > >> >>>>>>> There is no "widespread" standard for lineage (yet). Open
> > > > Lineage
> > > > >> >>>>>>> with its donation to Linux Foundation Data & AI is
> aspiring
> > to
> > > > >> become
> > > > >> >>>>>>> one. And it's a pretty good candidate:
> > > > >> >>>>>>>
> > > > >> >>>>>>> * designed from grounds-up to be agnostic (Open Lineage
> was
> > only
> > > > >> >>>>>>> published as an API from day one)
> > > > >> >>>>>>> * as of recently, the ownership and governance of Open
> > Lineage
> > > > is
> > > > >> with
> > > > >> >>>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/
> <
> > https://lfaidata.foundation/>)
> > > > which
> > > > >> is
> > > > >> >>>>>>> part of "Linux Foundation Project" - well known and
> > respectful
> > > > >> >>>>>>> foundation that - similarly to the ASF is an umbrella and
> > > > provides
> > > > >> >>>>>>> governance rules for a big number of well established OSS
> > > > projects
> > > > >> >>>>>>>
> > > > >> >>>>>>> In essence it is the same approach as we already discussed
> > and
> > > > >> >>>>>>> approved for Open Telemetry (which is governed by CNCF
> > which is
> > > > >> in the
> > > > >> >>>>>>> same league as recognition and governance to LFP) (not yet
> > > > >> implemented
> > > > >> >>>>>>> though). In the case of Open-Telemetry, we decided against
> > > > >> developing
> > > > >> >>>>>>> our "own" existing standard but we opted for one that is
> out
> > > > >> there.
> > > > >> >>>>>>> Yes it is a bit more established and popular than Open
> > Lineage
> > > > >> is, but
> > > > >> >>>>>>> i so wish that we chose and implemented it already (and
> > earlier
> > > > >> as not
> > > > >> >>>>>>> having a standard there - except statsd which is really,
> > really
> > > > >> poor)
> > > > >> >>>>>>> has a great impact on Airflow being just "pluggable" in
> > existing
> > > > >> >>>>>>> solutions for monitoring. (BTW. I hope we implement it
> soon
> > and
> > > > I
> > > > >> hear
> > > > >> >>>>>>> (and see) there are attempts to do so).
> > > > >> >>>>>>>
> > > > >> >>>>>>> In the case of Open Lineage, the questions are - is there
> an
> > > > >> >>>>>>> alternative of the same caliber? Shall we produce our own
> > > > >> "agnostic
> > > > >> >>>>>>> standard" for it instead ? Is there a chance the idea of
> > > > >> >>>>>>> "airflow-specific" attributes will catch up and many
> > "consumers"
> > > > >> will
> > > > >> >>>>>>> be writing their own conversions to the way they can
> > consume it?
> > > > >> >>>>>>>
> > > > >> >>>>>>> I would really, really try to avoid the pitfalls nicely
> > > > summarized
> > > > >> >>>>>>> here: https://xkcd.com/927/ <https://xkcd.com/927/>
> > > > >> >>>>>>>
> > > > >> >>>>>>> We can of course make a wrong bet and in 2 years Airflow
> > might
> > > > be
> > > > >> the
> > > > >> >>>>>>> only one supporting Open Lineage. That might happen.
> Though
> > the
> > > > >> list
> > > > >> >>>>>>> of "consumers" of Open Lineage is already pretty good
> IMHO.
> > Or
> > > > >> maybe -
> > > > >> >>>>>>> more likely - once Airflow implements it, due to Airflow's
> > > > >> popularity
> > > > >> >>>>>>> and the fact that there is already competition supporting
> it
> > > > (e.g.
> > > > >> >>>>>>> Amundsen) we will increase the chance of "hockey-stick"
> > adoption
> > > > >> of
> > > > >> >>>>>>> Open Lineage. My bet is - the latter and for the benefit
> of
> > the
> > > > >> whole
> > > > >> >>>>>>> ecosystem. I think we have a chance to influence creation
> > of a
> > > > >> new,
> > > > >> >>>>>>> important standard. Much less so, I think if we just
> > provide our
> > > > >> own
> > > > >> >>>>>>> custom solution - with lots and lots of work for others to
> > be
> > > > >> able to
> > > > >> >>>>>>> consume it, no time to properly nurture the API and make
> it
> > > > >> easier to
> > > > >> >>>>>>> implement it (which is undoubtedly what Datakin,
> Astronomer
> > and
> > > > >> now
> > > > >> >>>>>>> LFData & AI run governance main focus is)
> > > > >> >>>>>>>
> > > > >> >>>>>>> Are there other alternatives we should consider ? Do we
> > want to
> > > > >> >>>>>>> develop our own standard (and implement all the
> integrations
> > > > from
> > > > >> the
> > > > >> >>>>>>> grounds up) ?
> > > > >> >>>>>>>
> > > > >> >>>>>>> J.
> > > > >> >>>>>>>
> > > > >> >>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <
> > > > eugen@kosteev.com <ma...@kosteev.com>>
> > > > >> wrote:
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > Hi Julien.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > I reviewed the design doc.
> > > > >> >>>>>>> > The general idea looks good to me, but I have some
> > concerns
> > > > >> that I would like to share.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > If I understand correctly the proposed design is to fill
> > in
> > > > >> "operators" with self-methods to extract lineage metadata from it,
> > and I
> > > > >> agree with the motivation. If those are decoupled (in a form of
> > > > extractors
> > > > >> in separate package) from operators itself, then the downsides is
> > that
> > > > (as
> > > > >> you mentioned) - extractors will be distributed separately and
> > > > "operators"
> > > > >> logic is out of sync with "lineage extraction" logic by design.
> > > > >> >>>>>>> > Also knowledge about internals of operator spills out of
> > the
> > > > >> operator which is not good at all (at the very least).
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > However, if we make every operator being exposing method
> > to
> > > > >> generate lineage metadata of the specific format, e.g. OpenLineage
> > etc.,
> > > > >> then we will end up with cartesian complexity of supporting in
> each
> > > > >> provider+operator each backend format.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > If you say that the goal is that "operators" will always
> > > > >> generate OpenLineage format only and each consumer will convert
> this
> > > > format
> > > > >> to their own internal representation, well, if they do this then
> > this
> > > > seems
> > > > >> like a working approach. But with the assumption that each
> consumer
> > will
> > > > >> support it.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > I think it comes down to the question: is OpenLineage
> > format
> > > > >> enough popular, complete and proper for the lineage metadata that
> > every
> > > > >> consumer will be convinced to support it. We may also consider
> > issues
> > > > like
> > > > >> mismatch of lineage feature parity, e.g. OpenLineage supports
> > > > field-level
> > > > >> lineage but consumer doesn't support (or not at the moment), so we
> > would
> > > > >> prefer lineage metadata transferred to the backend to be slightly
> > > > different
> > > > >> in this case.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > What do you think about the idea:
> > > > >> >>>>>>> > 1. make lineage metadata generated by "operators" to be
> > > > >> agnostic of the specific format, just using entities from big
> > generic
> > > > >> vocabulary of entities e.g. created here
> > > > >>
> > https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py
> <
> > https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py>
> > > > .
> > > > >> We would have there e.g. entities like:
> > > > >> >>>>>>> >
> > > > >>
> --------------------------------------------------------------------
> > > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > > > >> >>>>>>> > class PostgresTable:
> > > > >> >>>>>>> > """Airflow lineage entity representing Postgres
> table."""
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > host: str = attr.ib()
> > > > >> >>>>>>> > port: str = attr.ib()
> > > > >> >>>>>>> > database: str = attr.ib()
> > > > >> >>>>>>> > schema: str = attr.ib()
> > > > >> >>>>>>> > table: str = attr.ib()
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > > > >> >>>>>>> > class GCSEntity:
> > > > >> >>>>>>> > """Airflow lineage entity representing generic Google
> > > > Cloud
> > > > >> Storage entity."""
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > bucket: str = attr.ib()
> > > > >> >>>>>>> > path: str = attr.ib()
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > > > >> >>>>>>> > class AWSS3Entity:
> > > > >> >>>>>>> > """Airflow lineage entity representing generic AWS S3
> > > > >> entity."""
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > bucket: str = attr.ib()
> > > > >> >>>>>>> > path: str = attr.ib()
> > > > >> >>>>>>> >
> > > > >>
> --------------------------------------------------------------------
> > > > >> >>>>>>> > 2. Implement "adapters" that will act as a bridge
> between
> > > > >> "operators" and backends. Their responsibility will be to convert
> > > > lineage
> > > > >> metadata generated by "operators" to a format understandable by
> > specific
> > > > >> backend.
> > > > >> >>>>>>> > And then we can use the built-in mechanism of
> > inlets/outlets
> > > > to
> > > > >> bypass Airflow lineage metadata to the Airflow lineage backend.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > I didn't get exactly implementation details of your
> > proposed
> > > > >> design, but I think maintaining global vocabulary of entities to
> > use in
> > > > >> inlets/outlets of operators is crucial for Airflow, as this could
> be
> > > > >> leveraged to build various features on top of it, like displaying
> > > > lineage
> > > > >> graph in Airflow UI (based on XCOM):)
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > Importantly to note, if we decide to send out from
> Airflow
> > > > >> lineage metadata only in OpenLineage format, well, we could have
> > than
> > > > only
> > > > >> one "adapter" OpenLineageAdapter. But the "adapters" approach
> > leaves us
> > > > >> room for adding support to others (following "pluggable" approach
> as
> > > > >> Airflow is mainly known/good about).
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > All in all:
> > > > >> >>>>>>> > - global vocabulary of entities used across all
> > "operators"
> > > > >> (with all advantages out of it, mentioned above)
> > > > >> >>>>>>> > - "adapters" approach
> > > > >> >>>>>>> > seems to me crucial points in the design that make sense
> > to
> > > > me.
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > What do you think about this?
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > - Eugene
> > > > >> >>>>>>> >
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem
> > > > >> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid>
> > wrote:
> > > > >> >>>>>>> >>
> > > > >> >>>>>>> >> Hello Michał,
> > > > >> >>>>>>> >> Thank you for your input.
> > > > >> >>>>>>> >> I would clarify that OpenLineage doesn't make any
> > assumption
> > > > >> about the backend being used to store lineage and is an
> adapter-like
> > > > layer.
> > > > >> >>>>>>> >> OpenLineage exists as the spec specifically for that
> > purpose
> > > > >> of avoiding the problem of every lineage consumer having to
> > understand
> > > > >> every lineage producer.
> > > > >> >>>>>>> >> Consumers of lineage want a unified spec consuming
> > lineage
> > > > >> from any data transformation layer like Airflow, Spark, Flink,
> SQL,
> > > > >> Warehouses, ...
> > > > >> >>>>>>> >> Just like OpenTelemetry allows consuming traces
> > independently
> > > > >> of the technology used, so does OpenLineage for lineage.
> > > > >> >>>>>>> >> Julien
> > > > >> >>>>>>> >>
> > > > >> >>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras <
> > > > >> michalmodras@google.com <ma...@google.com>> wrote:
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>> Hi everyone,
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>> As Airflow already supports lineage functionality
> > through
> > > > >> pluggable lineage backends, I think OpenLineage and other lineage
> > > > systems
> > > > >> integration should follow this path. I think more 'native'
> > integration
> > > > with
> > > > >> OpenLineage (or any other lineage system) in Airflow while
> > maintaining
> > > > the
> > > > >> generic lineage backend architecture in parallel would make the
> user
> > > > >> experience less open, troublesome to maintain, and the Airflow
> > > > architecture
> > > > >> itself more constrained by a logic of a specific system.
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>> I think enriching operators with a generic method
> > exposing
> > > > >> lineage metadata that could be leveraged by lineage backends
> > regardless
> > > > of
> > > > >> their implementation is a good idea which the Cloud Composer team
> > would
> > > > >> gladly contribute to. I believe the translation of the Airflow
> > metadata
> > > > >> exposed by the operators should be done by lineage backends (or
> > another
> > > > >> adapter-like layer). Tying Airflow operators' development to a
> > specific
> > > > >> lineage system like OpenLineage forces operators' contributors to
> > > > >> understand that system too, which increases both the entry costs
> and
> > > > >> maintenance costs. I see it as unnecessary coupling.
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>> Best,
> > > > >> >>>>>>> >>> Michal
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>>
> > > > >> >>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem <
> > > > >> julien@astronomer.io <ma...@astronomer.io>> wrote:
> > > > >> >>>>>>> >>>>
> > > > >> >>>>>>> >>>> Thank you Eugen,
> > > > >> >>>>>>> >>>> This sounds very aligned with the goals of
> OpenLineage
> > and
> > > > I
> > > > >> think this would work well.
> > > > >> >>>>>>> >>>> Here are the sections in the doc that I think address
> > your
> > > > >> points:
> > > > >> >>>>>>> >>>> - generalize lineage metadata extraction as
> > self-method in
> > > > >> each operator, using generic lineage entities
> > > > >> >>>>>>> >>>> See: OpenLineage support in providers. It describes
> how
> > > > each
> > > > >> operator exposes its lineage.
> > > > >> >>>>>>> >>>> - implement "adapter"s to convert generated metadata
> to
> > > > Data
> > > > >> Lineage format, Open Lineage format, etc.
> > > > >> >>>>>>> >>>> The goal here is each consumer turns from OpenLineage
> > > > format
> > > > >> to their own internal representation as you are suggesting.
> > > > >> >>>>>>> >>>> In the motivation section, towards the end, I link to
> > a few
> > > > >> examples of data catalogs doing just that.
> > > > >> >>>>>>> >>>>
> > > > >> >>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <
> > > > >> eugen@kosteev.com <ma...@kosteev.com>> wrote:
> > > > >> >>>>>>> >>>>>
> > > > >> >>>>>>> >>>>> ++ Michal Modras
> > > > >> >>>>>>> >>>>>
> > > > >> >>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <
> > > > >> eugen@kosteev.com <ma...@kosteev.com>> wrote:
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> Cloud Composer recently launched "Data lineage with
> > > > >> Dataplex" feature which effectively means to generate lineage out
> of
> > > > >> DAG/task executions and export it to Data Lineage (Data Catalog
> > service)
> > > > >> for further analysis.
> > > > >> >>>>>>> >>>>>>
> > > > >>
> > https://cloud.google.com/composer/docs/composer-2/lineage-integration <
> > https://cloud.google.com/composer/docs/composer-2/lineage-integration>
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> This feature is as of now in the "Preview" state.
> > > > >> >>>>>>> >>>>>> The current implementation uses built-in "Airflow
> > lineage
> > > > >> backend" feature and methods to extract lineage metadata on task
> > post
> > > > >> execution events.
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> The general idea was to contribute this to the
> > Airflow
> > > > >> community in a form:
> > > > >> >>>>>>> >>>>>> - generalize lineage metadata extraction as
> > self-method
> > > > in
> > > > >> each operator, using generic lineage entities
> > > > >> >>>>>>> >>>>>> - implement "adapter"s to convert generated
> metadata
> > to
> > > > >> Data Lineage format, Open Lineage format, etc.
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer
> would
> > mean
> > > > >> to introduce an additional layer of converting from OpenLineage
> > format
> > > > to
> > > > >> Data Lineage (Data Catalog/Dataplex) format. But this is
> definitely
> > a
> > > > >> possibility.
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem
> > > > >> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid>
> > wrote:
> > > > >> >>>>>>> >>>>>>>
> > > > >> >>>>>>> >>>>>>> Thank you very much for your input Jarek.
> > > > >> >>>>>>> >>>>>>> I am responding in the comments and adding to the
> > doc
> > > > >> accordingly.
> > > > >> >>>>>>> >>>>>>> I would also love to hear from more stakeholders.
> > > > >> >>>>>>> >>>>>>> Thanks to all who provided feedback so far.
> > > > >> >>>>>>> >>>>>>> Julien
> > > > >> >>>>>>> >>>>>>>
> > > > >> >>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <
> > > > >> jarek@potiuk.com <ma...@potiuk.com>> wrote:
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> General comment from my side: I think Open
> Lineage
> > is
> > > > >> (and should be
> > > > >> >>>>>>> >>>>>>>> even more) a feature of Airflow that expands
> > Airflow's
> > > > >> capabilities
> > > > >> >>>>>>> >>>>>>>> greatly and opens up the direction we've been all
> > > > >> working on - Airflow
> > > > >> >>>>>>> >>>>>>>> as a Platform.
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> I think closely integrating it with Open-Lineage
> > goes
> > > > >> the same
> > > > >> >>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open
> > Telemetry
> > > > >> goes, where we
> > > > >> >>>>>>> >>>>>>>> might decide to support certain standards in
> order
> > to
> > > > >> expand
> > > > >> >>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows
> to
> > > > >> plug-in multiple
> > > > >> >>>>>>> >>>>>>>> external solutions that would use the standard
> API.
> > > > >> After Open-Lineage
> > > > >> >>>>>>> >>>>>>>> graduated recently to LFAI&Data foundation (I've
> > been
> > > > >> watching this
> > > > >> >>>>>>> >>>>>>>> happening from far), it is I think the perfect
> > > > candidate
> > > > >> for Airflow
> > > > >> >>>>>>> >>>>>>>> to incorporate it. I hope this will help all the
> > > > players
> > > > >> to make use
> > > > >> >>>>>>> >>>>>>>> of the extra work necessary by the community to
> > make it
> > > > >> "officially
> > > > >> >>>>>>> >>>>>>>> supported". I think we have to also get some
> > feedback
> > > > >> from the big
> > > > >> >>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to
> > have
> > > > >> such a
> > > > >> >>>>>>> >>>>>>>> capability, and another is to get it used in all
> > the
> > > > >> ways Airflow is
> > > > >> >>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users
> > (which
> > > > >> is obviously a
> > > > >> >>>>>>> >>>>>>>> huge driving factor) but also everywhere where
> > Airflow
> > > > >> is exposed by
> > > > >> >>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see
> > some
> > > > >> warm words from
> > > > >> >>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to
> hear
> > > > >> whether the
> > > > >> >>>>>>> >>>>>>>> Composer team at Google would be on board in
> using
> > the
> > > > >> open-lineage
> > > > >> >>>>>>> >>>>>>>> information exposed this way in their Data
> Catalog
> > (and
> > > > >> likely more)
> > > > >> >>>>>>> >>>>>>>> offering. We have Amundsen and others and
> possibly
> > > > other
> > > > >> stakeholders
> > > > >> >>>>>>> >>>>>>>> might want to say something.
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved
> > in
> > > > >> implementing and
> > > > >> >>>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned,
> > that
> > > > >> is the main
> > > > >> >>>>>>> >>>>>>>> reason why the Open Lineage community would like
> to
> > > > make
> > > > >> the
> > > > >> >>>>>>> >>>>>>>> integration part of Airflow. But by being smart
> and
> > > > >> integrating it in
> > > > >> >>>>>>> >>>>>>>> the way that will allow to plug-it-in into our
> CI,
> > > > >> verification
> > > > >> >>>>>>> >>>>>>>> process and making some very clear expectations
> > about
> > > > >> what it means
> > > > >> >>>>>>> >>>>>>>> for contributors to Airflow to get it running, we
> > can
> > > > >> make some
> > > > >> >>>>>>> >>>>>>>> initial investment in making it happen and
> minimise
> > > > >> on-going cost,
> > > > >> >>>>>>> >>>>>>>> while maximising the gain.
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> And looking at all the above - I am super happy
> to
> > help
> > > > >> with all that
> > > > >> >>>>>>> >>>>>>>> to make this easy to "swallow" and integrate
> well,
> > even
> > > > >> if it will
> > > > >> >>>>>>> >>>>>>>> take an extra effort, especially that we will
> have
> > > > >> experts from Open
> > > > >> >>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open
> > Lineage
> > > > >> being the core
> > > > >> >>>>>>> >>>>>>>> part of the effort. I am actually super excited -
> > this
> > > > >> might be the
> > > > >> >>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its
> > position
> > > > as
> > > > >> an
> > > > >> >>>>>>> >>>>>>>> indispensable component of "even more modern data
> > > > stack".
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> I made my initial comments in the doc, and am
> > looking
> > > > >> forward to
> > > > >> >>>>>>> >>>>>>>> making it happen :).
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> J.
> > > > >> >>>>>>> >>>>>>>>
> > > > >> >>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
> > > > >> >>>>>>> >>>>>>>> <julien@astronomer.io.inva <mailto:
> > julien@astronomer.io.inva>lid> wrote:
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > Dear Airflow Community,
> > > > >> >>>>>>> >>>>>>>> > I have been working on a proposal to bring an
> > > > >> OpenLineage provider to Airflow.
> > > > >> >>>>>>> >>>>>>>> > I am looking for feedback with the goal to post
> > an
> > > > >> official AIP.
> > > > >> >>>>>>> >>>>>>>> > Please feel free to comment in the doc above.
> > > > >> >>>>>>> >>>>>>>> > Thank you,
> > > > >> >>>>>>> >>>>>>>> > Julien (OpenLineage project lead)
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > For convenience, here is the rationale from the
> > doc:
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > Operational lineage collection is a common need
> > to
> > > > >> understand dependencies between data pipelines and track
> end-to-end
> > > > >> provenance of data. It enables many use cases from ensuring
> reliable
> > > > >> delivery of data through observability to compliance and cost
> > > > management.
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > Publishing operational lineage is a core
> Airflow
> > > > >> capability to enable troubleshooting and governance.
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data
> > > > >> foundation that provides a spec standardizing operational lineage
> > > > >> collection and sharing across the data ecosystem. If it provides
> > plugins
> > > > >> for popular open source projects, its intent is very similar to
> > > > >> OpenTelemetry (also under the Linux Foundation umbrella): to
> remain
> > a
> > > > spec
> > > > >> for lineage exchange that projects - open source or proprietary -
> > > > implement.
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will
> > make it
> > > > >> easier and more reliable for Airflow users to publish their
> > operational
> > > > >> lineage through the OpenLineage ecosystem.
> > > > >> >>>>>>> >>>>>>>> >
> > > > >> >>>>>>> >>>>>>>> > The current external plugin maintained in the
> > > > >> OpenLineage project depends on Airflow and operators internals and
> > gets
> > > > >> broken when changes are made on those. Having a built-in
> integration
> > > > >> ensures a better first class support to expose lineage that gets
> > tested
> > > > >> alongside other changes and therefore is more stable.
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>>
> > > > >> >>>>>>> >>>>>> --
> > > > >> >>>>>>> >>>>>> Eugene
> > > > >> >>>>>>> >>>>>
> > > > >> >>>>>>> >>>>>
> > > > >> >>>>>>> >>>>>
> > > > >> >>>>>>> >>>>> --
> > > > >> >>>>>>> >>>>> Eugene
> > > > >> >>>>>>> >
> > > > >> >>>>>>> >
> > > > >> >>>>>>> >
> > > > >> >>>>>>> > --
> > > > >> >>>>>>> > Eugene
> > > > >>
> > > > >
> > > >
> > >
> > >
> > > --
> > > Eugene
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org <mailto:
> > dev-unsubscribe@airflow.apache.org>
> > For additional commands, e-mail: dev-help@airflow.apache.org <mailto:
> > dev-help@airflow.apache.org>
> >
> >
> >
> >
> >
> >
>

Re: Request for feedback on proposal for new OpenLineage provider in Airflow

Posted by Pierre Jeambrun <pi...@gmail.com>.
Same here if you can add me please.

Looking forward to this session.

Le mer. 22 mars 2023 à 23:07, Mehta, Shubham <sh...@amazon.com.invalid> a
écrit :

> Please include me, I will try my best to join (shubhammehta.93@gmail.com)
>
> Best,
> Shubham
>
> On 2023-03-22, 2:24 PM, "Jarek Potiuk" <jarek@potiuk.com <mailto:
> jarek@potiuk.com>> wrote:
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
>
>
>
> There are some strange behaviours in the calendar entry - I think you
> cannot add yourself, only guests can add others :)
> I've added you Eugen, maybe if someone wants to be also added - please
> post here with your gmail/calendar addresses.
>
>
> J.
>
>
> On Wed, Mar 22, 2023 at 9:56 PM Eugen Kosteev <eugen@kosteev.com <mailto:
> eugen@kosteev.com>> wrote:
> >
> > Hi Julien.
> >
> > Can you, please, include me there as well: eugen@kosteev.com <mailto:
> eugen@kosteev.com> or
> > kosteev@google.com <ma...@google.com>.
> > Looking forward to see presentation.
> >
> > - Eugene
> >
> > On Wed, Mar 22, 2023 at 8:36 PM Julien Le Dem <julien@astronomer.io.inva
> <ma...@astronomer.io.inva>lid>
> > wrote:
> >
> > > Hello all,
> > > I have to move the OpenLineage presentation to next week.
> > > Sorry for the change.
> > > It will be Friday next week March 31st at 5pm CET 9am PT.
> > >
> > >
> https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&tmsrc=julien%40astronomer.io
> <
> https://calendar.google.com/calendar/event?action=TEMPLATE&amp;tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&amp;tmsrc=julien%40astronomer.io
> >
> > > Julien
> > >
> > > On Thu, Mar 16, 2023 at 8:21 PM Julien Le Dem <julien@astronomer.io
> <ma...@astronomer.io>>
> > > wrote:
> > >
> > > > We are planning to do this session next Thursday at 5pm CET 9am PT. I
> > > will
> > > > send a zoom link in advance.
> > > > Julien
> > > >
> > > > On Sat, Feb 25, 2023 at 05:59 Jarek Potiuk <jarek@potiuk.com
> <ma...@potiuk.com>> wrote:
> > > >
> > > >> Cool. I am looking forward to it :). It would be great to get some
> > > >> insight from those who attempted to get the lineage working in
> several
> > > >> versions of Open Lineage and finally arrived at the current
> > > >> specs/integration.
> > > >>
> > > >> On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem
> > > >> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid>
> wrote:
> > > >> >
> > > >> > Thank you Jarek,
> > > >> > I am happy to organize a zoom presentation about OpenLineage and
> > > answer
> > > >> any question. It is indeed a spec decoupling the data transformation
> > > layer
> > > >> from the Metadata store people are using. Just like OpenTelemetry
> is for
> > > >> service metrics/traces.
> > > >> > Best,
> > > >> > Julien
> > > >> >
> > > >> > On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <jarek@potiuk.com
> <ma...@potiuk.com>>
> > > wrote:
> > > >> >>
> > > >> >> And to add a little "parallel" - I think Open Lineage integration
> > > >> replacing our "generic lineage" is very similar step to the new
> > > >> "Multi-tenant"-ready authentication interface we are discussing in
> > > >> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck <
> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck>
> > > >> >>
> > > >> >> Yes - we have a generic authentication interface, but no - it's
> > > >> useless for the case where multi-tenancy and good level of resource
> > > >> authorization is needed. It's just far too simplistic and limited.
> > > >> >>
> > > >> >> Same with current lineage generic interface - yes, we have it but
> > > it's
> > > >> only useful in a limited set of cases. and if we want to step-it-up
> we
> > > need
> > > >> to come up with something better (and Open Lineage happens to be one
> > > that
> > > >> has been developed with Airflow in mind and battle tested).
> > > >> >>
> > > >> >> J.
> > > >> >>
> > > >> >> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <jarek@potiuk.com
> <ma...@potiuk.com>>
> > > wrote:
> > > >> >>>
> > > >> >>> Hey Rafał (Eugene, Michal - and others who are looking),
> > > >> >>>
> > > >> >>> I think I know where your/Eugen/Michał concerns are coming
> from. And
> > > >> I think it would be great if we can talk it over a bit. I believe
> this
> > > is
> > > >> - in parts - quite a misunderstanding of what Open Lineage really
> is,
> > > how
> > > >> much of an integration it is and what are the reasons why it has
> been
> > > >> implemented the way it was implemented in Airflow.
> > > >> >>>
> > > >> >>> **Idea**: (Julien - Maybe you can organize it ?):
> > > >> >>>
> > > >> >>> Maybe we can have an open-to-everyone presentation/zoom call
> with
> > > >> quite some time foreseen to ask questions where you would explain
> the
> > > >> community about those integration points (and especially those
> people
> > > who
> > > >> are worried we are losing something by choosing the OpenLineage
> > > >> integration). I would love to see such a presentation - specifically
> > > >> focused on explaining how Open-Lineage is really improving the
> current
> > > >> lineage approach and what problems it solves that the existing
> generic
> > > >> interface doesn't.
> > > >> >>>
> > > >> >>> Just to set the tone and focus for such meeting if we have one:
> > > >> >>>
> > > >> >>> For me - when I look at Open Lineage, it is really "this is how
> > > >> lineage generic interface **should** be done in Airflow". The
> "generic"
> > > >> lineage support we have now is very, very basic, I'd even say far
> too
> > > >> simplistic. I would even say, it's useless besides a few, very
> basic use
> > > >> cases. Simply because there was never a good "receiver" of the
> > > information
> > > >> to cover those cases.
> > > >> >>>
> > > >> >>> When you look closely at OpenLineage, it's nothing more than a
> > > better
> > > >> convention of the dictionaries that we send as a metadata, better
> > > meta-data
> > > >> in case of SQL operators (Hooks in the future hopefully), allowing
> > > handling
> > > >> some cases that current lineage simply cannot. Also what
> open-lineage
> > > >> integration with Airflow covers better handling of the lifecycle
> "task"
> > > and
> > > >> "dag" in Airflow to be able to bind lineage data together. That's my
> > > >> understanding of what we get when we integrate OL in.
> > > >> >>>
> > > >> >>> I think over the last 2 years Datakin/Astronomer people had
> worked
> > > >> out the level of interface that **just works** and if we would like
> to
> > > get
> > > >> the lineage information from Airflow as useful as it is in OL, we
> would
> > > >> have to anyway implement pretty much all of the things they already
> did.
> > > >> >>>
> > > >> >>> I would love (and I think many community members) to take part
> in
> > > >> such a call to hear on that particular aspect of the OL integration.
> > > >> >>>
> > > >> >>> J.
> > > >> >>>
> > > >> >>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz <
> > > >> rafalbiegacz@google.com.inva <ma...@google.com.inva>lid>
> wrote:
> > > >> >>>>
> > > >> >>>> Hi,
> > > >> >>>>
> > > >> >>>> I second/echo the input provided by Eugene and Michal.
> > > >> >>>>
> > > >> >>>> In general, Airflow should provide generic interfaces to
> lineage
> > > >> backends so it's easy to configure the one preferred by the user.
> > > Whether
> > > >> it's Open Lineage, proprietary solution, Dataplex Lineage, etc. it
> > > should
> > > >> be the user's choice.
> > > >> >>>>
> > > >> >>>> We should avoid close integration with any specific lineage
> backend
> > > >> due to the reasons already mentioned, i.e. to avoid translations
> between
> > > >> lineage backends. Also, we would closely couple one framework
> (Airflow)
> > > >> with another one (Open Lineage) - it makes Airflow more complex and
> less
> > > >> flexible. Loose coupling between lineage backends and Airflow seems
> to
> > > be
> > > >> more future-proven.
> > > >> >>>>
> > > >> >>>> Regards, Rafal.
> > > >> >>>>
> > > >> >>>>
> > > >> >>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem
> > > >> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid>
> wrote:
> > > >> >>>>>
> > > >> >>>>> Dear Airflow community,
> > > >> >>>>> I have transferred the content of the working google doc I
> shared
> > > a
> > > >> few weeks ago to the Airflow confluence:
> > > >> >>>>>
> > > >>
> > >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> <
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> >
> > > >> >>>>> All comments have been answered, I added clarifications to
> the doc
> > > >> accordingly and I also added your suggestions to improve the
> proposal.
> > > >> >>>>> All that history is linked from the discussion thread link in
> the
> > > >> confluence doc if you wish to consult it.
> > > >> >>>>> Thank you all for your feedback and help in the process.
> > > >> >>>>> Best
> > > >> >>>>> Julien
> > > >> >>>>>
> > > >> >>>>>
> > > >> >>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <
> > > julien@astronomer.io <ma...@astronomer.io>>
> > > >> wrote:
> > > >> >>>>>>
> > > >> >>>>>> Thank you for the email Jarek, and Eugene for your
> suggestions,
> > > >> >>>>>> I do agree with Jarek's assessment. I don't have very much
> to add
> > > >> to his argument, it is very thoughtful!
> > > >> >>>>>> OpenLineage was started to avoid the cartesian complexity
> that
> > > >> Eugene mentions. There's actually that specific illustration in the
> > > >> OpenLineage doc.
> > > >> >>>>>> Lineage consumers want to avoid having to understand the
> lineage
> > > >> format of each individual observed data transformation layer. And
> > > >> transformation layers don't want to understand every Metadata
> store's
> > > model
> > > >> and protocol.
> > > >> >>>>>> Eugene, about your specific proposal about a global
> vocabulary of
> > > >> entities, I think it is a great suggestion.
> > > >> >>>>>> We can map those entities to Datasets in OpenLineage. The way
> > > >> OpenLineage models this is by allowing specific facets attached to
> > > Dataset.
> > > >> Facets are pieces of metadata each with their own JsonSchema.
> > > >> >>>>>> For example a table from a relational database will have a
> schema
> > > >> facet when a file in GCS might not.
> > > >> >>>>>> So I think in Airflow we could have each of the entity
> classes
> > > you
> > > >> describe be used in the get_openlineage_facets*() API in the
> Operators.
> > > >> >>>>>> Each of those classes would know what OpenLineage facets
> they can
> > > >> expose.
> > > >> >>>>>> I'll add a mention in the AIP and I think we can go in more
> > > >> details in a ticket.
> > > >> >>>>>> Cheers,
> > > >> >>>>>> Julien
> > > >> >>>>>>
> > > >> >>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <
> jarek@potiuk.com <ma...@potiuk.com>>
> > > >> wrote:
> > > >> >>>>>>>
> > > >> >>>>>>> Just a quick personal view on it, Eugene (I bet Julian's
> answer
> > > >> will
> > > >> >>>>>>> be more thoughtful).
> > > >> >>>>>>>
> > > >> >>>>>>> I think you are right to the "agnostic" part. But I have one
> > > >> question
> > > >> >>>>>>> - what are we considering "agnostic"?
> > > >> >>>>>>>
> > > >> >>>>>>> There is no "widespread" standard for lineage (yet). Open
> > > Lineage
> > > >> >>>>>>> with its donation to Linux Foundation Data & AI is aspiring
> to
> > > >> become
> > > >> >>>>>>> one. And it's a pretty good candidate:
> > > >> >>>>>>>
> > > >> >>>>>>> * designed from grounds-up to be agnostic (Open Lineage was
> only
> > > >> >>>>>>> published as an API from day one)
> > > >> >>>>>>> * as of recently, the ownership and governance of Open
> Lineage
> > > is
> > > >> with
> > > >> >>>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/ <
> https://lfaidata.foundation/>)
> > > which
> > > >> is
> > > >> >>>>>>> part of "Linux Foundation Project" - well known and
> respectful
> > > >> >>>>>>> foundation that - similarly to the ASF is an umbrella and
> > > provides
> > > >> >>>>>>> governance rules for a big number of well established OSS
> > > projects
> > > >> >>>>>>>
> > > >> >>>>>>> In essence it is the same approach as we already discussed
> and
> > > >> >>>>>>> approved for Open Telemetry (which is governed by CNCF
> which is
> > > >> in the
> > > >> >>>>>>> same league as recognition and governance to LFP) (not yet
> > > >> implemented
> > > >> >>>>>>> though). In the case of Open-Telemetry, we decided against
> > > >> developing
> > > >> >>>>>>> our "own" existing standard but we opted for one that is out
> > > >> there.
> > > >> >>>>>>> Yes it is a bit more established and popular than Open
> Lineage
> > > >> is, but
> > > >> >>>>>>> i so wish that we chose and implemented it already (and
> earlier
> > > >> as not
> > > >> >>>>>>> having a standard there - except statsd which is really,
> really
> > > >> poor)
> > > >> >>>>>>> has a great impact on Airflow being just "pluggable" in
> existing
> > > >> >>>>>>> solutions for monitoring. (BTW. I hope we implement it soon
> and
> > > I
> > > >> hear
> > > >> >>>>>>> (and see) there are attempts to do so).
> > > >> >>>>>>>
> > > >> >>>>>>> In the case of Open Lineage, the questions are - is there an
> > > >> >>>>>>> alternative of the same caliber? Shall we produce our own
> > > >> "agnostic
> > > >> >>>>>>> standard" for it instead ? Is there a chance the idea of
> > > >> >>>>>>> "airflow-specific" attributes will catch up and many
> "consumers"
> > > >> will
> > > >> >>>>>>> be writing their own conversions to the way they can
> consume it?
> > > >> >>>>>>>
> > > >> >>>>>>> I would really, really try to avoid the pitfalls nicely
> > > summarized
> > > >> >>>>>>> here: https://xkcd.com/927/ <https://xkcd.com/927/>
> > > >> >>>>>>>
> > > >> >>>>>>> We can of course make a wrong bet and in 2 years Airflow
> might
> > > be
> > > >> the
> > > >> >>>>>>> only one supporting Open Lineage. That might happen. Though
> the
> > > >> list
> > > >> >>>>>>> of "consumers" of Open Lineage is already pretty good IMHO.
> Or
> > > >> maybe -
> > > >> >>>>>>> more likely - once Airflow implements it, due to Airflow's
> > > >> popularity
> > > >> >>>>>>> and the fact that there is already competition supporting it
> > > (e.g.
> > > >> >>>>>>> Amundsen) we will increase the chance of "hockey-stick"
> adoption
> > > >> of
> > > >> >>>>>>> Open Lineage. My bet is - the latter and for the benefit of
> the
> > > >> whole
> > > >> >>>>>>> ecosystem. I think we have a chance to influence creation
> of a
> > > >> new,
> > > >> >>>>>>> important standard. Much less so, I think if we just
> provide our
> > > >> own
> > > >> >>>>>>> custom solution - with lots and lots of work for others to
> be
> > > >> able to
> > > >> >>>>>>> consume it, no time to properly nurture the API and make it
> > > >> easier to
> > > >> >>>>>>> implement it (which is undoubtedly what Datakin, Astronomer
> and
> > > >> now
> > > >> >>>>>>> LFData & AI run governance main focus is)
> > > >> >>>>>>>
> > > >> >>>>>>> Are there other alternatives we should consider ? Do we
> want to
> > > >> >>>>>>> develop our own standard (and implement all the integrations
> > > from
> > > >> the
> > > >> >>>>>>> grounds up) ?
> > > >> >>>>>>>
> > > >> >>>>>>> J.
> > > >> >>>>>>>
> > > >> >>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <
> > > eugen@kosteev.com <ma...@kosteev.com>>
> > > >> wrote:
> > > >> >>>>>>> >
> > > >> >>>>>>> > Hi Julien.
> > > >> >>>>>>> >
> > > >> >>>>>>> > I reviewed the design doc.
> > > >> >>>>>>> > The general idea looks good to me, but I have some
> concerns
> > > >> that I would like to share.
> > > >> >>>>>>> >
> > > >> >>>>>>> > If I understand correctly the proposed design is to fill
> in
> > > >> "operators" with self-methods to extract lineage metadata from it,
> and I
> > > >> agree with the motivation. If those are decoupled (in a form of
> > > extractors
> > > >> in separate package) from operators itself, then the downsides is
> that
> > > (as
> > > >> you mentioned) - extractors will be distributed separately and
> > > "operators"
> > > >> logic is out of sync with "lineage extraction" logic by design.
> > > >> >>>>>>> > Also knowledge about internals of operator spills out of
> the
> > > >> operator which is not good at all (at the very least).
> > > >> >>>>>>> >
> > > >> >>>>>>> > However, if we make every operator being exposing method
> to
> > > >> generate lineage metadata of the specific format, e.g. OpenLineage
> etc.,
> > > >> then we will end up with cartesian complexity of supporting in each
> > > >> provider+operator each backend format.
> > > >> >>>>>>> >
> > > >> >>>>>>> > If you say that the goal is that "operators" will always
> > > >> generate OpenLineage format only and each consumer will convert this
> > > format
> > > >> to their own internal representation, well, if they do this then
> this
> > > seems
> > > >> like a working approach. But with the assumption that each consumer
> will
> > > >> support it.
> > > >> >>>>>>> >
> > > >> >>>>>>> > I think it comes down to the question: is OpenLineage
> format
> > > >> enough popular, complete and proper for the lineage metadata that
> every
> > > >> consumer will be convinced to support it. We may also consider
> issues
> > > like
> > > >> mismatch of lineage feature parity, e.g. OpenLineage supports
> > > field-level
> > > >> lineage but consumer doesn't support (or not at the moment), so we
> would
> > > >> prefer lineage metadata transferred to the backend to be slightly
> > > different
> > > >> in this case.
> > > >> >>>>>>> >
> > > >> >>>>>>> > What do you think about the idea:
> > > >> >>>>>>> > 1. make lineage metadata generated by "operators" to be
> > > >> agnostic of the specific format, just using entities from big
> generic
> > > >> vocabulary of entities e.g. created here
> > > >>
> https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py <
> https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py>
> > > .
> > > >> We would have there e.g. entities like:
> > > >> >>>>>>> >
> > > >> --------------------------------------------------------------------
> > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > > >> >>>>>>> > class PostgresTable:
> > > >> >>>>>>> > """Airflow lineage entity representing Postgres table."""
> > > >> >>>>>>> >
> > > >> >>>>>>> > host: str = attr.ib()
> > > >> >>>>>>> > port: str = attr.ib()
> > > >> >>>>>>> > database: str = attr.ib()
> > > >> >>>>>>> > schema: str = attr.ib()
> > > >> >>>>>>> > table: str = attr.ib()
> > > >> >>>>>>> >
> > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > > >> >>>>>>> > class GCSEntity:
> > > >> >>>>>>> > """Airflow lineage entity representing generic Google
> > > Cloud
> > > >> Storage entity."""
> > > >> >>>>>>> >
> > > >> >>>>>>> > bucket: str = attr.ib()
> > > >> >>>>>>> > path: str = attr.ib()
> > > >> >>>>>>> >
> > > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > > >> >>>>>>> > class AWSS3Entity:
> > > >> >>>>>>> > """Airflow lineage entity representing generic AWS S3
> > > >> entity."""
> > > >> >>>>>>> >
> > > >> >>>>>>> > bucket: str = attr.ib()
> > > >> >>>>>>> > path: str = attr.ib()
> > > >> >>>>>>> >
> > > >> --------------------------------------------------------------------
> > > >> >>>>>>> > 2. Implement "adapters" that will act as a bridge between
> > > >> "operators" and backends. Their responsibility will be to convert
> > > lineage
> > > >> metadata generated by "operators" to a format understandable by
> specific
> > > >> backend.
> > > >> >>>>>>> > And then we can use the built-in mechanism of
> inlets/outlets
> > > to
> > > >> bypass Airflow lineage metadata to the Airflow lineage backend.
> > > >> >>>>>>> >
> > > >> >>>>>>> > I didn't get exactly implementation details of your
> proposed
> > > >> design, but I think maintaining global vocabulary of entities to
> use in
> > > >> inlets/outlets of operators is crucial for Airflow, as this could be
> > > >> leveraged to build various features on top of it, like displaying
> > > lineage
> > > >> graph in Airflow UI (based on XCOM):)
> > > >> >>>>>>> >
> > > >> >>>>>>> > Importantly to note, if we decide to send out from Airflow
> > > >> lineage metadata only in OpenLineage format, well, we could have
> than
> > > only
> > > >> one "adapter" OpenLineageAdapter. But the "adapters" approach
> leaves us
> > > >> room for adding support to others (following "pluggable" approach as
> > > >> Airflow is mainly known/good about).
> > > >> >>>>>>> >
> > > >> >>>>>>> > All in all:
> > > >> >>>>>>> > - global vocabulary of entities used across all
> "operators"
> > > >> (with all advantages out of it, mentioned above)
> > > >> >>>>>>> > - "adapters" approach
> > > >> >>>>>>> > seems to me crucial points in the design that make sense
> to
> > > me.
> > > >> >>>>>>> >
> > > >> >>>>>>> > What do you think about this?
> > > >> >>>>>>> >
> > > >> >>>>>>> > - Eugene
> > > >> >>>>>>> >
> > > >> >>>>>>> >
> > > >> >>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem
> > > >> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid>
> wrote:
> > > >> >>>>>>> >>
> > > >> >>>>>>> >> Hello Michał,
> > > >> >>>>>>> >> Thank you for your input.
> > > >> >>>>>>> >> I would clarify that OpenLineage doesn't make any
> assumption
> > > >> about the backend being used to store lineage and is an adapter-like
> > > layer.
> > > >> >>>>>>> >> OpenLineage exists as the spec specifically for that
> purpose
> > > >> of avoiding the problem of every lineage consumer having to
> understand
> > > >> every lineage producer.
> > > >> >>>>>>> >> Consumers of lineage want a unified spec consuming
> lineage
> > > >> from any data transformation layer like Airflow, Spark, Flink, SQL,
> > > >> Warehouses, ...
> > > >> >>>>>>> >> Just like OpenTelemetry allows consuming traces
> independently
> > > >> of the technology used, so does OpenLineage for lineage.
> > > >> >>>>>>> >> Julien
> > > >> >>>>>>> >>
> > > >> >>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras <
> > > >> michalmodras@google.com <ma...@google.com>> wrote:
> > > >> >>>>>>> >>>
> > > >> >>>>>>> >>> Hi everyone,
> > > >> >>>>>>> >>>
> > > >> >>>>>>> >>> As Airflow already supports lineage functionality
> through
> > > >> pluggable lineage backends, I think OpenLineage and other lineage
> > > systems
> > > >> integration should follow this path. I think more 'native'
> integration
> > > with
> > > >> OpenLineage (or any other lineage system) in Airflow while
> maintaining
> > > the
> > > >> generic lineage backend architecture in parallel would make the user
> > > >> experience less open, troublesome to maintain, and the Airflow
> > > architecture
> > > >> itself more constrained by a logic of a specific system.
> > > >> >>>>>>> >>>
> > > >> >>>>>>> >>> I think enriching operators with a generic method
> exposing
> > > >> lineage metadata that could be leveraged by lineage backends
> regardless
> > > of
> > > >> their implementation is a good idea which the Cloud Composer team
> would
> > > >> gladly contribute to. I believe the translation of the Airflow
> metadata
> > > >> exposed by the operators should be done by lineage backends (or
> another
> > > >> adapter-like layer). Tying Airflow operators' development to a
> specific
> > > >> lineage system like OpenLineage forces operators' contributors to
> > > >> understand that system too, which increases both the entry costs and
> > > >> maintenance costs. I see it as unnecessary coupling.
> > > >> >>>>>>> >>>
> > > >> >>>>>>> >>> Best,
> > > >> >>>>>>> >>> Michal
> > > >> >>>>>>> >>>
> > > >> >>>>>>> >>>
> > > >> >>>>>>> >>>
> > > >> >>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem <
> > > >> julien@astronomer.io <ma...@astronomer.io>> wrote:
> > > >> >>>>>>> >>>>
> > > >> >>>>>>> >>>> Thank you Eugen,
> > > >> >>>>>>> >>>> This sounds very aligned with the goals of OpenLineage
> and
> > > I
> > > >> think this would work well.
> > > >> >>>>>>> >>>> Here are the sections in the doc that I think address
> your
> > > >> points:
> > > >> >>>>>>> >>>> - generalize lineage metadata extraction as
> self-method in
> > > >> each operator, using generic lineage entities
> > > >> >>>>>>> >>>> See: OpenLineage support in providers. It describes how
> > > each
> > > >> operator exposes its lineage.
> > > >> >>>>>>> >>>> - implement "adapter"s to convert generated metadata to
> > > Data
> > > >> Lineage format, Open Lineage format, etc.
> > > >> >>>>>>> >>>> The goal here is each consumer turns from OpenLineage
> > > format
> > > >> to their own internal representation as you are suggesting.
> > > >> >>>>>>> >>>> In the motivation section, towards the end, I link to
> a few
> > > >> examples of data catalogs doing just that.
> > > >> >>>>>>> >>>>
> > > >> >>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <
> > > >> eugen@kosteev.com <ma...@kosteev.com>> wrote:
> > > >> >>>>>>> >>>>>
> > > >> >>>>>>> >>>>> ++ Michal Modras
> > > >> >>>>>>> >>>>>
> > > >> >>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <
> > > >> eugen@kosteev.com <ma...@kosteev.com>> wrote:
> > > >> >>>>>>> >>>>>>
> > > >> >>>>>>> >>>>>> Cloud Composer recently launched "Data lineage with
> > > >> Dataplex" feature which effectively means to generate lineage out of
> > > >> DAG/task executions and export it to Data Lineage (Data Catalog
> service)
> > > >> for further analysis.
> > > >> >>>>>>> >>>>>>
> > > >>
> https://cloud.google.com/composer/docs/composer-2/lineage-integration <
> https://cloud.google.com/composer/docs/composer-2/lineage-integration>
> > > >> >>>>>>> >>>>>>
> > > >> >>>>>>> >>>>>> This feature is as of now in the "Preview" state.
> > > >> >>>>>>> >>>>>> The current implementation uses built-in "Airflow
> lineage
> > > >> backend" feature and methods to extract lineage metadata on task
> post
> > > >> execution events.
> > > >> >>>>>>> >>>>>>
> > > >> >>>>>>> >>>>>> The general idea was to contribute this to the
> Airflow
> > > >> community in a form:
> > > >> >>>>>>> >>>>>> - generalize lineage metadata extraction as
> self-method
> > > in
> > > >> each operator, using generic lineage entities
> > > >> >>>>>>> >>>>>> - implement "adapter"s to convert generated metadata
> to
> > > >> Data Lineage format, Open Lineage format, etc.
> > > >> >>>>>>> >>>>>>
> > > >> >>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer would
> mean
> > > >> to introduce an additional layer of converting from OpenLineage
> format
> > > to
> > > >> Data Lineage (Data Catalog/Dataplex) format. But this is definitely
> a
> > > >> possibility.
> > > >> >>>>>>> >>>>>>
> > > >> >>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem
> > > >> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid>
> wrote:
> > > >> >>>>>>> >>>>>>>
> > > >> >>>>>>> >>>>>>> Thank you very much for your input Jarek.
> > > >> >>>>>>> >>>>>>> I am responding in the comments and adding to the
> doc
> > > >> accordingly.
> > > >> >>>>>>> >>>>>>> I would also love to hear from more stakeholders.
> > > >> >>>>>>> >>>>>>> Thanks to all who provided feedback so far.
> > > >> >>>>>>> >>>>>>> Julien
> > > >> >>>>>>> >>>>>>>
> > > >> >>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <
> > > >> jarek@potiuk.com <ma...@potiuk.com>> wrote:
> > > >> >>>>>>> >>>>>>>>
> > > >> >>>>>>> >>>>>>>> General comment from my side: I think Open Lineage
> is
> > > >> (and should be
> > > >> >>>>>>> >>>>>>>> even more) a feature of Airflow that expands
> Airflow's
> > > >> capabilities
> > > >> >>>>>>> >>>>>>>> greatly and opens up the direction we've been all
> > > >> working on - Airflow
> > > >> >>>>>>> >>>>>>>> as a Platform.
> > > >> >>>>>>> >>>>>>>>
> > > >> >>>>>>> >>>>>>>> I think closely integrating it with Open-Lineage
> goes
> > > >> the same
> > > >> >>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open
> Telemetry
> > > >> goes, where we
> > > >> >>>>>>> >>>>>>>> might decide to support certain standards in order
> to
> > > >> expand
> > > >> >>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows to
> > > >> plug-in multiple
> > > >> >>>>>>> >>>>>>>> external solutions that would use the standard API.
> > > >> After Open-Lineage
> > > >> >>>>>>> >>>>>>>> graduated recently to LFAI&Data foundation (I've
> been
> > > >> watching this
> > > >> >>>>>>> >>>>>>>> happening from far), it is I think the perfect
> > > candidate
> > > >> for Airflow
> > > >> >>>>>>> >>>>>>>> to incorporate it. I hope this will help all the
> > > players
> > > >> to make use
> > > >> >>>>>>> >>>>>>>> of the extra work necessary by the community to
> make it
> > > >> "officially
> > > >> >>>>>>> >>>>>>>> supported". I think we have to also get some
> feedback
> > > >> from the big
> > > >> >>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to
> have
> > > >> such a
> > > >> >>>>>>> >>>>>>>> capability, and another is to get it used in all
> the
> > > >> ways Airflow is
> > > >> >>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users
> (which
> > > >> is obviously a
> > > >> >>>>>>> >>>>>>>> huge driving factor) but also everywhere where
> Airflow
> > > >> is exposed by
> > > >> >>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see
> some
> > > >> warm words from
> > > >> >>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to hear
> > > >> whether the
> > > >> >>>>>>> >>>>>>>> Composer team at Google would be on board in using
> the
> > > >> open-lineage
> > > >> >>>>>>> >>>>>>>> information exposed this way in their Data Catalog
> (and
> > > >> likely more)
> > > >> >>>>>>> >>>>>>>> offering. We have Amundsen and others and possibly
> > > other
> > > >> stakeholders
> > > >> >>>>>>> >>>>>>>> might want to say something.
> > > >> >>>>>>> >>>>>>>>
> > > >> >>>>>>> >>>>>>>>
> > > >> >>>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved
> in
> > > >> implementing and
> > > >> >>>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned,
> that
> > > >> is the main
> > > >> >>>>>>> >>>>>>>> reason why the Open Lineage community would like to
> > > make
> > > >> the
> > > >> >>>>>>> >>>>>>>> integration part of Airflow. But by being smart and
> > > >> integrating it in
> > > >> >>>>>>> >>>>>>>> the way that will allow to plug-it-in into our CI,
> > > >> verification
> > > >> >>>>>>> >>>>>>>> process and making some very clear expectations
> about
> > > >> what it means
> > > >> >>>>>>> >>>>>>>> for contributors to Airflow to get it running, we
> can
> > > >> make some
> > > >> >>>>>>> >>>>>>>> initial investment in making it happen and minimise
> > > >> on-going cost,
> > > >> >>>>>>> >>>>>>>> while maximising the gain.
> > > >> >>>>>>> >>>>>>>>
> > > >> >>>>>>> >>>>>>>> And looking at all the above - I am super happy to
> help
> > > >> with all that
> > > >> >>>>>>> >>>>>>>> to make this easy to "swallow" and integrate well,
> even
> > > >> if it will
> > > >> >>>>>>> >>>>>>>> take an extra effort, especially that we will have
> > > >> experts from Open
> > > >> >>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open
> Lineage
> > > >> being the core
> > > >> >>>>>>> >>>>>>>> part of the effort. I am actually super excited -
> this
> > > >> might be the
> > > >> >>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its
> position
> > > as
> > > >> an
> > > >> >>>>>>> >>>>>>>> indispensable component of "even more modern data
> > > stack".
> > > >> >>>>>>> >>>>>>>>
> > > >> >>>>>>> >>>>>>>> I made my initial comments in the doc, and am
> looking
> > > >> forward to
> > > >> >>>>>>> >>>>>>>> making it happen :).
> > > >> >>>>>>> >>>>>>>>
> > > >> >>>>>>> >>>>>>>> J.
> > > >> >>>>>>> >>>>>>>>
> > > >> >>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
> > > >> >>>>>>> >>>>>>>> <julien@astronomer.io.inva <mailto:
> julien@astronomer.io.inva>lid> wrote:
> > > >> >>>>>>> >>>>>>>> >
> > > >> >>>>>>> >>>>>>>> > Dear Airflow Community,
> > > >> >>>>>>> >>>>>>>> > I have been working on a proposal to bring an
> > > >> OpenLineage provider to Airflow.
> > > >> >>>>>>> >>>>>>>> > I am looking for feedback with the goal to post
> an
> > > >> official AIP.
> > > >> >>>>>>> >>>>>>>> > Please feel free to comment in the doc above.
> > > >> >>>>>>> >>>>>>>> > Thank you,
> > > >> >>>>>>> >>>>>>>> > Julien (OpenLineage project lead)
> > > >> >>>>>>> >>>>>>>> >
> > > >> >>>>>>> >>>>>>>> > For convenience, here is the rationale from the
> doc:
> > > >> >>>>>>> >>>>>>>> >
> > > >> >>>>>>> >>>>>>>> > Operational lineage collection is a common need
> to
> > > >> understand dependencies between data pipelines and track end-to-end
> > > >> provenance of data. It enables many use cases from ensuring reliable
> > > >> delivery of data through observability to compliance and cost
> > > management.
> > > >> >>>>>>> >>>>>>>> >
> > > >> >>>>>>> >>>>>>>> > Publishing operational lineage is a core Airflow
> > > >> capability to enable troubleshooting and governance.
> > > >> >>>>>>> >>>>>>>> >
> > > >> >>>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data
> > > >> foundation that provides a spec standardizing operational lineage
> > > >> collection and sharing across the data ecosystem. If it provides
> plugins
> > > >> for popular open source projects, its intent is very similar to
> > > >> OpenTelemetry (also under the Linux Foundation umbrella): to remain
> a
> > > spec
> > > >> for lineage exchange that projects - open source or proprietary -
> > > implement.
> > > >> >>>>>>> >>>>>>>> >
> > > >> >>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will
> make it
> > > >> easier and more reliable for Airflow users to publish their
> operational
> > > >> lineage through the OpenLineage ecosystem.
> > > >> >>>>>>> >>>>>>>> >
> > > >> >>>>>>> >>>>>>>> > The current external plugin maintained in the
> > > >> OpenLineage project depends on Airflow and operators internals and
> gets
> > > >> broken when changes are made on those. Having a built-in integration
> > > >> ensures a better first class support to expose lineage that gets
> tested
> > > >> alongside other changes and therefore is more stable.
> > > >> >>>>>>> >>>>>>
> > > >> >>>>>>> >>>>>>
> > > >> >>>>>>> >>>>>>
> > > >> >>>>>>> >>>>>> --
> > > >> >>>>>>> >>>>>> Eugene
> > > >> >>>>>>> >>>>>
> > > >> >>>>>>> >>>>>
> > > >> >>>>>>> >>>>>
> > > >> >>>>>>> >>>>> --
> > > >> >>>>>>> >>>>> Eugene
> > > >> >>>>>>> >
> > > >> >>>>>>> >
> > > >> >>>>>>> >
> > > >> >>>>>>> > --
> > > >> >>>>>>> > Eugene
> > > >>
> > > >
> > >
> >
> >
> > --
> > Eugene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org <mailto:
> dev-unsubscribe@airflow.apache.org>
> For additional commands, e-mail: dev-help@airflow.apache.org <mailto:
> dev-help@airflow.apache.org>
>
>
>
>
>
>

Re: Request for feedback on proposal for new OpenLineage provider in Airflow

Posted by "Mehta, Shubham" <sh...@amazon.com.INVALID>.
Please include me, I will try my best to join (shubhammehta.93@gmail.com)

Best,
Shubham

On 2023-03-22, 2:24 PM, "Jarek Potiuk" <jarek@potiuk.com <ma...@potiuk.com>> wrote:


CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.






There are some strange behaviours in the calendar entry - I think you
cannot add yourself, only guests can add others :)
I've added you Eugen, maybe if someone wants to be also added - please
post here with your gmail/calendar addresses.


J.


On Wed, Mar 22, 2023 at 9:56 PM Eugen Kosteev <eugen@kosteev.com <ma...@kosteev.com>> wrote:
>
> Hi Julien.
>
> Can you, please, include me there as well: eugen@kosteev.com <ma...@kosteev.com> or
> kosteev@google.com <ma...@google.com>.
> Looking forward to see presentation.
>
> - Eugene
>
> On Wed, Mar 22, 2023 at 8:36 PM Julien Le Dem <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid>
> wrote:
>
> > Hello all,
> > I have to move the OpenLineage presentation to next week.
> > Sorry for the change.
> > It will be Friday next week March 31st at 5pm CET 9am PT.
> >
> > https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&tmsrc=julien%40astronomer.io <https://calendar.google.com/calendar/event?action=TEMPLATE&amp;tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&amp;tmsrc=julien%40astronomer.io>
> > Julien
> >
> > On Thu, Mar 16, 2023 at 8:21 PM Julien Le Dem <julien@astronomer.io <ma...@astronomer.io>>
> > wrote:
> >
> > > We are planning to do this session next Thursday at 5pm CET 9am PT. I
> > will
> > > send a zoom link in advance.
> > > Julien
> > >
> > > On Sat, Feb 25, 2023 at 05:59 Jarek Potiuk <jarek@potiuk.com <ma...@potiuk.com>> wrote:
> > >
> > >> Cool. I am looking forward to it :). It would be great to get some
> > >> insight from those who attempted to get the lineage working in several
> > >> versions of Open Lineage and finally arrived at the current
> > >> specs/integration.
> > >>
> > >> On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem
> > >> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid> wrote:
> > >> >
> > >> > Thank you Jarek,
> > >> > I am happy to organize a zoom presentation about OpenLineage and
> > answer
> > >> any question. It is indeed a spec decoupling the data transformation
> > layer
> > >> from the Metadata store people are using. Just like OpenTelemetry is for
> > >> service metrics/traces.
> > >> > Best,
> > >> > Julien
> > >> >
> > >> > On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <jarek@potiuk.com <ma...@potiuk.com>>
> > wrote:
> > >> >>
> > >> >> And to add a little "parallel" - I think Open Lineage integration
> > >> replacing our "generic lineage" is very similar step to the new
> > >> "Multi-tenant"-ready authentication interface we are discussing in
> > >> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck <https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck>
> > >> >>
> > >> >> Yes - we have a generic authentication interface, but no - it's
> > >> useless for the case where multi-tenancy and good level of resource
> > >> authorization is needed. It's just far too simplistic and limited.
> > >> >>
> > >> >> Same with current lineage generic interface - yes, we have it but
> > it's
> > >> only useful in a limited set of cases. and if we want to step-it-up we
> > need
> > >> to come up with something better (and Open Lineage happens to be one
> > that
> > >> has been developed with Airflow in mind and battle tested).
> > >> >>
> > >> >> J.
> > >> >>
> > >> >> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <jarek@potiuk.com <ma...@potiuk.com>>
> > wrote:
> > >> >>>
> > >> >>> Hey Rafał (Eugene, Michal - and others who are looking),
> > >> >>>
> > >> >>> I think I know where your/Eugen/Michał concerns are coming from. And
> > >> I think it would be great if we can talk it over a bit. I believe this
> > is
> > >> - in parts - quite a misunderstanding of what Open Lineage really is,
> > how
> > >> much of an integration it is and what are the reasons why it has been
> > >> implemented the way it was implemented in Airflow.
> > >> >>>
> > >> >>> **Idea**: (Julien - Maybe you can organize it ?):
> > >> >>>
> > >> >>> Maybe we can have an open-to-everyone presentation/zoom call with
> > >> quite some time foreseen to ask questions where you would explain the
> > >> community about those integration points (and especially those people
> > who
> > >> are worried we are losing something by choosing the OpenLineage
> > >> integration). I would love to see such a presentation - specifically
> > >> focused on explaining how Open-Lineage is really improving the current
> > >> lineage approach and what problems it solves that the existing generic
> > >> interface doesn't.
> > >> >>>
> > >> >>> Just to set the tone and focus for such meeting if we have one:
> > >> >>>
> > >> >>> For me - when I look at Open Lineage, it is really "this is how
> > >> lineage generic interface **should** be done in Airflow". The "generic"
> > >> lineage support we have now is very, very basic, I'd even say far too
> > >> simplistic. I would even say, it's useless besides a few, very basic use
> > >> cases. Simply because there was never a good "receiver" of the
> > information
> > >> to cover those cases.
> > >> >>>
> > >> >>> When you look closely at OpenLineage, it's nothing more than a
> > better
> > >> convention of the dictionaries that we send as a metadata, better
> > meta-data
> > >> in case of SQL operators (Hooks in the future hopefully), allowing
> > handling
> > >> some cases that current lineage simply cannot. Also what open-lineage
> > >> integration with Airflow covers better handling of the lifecycle "task"
> > and
> > >> "dag" in Airflow to be able to bind lineage data together. That's my
> > >> understanding of what we get when we integrate OL in.
> > >> >>>
> > >> >>> I think over the last 2 years Datakin/Astronomer people had worked
> > >> out the level of interface that **just works** and if we would like to
> > get
> > >> the lineage information from Airflow as useful as it is in OL, we would
> > >> have to anyway implement pretty much all of the things they already did.
> > >> >>>
> > >> >>> I would love (and I think many community members) to take part in
> > >> such a call to hear on that particular aspect of the OL integration.
> > >> >>>
> > >> >>> J.
> > >> >>>
> > >> >>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz <
> > >> rafalbiegacz@google.com.inva <ma...@google.com.inva>lid> wrote:
> > >> >>>>
> > >> >>>> Hi,
> > >> >>>>
> > >> >>>> I second/echo the input provided by Eugene and Michal.
> > >> >>>>
> > >> >>>> In general, Airflow should provide generic interfaces to lineage
> > >> backends so it's easy to configure the one preferred by the user.
> > Whether
> > >> it's Open Lineage, proprietary solution, Dataplex Lineage, etc. it
> > should
> > >> be the user's choice.
> > >> >>>>
> > >> >>>> We should avoid close integration with any specific lineage backend
> > >> due to the reasons already mentioned, i.e. to avoid translations between
> > >> lineage backends. Also, we would closely couple one framework (Airflow)
> > >> with another one (Open Lineage) - it makes Airflow more complex and less
> > >> flexible. Loose coupling between lineage backends and Airflow seems to
> > be
> > >> more future-proven.
> > >> >>>>
> > >> >>>> Regards, Rafal.
> > >> >>>>
> > >> >>>>
> > >> >>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem
> > >> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid> wrote:
> > >> >>>>>
> > >> >>>>> Dear Airflow community,
> > >> >>>>> I have transferred the content of the working google doc I shared
> > a
> > >> few weeks ago to the Airflow confluence:
> > >> >>>>>
> > >>
> > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow>
> > >> >>>>> All comments have been answered, I added clarifications to the doc
> > >> accordingly and I also added your suggestions to improve the proposal.
> > >> >>>>> All that history is linked from the discussion thread link in the
> > >> confluence doc if you wish to consult it.
> > >> >>>>> Thank you all for your feedback and help in the process.
> > >> >>>>> Best
> > >> >>>>> Julien
> > >> >>>>>
> > >> >>>>>
> > >> >>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <
> > julien@astronomer.io <ma...@astronomer.io>>
> > >> wrote:
> > >> >>>>>>
> > >> >>>>>> Thank you for the email Jarek, and Eugene for your suggestions,
> > >> >>>>>> I do agree with Jarek's assessment. I don't have very much to add
> > >> to his argument, it is very thoughtful!
> > >> >>>>>> OpenLineage was started to avoid the cartesian complexity that
> > >> Eugene mentions. There's actually that specific illustration in the
> > >> OpenLineage doc.
> > >> >>>>>> Lineage consumers want to avoid having to understand the lineage
> > >> format of each individual observed data transformation layer. And
> > >> transformation layers don't want to understand every Metadata store's
> > model
> > >> and protocol.
> > >> >>>>>> Eugene, about your specific proposal about a global vocabulary of
> > >> entities, I think it is a great suggestion.
> > >> >>>>>> We can map those entities to Datasets in OpenLineage. The way
> > >> OpenLineage models this is by allowing specific facets attached to
> > Dataset.
> > >> Facets are pieces of metadata each with their own JsonSchema.
> > >> >>>>>> For example a table from a relational database will have a schema
> > >> facet when a file in GCS might not.
> > >> >>>>>> So I think in Airflow we could have each of the entity classes
> > you
> > >> describe be used in the get_openlineage_facets*() API in the Operators.
> > >> >>>>>> Each of those classes would know what OpenLineage facets they can
> > >> expose.
> > >> >>>>>> I'll add a mention in the AIP and I think we can go in more
> > >> details in a ticket.
> > >> >>>>>> Cheers,
> > >> >>>>>> Julien
> > >> >>>>>>
> > >> >>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <jarek@potiuk.com <ma...@potiuk.com>>
> > >> wrote:
> > >> >>>>>>>
> > >> >>>>>>> Just a quick personal view on it, Eugene (I bet Julian's answer
> > >> will
> > >> >>>>>>> be more thoughtful).
> > >> >>>>>>>
> > >> >>>>>>> I think you are right to the "agnostic" part. But I have one
> > >> question
> > >> >>>>>>> - what are we considering "agnostic"?
> > >> >>>>>>>
> > >> >>>>>>> There is no "widespread" standard for lineage (yet). Open
> > Lineage
> > >> >>>>>>> with its donation to Linux Foundation Data & AI is aspiring to
> > >> become
> > >> >>>>>>> one. And it's a pretty good candidate:
> > >> >>>>>>>
> > >> >>>>>>> * designed from grounds-up to be agnostic (Open Lineage was only
> > >> >>>>>>> published as an API from day one)
> > >> >>>>>>> * as of recently, the ownership and governance of Open Lineage
> > is
> > >> with
> > >> >>>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/ <https://lfaidata.foundation/>)
> > which
> > >> is
> > >> >>>>>>> part of "Linux Foundation Project" - well known and respectful
> > >> >>>>>>> foundation that - similarly to the ASF is an umbrella and
> > provides
> > >> >>>>>>> governance rules for a big number of well established OSS
> > projects
> > >> >>>>>>>
> > >> >>>>>>> In essence it is the same approach as we already discussed and
> > >> >>>>>>> approved for Open Telemetry (which is governed by CNCF which is
> > >> in the
> > >> >>>>>>> same league as recognition and governance to LFP) (not yet
> > >> implemented
> > >> >>>>>>> though). In the case of Open-Telemetry, we decided against
> > >> developing
> > >> >>>>>>> our "own" existing standard but we opted for one that is out
> > >> there.
> > >> >>>>>>> Yes it is a bit more established and popular than Open Lineage
> > >> is, but
> > >> >>>>>>> i so wish that we chose and implemented it already (and earlier
> > >> as not
> > >> >>>>>>> having a standard there - except statsd which is really, really
> > >> poor)
> > >> >>>>>>> has a great impact on Airflow being just "pluggable" in existing
> > >> >>>>>>> solutions for monitoring. (BTW. I hope we implement it soon and
> > I
> > >> hear
> > >> >>>>>>> (and see) there are attempts to do so).
> > >> >>>>>>>
> > >> >>>>>>> In the case of Open Lineage, the questions are - is there an
> > >> >>>>>>> alternative of the same caliber? Shall we produce our own
> > >> "agnostic
> > >> >>>>>>> standard" for it instead ? Is there a chance the idea of
> > >> >>>>>>> "airflow-specific" attributes will catch up and many "consumers"
> > >> will
> > >> >>>>>>> be writing their own conversions to the way they can consume it?
> > >> >>>>>>>
> > >> >>>>>>> I would really, really try to avoid the pitfalls nicely
> > summarized
> > >> >>>>>>> here: https://xkcd.com/927/ <https://xkcd.com/927/>
> > >> >>>>>>>
> > >> >>>>>>> We can of course make a wrong bet and in 2 years Airflow might
> > be
> > >> the
> > >> >>>>>>> only one supporting Open Lineage. That might happen. Though the
> > >> list
> > >> >>>>>>> of "consumers" of Open Lineage is already pretty good IMHO. Or
> > >> maybe -
> > >> >>>>>>> more likely - once Airflow implements it, due to Airflow's
> > >> popularity
> > >> >>>>>>> and the fact that there is already competition supporting it
> > (e.g.
> > >> >>>>>>> Amundsen) we will increase the chance of "hockey-stick" adoption
> > >> of
> > >> >>>>>>> Open Lineage. My bet is - the latter and for the benefit of the
> > >> whole
> > >> >>>>>>> ecosystem. I think we have a chance to influence creation of a
> > >> new,
> > >> >>>>>>> important standard. Much less so, I think if we just provide our
> > >> own
> > >> >>>>>>> custom solution - with lots and lots of work for others to be
> > >> able to
> > >> >>>>>>> consume it, no time to properly nurture the API and make it
> > >> easier to
> > >> >>>>>>> implement it (which is undoubtedly what Datakin, Astronomer and
> > >> now
> > >> >>>>>>> LFData & AI run governance main focus is)
> > >> >>>>>>>
> > >> >>>>>>> Are there other alternatives we should consider ? Do we want to
> > >> >>>>>>> develop our own standard (and implement all the integrations
> > from
> > >> the
> > >> >>>>>>> grounds up) ?
> > >> >>>>>>>
> > >> >>>>>>> J.
> > >> >>>>>>>
> > >> >>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <
> > eugen@kosteev.com <ma...@kosteev.com>>
> > >> wrote:
> > >> >>>>>>> >
> > >> >>>>>>> > Hi Julien.
> > >> >>>>>>> >
> > >> >>>>>>> > I reviewed the design doc.
> > >> >>>>>>> > The general idea looks good to me, but I have some concerns
> > >> that I would like to share.
> > >> >>>>>>> >
> > >> >>>>>>> > If I understand correctly the proposed design is to fill in
> > >> "operators" with self-methods to extract lineage metadata from it, and I
> > >> agree with the motivation. If those are decoupled (in a form of
> > extractors
> > >> in separate package) from operators itself, then the downsides is that
> > (as
> > >> you mentioned) - extractors will be distributed separately and
> > "operators"
> > >> logic is out of sync with "lineage extraction" logic by design.
> > >> >>>>>>> > Also knowledge about internals of operator spills out of the
> > >> operator which is not good at all (at the very least).
> > >> >>>>>>> >
> > >> >>>>>>> > However, if we make every operator being exposing method to
> > >> generate lineage metadata of the specific format, e.g. OpenLineage etc.,
> > >> then we will end up with cartesian complexity of supporting in each
> > >> provider+operator each backend format.
> > >> >>>>>>> >
> > >> >>>>>>> > If you say that the goal is that "operators" will always
> > >> generate OpenLineage format only and each consumer will convert this
> > format
> > >> to their own internal representation, well, if they do this then this
> > seems
> > >> like a working approach. But with the assumption that each consumer will
> > >> support it.
> > >> >>>>>>> >
> > >> >>>>>>> > I think it comes down to the question: is OpenLineage format
> > >> enough popular, complete and proper for the lineage metadata that every
> > >> consumer will be convinced to support it. We may also consider issues
> > like
> > >> mismatch of lineage feature parity, e.g. OpenLineage supports
> > field-level
> > >> lineage but consumer doesn't support (or not at the moment), so we would
> > >> prefer lineage metadata transferred to the backend to be slightly
> > different
> > >> in this case.
> > >> >>>>>>> >
> > >> >>>>>>> > What do you think about the idea:
> > >> >>>>>>> > 1. make lineage metadata generated by "operators" to be
> > >> agnostic of the specific format, just using entities from big generic
> > >> vocabulary of entities e.g. created here
> > >> https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py <https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py>
> > .
> > >> We would have there e.g. entities like:
> > >> >>>>>>> >
> > >> --------------------------------------------------------------------
> > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > >> >>>>>>> > class PostgresTable:
> > >> >>>>>>> > """Airflow lineage entity representing Postgres table."""
> > >> >>>>>>> >
> > >> >>>>>>> > host: str = attr.ib()
> > >> >>>>>>> > port: str = attr.ib()
> > >> >>>>>>> > database: str = attr.ib()
> > >> >>>>>>> > schema: str = attr.ib()
> > >> >>>>>>> > table: str = attr.ib()
> > >> >>>>>>> >
> > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > >> >>>>>>> > class GCSEntity:
> > >> >>>>>>> > """Airflow lineage entity representing generic Google
> > Cloud
> > >> Storage entity."""
> > >> >>>>>>> >
> > >> >>>>>>> > bucket: str = attr.ib()
> > >> >>>>>>> > path: str = attr.ib()
> > >> >>>>>>> >
> > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > >> >>>>>>> > class AWSS3Entity:
> > >> >>>>>>> > """Airflow lineage entity representing generic AWS S3
> > >> entity."""
> > >> >>>>>>> >
> > >> >>>>>>> > bucket: str = attr.ib()
> > >> >>>>>>> > path: str = attr.ib()
> > >> >>>>>>> >
> > >> --------------------------------------------------------------------
> > >> >>>>>>> > 2. Implement "adapters" that will act as a bridge between
> > >> "operators" and backends. Their responsibility will be to convert
> > lineage
> > >> metadata generated by "operators" to a format understandable by specific
> > >> backend.
> > >> >>>>>>> > And then we can use the built-in mechanism of inlets/outlets
> > to
> > >> bypass Airflow lineage metadata to the Airflow lineage backend.
> > >> >>>>>>> >
> > >> >>>>>>> > I didn't get exactly implementation details of your proposed
> > >> design, but I think maintaining global vocabulary of entities to use in
> > >> inlets/outlets of operators is crucial for Airflow, as this could be
> > >> leveraged to build various features on top of it, like displaying
> > lineage
> > >> graph in Airflow UI (based on XCOM):)
> > >> >>>>>>> >
> > >> >>>>>>> > Importantly to note, if we decide to send out from Airflow
> > >> lineage metadata only in OpenLineage format, well, we could have than
> > only
> > >> one "adapter" OpenLineageAdapter. But the "adapters" approach leaves us
> > >> room for adding support to others (following "pluggable" approach as
> > >> Airflow is mainly known/good about).
> > >> >>>>>>> >
> > >> >>>>>>> > All in all:
> > >> >>>>>>> > - global vocabulary of entities used across all "operators"
> > >> (with all advantages out of it, mentioned above)
> > >> >>>>>>> > - "adapters" approach
> > >> >>>>>>> > seems to me crucial points in the design that make sense to
> > me.
> > >> >>>>>>> >
> > >> >>>>>>> > What do you think about this?
> > >> >>>>>>> >
> > >> >>>>>>> > - Eugene
> > >> >>>>>>> >
> > >> >>>>>>> >
> > >> >>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem
> > >> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid> wrote:
> > >> >>>>>>> >>
> > >> >>>>>>> >> Hello Michał,
> > >> >>>>>>> >> Thank you for your input.
> > >> >>>>>>> >> I would clarify that OpenLineage doesn't make any assumption
> > >> about the backend being used to store lineage and is an adapter-like
> > layer.
> > >> >>>>>>> >> OpenLineage exists as the spec specifically for that purpose
> > >> of avoiding the problem of every lineage consumer having to understand
> > >> every lineage producer.
> > >> >>>>>>> >> Consumers of lineage want a unified spec consuming lineage
> > >> from any data transformation layer like Airflow, Spark, Flink, SQL,
> > >> Warehouses, ...
> > >> >>>>>>> >> Just like OpenTelemetry allows consuming traces independently
> > >> of the technology used, so does OpenLineage for lineage.
> > >> >>>>>>> >> Julien
> > >> >>>>>>> >>
> > >> >>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras <
> > >> michalmodras@google.com <ma...@google.com>> wrote:
> > >> >>>>>>> >>>
> > >> >>>>>>> >>> Hi everyone,
> > >> >>>>>>> >>>
> > >> >>>>>>> >>> As Airflow already supports lineage functionality through
> > >> pluggable lineage backends, I think OpenLineage and other lineage
> > systems
> > >> integration should follow this path. I think more 'native' integration
> > with
> > >> OpenLineage (or any other lineage system) in Airflow while maintaining
> > the
> > >> generic lineage backend architecture in parallel would make the user
> > >> experience less open, troublesome to maintain, and the Airflow
> > architecture
> > >> itself more constrained by a logic of a specific system.
> > >> >>>>>>> >>>
> > >> >>>>>>> >>> I think enriching operators with a generic method exposing
> > >> lineage metadata that could be leveraged by lineage backends regardless
> > of
> > >> their implementation is a good idea which the Cloud Composer team would
> > >> gladly contribute to. I believe the translation of the Airflow metadata
> > >> exposed by the operators should be done by lineage backends (or another
> > >> adapter-like layer). Tying Airflow operators' development to a specific
> > >> lineage system like OpenLineage forces operators' contributors to
> > >> understand that system too, which increases both the entry costs and
> > >> maintenance costs. I see it as unnecessary coupling.
> > >> >>>>>>> >>>
> > >> >>>>>>> >>> Best,
> > >> >>>>>>> >>> Michal
> > >> >>>>>>> >>>
> > >> >>>>>>> >>>
> > >> >>>>>>> >>>
> > >> >>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem <
> > >> julien@astronomer.io <ma...@astronomer.io>> wrote:
> > >> >>>>>>> >>>>
> > >> >>>>>>> >>>> Thank you Eugen,
> > >> >>>>>>> >>>> This sounds very aligned with the goals of OpenLineage and
> > I
> > >> think this would work well.
> > >> >>>>>>> >>>> Here are the sections in the doc that I think address your
> > >> points:
> > >> >>>>>>> >>>> - generalize lineage metadata extraction as self-method in
> > >> each operator, using generic lineage entities
> > >> >>>>>>> >>>> See: OpenLineage support in providers. It describes how
> > each
> > >> operator exposes its lineage.
> > >> >>>>>>> >>>> - implement "adapter"s to convert generated metadata to
> > Data
> > >> Lineage format, Open Lineage format, etc.
> > >> >>>>>>> >>>> The goal here is each consumer turns from OpenLineage
> > format
> > >> to their own internal representation as you are suggesting.
> > >> >>>>>>> >>>> In the motivation section, towards the end, I link to a few
> > >> examples of data catalogs doing just that.
> > >> >>>>>>> >>>>
> > >> >>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <
> > >> eugen@kosteev.com <ma...@kosteev.com>> wrote:
> > >> >>>>>>> >>>>>
> > >> >>>>>>> >>>>> ++ Michal Modras
> > >> >>>>>>> >>>>>
> > >> >>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <
> > >> eugen@kosteev.com <ma...@kosteev.com>> wrote:
> > >> >>>>>>> >>>>>>
> > >> >>>>>>> >>>>>> Cloud Composer recently launched "Data lineage with
> > >> Dataplex" feature which effectively means to generate lineage out of
> > >> DAG/task executions and export it to Data Lineage (Data Catalog service)
> > >> for further analysis.
> > >> >>>>>>> >>>>>>
> > >> https://cloud.google.com/composer/docs/composer-2/lineage-integration <https://cloud.google.com/composer/docs/composer-2/lineage-integration>
> > >> >>>>>>> >>>>>>
> > >> >>>>>>> >>>>>> This feature is as of now in the "Preview" state.
> > >> >>>>>>> >>>>>> The current implementation uses built-in "Airflow lineage
> > >> backend" feature and methods to extract lineage metadata on task post
> > >> execution events.
> > >> >>>>>>> >>>>>>
> > >> >>>>>>> >>>>>> The general idea was to contribute this to the Airflow
> > >> community in a form:
> > >> >>>>>>> >>>>>> - generalize lineage metadata extraction as self-method
> > in
> > >> each operator, using generic lineage entities
> > >> >>>>>>> >>>>>> - implement "adapter"s to convert generated metadata to
> > >> Data Lineage format, Open Lineage format, etc.
> > >> >>>>>>> >>>>>>
> > >> >>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer would mean
> > >> to introduce an additional layer of converting from OpenLineage format
> > to
> > >> Data Lineage (Data Catalog/Dataplex) format. But this is definitely a
> > >> possibility.
> > >> >>>>>>> >>>>>>
> > >> >>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem
> > >> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid> wrote:
> > >> >>>>>>> >>>>>>>
> > >> >>>>>>> >>>>>>> Thank you very much for your input Jarek.
> > >> >>>>>>> >>>>>>> I am responding in the comments and adding to the doc
> > >> accordingly.
> > >> >>>>>>> >>>>>>> I would also love to hear from more stakeholders.
> > >> >>>>>>> >>>>>>> Thanks to all who provided feedback so far.
> > >> >>>>>>> >>>>>>> Julien
> > >> >>>>>>> >>>>>>>
> > >> >>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <
> > >> jarek@potiuk.com <ma...@potiuk.com>> wrote:
> > >> >>>>>>> >>>>>>>>
> > >> >>>>>>> >>>>>>>> General comment from my side: I think Open Lineage is
> > >> (and should be
> > >> >>>>>>> >>>>>>>> even more) a feature of Airflow that expands Airflow's
> > >> capabilities
> > >> >>>>>>> >>>>>>>> greatly and opens up the direction we've been all
> > >> working on - Airflow
> > >> >>>>>>> >>>>>>>> as a Platform.
> > >> >>>>>>> >>>>>>>>
> > >> >>>>>>> >>>>>>>> I think closely integrating it with Open-Lineage goes
> > >> the same
> > >> >>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open Telemetry
> > >> goes, where we
> > >> >>>>>>> >>>>>>>> might decide to support certain standards in order to
> > >> expand
> > >> >>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows to
> > >> plug-in multiple
> > >> >>>>>>> >>>>>>>> external solutions that would use the standard API.
> > >> After Open-Lineage
> > >> >>>>>>> >>>>>>>> graduated recently to LFAI&Data foundation (I've been
> > >> watching this
> > >> >>>>>>> >>>>>>>> happening from far), it is I think the perfect
> > candidate
> > >> for Airflow
> > >> >>>>>>> >>>>>>>> to incorporate it. I hope this will help all the
> > players
> > >> to make use
> > >> >>>>>>> >>>>>>>> of the extra work necessary by the community to make it
> > >> "officially
> > >> >>>>>>> >>>>>>>> supported". I think we have to also get some feedback
> > >> from the big
> > >> >>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to have
> > >> such a
> > >> >>>>>>> >>>>>>>> capability, and another is to get it used in all the
> > >> ways Airflow is
> > >> >>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users (which
> > >> is obviously a
> > >> >>>>>>> >>>>>>>> huge driving factor) but also everywhere where Airflow
> > >> is exposed by
> > >> >>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see some
> > >> warm words from
> > >> >>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to hear
> > >> whether the
> > >> >>>>>>> >>>>>>>> Composer team at Google would be on board in using the
> > >> open-lineage
> > >> >>>>>>> >>>>>>>> information exposed this way in their Data Catalog (and
> > >> likely more)
> > >> >>>>>>> >>>>>>>> offering. We have Amundsen and others and possibly
> > other
> > >> stakeholders
> > >> >>>>>>> >>>>>>>> might want to say something.
> > >> >>>>>>> >>>>>>>>
> > >> >>>>>>> >>>>>>>>
> > >> >>>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved in
> > >> implementing and
> > >> >>>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned, that
> > >> is the main
> > >> >>>>>>> >>>>>>>> reason why the Open Lineage community would like to
> > make
> > >> the
> > >> >>>>>>> >>>>>>>> integration part of Airflow. But by being smart and
> > >> integrating it in
> > >> >>>>>>> >>>>>>>> the way that will allow to plug-it-in into our CI,
> > >> verification
> > >> >>>>>>> >>>>>>>> process and making some very clear expectations about
> > >> what it means
> > >> >>>>>>> >>>>>>>> for contributors to Airflow to get it running, we can
> > >> make some
> > >> >>>>>>> >>>>>>>> initial investment in making it happen and minimise
> > >> on-going cost,
> > >> >>>>>>> >>>>>>>> while maximising the gain.
> > >> >>>>>>> >>>>>>>>
> > >> >>>>>>> >>>>>>>> And looking at all the above - I am super happy to help
> > >> with all that
> > >> >>>>>>> >>>>>>>> to make this easy to "swallow" and integrate well, even
> > >> if it will
> > >> >>>>>>> >>>>>>>> take an extra effort, especially that we will have
> > >> experts from Open
> > >> >>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open Lineage
> > >> being the core
> > >> >>>>>>> >>>>>>>> part of the effort. I am actually super excited - this
> > >> might be the
> > >> >>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its position
> > as
> > >> an
> > >> >>>>>>> >>>>>>>> indispensable component of "even more modern data
> > stack".
> > >> >>>>>>> >>>>>>>>
> > >> >>>>>>> >>>>>>>> I made my initial comments in the doc, and am looking
> > >> forward to
> > >> >>>>>>> >>>>>>>> making it happen :).
> > >> >>>>>>> >>>>>>>>
> > >> >>>>>>> >>>>>>>> J.
> > >> >>>>>>> >>>>>>>>
> > >> >>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
> > >> >>>>>>> >>>>>>>> <julien@astronomer.io.inva <ma...@astronomer.io.inva>lid> wrote:
> > >> >>>>>>> >>>>>>>> >
> > >> >>>>>>> >>>>>>>> > Dear Airflow Community,
> > >> >>>>>>> >>>>>>>> > I have been working on a proposal to bring an
> > >> OpenLineage provider to Airflow.
> > >> >>>>>>> >>>>>>>> > I am looking for feedback with the goal to post an
> > >> official AIP.
> > >> >>>>>>> >>>>>>>> > Please feel free to comment in the doc above.
> > >> >>>>>>> >>>>>>>> > Thank you,
> > >> >>>>>>> >>>>>>>> > Julien (OpenLineage project lead)
> > >> >>>>>>> >>>>>>>> >
> > >> >>>>>>> >>>>>>>> > For convenience, here is the rationale from the doc:
> > >> >>>>>>> >>>>>>>> >
> > >> >>>>>>> >>>>>>>> > Operational lineage collection is a common need to
> > >> understand dependencies between data pipelines and track end-to-end
> > >> provenance of data. It enables many use cases from ensuring reliable
> > >> delivery of data through observability to compliance and cost
> > management.
> > >> >>>>>>> >>>>>>>> >
> > >> >>>>>>> >>>>>>>> > Publishing operational lineage is a core Airflow
> > >> capability to enable troubleshooting and governance.
> > >> >>>>>>> >>>>>>>> >
> > >> >>>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data
> > >> foundation that provides a spec standardizing operational lineage
> > >> collection and sharing across the data ecosystem. If it provides plugins
> > >> for popular open source projects, its intent is very similar to
> > >> OpenTelemetry (also under the Linux Foundation umbrella): to remain a
> > spec
> > >> for lineage exchange that projects - open source or proprietary -
> > implement.
> > >> >>>>>>> >>>>>>>> >
> > >> >>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will make it
> > >> easier and more reliable for Airflow users to publish their operational
> > >> lineage through the OpenLineage ecosystem.
> > >> >>>>>>> >>>>>>>> >
> > >> >>>>>>> >>>>>>>> > The current external plugin maintained in the
> > >> OpenLineage project depends on Airflow and operators internals and gets
> > >> broken when changes are made on those. Having a built-in integration
> > >> ensures a better first class support to expose lineage that gets tested
> > >> alongside other changes and therefore is more stable.
> > >> >>>>>>> >>>>>>
> > >> >>>>>>> >>>>>>
> > >> >>>>>>> >>>>>>
> > >> >>>>>>> >>>>>> --
> > >> >>>>>>> >>>>>> Eugene
> > >> >>>>>>> >>>>>
> > >> >>>>>>> >>>>>
> > >> >>>>>>> >>>>>
> > >> >>>>>>> >>>>> --
> > >> >>>>>>> >>>>> Eugene
> > >> >>>>>>> >
> > >> >>>>>>> >
> > >> >>>>>>> >
> > >> >>>>>>> > --
> > >> >>>>>>> > Eugene
> > >>
> > >
> >
>
>
> --
> Eugene


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org <ma...@airflow.apache.org>
For additional commands, e-mail: dev-help@airflow.apache.org <ma...@airflow.apache.org>






Re: Request for feedback on proposal for new OpenLineage provider in Airflow

Posted by Jarek Potiuk <ja...@potiuk.com>.
There are some strange behaviours in the calendar entry - I think you
cannot add yourself, only guests can add others :)
I've added you Eugen, maybe if someone wants to be also added - please
post here with your gmail/calendar addresses.

J.

On Wed, Mar 22, 2023 at 9:56 PM Eugen Kosteev <eu...@kosteev.com> wrote:
>
> Hi Julien.
>
> Can you, please, include me there as well: eugen@kosteev.com or
> kosteev@google.com.
> Looking forward to see presentation.
>
> - Eugene
>
> On Wed, Mar 22, 2023 at 8:36 PM Julien Le Dem <ju...@astronomer.io.invalid>
> wrote:
>
> > Hello all,
> > I have to move the OpenLineage presentation to next week.
> > Sorry for the change.
> > It will be Friday next week March 31st at 5pm CET 9am PT.
> >
> > https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&tmsrc=julien%40astronomer.io
> > Julien
> >
> > On Thu, Mar 16, 2023 at 8:21 PM Julien Le Dem <ju...@astronomer.io>
> > wrote:
> >
> > > We are planning to do this session next Thursday at 5pm CET 9am PT. I
> > will
> > > send a zoom link in advance.
> > > Julien
> > >
> > > On Sat, Feb 25, 2023 at 05:59 Jarek Potiuk <ja...@potiuk.com> wrote:
> > >
> > >> Cool. I am looking forward to it :). It would be great to get some
> > >> insight from those who attempted to get the lineage working in several
> > >> versions of Open Lineage and finally arrived at the current
> > >> specs/integration.
> > >>
> > >> On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem
> > >> <ju...@astronomer.io.invalid> wrote:
> > >> >
> > >> > Thank you Jarek,
> > >> > I am happy to organize a zoom presentation about OpenLineage and
> > answer
> > >> any question. It is indeed a spec decoupling the data transformation
> > layer
> > >> from the Metadata store people are using. Just like OpenTelemetry is for
> > >> service metrics/traces.
> > >> > Best,
> > >> > Julien
> > >> >
> > >> > On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <ja...@potiuk.com>
> > wrote:
> > >> >>
> > >> >> And to add a little "parallel" - I think Open Lineage integration
> > >> replacing our "generic lineage" is very similar step to the new
> > >> "Multi-tenant"-ready authentication interface we are discussing in
> > >> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck
> > >> >>
> > >> >> Yes - we have a generic authentication interface, but no - it's
> > >> useless for the case where multi-tenancy and good level of resource
> > >> authorization is needed. It's just far too simplistic and limited.
> > >> >>
> > >> >> Same with current lineage generic interface - yes, we have it but
> > it's
> > >> only useful in a limited set of cases. and if we want to step-it-up we
> > need
> > >> to come up with something better (and Open Lineage happens to be one
> > that
> > >> has been developed with Airflow in mind and battle tested).
> > >> >>
> > >> >> J.
> > >> >>
> > >> >> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <ja...@potiuk.com>
> > wrote:
> > >> >>>
> > >> >>> Hey Rafał (Eugene, Michal - and others who are looking),
> > >> >>>
> > >> >>> I think I know where your/Eugen/Michał concerns are coming from. And
> > >> I think it would be great if we can talk it over a bit.  I believe this
> > is
> > >> - in parts - quite a misunderstanding of what Open Lineage really is,
> > how
> > >> much of an integration it is and what are the reasons why it has been
> > >> implemented the way it was implemented in Airflow.
> > >> >>>
> > >> >>> **Idea**: (Julien -  Maybe you can organize it ?):
> > >> >>>
> > >> >>> Maybe we can have an open-to-everyone presentation/zoom call with
> > >> quite some time foreseen to ask questions where you would explain the
> > >> community about those integration points (and especially those people
> > who
> > >> are worried we are losing something by choosing the OpenLineage
> > >> integration). I would love to see such a presentation - specifically
> > >> focused on explaining how Open-Lineage is really improving the current
> > >> lineage approach and what problems it solves that the existing generic
> > >> interface doesn't.
> > >> >>>
> > >> >>> Just to set the tone and focus for such meeting if we have one:
> > >> >>>
> > >> >>> For me - when I look at Open Lineage, it is really "this is how
> > >> lineage generic interface **should** be done in Airflow". The "generic"
> > >> lineage support we have now is very, very basic, I'd even say far too
> > >> simplistic. I would even say, it's useless besides a few, very basic use
> > >> cases. Simply because there was never a good "receiver" of the
> > information
> > >> to cover those cases.
> > >> >>>
> > >> >>> When you look closely at OpenLineage, it's nothing more than a
> > better
> > >> convention of the dictionaries that we send as a metadata, better
> > meta-data
> > >> in case of SQL operators (Hooks in the future hopefully), allowing
> > handling
> > >> some cases that current lineage simply cannot.  Also what open-lineage
> > >> integration with Airflow covers better handling of the lifecycle "task"
> > and
> > >> "dag" in Airflow to be able to bind lineage data together. That's my
> > >> understanding of what we get when we integrate OL in.
> > >> >>>
> > >> >>> I think over the last 2 years Datakin/Astronomer people had worked
> > >> out the level of interface that **just works** and if we would like to
> > get
> > >> the lineage information from Airflow as useful as it is in OL, we would
> > >> have to anyway implement pretty much all of the things they already did.
> > >> >>>
> > >> >>> I would love (and I think many community members) to take part in
> > >> such a call to hear on that particular aspect of the OL integration.
> > >> >>>
> > >> >>> J.
> > >> >>>
> > >> >>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz <
> > >> rafalbiegacz@google.com.invalid> wrote:
> > >> >>>>
> > >> >>>> Hi,
> > >> >>>>
> > >> >>>> I second/echo the input provided by Eugene and Michal.
> > >> >>>>
> > >> >>>> In general, Airflow should provide generic interfaces to lineage
> > >> backends so it's easy to configure the one preferred by the user.
> > Whether
> > >> it's Open Lineage, proprietary solution, Dataplex Lineage, etc. it
> > should
> > >> be the user's choice.
> > >> >>>>
> > >> >>>> We should avoid close integration with any specific lineage backend
> > >> due to the reasons already mentioned, i.e. to avoid translations between
> > >> lineage backends. Also, we would closely couple one framework (Airflow)
> > >> with another one (Open Lineage) - it makes Airflow more complex and less
> > >> flexible. Loose coupling between lineage backends and Airflow seems to
> > be
> > >> more future-proven.
> > >> >>>>
> > >> >>>> Regards, Rafal.
> > >> >>>>
> > >> >>>>
> > >> >>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem
> > >> <ju...@astronomer.io.invalid> wrote:
> > >> >>>>>
> > >> >>>>> Dear Airflow community,
> > >> >>>>> I have transferred the content of the working google doc I shared
> > a
> > >> few weeks ago to the Airflow confluence:
> > >> >>>>>
> > >>
> > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> > >> >>>>> All comments have been answered, I added clarifications to the doc
> > >> accordingly and I also added your suggestions to improve the proposal.
> > >> >>>>> All that history is linked from the discussion thread link in the
> > >> confluence doc if you wish to consult it.
> > >> >>>>> Thank you all for your feedback and help in the process.
> > >> >>>>> Best
> > >> >>>>> Julien
> > >> >>>>>
> > >> >>>>>
> > >> >>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <
> > julien@astronomer.io>
> > >> wrote:
> > >> >>>>>>
> > >> >>>>>> Thank you for the email Jarek, and Eugene for your suggestions,
> > >> >>>>>> I do agree with Jarek's assessment. I don't have very much to add
> > >> to his argument, it is very thoughtful!
> > >> >>>>>> OpenLineage was started to avoid the cartesian complexity that
> > >> Eugene mentions. There's actually that specific illustration in the
> > >> OpenLineage doc.
> > >> >>>>>> Lineage consumers want to avoid having to understand the lineage
> > >> format of each individual observed data transformation layer. And
> > >> transformation layers don't want to understand every Metadata store's
> > model
> > >> and protocol.
> > >> >>>>>> Eugene, about your specific proposal about a global vocabulary of
> > >> entities, I think it is a great suggestion.
> > >> >>>>>> We can map those entities to Datasets in OpenLineage. The way
> > >> OpenLineage models this is by allowing specific facets attached to
> > Dataset.
> > >> Facets are pieces of metadata each with their own JsonSchema.
> > >> >>>>>> For example a table from a relational database will have a schema
> > >> facet when a file in GCS might not.
> > >> >>>>>> So I think in Airflow we could have each of the entity classes
> > you
> > >> describe be used in the get_openlineage_facets*() API in the Operators.
> > >> >>>>>> Each of those classes would know what OpenLineage facets they can
> > >> expose.
> > >> >>>>>> I'll add a mention in the AIP and I think we can go in more
> > >> details in a ticket.
> > >> >>>>>> Cheers,
> > >> >>>>>> Julien
> > >> >>>>>>
> > >> >>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <ja...@potiuk.com>
> > >> wrote:
> > >> >>>>>>>
> > >> >>>>>>> Just a quick personal view on it, Eugene (I bet Julian's answer
> > >> will
> > >> >>>>>>> be more thoughtful).
> > >> >>>>>>>
> > >> >>>>>>> I think you are right to the "agnostic" part. But I have one
> > >> question
> > >> >>>>>>> - what are we considering "agnostic"?
> > >> >>>>>>>
> > >> >>>>>>>  There is no "widespread" standard for lineage (yet). Open
> > Lineage
> > >> >>>>>>> with its donation to Linux Foundation Data & AI is aspiring to
> > >> become
> > >> >>>>>>> one. And it's a pretty good candidate:
> > >> >>>>>>>
> > >> >>>>>>> * designed from grounds-up to be agnostic (Open Lineage was only
> > >> >>>>>>> published as an API from day one)
> > >> >>>>>>> * as of recently, the ownership and governance of Open Lineage
> > is
> > >> with
> > >> >>>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/)
> > which
> > >> is
> > >> >>>>>>> part of "Linux Foundation Project" - well known and respectful
> > >> >>>>>>> foundation that - similarly to the ASF is an umbrella and
> > provides
> > >> >>>>>>> governance rules for a big number of well established OSS
> > projects
> > >> >>>>>>>
> > >> >>>>>>> In essence it is the same approach as we already discussed and
> > >> >>>>>>> approved for Open Telemetry (which is governed by CNCF which is
> > >> in the
> > >> >>>>>>> same league as recognition and governance to LFP) (not yet
> > >> implemented
> > >> >>>>>>> though). In the case of Open-Telemetry, we decided against
> > >> developing
> > >> >>>>>>> our "own" existing standard but we opted for one that is out
> > >> there.
> > >> >>>>>>> Yes it is a bit more established and popular than Open Lineage
> > >> is, but
> > >> >>>>>>> i so wish that we chose and implemented it already (and earlier
> > >> as not
> > >> >>>>>>> having a standard there - except statsd which is really, really
> > >> poor)
> > >> >>>>>>> has a great impact on Airflow being just "pluggable" in existing
> > >> >>>>>>> solutions for monitoring. (BTW. I hope we implement it soon and
> > I
> > >> hear
> > >> >>>>>>> (and see) there are attempts to do so).
> > >> >>>>>>>
> > >> >>>>>>> In the case of Open Lineage, the questions are - is there an
> > >> >>>>>>> alternative of the same caliber? Shall we produce our own
> > >> "agnostic
> > >> >>>>>>> standard" for it instead ? Is there a chance the idea of
> > >> >>>>>>> "airflow-specific" attributes will catch up and many "consumers"
> > >> will
> > >> >>>>>>> be writing their own conversions to the way they can consume it?
> > >> >>>>>>>
> > >> >>>>>>> I would really, really try to avoid the pitfalls nicely
> > summarized
> > >> >>>>>>> here: https://xkcd.com/927/
> > >> >>>>>>>
> > >> >>>>>>> We can of course make a wrong bet and in 2 years Airflow might
> > be
> > >> the
> > >> >>>>>>> only one supporting Open Lineage. That might happen. Though the
> > >> list
> > >> >>>>>>> of "consumers" of Open Lineage is already pretty good IMHO. Or
> > >> maybe -
> > >> >>>>>>> more likely - once Airflow implements it, due to Airflow's
> > >> popularity
> > >> >>>>>>> and the fact that there is already competition supporting it
> > (e.g.
> > >> >>>>>>> Amundsen) we will increase the chance of "hockey-stick" adoption
> > >> of
> > >> >>>>>>> Open Lineage. My bet is -  the latter and for the benefit of the
> > >> whole
> > >> >>>>>>> ecosystem. I think we have a chance to influence creation of a
> > >> new,
> > >> >>>>>>> important standard. Much less so, I think if we just provide our
> > >> own
> > >> >>>>>>> custom solution - with lots and lots of work for others to be
> > >> able to
> > >> >>>>>>> consume it, no time to properly nurture the API and make it
> > >> easier to
> > >> >>>>>>> implement it (which is undoubtedly what Datakin, Astronomer and
> > >> now
> > >> >>>>>>> LFData & AI run governance main focus is)
> > >> >>>>>>>
> > >> >>>>>>> Are there other alternatives we should consider ? Do we want to
> > >> >>>>>>> develop our own standard (and implement all the integrations
> > from
> > >> the
> > >> >>>>>>> grounds up) ?
> > >> >>>>>>>
> > >> >>>>>>> J.
> > >> >>>>>>>
> > >> >>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <
> > eugen@kosteev.com>
> > >> wrote:
> > >> >>>>>>> >
> > >> >>>>>>> > Hi Julien.
> > >> >>>>>>> >
> > >> >>>>>>> > I reviewed the design doc.
> > >> >>>>>>> > The general idea looks good to me, but I have some concerns
> > >> that I would like to share.
> > >> >>>>>>> >
> > >> >>>>>>> > If I understand correctly the proposed design is to fill in
> > >> "operators" with self-methods to extract lineage metadata from it, and I
> > >> agree with the motivation. If those are decoupled (in a form of
> > extractors
> > >> in separate package) from operators itself, then the downsides is that
> > (as
> > >> you mentioned) - extractors will be distributed separately and
> > "operators"
> > >> logic is out of sync with "lineage extraction" logic by design.
> > >> >>>>>>> > Also knowledge about internals of operator spills out of the
> > >> operator which is not good at all (at the very least).
> > >> >>>>>>> >
> > >> >>>>>>> > However, if we make every operator being exposing method to
> > >> generate lineage metadata of the specific format, e.g. OpenLineage etc.,
> > >> then we will end up with cartesian complexity of supporting in each
> > >> provider+operator each backend format.
> > >> >>>>>>> >
> > >> >>>>>>> > If you say that the goal is that "operators" will always
> > >> generate OpenLineage format only and each consumer will convert this
> > format
> > >> to their own internal representation, well, if they do this then this
> > seems
> > >> like a working approach. But with the assumption that each consumer will
> > >> support it.
> > >> >>>>>>> >
> > >> >>>>>>> > I think it comes down to the question: is OpenLineage format
> > >> enough popular, complete and proper for the lineage metadata that every
> > >> consumer will be convinced to support it. We may also consider issues
> > like
> > >> mismatch of lineage feature parity, e.g. OpenLineage supports
> > field-level
> > >> lineage but consumer doesn't support (or not at the moment), so we would
> > >> prefer lineage metadata transferred to the backend to be slightly
> > different
> > >> in this case.
> > >> >>>>>>> >
> > >> >>>>>>> > What do you think about the idea:
> > >> >>>>>>> > 1. make lineage metadata generated by "operators" to be
> > >> agnostic of the specific format, just using entities from big generic
> > >> vocabulary of entities e.g. created here
> > >> https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py
> > .
> > >> We would have there e.g. entities like:
> > >> >>>>>>> >
> > >> --------------------------------------------------------------------
> > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > >> >>>>>>> > class PostgresTable:
> > >> >>>>>>> >     """Airflow lineage entity representing Postgres table."""
> > >> >>>>>>> >
> > >> >>>>>>> >     host: str = attr.ib()
> > >> >>>>>>> >     port: str = attr.ib()
> > >> >>>>>>> >     database: str = attr.ib()
> > >> >>>>>>> >     schema: str = attr.ib()
> > >> >>>>>>> >     table: str = attr.ib()
> > >> >>>>>>> >
> > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > >> >>>>>>> > class GCSEntity:
> > >> >>>>>>> >     """Airflow lineage entity representing generic Google
> > Cloud
> > >> Storage entity."""
> > >> >>>>>>> >
> > >> >>>>>>> >     bucket: str = attr.ib()
> > >> >>>>>>> >     path: str = attr.ib()
> > >> >>>>>>> >
> > >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> > >> >>>>>>> > class AWSS3Entity:
> > >> >>>>>>> >     """Airflow lineage entity representing generic AWS S3
> > >> entity."""
> > >> >>>>>>> >
> > >> >>>>>>> >     bucket: str = attr.ib()
> > >> >>>>>>> >     path: str = attr.ib()
> > >> >>>>>>> >
> > >> --------------------------------------------------------------------
> > >> >>>>>>> > 2. Implement "adapters" that will act as a bridge between
> > >> "operators" and backends. Their responsibility will be to convert
> > lineage
> > >> metadata generated by "operators" to a format understandable by specific
> > >> backend.
> > >> >>>>>>> > And then we can use the built-in mechanism of inlets/outlets
> > to
> > >> bypass Airflow lineage metadata to the Airflow lineage backend.
> > >> >>>>>>> >
> > >> >>>>>>> > I didn't get exactly implementation details of your proposed
> > >> design, but I think maintaining global vocabulary of entities to use in
> > >> inlets/outlets of operators is crucial for Airflow, as this could be
> > >> leveraged to build various features on top of it, like displaying
> > lineage
> > >> graph in Airflow UI (based on XCOM):)
> > >> >>>>>>> >
> > >> >>>>>>> > Importantly to note, if we decide to send out from Airflow
> > >> lineage metadata only in OpenLineage format, well, we could have than
> > only
> > >> one "adapter" OpenLineageAdapter. But the "adapters" approach leaves us
> > >> room for adding support to others (following "pluggable" approach as
> > >> Airflow is mainly known/good about).
> > >> >>>>>>> >
> > >> >>>>>>> > All in all:
> > >> >>>>>>> > - global vocabulary of entities used across all "operators"
> > >> (with all advantages out of it, mentioned above)
> > >> >>>>>>> > - "adapters" approach
> > >> >>>>>>> > seems to me crucial points in the design that make sense to
> > me.
> > >> >>>>>>> >
> > >> >>>>>>> > What do you think about this?
> > >> >>>>>>> >
> > >> >>>>>>> > - Eugene
> > >> >>>>>>> >
> > >> >>>>>>> >
> > >> >>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem
> > >> <ju...@astronomer.io.invalid> wrote:
> > >> >>>>>>> >>
> > >> >>>>>>> >> Hello Michał,
> > >> >>>>>>> >> Thank you for your input.
> > >> >>>>>>> >> I would clarify that OpenLineage doesn't make any assumption
> > >> about the backend being used to store lineage and is an adapter-like
> > layer.
> > >> >>>>>>> >> OpenLineage exists as the spec specifically for that purpose
> > >> of avoiding the problem of every lineage consumer having to understand
> > >> every lineage producer.
> > >> >>>>>>> >> Consumers of lineage want a unified spec consuming lineage
> > >> from any data transformation layer like Airflow, Spark, Flink, SQL,
> > >> Warehouses, ...
> > >> >>>>>>> >> Just like OpenTelemetry allows consuming traces independently
> > >> of the technology used, so does OpenLineage for lineage.
> > >> >>>>>>> >> Julien
> > >> >>>>>>> >>
> > >> >>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras <
> > >> michalmodras@google.com> wrote:
> > >> >>>>>>> >>>
> > >> >>>>>>> >>> Hi everyone,
> > >> >>>>>>> >>>
> > >> >>>>>>> >>> As Airflow already supports lineage functionality through
> > >> pluggable lineage backends, I think OpenLineage and other lineage
> > systems
> > >> integration should follow this path. I think more 'native' integration
> > with
> > >> OpenLineage (or any other lineage system) in Airflow while maintaining
> > the
> > >> generic lineage backend architecture in parallel would make the user
> > >> experience less open, troublesome to maintain, and the Airflow
> > architecture
> > >> itself more constrained by a logic of a specific system.
> > >> >>>>>>> >>>
> > >> >>>>>>> >>> I think enriching operators with a generic method exposing
> > >> lineage metadata that could be leveraged by lineage backends regardless
> > of
> > >> their implementation is a good idea which the Cloud Composer team would
> > >> gladly contribute to. I believe the translation of the Airflow metadata
> > >> exposed by the operators should be done by lineage backends (or another
> > >> adapter-like layer). Tying Airflow operators' development to a specific
> > >> lineage system like OpenLineage forces operators' contributors to
> > >> understand that system too, which increases both the entry costs and
> > >> maintenance costs. I see it as unnecessary coupling.
> > >> >>>>>>> >>>
> > >> >>>>>>> >>> Best,
> > >> >>>>>>> >>> Michal
> > >> >>>>>>> >>>
> > >> >>>>>>> >>>
> > >> >>>>>>> >>>
> > >> >>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem <
> > >> julien@astronomer.io> wrote:
> > >> >>>>>>> >>>>
> > >> >>>>>>> >>>> Thank you Eugen,
> > >> >>>>>>> >>>> This sounds very aligned with the goals of OpenLineage and
> > I
> > >> think this would work well.
> > >> >>>>>>> >>>> Here are the sections in the doc that I think address your
> > >> points:
> > >> >>>>>>> >>>> - generalize lineage metadata extraction as self-method in
> > >> each operator, using generic lineage entities
> > >> >>>>>>> >>>> See: OpenLineage support in providers. It describes how
> > each
> > >> operator exposes its lineage.
> > >> >>>>>>> >>>> - implement "adapter"s to convert generated metadata to
> > Data
> > >> Lineage format, Open Lineage format, etc.
> > >> >>>>>>> >>>> The goal here is each consumer turns from OpenLineage
> > format
> > >> to their own internal representation as you are suggesting.
> > >> >>>>>>> >>>> In the motivation section, towards the end, I link to a few
> > >> examples of data catalogs doing just that.
> > >> >>>>>>> >>>>
> > >> >>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <
> > >> eugen@kosteev.com> wrote:
> > >> >>>>>>> >>>>>
> > >> >>>>>>> >>>>> ++ Michal Modras
> > >> >>>>>>> >>>>>
> > >> >>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <
> > >> eugen@kosteev.com> wrote:
> > >> >>>>>>> >>>>>>
> > >> >>>>>>> >>>>>> Cloud Composer recently launched "Data lineage with
> > >> Dataplex" feature which effectively means to generate lineage out of
> > >> DAG/task executions and export it to Data Lineage (Data Catalog service)
> > >> for further analysis.
> > >> >>>>>>> >>>>>>
> > >> https://cloud.google.com/composer/docs/composer-2/lineage-integration
> > >> >>>>>>> >>>>>>
> > >> >>>>>>> >>>>>> This feature is as of now in the "Preview" state.
> > >> >>>>>>> >>>>>> The current implementation uses built-in "Airflow lineage
> > >> backend" feature and methods to extract lineage metadata on task post
> > >> execution events.
> > >> >>>>>>> >>>>>>
> > >> >>>>>>> >>>>>> The general idea was to contribute this to the Airflow
> > >> community in a form:
> > >> >>>>>>> >>>>>> - generalize lineage metadata extraction as self-method
> > in
> > >> each operator, using generic lineage entities
> > >> >>>>>>> >>>>>> - implement "adapter"s to convert generated metadata to
> > >> Data Lineage format, Open Lineage format, etc.
> > >> >>>>>>> >>>>>>
> > >> >>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer would mean
> > >> to introduce an additional layer of converting from OpenLineage format
> > to
> > >> Data Lineage (Data Catalog/Dataplex) format. But this is definitely a
> > >> possibility.
> > >> >>>>>>> >>>>>>
> > >> >>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem
> > >> <ju...@astronomer.io.invalid> wrote:
> > >> >>>>>>> >>>>>>>
> > >> >>>>>>> >>>>>>> Thank you very much for your input Jarek.
> > >> >>>>>>> >>>>>>> I am responding in the comments and adding to the doc
> > >> accordingly.
> > >> >>>>>>> >>>>>>> I would also love to hear from more stakeholders.
> > >> >>>>>>> >>>>>>> Thanks to all who provided feedback so far.
> > >> >>>>>>> >>>>>>> Julien
> > >> >>>>>>> >>>>>>>
> > >> >>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <
> > >> jarek@potiuk.com> wrote:
> > >> >>>>>>> >>>>>>>>
> > >> >>>>>>> >>>>>>>> General comment from my side: I think Open Lineage is
> > >> (and should be
> > >> >>>>>>> >>>>>>>> even more) a feature of Airflow that expands Airflow's
> > >> capabilities
> > >> >>>>>>> >>>>>>>> greatly and opens up the direction we've been all
> > >> working on - Airflow
> > >> >>>>>>> >>>>>>>> as a Platform.
> > >> >>>>>>> >>>>>>>>
> > >> >>>>>>> >>>>>>>> I think closely integrating it with Open-Lineage goes
> > >> the same
> > >> >>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open Telemetry
> > >> goes, where we
> > >> >>>>>>> >>>>>>>> might decide to support certain standards in order to
> > >> expand
> > >> >>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows to
> > >> plug-in multiple
> > >> >>>>>>> >>>>>>>> external solutions that would use the standard API.
> > >> After Open-Lineage
> > >> >>>>>>> >>>>>>>> graduated recently to  LFAI&Data foundation (I've been
> > >> watching this
> > >> >>>>>>> >>>>>>>> happening from far), it is I think the perfect
> > candidate
> > >> for Airflow
> > >> >>>>>>> >>>>>>>> to incorporate it. I hope this will help all the
> > players
> > >> to make use
> > >> >>>>>>> >>>>>>>> of the extra work necessary by the community to make it
> > >> "officially
> > >> >>>>>>> >>>>>>>> supported". I think we have to also get some feedback
> > >> from the big
> > >> >>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to have
> > >> such a
> > >> >>>>>>> >>>>>>>> capability, and another is to get it used in all the
> > >> ways Airflow is
> > >> >>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users (which
> > >> is obviously a
> > >> >>>>>>> >>>>>>>> huge driving factor) but also everywhere where Airflow
> > >> is exposed by
> > >> >>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see some
> > >> warm words from
> > >> >>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to hear
> > >> whether the
> > >> >>>>>>> >>>>>>>> Composer team at Google would be on board in using the
> > >> open-lineage
> > >> >>>>>>> >>>>>>>> information exposed this way in their Data Catalog (and
> > >> likely more)
> > >> >>>>>>> >>>>>>>> offering. We have Amundsen and others and possibly
> > other
> > >> stakeholders
> > >> >>>>>>> >>>>>>>> might want to say something.
> > >> >>>>>>> >>>>>>>>
> > >> >>>>>>> >>>>>>>>
> > >> >>>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved in
> > >> implementing and
> > >> >>>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned, that
> > >> is the main
> > >> >>>>>>> >>>>>>>> reason why the Open Lineage community would like to
> > make
> > >> the
> > >> >>>>>>> >>>>>>>> integration part of Airflow. But by being smart and
> > >> integrating it in
> > >> >>>>>>> >>>>>>>> the way that will allow to plug-it-in into our CI,
> > >> verification
> > >> >>>>>>> >>>>>>>> process and making some very clear expectations about
> > >> what it means
> > >> >>>>>>> >>>>>>>> for contributors to Airflow to get it running, we can
> > >> make some
> > >> >>>>>>> >>>>>>>> initial investment in making it happen and minimise
> > >> on-going cost,
> > >> >>>>>>> >>>>>>>> while maximising the gain.
> > >> >>>>>>> >>>>>>>>
> > >> >>>>>>> >>>>>>>> And looking at all the above - I am super happy to help
> > >> with all that
> > >> >>>>>>> >>>>>>>> to make this easy to "swallow" and integrate well, even
> > >> if it will
> > >> >>>>>>> >>>>>>>> take an extra effort, especially that we will have
> > >> experts from Open
> > >> >>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open Lineage
> > >> being the core
> > >> >>>>>>> >>>>>>>> part of the effort. I am actually super excited - this
> > >> might be the
> > >> >>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its position
> > as
> > >> an
> > >> >>>>>>> >>>>>>>> indispensable component of "even more modern data
> > stack".
> > >> >>>>>>> >>>>>>>>
> > >> >>>>>>> >>>>>>>> I made my initial comments in the doc, and am looking
> > >> forward to
> > >> >>>>>>> >>>>>>>> making it happen :).
> > >> >>>>>>> >>>>>>>>
> > >> >>>>>>> >>>>>>>> J.
> > >> >>>>>>> >>>>>>>>
> > >> >>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
> > >> >>>>>>> >>>>>>>> <ju...@astronomer.io.invalid> wrote:
> > >> >>>>>>> >>>>>>>> >
> > >> >>>>>>> >>>>>>>> > Dear Airflow Community,
> > >> >>>>>>> >>>>>>>> > I have been working on a proposal to bring an
> > >> OpenLineage provider to Airflow.
> > >> >>>>>>> >>>>>>>> > I am looking for feedback with the goal to post an
> > >> official AIP.
> > >> >>>>>>> >>>>>>>> > Please feel free to comment in the doc above.
> > >> >>>>>>> >>>>>>>> > Thank you,
> > >> >>>>>>> >>>>>>>> > Julien (OpenLineage project lead)
> > >> >>>>>>> >>>>>>>> >
> > >> >>>>>>> >>>>>>>> > For convenience, here is the rationale from the doc:
> > >> >>>>>>> >>>>>>>> >
> > >> >>>>>>> >>>>>>>> > Operational lineage collection is a common need to
> > >> understand dependencies between data pipelines and track end-to-end
> > >> provenance of data. It enables many use cases from ensuring reliable
> > >> delivery of data through observability to compliance and cost
> > management.
> > >> >>>>>>> >>>>>>>> >
> > >> >>>>>>> >>>>>>>> > Publishing operational lineage is a core Airflow
> > >> capability to enable troubleshooting and governance.
> > >> >>>>>>> >>>>>>>> >
> > >> >>>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data
> > >> foundation that provides a spec standardizing operational lineage
> > >> collection and sharing across the data ecosystem. If it provides plugins
> > >> for popular open source projects, its intent is very similar to
> > >> OpenTelemetry (also under the Linux Foundation umbrella): to remain a
> > spec
> > >> for lineage exchange that projects - open source or proprietary -
> > implement.
> > >> >>>>>>> >>>>>>>> >
> > >> >>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will make it
> > >> easier and more reliable for Airflow users to publish their operational
> > >> lineage through the OpenLineage ecosystem.
> > >> >>>>>>> >>>>>>>> >
> > >> >>>>>>> >>>>>>>> > The current external plugin maintained in the
> > >> OpenLineage project depends on Airflow and operators internals and gets
> > >> broken when changes are made on those. Having a built-in integration
> > >> ensures a better first class support to expose lineage that gets tested
> > >> alongside other changes and therefore is more stable.
> > >> >>>>>>> >>>>>>
> > >> >>>>>>> >>>>>>
> > >> >>>>>>> >>>>>>
> > >> >>>>>>> >>>>>> --
> > >> >>>>>>> >>>>>> Eugene
> > >> >>>>>>> >>>>>
> > >> >>>>>>> >>>>>
> > >> >>>>>>> >>>>>
> > >> >>>>>>> >>>>> --
> > >> >>>>>>> >>>>> Eugene
> > >> >>>>>>> >
> > >> >>>>>>> >
> > >> >>>>>>> >
> > >> >>>>>>> > --
> > >> >>>>>>> > Eugene
> > >>
> > >
> >
>
>
> --
> Eugene

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@airflow.apache.org
For additional commands, e-mail: dev-help@airflow.apache.org


Re: Request for feedback on proposal for new OpenLineage provider in Airflow

Posted by Eugen Kosteev <eu...@kosteev.com>.
Hi Julien.

Can you, please, include me there as well: eugen@kosteev.com or
kosteev@google.com.
Looking forward to see presentation.

- Eugene

On Wed, Mar 22, 2023 at 8:36 PM Julien Le Dem <ju...@astronomer.io.invalid>
wrote:

> Hello all,
> I have to move the OpenLineage presentation to next week.
> Sorry for the change.
> It will be Friday next week March 31st at 5pm CET 9am PT.
>
> https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&tmsrc=julien%40astronomer.io
> Julien
>
> On Thu, Mar 16, 2023 at 8:21 PM Julien Le Dem <ju...@astronomer.io>
> wrote:
>
> > We are planning to do this session next Thursday at 5pm CET 9am PT. I
> will
> > send a zoom link in advance.
> > Julien
> >
> > On Sat, Feb 25, 2023 at 05:59 Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> >> Cool. I am looking forward to it :). It would be great to get some
> >> insight from those who attempted to get the lineage working in several
> >> versions of Open Lineage and finally arrived at the current
> >> specs/integration.
> >>
> >> On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem
> >> <ju...@astronomer.io.invalid> wrote:
> >> >
> >> > Thank you Jarek,
> >> > I am happy to organize a zoom presentation about OpenLineage and
> answer
> >> any question. It is indeed a spec decoupling the data transformation
> layer
> >> from the Metadata store people are using. Just like OpenTelemetry is for
> >> service metrics/traces.
> >> > Best,
> >> > Julien
> >> >
> >> > On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <ja...@potiuk.com>
> wrote:
> >> >>
> >> >> And to add a little "parallel" - I think Open Lineage integration
> >> replacing our "generic lineage" is very similar step to the new
> >> "Multi-tenant"-ready authentication interface we are discussing in
> >> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck
> >> >>
> >> >> Yes - we have a generic authentication interface, but no - it's
> >> useless for the case where multi-tenancy and good level of resource
> >> authorization is needed. It's just far too simplistic and limited.
> >> >>
> >> >> Same with current lineage generic interface - yes, we have it but
> it's
> >> only useful in a limited set of cases. and if we want to step-it-up we
> need
> >> to come up with something better (and Open Lineage happens to be one
> that
> >> has been developed with Airflow in mind and battle tested).
> >> >>
> >> >> J.
> >> >>
> >> >> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <ja...@potiuk.com>
> wrote:
> >> >>>
> >> >>> Hey Rafał (Eugene, Michal - and others who are looking),
> >> >>>
> >> >>> I think I know where your/Eugen/Michał concerns are coming from. And
> >> I think it would be great if we can talk it over a bit.  I believe this
> is
> >> - in parts - quite a misunderstanding of what Open Lineage really is,
> how
> >> much of an integration it is and what are the reasons why it has been
> >> implemented the way it was implemented in Airflow.
> >> >>>
> >> >>> **Idea**: (Julien -  Maybe you can organize it ?):
> >> >>>
> >> >>> Maybe we can have an open-to-everyone presentation/zoom call with
> >> quite some time foreseen to ask questions where you would explain the
> >> community about those integration points (and especially those people
> who
> >> are worried we are losing something by choosing the OpenLineage
> >> integration). I would love to see such a presentation - specifically
> >> focused on explaining how Open-Lineage is really improving the current
> >> lineage approach and what problems it solves that the existing generic
> >> interface doesn't.
> >> >>>
> >> >>> Just to set the tone and focus for such meeting if we have one:
> >> >>>
> >> >>> For me - when I look at Open Lineage, it is really "this is how
> >> lineage generic interface **should** be done in Airflow". The "generic"
> >> lineage support we have now is very, very basic, I'd even say far too
> >> simplistic. I would even say, it's useless besides a few, very basic use
> >> cases. Simply because there was never a good "receiver" of the
> information
> >> to cover those cases.
> >> >>>
> >> >>> When you look closely at OpenLineage, it's nothing more than a
> better
> >> convention of the dictionaries that we send as a metadata, better
> meta-data
> >> in case of SQL operators (Hooks in the future hopefully), allowing
> handling
> >> some cases that current lineage simply cannot.  Also what open-lineage
> >> integration with Airflow covers better handling of the lifecycle "task"
> and
> >> "dag" in Airflow to be able to bind lineage data together. That's my
> >> understanding of what we get when we integrate OL in.
> >> >>>
> >> >>> I think over the last 2 years Datakin/Astronomer people had worked
> >> out the level of interface that **just works** and if we would like to
> get
> >> the lineage information from Airflow as useful as it is in OL, we would
> >> have to anyway implement pretty much all of the things they already did.
> >> >>>
> >> >>> I would love (and I think many community members) to take part in
> >> such a call to hear on that particular aspect of the OL integration.
> >> >>>
> >> >>> J.
> >> >>>
> >> >>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz <
> >> rafalbiegacz@google.com.invalid> wrote:
> >> >>>>
> >> >>>> Hi,
> >> >>>>
> >> >>>> I second/echo the input provided by Eugene and Michal.
> >> >>>>
> >> >>>> In general, Airflow should provide generic interfaces to lineage
> >> backends so it's easy to configure the one preferred by the user.
> Whether
> >> it's Open Lineage, proprietary solution, Dataplex Lineage, etc. it
> should
> >> be the user's choice.
> >> >>>>
> >> >>>> We should avoid close integration with any specific lineage backend
> >> due to the reasons already mentioned, i.e. to avoid translations between
> >> lineage backends. Also, we would closely couple one framework (Airflow)
> >> with another one (Open Lineage) - it makes Airflow more complex and less
> >> flexible. Loose coupling between lineage backends and Airflow seems to
> be
> >> more future-proven.
> >> >>>>
> >> >>>> Regards, Rafal.
> >> >>>>
> >> >>>>
> >> >>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem
> >> <ju...@astronomer.io.invalid> wrote:
> >> >>>>>
> >> >>>>> Dear Airflow community,
> >> >>>>> I have transferred the content of the working google doc I shared
> a
> >> few weeks ago to the Airflow confluence:
> >> >>>>>
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
> >> >>>>> All comments have been answered, I added clarifications to the doc
> >> accordingly and I also added your suggestions to improve the proposal.
> >> >>>>> All that history is linked from the discussion thread link in the
> >> confluence doc if you wish to consult it.
> >> >>>>> Thank you all for your feedback and help in the process.
> >> >>>>> Best
> >> >>>>> Julien
> >> >>>>>
> >> >>>>>
> >> >>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <
> julien@astronomer.io>
> >> wrote:
> >> >>>>>>
> >> >>>>>> Thank you for the email Jarek, and Eugene for your suggestions,
> >> >>>>>> I do agree with Jarek's assessment. I don't have very much to add
> >> to his argument, it is very thoughtful!
> >> >>>>>> OpenLineage was started to avoid the cartesian complexity that
> >> Eugene mentions. There's actually that specific illustration in the
> >> OpenLineage doc.
> >> >>>>>> Lineage consumers want to avoid having to understand the lineage
> >> format of each individual observed data transformation layer. And
> >> transformation layers don't want to understand every Metadata store's
> model
> >> and protocol.
> >> >>>>>> Eugene, about your specific proposal about a global vocabulary of
> >> entities, I think it is a great suggestion.
> >> >>>>>> We can map those entities to Datasets in OpenLineage. The way
> >> OpenLineage models this is by allowing specific facets attached to
> Dataset.
> >> Facets are pieces of metadata each with their own JsonSchema.
> >> >>>>>> For example a table from a relational database will have a schema
> >> facet when a file in GCS might not.
> >> >>>>>> So I think in Airflow we could have each of the entity classes
> you
> >> describe be used in the get_openlineage_facets*() API in the Operators.
> >> >>>>>> Each of those classes would know what OpenLineage facets they can
> >> expose.
> >> >>>>>> I'll add a mention in the AIP and I think we can go in more
> >> details in a ticket.
> >> >>>>>> Cheers,
> >> >>>>>> Julien
> >> >>>>>>
> >> >>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <ja...@potiuk.com>
> >> wrote:
> >> >>>>>>>
> >> >>>>>>> Just a quick personal view on it, Eugene (I bet Julian's answer
> >> will
> >> >>>>>>> be more thoughtful).
> >> >>>>>>>
> >> >>>>>>> I think you are right to the "agnostic" part. But I have one
> >> question
> >> >>>>>>> - what are we considering "agnostic"?
> >> >>>>>>>
> >> >>>>>>>  There is no "widespread" standard for lineage (yet). Open
> Lineage
> >> >>>>>>> with its donation to Linux Foundation Data & AI is aspiring to
> >> become
> >> >>>>>>> one. And it's a pretty good candidate:
> >> >>>>>>>
> >> >>>>>>> * designed from grounds-up to be agnostic (Open Lineage was only
> >> >>>>>>> published as an API from day one)
> >> >>>>>>> * as of recently, the ownership and governance of Open Lineage
> is
> >> with
> >> >>>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/)
> which
> >> is
> >> >>>>>>> part of "Linux Foundation Project" - well known and respectful
> >> >>>>>>> foundation that - similarly to the ASF is an umbrella and
> provides
> >> >>>>>>> governance rules for a big number of well established OSS
> projects
> >> >>>>>>>
> >> >>>>>>> In essence it is the same approach as we already discussed and
> >> >>>>>>> approved for Open Telemetry (which is governed by CNCF which is
> >> in the
> >> >>>>>>> same league as recognition and governance to LFP) (not yet
> >> implemented
> >> >>>>>>> though). In the case of Open-Telemetry, we decided against
> >> developing
> >> >>>>>>> our "own" existing standard but we opted for one that is out
> >> there.
> >> >>>>>>> Yes it is a bit more established and popular than Open Lineage
> >> is, but
> >> >>>>>>> i so wish that we chose and implemented it already (and earlier
> >> as not
> >> >>>>>>> having a standard there - except statsd which is really, really
> >> poor)
> >> >>>>>>> has a great impact on Airflow being just "pluggable" in existing
> >> >>>>>>> solutions for monitoring. (BTW. I hope we implement it soon and
> I
> >> hear
> >> >>>>>>> (and see) there are attempts to do so).
> >> >>>>>>>
> >> >>>>>>> In the case of Open Lineage, the questions are - is there an
> >> >>>>>>> alternative of the same caliber? Shall we produce our own
> >> "agnostic
> >> >>>>>>> standard" for it instead ? Is there a chance the idea of
> >> >>>>>>> "airflow-specific" attributes will catch up and many "consumers"
> >> will
> >> >>>>>>> be writing their own conversions to the way they can consume it?
> >> >>>>>>>
> >> >>>>>>> I would really, really try to avoid the pitfalls nicely
> summarized
> >> >>>>>>> here: https://xkcd.com/927/
> >> >>>>>>>
> >> >>>>>>> We can of course make a wrong bet and in 2 years Airflow might
> be
> >> the
> >> >>>>>>> only one supporting Open Lineage. That might happen. Though the
> >> list
> >> >>>>>>> of "consumers" of Open Lineage is already pretty good IMHO. Or
> >> maybe -
> >> >>>>>>> more likely - once Airflow implements it, due to Airflow's
> >> popularity
> >> >>>>>>> and the fact that there is already competition supporting it
> (e.g.
> >> >>>>>>> Amundsen) we will increase the chance of "hockey-stick" adoption
> >> of
> >> >>>>>>> Open Lineage. My bet is -  the latter and for the benefit of the
> >> whole
> >> >>>>>>> ecosystem. I think we have a chance to influence creation of a
> >> new,
> >> >>>>>>> important standard. Much less so, I think if we just provide our
> >> own
> >> >>>>>>> custom solution - with lots and lots of work for others to be
> >> able to
> >> >>>>>>> consume it, no time to properly nurture the API and make it
> >> easier to
> >> >>>>>>> implement it (which is undoubtedly what Datakin, Astronomer and
> >> now
> >> >>>>>>> LFData & AI run governance main focus is)
> >> >>>>>>>
> >> >>>>>>> Are there other alternatives we should consider ? Do we want to
> >> >>>>>>> develop our own standard (and implement all the integrations
> from
> >> the
> >> >>>>>>> grounds up) ?
> >> >>>>>>>
> >> >>>>>>> J.
> >> >>>>>>>
> >> >>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <
> eugen@kosteev.com>
> >> wrote:
> >> >>>>>>> >
> >> >>>>>>> > Hi Julien.
> >> >>>>>>> >
> >> >>>>>>> > I reviewed the design doc.
> >> >>>>>>> > The general idea looks good to me, but I have some concerns
> >> that I would like to share.
> >> >>>>>>> >
> >> >>>>>>> > If I understand correctly the proposed design is to fill in
> >> "operators" with self-methods to extract lineage metadata from it, and I
> >> agree with the motivation. If those are decoupled (in a form of
> extractors
> >> in separate package) from operators itself, then the downsides is that
> (as
> >> you mentioned) - extractors will be distributed separately and
> "operators"
> >> logic is out of sync with "lineage extraction" logic by design.
> >> >>>>>>> > Also knowledge about internals of operator spills out of the
> >> operator which is not good at all (at the very least).
> >> >>>>>>> >
> >> >>>>>>> > However, if we make every operator being exposing method to
> >> generate lineage metadata of the specific format, e.g. OpenLineage etc.,
> >> then we will end up with cartesian complexity of supporting in each
> >> provider+operator each backend format.
> >> >>>>>>> >
> >> >>>>>>> > If you say that the goal is that "operators" will always
> >> generate OpenLineage format only and each consumer will convert this
> format
> >> to their own internal representation, well, if they do this then this
> seems
> >> like a working approach. But with the assumption that each consumer will
> >> support it.
> >> >>>>>>> >
> >> >>>>>>> > I think it comes down to the question: is OpenLineage format
> >> enough popular, complete and proper for the lineage metadata that every
> >> consumer will be convinced to support it. We may also consider issues
> like
> >> mismatch of lineage feature parity, e.g. OpenLineage supports
> field-level
> >> lineage but consumer doesn't support (or not at the moment), so we would
> >> prefer lineage metadata transferred to the backend to be slightly
> different
> >> in this case.
> >> >>>>>>> >
> >> >>>>>>> > What do you think about the idea:
> >> >>>>>>> > 1. make lineage metadata generated by "operators" to be
> >> agnostic of the specific format, just using entities from big generic
> >> vocabulary of entities e.g. created here
> >> https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py
> .
> >> We would have there e.g. entities like:
> >> >>>>>>> >
> >> --------------------------------------------------------------------
> >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> >> >>>>>>> > class PostgresTable:
> >> >>>>>>> >     """Airflow lineage entity representing Postgres table."""
> >> >>>>>>> >
> >> >>>>>>> >     host: str = attr.ib()
> >> >>>>>>> >     port: str = attr.ib()
> >> >>>>>>> >     database: str = attr.ib()
> >> >>>>>>> >     schema: str = attr.ib()
> >> >>>>>>> >     table: str = attr.ib()
> >> >>>>>>> >
> >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> >> >>>>>>> > class GCSEntity:
> >> >>>>>>> >     """Airflow lineage entity representing generic Google
> Cloud
> >> Storage entity."""
> >> >>>>>>> >
> >> >>>>>>> >     bucket: str = attr.ib()
> >> >>>>>>> >     path: str = attr.ib()
> >> >>>>>>> >
> >> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
> >> >>>>>>> > class AWSS3Entity:
> >> >>>>>>> >     """Airflow lineage entity representing generic AWS S3
> >> entity."""
> >> >>>>>>> >
> >> >>>>>>> >     bucket: str = attr.ib()
> >> >>>>>>> >     path: str = attr.ib()
> >> >>>>>>> >
> >> --------------------------------------------------------------------
> >> >>>>>>> > 2. Implement "adapters" that will act as a bridge between
> >> "operators" and backends. Their responsibility will be to convert
> lineage
> >> metadata generated by "operators" to a format understandable by specific
> >> backend.
> >> >>>>>>> > And then we can use the built-in mechanism of inlets/outlets
> to
> >> bypass Airflow lineage metadata to the Airflow lineage backend.
> >> >>>>>>> >
> >> >>>>>>> > I didn't get exactly implementation details of your proposed
> >> design, but I think maintaining global vocabulary of entities to use in
> >> inlets/outlets of operators is crucial for Airflow, as this could be
> >> leveraged to build various features on top of it, like displaying
> lineage
> >> graph in Airflow UI (based on XCOM):)
> >> >>>>>>> >
> >> >>>>>>> > Importantly to note, if we decide to send out from Airflow
> >> lineage metadata only in OpenLineage format, well, we could have than
> only
> >> one "adapter" OpenLineageAdapter. But the "adapters" approach leaves us
> >> room for adding support to others (following "pluggable" approach as
> >> Airflow is mainly known/good about).
> >> >>>>>>> >
> >> >>>>>>> > All in all:
> >> >>>>>>> > - global vocabulary of entities used across all "operators"
> >> (with all advantages out of it, mentioned above)
> >> >>>>>>> > - "adapters" approach
> >> >>>>>>> > seems to me crucial points in the design that make sense to
> me.
> >> >>>>>>> >
> >> >>>>>>> > What do you think about this?
> >> >>>>>>> >
> >> >>>>>>> > - Eugene
> >> >>>>>>> >
> >> >>>>>>> >
> >> >>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem
> >> <ju...@astronomer.io.invalid> wrote:
> >> >>>>>>> >>
> >> >>>>>>> >> Hello Michał,
> >> >>>>>>> >> Thank you for your input.
> >> >>>>>>> >> I would clarify that OpenLineage doesn't make any assumption
> >> about the backend being used to store lineage and is an adapter-like
> layer.
> >> >>>>>>> >> OpenLineage exists as the spec specifically for that purpose
> >> of avoiding the problem of every lineage consumer having to understand
> >> every lineage producer.
> >> >>>>>>> >> Consumers of lineage want a unified spec consuming lineage
> >> from any data transformation layer like Airflow, Spark, Flink, SQL,
> >> Warehouses, ...
> >> >>>>>>> >> Just like OpenTelemetry allows consuming traces independently
> >> of the technology used, so does OpenLineage for lineage.
> >> >>>>>>> >> Julien
> >> >>>>>>> >>
> >> >>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras <
> >> michalmodras@google.com> wrote:
> >> >>>>>>> >>>
> >> >>>>>>> >>> Hi everyone,
> >> >>>>>>> >>>
> >> >>>>>>> >>> As Airflow already supports lineage functionality through
> >> pluggable lineage backends, I think OpenLineage and other lineage
> systems
> >> integration should follow this path. I think more 'native' integration
> with
> >> OpenLineage (or any other lineage system) in Airflow while maintaining
> the
> >> generic lineage backend architecture in parallel would make the user
> >> experience less open, troublesome to maintain, and the Airflow
> architecture
> >> itself more constrained by a logic of a specific system.
> >> >>>>>>> >>>
> >> >>>>>>> >>> I think enriching operators with a generic method exposing
> >> lineage metadata that could be leveraged by lineage backends regardless
> of
> >> their implementation is a good idea which the Cloud Composer team would
> >> gladly contribute to. I believe the translation of the Airflow metadata
> >> exposed by the operators should be done by lineage backends (or another
> >> adapter-like layer). Tying Airflow operators' development to a specific
> >> lineage system like OpenLineage forces operators' contributors to
> >> understand that system too, which increases both the entry costs and
> >> maintenance costs. I see it as unnecessary coupling.
> >> >>>>>>> >>>
> >> >>>>>>> >>> Best,
> >> >>>>>>> >>> Michal
> >> >>>>>>> >>>
> >> >>>>>>> >>>
> >> >>>>>>> >>>
> >> >>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem <
> >> julien@astronomer.io> wrote:
> >> >>>>>>> >>>>
> >> >>>>>>> >>>> Thank you Eugen,
> >> >>>>>>> >>>> This sounds very aligned with the goals of OpenLineage and
> I
> >> think this would work well.
> >> >>>>>>> >>>> Here are the sections in the doc that I think address your
> >> points:
> >> >>>>>>> >>>> - generalize lineage metadata extraction as self-method in
> >> each operator, using generic lineage entities
> >> >>>>>>> >>>> See: OpenLineage support in providers. It describes how
> each
> >> operator exposes its lineage.
> >> >>>>>>> >>>> - implement "adapter"s to convert generated metadata to
> Data
> >> Lineage format, Open Lineage format, etc.
> >> >>>>>>> >>>> The goal here is each consumer turns from OpenLineage
> format
> >> to their own internal representation as you are suggesting.
> >> >>>>>>> >>>> In the motivation section, towards the end, I link to a few
> >> examples of data catalogs doing just that.
> >> >>>>>>> >>>>
> >> >>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <
> >> eugen@kosteev.com> wrote:
> >> >>>>>>> >>>>>
> >> >>>>>>> >>>>> ++ Michal Modras
> >> >>>>>>> >>>>>
> >> >>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <
> >> eugen@kosteev.com> wrote:
> >> >>>>>>> >>>>>>
> >> >>>>>>> >>>>>> Cloud Composer recently launched "Data lineage with
> >> Dataplex" feature which effectively means to generate lineage out of
> >> DAG/task executions and export it to Data Lineage (Data Catalog service)
> >> for further analysis.
> >> >>>>>>> >>>>>>
> >> https://cloud.google.com/composer/docs/composer-2/lineage-integration
> >> >>>>>>> >>>>>>
> >> >>>>>>> >>>>>> This feature is as of now in the "Preview" state.
> >> >>>>>>> >>>>>> The current implementation uses built-in "Airflow lineage
> >> backend" feature and methods to extract lineage metadata on task post
> >> execution events.
> >> >>>>>>> >>>>>>
> >> >>>>>>> >>>>>> The general idea was to contribute this to the Airflow
> >> community in a form:
> >> >>>>>>> >>>>>> - generalize lineage metadata extraction as self-method
> in
> >> each operator, using generic lineage entities
> >> >>>>>>> >>>>>> - implement "adapter"s to convert generated metadata to
> >> Data Lineage format, Open Lineage format, etc.
> >> >>>>>>> >>>>>>
> >> >>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer would mean
> >> to introduce an additional layer of converting from OpenLineage format
> to
> >> Data Lineage (Data Catalog/Dataplex) format. But this is definitely a
> >> possibility.
> >> >>>>>>> >>>>>>
> >> >>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem
> >> <ju...@astronomer.io.invalid> wrote:
> >> >>>>>>> >>>>>>>
> >> >>>>>>> >>>>>>> Thank you very much for your input Jarek.
> >> >>>>>>> >>>>>>> I am responding in the comments and adding to the doc
> >> accordingly.
> >> >>>>>>> >>>>>>> I would also love to hear from more stakeholders.
> >> >>>>>>> >>>>>>> Thanks to all who provided feedback so far.
> >> >>>>>>> >>>>>>> Julien
> >> >>>>>>> >>>>>>>
> >> >>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <
> >> jarek@potiuk.com> wrote:
> >> >>>>>>> >>>>>>>>
> >> >>>>>>> >>>>>>>> General comment from my side: I think Open Lineage is
> >> (and should be
> >> >>>>>>> >>>>>>>> even more) a feature of Airflow that expands Airflow's
> >> capabilities
> >> >>>>>>> >>>>>>>> greatly and opens up the direction we've been all
> >> working on - Airflow
> >> >>>>>>> >>>>>>>> as a Platform.
> >> >>>>>>> >>>>>>>>
> >> >>>>>>> >>>>>>>> I think closely integrating it with Open-Lineage goes
> >> the same
> >> >>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open Telemetry
> >> goes, where we
> >> >>>>>>> >>>>>>>> might decide to support certain standards in order to
> >> expand
> >> >>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows to
> >> plug-in multiple
> >> >>>>>>> >>>>>>>> external solutions that would use the standard API.
> >> After Open-Lineage
> >> >>>>>>> >>>>>>>> graduated recently to  LFAI&Data foundation (I've been
> >> watching this
> >> >>>>>>> >>>>>>>> happening from far), it is I think the perfect
> candidate
> >> for Airflow
> >> >>>>>>> >>>>>>>> to incorporate it. I hope this will help all the
> players
> >> to make use
> >> >>>>>>> >>>>>>>> of the extra work necessary by the community to make it
> >> "officially
> >> >>>>>>> >>>>>>>> supported". I think we have to also get some feedback
> >> from the big
> >> >>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to have
> >> such a
> >> >>>>>>> >>>>>>>> capability, and another is to get it used in all the
> >> ways Airflow is
> >> >>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users (which
> >> is obviously a
> >> >>>>>>> >>>>>>>> huge driving factor) but also everywhere where Airflow
> >> is exposed by
> >> >>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see some
> >> warm words from
> >> >>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to hear
> >> whether the
> >> >>>>>>> >>>>>>>> Composer team at Google would be on board in using the
> >> open-lineage
> >> >>>>>>> >>>>>>>> information exposed this way in their Data Catalog (and
> >> likely more)
> >> >>>>>>> >>>>>>>> offering. We have Amundsen and others and possibly
> other
> >> stakeholders
> >> >>>>>>> >>>>>>>> might want to say something.
> >> >>>>>>> >>>>>>>>
> >> >>>>>>> >>>>>>>>
> >> >>>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved in
> >> implementing and
> >> >>>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned, that
> >> is the main
> >> >>>>>>> >>>>>>>> reason why the Open Lineage community would like to
> make
> >> the
> >> >>>>>>> >>>>>>>> integration part of Airflow. But by being smart and
> >> integrating it in
> >> >>>>>>> >>>>>>>> the way that will allow to plug-it-in into our CI,
> >> verification
> >> >>>>>>> >>>>>>>> process and making some very clear expectations about
> >> what it means
> >> >>>>>>> >>>>>>>> for contributors to Airflow to get it running, we can
> >> make some
> >> >>>>>>> >>>>>>>> initial investment in making it happen and minimise
> >> on-going cost,
> >> >>>>>>> >>>>>>>> while maximising the gain.
> >> >>>>>>> >>>>>>>>
> >> >>>>>>> >>>>>>>> And looking at all the above - I am super happy to help
> >> with all that
> >> >>>>>>> >>>>>>>> to make this easy to "swallow" and integrate well, even
> >> if it will
> >> >>>>>>> >>>>>>>> take an extra effort, especially that we will have
> >> experts from Open
> >> >>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open Lineage
> >> being the core
> >> >>>>>>> >>>>>>>> part of the effort. I am actually super excited - this
> >> might be the
> >> >>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its position
> as
> >> an
> >> >>>>>>> >>>>>>>> indispensable component of "even more modern data
> stack".
> >> >>>>>>> >>>>>>>>
> >> >>>>>>> >>>>>>>> I made my initial comments in the doc, and am looking
> >> forward to
> >> >>>>>>> >>>>>>>> making it happen :).
> >> >>>>>>> >>>>>>>>
> >> >>>>>>> >>>>>>>> J.
> >> >>>>>>> >>>>>>>>
> >> >>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
> >> >>>>>>> >>>>>>>> <ju...@astronomer.io.invalid> wrote:
> >> >>>>>>> >>>>>>>> >
> >> >>>>>>> >>>>>>>> > Dear Airflow Community,
> >> >>>>>>> >>>>>>>> > I have been working on a proposal to bring an
> >> OpenLineage provider to Airflow.
> >> >>>>>>> >>>>>>>> > I am looking for feedback with the goal to post an
> >> official AIP.
> >> >>>>>>> >>>>>>>> > Please feel free to comment in the doc above.
> >> >>>>>>> >>>>>>>> > Thank you,
> >> >>>>>>> >>>>>>>> > Julien (OpenLineage project lead)
> >> >>>>>>> >>>>>>>> >
> >> >>>>>>> >>>>>>>> > For convenience, here is the rationale from the doc:
> >> >>>>>>> >>>>>>>> >
> >> >>>>>>> >>>>>>>> > Operational lineage collection is a common need to
> >> understand dependencies between data pipelines and track end-to-end
> >> provenance of data. It enables many use cases from ensuring reliable
> >> delivery of data through observability to compliance and cost
> management.
> >> >>>>>>> >>>>>>>> >
> >> >>>>>>> >>>>>>>> > Publishing operational lineage is a core Airflow
> >> capability to enable troubleshooting and governance.
> >> >>>>>>> >>>>>>>> >
> >> >>>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data
> >> foundation that provides a spec standardizing operational lineage
> >> collection and sharing across the data ecosystem. If it provides plugins
> >> for popular open source projects, its intent is very similar to
> >> OpenTelemetry (also under the Linux Foundation umbrella): to remain a
> spec
> >> for lineage exchange that projects - open source or proprietary -
> implement.
> >> >>>>>>> >>>>>>>> >
> >> >>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will make it
> >> easier and more reliable for Airflow users to publish their operational
> >> lineage through the OpenLineage ecosystem.
> >> >>>>>>> >>>>>>>> >
> >> >>>>>>> >>>>>>>> > The current external plugin maintained in the
> >> OpenLineage project depends on Airflow and operators internals and gets
> >> broken when changes are made on those. Having a built-in integration
> >> ensures a better first class support to expose lineage that gets tested
> >> alongside other changes and therefore is more stable.
> >> >>>>>>> >>>>>>
> >> >>>>>>> >>>>>>
> >> >>>>>>> >>>>>>
> >> >>>>>>> >>>>>> --
> >> >>>>>>> >>>>>> Eugene
> >> >>>>>>> >>>>>
> >> >>>>>>> >>>>>
> >> >>>>>>> >>>>>
> >> >>>>>>> >>>>> --
> >> >>>>>>> >>>>> Eugene
> >> >>>>>>> >
> >> >>>>>>> >
> >> >>>>>>> >
> >> >>>>>>> > --
> >> >>>>>>> > Eugene
> >>
> >
>


-- 
Eugene

Re: Request for feedback on proposal for new OpenLineage provider in Airflow

Posted by Julien Le Dem <ju...@astronomer.io.INVALID>.
Hello all,
I have to move the OpenLineage presentation to next week.
Sorry for the change.
It will be Friday next week March 31st at 5pm CET 9am PT.
https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=MTF1bHRrdTdrM29vMGZyamdzc2JuZWFkMHEganVsaWVuQGFzdHJvbm9tZXIuaW8&tmsrc=julien%40astronomer.io
Julien

On Thu, Mar 16, 2023 at 8:21 PM Julien Le Dem <ju...@astronomer.io> wrote:

> We are planning to do this session next Thursday at 5pm CET 9am PT. I will
> send a zoom link in advance.
> Julien
>
> On Sat, Feb 25, 2023 at 05:59 Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> Cool. I am looking forward to it :). It would be great to get some
>> insight from those who attempted to get the lineage working in several
>> versions of Open Lineage and finally arrived at the current
>> specs/integration.
>>
>> On Wed, Feb 22, 2023 at 7:02 PM Julien Le Dem
>> <ju...@astronomer.io.invalid> wrote:
>> >
>> > Thank you Jarek,
>> > I am happy to organize a zoom presentation about OpenLineage and answer
>> any question. It is indeed a spec decoupling the data transformation layer
>> from the Metadata store people are using. Just like OpenTelemetry is for
>> service metrics/traces.
>> > Best,
>> > Julien
>> >
>> > On Tue, Feb 21, 2023 at 11:23 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>> >>
>> >> And to add a little "parallel" - I think Open Lineage integration
>> replacing our "generic lineage" is very similar step to the new
>> "Multi-tenant"-ready authentication interface we are discussing in
>> https://lists.apache.org/thread/cc9dj680nwz494k8n51w6qqohzm4wgck
>> >>
>> >> Yes - we have a generic authentication interface, but no - it's
>> useless for the case where multi-tenancy and good level of resource
>> authorization is needed. It's just far too simplistic and limited.
>> >>
>> >> Same with current lineage generic interface - yes, we have it but it's
>> only useful in a limited set of cases. and if we want to step-it-up we need
>> to come up with something better (and Open Lineage happens to be one that
>> has been developed with Airflow in mind and battle tested).
>> >>
>> >> J.
>> >>
>> >> On Wed, Feb 22, 2023 at 8:16 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>> >>>
>> >>> Hey Rafał (Eugene, Michal - and others who are looking),
>> >>>
>> >>> I think I know where your/Eugen/Michał concerns are coming from. And
>> I think it would be great if we can talk it over a bit.  I believe this is
>> - in parts - quite a misunderstanding of what Open Lineage really is, how
>> much of an integration it is and what are the reasons why it has been
>> implemented the way it was implemented in Airflow.
>> >>>
>> >>> **Idea**: (Julien -  Maybe you can organize it ?):
>> >>>
>> >>> Maybe we can have an open-to-everyone presentation/zoom call with
>> quite some time foreseen to ask questions where you would explain the
>> community about those integration points (and especially those people who
>> are worried we are losing something by choosing the OpenLineage
>> integration). I would love to see such a presentation - specifically
>> focused on explaining how Open-Lineage is really improving the current
>> lineage approach and what problems it solves that the existing generic
>> interface doesn't.
>> >>>
>> >>> Just to set the tone and focus for such meeting if we have one:
>> >>>
>> >>> For me - when I look at Open Lineage, it is really "this is how
>> lineage generic interface **should** be done in Airflow". The "generic"
>> lineage support we have now is very, very basic, I'd even say far too
>> simplistic. I would even say, it's useless besides a few, very basic use
>> cases. Simply because there was never a good "receiver" of the information
>> to cover those cases.
>> >>>
>> >>> When you look closely at OpenLineage, it's nothing more than a better
>> convention of the dictionaries that we send as a metadata, better meta-data
>> in case of SQL operators (Hooks in the future hopefully), allowing handling
>> some cases that current lineage simply cannot.  Also what open-lineage
>> integration with Airflow covers better handling of the lifecycle "task" and
>> "dag" in Airflow to be able to bind lineage data together. That's my
>> understanding of what we get when we integrate OL in.
>> >>>
>> >>> I think over the last 2 years Datakin/Astronomer people had worked
>> out the level of interface that **just works** and if we would like to get
>> the lineage information from Airflow as useful as it is in OL, we would
>> have to anyway implement pretty much all of the things they already did.
>> >>>
>> >>> I would love (and I think many community members) to take part in
>> such a call to hear on that particular aspect of the OL integration.
>> >>>
>> >>> J.
>> >>>
>> >>> On Wed, Feb 22, 2023 at 12:40 AM Rafal Biegacz <
>> rafalbiegacz@google.com.invalid> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> I second/echo the input provided by Eugene and Michal.
>> >>>>
>> >>>> In general, Airflow should provide generic interfaces to lineage
>> backends so it's easy to configure the one preferred by the user. Whether
>> it's Open Lineage, proprietary solution, Dataplex Lineage, etc. it should
>> be the user's choice.
>> >>>>
>> >>>> We should avoid close integration with any specific lineage backend
>> due to the reasons already mentioned, i.e. to avoid translations between
>> lineage backends. Also, we would closely couple one framework (Airflow)
>> with another one (Open Lineage) - it makes Airflow more complex and less
>> flexible. Loose coupling between lineage backends and Airflow seems to be
>> more future-proven.
>> >>>>
>> >>>> Regards, Rafal.
>> >>>>
>> >>>>
>> >>>> On Sat, Feb 11, 2023 at 12:21 AM Julien Le Dem
>> <ju...@astronomer.io.invalid> wrote:
>> >>>>>
>> >>>>> Dear Airflow community,
>> >>>>> I have transferred the content of the working google doc I shared a
>> few weeks ago to the Airflow confluence:
>> >>>>>
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
>> >>>>> All comments have been answered, I added clarifications to the doc
>> accordingly and I also added your suggestions to improve the proposal.
>> >>>>> All that history is linked from the discussion thread link in the
>> confluence doc if you wish to consult it.
>> >>>>> Thank you all for your feedback and help in the process.
>> >>>>> Best
>> >>>>> Julien
>> >>>>>
>> >>>>>
>> >>>>> On Fri, Feb 10, 2023 at 2:55 PM Julien Le Dem <ju...@astronomer.io>
>> wrote:
>> >>>>>>
>> >>>>>> Thank you for the email Jarek, and Eugene for your suggestions,
>> >>>>>> I do agree with Jarek's assessment. I don't have very much to add
>> to his argument, it is very thoughtful!
>> >>>>>> OpenLineage was started to avoid the cartesian complexity that
>> Eugene mentions. There's actually that specific illustration in the
>> OpenLineage doc.
>> >>>>>> Lineage consumers want to avoid having to understand the lineage
>> format of each individual observed data transformation layer. And
>> transformation layers don't want to understand every Metadata store's model
>> and protocol.
>> >>>>>> Eugene, about your specific proposal about a global vocabulary of
>> entities, I think it is a great suggestion.
>> >>>>>> We can map those entities to Datasets in OpenLineage. The way
>> OpenLineage models this is by allowing specific facets attached to Dataset.
>> Facets are pieces of metadata each with their own JsonSchema.
>> >>>>>> For example a table from a relational database will have a schema
>> facet when a file in GCS might not.
>> >>>>>> So I think in Airflow we could have each of the entity classes you
>> describe be used in the get_openlineage_facets*() API in the Operators.
>> >>>>>> Each of those classes would know what OpenLineage facets they can
>> expose.
>> >>>>>> I'll add a mention in the AIP and I think we can go in more
>> details in a ticket.
>> >>>>>> Cheers,
>> >>>>>> Julien
>> >>>>>>
>> >>>>>> On Fri, Feb 10, 2023 at 12:27 PM Jarek Potiuk <ja...@potiuk.com>
>> wrote:
>> >>>>>>>
>> >>>>>>> Just a quick personal view on it, Eugene (I bet Julian's answer
>> will
>> >>>>>>> be more thoughtful).
>> >>>>>>>
>> >>>>>>> I think you are right to the "agnostic" part. But I have one
>> question
>> >>>>>>> - what are we considering "agnostic"?
>> >>>>>>>
>> >>>>>>>  There is no "widespread" standard for lineage (yet). Open Lineage
>> >>>>>>> with its donation to Linux Foundation Data & AI is aspiring to
>> become
>> >>>>>>> one. And it's a pretty good candidate:
>> >>>>>>>
>> >>>>>>> * designed from grounds-up to be agnostic (Open Lineage was only
>> >>>>>>> published as an API from day one)
>> >>>>>>> * as of recently, the ownership and governance of Open Lineage is
>> with
>> >>>>>>> Linux Foundation Data & AI (https://lfaidata.foundation/)  which
>> is
>> >>>>>>> part of "Linux Foundation Project" - well known and respectful
>> >>>>>>> foundation that - similarly to the ASF is an umbrella and provides
>> >>>>>>> governance rules for a big number of well established OSS projects
>> >>>>>>>
>> >>>>>>> In essence it is the same approach as we already discussed and
>> >>>>>>> approved for Open Telemetry (which is governed by CNCF which is
>> in the
>> >>>>>>> same league as recognition and governance to LFP) (not yet
>> implemented
>> >>>>>>> though). In the case of Open-Telemetry, we decided against
>> developing
>> >>>>>>> our "own" existing standard but we opted for one that is out
>> there.
>> >>>>>>> Yes it is a bit more established and popular than Open Lineage
>> is, but
>> >>>>>>> i so wish that we chose and implemented it already (and earlier
>> as not
>> >>>>>>> having a standard there - except statsd which is really, really
>> poor)
>> >>>>>>> has a great impact on Airflow being just "pluggable" in existing
>> >>>>>>> solutions for monitoring. (BTW. I hope we implement it soon and I
>> hear
>> >>>>>>> (and see) there are attempts to do so).
>> >>>>>>>
>> >>>>>>> In the case of Open Lineage, the questions are - is there an
>> >>>>>>> alternative of the same caliber? Shall we produce our own
>> "agnostic
>> >>>>>>> standard" for it instead ? Is there a chance the idea of
>> >>>>>>> "airflow-specific" attributes will catch up and many "consumers"
>> will
>> >>>>>>> be writing their own conversions to the way they can consume it?
>> >>>>>>>
>> >>>>>>> I would really, really try to avoid the pitfalls nicely summarized
>> >>>>>>> here: https://xkcd.com/927/
>> >>>>>>>
>> >>>>>>> We can of course make a wrong bet and in 2 years Airflow might be
>> the
>> >>>>>>> only one supporting Open Lineage. That might happen. Though the
>> list
>> >>>>>>> of "consumers" of Open Lineage is already pretty good IMHO. Or
>> maybe -
>> >>>>>>> more likely - once Airflow implements it, due to Airflow's
>> popularity
>> >>>>>>> and the fact that there is already competition supporting it (e.g.
>> >>>>>>> Amundsen) we will increase the chance of "hockey-stick" adoption
>> of
>> >>>>>>> Open Lineage. My bet is -  the latter and for the benefit of the
>> whole
>> >>>>>>> ecosystem. I think we have a chance to influence creation of a
>> new,
>> >>>>>>> important standard. Much less so, I think if we just provide our
>> own
>> >>>>>>> custom solution - with lots and lots of work for others to be
>> able to
>> >>>>>>> consume it, no time to properly nurture the API and make it
>> easier to
>> >>>>>>> implement it (which is undoubtedly what Datakin, Astronomer and
>> now
>> >>>>>>> LFData & AI run governance main focus is)
>> >>>>>>>
>> >>>>>>> Are there other alternatives we should consider ? Do we want to
>> >>>>>>> develop our own standard (and implement all the integrations from
>> the
>> >>>>>>> grounds up) ?
>> >>>>>>>
>> >>>>>>> J.
>> >>>>>>>
>> >>>>>>> On Fri, Feb 10, 2023 at 11:40 AM Eugen Kosteev <eu...@kosteev.com>
>> wrote:
>> >>>>>>> >
>> >>>>>>> > Hi Julien.
>> >>>>>>> >
>> >>>>>>> > I reviewed the design doc.
>> >>>>>>> > The general idea looks good to me, but I have some concerns
>> that I would like to share.
>> >>>>>>> >
>> >>>>>>> > If I understand correctly the proposed design is to fill in
>> "operators" with self-methods to extract lineage metadata from it, and I
>> agree with the motivation. If those are decoupled (in a form of extractors
>> in separate package) from operators itself, then the downsides is that (as
>> you mentioned) - extractors will be distributed separately and "operators"
>> logic is out of sync with "lineage extraction" logic by design.
>> >>>>>>> > Also knowledge about internals of operator spills out of the
>> operator which is not good at all (at the very least).
>> >>>>>>> >
>> >>>>>>> > However, if we make every operator being exposing method to
>> generate lineage metadata of the specific format, e.g. OpenLineage etc.,
>> then we will end up with cartesian complexity of supporting in each
>> provider+operator each backend format.
>> >>>>>>> >
>> >>>>>>> > If you say that the goal is that "operators" will always
>> generate OpenLineage format only and each consumer will convert this format
>> to their own internal representation, well, if they do this then this seems
>> like a working approach. But with the assumption that each consumer will
>> support it.
>> >>>>>>> >
>> >>>>>>> > I think it comes down to the question: is OpenLineage format
>> enough popular, complete and proper for the lineage metadata that every
>> consumer will be convinced to support it. We may also consider issues like
>> mismatch of lineage feature parity, e.g. OpenLineage supports field-level
>> lineage but consumer doesn't support (or not at the moment), so we would
>> prefer lineage metadata transferred to the backend to be slightly different
>> in this case.
>> >>>>>>> >
>> >>>>>>> > What do you think about the idea:
>> >>>>>>> > 1. make lineage metadata generated by "operators" to be
>> agnostic of the specific format, just using entities from big generic
>> vocabulary of entities e.g. created here
>> https://github.com/apache/airflow/blob/main/airflow/lineage/entities.py.
>> We would have there e.g. entities like:
>> >>>>>>> >
>> --------------------------------------------------------------------
>> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
>> >>>>>>> > class PostgresTable:
>> >>>>>>> >     """Airflow lineage entity representing Postgres table."""
>> >>>>>>> >
>> >>>>>>> >     host: str = attr.ib()
>> >>>>>>> >     port: str = attr.ib()
>> >>>>>>> >     database: str = attr.ib()
>> >>>>>>> >     schema: str = attr.ib()
>> >>>>>>> >     table: str = attr.ib()
>> >>>>>>> >
>> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
>> >>>>>>> > class GCSEntity:
>> >>>>>>> >     """Airflow lineage entity representing generic Google Cloud
>> Storage entity."""
>> >>>>>>> >
>> >>>>>>> >     bucket: str = attr.ib()
>> >>>>>>> >     path: str = attr.ib()
>> >>>>>>> >
>> >>>>>>> > @attr.s(auto_attribs=True, kw_only=True)
>> >>>>>>> > class AWSS3Entity:
>> >>>>>>> >     """Airflow lineage entity representing generic AWS S3
>> entity."""
>> >>>>>>> >
>> >>>>>>> >     bucket: str = attr.ib()
>> >>>>>>> >     path: str = attr.ib()
>> >>>>>>> >
>> --------------------------------------------------------------------
>> >>>>>>> > 2. Implement "adapters" that will act as a bridge between
>> "operators" and backends. Their responsibility will be to convert lineage
>> metadata generated by "operators" to a format understandable by specific
>> backend.
>> >>>>>>> > And then we can use the built-in mechanism of inlets/outlets to
>> bypass Airflow lineage metadata to the Airflow lineage backend.
>> >>>>>>> >
>> >>>>>>> > I didn't get exactly implementation details of your proposed
>> design, but I think maintaining global vocabulary of entities to use in
>> inlets/outlets of operators is crucial for Airflow, as this could be
>> leveraged to build various features on top of it, like displaying lineage
>> graph in Airflow UI (based on XCOM):)
>> >>>>>>> >
>> >>>>>>> > Importantly to note, if we decide to send out from Airflow
>> lineage metadata only in OpenLineage format, well, we could have than only
>> one "adapter" OpenLineageAdapter. But the "adapters" approach leaves us
>> room for adding support to others (following "pluggable" approach as
>> Airflow is mainly known/good about).
>> >>>>>>> >
>> >>>>>>> > All in all:
>> >>>>>>> > - global vocabulary of entities used across all "operators"
>> (with all advantages out of it, mentioned above)
>> >>>>>>> > - "adapters" approach
>> >>>>>>> > seems to me crucial points in the design that make sense to me.
>> >>>>>>> >
>> >>>>>>> > What do you think about this?
>> >>>>>>> >
>> >>>>>>> > - Eugene
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > On Wed, Feb 8, 2023 at 1:01 AM Julien Le Dem
>> <ju...@astronomer.io.invalid> wrote:
>> >>>>>>> >>
>> >>>>>>> >> Hello Michał,
>> >>>>>>> >> Thank you for your input.
>> >>>>>>> >> I would clarify that OpenLineage doesn't make any assumption
>> about the backend being used to store lineage and is an adapter-like layer.
>> >>>>>>> >> OpenLineage exists as the spec specifically for that purpose
>> of avoiding the problem of every lineage consumer having to understand
>> every lineage producer.
>> >>>>>>> >> Consumers of lineage want a unified spec consuming lineage
>> from any data transformation layer like Airflow, Spark, Flink, SQL,
>> Warehouses, ...
>> >>>>>>> >> Just like OpenTelemetry allows consuming traces independently
>> of the technology used, so does OpenLineage for lineage.
>> >>>>>>> >> Julien
>> >>>>>>> >>
>> >>>>>>> >> On Tue, Feb 7, 2023 at 12:48 AM Michał Modras <
>> michalmodras@google.com> wrote:
>> >>>>>>> >>>
>> >>>>>>> >>> Hi everyone,
>> >>>>>>> >>>
>> >>>>>>> >>> As Airflow already supports lineage functionality through
>> pluggable lineage backends, I think OpenLineage and other lineage systems
>> integration should follow this path. I think more 'native' integration with
>> OpenLineage (or any other lineage system) in Airflow while maintaining the
>> generic lineage backend architecture in parallel would make the user
>> experience less open, troublesome to maintain, and the Airflow architecture
>> itself more constrained by a logic of a specific system.
>> >>>>>>> >>>
>> >>>>>>> >>> I think enriching operators with a generic method exposing
>> lineage metadata that could be leveraged by lineage backends regardless of
>> their implementation is a good idea which the Cloud Composer team would
>> gladly contribute to. I believe the translation of the Airflow metadata
>> exposed by the operators should be done by lineage backends (or another
>> adapter-like layer). Tying Airflow operators' development to a specific
>> lineage system like OpenLineage forces operators' contributors to
>> understand that system too, which increases both the entry costs and
>> maintenance costs. I see it as unnecessary coupling.
>> >>>>>>> >>>
>> >>>>>>> >>> Best,
>> >>>>>>> >>> Michal
>> >>>>>>> >>>
>> >>>>>>> >>>
>> >>>>>>> >>>
>> >>>>>>> >>> On Tue, Jan 31, 2023 at 7:10 PM Julien Le Dem <
>> julien@astronomer.io> wrote:
>> >>>>>>> >>>>
>> >>>>>>> >>>> Thank you Eugen,
>> >>>>>>> >>>> This sounds very aligned with the goals of OpenLineage and I
>> think this would work well.
>> >>>>>>> >>>> Here are the sections in the doc that I think address your
>> points:
>> >>>>>>> >>>> - generalize lineage metadata extraction as self-method in
>> each operator, using generic lineage entities
>> >>>>>>> >>>> See: OpenLineage support in providers. It describes how each
>> operator exposes its lineage.
>> >>>>>>> >>>> - implement "adapter"s to convert generated metadata to Data
>> Lineage format, Open Lineage format, etc.
>> >>>>>>> >>>> The goal here is each consumer turns from OpenLineage format
>> to their own internal representation as you are suggesting.
>> >>>>>>> >>>> In the motivation section, towards the end, I link to a few
>> examples of data catalogs doing just that.
>> >>>>>>> >>>>
>> >>>>>>> >>>> On Tue, Jan 31, 2023 at 8:36 AM Eugen Kosteev <
>> eugen@kosteev.com> wrote:
>> >>>>>>> >>>>>
>> >>>>>>> >>>>> ++ Michal Modras
>> >>>>>>> >>>>>
>> >>>>>>> >>>>> On Tue, Jan 31, 2023 at 3:49 PM Eugen Kosteev <
>> eugen@kosteev.com> wrote:
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>> Cloud Composer recently launched "Data lineage with
>> Dataplex" feature which effectively means to generate lineage out of
>> DAG/task executions and export it to Data Lineage (Data Catalog service)
>> for further analysis.
>> >>>>>>> >>>>>>
>> https://cloud.google.com/composer/docs/composer-2/lineage-integration
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>> This feature is as of now in the "Preview" state.
>> >>>>>>> >>>>>> The current implementation uses built-in "Airflow lineage
>> backend" feature and methods to extract lineage metadata on task post
>> execution events.
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>> The general idea was to contribute this to the Airflow
>> community in a form:
>> >>>>>>> >>>>>> - generalize lineage metadata extraction as self-method in
>> each operator, using generic lineage entities
>> >>>>>>> >>>>>> - implement "adapter"s to convert generated metadata to
>> Data Lineage format, Open Lineage format, etc.
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>> Adoption of "Airflow OpenLineage" for Composer would mean
>> to introduce an additional layer of converting from OpenLineage format to
>> Data Lineage (Data Catalog/Dataplex) format. But this is definitely a
>> possibility.
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>> On Tue, Jan 31, 2023 at 12:53 AM Julien Le Dem
>> <ju...@astronomer.io.invalid> wrote:
>> >>>>>>> >>>>>>>
>> >>>>>>> >>>>>>> Thank you very much for your input Jarek.
>> >>>>>>> >>>>>>> I am responding in the comments and adding to the doc
>> accordingly.
>> >>>>>>> >>>>>>> I would also love to hear from more stakeholders.
>> >>>>>>> >>>>>>> Thanks to all who provided feedback so far.
>> >>>>>>> >>>>>>> Julien
>> >>>>>>> >>>>>>>
>> >>>>>>> >>>>>>> On Fri, Jan 27, 2023 at 12:57 AM Jarek Potiuk <
>> jarek@potiuk.com> wrote:
>> >>>>>>> >>>>>>>>
>> >>>>>>> >>>>>>>> General comment from my side: I think Open Lineage is
>> (and should be
>> >>>>>>> >>>>>>>> even more) a feature of Airflow that expands Airflow's
>> capabilities
>> >>>>>>> >>>>>>>> greatly and opens up the direction we've been all
>> working on - Airflow
>> >>>>>>> >>>>>>>> as a Platform.
>> >>>>>>> >>>>>>>>
>> >>>>>>> >>>>>>>> I think closely integrating it with Open-Lineage goes
>> the same
>> >>>>>>> >>>>>>>> direction (also mentioned in the doc) as Open Telemetry
>> goes, where we
>> >>>>>>> >>>>>>>> might decide to support certain standards in order to
>> expand
>> >>>>>>> >>>>>>>> capabilities of Airflow-as-a-platform and allows to
>> plug-in multiple
>> >>>>>>> >>>>>>>> external solutions that would use the standard API.
>> After Open-Lineage
>> >>>>>>> >>>>>>>> graduated recently to  LFAI&Data foundation (I've been
>> watching this
>> >>>>>>> >>>>>>>> happening from far), it is I think the perfect candidate
>> for Airflow
>> >>>>>>> >>>>>>>> to incorporate it. I hope this will help all the players
>> to make use
>> >>>>>>> >>>>>>>> of the extra work necessary by the community to make it
>> "officially
>> >>>>>>> >>>>>>>> supported". I think we have to also get some feedback
>> from the big
>> >>>>>>> >>>>>>>> stakeholders in Airflow - because one thing is to have
>> such a
>> >>>>>>> >>>>>>>> capability, and another is to get it used in all the
>> ways Airflow is
>> >>>>>>> >>>>>>>> used - not only by on-premise/self-hosted users (which
>> is obviously a
>> >>>>>>> >>>>>>>> huge driving factor) but also everywhere where Airflow
>> is exposed by
>> >>>>>>> >>>>>>>> others - Astronomer is obviously on-board. we see some
>> warm words from
>> >>>>>>> >>>>>>>> Amazon (mentioned by Julian), I would love to hear
>> whether the
>> >>>>>>> >>>>>>>> Composer team at Google would be on board in using the
>> open-lineage
>> >>>>>>> >>>>>>>> information exposed this way in their Data Catalog (and
>> likely more)
>> >>>>>>> >>>>>>>> offering. We have Amundsen and others and possibly other
>> stakeholders
>> >>>>>>> >>>>>>>> might want to say something.
>> >>>>>>> >>>>>>>>
>> >>>>>>> >>>>>>>>
>> >>>>>>> >>>>>>>> There is - undoubtedly - an extra effort involved in
>> implementing and
>> >>>>>>> >>>>>>>> keeping it running smoothly (as Julian mentioned, that
>> is the main
>> >>>>>>> >>>>>>>> reason why the Open Lineage community would like to make
>> the
>> >>>>>>> >>>>>>>> integration part of Airflow. But by being smart and
>> integrating it in
>> >>>>>>> >>>>>>>> the way that will allow to plug-it-in into our CI,
>> verification
>> >>>>>>> >>>>>>>> process and making some very clear expectations about
>> what it means
>> >>>>>>> >>>>>>>> for contributors to Airflow to get it running, we can
>> make some
>> >>>>>>> >>>>>>>> initial investment in making it happen and minimise
>> on-going cost,
>> >>>>>>> >>>>>>>> while maximising the gain.
>> >>>>>>> >>>>>>>>
>> >>>>>>> >>>>>>>> And looking at all the above - I am super happy to help
>> with all that
>> >>>>>>> >>>>>>>> to make this easy to "swallow" and integrate well, even
>> if it will
>> >>>>>>> >>>>>>>> take an extra effort, especially that we will have
>> experts from Open
>> >>>>>>> >>>>>>>> Lineage who worked with both Airflow and Open Lineage
>> being the core
>> >>>>>>> >>>>>>>> part of the effort. I am actually super excited - this
>> might be the
>> >>>>>>> >>>>>>>> next-big-thing for Airflow to strengthen its position as
>> an
>> >>>>>>> >>>>>>>> indispensable component of "even more modern data stack".
>> >>>>>>> >>>>>>>>
>> >>>>>>> >>>>>>>> I made my initial comments in the doc, and am looking
>> forward to
>> >>>>>>> >>>>>>>> making it happen :).
>> >>>>>>> >>>>>>>>
>> >>>>>>> >>>>>>>> J.
>> >>>>>>> >>>>>>>>
>> >>>>>>> >>>>>>>> On Fri, Jan 27, 2023 at 2:20 AM Julien Le Dem
>> >>>>>>> >>>>>>>> <ju...@astronomer.io.invalid> wrote:
>> >>>>>>> >>>>>>>> >
>> >>>>>>> >>>>>>>> > Dear Airflow Community,
>> >>>>>>> >>>>>>>> > I have been working on a proposal to bring an
>> OpenLineage provider to Airflow.
>> >>>>>>> >>>>>>>> > I am looking for feedback with the goal to post an
>> official AIP.
>> >>>>>>> >>>>>>>> > Please feel free to comment in the doc above.
>> >>>>>>> >>>>>>>> > Thank you,
>> >>>>>>> >>>>>>>> > Julien (OpenLineage project lead)
>> >>>>>>> >>>>>>>> >
>> >>>>>>> >>>>>>>> > For convenience, here is the rationale from the doc:
>> >>>>>>> >>>>>>>> >
>> >>>>>>> >>>>>>>> > Operational lineage collection is a common need to
>> understand dependencies between data pipelines and track end-to-end
>> provenance of data. It enables many use cases from ensuring reliable
>> delivery of data through observability to compliance and cost management.
>> >>>>>>> >>>>>>>> >
>> >>>>>>> >>>>>>>> > Publishing operational lineage is a core Airflow
>> capability to enable troubleshooting and governance.
>> >>>>>>> >>>>>>>> >
>> >>>>>>> >>>>>>>> > OpenLineage is a project part of the LFAI&Data
>> foundation that provides a spec standardizing operational lineage
>> collection and sharing across the data ecosystem. If it provides plugins
>> for popular open source projects, its intent is very similar to
>> OpenTelemetry (also under the Linux Foundation umbrella): to remain a spec
>> for lineage exchange that projects - open source or proprietary - implement.
>> >>>>>>> >>>>>>>> >
>> >>>>>>> >>>>>>>> > Built-in OpenLineage support in Airflow will make it
>> easier and more reliable for Airflow users to publish their operational
>> lineage through the OpenLineage ecosystem.
>> >>>>>>> >>>>>>>> >
>> >>>>>>> >>>>>>>> > The current external plugin maintained in the
>> OpenLineage project depends on Airflow and operators internals and gets
>> broken when changes are made on those. Having a built-in integration
>> ensures a better first class support to expose lineage that gets tested
>> alongside other changes and therefore is more stable.
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>> --
>> >>>>>>> >>>>>> Eugene
>> >>>>>>> >>>>>
>> >>>>>>> >>>>>
>> >>>>>>> >>>>>
>> >>>>>>> >>>>> --
>> >>>>>>> >>>>> Eugene
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > --
>> >>>>>>> > Eugene
>>
>