You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Martijn Visser <ma...@ververica.com> on 2022/01/13 15:27:23 UTC

[FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

Hi everyone,

I'm currently checking out different metadata platforms, such as Amundsen
[1] and Datahub [2]. In short, these types of tools try to address problems
related to topics such as data discovery, data lineage and an overall data
catalogue.

I'm reaching out to the Dev and User mailing lists to get some feedback. It
would really help if you could spend a couple of minutes to let me know if
you already use either one of the two mentioned metadata platforms or
another one, or are you evaluating such tools? If so, is that for
the purpose as a catalogue, for lineage or anything else? Any type of
feedback on these types of tools is appreciated.

Best regards,

Martijn

[1] https://github.com/amundsen-io/amundsen/
[2] https://github.com/linkedin/datahub

Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

Posted by Pedro Silva <pe...@gmail.com>.
Hello,

I'm part of the DataHub community and working in collaboration with the
company behind it: http://acryldata.io
Happy to have a conversation or clarify any questions you may have on
DataHub :)

Have a nice day!

Em qui., 13 de jan. de 2022 às 15:33, Andrew Otto <ot...@wikimedia.org>
escreveu:

> Hello!  The Wikimedia Foundation is currently doing a similar evaluation
> (although we are not currently including any Flink considerations).
>
>
> https://wikitech.wikimedia.org/wiki/Data_Catalog_Application_Evaluation_Rubric
>
> More details will be published there as folks keep working on this.
> Hope that helps a little bit! :)
>
> -Andrew Otto
>
> On Thu, Jan 13, 2022 at 10:27 AM Martijn Visser <ma...@ververica.com>
> wrote:
>
>> Hi everyone,
>>
>> I'm currently checking out different metadata platforms, such as Amundsen
>> [1] and Datahub [2]. In short, these types of tools try to address problems
>> related to topics such as data discovery, data lineage and an overall data
>> catalogue.
>>
>> I'm reaching out to the Dev and User mailing lists to get some feedback.
>> It would really help if you could spend a couple of minutes to let me know
>> if you already use either one of the two mentioned metadata platforms or
>> another one, or are you evaluating such tools? If so, is that for
>> the purpose as a catalogue, for lineage or anything else? Any type of
>> feedback on these types of tools is appreciated.
>>
>> Best regards,
>>
>> Martijn
>>
>> [1] https://github.com/amundsen-io/amundsen/
>> [2] https://github.com/linkedin/datahub
>>
>>
>>

Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

Posted by Pedro Silva <pe...@gmail.com>.
Hello,

I'm part of the DataHub community and working in collaboration with the
company behind it: http://acryldata.io
Happy to have a conversation or clarify any questions you may have on
DataHub :)

Have a nice day!

Em qui., 13 de jan. de 2022 às 15:33, Andrew Otto <ot...@wikimedia.org>
escreveu:

> Hello!  The Wikimedia Foundation is currently doing a similar evaluation
> (although we are not currently including any Flink considerations).
>
>
> https://wikitech.wikimedia.org/wiki/Data_Catalog_Application_Evaluation_Rubric
>
> More details will be published there as folks keep working on this.
> Hope that helps a little bit! :)
>
> -Andrew Otto
>
> On Thu, Jan 13, 2022 at 10:27 AM Martijn Visser <ma...@ververica.com>
> wrote:
>
>> Hi everyone,
>>
>> I'm currently checking out different metadata platforms, such as Amundsen
>> [1] and Datahub [2]. In short, these types of tools try to address problems
>> related to topics such as data discovery, data lineage and an overall data
>> catalogue.
>>
>> I'm reaching out to the Dev and User mailing lists to get some feedback.
>> It would really help if you could spend a couple of minutes to let me know
>> if you already use either one of the two mentioned metadata platforms or
>> another one, or are you evaluating such tools? If so, is that for
>> the purpose as a catalogue, for lineage or anything else? Any type of
>> feedback on these types of tools is appreciated.
>>
>> Best regards,
>>
>> Martijn
>>
>> [1] https://github.com/amundsen-io/amundsen/
>> [2] https://github.com/linkedin/datahub
>>
>>
>>

Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

Posted by Andrew Otto <ot...@wikimedia.org>.
Hello!  The Wikimedia Foundation is currently doing a similar evaluation
(although we are not currently including any Flink considerations).

https://wikitech.wikimedia.org/wiki/Data_Catalog_Application_Evaluation_Rubric

More details will be published there as folks keep working on this.
Hope that helps a little bit! :)

-Andrew Otto

On Thu, Jan 13, 2022 at 10:27 AM Martijn Visser <ma...@ververica.com>
wrote:

> Hi everyone,
>
> I'm currently checking out different metadata platforms, such as Amundsen
> [1] and Datahub [2]. In short, these types of tools try to address problems
> related to topics such as data discovery, data lineage and an overall data
> catalogue.
>
> I'm reaching out to the Dev and User mailing lists to get some feedback.
> It would really help if you could spend a couple of minutes to let me know
> if you already use either one of the two mentioned metadata platforms or
> another one, or are you evaluating such tools? If so, is that for
> the purpose as a catalogue, for lineage or anything else? Any type of
> feedback on these types of tools is appreciated.
>
> Best regards,
>
> Martijn
>
> [1] https://github.com/amundsen-io/amundsen/
> [2] https://github.com/linkedin/datahub
>
>
>

Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

Posted by Andrew Otto <ot...@wikimedia.org>.
Hello!  The Wikimedia Foundation is currently doing a similar evaluation
(although we are not currently including any Flink considerations).

https://wikitech.wikimedia.org/wiki/Data_Catalog_Application_Evaluation_Rubric

More details will be published there as folks keep working on this.
Hope that helps a little bit! :)

-Andrew Otto

On Thu, Jan 13, 2022 at 10:27 AM Martijn Visser <ma...@ververica.com>
wrote:

> Hi everyone,
>
> I'm currently checking out different metadata platforms, such as Amundsen
> [1] and Datahub [2]. In short, these types of tools try to address problems
> related to topics such as data discovery, data lineage and an overall data
> catalogue.
>
> I'm reaching out to the Dev and User mailing lists to get some feedback.
> It would really help if you could spend a couple of minutes to let me know
> if you already use either one of the two mentioned metadata platforms or
> another one, or are you evaluating such tools? If so, is that for
> the purpose as a catalogue, for lineage or anything else? Any type of
> feedback on these types of tools is appreciated.
>
> Best regards,
>
> Martijn
>
> [1] https://github.com/amundsen-io/amundsen/
> [2] https://github.com/linkedin/datahub
>
>
>

Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

Posted by Martijn Visser <ma...@apache.org>.
Hi Yun Tang,

Sorry for the late reply. I haven't seen any tickets related to this topic.
Still think this is an important feature to have supported in Flink, would
love some volunteers on this topic.

Best regards,

Martijn

On Tue, Sep 13, 2022 at 7:47 AM Yun Tang <my...@live.com> wrote:

> An interesting topic, I noticed that the datahub community has launched
> the feature request discussion of Flink Integration [1].
>
> @Martijn Visser <ma...@apache.org> Did the Flink community had
> created tickets to track this topic?
> From my current understanding, Flink lacks rich information on FlinkJobListener
> just as Feng mentioned, which has been supported well by Spark, to send
> data lineage to external systems.
>
>
>
> [1] https://feature-requests.datahubproject.io/p/flink-integration
>
>
> Best
> Yun Tang
> ------------------------------
> *From:* wangqinghuan <10...@qq.com>
> *Sent:* Monday, January 17, 2022 18:27
> *To:* user@flink.apache.org <us...@flink.apache.org>
> *Subject:* Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage
> integration
>
>
> we are using Datahub to address table-level lineage and column-level
> lineage for Flink SQL.
> 在 2022/1/13 23:27, Martijn Visser 写道:
>
> Hi everyone,
>
> I'm currently checking out different metadata platforms, such as Amundsen
> [1] and Datahub [2]. In short, these types of tools try to address problems
> related to topics such as data discovery, data lineage and an overall data
> catalogue.
>
> I'm reaching out to the Dev and User mailing lists to get some feedback.
> It would really help if you could spend a couple of minutes to let me know
> if you already use either one of the two mentioned metadata platforms or
> another one, or are you evaluating such tools? If so, is that for
> the purpose as a catalogue, for lineage or anything else? Any type of
> feedback on these types of tools is appreciated.
>
> Best regards,
>
> Martijn
>
> [1] https://github.com/amundsen-io/amundsen/
> [2] https://github.com/linkedin/datahub
>
>

Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

Posted by Yun Tang <my...@live.com>.
An interesting topic, I noticed that the datahub community has launched the feature request discussion of Flink Integration [1].

@Martijn Visser<ma...@apache.org> Did the Flink community had created tickets to track this topic?
From my current understanding, Flink lacks rich information on FlinkJobListener just as Feng mentioned, which has been supported well by Spark, to send data lineage to external systems.



[1] https://feature-requests.datahubproject.io/p/flink-integration


Best
Yun Tang
________________________________
From: wangqinghuan <10...@qq.com>
Sent: Monday, January 17, 2022 18:27
To: user@flink.apache.org <us...@flink.apache.org>
Subject: Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration


we are using Datahub to address table-level lineage and column-level lineage for Flink SQL.

在 2022/1/13 23:27, Martijn Visser 写道:
Hi everyone,

I'm currently checking out different metadata platforms, such as Amundsen [1] and Datahub [2]. In short, these types of tools try to address problems related to topics such as data discovery, data lineage and an overall data catalogue.

I'm reaching out to the Dev and User mailing lists to get some feedback. It would really help if you could spend a couple of minutes to let me know if you already use either one of the two mentioned metadata platforms or another one, or are you evaluating such tools? If so, is that for the purpose as a catalogue, for lineage or anything else? Any type of feedback on these types of tools is appreciated.

Best regards,

Martijn

[1] https://github.com/amundsen-io/amundsen/
[2] https://github.com/linkedin/datahub


Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

Posted by wangqinghuan <10...@qq.com>.
we are using Datahub to address table-level lineage and column-level 
lineage for Flink SQL.

在 2022/1/13 23:27, Martijn Visser 写道:
> Hi everyone,
>
> I'm currently checking out different metadata platforms, such as 
> Amundsen [1] and Datahub [2]. In short, these types of tools try to 
> address problems related to topics such as data discovery, data 
> lineage and an overall data catalogue.
>
> I'm reaching out to the Dev and User mailing lists to get some 
> feedback. It would really help if you could spend a couple of minutes 
> to let me know if you already use either one of the two mentioned 
> metadata platforms or another one, or are you evaluating such tools? 
> If so, is that for the purpose as a catalogue, for lineage or anything 
> else? Any type of feedback on these types of tools is appreciated.
>
> Best regards,
>
> Martijn
>
> [1] https://github.com/amundsen-io/amundsen/
> [2] https://github.com/linkedin/datahub
>

Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

Posted by JIN FENG <ji...@gmail.com>.
Hi
I am a software engineer from Xiaomi.

Last year we used metacat(https://github.com/Netflix/metacat) to manage all
metadata, including Hive, Kudu, Doris, Iceberg, Elasticsearch, Talos
(Xiaomi self-developed message queue), Mysql, Tidb..

Metacat is well compatible with the hive-metastore protocol. Therefore, we
can directly use FlinkHiveCatalog to connect metacat to create different
Tables, including Hive tables, or other generic types of tables.

All systems are abstracted into catalog.database.table structure. So in
FlinkSQL we can access any registered table through catalog.database.table.

In addition, metacat uniformly manages all table creation, deletion, and
partitioning operations. By analyzing the audit log of metacat, we can
easily obtain the DDL lineage of different tables.

At the same time, with the use of ranger(https://github.com/ranger/ranger),
we have added permission control to the Flink framework, and all permission
information will be saved in the form of catalog.database.table.

We also modified the logic related to FlinkJobListener. By exposing the
JobGraph, we can obtain the lineage information of the job by parsing the
JobGraph.

To sum up, unified metadata management is convenient for managing different
systems and connecting to Flink, and at the same time, it is convenient for
unified permission management and obtaining table-related lineage
information.


On Fri, Jan 14, 2022 at 3:14 AM Maciej Obuchowski <
obuchowski.maciej@gmail.com> wrote:

> Hello,
>
> I'm an OpenLineage committer - and previously, a minor Flink contributor.
> OpenLineage community is very interested in conversation about Flink
> metadata, and we'll be happy to cooperate with the Flink community.
>
> Best,
> Maciej Obuchowski
>
>
>
> czw., 13 sty 2022 o 18:12 Martijn Visser <ma...@ververica.com>
> napisał(a):
> >
> > Hi all,
> >
> > @Andrew thanks for sharing that!
> >
> > @Tero good point, I should have clarified the purpose. I want to
> understand
> > what "metadata platforms" tools are used or evaluated by the Flink
> > community, what's their purpose for using such a tool (is it as a generic
> > catalogue, as a data discovery tool, is lineage the important part etc)
> and
> > what problems are people trying to solve with them. This space is
> > developing rapidly and there are many open source and commercial tools
> > popping up/growing, which is also why I'm trying to keep an open vision
> on
> > how this space is evolving.
> >
> > If the Flink community wants to integrate with metadata tools, I fully
> > agree that ideally we do that via standards. My perception is at this
> > moment that no clear standard has yet been established. You mentioned
> > open-metadata.org, but I believe https://openlineage.io/ is also an
> > alternative standard.
> >
> > Best regards,
> >
> > Martijn
> >
> > On Thu, 13 Jan 2022 at 17:00, Tero Paananen <te...@gmail.com>
> wrote:
> >
> > > > I'm currently checking out different metadata platforms, such as
> > > Amundsen [1] and Datahub [2]. In short, these types of tools try to
> address
> > > problems related to topics such as data discovery, data lineage and an
> > > overall data catalogue.
> > > >
> > > > I'm reaching out to the Dev and User mailing lists to get some
> feedback.
> > > It would really help if you could spend a couple of minutes to let me
> know
> > > if you already use either one of the two mentioned metadata platforms
> or
> > > another one, or are you evaluating such tools? If so, is that for the
> > > purpose as a catalogue, for lineage or anything else? Any type of
> feedback
> > > on these types of tools is appreciated.
> > >
> > > I hope you don't mind answers off-list.
> > >
> > > You didn't say what purpose you're evaluating these tools for, but if
> > > you're evaluating platforms for integration with Flink, I wouldn't
> > > approach it with a particular product in mind. Rather I'd create some
> > > sort of facility to propagate metadata and/or lineage information in a
> > > generic way and allow Flink users to plug in their favorite metadata
> > > tool. Using standards like OpenLineage, for example. I believe Egeria
> > > is also trying to create an open standard for metadata.;
> > >
> > > If you're evaluating data catalogs for personal use or use in a
> > > particular project, Andrew's answer about the Wikimedia evaluation is
> > > a good start. It's missing OpenMetadata (https://open-metadata.org/).
> > > That one is showing a LOT of promise. Wikimedia's evaluation is also
> > > missing industry leading commercial products (understandably, given
> > > their mission). Collibra and Alation probably the ones that pop up
> > > most often.
> > >
> > > I have personally looked into both DataHub and Amundsen. My high level
> > > feedback is that DataHub is overengineered, and using proprietary
> > > LinkedIn technology platform(s), which aren't widely used anywhere.
> > > Amundsen is much less flexible than DataHub and quite basic in its
> > > functionality. If you need anything beyond what it already offers,
> > > good luck.
> > >
> > > We dumped Amundsen in favor of OpenMetadata a few months back. We
> > > don't have enough data points to fully evaluate OpenMetadata yet.
> > >
> > > -TPP
> > >
>

Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

Posted by JIN FENG <ji...@gmail.com>.
Hi
I am a software engineer from Xiaomi.

Last year we used metacat(https://github.com/Netflix/metacat) to manage all
metadata, including Hive, Kudu, Doris, Iceberg, Elasticsearch, Talos
(Xiaomi self-developed message queue), Mysql, Tidb..

Metacat is well compatible with the hive-metastore protocol. Therefore, we
can directly use FlinkHiveCatalog to connect metacat to create different
Tables, including Hive tables, or other generic types of tables.

All systems are abstracted into catalog.database.table structure. So in
FlinkSQL we can access any registered table through catalog.database.table.

In addition, metacat uniformly manages all table creation, deletion, and
partitioning operations. By analyzing the audit log of metacat, we can
easily obtain the DDL lineage of different tables.

At the same time, with the use of ranger(https://github.com/ranger/ranger),
we have added permission control to the Flink framework, and all permission
information will be saved in the form of catalog.database.table.

We also modified the logic related to FlinkJobListener. By exposing the
JobGraph, we can obtain the lineage information of the job by parsing the
JobGraph.

To sum up, unified metadata management is convenient for managing different
systems and connecting to Flink, and at the same time, it is convenient for
unified permission management and obtaining table-related lineage
information.


On Fri, Jan 14, 2022 at 3:14 AM Maciej Obuchowski <
obuchowski.maciej@gmail.com> wrote:

> Hello,
>
> I'm an OpenLineage committer - and previously, a minor Flink contributor.
> OpenLineage community is very interested in conversation about Flink
> metadata, and we'll be happy to cooperate with the Flink community.
>
> Best,
> Maciej Obuchowski
>
>
>
> czw., 13 sty 2022 o 18:12 Martijn Visser <ma...@ververica.com>
> napisał(a):
> >
> > Hi all,
> >
> > @Andrew thanks for sharing that!
> >
> > @Tero good point, I should have clarified the purpose. I want to
> understand
> > what "metadata platforms" tools are used or evaluated by the Flink
> > community, what's their purpose for using such a tool (is it as a generic
> > catalogue, as a data discovery tool, is lineage the important part etc)
> and
> > what problems are people trying to solve with them. This space is
> > developing rapidly and there are many open source and commercial tools
> > popping up/growing, which is also why I'm trying to keep an open vision
> on
> > how this space is evolving.
> >
> > If the Flink community wants to integrate with metadata tools, I fully
> > agree that ideally we do that via standards. My perception is at this
> > moment that no clear standard has yet been established. You mentioned
> > open-metadata.org, but I believe https://openlineage.io/ is also an
> > alternative standard.
> >
> > Best regards,
> >
> > Martijn
> >
> > On Thu, 13 Jan 2022 at 17:00, Tero Paananen <te...@gmail.com>
> wrote:
> >
> > > > I'm currently checking out different metadata platforms, such as
> > > Amundsen [1] and Datahub [2]. In short, these types of tools try to
> address
> > > problems related to topics such as data discovery, data lineage and an
> > > overall data catalogue.
> > > >
> > > > I'm reaching out to the Dev and User mailing lists to get some
> feedback.
> > > It would really help if you could spend a couple of minutes to let me
> know
> > > if you already use either one of the two mentioned metadata platforms
> or
> > > another one, or are you evaluating such tools? If so, is that for the
> > > purpose as a catalogue, for lineage or anything else? Any type of
> feedback
> > > on these types of tools is appreciated.
> > >
> > > I hope you don't mind answers off-list.
> > >
> > > You didn't say what purpose you're evaluating these tools for, but if
> > > you're evaluating platforms for integration with Flink, I wouldn't
> > > approach it with a particular product in mind. Rather I'd create some
> > > sort of facility to propagate metadata and/or lineage information in a
> > > generic way and allow Flink users to plug in their favorite metadata
> > > tool. Using standards like OpenLineage, for example. I believe Egeria
> > > is also trying to create an open standard for metadata.;
> > >
> > > If you're evaluating data catalogs for personal use or use in a
> > > particular project, Andrew's answer about the Wikimedia evaluation is
> > > a good start. It's missing OpenMetadata (https://open-metadata.org/).
> > > That one is showing a LOT of promise. Wikimedia's evaluation is also
> > > missing industry leading commercial products (understandably, given
> > > their mission). Collibra and Alation probably the ones that pop up
> > > most often.
> > >
> > > I have personally looked into both DataHub and Amundsen. My high level
> > > feedback is that DataHub is overengineered, and using proprietary
> > > LinkedIn technology platform(s), which aren't widely used anywhere.
> > > Amundsen is much less flexible than DataHub and quite basic in its
> > > functionality. If you need anything beyond what it already offers,
> > > good luck.
> > >
> > > We dumped Amundsen in favor of OpenMetadata a few months back. We
> > > don't have enough data points to fully evaluate OpenMetadata yet.
> > >
> > > -TPP
> > >
>

Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

Posted by Maciej Obuchowski <ob...@gmail.com>.
Hello,

I'm an OpenLineage committer - and previously, a minor Flink contributor.
OpenLineage community is very interested in conversation about Flink
metadata, and we'll be happy to cooperate with the Flink community.

Best,
Maciej Obuchowski



czw., 13 sty 2022 o 18:12 Martijn Visser <ma...@ververica.com> napisał(a):
>
> Hi all,
>
> @Andrew thanks for sharing that!
>
> @Tero good point, I should have clarified the purpose. I want to understand
> what "metadata platforms" tools are used or evaluated by the Flink
> community, what's their purpose for using such a tool (is it as a generic
> catalogue, as a data discovery tool, is lineage the important part etc) and
> what problems are people trying to solve with them. This space is
> developing rapidly and there are many open source and commercial tools
> popping up/growing, which is also why I'm trying to keep an open vision on
> how this space is evolving.
>
> If the Flink community wants to integrate with metadata tools, I fully
> agree that ideally we do that via standards. My perception is at this
> moment that no clear standard has yet been established. You mentioned
> open-metadata.org, but I believe https://openlineage.io/ is also an
> alternative standard.
>
> Best regards,
>
> Martijn
>
> On Thu, 13 Jan 2022 at 17:00, Tero Paananen <te...@gmail.com> wrote:
>
> > > I'm currently checking out different metadata platforms, such as
> > Amundsen [1] and Datahub [2]. In short, these types of tools try to address
> > problems related to topics such as data discovery, data lineage and an
> > overall data catalogue.
> > >
> > > I'm reaching out to the Dev and User mailing lists to get some feedback.
> > It would really help if you could spend a couple of minutes to let me know
> > if you already use either one of the two mentioned metadata platforms or
> > another one, or are you evaluating such tools? If so, is that for the
> > purpose as a catalogue, for lineage or anything else? Any type of feedback
> > on these types of tools is appreciated.
> >
> > I hope you don't mind answers off-list.
> >
> > You didn't say what purpose you're evaluating these tools for, but if
> > you're evaluating platforms for integration with Flink, I wouldn't
> > approach it with a particular product in mind. Rather I'd create some
> > sort of facility to propagate metadata and/or lineage information in a
> > generic way and allow Flink users to plug in their favorite metadata
> > tool. Using standards like OpenLineage, for example. I believe Egeria
> > is also trying to create an open standard for metadata.;
> >
> > If you're evaluating data catalogs for personal use or use in a
> > particular project, Andrew's answer about the Wikimedia evaluation is
> > a good start. It's missing OpenMetadata (https://open-metadata.org/).
> > That one is showing a LOT of promise. Wikimedia's evaluation is also
> > missing industry leading commercial products (understandably, given
> > their mission). Collibra and Alation probably the ones that pop up
> > most often.
> >
> > I have personally looked into both DataHub and Amundsen. My high level
> > feedback is that DataHub is overengineered, and using proprietary
> > LinkedIn technology platform(s), which aren't widely used anywhere.
> > Amundsen is much less flexible than DataHub and quite basic in its
> > functionality. If you need anything beyond what it already offers,
> > good luck.
> >
> > We dumped Amundsen in favor of OpenMetadata a few months back. We
> > don't have enough data points to fully evaluate OpenMetadata yet.
> >
> > -TPP
> >

Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

Posted by Maciej Obuchowski <ob...@gmail.com>.
Hello,

I'm an OpenLineage committer - and previously, a minor Flink contributor.
OpenLineage community is very interested in conversation about Flink
metadata, and we'll be happy to cooperate with the Flink community.

Best,
Maciej Obuchowski



czw., 13 sty 2022 o 18:12 Martijn Visser <ma...@ververica.com> napisał(a):
>
> Hi all,
>
> @Andrew thanks for sharing that!
>
> @Tero good point, I should have clarified the purpose. I want to understand
> what "metadata platforms" tools are used or evaluated by the Flink
> community, what's their purpose for using such a tool (is it as a generic
> catalogue, as a data discovery tool, is lineage the important part etc) and
> what problems are people trying to solve with them. This space is
> developing rapidly and there are many open source and commercial tools
> popping up/growing, which is also why I'm trying to keep an open vision on
> how this space is evolving.
>
> If the Flink community wants to integrate with metadata tools, I fully
> agree that ideally we do that via standards. My perception is at this
> moment that no clear standard has yet been established. You mentioned
> open-metadata.org, but I believe https://openlineage.io/ is also an
> alternative standard.
>
> Best regards,
>
> Martijn
>
> On Thu, 13 Jan 2022 at 17:00, Tero Paananen <te...@gmail.com> wrote:
>
> > > I'm currently checking out different metadata platforms, such as
> > Amundsen [1] and Datahub [2]. In short, these types of tools try to address
> > problems related to topics such as data discovery, data lineage and an
> > overall data catalogue.
> > >
> > > I'm reaching out to the Dev and User mailing lists to get some feedback.
> > It would really help if you could spend a couple of minutes to let me know
> > if you already use either one of the two mentioned metadata platforms or
> > another one, or are you evaluating such tools? If so, is that for the
> > purpose as a catalogue, for lineage or anything else? Any type of feedback
> > on these types of tools is appreciated.
> >
> > I hope you don't mind answers off-list.
> >
> > You didn't say what purpose you're evaluating these tools for, but if
> > you're evaluating platforms for integration with Flink, I wouldn't
> > approach it with a particular product in mind. Rather I'd create some
> > sort of facility to propagate metadata and/or lineage information in a
> > generic way and allow Flink users to plug in their favorite metadata
> > tool. Using standards like OpenLineage, for example. I believe Egeria
> > is also trying to create an open standard for metadata.;
> >
> > If you're evaluating data catalogs for personal use or use in a
> > particular project, Andrew's answer about the Wikimedia evaluation is
> > a good start. It's missing OpenMetadata (https://open-metadata.org/).
> > That one is showing a LOT of promise. Wikimedia's evaluation is also
> > missing industry leading commercial products (understandably, given
> > their mission). Collibra and Alation probably the ones that pop up
> > most often.
> >
> > I have personally looked into both DataHub and Amundsen. My high level
> > feedback is that DataHub is overengineered, and using proprietary
> > LinkedIn technology platform(s), which aren't widely used anywhere.
> > Amundsen is much less flexible than DataHub and quite basic in its
> > functionality. If you need anything beyond what it already offers,
> > good luck.
> >
> > We dumped Amundsen in favor of OpenMetadata a few months back. We
> > don't have enough data points to fully evaluate OpenMetadata yet.
> >
> > -TPP
> >

Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

Posted by Martijn Visser <ma...@ververica.com>.
Hi all,

@Andrew thanks for sharing that!

@Tero good point, I should have clarified the purpose. I want to understand
what "metadata platforms" tools are used or evaluated by the Flink
community, what's their purpose for using such a tool (is it as a generic
catalogue, as a data discovery tool, is lineage the important part etc) and
what problems are people trying to solve with them. This space is
developing rapidly and there are many open source and commercial tools
popping up/growing, which is also why I'm trying to keep an open vision on
how this space is evolving.

If the Flink community wants to integrate with metadata tools, I fully
agree that ideally we do that via standards. My perception is at this
moment that no clear standard has yet been established. You mentioned
open-metadata.org, but I believe https://openlineage.io/ is also an
alternative standard.

Best regards,

Martijn

On Thu, 13 Jan 2022 at 17:00, Tero Paananen <te...@gmail.com> wrote:

> > I'm currently checking out different metadata platforms, such as
> Amundsen [1] and Datahub [2]. In short, these types of tools try to address
> problems related to topics such as data discovery, data lineage and an
> overall data catalogue.
> >
> > I'm reaching out to the Dev and User mailing lists to get some feedback.
> It would really help if you could spend a couple of minutes to let me know
> if you already use either one of the two mentioned metadata platforms or
> another one, or are you evaluating such tools? If so, is that for the
> purpose as a catalogue, for lineage or anything else? Any type of feedback
> on these types of tools is appreciated.
>
> I hope you don't mind answers off-list.
>
> You didn't say what purpose you're evaluating these tools for, but if
> you're evaluating platforms for integration with Flink, I wouldn't
> approach it with a particular product in mind. Rather I'd create some
> sort of facility to propagate metadata and/or lineage information in a
> generic way and allow Flink users to plug in their favorite metadata
> tool. Using standards like OpenLineage, for example. I believe Egeria
> is also trying to create an open standard for metadata.;
>
> If you're evaluating data catalogs for personal use or use in a
> particular project, Andrew's answer about the Wikimedia evaluation is
> a good start. It's missing OpenMetadata (https://open-metadata.org/).
> That one is showing a LOT of promise. Wikimedia's evaluation is also
> missing industry leading commercial products (understandably, given
> their mission). Collibra and Alation probably the ones that pop up
> most often.
>
> I have personally looked into both DataHub and Amundsen. My high level
> feedback is that DataHub is overengineered, and using proprietary
> LinkedIn technology platform(s), which aren't widely used anywhere.
> Amundsen is much less flexible than DataHub and quite basic in its
> functionality. If you need anything beyond what it already offers,
> good luck.
>
> We dumped Amundsen in favor of OpenMetadata a few months back. We
> don't have enough data points to fully evaluate OpenMetadata yet.
>
> -TPP
>

Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

Posted by Martijn Visser <ma...@ververica.com>.
Hi all,

@Andrew thanks for sharing that!

@Tero good point, I should have clarified the purpose. I want to understand
what "metadata platforms" tools are used or evaluated by the Flink
community, what's their purpose for using such a tool (is it as a generic
catalogue, as a data discovery tool, is lineage the important part etc) and
what problems are people trying to solve with them. This space is
developing rapidly and there are many open source and commercial tools
popping up/growing, which is also why I'm trying to keep an open vision on
how this space is evolving.

If the Flink community wants to integrate with metadata tools, I fully
agree that ideally we do that via standards. My perception is at this
moment that no clear standard has yet been established. You mentioned
open-metadata.org, but I believe https://openlineage.io/ is also an
alternative standard.

Best regards,

Martijn

On Thu, 13 Jan 2022 at 17:00, Tero Paananen <te...@gmail.com> wrote:

> > I'm currently checking out different metadata platforms, such as
> Amundsen [1] and Datahub [2]. In short, these types of tools try to address
> problems related to topics such as data discovery, data lineage and an
> overall data catalogue.
> >
> > I'm reaching out to the Dev and User mailing lists to get some feedback.
> It would really help if you could spend a couple of minutes to let me know
> if you already use either one of the two mentioned metadata platforms or
> another one, or are you evaluating such tools? If so, is that for the
> purpose as a catalogue, for lineage or anything else? Any type of feedback
> on these types of tools is appreciated.
>
> I hope you don't mind answers off-list.
>
> You didn't say what purpose you're evaluating these tools for, but if
> you're evaluating platforms for integration with Flink, I wouldn't
> approach it with a particular product in mind. Rather I'd create some
> sort of facility to propagate metadata and/or lineage information in a
> generic way and allow Flink users to plug in their favorite metadata
> tool. Using standards like OpenLineage, for example. I believe Egeria
> is also trying to create an open standard for metadata.;
>
> If you're evaluating data catalogs for personal use or use in a
> particular project, Andrew's answer about the Wikimedia evaluation is
> a good start. It's missing OpenMetadata (https://open-metadata.org/).
> That one is showing a LOT of promise. Wikimedia's evaluation is also
> missing industry leading commercial products (understandably, given
> their mission). Collibra and Alation probably the ones that pop up
> most often.
>
> I have personally looked into both DataHub and Amundsen. My high level
> feedback is that DataHub is overengineered, and using proprietary
> LinkedIn technology platform(s), which aren't widely used anywhere.
> Amundsen is much less flexible than DataHub and quite basic in its
> functionality. If you need anything beyond what it already offers,
> good luck.
>
> We dumped Amundsen in favor of OpenMetadata a few months back. We
> don't have enough data points to fully evaluate OpenMetadata yet.
>
> -TPP
>