You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by Chris Li <ch...@linkedin.com.INVALID> on 2021/03/22 22:44:13 UTC

[DISCUSS] Connectors in Apache Gobblin

Proposal:

DIL (LinkedIn internal project name) is a generic multi-stage Gobblin connector library. The code can be accessed here:  https://github.com/linkedin/gobblin-connectors. Its core features and high level descriptions are shared here: https://engineering.linkedin.com/blog/2021/data-integration-library.

Per initial discussion with members of Gobblin community, we are here proposing a separate sub-repo for this library.

Why:
            Some thoughts/justifications of a sub-repo vs. a module in the main Gobblin repo.


  1.  Gobblin connectors are important part of Gobblin ecosystem, but the development of connectors is relatively independent of Gobblin core.
  2.  Gobblin connector is where open source communities can contribute the most, and it will be growing much faster than Gobblin core.
  3.  The new connector library is a comprehensive package of unique design patterns. This is where the data integration diversity challenge will be addressed. The importance of this code base grows by day as more integration scenarios are becoming supported.
  4.  The new connector library evolves and replaces many prior Gobblin connectors under the “gobblin-modules” module. A separate repo will help avoid confusion.
  5.  Separating core and ecosystem modules can help improve isolation and reduce the number of defects.

Regards,
Chris





Re: [DISCUSS] Connectors in Apache Gobblin

Posted by Abhishek Tiwari <ab...@apache.org>.
Hi Chris,

Thanks for starting this thread, and for your contribution. My thoughts:
1. I like the idea of keeping Gobblin's connectors on a separate sub-repo,
because these are fairly independent pluggable connectors.
2. They can bring in a significant amount of dependencies, which are better
kept isolated from core.

As for the next steps:
a. I request Gobblin committers and PMCs to review this repo to ensure it's
in a shape to add to Gobblin (including code quality, structure,
dependencies it brings in, etc.)
b. Chris, please confirm that this contribution will be under Apache 2.0
license and the code already has the right license headers.

Once, we have an affirmative answer for (a), and (b). I will coordinate
with Apache Infra to set up a sub-repo for this in Gobblin, and work with
Chris to bring it in.

Thanks,
Abhishek


On Mon, Mar 22, 2021 at 5:41 PM Chris Li <ch...@linkedin.com.invalid> wrote:

> Proposal:
>
> DIL (LinkedIn internal project name) is a generic multi-stage Gobblin
> connector library. The code can be accessed here:
> https://github.com/linkedin/gobblin-connectors. Its core features and
> high level descriptions are shared here:
> https://engineering.linkedin.com/blog/2021/data-integration-library.
>
> Per initial discussion with members of Gobblin community, we are here
> proposing a separate sub-repo for this library.
>
> Why:
>             Some thoughts/justifications of a sub-repo vs. a module in the
> main Gobblin repo.
>
>
>   1.  Gobblin connectors are important part of Gobblin ecosystem, but the
> development of connectors is relatively independent of Gobblin core.
>   2.  Gobblin connector is where open source communities can contribute
> the most, and it will be growing much faster than Gobblin core.
>   3.  The new connector library is a comprehensive package of unique
> design patterns. This is where the data integration diversity challenge
> will be addressed. The importance of this code base grows by day as more
> integration scenarios are becoming supported.
>   4.  The new connector library evolves and replaces many prior Gobblin
> connectors under the “gobblin-modules” module. A separate repo will help
> avoid confusion.
>   5.  Separating core and ecosystem modules can help improve isolation and
> reduce the number of defects.
>
> Regards,
> Chris
>
>
>
>
>

Re: [DISCUSS] Connectors in Apache Gobblin

Posted by Sudarshan Vasudevan <su...@linkedin.com.INVALID>.
+1 to @Shirshanka Das<ma...@linkedin.com>' proposal for pulling connectors into a separate repo. The Kafka connect model is worth emulating here.

That said, I prefer DIL connectors to be maintained as a standalone open source repository outside of Apache Gobblin for several reasons:

  1.  As has been already mentioned in the thread below, the connector library will evolve much more rapidly than the Gobblin core libraries. As such, it is better to have separate sets of committers that are more attuned to the pace of development in their respective libraries. This will ultimately lead to faster code reviews, bug fixes etc.
  2.  I imagine the community for connectors to be very different from the community for Gobblin core, and it is better to cultivate and support these communities independently.
  3.  Having DIL connectors outside Apache Gobblin, allows Gobblin to support a marketplace of connectors discoverable via a catalog. In this end state, we could have multiple implementations of the same connector with different feature sets catering to different use cases.

Of course, any framework enhancements that are necessary to support DIL connectors can be contributed back to Gobblin core.

HTH,
Sudarshan
________________________________
From: Shirshanka Das <sh...@apache.org>
Sent: Monday, March 22, 2021 11:40 PM
To: dev@gobblin.apache.org <de...@gobblin.apache.org>
Subject: Re: [DISCUSS] Connectors in Apache Gobblin

Hi Chris,
  Thanks for this proposal! I think we have had quite a few issues with our
monolithic repository and I think it has hindered the development and
maintenance of new connectors.
  JB makes some good points that are worth considering.

  My 2c:
   I think separating out the connectors into a separate repo, and in fact
supporting multiple repos that can contain separate connectors is probably
going to be my vote.
   This will help us also clarify the "public API" of the Gobblin framework
versus internal details that many connectors probably depend on today.

 I would rather follow the Kafka Connect model of — core framework has
API-s and is versioned independently from connector implementations which
can live in other repositories. Implementations should feature in the
"Connector Matrix" as part of the documentation for discoverability.

There can be an official catalog of supported connectors, and maybe that
can be our first "repo" that Abhishek is proposing. But I would make sure
we are not creating a new monorepo pattern with it.

What do others think?
Shirshanka





On Mon, Mar 22, 2021 at 10:09 PM, Jean-Baptiste Onofre <jb...@nanthrax.net>
wrote:

> Hi Chris,
>
> I agree that connector is very important. Other Apache projects became
> popular mostly thank to the connectors set (I’m thinking about Apache Beam,
> Apache Camel, or Apache Karaf Decanter for instance). The connectors allow
> more users to "integrate" Gobblin in their ecosystem, so it would increase
> our users community. It will also increase our dev community as it’s
> probably easier to contribute on connector than in the Gobblin core.
>
> About the repo vs module, there are two questions IMHO:
> 1. How to keep API/code sync together between Gobblin core and the
> connectors
> 2. Do we plan to have a different release cycle between core and
> connectors (even if it’s always possible to release a module atomically)
>
> IMHO, if we plan to do a Gobblin release including core + connectors, then
> a module is easier.
>
> Regards
> JB
>
> Le 22 mars 2021 à 23:44, Chris Li <ch...@linkedin.com.INVALID> a écrit :
>
> Proposal:
>
> DIL (LinkedIn internal project name) is a generic multi-stage Gobblin
> connector library. The code can be accessed here: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2F&amp;data=04%7C01%7Csuvasudevan%40linkedin.com%7Cc4f0705167374d1f536008d8edc687dd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637520784229647358%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=6G4Zw7vWt6CW9UsWu1WY1fevzO%2B05k9WnfLsJWxRQEg%3D&amp;reserved=0
> linkedin/gobblin-connectors. Its core features and high level
> descriptions are shared here: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fengineering.linkedin.com%2Fblog%2F2021%2F&amp;data=04%7C01%7Csuvasudevan%40linkedin.com%7Cc4f0705167374d1f536008d8edc687dd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637520784229647358%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=zw%2BQuEclJXzWE1%2BEIHIocPJImnNW7zerPATmJz9Q9%2FQ%3D&amp;reserved=0
> data-integration-library.
>
> Per initial discussion with members of Gobblin community, we are here
> proposing a separate sub-repo for this library.
>
> Why:
> Some thoughts/justifications of a sub-repo vs. a module in the main
> Gobblin repo.
>
> 1. Gobblin connectors are important part of Gobblin ecosystem, but the
> development of connectors is relatively independent of Gobblin core.
> 2. Gobblin connector is where open source communities can contribute the
> most, and it will be growing much faster than Gobblin core.
> 3. The new connector library is a comprehensive package of unique design
> patterns. This is where the data integration diversity challenge will be
> addressed. The importance of this code base grows by day as more
> integration scenarios are becoming supported.
> 4. The new connector library evolves and replaces many prior Gobblin
> connectors under the “gobblin-modules” module. A separate repo will help
> avoid confusion.
> 5. Separating core and ecosystem modules can help improve isolation and
> reduce the number of defects.
>
> Regards,
> Chris
>
>

Re: [DISCUSS] Connectors in Apache Gobblin

Posted by Shirshanka Das <sh...@apache.org>.
Hi Chris,
  Thanks for this proposal! I think we have had quite a few issues with our
monolithic repository and I think it has hindered the development and
maintenance of new connectors.
  JB makes some good points that are worth considering.

  My 2c:
   I think separating out the connectors into a separate repo, and in fact
supporting multiple repos that can contain separate connectors is probably
going to be my vote.
   This will help us also clarify the "public API" of the Gobblin framework
versus internal details that many connectors probably depend on today.

 I would rather follow the Kafka Connect model of — core framework has
API-s and is versioned independently from connector implementations which
can live in other repositories. Implementations should feature in the
"Connector Matrix" as part of the documentation for discoverability.

There can be an official catalog of supported connectors, and maybe that
can be our first "repo" that Abhishek is proposing. But I would make sure
we are not creating a new monorepo pattern with it.

What do others think?
Shirshanka





On Mon, Mar 22, 2021 at 10:09 PM, Jean-Baptiste Onofre <jb...@nanthrax.net>
wrote:

> Hi Chris,
>
> I agree that connector is very important. Other Apache projects became
> popular mostly thank to the connectors set (I’m thinking about Apache Beam,
> Apache Camel, or Apache Karaf Decanter for instance). The connectors allow
> more users to "integrate" Gobblin in their ecosystem, so it would increase
> our users community. It will also increase our dev community as it’s
> probably easier to contribute on connector than in the Gobblin core.
>
> About the repo vs module, there are two questions IMHO:
> 1. How to keep API/code sync together between Gobblin core and the
> connectors
> 2. Do we plan to have a different release cycle between core and
> connectors (even if it’s always possible to release a module atomically)
>
> IMHO, if we plan to do a Gobblin release including core + connectors, then
> a module is easier.
>
> Regards
> JB
>
> Le 22 mars 2021 à 23:44, Chris Li <ch...@linkedin.com.INVALID> a écrit :
>
> Proposal:
>
> DIL (LinkedIn internal project name) is a generic multi-stage Gobblin
> connector library. The code can be accessed here: https://github.com/
> linkedin/gobblin-connectors. Its core features and high level
> descriptions are shared here: https://engineering.linkedin.com/blog/2021/
> data-integration-library.
>
> Per initial discussion with members of Gobblin community, we are here
> proposing a separate sub-repo for this library.
>
> Why:
> Some thoughts/justifications of a sub-repo vs. a module in the main
> Gobblin repo.
>
> 1. Gobblin connectors are important part of Gobblin ecosystem, but the
> development of connectors is relatively independent of Gobblin core.
> 2. Gobblin connector is where open source communities can contribute the
> most, and it will be growing much faster than Gobblin core.
> 3. The new connector library is a comprehensive package of unique design
> patterns. This is where the data integration diversity challenge will be
> addressed. The importance of this code base grows by day as more
> integration scenarios are becoming supported.
> 4. The new connector library evolves and replaces many prior Gobblin
> connectors under the “gobblin-modules” module. A separate repo will help
> avoid confusion.
> 5. Separating core and ecosystem modules can help improve isolation and
> reduce the number of defects.
>
> Regards,
> Chris
>
>

Re: [DISCUSS] Connectors in Apache Gobblin

Posted by Jean-Baptiste Onofre <jb...@nanthrax.net>.
Hi Chris,

I agree that connector is very important. Other Apache projects became popular mostly thank to the connectors set (I’m thinking about Apache Beam, Apache Camel, or Apache Karaf Decanter for instance). The connectors allow more users to "integrate" Gobblin in their ecosystem, so it would increase our users community.
It will also increase our dev community as it’s probably easier to contribute on connector than in the Gobblin core.

About the repo vs module, there are two questions IMHO:
1. How to keep API/code sync together between Gobblin core and the connectors
2. Do we plan to have a different release cycle between core and connectors (even if it’s always possible to release a module atomically)

IMHO, if we plan to do a Gobblin release including core + connectors, then a module is easier.

Regards
JB

> Le 22 mars 2021 à 23:44, Chris Li <ch...@linkedin.com.INVALID> a écrit :
> 
> Proposal:
> 
> DIL (LinkedIn internal project name) is a generic multi-stage Gobblin connector library. The code can be accessed here:  https://github.com/linkedin/gobblin-connectors. Its core features and high level descriptions are shared here: https://engineering.linkedin.com/blog/2021/data-integration-library.
> 
> Per initial discussion with members of Gobblin community, we are here proposing a separate sub-repo for this library.
> 
> Why:
>            Some thoughts/justifications of a sub-repo vs. a module in the main Gobblin repo.
> 
> 
>  1.  Gobblin connectors are important part of Gobblin ecosystem, but the development of connectors is relatively independent of Gobblin core.
>  2.  Gobblin connector is where open source communities can contribute the most, and it will be growing much faster than Gobblin core.
>  3.  The new connector library is a comprehensive package of unique design patterns. This is where the data integration diversity challenge will be addressed. The importance of this code base grows by day as more integration scenarios are becoming supported.
>  4.  The new connector library evolves and replaces many prior Gobblin connectors under the “gobblin-modules” module. A separate repo will help avoid confusion.
>  5.  Separating core and ecosystem modules can help improve isolation and reduce the number of defects.
> 
> Regards,
> Chris
> 
> 
> 
>