You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Jarek Potiuk <ja...@potiuk.com> on 2022/05/18 07:01:23 UTC

Re: [DISCUSS] Approach for new providers of the community

Hello everyone,

I think we have a series of things that make it difficult to focus on
such  long term discussions  - 2.3.0 was out, many  people are busy
with 2.3.1 which is going to focus on "teething" problems and we have
Airflow Summit next week (yay!) and I know how many people in our
community are either busy preparing the local events or their talks
:).

I have some ideas and proposals on how we can approach the subject and
would like to continue the discussion (I would still love to hear more
voices), but I think it would be great if we can resume the discussion
after the Summit.

But - Summit is not only a "disruption" - it's also an opportunity to
make the discussion better. I think the summit with the local events
is a great opportunity to discuss this in person - and at least in 13
separate locations  :).

So I have a kind request to everyone - let's talk about it at the
local events. I will be in both - London and Warsaw, so if you happen
to be there - happy to share my thoughts with anyone interested  and
hear what you have to say :) - and I encourage similar discussions
elsewhere.

I think the decision on how we approach providers in the future is a
very important one and we should take it very seriously and we should
give anyone a chance to participate. It will define a bit the future
of the whole Airflow Ecosystem.

J.

On Tue, Apr 26, 2022 at 12:43 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> I think this is a different story (and different discussion).
> And I think we should have good reasons to split the repo. I think we
> do have it but for different reasons many people think we will get
> there sooner rather than later - but I think we should not hijack the
> discussion for it.
> This discussion is more for governance of providers rather than which
> repo they are.
>
> Unless I am mistaken - moving providers to separate repo does not
> really solve any of the "should we have more or less community
> providers". It's really a technical split of code, but If we have
> separate repo and we still add more providers from community we will
> still have to make sure all of them can be installed, run the tests
> the code, make sure they run with Airflow (released and main) and make
> sure that airflow changes do not break it.
>
> It means about the same amount of safeguards and protection, CI
> overhead we have now - only the code will be somewhere else, but the
> amount of CI tests, when they are executing, who is allowed to merge
> the code, approval process will remain the same as long as this will
> be "apache Airflow PMC" project.
>
> J.
>
> On Tue, Apr 26, 2022 at 12:21 AM Kaxil Naik <ka...@gmail.com> wrote:
> >
> > Hey all,
> >
> > Another alternative is separating out core providers from the Core Airflow Repo into a separate repo within the Apache Org itself, maybe: apache-airflow-providers.
> >
> > That will not decrease the maintenance from the Committers but the Core work and release will be completely separate and untangled from Apache Airflow repo and can move at a faster pace.
> >
> > The benefit and compromise for the community is that all the providers are still officially maintained and released by the committers. However, over time we can invite more committers who show active participation in apache-airflow-providers repo too.
> >
> > This is a compromise to the arguments about Providers being integral to the success of Airflow and as such should be maintained and released officially.
> >
> > Regards,
> > Kaxil
> >
> > On Mon, 25 Apr 2022 at 19:17, Jarek Potiuk <ja...@potiuk.com> wrote:
> >>
> >> > 1. https://registry.astronomer.io/
> >> > 2. Using the new classifier https://pypi.org/search/?o=&c=Framework+%3A%3A+Apache+Airflow+%3A%3A+Provider
> >>
> >> Yep. precisely what I thought to place at the top of the ecosystem page.
> >>
> >> > On 25 April 2022 18:08:49 BST, "Ferruzzi, Dennis" <fe...@amazon.com.INVALID> wrote:
> >> >>
> >> >> I still think that easy inclusion with a defined pruning process is best, but it's looking like that is the minority opinion.  In which case, IFF we are going to be keeping them separate then I definitely agree that there needs to be a fast/easy/convenient way to find them.
> >> >> ________________________________
> >> >> From: Jarek Potiuk <ja...@potiuk.com>
> >> >> Sent: Monday, April 25, 2022 7:17 AM
> >> >> To: dev@airflow.apache.org
> >> >> Subject: RE: [EXTERNAL][DISCUSS] Approach for new providers of the community
> >> >>
> >> >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> >> >>
> >> >>
> >> >>
> >> >> Just to come back to it (please everyone a little patience - I think
> >> >> some people have not chimed in yet due to 2.3.0 "focus" so this
> >> >> discussion might take a little more time.
> >> >>
> >> >> My current thinking on it so far:
> >> >>
> >> >> * I am not really in the camp of "lets not add any more providers at
> >> >> all" and also not in the "let's accept all that are good quality code
> >> >> providers". I think there are a few providers which "after fulfilling
> >> >> all the criteria" could be added - mostly open-source standards,
> >> >> generic, established technologies - but it should be rather limited
> >> >> and rare event.
> >> >>
> >> >> * when there is a proprietary service which has not too broad reach
> >> >> and it's not likely that we will have some committers who will be
> >> >> maintaining it - becauyse they are users - the default option should
> >> >> be to make a standalone per-service providers. the difficulty here is
> >> >> to set the right "non-quality" criteria - but I think we really want
> >> >> to limit any new code to maintain. Here maybe we can have some more
> >> >> concrete criteria proposed - so that we do not have to vote
> >> >> individually on each proposed providers - and so that those who want
> >> >> to propose a provider could check themselves by reading the criteria,
> >> >> what's best for them.
> >> >>
> >> >> * we might improve our "providers" list at the "ecosystem" to make
> >> >> providers stand out a bit more (maybe simply put them on top and make
> >> >> a clearly visible section). We are not going to maintain and keep the
> >> >> nice "registry" similar to Astronomer's one (we could even actually
> >> >> make the link to the Astronomer registry more prominent as the way to
> >> >> "search" for providers on our Ecosystem Page. We could also add a link
> >> >> to Pypi with the "aifrflow provider" classifier at the ecosystem page
> >> >> as another way of searching for providers. All that is perfectly fine,
> >> >> I think with the ASF Policies and spirit. And it will be good for
> >> >> discovery.
> >> >>
> >> >> WDYT?
> >> >>
> >> >> J.
> >> >>
> >> >> On Mon, Apr 18, 2022 at 3:59 PM Samhita Alla <sa...@union.ai> wrote:
> >> >>>
> >> >>>
> >> >>> Hello!
> >> >>>
> >> >>> The reason behind submitting Flyte provider to the Airflow repository is because we felt it'd be effortless for the Airflow users to use the integration. Moreover, since it'd be under the umbrella of Airflow, we estimated that the Airflow users would not hesitate from using the provider.
> >> >>>
> >> >>> We could definitely have this as a standalone provider, but the easy-to-get-started incentive of Airflow providers seemed like a better option.
> >> >>>
> >> >>> If there's a sophisticated plan in place for having standalone providers in PyPI, we're up for it.
> >> >>>
> >> >>> Thanks,
> >> >>> Samhita
> >> >>>
> >> >>> On Wed, Apr 13, 2022 at 9:58 PM Alex Ott <al...@gmail.com> wrote:
> >> >>>>
> >> >>>>
> >> >>>> Hello all
> >> >>>>
> >> >>>> I want to try to explain a motivation behind submission of the Delta Sharing provider:
> >> >>>>
> >> >>>> Let me start with the fact that the original issue was created against Airflow repository, and it was accepted as potential new functionality. And discussion about new providers has started almost on the day when PR was submitted :-)
> >> >>>> Delta Sharing is the OSS project under umbrella of the Linux Foundation that defines a protocol and reference implementations. It was started by the Databricks, but has other contributors as well - that's why it wasn't pushed into a Databricks provider, as it's not specific to Databricks.
> >> >>>> Another thought about submitting it as a separate provider was to get more people interested in this functionality and build additional integrations on top of it.
> >> >>>> Another important aspect of having providers in the Airflow repository is that they are tested together with changes in the core of the Airflow.
> >> >>>>
> >> >>>> I completely understand the concerns about more maintenance effort, but my personal point of view (about it below) is similar to Rafal's & John's - if there are well defined criteria & plans for decommissioning or something like, then providers could be part of the releases, etc.
> >> >>>>
> >> >>>> I just want to add that although I'm employed by Databricks, I'm not a part of the development team - I'm in the field team who work with customers, sees how they are using different tools, seeing pain points, etc.  Most of work so far was done on my own time - I'm doing some coordination, but most of new functionality (AAD tokens support, Repos, Databricks SQL operators, etc.) is coming from seeing customers using Airflow together with Databricks.
> >> >>>>
> >> >>>>
> >> >>>> On Mon, Apr 11, 2022 at 9:14 PM Rafal Biegacz <ra...@google.com.invalid> wrote:
> >> >>>>>
> >> >>>>>
> >> >>>>> Hi,
> >> >>>>>
> >> >>>>> I think that we will need to find some middle ground here - we are trying to optimize in many dimensions (Jarek mentioned 3 of them). Maybe I would also add another 4th dimension - Airflow Service Provider, :).
> >> >>>>>
> >> >>>>> Airflow users - whether they do self-managed Airflow or use "managed Airflow" provided by others are beneficients of the fact that Airflow has a decent portfolio of providers.
> >> >>>>> It's not only a guarantee that these providers should work fine and they meet Airflow coding/testing standards. It's also a kind of guarantee, that once they start using Airflow
> >> >>>>> with providers backed by the Airflow community they won't be on their own when it comes to troubleshooting/updating/etc. It will be much easier for them to convince their companies to use Airflow for production use cases as the Airflow platform (core + providers) is tested/maintained by the Airflow community.
> >> >>>>>
> >> >>>>> Keeping providers within the Airflow repository generates integration and maintenance work on the Airflow community side. On the other hand, if this work is not done within the community then this effort would need to be done by all users to a certain extent. So from this perspective it's more optimal for the community to do it so users can use off-the-shelf Airflow for the majority of their use cases
> >> >>>>>
> >> >>>>> When it comes to accepting new providers - I like John's suggestions:
> >> >>>>> a) well defined standard that needs to be met by providers - passing the "provider qualification" would be some effort so each service provider would need to decide if it wouldn't be easier to maintain their provider on their own.
> >> >>>>> b) well define lifecycle for providers - which would allow to identify providers that are obsolete/not popular any more and make them obsolete.
> >> >>>>>
> >> >>>>> Regards, Rafal.
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> On Mon, Apr 11, 2022 at 6:47 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> I've been thinking about it - to make up my mind a little. The good thing for me is that I have no strong opinion and I can rather easily see (or so I think) of both sides.
> >> >>>>>>
> >> >>>>>> TL;DR; I think we need an explanation from the "Service Providers" - what they want to achieve by contributing providers to the community and see if we can achieve similar results differently.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Obviously I am a bit biased from the maintainer point of view, but since I cooperate with various stakeholders i spoke to some of them just see their point of view and this is what I got:
> >> >>>>>>
> >> >>>>>> Seems that we have really three  types of stakeholders that are really interested in "providers":
> >> >>>>>>
> >> >>>>>> 1) "Maintainers" - those who mostly maintain Airflow and have to take care about its future and development and "grand vision" of where we want to be in few years
> >> >>>>>> 2) "Users" - those who use Airflow and integration with the Service Provider
> >> >>>>>> 3) "Service providers" - those who run the services that Airflow integrates with - via providers (that group might also contain those stakeholders that run Airflow "as a service")
> >> >>>>>>
> >> >>>>>> Let me see it from all the different POVs:
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> From 1) Maintainer POV
> >> >>>>>>
> >> >>>>>> More providers mean slower growth of the platform overall as the more providers we add and manage as a community, the less time we can spend on improving Airflow as a core.
> >> >>>>>> Also the vision I think we all share is that Airflow is not a "standalone orchestrator" any more - due to its popularity, reach and power, it became an "orchestrating platform" and this is the vision that keeps us - maintainers - busy.
> >> >>>>>>
> >> >>>>>> Over the last 2 years pretty much everything we do - make Airflow "more extensible". You can add custom "secrets managers". "timetables", "defferers" etc. "Customizability" is now built-in and "theme" of being a modern platform.
> >> >>>>>> Hell - we even recently added "Airflow Provider" trove classified in PyPI: https://pypi.org/search/?c=Framework+%3A%3A+Apache+Airflow+%3A%3A+Provider and the main justification in the discussion was that we expect MORE 3rd-parties to use it, rather than relying on "apache-airflow-provider" package name.
> >> >>>>>> So from maintainer POV - having 3rd-party providers as "extensions" to Airlow makes perfect sense and is the way to go.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> From  2) User POV
> >> >>>>>>
> >> >>>>>> Users want to use Airflow with all the integrations they use together. But only with those that they actually use. Similarly as maintainers - supporting and needing all 70+ providers is something they usually do not REALLY care about.
> >> >>>>>> They literally care about the few providers they use. We even taught the users that they can upgrade and install providers separately from the core. So they already know they can mix and match Airflow + Providers to get what they want.
> >> >>>>>>
> >> >>>>>> And they do use it - even if they use our image, the image only contains a handful of the providers and when they need to install
> >> >>>>>> new providers - they just install it from PyPI. And for that the difference of "community providers" vs. 3rd party providers - except the stamp of approval of the ASF, is not really visible.
> >> >>>>>> Surely they can use [extras] to install the providers but that is just a convenience and is definitely not needed by the users.
> >> >>>>>> For example when they build a custom image they usually extend Airflow and simply 'pip install <PROVIDER>'
> >> >>>>>> As long as someone makes sure that the provider can be installed on certain versions of Airflow - it does not matter.
> >> >>>>>>
> >> >>>>>> Also from the users perspective Airflow became "popular" enough that it no longer needed "more integrations" to be more "appealing" for the users.
> >> >>>>>> They already use Airflow. They like it (hopefully) and the fact that this or that provider is part of the community makes no difference any more.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> From 3) "Service providers" POV
> >> >>>>>>
> >> >>>>>> Here I am not sure. It's not very clear what service providers get from being part of the "community providers".
> >> >>>>>>
> >> >>>>>> I hear that some big service (cloud providers) find it cool that we give it the ASF "Stamp of Approval". And they are willing to pay the price of a slower merge process, dependence on the community and following strict rules of the ASF.
> >> >>>>>> And the community also is happy to pay the price of maintaining those (including the dependencies which Elad mention) to make sure that all the community providers work in concert - because those "Services" are hugely popular and we "want" as a community to invest there.
> >> >>>>>> But maintaining those  deps in sync is a huge effort and it will become even worse - the more we add. On the other hand for 3rd party providers it will be EASIER to keep up.
> >> >>>>>> They don't have to care about all the community providers to work together, they can choose a subset. And when they release their libraries they can take care about making sure the dependencies are not broken.
> >> >>>>>>
> >> >>>>>> There are other "drawbacks" for being a "community" provider. For example we have the rule that we support the min-Airflow version for providers from the community 12 months after Airflow release.
> >> >>>>>> This means that users of Airflow 2.1 will not receive updates for the providers after 21st of May. This is the price to pay for community-managed providers. We will not release bug fixes in providers or changes for Airflow 2.1 users after 21st of May.
> >> >>>>>> But if you manage your own provider - you still can support 2.0 or even 1.10 if you want.
> >> >>>>>>
> >> >>>>>> I cannot really see why a Service Provider would want to become an Airflow Community Provider.
> >> >>>>>>
> >> >>>>>> And I am not really sure what  Flyte, Delta Sharing, Versatile Data Kit, and Cloudera people think and why they think this is the best choice.
> >> >>>>>>
> >> >>>>>> I think when we understand what the  "Service Providers" want to achieve this way, maybe we will be able to come up with some middle ground and at least set some rules when it makes sense and when it does not make sense.
> >> >>>>>> Maybe 'contributing provider' is the way to achieve something else and we simply do not realize that in the new "Airflow as a Platform" world, all the stakeholders can achieve very similar results using different approaches.
> >> >>>>>>
> >> >>>>>> * For example we could think about how we can make it easier for Airflow users to discover and install their providers - without actually taking ownership of the code by the community.
> >> >>>>>> * Or maybe we could introduce a tool to make a 3rd-party provider pass a "compliance check" as suggested above
> >> >>>>>> * Or maybe we could introduce a "breeze" extension to be able to install and test provider in the "latest airflow" so that the service providers could check it before we even release airflow and dependencies
> >> >>>>>>
> >> >>>>>> So what I think we really need -  Alex, Samhita, Andon, Philippe (I think) - could you tell us (every one of you separately) - what are your goals when you came up with the "contribute the new provider" idea?
> >> >>>>>>
> >> >>>>>> J.
> >> >>>>>>
> >> >>>>>> On Wed, Apr 6, 2022 at 11:51 PM Elad Kalif <el...@apache.org> wrote:
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> Ash what is your recommendation for the users should we follow your suggestion?
> >> >>>>>>> This means that the big big big joy of using airflow constraints and getting a working environment with all required providers will be no more.
> >> >>>>>>> So users will get a working "Vanilla" Airflow and then will need to figure out how they are going to tackle independent providers that may not be able to coexist one with another.
> >> >>>>>>> This means that users will need to create their own constraints mechanism and maintain it.
> >> >>>>>>>
> >> >>>>>>> From my perspective this increases the complexity of getting Airflow to be production ready.
> >> >>>>>>> I know that we say providers vs core but I think that from users perspective providers are an integral part of Airflow.
> >> >>>>>>> Having the best scheduler and the best UI is not enough. Providers are a crucial part that complete the set.
> >> >>>>>>>
> >> >>>>>>> Maybe eventually there should be something like a provider store where there can be official providers and 3rd party providers.
> >> >>>>>>>
> >> >>>>>>> This may be even greater discussion than what we are having here. It feels more like Airflow as a product vs Airflow as an ecosystem.
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> On Thu, Apr 7, 2022 at 12:27 AM Collin McNulty <co...@astronomer.io.invalid> wrote:
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> I agree with Ash and Tomasz. If it were not for the history, I think in an ideal world even the providers currently part of the Airflow repo would be managed separately. (I'm not actually suggesting removing any providers.) I don't think it's a matter of gatekeeping, I just think it's actually kind of odd to have providers in the same repo as core Airflow, and it increases confusion about Airflow versions vs provider package versions.
> >> >>>>>>>>
> >> >>>>>>>> Collin McNulty
> >> >>>>>>>>
> >> >>>>>>>> On Wed, Apr 6, 2022 at 4:21 PM Tomasz Urbaszek <tu...@apache.org> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> I’m leaning toward Ash approach. Having providers maintaining the packages may streamline many aspects for providers/companies.
> >> >>>>>>>>>
> >> >>>>>>>>> 1. They are owners so they can merge and release whenever they need.
> >> >>>>>>>>> 2. It’s easier for them to add E2E tests and manage the resources needed for running them.
> >> >>>>>>>>> 3. The development of the package can be incorporated into their company processes - not every company is used to OSS mode.
> >> >>>>>>>>>
> >> >>>>>>>>> Whatever way we go - we should have some basics guidelines and requirements (for example to brand a provider as “recommended by community” or something).
> >> >>>>>>>>>
> >> >>>>>>>>> Cheers,
> >> >>>>>>>>> Tomsk
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> With best wishes,                    Alex Ott
> >> >>>> http://alexott.net/
> >> >>>> Twitter: alexott_en (English), alexott (Russian)