You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Jarek Potiuk <Ja...@polidea.com> on 2020/09/09 09:01:48 UTC

Splitting 2.0 into provider packages/dynamic provider discovery

This is a proposal on how we can address dynamic provider discovery in
Airflow 2.0 and 1.10.13 as well.

At the meeting on Monday, we agreed that Airflow 2.0 will be released
using a mechanism based on what we have for backport packages. One of
the problems to solve was the dynamic discovery of packages and the
"dependency injection" of some sort of providers to core.

It was mainly about making provider-specific connections
"discoverable" by the core after installing a provider package. We
discussed that we could use a plugin mechanism, but Ash commented that
using plugins might introduce conflicts in dependencies so I went a
different route.

I prepared a POC showing how it can look like.

It is very much WIP but I already tested it in various scenarios
(airflow 2.0  from package and sources, airflow 1.10 using backport
packages, airflow 1.10 using the "providers" symbolic link" workaround
and it seems to handle all the cases. There are some things we will
need to change a bit though and add some more "dynamic" parts that I
currently did not add but it all seems to be doable and easy.

The current WIP is here:

* Master/2.0 -> https://github.com/apache/airflow/pull/10822
* 1.10 backport -> https://github.com/apache/airflow/pull/10823

There are a few things to note:-

- it is very fast - on my PC it's sub-second (and much less than
second) delay introduced - because I only limit the search to
sub-packages of "airflow.providers". For now that is a limitation that
all the providers must be installed in the same "path" (but maybe with
Ash's help we can get this one solved). The importlib's
"walk_packages" imports all the packages (but not modules!) it walks
through, so skipping the "airflow" directory is a huge performance
boost (it takes several seconds because our __init__.py are loading
pretty much everything in airflow).

- we could release a new wave of backport packages soon and implement
changes in 1.10.13 so that all those "dynamic features" of the
backport packages will be available to 1.10 users (that would solve
for example this problem
https://github.com/apache/airflow/issues/10783). Those packages will
not be dynamically discovered on 1.10.12 but then in 1.10.13 they will
work nicely if we make the changes.

- currently, packages that are one level "deeper" (apache/NN or
microsoft/azure) are not discoverable until we create __init__.py in
the parent packages - but it can be solved for sure (Ash - looking for
your help here :) ).

- I have added some additional meta-data to the provider-info.py file
that allows us to do a very nice thing - we could add a UI page and
CLI command to display information about installed  (and discovered)
provider packages. I think that would be super-useful. I already
return a dictionary ready to be rendered in the get_provider_info()
method.

I have not implemented everything - this is just POC to show that it
is possible and start the discussion on possibly improving that. I am
sure there are still some things left that I have not found yet :).

There are a few things missing for sure:

- I have not yet solved the "javascript" changes - it can be for sure
by some clever template modifications and generating javascript
sources - but I look for help here from people who are more familiar
with flask/javascript and the UI part.

- we also need to add extra links in a similar way.

Happy for comments, suggestions, improvements (and help!) - either
here or directly in the PRs

J.





-- 
Jarek Potiuk
Polidea | Principal Software Engineer

M: +48 660 796 129