You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Jarek Potiuk <Ja...@polidea.com> on 2019/11/22 12:06:52 UTC

Using implicit namespace packages in Airflow (removing __init__.py where empty)

There is one implementation detail in AIP-21 that I continue to have
questions about, and I wanted to ask community about it.

Since we moved to Python 3, we have the option of using implicit namespace
packages.
https://www.python.org/dev/peps/pep-0420/ . We have now grouping per folder
for services, so it would require a lot more __init__.py files if we
continue using them.

The implicit naming boils down to not requiring __init__.py files if no
initialisation of package is needed. We could do it consistently for all
code which is "internal" to airflow. This means that most packages will not
have __init__.py there will be few exceptions like 'airflow',
'airflow.modules' and likely few others with the __init__.py.

I did a quick check and we have only 25 _init_.py with some logic -
remaining ~ 140 could be removed as they only contain comments.

find . -name '__init__.py' | xargs grep -le '^[^#].*'  | wc -l
      25
find . -name '__init__.py' |wc -l
     164

Those are the files that will be left:
./tests/__init__.py
./tests/utils/log/elasticmock/__init__.py
./tests/utils/log/elasticmock/utilities/__init__.py
./tests/models/__init__.py
./tests/task/__init__.py
./airflow/sensors/__init__.py
./airflow/operators/__init__.py
./airflow/_vendor/nvd3/__init__.py
./airflow/_vendor/slugify/__init__.py
./airflow/serialization/__init__.py
./airflow/__init__.py
./airflow/models/__init__.py
./airflow/www/node_modules/npm/node_modules/node-gyp/gyp/pylib/gyp/__init__.py
./airflow/ti_deps/deps/__init__.py
./airflow/macros/__init__.py
./airflow/executors/__init__.py
./airflow/lineage/__init__.py
./airflow/lineage/backend/__init__.py
./airflow/lineage/backend/atlas/__init__.py
./airflow/hooks/__init__.py
./airflow/task/task_runner/__init__.py
./airflow/api/__init__.py
./airflow/api/common/experimental/__init__.py
./airflow/api/client/__init__.py
./airflow/jobs/__init__.py

WDYT?

J

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Jarek Potiuk <Ja...@polidea.com>.
The tests in CI were failing - you can see the failures here:
https://travis-ci.org/potiuk/airflow/jobs/647748452

Regarding the non-implicit packages and sequence of installation: from my
tests it does not depend on the sequence of the installation.  I manually
filtered "airflow" package from installation in the provider's package.
This means that provider's package does not contain "airflow/__init__.py".

But to be on the safe side, I also added "apache-airflow~1.10.*" as
dependency to the "apache-airlfow-providers" package. It's impossible to
install "providers" before "apache-airlfow" from 1.10. series because it is
a dependency. I tested both cases that are valid and both work fine:

   - installing "apache-airflow" first and then "apache-airflow-providers"
   (this way you can install "providers" for any 1.10.* version of
   apache-airflow
   - installing just "apacher-airflow-providers" on a clean system - it
   automatically pulls and installs "apache-airflow==1.10.9" when installed

J.

On Sun, Feb 9, 2020 at 10:11 PM Ash Berlin-Taylor <as...@apache.org> wrote:

> And also how it failed when making it an implicit package? I did do some
> quick tests before suggesting it earlier in this thread and pytest and mypy
> ran seemingly without error (but I did only do very quick tests)
>
> My understanding is that come pip install time without implicit packages
> that the order in which modules get installed can change/break the
> behaviour.
> -a
> On Feb 9 2020, at 9:09 pm, Kamil Breguła <ka...@polidea.com>
> wrote:
> > Can you show some code snippet? I am curious about this solution.
> >
> > On Sun, Feb 9, 2020 at 10:06 PM Jarek Potiuk <Ja...@polidea.com>
> wrote:
> > >
> > > I have bad news and good news :)
> > > Bad news - just making "providers" package an implicit one does not
> work -
> > > when you mix implicit and explicit packages there are some problems.
> Tools
> > > like pytest, mypy, pylint do not handle this well and think that
> "google",
> > > "amazon" are top-level packages when analysing the sources.
> > >
> > > Good news - I managed to get providers packages for backporting work
> > > without having to change "providers" to implicit package. It was pretty
> > > much just a matter of filtering found packages in setup.py - I am
> sending
> > > a PR shortly to enable that. I also managed to use breeze environment
> to
> > > test the backported packages with system tests (at least for GCP
> operators
> > > where we have system test).
> > >
> > > We can therefore close the subject of implicit packages. We do not need
> > > them for now. Later we might want to convert to implicit packages but
> this
> > > is something that might never happen.
> > >
> > > J,
> > >
> > >
> > >
> > >
> > > On Wed, Feb 5, 2020 at 2:40 PM Jarek Potiuk <Ja...@polidea.com>
> > > wrote:
> > >
> > > > Yep. Learning from the bigger move I will see how small is the
> "smaller"
> > > > one I would love to see and can make informed decision.
> > > >
> > > > On Wed, Feb 5, 2020 at 1:35 PM Ash Berlin-Taylor <as...@apache.org>
> wrote:
> > > > > Ah right okay, not an impossible task then.
> > > > > Still, I am very strongly in favour of the smallest change for the
> goal
> > > > > (which is to allow back-ported providers to be installed on
> 1.10.x) - both
> > > > > cos then most things are more normal, including code under a
> provider, and
> > > > > then that way we don't have to fight our tools.
> > > > > On Feb 5 2020, at 12:08 pm, Jarek Potiuk <Jarek.Potiuk@polidea.com
> >
> > > > > wrote:
> > > > > > >
> > > > > > > If we make every namespace under airflow.* implict, then we
> have to
> > > > > ensure
> > > > > > > that our modules do not clash with _any and every_ possible
> top level
> > > > > > > package that exists in PyPi, because we cannot know what might
> be
> > > > > >
> > > > >
> > > > > installed
> > > > > > > by our users. That seems like an impossible feat. Have I
> > > > > >
> > > > >
> > > > > mis-understood?
> > > > > > > If I haven't mis-understoond then I this is I think a strong
> argument
> > > > > >
> > > > >
> > > > > for
> > > > > > > _not_ making more than we absolutely have to an implicit
> namepsace.
> > > > > > >
> > > > > >
> > > > > > Nope. This is a problem for MyPy and Pytest only so mostly for
> our
> > > > > tooling.
> > > > > > And only if our own modules import packages with the same name
> as the
> > > > > > module. So we are in full control here.
> > > > > >
> > > > > > >
> > > > > >
> > > > > > > -a
> > > > > >
> > > > > > --
> > > > > > Jarek Potiuk
> > > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > > >
> > > > > > M: +48 660 796 129 <+48660796129>
> > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > > Jarek Potiuk
> > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >
> > > > M: +48 660 796 129 <+48660796129>
> > > > [image: Polidea] <https://www.polidea.com/>
> > > >
> > > >
> > >
> > > --
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> >
> >
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Ash Berlin-Taylor <as...@apache.org>.
And also how it failed when making it an implicit package? I did do some quick tests before suggesting it earlier in this thread and pytest and mypy ran seemingly without error (but I did only do very quick tests)

My understanding is that come pip install time without implicit packages that the order in which modules get installed can change/break the behaviour.
-a
On Feb 9 2020, at 9:09 pm, Kamil Breguła <ka...@polidea.com> wrote:
> Can you show some code snippet? I am curious about this solution.
>
> On Sun, Feb 9, 2020 at 10:06 PM Jarek Potiuk <Ja...@polidea.com> wrote:
> >
> > I have bad news and good news :)
> > Bad news - just making "providers" package an implicit one does not work -
> > when you mix implicit and explicit packages there are some problems. Tools
> > like pytest, mypy, pylint do not handle this well and think that "google",
> > "amazon" are top-level packages when analysing the sources.
> >
> > Good news - I managed to get providers packages for backporting work
> > without having to change "providers" to implicit package. It was pretty
> > much just a matter of filtering found packages in setup.py - I am sending
> > a PR shortly to enable that. I also managed to use breeze environment to
> > test the backported packages with system tests (at least for GCP operators
> > where we have system test).
> >
> > We can therefore close the subject of implicit packages. We do not need
> > them for now. Later we might want to convert to implicit packages but this
> > is something that might never happen.
> >
> > J,
> >
> >
> >
> >
> > On Wed, Feb 5, 2020 at 2:40 PM Jarek Potiuk <Ja...@polidea.com>
> > wrote:
> >
> > > Yep. Learning from the bigger move I will see how small is the "smaller"
> > > one I would love to see and can make informed decision.
> > >
> > > On Wed, Feb 5, 2020 at 1:35 PM Ash Berlin-Taylor <as...@apache.org> wrote:
> > > > Ah right okay, not an impossible task then.
> > > > Still, I am very strongly in favour of the smallest change for the goal
> > > > (which is to allow back-ported providers to be installed on 1.10.x) - both
> > > > cos then most things are more normal, including code under a provider, and
> > > > then that way we don't have to fight our tools.
> > > > On Feb 5 2020, at 12:08 pm, Jarek Potiuk <Ja...@polidea.com>
> > > > wrote:
> > > > > >
> > > > > > If we make every namespace under airflow.* implict, then we have to
> > > > ensure
> > > > > > that our modules do not clash with _any and every_ possible top level
> > > > > > package that exists in PyPi, because we cannot know what might be
> > > > >
> > > >
> > > > installed
> > > > > > by our users. That seems like an impossible feat. Have I
> > > > >
> > > >
> > > > mis-understood?
> > > > > > If I haven't mis-understoond then I this is I think a strong argument
> > > > >
> > > >
> > > > for
> > > > > > _not_ making more than we absolutely have to an implicit namepsace.
> > > > > >
> > > > >
> > > > > Nope. This is a problem for MyPy and Pytest only so mostly for our
> > > > tooling.
> > > > > And only if our own modules import packages with the same name as the
> > > > > module. So we are in full control here.
> > > > >
> > > > > >
> > > > >
> > > > > > -a
> > > > >
> > > > > --
> > > > > Jarek Potiuk
> > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > >
> > > > > M: +48 660 796 129 <+48660796129>
> > > > > [image: Polidea] <https://www.polidea.com/>
> > > > >
> > > >
> > > >
> > >
> > > --
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> > >
> > >
> >
> > --
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
>
>


Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Jarek Potiuk <Ja...@polidea.com>.
Separate message coming soon on the devlist :)

On Sun, Feb 9, 2020 at 10:09 PM Kamil Breguła <ka...@polidea.com>
wrote:

> Can you show some code snippet? I am curious about this solution.
>
> On Sun, Feb 9, 2020 at 10:06 PM Jarek Potiuk <Ja...@polidea.com>
> wrote:
> >
> > I have bad news and good news :)
> >
> > Bad news - just making "providers" package an implicit one does not work
> -
> > when you mix implicit and explicit packages there are some problems.
> Tools
> > like pytest, mypy, pylint do not handle this well and think that
> "google",
> > "amazon" are top-level packages when analysing the sources.
> >
> > Good news - I managed  to get providers packages for backporting work
> > without having to change "providers" to implicit package. It was pretty
> > much just a matter of filtering found packages in setup.py  - I am
> sending
> > a PR shortly to enable that. I also managed to use breeze environment to
> > test the backported packages with system tests (at least for GCP
> operators
> > where we have system test).
> >
> > We can therefore close the subject of implicit packages. We do not need
> > them for now. Later we might want to convert to implicit packages but
> this
> > is something that might never happen.
> >
> > J,
> >
> >
> >
> >
> >
> > On Wed, Feb 5, 2020 at 2:40 PM Jarek Potiuk <Ja...@polidea.com>
> > wrote:
> >
> > > Yep. Learning from the bigger move I will see how small is the
> "smaller"
> > > one I would love to see and can make informed decision.
> > >
> > > On Wed, Feb 5, 2020 at 1:35 PM Ash Berlin-Taylor <as...@apache.org>
> wrote:
> > >
> > >> Ah right okay, not an impossible task then.
> > >>
> > >> Still, I am very strongly in favour of the smallest change for the
> goal
> > >> (which is to allow back-ported providers to be installed on 1.10.x) -
> both
> > >> cos then most things are more normal, including code under a
> provider, and
> > >> then that way we don't have to fight our tools.
> > >> On Feb 5 2020, at 12:08 pm, Jarek Potiuk <Ja...@polidea.com>
> > >> wrote:
> > >> > >
> > >> > > If we make every namespace under airflow.* implict, then we have
> to
> > >> ensure
> > >> > > that our modules do not clash with _any and every_ possible top
> level
> > >> > > package that exists in PyPi, because we cannot know what might be
> > >> installed
> > >> > > by our users. That seems like an impossible feat. Have I
> > >> mis-understood?
> > >> > > If I haven't mis-understoond then I this is I think a strong
> argument
> > >> for
> > >> > > _not_ making more than we absolutely have to an implicit
> namepsace.
> > >> > >
> > >> >
> > >> > Nope. This is a problem for MyPy and Pytest only so mostly for our
> > >> tooling.
> > >> > And only if our own modules import packages with the same name as
> the
> > >> > module. So we are in full control here.
> > >> >
> > >> > >
> > >> >
> > >> > > -a
> > >> > >
> > >> >
> > >> > --
> > >> > Jarek Potiuk
> > >> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >> >
> > >> > M: +48 660 796 129 <+48660796129>
> > >> > [image: Polidea] <https://www.polidea.com/>
> > >> >
> > >>
> > >>
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> > >
> > >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Kamil Breguła <ka...@polidea.com>.
Can you show some code snippet? I am curious about this solution.

On Sun, Feb 9, 2020 at 10:06 PM Jarek Potiuk <Ja...@polidea.com> wrote:
>
> I have bad news and good news :)
>
> Bad news - just making "providers" package an implicit one does not work -
> when you mix implicit and explicit packages there are some problems. Tools
> like pytest, mypy, pylint do not handle this well and think that "google",
> "amazon" are top-level packages when analysing the sources.
>
> Good news - I managed  to get providers packages for backporting work
> without having to change "providers" to implicit package. It was pretty
> much just a matter of filtering found packages in setup.py  - I am sending
> a PR shortly to enable that. I also managed to use breeze environment to
> test the backported packages with system tests (at least for GCP operators
> where we have system test).
>
> We can therefore close the subject of implicit packages. We do not need
> them for now. Later we might want to convert to implicit packages but this
> is something that might never happen.
>
> J,
>
>
>
>
>
> On Wed, Feb 5, 2020 at 2:40 PM Jarek Potiuk <Ja...@polidea.com>
> wrote:
>
> > Yep. Learning from the bigger move I will see how small is the "smaller"
> > one I would love to see and can make informed decision.
> >
> > On Wed, Feb 5, 2020 at 1:35 PM Ash Berlin-Taylor <as...@apache.org> wrote:
> >
> >> Ah right okay, not an impossible task then.
> >>
> >> Still, I am very strongly in favour of the smallest change for the goal
> >> (which is to allow back-ported providers to be installed on 1.10.x) - both
> >> cos then most things are more normal, including code under a provider, and
> >> then that way we don't have to fight our tools.
> >> On Feb 5 2020, at 12:08 pm, Jarek Potiuk <Ja...@polidea.com>
> >> wrote:
> >> > >
> >> > > If we make every namespace under airflow.* implict, then we have to
> >> ensure
> >> > > that our modules do not clash with _any and every_ possible top level
> >> > > package that exists in PyPi, because we cannot know what might be
> >> installed
> >> > > by our users. That seems like an impossible feat. Have I
> >> mis-understood?
> >> > > If I haven't mis-understoond then I this is I think a strong argument
> >> for
> >> > > _not_ making more than we absolutely have to an implicit namepsace.
> >> > >
> >> >
> >> > Nope. This is a problem for MyPy and Pytest only so mostly for our
> >> tooling.
> >> > And only if our own modules import packages with the same name as the
> >> > module. So we are in full control here.
> >> >
> >> > >
> >> >
> >> > > -a
> >> > >
> >> >
> >> > --
> >> > Jarek Potiuk
> >> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >> >
> >> > M: +48 660 796 129 <+48660796129>
> >> > [image: Polidea] <https://www.polidea.com/>
> >> >
> >>
> >>
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
> >
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>

Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Jarek Potiuk <Ja...@polidea.com>.
I have bad news and good news :)

Bad news - just making "providers" package an implicit one does not work -
when you mix implicit and explicit packages there are some problems. Tools
like pytest, mypy, pylint do not handle this well and think that "google",
"amazon" are top-level packages when analysing the sources.

Good news - I managed  to get providers packages for backporting work
without having to change "providers" to implicit package. It was pretty
much just a matter of filtering found packages in setup.py  - I am sending
a PR shortly to enable that. I also managed to use breeze environment to
test the backported packages with system tests (at least for GCP operators
where we have system test).

We can therefore close the subject of implicit packages. We do not need
them for now. Later we might want to convert to implicit packages but this
is something that might never happen.

J,





On Wed, Feb 5, 2020 at 2:40 PM Jarek Potiuk <Ja...@polidea.com>
wrote:

> Yep. Learning from the bigger move I will see how small is the "smaller"
> one I would love to see and can make informed decision.
>
> On Wed, Feb 5, 2020 at 1:35 PM Ash Berlin-Taylor <as...@apache.org> wrote:
>
>> Ah right okay, not an impossible task then.
>>
>> Still, I am very strongly in favour of the smallest change for the goal
>> (which is to allow back-ported providers to be installed on 1.10.x) - both
>> cos then most things are more normal, including code under a provider, and
>> then that way we don't have to fight our tools.
>> On Feb 5 2020, at 12:08 pm, Jarek Potiuk <Ja...@polidea.com>
>> wrote:
>> > >
>> > > If we make every namespace under airflow.* implict, then we have to
>> ensure
>> > > that our modules do not clash with _any and every_ possible top level
>> > > package that exists in PyPi, because we cannot know what might be
>> installed
>> > > by our users. That seems like an impossible feat. Have I
>> mis-understood?
>> > > If I haven't mis-understoond then I this is I think a strong argument
>> for
>> > > _not_ making more than we absolutely have to an implicit namepsace.
>> > >
>> >
>> > Nope. This is a problem for MyPy and Pytest only so mostly for our
>> tooling.
>> > And only if our own modules import packages with the same name as the
>> > module. So we are in full control here.
>> >
>> > >
>> >
>> > > -a
>> > >
>> >
>> > --
>> > Jarek Potiuk
>> > Polidea <https://www.polidea.com/> | Principal Software Engineer
>> >
>> > M: +48 660 796 129 <+48660796129>
>> > [image: Polidea] <https://www.polidea.com/>
>> >
>>
>>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Jarek Potiuk <Ja...@polidea.com>.
Yep. Learning from the bigger move I will see how small is the "smaller"
one I would love to see and can make informed decision.

On Wed, Feb 5, 2020 at 1:35 PM Ash Berlin-Taylor <as...@apache.org> wrote:

> Ah right okay, not an impossible task then.
>
> Still, I am very strongly in favour of the smallest change for the goal
> (which is to allow back-ported providers to be installed on 1.10.x) - both
> cos then most things are more normal, including code under a provider, and
> then that way we don't have to fight our tools.
> On Feb 5 2020, at 12:08 pm, Jarek Potiuk <Ja...@polidea.com> wrote:
> > >
> > > If we make every namespace under airflow.* implict, then we have to
> ensure
> > > that our modules do not clash with _any and every_ possible top level
> > > package that exists in PyPi, because we cannot know what might be
> installed
> > > by our users. That seems like an impossible feat. Have I
> mis-understood?
> > > If I haven't mis-understoond then I this is I think a strong argument
> for
> > > _not_ making more than we absolutely have to an implicit namepsace.
> > >
> >
> > Nope. This is a problem for MyPy and Pytest only so mostly for our
> tooling.
> > And only if our own modules import packages with the same name as the
> > module. So we are in full control here.
> >
> > >
> >
> > > -a
> > >
> >
> > --
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Ash Berlin-Taylor <as...@apache.org>.
Ah right okay, not an impossible task then.

Still, I am very strongly in favour of the smallest change for the goal (which is to allow back-ported providers to be installed on 1.10.x) - both cos then most things are more normal, including code under a provider, and then that way we don't have to fight our tools.
On Feb 5 2020, at 12:08 pm, Jarek Potiuk <Ja...@polidea.com> wrote:
> >
> > If we make every namespace under airflow.* implict, then we have to ensure
> > that our modules do not clash with _any and every_ possible top level
> > package that exists in PyPi, because we cannot know what might be installed
> > by our users. That seems like an impossible feat. Have I mis-understood?
> > If I haven't mis-understoond then I this is I think a strong argument for
> > _not_ making more than we absolutely have to an implicit namepsace.
> >
>
> Nope. This is a problem for MyPy and Pytest only so mostly for our tooling.
> And only if our own modules import packages with the same name as the
> module. So we are in full control here.
>
> >
>
> > -a
> >
>
> --
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>


Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Jarek Potiuk <Ja...@polidea.com>.
>
> If we make every namespace under airflow.* implict, then we have to ensure
> that our modules do not clash with _any and every_ possible top level
> package that exists in PyPi, because we cannot know what might be installed
> by our users. That seems like an impossible feat. Have I mis-understood?
> If I haven't mis-understoond then I this is I think a strong argument for
> _not_ making more than we absolutely have to an implicit namepsace.
>

Nope. This is a problem for MyPy and Pytest only so mostly for our tooling.
And only if our own modules import packages with the same name as the
module. So we are in full control here.

>

> -a
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Ash Berlin-Taylor <as...@apache.org>.
On Feb 5 2020, at 11:59 am, Jarek Potiuk <Ja...@polidea.com> wrote:
> So I would like that we - as a community - make an informed decision about
> it without over-simplifications and biases. I have a kind request to those
> interested in AIP-21 - can you please take a look and comment? I would not
> like to be held by this for a long time, but I am super happy to discuss it
> within the next few days and make voting about it (and in the meantime I
> will test the "simpler" approach with just removing 1 file (and everything
> that follows) and see if it works. We will then know more about what kind
> and size of change we are comparing it to.
>

If we make every namespace under airflow.* implict, then we have to ensure that our modules do not clash with _any and every_ possible top level package that exists in PyPi, because we cannot know what might be installed by our users. That seems like an impossible feat. Have I mis-understood?
If I haven't mis-understoond then I this is I think a strong argument for _not_ making more than we absolutely have to an implicit namepsace.

-a


Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Jarek Potiuk <Ja...@polidea.com>.
I don't think such quick assessment on a "global level" with the size of
change is complete or even correct - the alternative is more than "1 file"
change.

I think we should all try to understand what is changing, the reasoning and
consequences, and not only to look at the total numbers. Particularly I do
not think "size" of change is the only or even most important criteria in
case of such refactor. It's one of the criteria IMHO, but there are others
- one of them is being forward-compatible as well and preventing/limiting
potential backwards-incompatible changes in the future.

So I would like that we - as a community - make an informed decision about
it without over-simplifications and biases. I have a kind request to those
interested in AIP-21 - can you please take a look and comment? I would not
like to be held by this for a long time, but I am super happy to discuss it
within the next few days and make voting about it (and in the meantime I
will test the "simpler" approach with just removing 1 file (and everything
that follows) and see if it works. We will then know more about what kind
and size of change we are comparing it to.

And if - as a community - we come to the conclusion that it's better to
make smaller change (and we prove that it works) - I am happy to follow
this.

Firstly - just to be precise about stats of the change:

   - it removes 389 files (__init__.py files that contain only commented
   license information) - we would not have to worry about those in the future
   - it moves 296 file that just have been moved in AIP-21 and nobody
   started using them yet. There is no problem with moving them again.
   - it updates 19 files in 'contrib' that only contain deprecation notice
   and have just been moved in AIP-21
   - it corrects single import statements or updates test-metadata, list of
   packages, pre-commit checks, using implicit packages semantics for
   importlib  etc. in the remaining 73 files

Secondly. I would not jump into conclusion that alternative is just "remove
__init__.py" from providers. It's vast over-simplification. Even if it
works for backported packages case, there are a number of changes needed to
make it work for master. Starting from setup.py, doc generation, package
preparation. Some part of the 73 files above will have to be done anyway if
we make the 'providers' package an implicit one. And I have not yet started
changes to setup.py to prepare the backport packages - I wanted to do it
after we make sure we have the "__init__.py" question solved for AIP-21.
And I think making packages implicit will make it simpler.

My tests before tested that everything works if all packages in providers
were made implicit. I do not know - because I have not tested it - if it is
going to work if I just remove __init__.py from providers. I might test it
to see if it works, and will let you know - but in the meantime I would
love your opinion on the statements above - in case we decide to move. And
even if it works, it's not the only change - for one for example we will
have to make changes to auto-api anyway because it does not work currently
with implicit packages at all. and we might find that we have to make other
changes as well.

Thirdly: Let my add my reasoning here as well why I think we should convert
everything to implicit packages now. I think it is for the benefit of our
users and limiting future backwards incompatible changes.

I think we are making an important, backwards incompatible move now
with AIP-21 (we've agreed to that already). My tests with it have shown
that IF we start using implicit packages at some point in time, there are
certain consequences that we have to deal with sooner or later. And my
point here is that if we already know that duplicated module names might be
a problem, and that we are already started using the implicit packages  -
why should we not fix everything now - to avoid another breaking change in
the future. I think this is perfect timing to make such a move now rather
than waiting for the future. We've already made a bold move with AIP-21 and
we are just about (via backporting) to make it used by our users - so we
better make sure we do not have to make them adapt twice in the near
future.

I think it's painless now, where it will be very painful if we make it in
the future.

That's the reason why I worked on detecting and solving such problems now -
to be better prepared for the discussion (and be informed about problems
caused potentially by AIP-21 move) and to make sure that what we finally
come up as AIP-21 (with backporting) is future-proof and we will not have
change it again in the near future.

If my tests will show that we can only do providers.__init__ removal and we
can make everything works for backported packages. I am going to take a
look at it now.

But I would like that as a community we decide which road we take now. I am
happy to cast a vote on it after we hear the opinions of the community.

J.



On Wed, Feb 5, 2020 at 11:58 AM Ash Berlin-Taylor <as...@apache.org> wrote:

> There's two things I don't like: 1 the name, and that two it's more work
> than we are needed for (to what I can see) no gain. Making everything an
> implicit package touches 776 files, when 1 is all that is needed.
>
> To make airflow.providers an implicit namepsace package we need to remove
> a single file. Doing just that mypy doesn't complain about anything. From a
> few quick spot tests they still pass too.
> There is nothing to think about again as everywhere else we follow the
> normal python convention of having an __init__.py
> -ash
> On Feb 5 2020, at 10:48 am, Jarek Potiuk <Ja...@polidea.com> wrote:
> > Hey Ash,
> >
> > What do you think is the downside of changing all the packages to
> implicit
> > Except the -6000 or so useless comments in empty __init__.py ?
> >
> > J.
> > On Wed, Feb 5, 2020 at 11:34 AM Ash Berlin-Taylor <as...@apache.org>
> wrote:
> > > And to be clear: I'd rather we didn't make everything an implicit
> > > namespace package, but only the "official" extension points of
> > > airflow.providers. I cant immediately think of a reason to make
> anything
> > > else a namespace package, and limiting this to a single place makes the
> > > change smaller (tiny even?) and also easier to reason about where
> modules
> > > might be coming from.
> > >
> > > What have I not thought about?
> > > On Feb 5 2020, at 10:31 am, Ash Berlin-Taylor <as...@apache.org> wrote:
> > > > Are you talking about/test with making _ALL_ packages implicit
> > >
> > > namespaces?
> > > >
> > > > I would think that we would only make airflow.providers an implicit
> > > package, but leave all the sub packages as explicit packages: i.e.
> this:
> > > > airflow/__init__.py
> > > > airflow/utils/__init__.py
> > > > airflow/providers/ (no __init__.py)
> > > > airflow/providers/google/__init__.py
> > > >
> > > > Does limiting the implicit namespace to just airflow.providers
> address
> > > 1+2+3?
> > > > -a
> > > > On Feb 5 2020, at 10:27 am, Jarek Potiuk <Ja...@polidea.com>
> > >
> > > wrote:
> > > > > TL;DR; I am asking for opinions on some of the changes required to
> > > >
> > >
> > > introduce implicit native package support. This is the easiest way to
> make
> > > backports of master providers packages to 1.10.x.
> > > > >
> > > > >
> > > > > NOTE!: @Ash Berlin-Taylor (mailto:ash@apache.org) @Fokko
> Driesprong
> > > (mailto:fokko@driesprong.frl) @Felix Uellendall (mailto:feluelle@pm.me
> )
> > > @Kamil Breguła (mailto:kamil.bregula@polidea.com) @Kaxil Naik (mailto:
> > > kaxilnaik@gmail.com) @basharenslak@godatadriven.com (mailto:
> > > basharenslak@godatadriven.com) @Philippe Gagnon (mailto:
> > > philgagnon1@gmail.com) - there is one point 1a) below that I would
> like
> > > to get your input.
> > > > >
> > > > > As AIP-21 (import paths) import path move is done we have an
> important
> > > step to complete - we should switch to implicit "native" packages. It
> boils
> > > down to removal of empty (only with comment licences) __init__.py
> files and
> > > relying on python 3.3+ capability of finding packages without having
> to add
> > > __init_.py files. This is needed in order to get backported packaged
> to be
> > > prepared for 1.10.* series. I worked on it during the last few days -
> to
> > > make sure this can be done and I am close to have it. Close enough to
> know
> > > I can solve all the remaining problems. I wanted to check that we can
> do it
> > > for almost all the packages. I already solved most of the initial
> problems
> > > but I have some places where I think I need community opinion on the
> way we
> > > should solve them:
> > > > >
> > > > > The PR that has the changes is here:
> > > https://github.com/apache/airflow/pull/7279/files - it's not 100%
> ready
> > > yet and I want to split it into a few separate PRs so do not comment it
> > > there yet - I extracted the most important changes here and wanted to
> ask
> > > your opinion where I have doubts:
> > > > >
> > > > > 1) Module names identical to some top-level package names (for
> example
> > > email.py):
> > > > >
> > > > > Some of the module names (for example email.py) has to be changed.
> The
> > > problem with implicit packages is that if local module is named the
> same as
> > > the top level package, importing the top-level package from it does not
> > > work:
> > > > >
> > > > > email.py:
> > > > > ...
> > > > > import email <- this won't work - module imports itself.
> > > > >
> > > > > For now in my PR i renamed the modules to be airflow_<old_name>.py
> or
> > > > > <old_name>_utils.py : There are such modules (mostly hooks):
> > > > > email -> email_utils, (utils)
> > > > >
> > > > > cassandra -> airflow_cassandra
> > > > > cloudant -> airflow_cloudant
> > > > > dataproc -> airflow_dataproc
> > > > > grpc -> airlfow_grpc
> > > > > hdfs -> airflow_hdfs
> > > > > jira -> airflow_jira
> > > > > kerberos -> kerberos_security
> > > > > redis -> airflow_redis
> > > > > sendgrid -> airflow_sendgrid
> > > > > snowflake -> airflow_snowflake
> > > > > winrm -> airflow_winrm (operator)
> > > > > datadog -> airflow_datatog(sensor)
> > > > > sqlalchemy -> sqlalchemy_utils (utils)
> > > > >
> > > > >
> > > > >
> > > > > I can also change it to grpc_hooks.py, datadog_sensors.py etc. Or
> > > maybe someone knows an easy way how to keep the module name and
> implicit
> > > packages?
> > > > >
> > > > > I guess this is not a big problem to change it this way - we anyhow
> > > have changed import paths for those. The only two problems with this
> are:
> > > > > a) email was used in "email_backend" configuration : email_backend
> =
> > > >
> > >
> > > airflow.utils.email.send_email_smtp
> > > > > but we can solve it by handling that as special case, raising
> > > >
> > >
> > > deprecation warning and converting it to
> > > airflow.utils.email_utils.send_email_smtp
> > > > > b) we introduce slight inconsistency in hook/operator/sensor names.
> > > > >
> > > > > 2) Mypy fails when checking individual (duplicated) module names
> > > > > Mypy does not work well with duplicate module names in implicit
> > > packages (https://github.com/pre-commit/mirrors-mypy/issues/5) and
> with
> > > AIP-21 we have plenty of them. Repeating modules (http.py in both
> operators
> > > and hooks for example) cause it to fail in case they are provided as
> > > parameters (in case of incremental mypy check in pre-commits. I had to
> > > change it then so that mypy always check 'airflow' and 'test' packages
> > > instead - which makes pre-commit check slightly slower (few seconds).
> > > > >
> > > > > 3) _hooks.py/_operators.py/_sensors.py (
> > > http://hooks.py/_operators.py/_sensors.py) suffix: Some thought
> resulting
> > > from above 1) and 2):
> > > > >
> > > > > I did not want to open pandora's box again but I think removal of
> > > _hooks, _operators, _sensors from the module name might not have been
> the
> > > best decision. While it removes some duplication in module name, it
> > > actually introduces duplicate module names and (as it is with mypy) it
> > > might be a problem in the future. I think now we should change it back
> to
> > > add the suffixes. If we want to reverse it - this is the last moment
> we can
> > > do it (and we still can easily). I would love some quick
> opinion/voting for
> > > that - especially those who voted in AIP-21 but others are welcome as
> well:
> > > @Ash Berlin-Taylor (mailto:ash@apache.org) @Fokko Driesprong (mailto:
> > > fokko@driesprong.frl) @Felix Uellendall (mailto:feluelle@pm.me) @Kamil
> > > Breguła (mailto:kamil.bregula@polidea.com) @Kaxil Naik (mailto:
> > > kaxilnaik@gmail.com) @basharenslak@godatadriven.com (mailto:
> > > basharenslak@godatadriven.com) @Philippe Gagnon (mailto:
> > > philgagnon1@gmail.com) - this was "Case 2" in the original AIP-21
> > > proposal.
> > > > >
> > > > > WDYT?
> > > > > 4) Test module names:
> > > > > I had to change all test module name for providers. For example: .
> > > Pytest does not like repeating implicit test module names. It throws
> error
> > > "duplicate test module). For example:
> > > > > tests/providers/http/hooks/test_http.py →
> > > >
> > >
> > > tests/providers/http/hooks/test_http_hooks.py (
> > >
> https://github.com/apache/airflow/pull/7279/files#diff-593e762ce1d79c250426a27b6fb28908
> > > )
> > > > >
> > > > >
> > > > > (basically I added hooks/operators/sensors everywhere). It is easy
> and
> > > fully automated and it has no impact at all on the main code base, so I
> > > think this is a super-safe change. t The nice thing is that it makes it
> > > easier to understand what kind of tests you are looking at immediately
> and.
> > > That also hints that maybe _hooks.py, _operators.py, _sensors.py in
> > > hooks/operators/sensors could have been a better choice (see 2 a) )
> > > > >
> > > > > 5) Doc generation
> > > > > Autoapi doc generation has not yet released implicit package
> support
> > > (it is in master only). However I managed to monkey-patch 1.2.1
> (latest)
> > > version to add support by cherry-picking the changes (there are
> literally
> > > few methods changed there -
> > > https://github.com/readthedocs/sphinx-autoapi/pull/178) and when new
> > > version is released we will be able to get rid of monkey-patching . I
> am
> > > still resolving a few import issues in that newest version but I am
> > > confident I can fix or workaround all issues with it.
> > > > > I don't expect this to be a problem - it is only in the doc
> generation
> > > >
> > >
> > > code, not the main application code.
> > > > >
> > > > > I verified that pytest test discovery works fine, and that airflow
> can
> > > be installed and works with implicit packages. Let me know what you
> think
> > > about it and I would love to merge it as soon as possible (rebasing
> this
> > > for a long time might be painful).
> > > > >
> > > > > J.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Nov 23, 2019 at 9:22 AM Jarek Potiuk <
> Jarek.Potiuk@polidea.com
> > > (mailto:Jarek.Potiuk@polidea.com)> wrote:
> > > > > > I've read a bit more about PEP-0420 so some more extracted
> > > > >
> > > >
> > >
> > > information:
> > > > > >
> > > > > > - implicit namespace packages are very similar to regular
> packages.
> > > The notable difference is that they can be loaded from several
> directories
> > > (so you can have modules in the same package but coming from different
> > > physical directories), so no __file__ is defined for the package (it is
> > > still defined for modules),
> > > > > >
> > > > > > - pytest should have no problems. It has indeed (
> > >
> https://stackoverflow.com/questions/50174130/how-do-i-pytest-a-project-using-pep-420-namespace-packages
> )
> > > problems with implicit packages when tests are co-located with their
> tested
> > > modules (module.py, module_test.py) but we are not in this situation.
> > > > > >
> > > > > > - I do not see how it could be slower - I have not found any
> > > benchmarking yet (but I have not found any complaints about it
> either). I
> > > looked at the loading algorithm for PEP-420 - it traverses all the
> > > directory structure available on the PYTHONPATH bo so does the
> "standard"
> > > mechanism. And we would not have to parse and load all the 140
> licence-only
> > > __init__.py. The presence of module in the package is based on the
> presence
> > > of "<module>.py" in the right folder. If we would have several packages
> > > elsewhere that might become a bit slower, but I think reading files is
> so
> > > much slower than scanning directories that - if anything - we will be
> > > faster (just opening and closing those __init__.files will take quite
> some
> > > time :). We can do some benchmarking before and after to be sure.
> > > > > >
> > > > > > Side comment: I am not 100% sure, but after reading PEP-420
> > > intuition tells me that maybe implicit packages can help in solving
> > > relative imports problems that some of our user raised . I saw recently
> > > several users had problems with relative imports used in DAGs. We are
> > > loading the DAGs in a specific way and in implicit packages, the
> __repr__
> > > of modules parent is dynamic - based on the location of the file rather
> > > than based on the module of the file, so it's likely if we encourage
> dag
> > > developers also to use implicit packages, it might actually solve the
> > > relative imports case.
> > > > > >
> > > > > > J.
> > > > > > On Fri, Nov 22, 2019 at 4:29 PM Daniel Imberman <
> > > > >
> > > >
> > >
> > > daniel.imberman@gmail.com (mailto:daniel.imberman@gmail.com)> wrote:
> > > > > > > Fine by me if it doesn’t break anything. I’m always on the
> side of
> > > > > >
> > > > >
> > > >
> > >
> > > less code :).
> > > > > > >
> > > > > > > via Newton Mail [
> > >
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.32&pv=10.14.5&source=email_footer_2
> > > ]
> > > > > > > On Fri, Nov 22, 2019 at 8:12 AM, Ash Berlin-Taylor <
> ash@apache.org
> > > > > >
> > > > >
> > > >
> > >
> > > (mailto:ash@apache.org)> wrote:
> > > > > > > Probably worth it, but things to check:
> > > > > > >
> > > > > > > - if nose or pytest needs them under tests/ to find all the
> tests
> > > > > > > - if not having the init files slows down pythons package
> importer
> > > > > >
> > > > >
> > > >
> > >
> > > as it searches more paths?
> > > > > > >
> > > > > > > -a
> > > > > > > > On 22 Nov 2019, at 12:06, Jarek Potiuk <
> Jarek.Potiuk@polidea.com
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > > (mailto:Jarek.Potiuk@polidea.com)> wrote:
> > > > > > > >
> > > > > > > > There is one implementation detail in AIP-21 that I continue
> to
> > > have
> > > > > > > > questions about, and I wanted to ask community about it.
> > > > > > > >
> > > > > > > > Since we moved to Python 3, we have the option of using
> implicit
> > > namespace
> > > > > > > > packages.
> > > > > > > > https://www.python.org/dev/peps/pep-0420/ . We have now
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > > grouping per folder
> > > > > > > > for services, so it would require a lot more __init__.py
> files
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > > if we
> > > > > > > > continue using them.
> > > > > > > >
> > > > > > > > The implicit naming boils down to not requiring __init__.py
> > > files if no
> > > > > > > > initialisation of package is needed. We could do it
> consistently
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > > for all
> > > > > > > > code which is "internal" to airflow. This means that most
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > > packages will not
> > > > > > > > have __init__.py there will be few exceptions like 'airflow',
> > > > > > > > 'airflow.modules' and likely few others with the __init__.py.
> > > > > > > >
> > > > > > > > I did a quick check and we have only 25 _init_.py with some
> > > logic -
> > > > > > > > remaining ~ 140 could be removed as they only contain
> comments.
> > > > > > > >
> > > > > > > > find . -name '__init__.py' | xargs grep -le '^[^#].*' | wc -l
> > > > > > > > 25
> > > > > > > > find . -name '__init__.py' |wc -l
> > > > > > > > 164
> > > > > > > >
> > > > > > > > Those are the files that will be left:
> > > > > > > > ./tests/__init__.py
> > > > > > > > ./tests/utils/log/elasticmock/__init__.py
> > > > > > > > ./tests/utils/log/elasticmock/utilities/__init__.py
> > > > > > > > ./tests/models/__init__.py
> > > > > > > > ./tests/task/__init__.py
> > > > > > > > ./airflow/sensors/__init__.py
> > > > > > > > ./airflow/operators/__init__.py
> > > > > > > > ./airflow/_vendor/nvd3/__init__.py
> > > > > > > > ./airflow/_vendor/slugify/__init__.py
> > > > > > > > ./airflow/serialization/__init__.py
> > > > > > > > ./airflow/__init__.py
> > > > > > > > ./airflow/models/__init__.py
> > > > > > > >
> > >
> ./airflow/www/node_modules/npm/node_modules/node-gyp/gyp/pylib/gyp/__init__.py
> > > > > > > > ./airflow/ti_deps/deps/__init__.py
> > > > > > > > ./airflow/macros/__init__.py
> > > > > > > > ./airflow/executors/__init__.py
> > > > > > > > ./airflow/lineage/__init__.py
> > > > > > > > ./airflow/lineage/backend/__init__.py
> > > > > > > > ./airflow/lineage/backend/atlas/__init__.py
> > > > > > > > ./airflow/hooks/__init__.py
> > > > > > > > ./airflow/task/task_runner/__init__.py
> > > > > > > > ./airflow/api/__init__.py
> > > > > > > > ./airflow/api/common/experimental/__init__.py
> > > > > > > > ./airflow/api/client/__init__.py
> > > > > > > > ./airflow/jobs/__init__.py
> > > > > > > >
> > > > > > > > WDYT?
> > > > > > > > J
> > > > > > > > --
> > > > > > > > Jarek Potiuk
> > > > > > > > Polidea <https://www.polidea.com/> | Principal Software
> Engineer
> > > > > > > >
> > > > > > > > M: +48 660 796 129 <+48660796129>
> > > > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Jarek Potiuk
> > > > > > Polidea (https://www.polidea.com/) | Principal Software Engineer
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > M: +48 660 796 129 (tel:+48660796129)
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Jarek Potiuk
> > > > > Polidea (https://www.polidea.com/) | Principal Software Engineer
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > M: +48 660 796 129 (tel:+48660796129)
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> > --
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Ash Berlin-Taylor <as...@apache.org>.
There's two things I don't like: 1 the name, and that two it's more work than we are needed for (to what I can see) no gain. Making everything an implicit package touches 776 files, when 1 is all that is needed.

To make airflow.providers an implicit namepsace package we need to remove a single file. Doing just that mypy doesn't complain about anything. From a few quick spot tests they still pass too.
There is nothing to think about again as everywhere else we follow the normal python convention of having an __init__.py
-ash
On Feb 5 2020, at 10:48 am, Jarek Potiuk <Ja...@polidea.com> wrote:
> Hey Ash,
>
> What do you think is the downside of changing all the packages to implicit
> Except the -6000 or so useless comments in empty __init__.py ?
>
> J.
> On Wed, Feb 5, 2020 at 11:34 AM Ash Berlin-Taylor <as...@apache.org> wrote:
> > And to be clear: I'd rather we didn't make everything an implicit
> > namespace package, but only the "official" extension points of
> > airflow.providers. I cant immediately think of a reason to make anything
> > else a namespace package, and limiting this to a single place makes the
> > change smaller (tiny even?) and also easier to reason about where modules
> > might be coming from.
> >
> > What have I not thought about?
> > On Feb 5 2020, at 10:31 am, Ash Berlin-Taylor <as...@apache.org> wrote:
> > > Are you talking about/test with making _ALL_ packages implicit
> >
> > namespaces?
> > >
> > > I would think that we would only make airflow.providers an implicit
> > package, but leave all the sub packages as explicit packages: i.e. this:
> > > airflow/__init__.py
> > > airflow/utils/__init__.py
> > > airflow/providers/ (no __init__.py)
> > > airflow/providers/google/__init__.py
> > >
> > > Does limiting the implicit namespace to just airflow.providers address
> > 1+2+3?
> > > -a
> > > On Feb 5 2020, at 10:27 am, Jarek Potiuk <Ja...@polidea.com>
> >
> > wrote:
> > > > TL;DR; I am asking for opinions on some of the changes required to
> > >
> >
> > introduce implicit native package support. This is the easiest way to make
> > backports of master providers packages to 1.10.x.
> > > >
> > > >
> > > > NOTE!: @Ash Berlin-Taylor (mailto:ash@apache.org) @Fokko Driesprong
> > (mailto:fokko@driesprong.frl) @Felix Uellendall (mailto:feluelle@pm.me)
> > @Kamil Breguła (mailto:kamil.bregula@polidea.com) @Kaxil Naik (mailto:
> > kaxilnaik@gmail.com) @basharenslak@godatadriven.com (mailto:
> > basharenslak@godatadriven.com) @Philippe Gagnon (mailto:
> > philgagnon1@gmail.com) - there is one point 1a) below that I would like
> > to get your input.
> > > >
> > > > As AIP-21 (import paths) import path move is done we have an important
> > step to complete - we should switch to implicit "native" packages. It boils
> > down to removal of empty (only with comment licences) __init__.py files and
> > relying on python 3.3+ capability of finding packages without having to add
> > __init_.py files. This is needed in order to get backported packaged to be
> > prepared for 1.10.* series. I worked on it during the last few days - to
> > make sure this can be done and I am close to have it. Close enough to know
> > I can solve all the remaining problems. I wanted to check that we can do it
> > for almost all the packages. I already solved most of the initial problems
> > but I have some places where I think I need community opinion on the way we
> > should solve them:
> > > >
> > > > The PR that has the changes is here:
> > https://github.com/apache/airflow/pull/7279/files - it's not 100% ready
> > yet and I want to split it into a few separate PRs so do not comment it
> > there yet - I extracted the most important changes here and wanted to ask
> > your opinion where I have doubts:
> > > >
> > > > 1) Module names identical to some top-level package names (for example
> > email.py):
> > > >
> > > > Some of the module names (for example email.py) has to be changed. The
> > problem with implicit packages is that if local module is named the same as
> > the top level package, importing the top-level package from it does not
> > work:
> > > >
> > > > email.py:
> > > > ...
> > > > import email <- this won't work - module imports itself.
> > > >
> > > > For now in my PR i renamed the modules to be airflow_<old_name>.py or
> > > > <old_name>_utils.py : There are such modules (mostly hooks):
> > > > email -> email_utils, (utils)
> > > >
> > > > cassandra -> airflow_cassandra
> > > > cloudant -> airflow_cloudant
> > > > dataproc -> airflow_dataproc
> > > > grpc -> airlfow_grpc
> > > > hdfs -> airflow_hdfs
> > > > jira -> airflow_jira
> > > > kerberos -> kerberos_security
> > > > redis -> airflow_redis
> > > > sendgrid -> airflow_sendgrid
> > > > snowflake -> airflow_snowflake
> > > > winrm -> airflow_winrm (operator)
> > > > datadog -> airflow_datatog(sensor)
> > > > sqlalchemy -> sqlalchemy_utils (utils)
> > > >
> > > >
> > > >
> > > > I can also change it to grpc_hooks.py, datadog_sensors.py etc. Or
> > maybe someone knows an easy way how to keep the module name and implicit
> > packages?
> > > >
> > > > I guess this is not a big problem to change it this way - we anyhow
> > have changed import paths for those. The only two problems with this are:
> > > > a) email was used in "email_backend" configuration : email_backend =
> > >
> >
> > airflow.utils.email.send_email_smtp
> > > > but we can solve it by handling that as special case, raising
> > >
> >
> > deprecation warning and converting it to
> > airflow.utils.email_utils.send_email_smtp
> > > > b) we introduce slight inconsistency in hook/operator/sensor names.
> > > >
> > > > 2) Mypy fails when checking individual (duplicated) module names
> > > > Mypy does not work well with duplicate module names in implicit
> > packages (https://github.com/pre-commit/mirrors-mypy/issues/5) and with
> > AIP-21 we have plenty of them. Repeating modules (http.py in both operators
> > and hooks for example) cause it to fail in case they are provided as
> > parameters (in case of incremental mypy check in pre-commits. I had to
> > change it then so that mypy always check 'airflow' and 'test' packages
> > instead - which makes pre-commit check slightly slower (few seconds).
> > > >
> > > > 3) _hooks.py/_operators.py/_sensors.py (
> > http://hooks.py/_operators.py/_sensors.py) suffix: Some thought resulting
> > from above 1) and 2):
> > > >
> > > > I did not want to open pandora's box again but I think removal of
> > _hooks, _operators, _sensors from the module name might not have been the
> > best decision. While it removes some duplication in module name, it
> > actually introduces duplicate module names and (as it is with mypy) it
> > might be a problem in the future. I think now we should change it back to
> > add the suffixes. If we want to reverse it - this is the last moment we can
> > do it (and we still can easily). I would love some quick opinion/voting for
> > that - especially those who voted in AIP-21 but others are welcome as well:
> > @Ash Berlin-Taylor (mailto:ash@apache.org) @Fokko Driesprong (mailto:
> > fokko@driesprong.frl) @Felix Uellendall (mailto:feluelle@pm.me) @Kamil
> > Breguła (mailto:kamil.bregula@polidea.com) @Kaxil Naik (mailto:
> > kaxilnaik@gmail.com) @basharenslak@godatadriven.com (mailto:
> > basharenslak@godatadriven.com) @Philippe Gagnon (mailto:
> > philgagnon1@gmail.com) - this was "Case 2" in the original AIP-21
> > proposal.
> > > >
> > > > WDYT?
> > > > 4) Test module names:
> > > > I had to change all test module name for providers. For example: .
> > Pytest does not like repeating implicit test module names. It throws error
> > "duplicate test module). For example:
> > > > tests/providers/http/hooks/test_http.py →
> > >
> >
> > tests/providers/http/hooks/test_http_hooks.py (
> > https://github.com/apache/airflow/pull/7279/files#diff-593e762ce1d79c250426a27b6fb28908
> > )
> > > >
> > > >
> > > > (basically I added hooks/operators/sensors everywhere). It is easy and
> > fully automated and it has no impact at all on the main code base, so I
> > think this is a super-safe change. t The nice thing is that it makes it
> > easier to understand what kind of tests you are looking at immediately and.
> > That also hints that maybe _hooks.py, _operators.py, _sensors.py in
> > hooks/operators/sensors could have been a better choice (see 2 a) )
> > > >
> > > > 5) Doc generation
> > > > Autoapi doc generation has not yet released implicit package support
> > (it is in master only). However I managed to monkey-patch 1.2.1 (latest)
> > version to add support by cherry-picking the changes (there are literally
> > few methods changed there -
> > https://github.com/readthedocs/sphinx-autoapi/pull/178) and when new
> > version is released we will be able to get rid of monkey-patching . I am
> > still resolving a few import issues in that newest version but I am
> > confident I can fix or workaround all issues with it.
> > > > I don't expect this to be a problem - it is only in the doc generation
> > >
> >
> > code, not the main application code.
> > > >
> > > > I verified that pytest test discovery works fine, and that airflow can
> > be installed and works with implicit packages. Let me know what you think
> > about it and I would love to merge it as soon as possible (rebasing this
> > for a long time might be painful).
> > > >
> > > > J.
> > > >
> > > >
> > > >
> > > >
> > > > On Sat, Nov 23, 2019 at 9:22 AM Jarek Potiuk <Jarek.Potiuk@polidea.com
> > (mailto:Jarek.Potiuk@polidea.com)> wrote:
> > > > > I've read a bit more about PEP-0420 so some more extracted
> > > >
> > >
> >
> > information:
> > > > >
> > > > > - implicit namespace packages are very similar to regular packages.
> > The notable difference is that they can be loaded from several directories
> > (so you can have modules in the same package but coming from different
> > physical directories), so no __file__ is defined for the package (it is
> > still defined for modules),
> > > > >
> > > > > - pytest should have no problems. It has indeed (
> > https://stackoverflow.com/questions/50174130/how-do-i-pytest-a-project-using-pep-420-namespace-packages)
> > problems with implicit packages when tests are co-located with their tested
> > modules (module.py, module_test.py) but we are not in this situation.
> > > > >
> > > > > - I do not see how it could be slower - I have not found any
> > benchmarking yet (but I have not found any complaints about it either). I
> > looked at the loading algorithm for PEP-420 - it traverses all the
> > directory structure available on the PYTHONPATH bo so does the "standard"
> > mechanism. And we would not have to parse and load all the 140 licence-only
> > __init__.py. The presence of module in the package is based on the presence
> > of "<module>.py" in the right folder. If we would have several packages
> > elsewhere that might become a bit slower, but I think reading files is so
> > much slower than scanning directories that - if anything - we will be
> > faster (just opening and closing those __init__.files will take quite some
> > time :). We can do some benchmarking before and after to be sure.
> > > > >
> > > > > Side comment: I am not 100% sure, but after reading PEP-420
> > intuition tells me that maybe implicit packages can help in solving
> > relative imports problems that some of our user raised . I saw recently
> > several users had problems with relative imports used in DAGs. We are
> > loading the DAGs in a specific way and in implicit packages, the __repr__
> > of modules parent is dynamic - based on the location of the file rather
> > than based on the module of the file, so it's likely if we encourage dag
> > developers also to use implicit packages, it might actually solve the
> > relative imports case.
> > > > >
> > > > > J.
> > > > > On Fri, Nov 22, 2019 at 4:29 PM Daniel Imberman <
> > > >
> > >
> >
> > daniel.imberman@gmail.com (mailto:daniel.imberman@gmail.com)> wrote:
> > > > > > Fine by me if it doesn’t break anything. I’m always on the side of
> > > > >
> > > >
> > >
> >
> > less code :).
> > > > > >
> > > > > > via Newton Mail [
> > https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.32&pv=10.14.5&source=email_footer_2
> > ]
> > > > > > On Fri, Nov 22, 2019 at 8:12 AM, Ash Berlin-Taylor <ash@apache.org
> > > > >
> > > >
> > >
> >
> > (mailto:ash@apache.org)> wrote:
> > > > > > Probably worth it, but things to check:
> > > > > >
> > > > > > - if nose or pytest needs them under tests/ to find all the tests
> > > > > > - if not having the init files slows down pythons package importer
> > > > >
> > > >
> > >
> >
> > as it searches more paths?
> > > > > >
> > > > > > -a
> > > > > > > On 22 Nov 2019, at 12:06, Jarek Potiuk <Jarek.Potiuk@polidea.com
> > > > > >
> > > > >
> > > >
> > >
> >
> > (mailto:Jarek.Potiuk@polidea.com)> wrote:
> > > > > > >
> > > > > > > There is one implementation detail in AIP-21 that I continue to
> > have
> > > > > > > questions about, and I wanted to ask community about it.
> > > > > > >
> > > > > > > Since we moved to Python 3, we have the option of using implicit
> > namespace
> > > > > > > packages.
> > > > > > > https://www.python.org/dev/peps/pep-0420/ . We have now
> > > > > >
> > > > >
> > > >
> > >
> >
> > grouping per folder
> > > > > > > for services, so it would require a lot more __init__.py files
> > > > > >
> > > > >
> > > >
> > >
> >
> > if we
> > > > > > > continue using them.
> > > > > > >
> > > > > > > The implicit naming boils down to not requiring __init__.py
> > files if no
> > > > > > > initialisation of package is needed. We could do it consistently
> > > > > >
> > > > >
> > > >
> > >
> >
> > for all
> > > > > > > code which is "internal" to airflow. This means that most
> > > > > >
> > > > >
> > > >
> > >
> >
> > packages will not
> > > > > > > have __init__.py there will be few exceptions like 'airflow',
> > > > > > > 'airflow.modules' and likely few others with the __init__.py.
> > > > > > >
> > > > > > > I did a quick check and we have only 25 _init_.py with some
> > logic -
> > > > > > > remaining ~ 140 could be removed as they only contain comments.
> > > > > > >
> > > > > > > find . -name '__init__.py' | xargs grep -le '^[^#].*' | wc -l
> > > > > > > 25
> > > > > > > find . -name '__init__.py' |wc -l
> > > > > > > 164
> > > > > > >
> > > > > > > Those are the files that will be left:
> > > > > > > ./tests/__init__.py
> > > > > > > ./tests/utils/log/elasticmock/__init__.py
> > > > > > > ./tests/utils/log/elasticmock/utilities/__init__.py
> > > > > > > ./tests/models/__init__.py
> > > > > > > ./tests/task/__init__.py
> > > > > > > ./airflow/sensors/__init__.py
> > > > > > > ./airflow/operators/__init__.py
> > > > > > > ./airflow/_vendor/nvd3/__init__.py
> > > > > > > ./airflow/_vendor/slugify/__init__.py
> > > > > > > ./airflow/serialization/__init__.py
> > > > > > > ./airflow/__init__.py
> > > > > > > ./airflow/models/__init__.py
> > > > > > >
> > ./airflow/www/node_modules/npm/node_modules/node-gyp/gyp/pylib/gyp/__init__.py
> > > > > > > ./airflow/ti_deps/deps/__init__.py
> > > > > > > ./airflow/macros/__init__.py
> > > > > > > ./airflow/executors/__init__.py
> > > > > > > ./airflow/lineage/__init__.py
> > > > > > > ./airflow/lineage/backend/__init__.py
> > > > > > > ./airflow/lineage/backend/atlas/__init__.py
> > > > > > > ./airflow/hooks/__init__.py
> > > > > > > ./airflow/task/task_runner/__init__.py
> > > > > > > ./airflow/api/__init__.py
> > > > > > > ./airflow/api/common/experimental/__init__.py
> > > > > > > ./airflow/api/client/__init__.py
> > > > > > > ./airflow/jobs/__init__.py
> > > > > > >
> > > > > > > WDYT?
> > > > > > > J
> > > > > > > --
> > > > > > > Jarek Potiuk
> > > > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > > > >
> > > > > > > M: +48 660 796 129 <+48660796129>
> > > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Jarek Potiuk
> > > > > Polidea (https://www.polidea.com/) | Principal Software Engineer
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > M: +48 660 796 129 (tel:+48660796129)
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Jarek Potiuk
> > > > Polidea (https://www.polidea.com/) | Principal Software Engineer
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > M: +48 660 796 129 (tel:+48660796129)
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
> --
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>


Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Jarek Potiuk <Ja...@polidea.com>.
Hey Ash,

What do you think is the downside of changing all the packages to implicit
Except the -6000 or so useless comments in empty __init__.py ?

J.

On Wed, Feb 5, 2020 at 11:34 AM Ash Berlin-Taylor <as...@apache.org> wrote:

> And to be clear: I'd rather we didn't make everything an implicit
> namespace package, but only the "official" extension points of
> airflow.providers. I cant immediately think of a reason to make anything
> else a namespace package, and limiting this to a single place makes the
> change smaller (tiny even?) and also easier to reason about where modules
> might be coming from.
>
> What have I not thought about?
> On Feb 5 2020, at 10:31 am, Ash Berlin-Taylor <as...@apache.org> wrote:
> > Are you talking about/test with making _ALL_ packages implicit
> namespaces?
> >
> > I would think that we would only make airflow.providers an implicit
> package, but leave all the sub packages as explicit packages: i.e. this:
> > airflow/__init__.py
> > airflow/utils/__init__.py
> > airflow/providers/ (no __init__.py)
> > airflow/providers/google/__init__.py
> >
> > Does limiting the implicit namespace to just airflow.providers address
> 1+2+3?
> > -a
> > On Feb 5 2020, at 10:27 am, Jarek Potiuk <Ja...@polidea.com>
> wrote:
> > > TL;DR; I am asking for opinions on some of the changes required to
> introduce implicit native package support. This is the easiest way to make
> backports of master providers packages to 1.10.x.
> > >
> > >
> > > NOTE!: @Ash Berlin-Taylor (mailto:ash@apache.org) @Fokko Driesprong
> (mailto:fokko@driesprong.frl) @Felix Uellendall (mailto:feluelle@pm.me)
> @Kamil Breguła (mailto:kamil.bregula@polidea.com) @Kaxil Naik (mailto:
> kaxilnaik@gmail.com) @basharenslak@godatadriven.com (mailto:
> basharenslak@godatadriven.com) @Philippe Gagnon (mailto:
> philgagnon1@gmail.com) - there is one point 1a) below that I would like
> to get your input.
> > >
> > > As AIP-21 (import paths) import path move is done we have an important
> step to complete - we should switch to implicit "native" packages. It boils
> down to removal of empty (only with comment licences) __init__.py files and
> relying on python 3.3+ capability of finding packages without having to add
> __init_.py files. This is needed in order to get backported packaged to be
> prepared for 1.10.* series. I worked on it during the last few days - to
> make sure this can be done and I am close to have it. Close enough to know
> I can solve all the remaining problems. I wanted to check that we can do it
> for almost all the packages. I already solved most of the initial problems
> but I have some places where I think I need community opinion on the way we
> should solve them:
> > >
> > > The PR that has the changes is here:
> https://github.com/apache/airflow/pull/7279/files - it's not 100% ready
> yet and I want to split it into a few separate PRs so do not comment it
> there yet - I extracted the most important changes here and wanted to ask
> your opinion where I have doubts:
> > >
> > > 1) Module names identical to some top-level package names (for example
> email.py):
> > >
> > > Some of the module names (for example email.py) has to be changed. The
> problem with implicit packages is that if local module is named the same as
> the top level package, importing the top-level package from it does not
> work:
> > >
> > > email.py:
> > > ...
> > > import email <- this won't work - module imports itself.
> > >
> > > For now in my PR i renamed the modules to be airflow_<old_name>.py or
> <old_name>_utils.py : There are such modules (mostly hooks):
> > > email -> email_utils, (utils)
> > >
> > > cassandra -> airflow_cassandra
> > >
> > > cloudant -> airflow_cloudant
> > >
> > > dataproc -> airflow_dataproc
> > >
> > > grpc -> airlfow_grpc
> > >
> > > hdfs -> airflow_hdfs
> > >
> > > jira -> airflow_jira
> > >
> > > kerberos -> kerberos_security
> > >
> > > redis -> airflow_redis
> > >
> > > sendgrid -> airflow_sendgrid
> > >
> > > snowflake -> airflow_snowflake
> > >
> > > winrm -> airflow_winrm (operator)
> > >
> > > datadog -> airflow_datatog(sensor)
> > >
> > > sqlalchemy -> sqlalchemy_utils (utils)
> > >
> > >
> > >
> > > I can also change it to grpc_hooks.py, datadog_sensors.py etc. Or
> maybe someone knows an easy way how to keep the module name and implicit
> packages?
> > >
> > > I guess this is not a big problem to change it this way - we anyhow
> have changed import paths for those. The only two problems with this are:
> > > a) email was used in "email_backend" configuration : email_backend =
> airflow.utils.email.send_email_smtp
> > > but we can solve it by handling that as special case, raising
> deprecation warning and converting it to
> airflow.utils.email_utils.send_email_smtp
> > > b) we introduce slight inconsistency in hook/operator/sensor names.
> > >
> > > 2) Mypy fails when checking individual (duplicated) module names
> > >
> > > Mypy does not work well with duplicate module names in implicit
> packages (https://github.com/pre-commit/mirrors-mypy/issues/5) and with
> AIP-21 we have plenty of them. Repeating modules (http.py in both operators
> and hooks for example) cause it to fail in case they are provided as
> parameters (in case of incremental mypy check in pre-commits. I had to
> change it then so that mypy always check 'airflow' and 'test' packages
> instead - which makes pre-commit check slightly slower (few seconds).
> > >
> > > 3) _hooks.py/_operators.py/_sensors.py (
> http://hooks.py/_operators.py/_sensors.py) suffix: Some thought resulting
> from above 1) and 2):
> > >
> > > I did not want to open pandora's box again but I think removal of
> _hooks, _operators, _sensors from the module name might not have been the
> best decision. While it removes some duplication in module name, it
> actually introduces duplicate module names and (as it is with mypy) it
> might be a problem in the future. I think now we should change it back to
> add the suffixes. If we want to reverse it - this is the last moment we can
> do it (and we still can easily). I would love some quick opinion/voting for
> that - especially those who voted in AIP-21 but others are welcome as well:
> @Ash Berlin-Taylor (mailto:ash@apache.org) @Fokko Driesprong (mailto:
> fokko@driesprong.frl) @Felix Uellendall (mailto:feluelle@pm.me) @Kamil
> Breguła (mailto:kamil.bregula@polidea.com) @Kaxil Naik (mailto:
> kaxilnaik@gmail.com) @basharenslak@godatadriven.com (mailto:
> basharenslak@godatadriven.com) @Philippe Gagnon (mailto:
> philgagnon1@gmail.com) - this was "Case 2" in the original AIP-21
> proposal.
> > >
> > > WDYT?
> > >
> > > 4) Test module names:
> > >
> > > I had to change all test module name for providers. For example: .
> Pytest does not like repeating implicit test module names. It throws error
> "duplicate test module). For example:
> > > tests/providers/http/hooks/test_http.py →
> tests/providers/http/hooks/test_http_hooks.py (
> https://github.com/apache/airflow/pull/7279/files#diff-593e762ce1d79c250426a27b6fb28908
> )
> > >
> > >
> > > (basically I added hooks/operators/sensors everywhere). It is easy and
> fully automated and it has no impact at all on the main code base, so I
> think this is a super-safe change. t The nice thing is that it makes it
> easier to understand what kind of tests you are looking at immediately and.
> That also hints that maybe _hooks.py, _operators.py, _sensors.py in
> hooks/operators/sensors could have been a better choice (see 2 a) )
> > >
> > > 5) Doc generation
> > >
> > > Autoapi doc generation has not yet released implicit package support
> (it is in master only). However I managed to monkey-patch 1.2.1 (latest)
> version to add support by cherry-picking the changes (there are literally
> few methods changed there -
> https://github.com/readthedocs/sphinx-autoapi/pull/178) and when new
> version is released we will be able to get rid of monkey-patching . I am
> still resolving a few import issues in that newest version but I am
> confident I can fix or workaround all issues with it.
> > > I don't expect this to be a problem - it is only in the doc generation
> code, not the main application code.
> > >
> > > I verified that pytest test discovery works fine, and that airflow can
> be installed and works with implicit packages. Let me know what you think
> about it and I would love to merge it as soon as possible (rebasing this
> for a long time might be painful).
> > >
> > > J.
> > >
> > >
> > >
> > >
> > >
> > > On Sat, Nov 23, 2019 at 9:22 AM Jarek Potiuk <Jarek.Potiuk@polidea.com
> (mailto:Jarek.Potiuk@polidea.com)> wrote:
> > > > I've read a bit more about PEP-0420 so some more extracted
> information:
> > > >
> > > > - implicit namespace packages are very similar to regular packages.
> The notable difference is that they can be loaded from several directories
> (so you can have modules in the same package but coming from different
> physical directories), so no __file__ is defined for the package (it is
> still defined for modules),
> > > >
> > > > - pytest should have no problems. It has indeed (
> https://stackoverflow.com/questions/50174130/how-do-i-pytest-a-project-using-pep-420-namespace-packages)
> problems with implicit packages when tests are co-located with their tested
> modules (module.py, module_test.py) but we are not in this situation.
> > > >
> > > > - I do not see how it could be slower - I have not found any
> benchmarking yet (but I have not found any complaints about it either). I
> looked at the loading algorithm for PEP-420 - it traverses all the
> directory structure available on the PYTHONPATH bo so does the "standard"
> mechanism. And we would not have to parse and load all the 140 licence-only
> __init__.py. The presence of module in the package is based on the presence
> of "<module>.py" in the right folder. If we would have several packages
> elsewhere that might become a bit slower, but I think reading files is so
> much slower than scanning directories that - if anything - we will be
> faster (just opening and closing those __init__.files will take quite some
> time :). We can do some benchmarking before and after to be sure.
> > > >
> > > > Side comment: I am not 100% sure, but after reading PEP-420
> intuition tells me that maybe implicit packages can help in solving
> relative imports problems that some of our user raised . I saw recently
> several users had problems with relative imports used in DAGs. We are
> loading the DAGs in a specific way and in implicit packages, the __repr__
> of modules parent is dynamic - based on the location of the file rather
> than based on the module of the file, so it's likely if we encourage dag
> developers also to use implicit packages, it might actually solve the
> relative imports case.
> > > >
> > > > J.
> > > > On Fri, Nov 22, 2019 at 4:29 PM Daniel Imberman <
> daniel.imberman@gmail.com (mailto:daniel.imberman@gmail.com)> wrote:
> > > > > Fine by me if it doesn’t break anything. I’m always on the side of
> less code :).
> > > > >
> > > > > via Newton Mail [
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.32&pv=10.14.5&source=email_footer_2
> ]
> > > > > On Fri, Nov 22, 2019 at 8:12 AM, Ash Berlin-Taylor <ash@apache.org
> (mailto:ash@apache.org)> wrote:
> > > > > Probably worth it, but things to check:
> > > > >
> > > > > - if nose or pytest needs them under tests/ to find all the tests
> > > > > - if not having the init files slows down pythons package importer
> as it searches more paths?
> > > > >
> > > > > -a
> > > > > > On 22 Nov 2019, at 12:06, Jarek Potiuk <Jarek.Potiuk@polidea.com
> (mailto:Jarek.Potiuk@polidea.com)> wrote:
> > > > > >
> > > > > > There is one implementation detail in AIP-21 that I continue to
> have
> > > > > > questions about, and I wanted to ask community about it.
> > > > > >
> > > > > > Since we moved to Python 3, we have the option of using implicit
> namespace
> > > > > > packages.
> > > > > > https://www.python.org/dev/peps/pep-0420/ . We have now
> grouping per folder
> > > > > > for services, so it would require a lot more __init__.py files
> if we
> > > > > > continue using them.
> > > > > >
> > > > > > The implicit naming boils down to not requiring __init__.py
> files if no
> > > > > > initialisation of package is needed. We could do it consistently
> for all
> > > > > > code which is "internal" to airflow. This means that most
> packages will not
> > > > > > have __init__.py there will be few exceptions like 'airflow',
> > > > > > 'airflow.modules' and likely few others with the __init__.py.
> > > > > >
> > > > > > I did a quick check and we have only 25 _init_.py with some
> logic -
> > > > > > remaining ~ 140 could be removed as they only contain comments.
> > > > > >
> > > > > > find . -name '__init__.py' | xargs grep -le '^[^#].*' | wc -l
> > > > > > 25
> > > > > > find . -name '__init__.py' |wc -l
> > > > > > 164
> > > > > >
> > > > > > Those are the files that will be left:
> > > > > > ./tests/__init__.py
> > > > > > ./tests/utils/log/elasticmock/__init__.py
> > > > > > ./tests/utils/log/elasticmock/utilities/__init__.py
> > > > > > ./tests/models/__init__.py
> > > > > > ./tests/task/__init__.py
> > > > > > ./airflow/sensors/__init__.py
> > > > > > ./airflow/operators/__init__.py
> > > > > > ./airflow/_vendor/nvd3/__init__.py
> > > > > > ./airflow/_vendor/slugify/__init__.py
> > > > > > ./airflow/serialization/__init__.py
> > > > > > ./airflow/__init__.py
> > > > > > ./airflow/models/__init__.py
> > > > > >
> ./airflow/www/node_modules/npm/node_modules/node-gyp/gyp/pylib/gyp/__init__.py
> > > > > > ./airflow/ti_deps/deps/__init__.py
> > > > > > ./airflow/macros/__init__.py
> > > > > > ./airflow/executors/__init__.py
> > > > > > ./airflow/lineage/__init__.py
> > > > > > ./airflow/lineage/backend/__init__.py
> > > > > > ./airflow/lineage/backend/atlas/__init__.py
> > > > > > ./airflow/hooks/__init__.py
> > > > > > ./airflow/task/task_runner/__init__.py
> > > > > > ./airflow/api/__init__.py
> > > > > > ./airflow/api/common/experimental/__init__.py
> > > > > > ./airflow/api/client/__init__.py
> > > > > > ./airflow/jobs/__init__.py
> > > > > >
> > > > > > WDYT?
> > > > > >
> > > > > > J
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Jarek Potiuk
> > > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > > >
> > > > > > M: +48 660 796 129 <+48660796129>
> > > > > > [image: Polidea] <https://www.polidea.com/>
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > >
> > > > Jarek Potiuk
> > > > Polidea (https://www.polidea.com/) | Principal Software Engineer
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > M: +48 660 796 129 (tel:+48660796129)
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > >
> > >
> > > Jarek Potiuk
> > > Polidea (https://www.polidea.com/) | Principal Software Engineer
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > M: +48 660 796 129 (tel:+48660796129)
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Ash Berlin-Taylor <as...@apache.org>.
And to be clear: I'd rather we didn't make everything an implicit namespace package, but only the "official" extension points of airflow.providers. I cant immediately think of a reason to make anything else a namespace package, and limiting this to a single place makes the change smaller (tiny even?) and also easier to reason about where modules might be coming from.

What have I not thought about?
On Feb 5 2020, at 10:31 am, Ash Berlin-Taylor <as...@apache.org> wrote:
> Are you talking about/test with making _ALL_ packages implicit namespaces?
>
> I would think that we would only make airflow.providers an implicit package, but leave all the sub packages as explicit packages: i.e. this:
> airflow/__init__.py
> airflow/utils/__init__.py
> airflow/providers/ (no __init__.py)
> airflow/providers/google/__init__.py
>
> Does limiting the implicit namespace to just airflow.providers address 1+2+3?
> -a
> On Feb 5 2020, at 10:27 am, Jarek Potiuk <Ja...@polidea.com> wrote:
> > TL;DR; I am asking for opinions on some of the changes required to introduce implicit native package support. This is the easiest way to make backports of master providers packages to 1.10.x.
> >
> >
> > NOTE!: @Ash Berlin-Taylor (mailto:ash@apache.org) @Fokko Driesprong (mailto:fokko@driesprong.frl) @Felix Uellendall (mailto:feluelle@pm.me) @Kamil Breguła (mailto:kamil.bregula@polidea.com) @Kaxil Naik (mailto:kaxilnaik@gmail.com) @basharenslak@godatadriven.com (mailto:basharenslak@godatadriven.com) @Philippe Gagnon (mailto:philgagnon1@gmail.com) - there is one point 1a) below that I would like to get your input.
> >
> > As AIP-21 (import paths) import path move is done we have an important step to complete - we should switch to implicit "native" packages. It boils down to removal of empty (only with comment licences) __init__.py files and relying on python 3.3+ capability of finding packages without having to add __init_.py files. This is needed in order to get backported packaged to be prepared for 1.10.* series. I worked on it during the last few days - to make sure this can be done and I am close to have it. Close enough to know I can solve all the remaining problems. I wanted to check that we can do it for almost all the packages. I already solved most of the initial problems but I have some places where I think I need community opinion on the way we should solve them:
> >
> > The PR that has the changes is here: https://github.com/apache/airflow/pull/7279/files - it's not 100% ready yet and I want to split it into a few separate PRs so do not comment it there yet - I extracted the most important changes here and wanted to ask your opinion where I have doubts:
> >
> > 1) Module names identical to some top-level package names (for example email.py):
> >
> > Some of the module names (for example email.py) has to be changed. The problem with implicit packages is that if local module is named the same as the top level package, importing the top-level package from it does not work:
> >
> > email.py:
> > ...
> > import email <- this won't work - module imports itself.
> >
> > For now in my PR i renamed the modules to be airflow_<old_name>.py or <old_name>_utils.py : There are such modules (mostly hooks):
> > email -> email_utils, (utils)
> >
> > cassandra -> airflow_cassandra
> >
> > cloudant -> airflow_cloudant
> >
> > dataproc -> airflow_dataproc
> >
> > grpc -> airlfow_grpc
> >
> > hdfs -> airflow_hdfs
> >
> > jira -> airflow_jira
> >
> > kerberos -> kerberos_security
> >
> > redis -> airflow_redis
> >
> > sendgrid -> airflow_sendgrid
> >
> > snowflake -> airflow_snowflake
> >
> > winrm -> airflow_winrm (operator)
> >
> > datadog -> airflow_datatog(sensor)
> >
> > sqlalchemy -> sqlalchemy_utils (utils)
> >
> >
> >
> > I can also change it to grpc_hooks.py, datadog_sensors.py etc. Or maybe someone knows an easy way how to keep the module name and implicit packages?
> >
> > I guess this is not a big problem to change it this way - we anyhow have changed import paths for those. The only two problems with this are:
> > a) email was used in "email_backend" configuration : email_backend = airflow.utils.email.send_email_smtp
> > but we can solve it by handling that as special case, raising deprecation warning and converting it to airflow.utils.email_utils.send_email_smtp
> > b) we introduce slight inconsistency in hook/operator/sensor names.
> >
> > 2) Mypy fails when checking individual (duplicated) module names
> >
> > Mypy does not work well with duplicate module names in implicit packages (https://github.com/pre-commit/mirrors-mypy/issues/5) and with AIP-21 we have plenty of them. Repeating modules (http.py in both operators and hooks for example) cause it to fail in case they are provided as parameters (in case of incremental mypy check in pre-commits. I had to change it then so that mypy always check 'airflow' and 'test' packages instead - which makes pre-commit check slightly slower (few seconds).
> >
> > 3) _hooks.py/_operators.py/_sensors.py (http://hooks.py/_operators.py/_sensors.py) suffix: Some thought resulting from above 1) and 2):
> >
> > I did not want to open pandora's box again but I think removal of _hooks, _operators, _sensors from the module name might not have been the best decision. While it removes some duplication in module name, it actually introduces duplicate module names and (as it is with mypy) it might be a problem in the future. I think now we should change it back to add the suffixes. If we want to reverse it - this is the last moment we can do it (and we still can easily). I would love some quick opinion/voting for that - especially those who voted in AIP-21 but others are welcome as well: @Ash Berlin-Taylor (mailto:ash@apache.org) @Fokko Driesprong (mailto:fokko@driesprong.frl) @Felix Uellendall (mailto:feluelle@pm.me) @Kamil Breguła (mailto:kamil.bregula@polidea.com) @Kaxil Naik (mailto:kaxilnaik@gmail.com) @basharenslak@godatadriven.com (mailto:basharenslak@godatadriven.com) @Philippe Gagnon (mailto:philgagnon1@gmail.com) - this was "Case 2" in the original AIP-21 proposal.
> >
> > WDYT?
> >
> > 4) Test module names:
> >
> > I had to change all test module name for providers. For example: . Pytest does not like repeating implicit test module names. It throws error "duplicate test module). For example:
> > tests/providers/http/hooks/test_http.py → tests/providers/http/hooks/test_http_hooks.py (https://github.com/apache/airflow/pull/7279/files#diff-593e762ce1d79c250426a27b6fb28908)
> >
> >
> > (basically I added hooks/operators/sensors everywhere). It is easy and fully automated and it has no impact at all on the main code base, so I think this is a super-safe change. t The nice thing is that it makes it easier to understand what kind of tests you are looking at immediately and. That also hints that maybe _hooks.py, _operators.py, _sensors.py in hooks/operators/sensors could have been a better choice (see 2 a) )
> >
> > 5) Doc generation
> >
> > Autoapi doc generation has not yet released implicit package support (it is in master only). However I managed to monkey-patch 1.2.1 (latest) version to add support by cherry-picking the changes (there are literally few methods changed there - https://github.com/readthedocs/sphinx-autoapi/pull/178) and when new version is released we will be able to get rid of monkey-patching . I am still resolving a few import issues in that newest version but I am confident I can fix or workaround all issues with it.
> > I don't expect this to be a problem - it is only in the doc generation code, not the main application code.
> >
> > I verified that pytest test discovery works fine, and that airflow can be installed and works with implicit packages. Let me know what you think about it and I would love to merge it as soon as possible (rebasing this for a long time might be painful).
> >
> > J.
> >
> >
> >
> >
> >
> > On Sat, Nov 23, 2019 at 9:22 AM Jarek Potiuk <Jarek.Potiuk@polidea.com (mailto:Jarek.Potiuk@polidea.com)> wrote:
> > > I've read a bit more about PEP-0420 so some more extracted information:
> > >
> > > - implicit namespace packages are very similar to regular packages. The notable difference is that they can be loaded from several directories (so you can have modules in the same package but coming from different physical directories), so no __file__ is defined for the package (it is still defined for modules),
> > >
> > > - pytest should have no problems. It has indeed (https://stackoverflow.com/questions/50174130/how-do-i-pytest-a-project-using-pep-420-namespace-packages) problems with implicit packages when tests are co-located with their tested modules (module.py, module_test.py) but we are not in this situation.
> > >
> > > - I do not see how it could be slower - I have not found any benchmarking yet (but I have not found any complaints about it either). I looked at the loading algorithm for PEP-420 - it traverses all the directory structure available on the PYTHONPATH bo so does the "standard" mechanism. And we would not have to parse and load all the 140 licence-only __init__.py. The presence of module in the package is based on the presence of "<module>.py" in the right folder. If we would have several packages elsewhere that might become a bit slower, but I think reading files is so much slower than scanning directories that - if anything - we will be faster (just opening and closing those __init__.files will take quite some time :). We can do some benchmarking before and after to be sure.
> > >
> > > Side comment: I am not 100% sure, but after reading PEP-420 intuition tells me that maybe implicit packages can help in solving relative imports problems that some of our user raised . I saw recently several users had problems with relative imports used in DAGs. We are loading the DAGs in a specific way and in implicit packages, the __repr__ of modules parent is dynamic - based on the location of the file rather than based on the module of the file, so it's likely if we encourage dag developers also to use implicit packages, it might actually solve the relative imports case.
> > >
> > > J.
> > > On Fri, Nov 22, 2019 at 4:29 PM Daniel Imberman <daniel.imberman@gmail.com (mailto:daniel.imberman@gmail.com)> wrote:
> > > > Fine by me if it doesn’t break anything. I’m always on the side of less code :).
> > > >
> > > > via Newton Mail [https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.32&pv=10.14.5&source=email_footer_2]
> > > > On Fri, Nov 22, 2019 at 8:12 AM, Ash Berlin-Taylor <ash@apache.org (mailto:ash@apache.org)> wrote:
> > > > Probably worth it, but things to check:
> > > >
> > > > - if nose or pytest needs them under tests/ to find all the tests
> > > > - if not having the init files slows down pythons package importer as it searches more paths?
> > > >
> > > > -a
> > > > > On 22 Nov 2019, at 12:06, Jarek Potiuk <Jarek.Potiuk@polidea.com (mailto:Jarek.Potiuk@polidea.com)> wrote:
> > > > >
> > > > > There is one implementation detail in AIP-21 that I continue to have
> > > > > questions about, and I wanted to ask community about it.
> > > > >
> > > > > Since we moved to Python 3, we have the option of using implicit namespace
> > > > > packages.
> > > > > https://www.python.org/dev/peps/pep-0420/ . We have now grouping per folder
> > > > > for services, so it would require a lot more __init__.py files if we
> > > > > continue using them.
> > > > >
> > > > > The implicit naming boils down to not requiring __init__.py files if no
> > > > > initialisation of package is needed. We could do it consistently for all
> > > > > code which is "internal" to airflow. This means that most packages will not
> > > > > have __init__.py there will be few exceptions like 'airflow',
> > > > > 'airflow.modules' and likely few others with the __init__.py.
> > > > >
> > > > > I did a quick check and we have only 25 _init_.py with some logic -
> > > > > remaining ~ 140 could be removed as they only contain comments.
> > > > >
> > > > > find . -name '__init__.py' | xargs grep -le '^[^#].*' | wc -l
> > > > > 25
> > > > > find . -name '__init__.py' |wc -l
> > > > > 164
> > > > >
> > > > > Those are the files that will be left:
> > > > > ./tests/__init__.py
> > > > > ./tests/utils/log/elasticmock/__init__.py
> > > > > ./tests/utils/log/elasticmock/utilities/__init__.py
> > > > > ./tests/models/__init__.py
> > > > > ./tests/task/__init__.py
> > > > > ./airflow/sensors/__init__.py
> > > > > ./airflow/operators/__init__.py
> > > > > ./airflow/_vendor/nvd3/__init__.py
> > > > > ./airflow/_vendor/slugify/__init__.py
> > > > > ./airflow/serialization/__init__.py
> > > > > ./airflow/__init__.py
> > > > > ./airflow/models/__init__.py
> > > > > ./airflow/www/node_modules/npm/node_modules/node-gyp/gyp/pylib/gyp/__init__.py
> > > > > ./airflow/ti_deps/deps/__init__.py
> > > > > ./airflow/macros/__init__.py
> > > > > ./airflow/executors/__init__.py
> > > > > ./airflow/lineage/__init__.py
> > > > > ./airflow/lineage/backend/__init__.py
> > > > > ./airflow/lineage/backend/atlas/__init__.py
> > > > > ./airflow/hooks/__init__.py
> > > > > ./airflow/task/task_runner/__init__.py
> > > > > ./airflow/api/__init__.py
> > > > > ./airflow/api/common/experimental/__init__.py
> > > > > ./airflow/api/client/__init__.py
> > > > > ./airflow/jobs/__init__.py
> > > > >
> > > > > WDYT?
> > > > >
> > > > > J
> > > > >
> > > > > --
> > > > >
> > > > > Jarek Potiuk
> > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > >
> > > > > M: +48 660 796 129 <+48660796129>
> > > > > [image: Polidea] <https://www.polidea.com/>
> > >
> > >
> > >
> > >
> > > --
> > >
> > >
> > > Jarek Potiuk
> > > Polidea (https://www.polidea.com/) | Principal Software Engineer
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > M: +48 660 796 129 (tel:+48660796129)
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> >
> >
> > Jarek Potiuk
> > Polidea (https://www.polidea.com/) | Principal Software Engineer
> >
> >
> >
> >
> >
> >
> >
> >
> > M: +48 660 796 129 (tel:+48660796129)
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>


Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Jarek Potiuk <Ja...@polidea.com>.
On Wed, Feb 5, 2020 at 11:31 AM Ash Berlin-Taylor <as...@apache.org> wrote:

> Are you talking about/test with making _ALL_ packages implicit namespaces?
>
> I would think that we would only make airflow.providers an implicit
> package, but leave all the sub packages as explicit packages: i.e. this:
>

Nope.

Pretty much all of the problems (except the email, kerberos and sqlachemy)
are caused by implicit providers packages. But I think it's good to change
the names of modules in this case to not be the same as top-level package
names anyway. It's nice that implicit packages helped to detect it.

I think if we are making providers packages implicit, it makes perfect
sense to make all of them implicit where possible (i.e. where we have empty
_init.py). That makes a lot of sense IMHO and frees our brains from making
yet another decision. For almost everything in core it's just __init__.py
removal and everything works. I think in this case consistency is good.

Also we have all the protection to keep it correct in the future. All the
automation we have in CI will detect and clearly report if anyone adds
duplicated module, import package that has the same name as module etc. so
we are very well covered here.

J.


> Does limiting the implicit namespace to just airflow.providers address
> 1+2+3?
> -a
> On Feb 5 2020, at 10:27 am, Jarek Potiuk <Ja...@polidea.com> wrote:
> > TL;DR; I am asking for opinions on some of the changes required to
> introduce implicit native package support. This is the easiest way to make
> backports of master providers packages to 1.10.x.
> >
> >
> > NOTE!: @Ash Berlin-Taylor (mailto:ash@apache.org) @Fokko Driesprong
> (mailto:fokko@driesprong.frl) @Felix Uellendall (mailto:feluelle@pm.me)
> @Kamil Breguła (mailto:kamil.bregula@polidea.com) @Kaxil Naik (mailto:
> kaxilnaik@gmail.com) @basharenslak@godatadriven.com (mailto:
> basharenslak@godatadriven.com) @Philippe Gagnon (mailto:
> philgagnon1@gmail.com) - there is one point 1a) below that I would like
> to get your input.
> >
> > As AIP-21 (import paths) import path move is done we have an important
> step to complete - we should switch to implicit "native" packages. It boils
> down to removal of empty (only with comment licences) __init__.py files and
> relying on python 3.3+ capability of finding packages without having to add
> __init_.py files. This is needed in order to get backported packaged to be
> prepared for 1.10.* series. I worked on it during the last few days - to
> make sure this can be done and I am close to have it. Close enough to know
> I can solve all the remaining problems. I wanted to check that we can do it
> for almost all the packages. I already solved most of the initial problems
> but I have some places where I think I need community opinion on the way we
> should solve them:
> >
> > The PR that has the changes is here:
> https://github.com/apache/airflow/pull/7279/files - it's not 100% ready
> yet and I want to split it into a few separate PRs so do not comment it
> there yet - I extracted the most important changes here and wanted to ask
> your opinion where I have doubts:
> >
> > 1) Module names identical to some top-level package names (for example
> email.py):
> >
> > Some of the module names (for example email.py) has to be changed. The
> problem with implicit packages is that if local module is named the same as
> the top level package, importing the top-level package from it does not
> work:
> >
> > email.py:
> > ...
> > import email <- this won't work - module imports itself.
> >
> > For now in my PR i renamed the modules to be airflow_<old_name>.py or
> <old_name>_utils.py : There are such modules (mostly hooks):
> > email -> email_utils, (utils)
> >
> > cassandra -> airflow_cassandra
> >
> > cloudant -> airflow_cloudant
> >
> > dataproc -> airflow_dataproc
> >
> > grpc -> airlfow_grpc
> >
> > hdfs -> airflow_hdfs
> >
> > jira -> airflow_jira
> >
> > kerberos -> kerberos_security
> >
> > redis -> airflow_redis
> >
> > sendgrid -> airflow_sendgrid
> >
> > snowflake -> airflow_snowflake
> >
> > winrm -> airflow_winrm (operator)
> >
> > datadog -> airflow_datatog(sensor)
> >
> > sqlalchemy -> sqlalchemy_utils (utils)
> >
> >
> >
> > I can also change it to grpc_hooks.py, datadog_sensors.py etc. Or maybe
> someone knows an easy way how to keep the module name and implicit packages?
> >
> > I guess this is not a big problem to change it this way - we anyhow have
> changed import paths for those. The only two problems with this are:
> > a) email was used in "email_backend" configuration : email_backend =
> airflow.utils.email.send_email_smtp
> > but we can solve it by handling that as special case, raising
> deprecation warning and converting it to
> airflow.utils.email_utils.send_email_smtp
> > b) we introduce slight inconsistency in hook/operator/sensor names.
> >
> > 2) Mypy fails when checking individual (duplicated) module names
> >
> > Mypy does not work well with duplicate module names in implicit packages
> (https://github.com/pre-commit/mirrors-mypy/issues/5) and with AIP-21 we
> have plenty of them. Repeating modules (http.py in both operators and hooks
> for example) cause it to fail in case they are provided as parameters (in
> case of incremental mypy check in pre-commits. I had to change it then so
> that mypy always check 'airflow' and 'test' packages instead - which makes
> pre-commit check slightly slower (few seconds).
> >
> > 3) _hooks.py/_operators.py/_sensors.py (
> http://hooks.py/_operators.py/_sensors.py) suffix: Some thought resulting
> from above 1) and 2):
> >
> > I did not want to open pandora's box again but I think removal of
> _hooks, _operators, _sensors from the module name might not have been the
> best decision. While it removes some duplication in module name, it
> actually introduces duplicate module names and (as it is with mypy) it
> might be a problem in the future. I think now we should change it back to
> add the suffixes. If we want to reverse it - this is the last moment we can
> do it (and we still can easily). I would love some quick opinion/voting for
> that - especially those who voted in AIP-21 but others are welcome as well:
> @Ash Berlin-Taylor (mailto:ash@apache.org) @Fokko Driesprong (mailto:
> fokko@driesprong.frl) @Felix Uellendall (mailto:feluelle@pm.me) @Kamil
> Breguła (mailto:kamil.bregula@polidea.com) @Kaxil Naik (mailto:
> kaxilnaik@gmail.com) @basharenslak@godatadriven.com (mailto:
> basharenslak@godatadriven.com) @Philippe Gagnon (mailto:
> philgagnon1@gmail.com) - this was "Case 2" in the original AIP-21
> proposal.
> >
> > WDYT?
> >
> > 4) Test module names:
> >
> > I had to change all test module name for providers. For example: .
> Pytest does not like repeating implicit test module names. It throws error
> "duplicate test module). For example:
> > tests/providers/http/hooks/test_http.py →
> tests/providers/http/hooks/test_http_hooks.py (
> https://github.com/apache/airflow/pull/7279/files#diff-593e762ce1d79c250426a27b6fb28908
> )
> >
> >
> > (basically I added hooks/operators/sensors everywhere). It is easy and
> fully automated and it has no impact at all on the main code base, so I
> think this is a super-safe change. t The nice thing is that it makes it
> easier to understand what kind of tests you are looking at immediately and.
> That also hints that maybe _hooks.py, _operators.py, _sensors.py in
> hooks/operators/sensors could have been a better choice (see 2 a) )
> >
> > 5) Doc generation
> >
> > Autoapi doc generation has not yet released implicit package support (it
> is in master only). However I managed to monkey-patch 1.2.1 (latest)
> version to add support by cherry-picking the changes (there are literally
> few methods changed there -
> https://github.com/readthedocs/sphinx-autoapi/pull/178) and when new
> version is released we will be able to get rid of monkey-patching . I am
> still resolving a few import issues in that newest version but I am
> confident I can fix or workaround all issues with it.
> > I don't expect this to be a problem - it is only in the doc generation
> code, not the main application code.
> >
> > I verified that pytest test discovery works fine, and that airflow can
> be installed and works with implicit packages. Let me know what you think
> about it and I would love to merge it as soon as possible (rebasing this
> for a long time might be painful).
> >
> > J.
> >
> >
> >
> >
> >
> > On Sat, Nov 23, 2019 at 9:22 AM Jarek Potiuk <Jarek.Potiuk@polidea.com
> (mailto:Jarek.Potiuk@polidea.com)> wrote:
> > > I've read a bit more about PEP-0420 so some more extracted information:
> > >
> > > - implicit namespace packages are very similar to regular packages.
> The notable difference is that they can be loaded from several directories
> (so you can have modules in the same package but coming from different
> physical directories), so no __file__ is defined for the package (it is
> still defined for modules),
> > >
> > > - pytest should have no problems. It has indeed (
> https://stackoverflow.com/questions/50174130/how-do-i-pytest-a-project-using-pep-420-namespace-packages)
> problems with implicit packages when tests are co-located with their tested
> modules (module.py, module_test.py) but we are not in this situation.
> > >
> > > - I do not see how it could be slower - I have not found any
> benchmarking yet (but I have not found any complaints about it either). I
> looked at the loading algorithm for PEP-420 - it traverses all the
> directory structure available on the PYTHONPATH bo so does the "standard"
> mechanism. And we would not have to parse and load all the 140 licence-only
> __init__.py. The presence of module in the package is based on the presence
> of "<module>.py" in the right folder. If we would have several packages
> elsewhere that might become a bit slower, but I think reading files is so
> much slower than scanning directories that - if anything - we will be
> faster (just opening and closing those __init__.files will take quite some
> time :). We can do some benchmarking before and after to be sure.
> > >
> > > Side comment: I am not 100% sure, but after reading PEP-420 intuition
> tells me that maybe implicit packages can help in solving relative imports
> problems that some of our user raised . I saw recently several users had
> problems with relative imports used in DAGs. We are loading the DAGs in a
> specific way and in implicit packages, the __repr__ of modules parent is
> dynamic - based on the location of the file rather than based on the module
> of the file, so it's likely if we encourage dag developers also to use
> implicit packages, it might actually solve the relative imports case.
> > >
> > > J.
> > > On Fri, Nov 22, 2019 at 4:29 PM Daniel Imberman <
> daniel.imberman@gmail.com (mailto:daniel.imberman@gmail.com)> wrote:
> > > > Fine by me if it doesn’t break anything. I’m always on the side of
> less code :).
> > > >
> > > > via Newton Mail [
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.32&pv=10.14.5&source=email_footer_2
> ]
> > > > On Fri, Nov 22, 2019 at 8:12 AM, Ash Berlin-Taylor <ash@apache.org
> (mailto:ash@apache.org)> wrote:
> > > > Probably worth it, but things to check:
> > > >
> > > > - if nose or pytest needs them under tests/ to find all the tests
> > > > - if not having the init files slows down pythons package importer
> as it searches more paths?
> > > >
> > > > -a
> > > > > On 22 Nov 2019, at 12:06, Jarek Potiuk <Jarek.Potiuk@polidea.com
> (mailto:Jarek.Potiuk@polidea.com)> wrote:
> > > > >
> > > > > There is one implementation detail in AIP-21 that I continue to
> have
> > > > > questions about, and I wanted to ask community about it.
> > > > >
> > > > > Since we moved to Python 3, we have the option of using implicit
> namespace
> > > > > packages.
> > > > > https://www.python.org/dev/peps/pep-0420/ . We have now grouping
> per folder
> > > > > for services, so it would require a lot more __init__.py files if
> we
> > > > > continue using them.
> > > > >
> > > > > The implicit naming boils down to not requiring __init__.py files
> if no
> > > > > initialisation of package is needed. We could do it consistently
> for all
> > > > > code which is "internal" to airflow. This means that most packages
> will not
> > > > > have __init__.py there will be few exceptions like 'airflow',
> > > > > 'airflow.modules' and likely few others with the __init__.py.
> > > > >
> > > > > I did a quick check and we have only 25 _init_.py with some logic -
> > > > > remaining ~ 140 could be removed as they only contain comments.
> > > > >
> > > > > find . -name '__init__.py' | xargs grep -le '^[^#].*' | wc -l
> > > > > 25
> > > > > find . -name '__init__.py' |wc -l
> > > > > 164
> > > > >
> > > > > Those are the files that will be left:
> > > > > ./tests/__init__.py
> > > > > ./tests/utils/log/elasticmock/__init__.py
> > > > > ./tests/utils/log/elasticmock/utilities/__init__.py
> > > > > ./tests/models/__init__.py
> > > > > ./tests/task/__init__.py
> > > > > ./airflow/sensors/__init__.py
> > > > > ./airflow/operators/__init__.py
> > > > > ./airflow/_vendor/nvd3/__init__.py
> > > > > ./airflow/_vendor/slugify/__init__.py
> > > > > ./airflow/serialization/__init__.py
> > > > > ./airflow/__init__.py
> > > > > ./airflow/models/__init__.py
> > > > >
> ./airflow/www/node_modules/npm/node_modules/node-gyp/gyp/pylib/gyp/__init__.py
> > > > > ./airflow/ti_deps/deps/__init__.py
> > > > > ./airflow/macros/__init__.py
> > > > > ./airflow/executors/__init__.py
> > > > > ./airflow/lineage/__init__.py
> > > > > ./airflow/lineage/backend/__init__.py
> > > > > ./airflow/lineage/backend/atlas/__init__.py
> > > > > ./airflow/hooks/__init__.py
> > > > > ./airflow/task/task_runner/__init__.py
> > > > > ./airflow/api/__init__.py
> > > > > ./airflow/api/common/experimental/__init__.py
> > > > > ./airflow/api/client/__init__.py
> > > > > ./airflow/jobs/__init__.py
> > > > >
> > > > > WDYT?
> > > > >
> > > > > J
> > > > >
> > > > > --
> > > > >
> > > > > Jarek Potiuk
> > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > >
> > > > > M: +48 660 796 129 <+48660796129>
> > > > > [image: Polidea] <https://www.polidea.com/>
> > >
> > >
> > >
> > >
> > > --
> > >
> > >
> > > Jarek Potiuk
> > > Polidea (https://www.polidea.com/) | Principal Software Engineer
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > M: +48 660 796 129 (tel:+48660796129)
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> >
> >
> > Jarek Potiuk
> > Polidea (https://www.polidea.com/) | Principal Software Engineer
> >
> >
> >
> >
> >
> >
> >
> > M: +48 660 796 129 (tel:+48660796129)
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Ash Berlin-Taylor <as...@apache.org>.
Are you talking about/test with making _ALL_ packages implicit namespaces?

I would think that we would only make airflow.providers an implicit package, but leave all the sub packages as explicit packages: i.e. this:
airflow/__init__.py
airflow/utils/__init__.py
airflow/providers/ (no __init__.py)
airflow/providers/google/__init__.py

Does limiting the implicit namespace to just airflow.providers address 1+2+3?
-a
On Feb 5 2020, at 10:27 am, Jarek Potiuk <Ja...@polidea.com> wrote:
> TL;DR; I am asking for opinions on some of the changes required to introduce implicit native package support. This is the easiest way to make backports of master providers packages to 1.10.x.
>
>
> NOTE!: @Ash Berlin-Taylor (mailto:ash@apache.org) @Fokko Driesprong (mailto:fokko@driesprong.frl) @Felix Uellendall (mailto:feluelle@pm.me) @Kamil Breguła (mailto:kamil.bregula@polidea.com) @Kaxil Naik (mailto:kaxilnaik@gmail.com) @basharenslak@godatadriven.com (mailto:basharenslak@godatadriven.com) @Philippe Gagnon (mailto:philgagnon1@gmail.com) - there is one point 1a) below that I would like to get your input.
>
> As AIP-21 (import paths) import path move is done we have an important step to complete - we should switch to implicit "native" packages. It boils down to removal of empty (only with comment licences) __init__.py files and relying on python 3.3+ capability of finding packages without having to add __init_.py files. This is needed in order to get backported packaged to be prepared for 1.10.* series. I worked on it during the last few days - to make sure this can be done and I am close to have it. Close enough to know I can solve all the remaining problems. I wanted to check that we can do it for almost all the packages. I already solved most of the initial problems but I have some places where I think I need community opinion on the way we should solve them:
>
> The PR that has the changes is here: https://github.com/apache/airflow/pull/7279/files - it's not 100% ready yet and I want to split it into a few separate PRs so do not comment it there yet - I extracted the most important changes here and wanted to ask your opinion where I have doubts:
>
> 1) Module names identical to some top-level package names (for example email.py):
>
> Some of the module names (for example email.py) has to be changed. The problem with implicit packages is that if local module is named the same as the top level package, importing the top-level package from it does not work:
>
> email.py:
> ...
> import email <- this won't work - module imports itself.
>
> For now in my PR i renamed the modules to be airflow_<old_name>.py or <old_name>_utils.py : There are such modules (mostly hooks):
> email -> email_utils, (utils)
>
> cassandra -> airflow_cassandra
>
> cloudant -> airflow_cloudant
>
> dataproc -> airflow_dataproc
>
> grpc -> airlfow_grpc
>
> hdfs -> airflow_hdfs
>
> jira -> airflow_jira
>
> kerberos -> kerberos_security
>
> redis -> airflow_redis
>
> sendgrid -> airflow_sendgrid
>
> snowflake -> airflow_snowflake
>
> winrm -> airflow_winrm (operator)
>
> datadog -> airflow_datatog(sensor)
>
> sqlalchemy -> sqlalchemy_utils (utils)
>
>
>
> I can also change it to grpc_hooks.py, datadog_sensors.py etc. Or maybe someone knows an easy way how to keep the module name and implicit packages?
>
> I guess this is not a big problem to change it this way - we anyhow have changed import paths for those. The only two problems with this are:
> a) email was used in "email_backend" configuration : email_backend = airflow.utils.email.send_email_smtp
> but we can solve it by handling that as special case, raising deprecation warning and converting it to airflow.utils.email_utils.send_email_smtp
> b) we introduce slight inconsistency in hook/operator/sensor names.
>
> 2) Mypy fails when checking individual (duplicated) module names
>
> Mypy does not work well with duplicate module names in implicit packages (https://github.com/pre-commit/mirrors-mypy/issues/5) and with AIP-21 we have plenty of them. Repeating modules (http.py in both operators and hooks for example) cause it to fail in case they are provided as parameters (in case of incremental mypy check in pre-commits. I had to change it then so that mypy always check 'airflow' and 'test' packages instead - which makes pre-commit check slightly slower (few seconds).
>
> 3) _hooks.py/_operators.py/_sensors.py (http://hooks.py/_operators.py/_sensors.py) suffix: Some thought resulting from above 1) and 2):
>
> I did not want to open pandora's box again but I think removal of _hooks, _operators, _sensors from the module name might not have been the best decision. While it removes some duplication in module name, it actually introduces duplicate module names and (as it is with mypy) it might be a problem in the future. I think now we should change it back to add the suffixes. If we want to reverse it - this is the last moment we can do it (and we still can easily). I would love some quick opinion/voting for that - especially those who voted in AIP-21 but others are welcome as well: @Ash Berlin-Taylor (mailto:ash@apache.org) @Fokko Driesprong (mailto:fokko@driesprong.frl) @Felix Uellendall (mailto:feluelle@pm.me) @Kamil Breguła (mailto:kamil.bregula@polidea.com) @Kaxil Naik (mailto:kaxilnaik@gmail.com) @basharenslak@godatadriven.com (mailto:basharenslak@godatadriven.com) @Philippe Gagnon (mailto:philgagnon1@gmail.com) - this was "Case 2" in the original AIP-21 proposal.
>
> WDYT?
>
> 4) Test module names:
>
> I had to change all test module name for providers. For example: . Pytest does not like repeating implicit test module names. It throws error "duplicate test module). For example:
> tests/providers/http/hooks/test_http.py → tests/providers/http/hooks/test_http_hooks.py (https://github.com/apache/airflow/pull/7279/files#diff-593e762ce1d79c250426a27b6fb28908)
>
>
> (basically I added hooks/operators/sensors everywhere). It is easy and fully automated and it has no impact at all on the main code base, so I think this is a super-safe change. t The nice thing is that it makes it easier to understand what kind of tests you are looking at immediately and. That also hints that maybe _hooks.py, _operators.py, _sensors.py in hooks/operators/sensors could have been a better choice (see 2 a) )
>
> 5) Doc generation
>
> Autoapi doc generation has not yet released implicit package support (it is in master only). However I managed to monkey-patch 1.2.1 (latest) version to add support by cherry-picking the changes (there are literally few methods changed there - https://github.com/readthedocs/sphinx-autoapi/pull/178) and when new version is released we will be able to get rid of monkey-patching . I am still resolving a few import issues in that newest version but I am confident I can fix or workaround all issues with it.
> I don't expect this to be a problem - it is only in the doc generation code, not the main application code.
>
> I verified that pytest test discovery works fine, and that airflow can be installed and works with implicit packages. Let me know what you think about it and I would love to merge it as soon as possible (rebasing this for a long time might be painful).
>
> J.
>
>
>
>
>
> On Sat, Nov 23, 2019 at 9:22 AM Jarek Potiuk <Jarek.Potiuk@polidea.com (mailto:Jarek.Potiuk@polidea.com)> wrote:
> > I've read a bit more about PEP-0420 so some more extracted information:
> >
> > - implicit namespace packages are very similar to regular packages. The notable difference is that they can be loaded from several directories (so you can have modules in the same package but coming from different physical directories), so no __file__ is defined for the package (it is still defined for modules),
> >
> > - pytest should have no problems. It has indeed (https://stackoverflow.com/questions/50174130/how-do-i-pytest-a-project-using-pep-420-namespace-packages) problems with implicit packages when tests are co-located with their tested modules (module.py, module_test.py) but we are not in this situation.
> >
> > - I do not see how it could be slower - I have not found any benchmarking yet (but I have not found any complaints about it either). I looked at the loading algorithm for PEP-420 - it traverses all the directory structure available on the PYTHONPATH bo so does the "standard" mechanism. And we would not have to parse and load all the 140 licence-only __init__.py. The presence of module in the package is based on the presence of "<module>.py" in the right folder. If we would have several packages elsewhere that might become a bit slower, but I think reading files is so much slower than scanning directories that - if anything - we will be faster (just opening and closing those __init__.files will take quite some time :). We can do some benchmarking before and after to be sure.
> >
> > Side comment: I am not 100% sure, but after reading PEP-420 intuition tells me that maybe implicit packages can help in solving relative imports problems that some of our user raised . I saw recently several users had problems with relative imports used in DAGs. We are loading the DAGs in a specific way and in implicit packages, the __repr__ of modules parent is dynamic - based on the location of the file rather than based on the module of the file, so it's likely if we encourage dag developers also to use implicit packages, it might actually solve the relative imports case.
> >
> > J.
> > On Fri, Nov 22, 2019 at 4:29 PM Daniel Imberman <daniel.imberman@gmail.com (mailto:daniel.imberman@gmail.com)> wrote:
> > > Fine by me if it doesn’t break anything. I’m always on the side of less code :).
> > >
> > > via Newton Mail [https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.32&pv=10.14.5&source=email_footer_2]
> > > On Fri, Nov 22, 2019 at 8:12 AM, Ash Berlin-Taylor <ash@apache.org (mailto:ash@apache.org)> wrote:
> > > Probably worth it, but things to check:
> > >
> > > - if nose or pytest needs them under tests/ to find all the tests
> > > - if not having the init files slows down pythons package importer as it searches more paths?
> > >
> > > -a
> > > > On 22 Nov 2019, at 12:06, Jarek Potiuk <Jarek.Potiuk@polidea.com (mailto:Jarek.Potiuk@polidea.com)> wrote:
> > > >
> > > > There is one implementation detail in AIP-21 that I continue to have
> > > > questions about, and I wanted to ask community about it.
> > > >
> > > > Since we moved to Python 3, we have the option of using implicit namespace
> > > > packages.
> > > > https://www.python.org/dev/peps/pep-0420/ . We have now grouping per folder
> > > > for services, so it would require a lot more __init__.py files if we
> > > > continue using them.
> > > >
> > > > The implicit naming boils down to not requiring __init__.py files if no
> > > > initialisation of package is needed. We could do it consistently for all
> > > > code which is "internal" to airflow. This means that most packages will not
> > > > have __init__.py there will be few exceptions like 'airflow',
> > > > 'airflow.modules' and likely few others with the __init__.py.
> > > >
> > > > I did a quick check and we have only 25 _init_.py with some logic -
> > > > remaining ~ 140 could be removed as they only contain comments.
> > > >
> > > > find . -name '__init__.py' | xargs grep -le '^[^#].*' | wc -l
> > > > 25
> > > > find . -name '__init__.py' |wc -l
> > > > 164
> > > >
> > > > Those are the files that will be left:
> > > > ./tests/__init__.py
> > > > ./tests/utils/log/elasticmock/__init__.py
> > > > ./tests/utils/log/elasticmock/utilities/__init__.py
> > > > ./tests/models/__init__.py
> > > > ./tests/task/__init__.py
> > > > ./airflow/sensors/__init__.py
> > > > ./airflow/operators/__init__.py
> > > > ./airflow/_vendor/nvd3/__init__.py
> > > > ./airflow/_vendor/slugify/__init__.py
> > > > ./airflow/serialization/__init__.py
> > > > ./airflow/__init__.py
> > > > ./airflow/models/__init__.py
> > > > ./airflow/www/node_modules/npm/node_modules/node-gyp/gyp/pylib/gyp/__init__.py
> > > > ./airflow/ti_deps/deps/__init__.py
> > > > ./airflow/macros/__init__.py
> > > > ./airflow/executors/__init__.py
> > > > ./airflow/lineage/__init__.py
> > > > ./airflow/lineage/backend/__init__.py
> > > > ./airflow/lineage/backend/atlas/__init__.py
> > > > ./airflow/hooks/__init__.py
> > > > ./airflow/task/task_runner/__init__.py
> > > > ./airflow/api/__init__.py
> > > > ./airflow/api/common/experimental/__init__.py
> > > > ./airflow/api/client/__init__.py
> > > > ./airflow/jobs/__init__.py
> > > >
> > > > WDYT?
> > > >
> > > > J
> > > >
> > > > --
> > > >
> > > > Jarek Potiuk
> > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >
> > > > M: +48 660 796 129 <+48660796129>
> > > > [image: Polidea] <https://www.polidea.com/>
> >
> >
> >
> >
> > --
> >
> >
> > Jarek Potiuk
> > Polidea (https://www.polidea.com/) | Principal Software Engineer
> >
> >
> >
> >
> >
> >
> >
> > M: +48 660 796 129 (tel:+48660796129)
> >
> >
> >
> >
> >
> >
> >
> >
>
>
> --
>
>
> Jarek Potiuk
> Polidea (https://www.polidea.com/) | Principal Software Engineer
>
>
>
>
>
>
>
> M: +48 660 796 129 (tel:+48660796129)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Jarek Potiuk <Ja...@polidea.com>.
*TL;DR; I am asking for opinions on some of the changes required to
introduce implicit native package support. This is the easiest way to make
backports of master providers packages to 1.10.x. *

NOTE!: @Ash Berlin-Taylor <as...@apache.org> @Fokko Driesprong
<fo...@driesprong.frl> @Felix Uellendall <fe...@pm.me> @Kamil Breguła
<ka...@polidea.com> @Kaxil Naik <ka...@gmail.com>
@basharenslak@godatadriven.com <ba...@godatadriven.com> @Philippe
Gagnon <ph...@gmail.com>  - there is one point 1a) below that I would
like to get your input.

As AIP-21 (import paths) import path move is done we have an important step
to complete - we should switch to implicit "native" packages. It boils down
to removal of empty (only with comment licences) __init__.py files and
relying on python 3.3+ capability of finding packages without having to add
__init_.py files. This is needed in order to get backported packaged to be
prepared for 1.10.* series. I worked on it during the last few days - to
make sure this can be done and I am close to have it. Close enough to know
I can solve all the remaining problems. I wanted to check that we can do it
for almost all the packages. I already solved most of the initial
problems but I have some places where I think I need community opinion on
the way we should solve them:

The PR that has the changes is here:
https://github.com/apache/airflow/pull/7279/files - it's not 100% ready yet
and I want to split it into a few separate PRs so do not comment it there
yet - I extracted the most important changes here and wanted to ask your
opinion where I have doubts:

*1) Module names identical to some top-level package names (for example
email.py):*

Some of the module names (for example email.py) has to be changed. The
problem with implicit packages is that if local module is named the same as
the top level package, importing the top-level package from it does not
work:

email.py:
...
import email <- this won't work - module imports itself.

For now in my PR i renamed the modules to be airflow_<old_name>.py or
<old_name>_utils.py : There are  such modules (mostly hooks):

   - email -> email_utils, (utils)
   - cassandra -> airflow_cassandra
   - cloudant -> airflow_cloudant
   - dataproc -> airflow_dataproc
   - grpc -> airlfow_grpc
   - hdfs -> airflow_hdfs
   - jira -> airflow_jira
   - kerberos -> kerberos_security
   - redis -> airflow_redis
   - sendgrid -> airflow_sendgrid
   - snowflake -> airflow_snowflake
   - winrm -> airflow_winrm (operator)
   - datadog -> airflow_datatog(sensor)
   - sqlalchemy -> sqlalchemy_utils (utils)

I can also change it to grpc_hooks.py, datadog_sensors.py etc. Or maybe
someone knows an easy way how to keep the module name and implicit
packages?

I guess this is not a big problem to change it this way - we anyhow have
changed import paths for those. The only two problems with this are:
a) email was used in "email_backend" configuration : email_backend =
airflow.utils.email.send_email_smtp
but we can solve it by handling that as special case, raising deprecation
warning and converting it to  airflow.utils.email_utils.send_email_smtp
b) we introduce slight inconsistency in hook/operator/sensor names.

*2) Mypy fails when checking individual (duplicated) module names*

Mypy does not work well with duplicate module names in implicit packages (
https://github.com/pre-commit/mirrors-mypy/issues/5) and with AIP-21 we
have plenty of them. Repeating modules (http.py in both operators and hooks
for example) cause it to fail in case they are provided as parameters (in
case of incremental mypy check in pre-commits. I had to change it then so
that mypy always check 'airflow' and 'test' packages instead - which makes
pre-commit check slightly slower (few seconds).

*3) _hooks.py/_operators.py/_sensors.py
<http://hooks.py/_operators.py/_sensors.py> suffix: Some thought resulting
from above 1) and 2): *

I did not want to open pandora's box again but I think removal of _hooks,
_operators, _sensors from the module name might not have been the best
decision. While it removes some duplication in module name, it actually
introduces duplicate module names and (as it is with mypy) it might be a
problem in the future. I think now we should change it back to add the
suffixes. If we want to reverse it - this is the last moment we can do it
(and we still can easily). I would love some quick opinion/voting for that
- especially those who voted in AIP-21 but others are welcome as well: @Ash
Berlin-Taylor <as...@apache.org> @Fokko Driesprong <fo...@driesprong.frl> @Felix
Uellendall <fe...@pm.me> @Kamil Breguła <ka...@polidea.com> @Kaxil
Naik <ka...@gmail.com> @basharenslak@godatadriven.com
<ba...@godatadriven.com> @Philippe Gagnon <ph...@gmail.com>  -
this was "Case 2" in the original AIP-21 proposal.

WDYT?

*4) Test module names:*

I had to change all test module name for providers. For example: . Pytest
does not like repeating implicit test module names. It throws error
"duplicate test module). For example:
tests/providers/http/hooks/test_http.py →
tests/providers/http/hooks/test_http_hooks.py
<https://github.com/apache/airflow/pull/7279/files#diff-593e762ce1d79c250426a27b6fb28908>


(basically I added hooks/operators/sensors everywhere). It is easy and
fully automated and it has no impact at all on the main code base, so I
think this is a super-safe change. t The nice thing is that it makes it
easier to understand what kind of tests you are looking at immediately and.
That also hints that maybe _hooks.py, _operators.py, _sensors.py in
hooks/operators/sensors could have been a better choice (see 2 a) )

*5) Doc generation*

Autoapi doc generation has not yet released implicit package support (it is
in master only). However I managed to monkey-patch 1.2.1 (latest) version
to add support by cherry-picking the changes (there are literally few
methods changed there -
https://github.com/readthedocs/sphinx-autoapi/pull/178) and when new
version is released we will be able to get rid of monkey-patching . I am
still resolving a few import issues in that newest version but I am
confident I can fix or workaround all issues with it.
I don't expect this to be a problem - it is only in the doc generation
code, not the main application code.

I verified that pytest test discovery works fine, and that airflow can be
installed and works with implicit packages. Let me know what you think
about it and I would love to merge it as soon as possible (rebasing this
for a long time might be painful).

J.





On Sat, Nov 23, 2019 at 9:22 AM Jarek Potiuk <Ja...@polidea.com>
wrote:

> I've read a bit more about PEP-0420 so some more extracted information:
>
> - implicit namespace packages are very similar to regular packages. The
> notable difference is that they can be loaded from several directories (so
> you can have modules in the same package but coming from different physical
> directories), so no __file__ is defined for the package (it is still
> defined for modules),
>
> - pytest should have no problems. It has indeed
> <https://stackoverflow.com/questions/50174130/how-do-i-pytest-a-project-using-pep-420-namespace-packages>
> problems with implicit packages when tests are co-located with their tested
> modules (module.py, module_test.py) but we are not in this situation.
>
> - I do not see how it could be slower - I have not found any
> benchmarking yet (but I have not found any complaints about it either). I
> looked at the loading algorithm for PEP-420 - it traverses all the
> directory structure available on the PYTHONPATH bo so does the "standard"
> mechanism. And we would not have to parse and load all the 140 licence-only
> __init__.py. The presence of module in the package is based on the presence
> of "<module>.py" in the right folder. If we would have several packages
> elsewhere that might become a bit slower, but I think reading files is so
> much slower than scanning directories that - if anything - we will be
> faster (just opening and closing those __init__.files will take quite some
> time :). We can do some benchmarking before and after to be sure.
>
> Side comment: I am not 100% sure, but after reading PEP-420  intuition
> tells me that maybe implicit packages can help in solving relative imports
> problems that some of our user raised . I saw recently several users had
> problems with relative imports used in DAGs. We are loading the DAGs in a
> specific way and in implicit packages, the __repr__ of modules parent is
> dynamic - based on the location of the file rather than based on the module
> of the file, so it's likely if we encourage dag developers also to use
> implicit packages, it might actually solve the relative imports case.
>
> J.
>
> On Fri, Nov 22, 2019 at 4:29 PM Daniel Imberman <da...@gmail.com>
> wrote:
>
>> Fine by me if it doesn’t break anything. I’m always on the side of less
>> code :).
>>
>> via Newton Mail [
>> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.32&pv=10.14.5&source=email_footer_2
>> ]
>> On Fri, Nov 22, 2019 at 8:12 AM, Ash Berlin-Taylor <as...@apache.org>
>> wrote:
>> Probably worth it, but things to check:
>>
>> - if nose or pytest needs them under tests/ to find all the tests
>> - if not having the init files slows down pythons package importer as it
>> searches more paths?
>>
>> -a
>>
>> > On 22 Nov 2019, at 12:06, Jarek Potiuk <Ja...@polidea.com>
>> wrote:
>> >
>> > There is one implementation detail in AIP-21 that I continue to have
>> > questions about, and I wanted to ask community about it.
>> >
>> > Since we moved to Python 3, we have the option of using implicit
>> namespace
>> > packages.
>> > https://www.python.org/dev/peps/pep-0420/ . We have now grouping per
>> folder
>> > for services, so it would require a lot more __init__.py files if we
>> > continue using them.
>> >
>> > The implicit naming boils down to not requiring __init__.py files if no
>> > initialisation of package is needed. We could do it consistently for all
>> > code which is "internal" to airflow. This means that most packages will
>> not
>> > have __init__.py there will be few exceptions like 'airflow',
>> > 'airflow.modules' and likely few others with the __init__.py.
>> >
>> > I did a quick check and we have only 25 _init_.py with some logic -
>> > remaining ~ 140 could be removed as they only contain comments.
>> >
>> > find . -name '__init__.py' | xargs grep -le '^[^#].*' | wc -l
>> > 25
>> > find . -name '__init__.py' |wc -l
>> > 164
>> >
>> > Those are the files that will be left:
>> > ./tests/__init__.py
>> > ./tests/utils/log/elasticmock/__init__.py
>> > ./tests/utils/log/elasticmock/utilities/__init__.py
>> > ./tests/models/__init__.py
>> > ./tests/task/__init__.py
>> > ./airflow/sensors/__init__.py
>> > ./airflow/operators/__init__.py
>> > ./airflow/_vendor/nvd3/__init__.py
>> > ./airflow/_vendor/slugify/__init__.py
>> > ./airflow/serialization/__init__.py
>> > ./airflow/__init__.py
>> > ./airflow/models/__init__.py
>> >
>> ./airflow/www/node_modules/npm/node_modules/node-gyp/gyp/pylib/gyp/__init__.py
>> > ./airflow/ti_deps/deps/__init__.py
>> > ./airflow/macros/__init__.py
>> > ./airflow/executors/__init__.py
>> > ./airflow/lineage/__init__.py
>> > ./airflow/lineage/backend/__init__.py
>> > ./airflow/lineage/backend/atlas/__init__.py
>> > ./airflow/hooks/__init__.py
>> > ./airflow/task/task_runner/__init__.py
>> > ./airflow/api/__init__.py
>> > ./airflow/api/common/experimental/__init__.py
>> > ./airflow/api/client/__init__.py
>> > ./airflow/jobs/__init__.py
>> >
>> > WDYT?
>> >
>> > J
>> >
>> > --
>> >
>> > Jarek Potiuk
>> > Polidea <https://www.polidea.com/> | Principal Software Engineer
>> >
>> > M: +48 660 796 129 <+48660796129>
>> > [image: Polidea] <https://www.polidea.com/>
>
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Jarek Potiuk <Ja...@polidea.com>.
I've read a bit more about PEP-0420 so some more extracted information:

- implicit namespace packages are very similar to regular packages. The
notable difference is that they can be loaded from several directories (so
you can have modules in the same package but coming from different physical
directories), so no __file__ is defined for the package (it is still
defined for modules),

- pytest should have no problems. It has indeed
<https://stackoverflow.com/questions/50174130/how-do-i-pytest-a-project-using-pep-420-namespace-packages>
problems with implicit packages when tests are co-located with their tested
modules (module.py, module_test.py) but we are not in this situation.

- I do not see how it could be slower - I have not found any
benchmarking yet (but I have not found any complaints about it either). I
looked at the loading algorithm for PEP-420 - it traverses all the
directory structure available on the PYTHONPATH bo so does the "standard"
mechanism. And we would not have to parse and load all the 140 licence-only
__init__.py. The presence of module in the package is based on the presence
of "<module>.py" in the right folder. If we would have several packages
elsewhere that might become a bit slower, but I think reading files is so
much slower than scanning directories that - if anything - we will be
faster (just opening and closing those __init__.files will take quite some
time :). We can do some benchmarking before and after to be sure.

Side comment: I am not 100% sure, but after reading PEP-420  intuition
tells me that maybe implicit packages can help in solving relative imports
problems that some of our user raised . I saw recently several users had
problems with relative imports used in DAGs. We are loading the DAGs in a
specific way and in implicit packages, the __repr__ of modules parent is
dynamic - based on the location of the file rather than based on the module
of the file, so it's likely if we encourage dag developers also to use
implicit packages, it might actually solve the relative imports case.

J.

On Fri, Nov 22, 2019 at 4:29 PM Daniel Imberman <da...@gmail.com>
wrote:

> Fine by me if it doesn’t break anything. I’m always on the side of less
> code :).
>
> via Newton Mail [
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.32&pv=10.14.5&source=email_footer_2
> ]
> On Fri, Nov 22, 2019 at 8:12 AM, Ash Berlin-Taylor <as...@apache.org> wrote:
> Probably worth it, but things to check:
>
> - if nose or pytest needs them under tests/ to find all the tests
> - if not having the init files slows down pythons package importer as it
> searches more paths?
>
> -a
>
> > On 22 Nov 2019, at 12:06, Jarek Potiuk <Ja...@polidea.com> wrote:
> >
> > There is one implementation detail in AIP-21 that I continue to have
> > questions about, and I wanted to ask community about it.
> >
> > Since we moved to Python 3, we have the option of using implicit
> namespace
> > packages.
> > https://www.python.org/dev/peps/pep-0420/ . We have now grouping per
> folder
> > for services, so it would require a lot more __init__.py files if we
> > continue using them.
> >
> > The implicit naming boils down to not requiring __init__.py files if no
> > initialisation of package is needed. We could do it consistently for all
> > code which is "internal" to airflow. This means that most packages will
> not
> > have __init__.py there will be few exceptions like 'airflow',
> > 'airflow.modules' and likely few others with the __init__.py.
> >
> > I did a quick check and we have only 25 _init_.py with some logic -
> > remaining ~ 140 could be removed as they only contain comments.
> >
> > find . -name '__init__.py' | xargs grep -le '^[^#].*' | wc -l
> > 25
> > find . -name '__init__.py' |wc -l
> > 164
> >
> > Those are the files that will be left:
> > ./tests/__init__.py
> > ./tests/utils/log/elasticmock/__init__.py
> > ./tests/utils/log/elasticmock/utilities/__init__.py
> > ./tests/models/__init__.py
> > ./tests/task/__init__.py
> > ./airflow/sensors/__init__.py
> > ./airflow/operators/__init__.py
> > ./airflow/_vendor/nvd3/__init__.py
> > ./airflow/_vendor/slugify/__init__.py
> > ./airflow/serialization/__init__.py
> > ./airflow/__init__.py
> > ./airflow/models/__init__.py
> >
> ./airflow/www/node_modules/npm/node_modules/node-gyp/gyp/pylib/gyp/__init__.py
> > ./airflow/ti_deps/deps/__init__.py
> > ./airflow/macros/__init__.py
> > ./airflow/executors/__init__.py
> > ./airflow/lineage/__init__.py
> > ./airflow/lineage/backend/__init__.py
> > ./airflow/lineage/backend/atlas/__init__.py
> > ./airflow/hooks/__init__.py
> > ./airflow/task/task_runner/__init__.py
> > ./airflow/api/__init__.py
> > ./airflow/api/common/experimental/__init__.py
> > ./airflow/api/client/__init__.py
> > ./airflow/jobs/__init__.py
> >
> > WDYT?
> >
> > J
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>



-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Daniel Imberman <da...@gmail.com>.
Fine by me if it doesn’t break anything. I’m always on the side of less code :).

via Newton Mail [https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.32&pv=10.14.5&source=email_footer_2]
On Fri, Nov 22, 2019 at 8:12 AM, Ash Berlin-Taylor <as...@apache.org> wrote:
Probably worth it, but things to check:

- if nose or pytest needs them under tests/ to find all the tests
- if not having the init files slows down pythons package importer as it searches more paths?

-a

> On 22 Nov 2019, at 12:06, Jarek Potiuk <Ja...@polidea.com> wrote:
>
> There is one implementation detail in AIP-21 that I continue to have
> questions about, and I wanted to ask community about it.
>
> Since we moved to Python 3, we have the option of using implicit namespace
> packages.
> https://www.python.org/dev/peps/pep-0420/ . We have now grouping per folder
> for services, so it would require a lot more __init__.py files if we
> continue using them.
>
> The implicit naming boils down to not requiring __init__.py files if no
> initialisation of package is needed. We could do it consistently for all
> code which is "internal" to airflow. This means that most packages will not
> have __init__.py there will be few exceptions like 'airflow',
> 'airflow.modules' and likely few others with the __init__.py.
>
> I did a quick check and we have only 25 _init_.py with some logic -
> remaining ~ 140 could be removed as they only contain comments.
>
> find . -name '__init__.py' | xargs grep -le '^[^#].*' | wc -l
> 25
> find . -name '__init__.py' |wc -l
> 164
>
> Those are the files that will be left:
> ./tests/__init__.py
> ./tests/utils/log/elasticmock/__init__.py
> ./tests/utils/log/elasticmock/utilities/__init__.py
> ./tests/models/__init__.py
> ./tests/task/__init__.py
> ./airflow/sensors/__init__.py
> ./airflow/operators/__init__.py
> ./airflow/_vendor/nvd3/__init__.py
> ./airflow/_vendor/slugify/__init__.py
> ./airflow/serialization/__init__.py
> ./airflow/__init__.py
> ./airflow/models/__init__.py
> ./airflow/www/node_modules/npm/node_modules/node-gyp/gyp/pylib/gyp/__init__.py
> ./airflow/ti_deps/deps/__init__.py
> ./airflow/macros/__init__.py
> ./airflow/executors/__init__.py
> ./airflow/lineage/__init__.py
> ./airflow/lineage/backend/__init__.py
> ./airflow/lineage/backend/atlas/__init__.py
> ./airflow/hooks/__init__.py
> ./airflow/task/task_runner/__init__.py
> ./airflow/api/__init__.py
> ./airflow/api/common/experimental/__init__.py
> ./airflow/api/client/__init__.py
> ./airflow/jobs/__init__.py
>
> WDYT?
>
> J
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>

Re: Using implicit namespace packages in Airflow (removing __init__.py where empty)

Posted by Ash Berlin-Taylor <as...@apache.org>.
Probably worth it, but things to check:

- if nose or pytest needs them under tests/ to find all the tests
- if not having the init files slows down pythons package importer as it searches more paths?

-a

> On 22 Nov 2019, at 12:06, Jarek Potiuk <Ja...@polidea.com> wrote:
> 
> There is one implementation detail in AIP-21 that I continue to have
> questions about, and I wanted to ask community about it.
> 
> Since we moved to Python 3, we have the option of using implicit namespace
> packages.
> https://www.python.org/dev/peps/pep-0420/ . We have now grouping per folder
> for services, so it would require a lot more __init__.py files if we
> continue using them.
> 
> The implicit naming boils down to not requiring __init__.py files if no
> initialisation of package is needed. We could do it consistently for all
> code which is "internal" to airflow. This means that most packages will not
> have __init__.py there will be few exceptions like 'airflow',
> 'airflow.modules' and likely few others with the __init__.py.
> 
> I did a quick check and we have only 25 _init_.py with some logic -
> remaining ~ 140 could be removed as they only contain comments.
> 
> find . -name '__init__.py' | xargs grep -le '^[^#].*'  | wc -l
>      25
> find . -name '__init__.py' |wc -l
>     164
> 
> Those are the files that will be left:
> ./tests/__init__.py
> ./tests/utils/log/elasticmock/__init__.py
> ./tests/utils/log/elasticmock/utilities/__init__.py
> ./tests/models/__init__.py
> ./tests/task/__init__.py
> ./airflow/sensors/__init__.py
> ./airflow/operators/__init__.py
> ./airflow/_vendor/nvd3/__init__.py
> ./airflow/_vendor/slugify/__init__.py
> ./airflow/serialization/__init__.py
> ./airflow/__init__.py
> ./airflow/models/__init__.py
> ./airflow/www/node_modules/npm/node_modules/node-gyp/gyp/pylib/gyp/__init__.py
> ./airflow/ti_deps/deps/__init__.py
> ./airflow/macros/__init__.py
> ./airflow/executors/__init__.py
> ./airflow/lineage/__init__.py
> ./airflow/lineage/backend/__init__.py
> ./airflow/lineage/backend/atlas/__init__.py
> ./airflow/hooks/__init__.py
> ./airflow/task/task_runner/__init__.py
> ./airflow/api/__init__.py
> ./airflow/api/common/experimental/__init__.py
> ./airflow/api/client/__init__.py
> ./airflow/jobs/__init__.py
> 
> WDYT?
> 
> J
> 
> -- 
> 
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
> 
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>