You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Jarek Potiuk <ja...@potiuk.com> on 2022/05/01 15:26:21 UTC

[DISCUSS] Support "slim" PROD image(s) for Airflow

Hello everyone,

TL;DR: I am looking for consensus on releasing "slim" versions of PROD
images - ones that will be way smaller and contain no providers nor
other extras and would be database-specific.

Context:

Now after we are done with some infra changes that were also released
in 2.3.0 I came back to the issue raised in in
https://github.com/apache/airflow/issues/20849 which was originally
about "vanilla" image for Airflow, but I renamed the idea to "slim"
image (following similar convention by various distro and Python
providers). The issue itself explains why there is a need for such
images.

The idea is to have a very small "base" ("slim") image that users will
be able to extend  - not only a "regular" (see the relation with
"slim" :D ?)  image where we pre-install a set of providers and
support multiple database backends.

The "slim" images also have the advantage that we can use
"no-constraints" dependencies with them - which means that in those
images, the dependencies are "latest" that airflow supports even if
some providers would limit the dependencies.

I looked at what it would mean and really what it translates to is
that we would have to push many more images.

The bad news:

We need to push matrix of 4 * 3 = 12 new "slim" images (plus some
aliases for "latest")
*  Python versions: 3.7, 3.8, 3.9, 3.10
*  Database: postgres, mysql, mssql

Postgres images would be additionally multiplatform (AMD64/ARM64) and
for now MySQL and MsSQL would  be just AMD64 (until we add support for
ARM for those).
Those are plenty of images, but this is a rather normal approach if
you look for a number of other images published by multiple
"platform-like" products.

The good news:

We only need to do it at release time and we already have the right
set of scripts and parameters to enable that. It will take a bit
longer, but those images are much smaller and building and pushing
them is WAY faster and smaller han the regular image.

Some comparison:

Size (uncompressed): Regular (1.1G), Slim (500MB)
Time to build single image: Regular(6m), Slim (up to 3m)

Overall the release process would take some 20 mins longer if we
release the slim images (and I already made it a separate step so it
should not block "regular" release).

The very good news:

I've actually prepared PR:
https://github.com/apache/airflow/pull/23391 to add this feature
(including the docs), and it's a very small change. It does not change
any of the source code of airflow or Dockerfile, we basically need to
extend our "dev" script to build and push images to ... build and push
more images. I actually even .. prepared and pushed 2.3.0 images of
airflow to my private dockerhub account so that everyone can see how
it will look like.

You can see it here:
https://hub.docker.com/repository/docker/potiuk/airflow/tags?page=1&ordering=last_updated&name=2.3.0

I **believe** those changes don't even need PMC votes for release, and
this is more a procedural change than software release, so we
**could** release the "slim" 2.3.0 images even now - so that they are
available as of 2.3.0. I think even if we see that this is a welcome
change (despite the complexity of our dockerhub images available) it
could even be agreed to via lasy-consensus if we see consensus
forming.

J.

Re: [DISCUSS] Support "slim" PROD image(s) for Airflow

Posted by Jed Cunningham <je...@apache.org>.
Cool! Glad it worked out.

Re: [DISCUSS] Support "slim" PROD image(s) for Airflow

Posted by Jarek Potiuk <ja...@potiuk.com>.
The PR updated - I think that solves the main problem I had with the
ballooning number of images :).  I guess with adding just one parallel
"slim" image to already existing images is far less controversial so I will
call for a lazy consensus :)

Thanks Jed It's quite obvious when you mentioned it, I wonder why I have
not thought about it before ).

This is how the convention will look like then.

+----------------+------------------+---------------------------------+--------------------------------------+
| Image          | Python           | Standard image                  |
Slim image                           |
+================+==================+=================================+======================================+
| Latest default | 3.7              | apache/airflow:latest           |
apache/airflow:slim-latest           |
| Default        | 3.7              | apache/airflow:X.Y.Z            |
apache/airflow:slim-X.Y.Z            |
| Latest         | 3.7,3.8,3.9,3.10 | apache/airflow:latest-pythonN.M |
apache/airflow:slim-latest-pythonN.M |
| Specific       | 3.7,3.8,3.9,3.10 | apache/airflow:X.Y.Z-pythonN.M  |
apache/airflow:slim-X.Y.Z-pythonN.M  |
+----------------+------------------+---------------------------------+--------------------------------------+

J.


On Thu, May 5, 2022 at 10:28 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> Yeah. Indeed it's almost no difference, that will simplify things a lot.
> Good Idea Jed. I will update the PR to reflect it :)
>
> On Thu, May 5, 2022 at 10:17 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> Good point. Let me try :)
>>
>>
>> On Thu, May 5, 2022 at 5:57 AM Jed Cunningham <je...@apache.org>
>> wrote:
>>
>>> How much bigger would the image be if we included postgres, mysql, and
>>> mssql in the same image? That'd mean we'd have 4 vs 12 (ignoring the
>>> platform piece), and might be worth the tradeoff.
>>>
>>

Re: [DISCUSS] Support "slim" PROD image(s) for Airflow

Posted by Jarek Potiuk <ja...@potiuk.com>.
Yeah. Indeed it's almost no difference, that will simplify things a lot.
Good Idea Jed. I will update the PR to reflect it :)

On Thu, May 5, 2022 at 10:17 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> Good point. Let me try :)
>
>
> On Thu, May 5, 2022 at 5:57 AM Jed Cunningham <je...@apache.org>
> wrote:
>
>> How much bigger would the image be if we included postgres, mysql, and
>> mssql in the same image? That'd mean we'd have 4 vs 12 (ignoring the
>> platform piece), and might be worth the tradeoff.
>>
>

Re: [DISCUSS] Support "slim" PROD image(s) for Airflow

Posted by Jarek Potiuk <ja...@potiuk.com>.
Good point. Let me try :)


On Thu, May 5, 2022 at 5:57 AM Jed Cunningham <je...@apache.org>
wrote:

> How much bigger would the image be if we included postgres, mysql, and
> mssql in the same image? That'd mean we'd have 4 vs 12 (ignoring the
> platform piece), and might be worth the tradeoff.
>

Re: [DISCUSS] Support "slim" PROD image(s) for Airflow

Posted by Jed Cunningham <je...@apache.org>.
How much bigger would the image be if we included postgres, mysql, and
mssql in the same image? That'd mean we'd have 4 vs 12 (ignoring the
platform piece), and might be worth the tradeoff.

Re: [DISCUSS] Support "slim" PROD image(s) for Airflow

Posted by Howard Yoo <ho...@gmail.com>.
I also like the idea of SLIM images - always helpful.

Howard

On Wed, May 4, 2022 at 4:53 PM Ping Zhang <pi...@umich.edu> wrote:

> Hi Jarek,
>
> I really like the idea of having a slim airflow docker image.  500MB
> uncompressed is tiny 👍
>
>
> Thanks,
>
> Ping
>
>
> On Sun, May 1, 2022 at 8:41 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> And just to clarify. Those "slim" images are not at all "toothless". You
>> can actually do stuff with them :)
>>
>> The 4 providers that are preinstalled are there:
>>
>> apache-airflow-providers-ftp    | File Transfer Protocol (FTP)
>> https://tools.ietf.org/html/rfc114             | 2.1.2
>> apache-airflow-providers-http   | Hypertext Transfer Protocol (HTTP)
>> https://www.w3.org/Protocols/            | 2.1.2
>> apache-airflow-providers-imap   | Internet Message Access Protocol (IMAP)
>> https://tools.ietf.org/html/rfc3501 | 2.2.3
>> apache-airflow-providers-sqlite | SQLite https://www.sqlite.org/
>>                                      | 2.1.3
>>
>> We could probably further slim them down but that would limit the
>> extensibility a bit and I consider 500 MB uncompressed as pretty "decent" -
>> it's ~ 130-160 MB of compressed data when you pull the image.
>>
>> J.
>>
>>
>>
>> On Sun, May 1, 2022 at 5:26 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> Hello everyone,
>>>
>>> TL;DR: I am looking for consensus on releasing "slim" versions of PROD
>>> images - ones that will be way smaller and contain no providers nor
>>> other extras and would be database-specific.
>>>
>>> Context:
>>>
>>> Now after we are done with some infra changes that were also released
>>> in 2.3.0 I came back to the issue raised in in
>>> https://github.com/apache/airflow/issues/20849 which was originally
>>> about "vanilla" image for Airflow, but I renamed the idea to "slim"
>>> image (following similar convention by various distro and Python
>>> providers). The issue itself explains why there is a need for such
>>> images.
>>>
>>> The idea is to have a very small "base" ("slim") image that users will
>>> be able to extend  - not only a "regular" (see the relation with
>>> "slim" :D ?)  image where we pre-install a set of providers and
>>> support multiple database backends.
>>>
>>> The "slim" images also have the advantage that we can use
>>> "no-constraints" dependencies with them - which means that in those
>>> images, the dependencies are "latest" that airflow supports even if
>>> some providers would limit the dependencies.
>>>
>>> I looked at what it would mean and really what it translates to is
>>> that we would have to push many more images.
>>>
>>> The bad news:
>>>
>>> We need to push matrix of 4 * 3 = 12 new "slim" images (plus some
>>> aliases for "latest")
>>> *  Python versions: 3.7, 3.8, 3.9, 3.10
>>> *  Database: postgres, mysql, mssql
>>>
>>> Postgres images would be additionally multiplatform (AMD64/ARM64) and
>>> for now MySQL and MsSQL would  be just AMD64 (until we add support for
>>> ARM for those).
>>> Those are plenty of images, but this is a rather normal approach if
>>> you look for a number of other images published by multiple
>>> "platform-like" products.
>>>
>>> The good news:
>>>
>>> We only need to do it at release time and we already have the right
>>> set of scripts and parameters to enable that. It will take a bit
>>> longer, but those images are much smaller and building and pushing
>>> them is WAY faster and smaller han the regular image.
>>>
>>> Some comparison:
>>>
>>> Size (uncompressed): Regular (1.1G), Slim (500MB)
>>> Time to build single image: Regular(6m), Slim (up to 3m)
>>>
>>> Overall the release process would take some 20 mins longer if we
>>> release the slim images (and I already made it a separate step so it
>>> should not block "regular" release).
>>>
>>> The very good news:
>>>
>>> I've actually prepared PR:
>>> https://github.com/apache/airflow/pull/23391 to add this feature
>>> (including the docs), and it's a very small change. It does not change
>>> any of the source code of airflow or Dockerfile, we basically need to
>>> extend our "dev" script to build and push images to ... build and push
>>> more images. I actually even .. prepared and pushed 2.3.0 images of
>>> airflow to my private dockerhub account so that everyone can see how
>>> it will look like.
>>>
>>> You can see it here:
>>>
>>> https://hub.docker.com/repository/docker/potiuk/airflow/tags?page=1&ordering=last_updated&name=2.3.0
>>>
>>> I **believe** those changes don't even need PMC votes for release, and
>>> this is more a procedural change than software release, so we
>>> **could** release the "slim" 2.3.0 images even now - so that they are
>>> available as of 2.3.0. I think even if we see that this is a welcome
>>> change (despite the complexity of our dockerhub images available) it
>>> could even be agreed to via lasy-consensus if we see consensus
>>> forming.
>>>
>>> J.
>>>
>>

Re: [DISCUSS] Support "slim" PROD image(s) for Airflow

Posted by Ping Zhang <pi...@umich.edu>.
Hi Jarek,

I really like the idea of having a slim airflow docker image.  500MB
uncompressed is tiny 👍


Thanks,

Ping


On Sun, May 1, 2022 at 8:41 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> And just to clarify. Those "slim" images are not at all "toothless". You
> can actually do stuff with them :)
>
> The 4 providers that are preinstalled are there:
>
> apache-airflow-providers-ftp    | File Transfer Protocol (FTP)
> https://tools.ietf.org/html/rfc114             | 2.1.2
> apache-airflow-providers-http   | Hypertext Transfer Protocol (HTTP)
> https://www.w3.org/Protocols/            | 2.1.2
> apache-airflow-providers-imap   | Internet Message Access Protocol (IMAP)
> https://tools.ietf.org/html/rfc3501 | 2.2.3
> apache-airflow-providers-sqlite | SQLite https://www.sqlite.org/
>                                      | 2.1.3
>
> We could probably further slim them down but that would limit the
> extensibility a bit and I consider 500 MB uncompressed as pretty "decent" -
> it's ~ 130-160 MB of compressed data when you pull the image.
>
> J.
>
>
>
> On Sun, May 1, 2022 at 5:26 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> Hello everyone,
>>
>> TL;DR: I am looking for consensus on releasing "slim" versions of PROD
>> images - ones that will be way smaller and contain no providers nor
>> other extras and would be database-specific.
>>
>> Context:
>>
>> Now after we are done with some infra changes that were also released
>> in 2.3.0 I came back to the issue raised in in
>> https://github.com/apache/airflow/issues/20849 which was originally
>> about "vanilla" image for Airflow, but I renamed the idea to "slim"
>> image (following similar convention by various distro and Python
>> providers). The issue itself explains why there is a need for such
>> images.
>>
>> The idea is to have a very small "base" ("slim") image that users will
>> be able to extend  - not only a "regular" (see the relation with
>> "slim" :D ?)  image where we pre-install a set of providers and
>> support multiple database backends.
>>
>> The "slim" images also have the advantage that we can use
>> "no-constraints" dependencies with them - which means that in those
>> images, the dependencies are "latest" that airflow supports even if
>> some providers would limit the dependencies.
>>
>> I looked at what it would mean and really what it translates to is
>> that we would have to push many more images.
>>
>> The bad news:
>>
>> We need to push matrix of 4 * 3 = 12 new "slim" images (plus some
>> aliases for "latest")
>> *  Python versions: 3.7, 3.8, 3.9, 3.10
>> *  Database: postgres, mysql, mssql
>>
>> Postgres images would be additionally multiplatform (AMD64/ARM64) and
>> for now MySQL and MsSQL would  be just AMD64 (until we add support for
>> ARM for those).
>> Those are plenty of images, but this is a rather normal approach if
>> you look for a number of other images published by multiple
>> "platform-like" products.
>>
>> The good news:
>>
>> We only need to do it at release time and we already have the right
>> set of scripts and parameters to enable that. It will take a bit
>> longer, but those images are much smaller and building and pushing
>> them is WAY faster and smaller han the regular image.
>>
>> Some comparison:
>>
>> Size (uncompressed): Regular (1.1G), Slim (500MB)
>> Time to build single image: Regular(6m), Slim (up to 3m)
>>
>> Overall the release process would take some 20 mins longer if we
>> release the slim images (and I already made it a separate step so it
>> should not block "regular" release).
>>
>> The very good news:
>>
>> I've actually prepared PR:
>> https://github.com/apache/airflow/pull/23391 to add this feature
>> (including the docs), and it's a very small change. It does not change
>> any of the source code of airflow or Dockerfile, we basically need to
>> extend our "dev" script to build and push images to ... build and push
>> more images. I actually even .. prepared and pushed 2.3.0 images of
>> airflow to my private dockerhub account so that everyone can see how
>> it will look like.
>>
>> You can see it here:
>>
>> https://hub.docker.com/repository/docker/potiuk/airflow/tags?page=1&ordering=last_updated&name=2.3.0
>>
>> I **believe** those changes don't even need PMC votes for release, and
>> this is more a procedural change than software release, so we
>> **could** release the "slim" 2.3.0 images even now - so that they are
>> available as of 2.3.0. I think even if we see that this is a welcome
>> change (despite the complexity of our dockerhub images available) it
>> could even be agreed to via lasy-consensus if we see consensus
>> forming.
>>
>> J.
>>
>

Re: [DISCUSS] Support "slim" PROD image(s) for Airflow

Posted by Jarek Potiuk <ja...@potiuk.com>.
And just to clarify. Those "slim" images are not at all "toothless". You
can actually do stuff with them :)

The 4 providers that are preinstalled are there:

apache-airflow-providers-ftp    | File Transfer Protocol (FTP)
https://tools.ietf.org/html/rfc114             | 2.1.2
apache-airflow-providers-http   | Hypertext Transfer Protocol (HTTP)
https://www.w3.org/Protocols/            | 2.1.2
apache-airflow-providers-imap   | Internet Message Access Protocol (IMAP)
https://tools.ietf.org/html/rfc3501 | 2.2.3
apache-airflow-providers-sqlite | SQLite https://www.sqlite.org/
                                   | 2.1.3

We could probably further slim them down but that would limit the
extensibility a bit and I consider 500 MB uncompressed as pretty "decent" -
it's ~ 130-160 MB of compressed data when you pull the image.

J.



On Sun, May 1, 2022 at 5:26 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> Hello everyone,
>
> TL;DR: I am looking for consensus on releasing "slim" versions of PROD
> images - ones that will be way smaller and contain no providers nor
> other extras and would be database-specific.
>
> Context:
>
> Now after we are done with some infra changes that were also released
> in 2.3.0 I came back to the issue raised in in
> https://github.com/apache/airflow/issues/20849 which was originally
> about "vanilla" image for Airflow, but I renamed the idea to "slim"
> image (following similar convention by various distro and Python
> providers). The issue itself explains why there is a need for such
> images.
>
> The idea is to have a very small "base" ("slim") image that users will
> be able to extend  - not only a "regular" (see the relation with
> "slim" :D ?)  image where we pre-install a set of providers and
> support multiple database backends.
>
> The "slim" images also have the advantage that we can use
> "no-constraints" dependencies with them - which means that in those
> images, the dependencies are "latest" that airflow supports even if
> some providers would limit the dependencies.
>
> I looked at what it would mean and really what it translates to is
> that we would have to push many more images.
>
> The bad news:
>
> We need to push matrix of 4 * 3 = 12 new "slim" images (plus some
> aliases for "latest")
> *  Python versions: 3.7, 3.8, 3.9, 3.10
> *  Database: postgres, mysql, mssql
>
> Postgres images would be additionally multiplatform (AMD64/ARM64) and
> for now MySQL and MsSQL would  be just AMD64 (until we add support for
> ARM for those).
> Those are plenty of images, but this is a rather normal approach if
> you look for a number of other images published by multiple
> "platform-like" products.
>
> The good news:
>
> We only need to do it at release time and we already have the right
> set of scripts and parameters to enable that. It will take a bit
> longer, but those images are much smaller and building and pushing
> them is WAY faster and smaller han the regular image.
>
> Some comparison:
>
> Size (uncompressed): Regular (1.1G), Slim (500MB)
> Time to build single image: Regular(6m), Slim (up to 3m)
>
> Overall the release process would take some 20 mins longer if we
> release the slim images (and I already made it a separate step so it
> should not block "regular" release).
>
> The very good news:
>
> I've actually prepared PR:
> https://github.com/apache/airflow/pull/23391 to add this feature
> (including the docs), and it's a very small change. It does not change
> any of the source code of airflow or Dockerfile, we basically need to
> extend our "dev" script to build and push images to ... build and push
> more images. I actually even .. prepared and pushed 2.3.0 images of
> airflow to my private dockerhub account so that everyone can see how
> it will look like.
>
> You can see it here:
>
> https://hub.docker.com/repository/docker/potiuk/airflow/tags?page=1&ordering=last_updated&name=2.3.0
>
> I **believe** those changes don't even need PMC votes for release, and
> this is more a procedural change than software release, so we
> **could** release the "slim" 2.3.0 images even now - so that they are
> available as of 2.3.0. I think even if we see that this is a welcome
> change (despite the complexity of our dockerhub images available) it
> could even be agreed to via lasy-consensus if we see consensus
> forming.
>
> J.
>

Re: [DISCUSS] Support "slim" PROD image(s) for Airflow

Posted by Jarek Potiuk <ja...@potiuk.com>.
And just to clarify. Those "slim" images are not at all "toothless". You
can actually do stuff with them :)

The 4 providers that are preinstalled are there:

apache-airflow-providers-ftp    | File Transfer Protocol (FTP)
https://tools.ietf.org/html/rfc114             | 2.1.2
apache-airflow-providers-http   | Hypertext Transfer Protocol (HTTP)
https://www.w3.org/Protocols/            | 2.1.2
apache-airflow-providers-imap   | Internet Message Access Protocol (IMAP)
https://tools.ietf.org/html/rfc3501 | 2.2.3
apache-airflow-providers-sqlite | SQLite https://www.sqlite.org/
                                   | 2.1.3

We could probably further slim them down but that would limit the
extensibility a bit and I consider 500 MB uncompressed as pretty "decent" -
it's ~ 130-160 MB of compressed data when you pull the image.

J.



On Sun, May 1, 2022 at 5:26 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> Hello everyone,
>
> TL;DR: I am looking for consensus on releasing "slim" versions of PROD
> images - ones that will be way smaller and contain no providers nor
> other extras and would be database-specific.
>
> Context:
>
> Now after we are done with some infra changes that were also released
> in 2.3.0 I came back to the issue raised in in
> https://github.com/apache/airflow/issues/20849 which was originally
> about "vanilla" image for Airflow, but I renamed the idea to "slim"
> image (following similar convention by various distro and Python
> providers). The issue itself explains why there is a need for such
> images.
>
> The idea is to have a very small "base" ("slim") image that users will
> be able to extend  - not only a "regular" (see the relation with
> "slim" :D ?)  image where we pre-install a set of providers and
> support multiple database backends.
>
> The "slim" images also have the advantage that we can use
> "no-constraints" dependencies with them - which means that in those
> images, the dependencies are "latest" that airflow supports even if
> some providers would limit the dependencies.
>
> I looked at what it would mean and really what it translates to is
> that we would have to push many more images.
>
> The bad news:
>
> We need to push matrix of 4 * 3 = 12 new "slim" images (plus some
> aliases for "latest")
> *  Python versions: 3.7, 3.8, 3.9, 3.10
> *  Database: postgres, mysql, mssql
>
> Postgres images would be additionally multiplatform (AMD64/ARM64) and
> for now MySQL and MsSQL would  be just AMD64 (until we add support for
> ARM for those).
> Those are plenty of images, but this is a rather normal approach if
> you look for a number of other images published by multiple
> "platform-like" products.
>
> The good news:
>
> We only need to do it at release time and we already have the right
> set of scripts and parameters to enable that. It will take a bit
> longer, but those images are much smaller and building and pushing
> them is WAY faster and smaller han the regular image.
>
> Some comparison:
>
> Size (uncompressed): Regular (1.1G), Slim (500MB)
> Time to build single image: Regular(6m), Slim (up to 3m)
>
> Overall the release process would take some 20 mins longer if we
> release the slim images (and I already made it a separate step so it
> should not block "regular" release).
>
> The very good news:
>
> I've actually prepared PR:
> https://github.com/apache/airflow/pull/23391 to add this feature
> (including the docs), and it's a very small change. It does not change
> any of the source code of airflow or Dockerfile, we basically need to
> extend our "dev" script to build and push images to ... build and push
> more images. I actually even .. prepared and pushed 2.3.0 images of
> airflow to my private dockerhub account so that everyone can see how
> it will look like.
>
> You can see it here:
>
> https://hub.docker.com/repository/docker/potiuk/airflow/tags?page=1&ordering=last_updated&name=2.3.0
>
> I **believe** those changes don't even need PMC votes for release, and
> this is more a procedural change than software release, so we
> **could** release the "slim" 2.3.0 images even now - so that they are
> available as of 2.3.0. I think even if we see that this is a welcome
> change (despite the complexity of our dockerhub images available) it
> could even be agreed to via lasy-consensus if we see consensus
> forming.
>
> J.
>