You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2019/10/21 13:21:45 UTC

[GitHub] [airflow] potiuk commented on issue #6266: [AIRFLOW-2439] Production Docker image support including refactoring of build scripts - depends on [AIRFLOW-5704]

potiuk commented on issue #6266: [AIRFLOW-2439] Production Docker image support including refactoring of build scripts - depends on [AIRFLOW-5704]
URL: https://github.com/apache/airflow/pull/6266#issuecomment-544511732
 
 
   > I've left some review comments (see below) but the main thing I want to think about here:
   > 
   > * What packages are included in the prod image? 1GB is a _very_ heavy image by docker standards. (I have a feeling that most of the space is from hadoop/jvm/cassandra? If this is true?)
   
   I still have to update the documentation. It's wrong. The PROD image size is 387 MB (Python 3.7), 408 MB (Python 3.5) and 410 MB (Python 3.6). I can review it and see if anything can be removed. PROD image does not contain Cassandra/Hadoop/JVM nor NPM nor node modules (this is optimised in the current version ). The CI image is around 1GB and it contains a lot of extra packages. The basic list of packages is quite easy to see in the Dockerfile:
   
   - Prod image is built based on the airflow-base which contains ( apt-utils build-essential curl dirmngr freetds-bin freetds-dev git gosu libffi-dev libkrb5-dev libpq-dev libsasl2-2 libsasl2-dev libsasl2-modules libssl-dev locales netcat rsync sasl2-bin sudo libmariadb-dev-compat)
   - then in separate stage airflow sources and NPM are compiled and only resulting .js 'prod' is copied to the production image (uses --copy-from feature of Docker)
   - then in separate stage docs are build and resulting HTML is also copied to the production image (again using --copy-from and not storing any other artifacts). This way documentation is also part of the image and reachable via the UI.
   - what happens next - only once 'pip install' is executed - by default with 'all' dependencies (snakebite is removed after installation as this is the easiest way for now until we fix snakebite's python3 compatibility problem). As discussed before - I separated the 'devel' dependencies out (previously they were installed when 'all' was installed). The only thing that may take some space (certainly it takes quite a lot of time) is the cassandra python driver which requires cython and build essentials to build it (but this is only client and I am not sure if we can save a lot by not having the build essentials - they might be needed to install other packages).
   
   We could potentially do one more optimisation here - we could have another separate stage to install packages using --user switch (then they are all installed to .local directory) and copy them from there to the main image. I will take a look and see how much we can save there - this will mean that we will not have to install build essentials - so gcc/c++ and cython. Also this will solve a problem that I have now that we have some extra layer for Airflow sources which is removed later - using another stage I can get rid of those. This way we can save maybe 50-60 MB. I will take a look.
   
   > * What about a "prod-slim" image containing just core deps? Or core + postgres, mysql, aws, gcp?
   I think with ~400 MB image we are close to minimal size achievable for PROD airflow image. We could save a bit with alpine but we all agreed it's not a good idea. I already used buster-slim python image which is really small. There might be some optimisation with package installation as described above - I still have to see how much we can save there. I will experiment.
   
   > * How about not including compiler toolchain and *-dev libs in the final prod image?
   
   Yep. This is exactly what I can do using --copy-from as described above. We got to the same conclusions (I only read that comment after I wrote the answer above).
   
   > * For building a prod image (of say 1.10.5) do we need to do more than `pip install apache-airflow==1.10.5`. (Specifically we don't need to do npm as that is already done for packaged releases.)
   
   For building an image from released packages, yes we could do that. But part of my idea of building image for production is to build the image in a continuous integration fashion all the time from current sources rather than from released packages. This means that it should be buildable from sources - in similar fashion as the CI image. We even have "Build PROD image" step in Travis CI in this version to test it (though I only run it in master/CRON job).
   
   Doing that in the CI we will be able to detect all the packaging/dependency/image problems early - between the releases (every time we build from master) rather than during the release (when we have a pressure). Plus this allows us also to easily skip compiler toolchain (by smart using --copy-from as described above). And it's much easier to automate it on CI for Pull Requests - otherwise you would have to install airflow from this particular commit in this particular fork (using TRAVIS variables) rather than from already checked-out sources.
   
   I think that building a PROD image from sources is better than from released packages - and should be part of release process rather than post-release. But I can be convinced otherwise if there are good reasons for that. We can discuss this of course and have it both - from sources for regular builds and from github releases in case we want to do release. Some more conditionals/variants of the image would be needed for that (COPY . is not something that you can have conditionally)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services