You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Valentyn Tymofieiev via dev <de...@beam.apache.org> on 2023/05/02 14:49:19 UTC

Re: [DISCUSS] Dependency management in Apache Beam Python SDK

Hi All,

just wanted to give a quick update on the effort discussed here:

The action items from the retrospective are tracked in
https://github.com/apache/beam/issues/25652.

Many outdated dependencies were updated in
https://github.com/apache/beam/pull/24599 by +Anand Inguva
<an...@google.com>  and remaining older libraries (dill, apitools,
pandas) are non-trival, well known, tracked separately and also made
progress, in particular @Bjorn Pedersen <bj...@google.com> is
working on removing aspects of apitools dependencies, and I took a stab at
updating dill <https://github.com/apache/beam/issues/22893>.  We are
starting to test Beam actively against pre-released versions of our
dependencies (you may have seen threads from Anand about it) and I wrote
some guidelines to Python SDK maintainers pertaining to dependency
management in
https://docs.google.com/document/d/1euZogGjbW4VZNJMFrA5AL1keR5gZO5l45H9b9CoQ0SI/edit,
which I plan to merge in Beam website and/or wiki. Feel free to take a
look, especially if you are committing code to Python SDK.

Once again thanks to everyone who provided feedback so far.

Valentyn

On Fri, Aug 26, 2022 at 3:40 PM Kerry Donny-Clark <ke...@google.com>
wrote:

> Jarek, I really appreciate you sharing your experience and expertise here.
> I think Beam would benefit from adopting some of these practices.
> Kerry
>
> On Fri, Aug 26, 2022, 7:35 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>>
>>> I'm curious Jarek, does Airflow take any dependencies on popular
>>> libraries like pandas, numpy, pyarrow, scipy, etc... which users are likely
>>> to have their own dependency on? I think these dependencies are challenging
>>> in a different way than the client libraries - ideally we would support a
>>> wide version range so as not to require users to upgrade those libraries in
>>> lockstep with Beam. However in some cases our dependency is pretty tight
>>> (e.g. the DataFrame API's dependency on pandas), so we need to make sure to
>>> explicitly test with multiple different versions. Does Airflow have any
>>> similar issues?
>>>
>>
>> Yes we do (all of those I think :) ). Complete set of all our deps can be
>> found here
>> https://github.com/apache/airflow/blob/constraints-main/constraints-3.9.txt
>> (continuously updated and we have different sets for different python
>> versions).
>>
>> We took a rather interesting and unusual approach (more details in my
>> talk) - mainly because Airflow is both an application to install (for
>> users) and library to use (for DAG authors) and both have contradicting
>> expectations (installation stability versus flexibility in
>> upgrading/downgrading dependencies). Our approach is really smart in making
>> sure water and fire play well with each other.
>>
>> Most of those dependencies are coming from optional extras (list of all
>> extras here:
>> https://airflow.apache.org/docs/apache-airflow/stable/extra-packages-ref.html).
>> More often than not the "problematic" dependencies you mention are
>> transitive dependencies through some client libraries we use (for example
>> Apache Beam SDK is a big contributor to those :).
>>
>> Airflow "core" itself has far less dependencies
>> https://github.com/apache/airflow/blob/constraints-main/constraints-no-providers-3.9.txt
>> (175 currently) and we actively made sure that all "pandas" of this world
>> are only optional extra deps.
>>
>> Now - the interesting thing is that we use "constraints'' (the links you
>> with dependencies that I posted are those constraints) to pin versions of
>> the dependencies that are "golden" - i.e. we test those continuously in our
>> CI and we automatically upgrade the constraints when all the unit and
>> integration tests pass.
>> There is a little bit of complexity and sometimes conflicts to handle (as
>> `pip` has to find the right set of deps that will work for all our optional
>> extras), but eventually we have really one "golden" set of constraints at
>> any moment in time main (or v2-x branch - we have a separate set for each
>> branch) that we are dealing with. And this is the only "set" of dependency
>> versions that Airflow gets tested with. Note - these are *constraints *not
>> *requirements *- that makes a whole world of difference.
>>
>> Then when we release airflow, we "freeze" the constraints with the
>> version tag. We know they work because all our tests pass with them in CI.
>>
>> Then we communicate to our users (and we use it in our Docker image) that
>> the only "supported" way of installing airflow is with using `pip` and
>> constraints
>> https://airflow.apache.org/docs/apache-airflow/stable/installation/installing-from-pypi.html.
>> And we do not support poetry, pipenv - we leave it up to users to handle
>> them (until poetry/pipenv will support constraints - which we are waiting
>> for and there is an issue where I explained  why it is useful). It looks
>> like that `pip install "apache-airflow==2.3.4" --constraint "
>> https://raw.githubusercontent.com/apache/airflow/constraints-2.3.4/constraints-3.9.txt"`
>> (different constraints for different airflow version and Python version you
>> have)
>>
>> Constraints have this nice feature that they are only used during the
>> "pip install" phase and thrown out immediately after the install is
>> complete. They do not create "hard" requirements for airflow. Airflow still
>> has a number of "lower-bound" limits for a number of constraints but we try
>> to avoid putting upper-bounds at all (only in specific cases and
>> documenting them) and our bounds are rather relaxed. This way we achieve
>> two things:
>>
>> 1) when someone does not use constraints and has a problem with broken
>> dependency - we tell them to use constraints - this is what we as a
>> community commit to and support
>> 2) but by using constraints mechanism we do not limit our users if they
>> want to upgrade or downgrade any dependencies. They are free to do it (as
>> long as it fits the - rather relaxed lower/upper bounds of Airflow). But
>> "with great powers come great responsibilities" - if they want to do that.,
>> THEY have to make sure that airflow will work. We make no guarantees there.
>> 3) we are not limited by the 3rd-party libraries that come as extras - if
>> you do not use those, the limits do not apply
>>
>> I think this works really well - but it is rather complex to setup and
>> maintain - I built a whole complex set of scripts and I have the whole
>> `breeze` ("It's a breeze to develop airflow" is the theme) development/CI
>> environment based on docker and docker-compose that allows us to automate
>> all of that.
>>
>> J.
>>
>>
>>
>