You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Brian Hulette <bh...@google.com> on 2019/08/02 21:20:10 UTC

[DISCUSS] Dependency management for Python SDK Container

I recently ran into a portable python precommit failure that led me to
discover that python dependencies for the container are defined in two
different places, in slightly different ways: in setup.py with version
ranges [1], and in a base_image_requirements.txt file in the container
directory with pinned versions [2]. The latter includes some dependencies
that are commonly used in user code, in addition to the Beam dependencies,
and states that "Specifying the versions manually helps to resolve
dependency conflicts with other packages installed in the container."

But isn't the purpose of a package manager to resolve those conflicts? It
seems that we should be able to just list the beam tar.gz as a dependency
and let pip resolve it. I wrote up a patch [3] to test this approach and it
seems to work. The only issue I ran into is that cython _does_ needs to be
installed separately before beam so that it can be used when installing
beam. Would this be a reasonable approach to remove the duplicate
dependency specifications?

Apologies if I'm rehashing an old discussion, if there's already an ML
thread or PR discussing this issue I'd be happy to take a look at it.

Thanks,
Brian

[1] https://github.com/apache/beam/blob/master/sdks/python/setup.py
[2]
https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
[3] https://github.com/apache/beam/pull/9236



PS - What caused my precommit failure:
I tested out the portable runner locally to debug the issue. It worked fine
with the loopback worker, but failed when using the docker worker. I
eventually determined the actual cause: I unintentionally relied on a
protobuf 3.8.0 feature in the code I added. The setup.py requirement, used
locally in the loopback worker, specified a range requirement which
resolved to 3.9.0, but the dependency for the container was pinned to
3.6.1. I've proposed to resolve the issue by just bumping the lower bound
of the range in setup.py to 3.8.0, to reflect the new requirement.

Re: [DISCUSS] Dependency management for Python SDK Container

Posted by Valentyn Tymofieiev <va...@google.com>.
Added some ideas discussed here to the SDK container release process
working doc
<https://docs.google.com/document/d/1IKE_aEkrAzkzUE4pD_r_zVuL5amHGetJ1efnbTfmunM/edit#>
that
Hannah started.



On Tue, Aug 6, 2019 at 10:33 PM Sam Bourne <sa...@gmail.com> wrote:

>
> On Wed, Aug 7, 2019 at 1:53 AM Ahmet Altay <al...@google.com> wrote:
>
>>
>>
>> On Mon, Aug 5, 2019 at 9:49 PM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>> On Tue, Aug 6, 2019 at 2:29 AM Ahmet Altay <al...@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Aug 5, 2019 at 1:43 AM Valentyn Tymofieiev <va...@google.com>
>>>> wrote:
>>>>
>>>>> - The purpose of install_requires in setup.py is to define the
>>>>> maximally permissive set of requirements for a package[1]. We don't pin a
>>>>> version in setup.py without a strong reason, instead we typically pick up a
>>>>> lower bound we have tested, and set an upper bound to be next major
>>>>> version.
>>>>> - The purpose of requirements.txt is to force pip to properly resolve
>>>>> dependencies, and create a reproducible execution environment, since pip
>>>>> doesn’t have true dependency resolution [2]
>>>>>
>>>>> We currently regularly upgrade setup.py dependencies, but do not
>>>>> update base_image_requirements.txt. Also, our requirements file is not
>>>>> exhaustive. We need to introduce a process to fix this.
>>>>>
>>>>
>>>> base_image_requirements.txt currently does not drive any released Beam
>>>> artifact. It would make sense to address this as part of the SDKHarness
>>>> container image release process. Similar mechanism might make sense for
>>>> figuring out what goes into containers for other SDKs. (Perhaps add your
>>>> proposal to
>>>> https://cwiki.apache.org/confluence/display/BEAM/%5BWIP%5D+SDKHarness+Container+Image+Release+Process
>>>> )
>>>>
>>>
>>> +1. I will add notes to the wiki, thanks.
>>>
>>>
>>>>
>>>>
>>>
>>>>
>>>>> I would recommend that in new process:
>>>>> - All dependencies of Beam Python SDK, including transitive
>>>>> dependencies, are listed in base_image_requirements.txt (or another
>>>>> requirements file). "Explicit is better than implicit."
>>>>> - Requirements file is regenerated whenever setup.py changes..
>>>>>
>>>> - When we build a container image, we check that the final image has
>>>>> exactly the same versions of dependencies that were spelled out in
>>>>> requirements file (no versions added, or changed)
>>>>>
>>>>
>>>> The above could be done in a two step process:
>>>> - a human generated requirements file like base_image_requirements
>>>> today which has a set of curated requirements.
>>>> - Changes to the first file would result in a generated file with all
>>>> transitive dependencies. Second file could be used as the source of truth
>>>> for all dependencies at a particular commit. Generated file could be used
>>>> for the container builds.
>>>>
>>>>
>>>>> - We also check that there are no dependency conflicts (they typically
>>>>> look like: Package X requires version A of dependency Y, but you will have
>>>>> B, which is incompatible).
>>>>>
>>>>
>>>> +1. Container build could verify this.
>>>>
>>>>
>>>>> - We update the versions of pinned dependencies periodically. We may
>>>>> want to put all dependencies of SDK harness containers on the radar of Beam
>>>>> dependency checker.
>>>>>
>>>>
>>>> +1 but a separate process.
>>>>
>>>> This process does not address how to keep setup.py in sync with
>>>> requirements file. We could use git hooks to ensure that files are changed
>>>> at the same time.
>>>>
>>>
>>> In addition to git hooks, we could try to check that once we install
>>> Apache Beam (after installing requirements file dependencies), we don't
>>> pull anything from PyPi. We could reply on pip output or cut internet
>>> access to enforce this.
>>>
>>
>> Good idea. This also represents an interesting use case where pipelines
>> run on a private network with no pypi access.
>>
>
> +1 For what it's worth we intend to run beam pipelines (via flink) on a
> private network.
>
>
>>
>>>
>>>
>>>>
>>>>> Valentyn
>>>>>
>>>>> [1]
>>>>> https://packaging.python.org/discussions/install-requires-vs-requirements/
>>>>> [2] https://pip.pypa.io/en/stable/user_guide/#requirements-files
>>>>>
>>>>> On Sat, Aug 3, 2019 at 2:47 AM Ahmet Altay <al...@google.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 2, 2019 at 4:34 PM Brian Hulette <bh...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the reply, I added some responses inline.
>>>>>>>
>>>>>>> On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <al...@google.com> wrote:
>>>>>>> >
>>>>>>> > There is a value in explicitly pinning the dependencies to be used
>>>>>>> in the containers:
>>>>>>> > - It reproducibly produces the same container. This will be
>>>>>>> important once we start release Beam container. By looking at a Beam
>>>>>>> release branch, one could exactly figure out the set of dependencies
>>>>>>> available in a released container.
>>>>>>> >
>>>>>>> > - Package manager (pip) in this is notorious about resolving
>>>>>>> versions for sub-dependencies, and sometimes picks incompatible
>>>>>>> dependencies.
>>>>>>> > - SImilarly this repdocubility is helpful with tests, that can
>>>>>>> work on the same container version unaffected by numerous dependency
>>>>>>> version changes happening in sub-dependecies.
>>>>>>>
>>>>>>> I thought this may be the reason, it definitely makes sense to pin
>>>>>>> dependencies for reproducible builds. If that is the case though I
>>>>>>> think we should at least change the comment in
>>>>>>> base_image_requirements.txt to say so (I'm happy to write a patch for
>>>>>>> that).
>>>>>>>
>>>>>>
>>>>>> Sounds good.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> > In addition, I will argue that we will want to keep a form of
>>>>>>> base_image_requirements.txt, in order to be able to install additional
>>>>>>> dependencies on the container and we would not be able to get rid of this
>>>>>>> mechanism.
>>>>>>>
>>>>>>> My PR doesn't remove base_image_requirements.txt. It keeps the file
>>>>>>> there but removes all the pinned beam dependencies, and instead just
>>>>>>> lists the beam tar.gz as a dependency directly, along with all of the
>>>>>>> additional dependencies.
>>>>>>>
>>>>>>
>>>>>> Got it, thanks for the clarification.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> >
>>>>>>> >
>>>>>>> > I will suggest, If possible ensure that both files are modified
>>>>>>> synchronously. This might be possible with precommit hooks although I am
>>>>>>> not familiar.
>>>>>>>
>>>>>>> Yeah setting up something like that would help, but surely there's a
>>>>>>> better way to lock dependencies than to count on people manually
>>>>>>> updating this file (even with a precommit reminding them)? I'm not
>>>>>>> that familiar with the complexities of python package management, but
>>>>>>> it seems like we should be able to set up something with pip freeze
>>>>>>> or
>>>>>>> pipenv's Pipfile.lock. The thing that confounds me when I try to
>>>>>>> think
>>>>>>> through that setup though is how to reference a development build of
>>>>>>> beam. If I run pip freeze in the container the requirement for
>>>>>>> apache-beam is "2.16.0.dev0" which of course doesn't resolve when you
>>>>>>> turn around and try to install it. Maybe an automated solution would
>>>>>>> just replace the apache-beam line with
>>>>>>> /opt/apache/beam/tars/apache-beam.tar.gz[gcp]?
>>>>>>>
>>>>>>
>>>>>> I agree, this probably could be automated in a better way.
>>>>>>
>>>>>> I believe just doing "pip
>>>>>> install /opt/apache/beam/tars/apache-beam.tar.gz[gcp]" at the beginning
>>>>>> will prevent a later error for "pip install apache-beam==2.16.0.dev0". So
>>>>>> we could have the former line in the docker file before installing the
>>>>>> requirements.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> >
>>>>>>> >
>>>>>>> > Ahmet
>>>>>>> >
>>>>>>> >
>>>>>>> > On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <bh...@google.com>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> I recently ran into a portable python precommit failure that led
>>>>>>> me to discover that python dependencies for the container are defined in
>>>>>>> two different places, in slightly different ways: in setup.py with version
>>>>>>> ranges [1], and in a base_image_requirements.txt file in the container
>>>>>>> directory with pinned versions [2]. The latter includes some dependencies
>>>>>>> that are commonly used in user code, in addition to the Beam dependencies,
>>>>>>> and states that "Specifying the versions manually helps to resolve
>>>>>>> dependency conflicts with other packages installed in the container."
>>>>>>> >>
>>>>>>> >> But isn't the purpose of a package manager to resolve those
>>>>>>> conflicts? It seems that we should be able to just list the beam tar.gz as
>>>>>>> a dependency and let pip resolve it. I wrote up a patch [3] to test this
>>>>>>> approach and it seems to work. The only issue I ran into is that cython
>>>>>>> _does_ needs to be installed separately before beam so that it can be used
>>>>>>> when installing beam. Would this be a reasonable approach to remove the
>>>>>>> duplicate dependency specifications?
>>>>>>> >>
>>>>>>> >> Apologies if I'm rehashing an old discussion, if there's already
>>>>>>> an ML thread or PR discussing this issue I'd be happy to take a look at it.
>>>>>>> >>
>>>>>>> >> Thanks,
>>>>>>> >> Brian
>>>>>>> >>
>>>>>>> >> [1]
>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py
>>>>>>> >> [2]
>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>>>>>>> >> [3] https://github.com/apache/beam/pull/9236
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> PS - What caused my precommit failure:
>>>>>>> >> I tested out the portable runner locally to debug the issue. It
>>>>>>> worked fine with the loopback worker, but failed when using the docker
>>>>>>> worker. I eventually determined the actual cause: I unintentionally relied
>>>>>>> on a protobuf 3.8.0 feature in the code I added. The setup.py requirement,
>>>>>>> used locally in the loopback worker, specified a range requirement which
>>>>>>> resolved to 3.9.0, but the dependency for the container was pinned to
>>>>>>> 3.6.1. I've proposed to resolve the issue by just bumping the lower bound
>>>>>>> of the range in setup.py to 3.8.0, to reflect the new requirement.
>>>>>>>
>>>>>>

Re: [DISCUSS] Dependency management for Python SDK Container

Posted by Sam Bourne <sa...@gmail.com>.
On Wed, Aug 7, 2019 at 1:53 AM Ahmet Altay <al...@google.com> wrote:

>
>
> On Mon, Aug 5, 2019 at 9:49 PM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> On Tue, Aug 6, 2019 at 2:29 AM Ahmet Altay <al...@google.com> wrote:
>>
>>>
>>>
>>> On Mon, Aug 5, 2019 at 1:43 AM Valentyn Tymofieiev <va...@google.com>
>>> wrote:
>>>
>>>> - The purpose of install_requires in setup.py is to define the
>>>> maximally permissive set of requirements for a package[1]. We don't pin a
>>>> version in setup.py without a strong reason, instead we typically pick up a
>>>> lower bound we have tested, and set an upper bound to be next major
>>>> version.
>>>> - The purpose of requirements.txt is to force pip to properly resolve
>>>> dependencies, and create a reproducible execution environment, since pip
>>>> doesn’t have true dependency resolution [2]
>>>>
>>>> We currently regularly upgrade setup.py dependencies, but do not update
>>>> base_image_requirements.txt. Also, our requirements file is not exhaustive.
>>>> We need to introduce a process to fix this.
>>>>
>>>
>>> base_image_requirements.txt currently does not drive any released Beam
>>> artifact. It would make sense to address this as part of the SDKHarness
>>> container image release process. Similar mechanism might make sense for
>>> figuring out what goes into containers for other SDKs. (Perhaps add your
>>> proposal to
>>> https://cwiki.apache.org/confluence/display/BEAM/%5BWIP%5D+SDKHarness+Container+Image+Release+Process
>>> )
>>>
>>
>> +1. I will add notes to the wiki, thanks.
>>
>>
>>>
>>>
>>
>>>
>>>> I would recommend that in new process:
>>>> - All dependencies of Beam Python SDK, including transitive
>>>> dependencies, are listed in base_image_requirements.txt (or another
>>>> requirements file). "Explicit is better than implicit."
>>>> - Requirements file is regenerated whenever setup.py changes..
>>>>
>>> - When we build a container image, we check that the final image has
>>>> exactly the same versions of dependencies that were spelled out in
>>>> requirements file (no versions added, or changed)
>>>>
>>>
>>> The above could be done in a two step process:
>>> - a human generated requirements file like base_image_requirements today
>>> which has a set of curated requirements.
>>> - Changes to the first file would result in a generated file with all
>>> transitive dependencies. Second file could be used as the source of truth
>>> for all dependencies at a particular commit. Generated file could be used
>>> for the container builds.
>>>
>>>
>>>> - We also check that there are no dependency conflicts (they typically
>>>> look like: Package X requires version A of dependency Y, but you will have
>>>> B, which is incompatible).
>>>>
>>>
>>> +1. Container build could verify this.
>>>
>>>
>>>> - We update the versions of pinned dependencies periodically. We may
>>>> want to put all dependencies of SDK harness containers on the radar of Beam
>>>> dependency checker.
>>>>
>>>
>>> +1 but a separate process.
>>>
>>> This process does not address how to keep setup.py in sync with
>>> requirements file. We could use git hooks to ensure that files are changed
>>> at the same time.
>>>
>>
>> In addition to git hooks, we could try to check that once we install
>> Apache Beam (after installing requirements file dependencies), we don't
>> pull anything from PyPi. We could reply on pip output or cut internet
>> access to enforce this.
>>
>
> Good idea. This also represents an interesting use case where pipelines
> run on a private network with no pypi access.
>

+1 For what it's worth we intend to run beam pipelines (via flink) on a
private network.


>
>>
>>
>>>
>>>> Valentyn
>>>>
>>>> [1]
>>>> https://packaging.python.org/discussions/install-requires-vs-requirements/
>>>> [2] https://pip.pypa.io/en/stable/user_guide/#requirements-files
>>>>
>>>> On Sat, Aug 3, 2019 at 2:47 AM Ahmet Altay <al...@google.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 2, 2019 at 4:34 PM Brian Hulette <bh...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the reply, I added some responses inline.
>>>>>>
>>>>>> On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <al...@google.com> wrote:
>>>>>> >
>>>>>> > There is a value in explicitly pinning the dependencies to be used
>>>>>> in the containers:
>>>>>> > - It reproducibly produces the same container. This will be
>>>>>> important once we start release Beam container. By looking at a Beam
>>>>>> release branch, one could exactly figure out the set of dependencies
>>>>>> available in a released container.
>>>>>> >
>>>>>> > - Package manager (pip) in this is notorious about resolving
>>>>>> versions for sub-dependencies, and sometimes picks incompatible
>>>>>> dependencies.
>>>>>> > - SImilarly this repdocubility is helpful with tests, that can work
>>>>>> on the same container version unaffected by numerous dependency version
>>>>>> changes happening in sub-dependecies.
>>>>>>
>>>>>> I thought this may be the reason, it definitely makes sense to pin
>>>>>> dependencies for reproducible builds. If that is the case though I
>>>>>> think we should at least change the comment in
>>>>>> base_image_requirements.txt to say so (I'm happy to write a patch for
>>>>>> that).
>>>>>>
>>>>>
>>>>> Sounds good.
>>>>>
>>>>>
>>>>>>
>>>>>> > In addition, I will argue that we will want to keep a form of
>>>>>> base_image_requirements.txt, in order to be able to install additional
>>>>>> dependencies on the container and we would not be able to get rid of this
>>>>>> mechanism.
>>>>>>
>>>>>> My PR doesn't remove base_image_requirements.txt. It keeps the file
>>>>>> there but removes all the pinned beam dependencies, and instead just
>>>>>> lists the beam tar.gz as a dependency directly, along with all of the
>>>>>> additional dependencies.
>>>>>>
>>>>>
>>>>> Got it, thanks for the clarification.
>>>>>
>>>>>
>>>>>>
>>>>>> >
>>>>>> >
>>>>>> > I will suggest, If possible ensure that both files are modified
>>>>>> synchronously. This might be possible with precommit hooks although I am
>>>>>> not familiar.
>>>>>>
>>>>>> Yeah setting up something like that would help, but surely there's a
>>>>>> better way to lock dependencies than to count on people manually
>>>>>> updating this file (even with a precommit reminding them)? I'm not
>>>>>> that familiar with the complexities of python package management, but
>>>>>> it seems like we should be able to set up something with pip freeze or
>>>>>> pipenv's Pipfile.lock. The thing that confounds me when I try to think
>>>>>> through that setup though is how to reference a development build of
>>>>>> beam. If I run pip freeze in the container the requirement for
>>>>>> apache-beam is "2.16.0.dev0" which of course doesn't resolve when you
>>>>>> turn around and try to install it. Maybe an automated solution would
>>>>>> just replace the apache-beam line with
>>>>>> /opt/apache/beam/tars/apache-beam.tar.gz[gcp]?
>>>>>>
>>>>>
>>>>> I agree, this probably could be automated in a better way.
>>>>>
>>>>> I believe just doing "pip
>>>>> install /opt/apache/beam/tars/apache-beam.tar.gz[gcp]" at the beginning
>>>>> will prevent a later error for "pip install apache-beam==2.16.0.dev0". So
>>>>> we could have the former line in the docker file before installing the
>>>>> requirements.
>>>>>
>>>>>
>>>>>>
>>>>>> >
>>>>>> >
>>>>>> > Ahmet
>>>>>> >
>>>>>> >
>>>>>> > On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <bh...@google.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> I recently ran into a portable python precommit failure that led
>>>>>> me to discover that python dependencies for the container are defined in
>>>>>> two different places, in slightly different ways: in setup.py with version
>>>>>> ranges [1], and in a base_image_requirements.txt file in the container
>>>>>> directory with pinned versions [2]. The latter includes some dependencies
>>>>>> that are commonly used in user code, in addition to the Beam dependencies,
>>>>>> and states that "Specifying the versions manually helps to resolve
>>>>>> dependency conflicts with other packages installed in the container."
>>>>>> >>
>>>>>> >> But isn't the purpose of a package manager to resolve those
>>>>>> conflicts? It seems that we should be able to just list the beam tar.gz as
>>>>>> a dependency and let pip resolve it. I wrote up a patch [3] to test this
>>>>>> approach and it seems to work. The only issue I ran into is that cython
>>>>>> _does_ needs to be installed separately before beam so that it can be used
>>>>>> when installing beam. Would this be a reasonable approach to remove the
>>>>>> duplicate dependency specifications?
>>>>>> >>
>>>>>> >> Apologies if I'm rehashing an old discussion, if there's already
>>>>>> an ML thread or PR discussing this issue I'd be happy to take a look at it.
>>>>>> >>
>>>>>> >> Thanks,
>>>>>> >> Brian
>>>>>> >>
>>>>>> >> [1]
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/setup.py
>>>>>> >> [2]
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>>>>>> >> [3] https://github.com/apache/beam/pull/9236
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> PS - What caused my precommit failure:
>>>>>> >> I tested out the portable runner locally to debug the issue. It
>>>>>> worked fine with the loopback worker, but failed when using the docker
>>>>>> worker. I eventually determined the actual cause: I unintentionally relied
>>>>>> on a protobuf 3.8.0 feature in the code I added. The setup.py requirement,
>>>>>> used locally in the loopback worker, specified a range requirement which
>>>>>> resolved to 3.9.0, but the dependency for the container was pinned to
>>>>>> 3.6.1. I've proposed to resolve the issue by just bumping the lower bound
>>>>>> of the range in setup.py to 3.8.0, to reflect the new requirement.
>>>>>>
>>>>>

Re: [DISCUSS] Dependency management for Python SDK Container

Posted by Ahmet Altay <al...@google.com>.
On Mon, Aug 5, 2019 at 9:49 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> On Tue, Aug 6, 2019 at 2:29 AM Ahmet Altay <al...@google.com> wrote:
>
>>
>>
>> On Mon, Aug 5, 2019 at 1:43 AM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>> - The purpose of install_requires in setup.py is to define the maximally
>>> permissive set of requirements for a package[1]. We don't pin a version in
>>> setup.py without a strong reason, instead we typically pick up a lower
>>> bound we have tested, and set an upper bound to be next major version.
>>> - The purpose of requirements.txt is to force pip to properly resolve
>>> dependencies, and create a reproducible execution environment, since pip
>>> doesn’t have true dependency resolution [2]
>>>
>>> We currently regularly upgrade setup.py dependencies, but do not update
>>> base_image_requirements.txt. Also, our requirements file is not exhaustive.
>>> We need to introduce a process to fix this.
>>>
>>
>> base_image_requirements.txt currently does not drive any released Beam
>> artifact. It would make sense to address this as part of the SDKHarness
>> container image release process. Similar mechanism might make sense for
>> figuring out what goes into containers for other SDKs. (Perhaps add your
>> proposal to
>> https://cwiki.apache.org/confluence/display/BEAM/%5BWIP%5D+SDKHarness+Container+Image+Release+Process
>> )
>>
>
> +1. I will add notes to the wiki, thanks.
>
>
>>
>>
>
>>
>>> I would recommend that in new process:
>>> - All dependencies of Beam Python SDK, including transitive
>>> dependencies, are listed in base_image_requirements.txt (or another
>>> requirements file). "Explicit is better than implicit."
>>> - Requirements file is regenerated whenever setup.py changes..
>>>
>> - When we build a container image, we check that the final image has
>>> exactly the same versions of dependencies that were spelled out in
>>> requirements file (no versions added, or changed)
>>>
>>
>> The above could be done in a two step process:
>> - a human generated requirements file like base_image_requirements today
>> which has a set of curated requirements.
>> - Changes to the first file would result in a generated file with all
>> transitive dependencies. Second file could be used as the source of truth
>> for all dependencies at a particular commit. Generated file could be used
>> for the container builds.
>>
>>
>>> - We also check that there are no dependency conflicts (they typically
>>> look like: Package X requires version A of dependency Y, but you will have
>>> B, which is incompatible).
>>>
>>
>> +1. Container build could verify this.
>>
>>
>>> - We update the versions of pinned dependencies periodically. We may
>>> want to put all dependencies of SDK harness containers on the radar of Beam
>>> dependency checker.
>>>
>>
>> +1 but a separate process.
>>
>> This process does not address how to keep setup.py in sync with
>> requirements file. We could use git hooks to ensure that files are changed
>> at the same time.
>>
>
> In addition to git hooks, we could try to check that once we install
> Apache Beam (after installing requirements file dependencies), we don't
> pull anything from PyPi. We could reply on pip output or cut internet
> access to enforce this.
>

Good idea. This also represents an interesting use case where pipelines run
on a private network with no pypi access.


>
>
>>
>>> Valentyn
>>>
>>> [1]
>>> https://packaging.python.org/discussions/install-requires-vs-requirements/
>>> [2] https://pip.pypa.io/en/stable/user_guide/#requirements-files
>>>
>>> On Sat, Aug 3, 2019 at 2:47 AM Ahmet Altay <al...@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Aug 2, 2019 at 4:34 PM Brian Hulette <bh...@google.com>
>>>> wrote:
>>>>
>>>>> Thanks for the reply, I added some responses inline.
>>>>>
>>>>> On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <al...@google.com> wrote:
>>>>> >
>>>>> > There is a value in explicitly pinning the dependencies to be used
>>>>> in the containers:
>>>>> > - It reproducibly produces the same container. This will be
>>>>> important once we start release Beam container. By looking at a Beam
>>>>> release branch, one could exactly figure out the set of dependencies
>>>>> available in a released container.
>>>>> >
>>>>> > - Package manager (pip) in this is notorious about resolving
>>>>> versions for sub-dependencies, and sometimes picks incompatible
>>>>> dependencies.
>>>>> > - SImilarly this repdocubility is helpful with tests, that can work
>>>>> on the same container version unaffected by numerous dependency version
>>>>> changes happening in sub-dependecies.
>>>>>
>>>>> I thought this may be the reason, it definitely makes sense to pin
>>>>> dependencies for reproducible builds. If that is the case though I
>>>>> think we should at least change the comment in
>>>>> base_image_requirements.txt to say so (I'm happy to write a patch for
>>>>> that).
>>>>>
>>>>
>>>> Sounds good.
>>>>
>>>>
>>>>>
>>>>> > In addition, I will argue that we will want to keep a form of
>>>>> base_image_requirements.txt, in order to be able to install additional
>>>>> dependencies on the container and we would not be able to get rid of this
>>>>> mechanism.
>>>>>
>>>>> My PR doesn't remove base_image_requirements.txt. It keeps the file
>>>>> there but removes all the pinned beam dependencies, and instead just
>>>>> lists the beam tar.gz as a dependency directly, along with all of the
>>>>> additional dependencies.
>>>>>
>>>>
>>>> Got it, thanks for the clarification.
>>>>
>>>>
>>>>>
>>>>> >
>>>>> >
>>>>> > I will suggest, If possible ensure that both files are modified
>>>>> synchronously. This might be possible with precommit hooks although I am
>>>>> not familiar.
>>>>>
>>>>> Yeah setting up something like that would help, but surely there's a
>>>>> better way to lock dependencies than to count on people manually
>>>>> updating this file (even with a precommit reminding them)? I'm not
>>>>> that familiar with the complexities of python package management, but
>>>>> it seems like we should be able to set up something with pip freeze or
>>>>> pipenv's Pipfile.lock. The thing that confounds me when I try to think
>>>>> through that setup though is how to reference a development build of
>>>>> beam. If I run pip freeze in the container the requirement for
>>>>> apache-beam is "2.16.0.dev0" which of course doesn't resolve when you
>>>>> turn around and try to install it. Maybe an automated solution would
>>>>> just replace the apache-beam line with
>>>>> /opt/apache/beam/tars/apache-beam.tar.gz[gcp]?
>>>>>
>>>>
>>>> I agree, this probably could be automated in a better way.
>>>>
>>>> I believe just doing "pip
>>>> install /opt/apache/beam/tars/apache-beam.tar.gz[gcp]" at the beginning
>>>> will prevent a later error for "pip install apache-beam==2.16.0.dev0". So
>>>> we could have the former line in the docker file before installing the
>>>> requirements.
>>>>
>>>>
>>>>>
>>>>> >
>>>>> >
>>>>> > Ahmet
>>>>> >
>>>>> >
>>>>> > On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <bh...@google.com>
>>>>> wrote:
>>>>> >>
>>>>> >> I recently ran into a portable python precommit failure that led me
>>>>> to discover that python dependencies for the container are defined in two
>>>>> different places, in slightly different ways: in setup.py with version
>>>>> ranges [1], and in a base_image_requirements.txt file in the container
>>>>> directory with pinned versions [2]. The latter includes some dependencies
>>>>> that are commonly used in user code, in addition to the Beam dependencies,
>>>>> and states that "Specifying the versions manually helps to resolve
>>>>> dependency conflicts with other packages installed in the container."
>>>>> >>
>>>>> >> But isn't the purpose of a package manager to resolve those
>>>>> conflicts? It seems that we should be able to just list the beam tar.gz as
>>>>> a dependency and let pip resolve it. I wrote up a patch [3] to test this
>>>>> approach and it seems to work. The only issue I ran into is that cython
>>>>> _does_ needs to be installed separately before beam so that it can be used
>>>>> when installing beam. Would this be a reasonable approach to remove the
>>>>> duplicate dependency specifications?
>>>>> >>
>>>>> >> Apologies if I'm rehashing an old discussion, if there's already an
>>>>> ML thread or PR discussing this issue I'd be happy to take a look at it.
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Brian
>>>>> >>
>>>>> >> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py
>>>>> >> [2]
>>>>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>>>>> >> [3] https://github.com/apache/beam/pull/9236
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> PS - What caused my precommit failure:
>>>>> >> I tested out the portable runner locally to debug the issue. It
>>>>> worked fine with the loopback worker, but failed when using the docker
>>>>> worker. I eventually determined the actual cause: I unintentionally relied
>>>>> on a protobuf 3.8.0 feature in the code I added. The setup.py requirement,
>>>>> used locally in the loopback worker, specified a range requirement which
>>>>> resolved to 3.9.0, but the dependency for the container was pinned to
>>>>> 3.6.1. I've proposed to resolve the issue by just bumping the lower bound
>>>>> of the range in setup.py to 3.8.0, to reflect the new requirement.
>>>>>
>>>>

Re: [DISCUSS] Dependency management for Python SDK Container

Posted by Valentyn Tymofieiev <va...@google.com>.
On Tue, Aug 6, 2019 at 2:29 AM Ahmet Altay <al...@google.com> wrote:

>
>
> On Mon, Aug 5, 2019 at 1:43 AM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> - The purpose of install_requires in setup.py is to define the maximally
>> permissive set of requirements for a package[1]. We don't pin a version in
>> setup.py without a strong reason, instead we typically pick up a lower
>> bound we have tested, and set an upper bound to be next major version.
>> - The purpose of requirements.txt is to force pip to properly resolve
>> dependencies, and create a reproducible execution environment, since pip
>> doesn’t have true dependency resolution [2]
>>
>> We currently regularly upgrade setup.py dependencies, but do not update
>> base_image_requirements.txt. Also, our requirements file is not exhaustive.
>> We need to introduce a process to fix this.
>>
>
> base_image_requirements.txt currently does not drive any released Beam
> artifact. It would make sense to address this as part of the SDKHarness
> container image release process. Similar mechanism might make sense for
> figuring out what goes into containers for other SDKs. (Perhaps add your
> proposal to
> https://cwiki.apache.org/confluence/display/BEAM/%5BWIP%5D+SDKHarness+Container+Image+Release+Process
> )
>

+1. I will add notes to the wiki, thanks.


>
>

>
>> I would recommend that in new process:
>> - All dependencies of Beam Python SDK, including transitive
>> dependencies, are listed in base_image_requirements.txt (or another
>> requirements file). "Explicit is better than implicit."
>> - Requirements file is regenerated whenever setup.py changes..
>>
> - When we build a container image, we check that the final image has
>> exactly the same versions of dependencies that were spelled out in
>> requirements file (no versions added, or changed)
>>
>
> The above could be done in a two step process:
> - a human generated requirements file like base_image_requirements today
> which has a set of curated requirements.
> - Changes to the first file would result in a generated file with all
> transitive dependencies. Second file could be used as the source of truth
> for all dependencies at a particular commit. Generated file could be used
> for the container builds.
>
>
>> - We also check that there are no dependency conflicts (they typically
>> look like: Package X requires version A of dependency Y, but you will have
>> B, which is incompatible).
>>
>
> +1. Container build could verify this.
>
>
>> - We update the versions of pinned dependencies periodically. We may want
>> to put all dependencies of SDK harness containers on the radar of Beam
>> dependency checker.
>>
>
> +1 but a separate process.
>
> This process does not address how to keep setup.py in sync with
> requirements file. We could use git hooks to ensure that files are changed
> at the same time.
>

In addition to git hooks, we could try to check that once we install Apache
Beam (after installing requirements file dependencies), we don't pull
anything from PyPi. We could reply on pip output or cut internet access to
enforce this.


>
>> Valentyn
>>
>> [1]
>> https://packaging.python.org/discussions/install-requires-vs-requirements/
>> [2] https://pip.pypa.io/en/stable/user_guide/#requirements-files
>>
>> On Sat, Aug 3, 2019 at 2:47 AM Ahmet Altay <al...@google.com> wrote:
>>
>>>
>>>
>>> On Fri, Aug 2, 2019 at 4:34 PM Brian Hulette <bh...@google.com>
>>> wrote:
>>>
>>>> Thanks for the reply, I added some responses inline.
>>>>
>>>> On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <al...@google.com> wrote:
>>>> >
>>>> > There is a value in explicitly pinning the dependencies to be used in
>>>> the containers:
>>>> > - It reproducibly produces the same container. This will be important
>>>> once we start release Beam container. By looking at a Beam release branch,
>>>> one could exactly figure out the set of dependencies available in a
>>>> released container.
>>>> >
>>>> > - Package manager (pip) in this is notorious about resolving versions
>>>> for sub-dependencies, and sometimes picks incompatible dependencies.
>>>> > - SImilarly this repdocubility is helpful with tests, that can work
>>>> on the same container version unaffected by numerous dependency version
>>>> changes happening in sub-dependecies.
>>>>
>>>> I thought this may be the reason, it definitely makes sense to pin
>>>> dependencies for reproducible builds. If that is the case though I
>>>> think we should at least change the comment in
>>>> base_image_requirements.txt to say so (I'm happy to write a patch for
>>>> that).
>>>>
>>>
>>> Sounds good.
>>>
>>>
>>>>
>>>> > In addition, I will argue that we will want to keep a form of
>>>> base_image_requirements.txt, in order to be able to install additional
>>>> dependencies on the container and we would not be able to get rid of this
>>>> mechanism.
>>>>
>>>> My PR doesn't remove base_image_requirements.txt. It keeps the file
>>>> there but removes all the pinned beam dependencies, and instead just
>>>> lists the beam tar.gz as a dependency directly, along with all of the
>>>> additional dependencies.
>>>>
>>>
>>> Got it, thanks for the clarification.
>>>
>>>
>>>>
>>>> >
>>>> >
>>>> > I will suggest, If possible ensure that both files are modified
>>>> synchronously. This might be possible with precommit hooks although I am
>>>> not familiar.
>>>>
>>>> Yeah setting up something like that would help, but surely there's a
>>>> better way to lock dependencies than to count on people manually
>>>> updating this file (even with a precommit reminding them)? I'm not
>>>> that familiar with the complexities of python package management, but
>>>> it seems like we should be able to set up something with pip freeze or
>>>> pipenv's Pipfile.lock. The thing that confounds me when I try to think
>>>> through that setup though is how to reference a development build of
>>>> beam. If I run pip freeze in the container the requirement for
>>>> apache-beam is "2.16.0.dev0" which of course doesn't resolve when you
>>>> turn around and try to install it. Maybe an automated solution would
>>>> just replace the apache-beam line with
>>>> /opt/apache/beam/tars/apache-beam.tar.gz[gcp]?
>>>>
>>>
>>> I agree, this probably could be automated in a better way.
>>>
>>> I believe just doing "pip
>>> install /opt/apache/beam/tars/apache-beam.tar.gz[gcp]" at the beginning
>>> will prevent a later error for "pip install apache-beam==2.16.0.dev0". So
>>> we could have the former line in the docker file before installing the
>>> requirements.
>>>
>>>
>>>>
>>>> >
>>>> >
>>>> > Ahmet
>>>> >
>>>> >
>>>> > On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <bh...@google.com>
>>>> wrote:
>>>> >>
>>>> >> I recently ran into a portable python precommit failure that led me
>>>> to discover that python dependencies for the container are defined in two
>>>> different places, in slightly different ways: in setup.py with version
>>>> ranges [1], and in a base_image_requirements.txt file in the container
>>>> directory with pinned versions [2]. The latter includes some dependencies
>>>> that are commonly used in user code, in addition to the Beam dependencies,
>>>> and states that "Specifying the versions manually helps to resolve
>>>> dependency conflicts with other packages installed in the container."
>>>> >>
>>>> >> But isn't the purpose of a package manager to resolve those
>>>> conflicts? It seems that we should be able to just list the beam tar.gz as
>>>> a dependency and let pip resolve it. I wrote up a patch [3] to test this
>>>> approach and it seems to work. The only issue I ran into is that cython
>>>> _does_ needs to be installed separately before beam so that it can be used
>>>> when installing beam. Would this be a reasonable approach to remove the
>>>> duplicate dependency specifications?
>>>> >>
>>>> >> Apologies if I'm rehashing an old discussion, if there's already an
>>>> ML thread or PR discussing this issue I'd be happy to take a look at it.
>>>> >>
>>>> >> Thanks,
>>>> >> Brian
>>>> >>
>>>> >> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py
>>>> >> [2]
>>>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>>>> >> [3] https://github.com/apache/beam/pull/9236
>>>> >>
>>>> >>
>>>> >>
>>>> >> PS - What caused my precommit failure:
>>>> >> I tested out the portable runner locally to debug the issue. It
>>>> worked fine with the loopback worker, but failed when using the docker
>>>> worker. I eventually determined the actual cause: I unintentionally relied
>>>> on a protobuf 3.8.0 feature in the code I added. The setup.py requirement,
>>>> used locally in the loopback worker, specified a range requirement which
>>>> resolved to 3.9.0, but the dependency for the container was pinned to
>>>> 3.6.1. I've proposed to resolve the issue by just bumping the lower bound
>>>> of the range in setup.py to 3.8.0, to reflect the new requirement.
>>>>
>>>

Re: [DISCUSS] Dependency management for Python SDK Container

Posted by Ahmet Altay <al...@google.com>.
On Mon, Aug 5, 2019 at 1:43 AM Valentyn Tymofieiev <va...@google.com>
wrote:

> - The purpose of install_requires in setup.py is to define the maximally
> permissive set of requirements for a package[1]. We don't pin a version in
> setup.py without a strong reason, instead we typically pick up a lower
> bound we have tested, and set an upper bound to be next major version.
> - The purpose of requirements.txt is to force pip to properly resolve
> dependencies, and create a reproducible execution environment, since pip
> doesn’t have true dependency resolution [2]
>
> We currently regularly upgrade setup.py dependencies, but do not update
> base_image_requirements.txt. Also, our requirements file is not exhaustive.
> We need to introduce a process to fix this.
>

base_image_requirements.txt currently does not drive any released Beam
artifact. It would make sense to address this as part of the SDKHarness
container image release process. Similar mechanism might make sense for
figuring out what goes into containers for other SDKs. (Perhaps add your
proposal to
https://cwiki.apache.org/confluence/display/BEAM/%5BWIP%5D+SDKHarness+Container+Image+Release+Process
)


> I would recommend that in new process:
> - All dependencies of Beam Python SDK, including transitive
> dependencies, are listed in base_image_requirements.txt (or another
> requirements file). "Explicit is better than implicit."
> - Requirements file is regenerated whenever setup.py changes..
>
- When we build a container image, we check that the final image has
> exactly the same versions of dependencies that were spelled out in
> requirements file (no versions added, or changed)
>

The above could be done in a two step process:
- a human generated requirements file like base_image_requirements today
which has a set of curated requirements.
- Changes to the first file would result in a generated file with all
transitive dependencies. Second file could be used as the source of truth
for all dependencies at a particular commit. Generated file could be used
for the container builds.


> - We also check that there are no dependency conflicts (they typically
> look like: Package X requires version A of dependency Y, but you will have
> B, which is incompatible).
>

+1. Container build could verify this.


> - We update the versions of pinned dependencies periodically. We may want
> to put all dependencies of SDK harness containers on the radar of Beam
> dependency checker.
>

+1 but a separate process.

This process does not address how to keep setup.py in sync with
requirements file. We could use git hooks to ensure that files are changed
at the same time.


> Valentyn
>
> [1]
> https://packaging.python.org/discussions/install-requires-vs-requirements/
> [2] https://pip.pypa.io/en/stable/user_guide/#requirements-files
>
> On Sat, Aug 3, 2019 at 2:47 AM Ahmet Altay <al...@google.com> wrote:
>
>>
>>
>> On Fri, Aug 2, 2019 at 4:34 PM Brian Hulette <bh...@google.com> wrote:
>>
>>> Thanks for the reply, I added some responses inline.
>>>
>>> On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <al...@google.com> wrote:
>>> >
>>> > There is a value in explicitly pinning the dependencies to be used in
>>> the containers:
>>> > - It reproducibly produces the same container. This will be important
>>> once we start release Beam container. By looking at a Beam release branch,
>>> one could exactly figure out the set of dependencies available in a
>>> released container.
>>> >
>>> > - Package manager (pip) in this is notorious about resolving versions
>>> for sub-dependencies, and sometimes picks incompatible dependencies.
>>> > - SImilarly this repdocubility is helpful with tests, that can work on
>>> the same container version unaffected by numerous dependency version
>>> changes happening in sub-dependecies.
>>>
>>> I thought this may be the reason, it definitely makes sense to pin
>>> dependencies for reproducible builds. If that is the case though I
>>> think we should at least change the comment in
>>> base_image_requirements.txt to say so (I'm happy to write a patch for
>>> that).
>>>
>>
>> Sounds good.
>>
>>
>>>
>>> > In addition, I will argue that we will want to keep a form of
>>> base_image_requirements.txt, in order to be able to install additional
>>> dependencies on the container and we would not be able to get rid of this
>>> mechanism.
>>>
>>> My PR doesn't remove base_image_requirements.txt. It keeps the file
>>> there but removes all the pinned beam dependencies, and instead just
>>> lists the beam tar.gz as a dependency directly, along with all of the
>>> additional dependencies.
>>>
>>
>> Got it, thanks for the clarification.
>>
>>
>>>
>>> >
>>> >
>>> > I will suggest, If possible ensure that both files are modified
>>> synchronously. This might be possible with precommit hooks although I am
>>> not familiar.
>>>
>>> Yeah setting up something like that would help, but surely there's a
>>> better way to lock dependencies than to count on people manually
>>> updating this file (even with a precommit reminding them)? I'm not
>>> that familiar with the complexities of python package management, but
>>> it seems like we should be able to set up something with pip freeze or
>>> pipenv's Pipfile.lock. The thing that confounds me when I try to think
>>> through that setup though is how to reference a development build of
>>> beam. If I run pip freeze in the container the requirement for
>>> apache-beam is "2.16.0.dev0" which of course doesn't resolve when you
>>> turn around and try to install it. Maybe an automated solution would
>>> just replace the apache-beam line with
>>> /opt/apache/beam/tars/apache-beam.tar.gz[gcp]?
>>>
>>
>> I agree, this probably could be automated in a better way.
>>
>> I believe just doing "pip
>> install /opt/apache/beam/tars/apache-beam.tar.gz[gcp]" at the beginning
>> will prevent a later error for "pip install apache-beam==2.16.0.dev0". So
>> we could have the former line in the docker file before installing the
>> requirements.
>>
>>
>>>
>>> >
>>> >
>>> > Ahmet
>>> >
>>> >
>>> > On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <bh...@google.com>
>>> wrote:
>>> >>
>>> >> I recently ran into a portable python precommit failure that led me
>>> to discover that python dependencies for the container are defined in two
>>> different places, in slightly different ways: in setup.py with version
>>> ranges [1], and in a base_image_requirements.txt file in the container
>>> directory with pinned versions [2]. The latter includes some dependencies
>>> that are commonly used in user code, in addition to the Beam dependencies,
>>> and states that "Specifying the versions manually helps to resolve
>>> dependency conflicts with other packages installed in the container."
>>> >>
>>> >> But isn't the purpose of a package manager to resolve those
>>> conflicts? It seems that we should be able to just list the beam tar.gz as
>>> a dependency and let pip resolve it. I wrote up a patch [3] to test this
>>> approach and it seems to work. The only issue I ran into is that cython
>>> _does_ needs to be installed separately before beam so that it can be used
>>> when installing beam. Would this be a reasonable approach to remove the
>>> duplicate dependency specifications?
>>> >>
>>> >> Apologies if I'm rehashing an old discussion, if there's already an
>>> ML thread or PR discussing this issue I'd be happy to take a look at it.
>>> >>
>>> >> Thanks,
>>> >> Brian
>>> >>
>>> >> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py
>>> >> [2]
>>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>>> >> [3] https://github.com/apache/beam/pull/9236
>>> >>
>>> >>
>>> >>
>>> >> PS - What caused my precommit failure:
>>> >> I tested out the portable runner locally to debug the issue. It
>>> worked fine with the loopback worker, but failed when using the docker
>>> worker. I eventually determined the actual cause: I unintentionally relied
>>> on a protobuf 3.8.0 feature in the code I added. The setup.py requirement,
>>> used locally in the loopback worker, specified a range requirement which
>>> resolved to 3.9.0, but the dependency for the container was pinned to
>>> 3.6.1. I've proposed to resolve the issue by just bumping the lower bound
>>> of the range in setup.py to 3.8.0, to reflect the new requirement.
>>>
>>

Re: [DISCUSS] Dependency management for Python SDK Container

Posted by Valentyn Tymofieiev <va...@google.com>.
- The purpose of install_requires in setup.py is to define the maximally
permissive set of requirements for a package[1]. We don't pin a version in
setup.py without a strong reason, instead we typically pick up a lower
bound we have tested, and set an upper bound to be next major version.
- The purpose of requirements.txt is to force pip to properly resolve
dependencies, and create a reproducible execution environment, since pip
doesn’t have true dependency resolution [2]

We currently regularly upgrade setup.py dependencies, but do not update
base_image_requirements.txt. Also, our requirements file is not exhaustive.
We need to introduce a process to fix this. I would recommend that in new
process:
- All dependencies of Beam Python SDK, including transitive
dependencies, are listed in base_image_requirements.txt (or another
requirements file). "Explicit is better than implicit."
- Requirements file is regenerated whenever setup.py changes..
- When we build a container image, we check that the final image has
exactly the same versions of dependencies that were spelled out in
requirements file (no versions added, or changed)
- We also check that there are no dependency conflicts (they typically look
like: Package X requires version A of dependency Y, but you will have B,
which is incompatible).
- We update the versions of pinned dependencies periodically. We may want
to put all dependencies of SDK harness containers on the radar of Beam
dependency checker.

Valentyn

[1]
https://packaging.python.org/discussions/install-requires-vs-requirements/
[2] https://pip.pypa.io/en/stable/user_guide/#requirements-files

On Sat, Aug 3, 2019 at 2:47 AM Ahmet Altay <al...@google.com> wrote:

>
>
> On Fri, Aug 2, 2019 at 4:34 PM Brian Hulette <bh...@google.com> wrote:
>
>> Thanks for the reply, I added some responses inline.
>>
>> On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <al...@google.com> wrote:
>> >
>> > There is a value in explicitly pinning the dependencies to be used in
>> the containers:
>> > - It reproducibly produces the same container. This will be important
>> once we start release Beam container. By looking at a Beam release branch,
>> one could exactly figure out the set of dependencies available in a
>> released container.
>> >
>> > - Package manager (pip) in this is notorious about resolving versions
>> for sub-dependencies, and sometimes picks incompatible dependencies.
>> > - SImilarly this repdocubility is helpful with tests, that can work on
>> the same container version unaffected by numerous dependency version
>> changes happening in sub-dependecies.
>>
>> I thought this may be the reason, it definitely makes sense to pin
>> dependencies for reproducible builds. If that is the case though I
>> think we should at least change the comment in
>> base_image_requirements.txt to say so (I'm happy to write a patch for
>> that).
>>
>
> Sounds good.
>
>
>>
>> > In addition, I will argue that we will want to keep a form of
>> base_image_requirements.txt, in order to be able to install additional
>> dependencies on the container and we would not be able to get rid of this
>> mechanism.
>>
>> My PR doesn't remove base_image_requirements.txt. It keeps the file
>> there but removes all the pinned beam dependencies, and instead just
>> lists the beam tar.gz as a dependency directly, along with all of the
>> additional dependencies.
>>
>
> Got it, thanks for the clarification.
>
>
>>
>> >
>> >
>> > I will suggest, If possible ensure that both files are modified
>> synchronously. This might be possible with precommit hooks although I am
>> not familiar.
>>
>> Yeah setting up something like that would help, but surely there's a
>> better way to lock dependencies than to count on people manually
>> updating this file (even with a precommit reminding them)? I'm not
>> that familiar with the complexities of python package management, but
>> it seems like we should be able to set up something with pip freeze or
>> pipenv's Pipfile.lock. The thing that confounds me when I try to think
>> through that setup though is how to reference a development build of
>> beam. If I run pip freeze in the container the requirement for
>> apache-beam is "2.16.0.dev0" which of course doesn't resolve when you
>> turn around and try to install it. Maybe an automated solution would
>> just replace the apache-beam line with
>> /opt/apache/beam/tars/apache-beam.tar.gz[gcp]?
>>
>
> I agree, this probably could be automated in a better way.
>
> I believe just doing "pip
> install /opt/apache/beam/tars/apache-beam.tar.gz[gcp]" at the beginning
> will prevent a later error for "pip install apache-beam==2.16.0.dev0". So
> we could have the former line in the docker file before installing the
> requirements.
>
>
>>
>> >
>> >
>> > Ahmet
>> >
>> >
>> > On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <bh...@google.com>
>> wrote:
>> >>
>> >> I recently ran into a portable python precommit failure that led me to
>> discover that python dependencies for the container are defined in two
>> different places, in slightly different ways: in setup.py with version
>> ranges [1], and in a base_image_requirements.txt file in the container
>> directory with pinned versions [2]. The latter includes some dependencies
>> that are commonly used in user code, in addition to the Beam dependencies,
>> and states that "Specifying the versions manually helps to resolve
>> dependency conflicts with other packages installed in the container."
>> >>
>> >> But isn't the purpose of a package manager to resolve those conflicts?
>> It seems that we should be able to just list the beam tar.gz as a
>> dependency and let pip resolve it. I wrote up a patch [3] to test this
>> approach and it seems to work. The only issue I ran into is that cython
>> _does_ needs to be installed separately before beam so that it can be used
>> when installing beam. Would this be a reasonable approach to remove the
>> duplicate dependency specifications?
>> >>
>> >> Apologies if I'm rehashing an old discussion, if there's already an ML
>> thread or PR discussing this issue I'd be happy to take a look at it.
>> >>
>> >> Thanks,
>> >> Brian
>> >>
>> >> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py
>> >> [2]
>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>> >> [3] https://github.com/apache/beam/pull/9236
>> >>
>> >>
>> >>
>> >> PS - What caused my precommit failure:
>> >> I tested out the portable runner locally to debug the issue. It worked
>> fine with the loopback worker, but failed when using the docker worker. I
>> eventually determined the actual cause: I unintentionally relied on a
>> protobuf 3.8.0 feature in the code I added. The setup.py requirement, used
>> locally in the loopback worker, specified a range requirement which
>> resolved to 3.9.0, but the dependency for the container was pinned to
>> 3.6.1. I've proposed to resolve the issue by just bumping the lower bound
>> of the range in setup.py to 3.8.0, to reflect the new requirement.
>>
>

Re: [DISCUSS] Dependency management for Python SDK Container

Posted by Ahmet Altay <al...@google.com>.
On Fri, Aug 2, 2019 at 4:34 PM Brian Hulette <bh...@google.com> wrote:

> Thanks for the reply, I added some responses inline.
>
> On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <al...@google.com> wrote:
> >
> > There is a value in explicitly pinning the dependencies to be used in
> the containers:
> > - It reproducibly produces the same container. This will be important
> once we start release Beam container. By looking at a Beam release branch,
> one could exactly figure out the set of dependencies available in a
> released container.
> >
> > - Package manager (pip) in this is notorious about resolving versions
> for sub-dependencies, and sometimes picks incompatible dependencies.
> > - SImilarly this repdocubility is helpful with tests, that can work on
> the same container version unaffected by numerous dependency version
> changes happening in sub-dependecies.
>
> I thought this may be the reason, it definitely makes sense to pin
> dependencies for reproducible builds. If that is the case though I
> think we should at least change the comment in
> base_image_requirements.txt to say so (I'm happy to write a patch for
> that).
>

Sounds good.


>
> > In addition, I will argue that we will want to keep a form of
> base_image_requirements.txt, in order to be able to install additional
> dependencies on the container and we would not be able to get rid of this
> mechanism.
>
> My PR doesn't remove base_image_requirements.txt. It keeps the file
> there but removes all the pinned beam dependencies, and instead just
> lists the beam tar.gz as a dependency directly, along with all of the
> additional dependencies.
>

Got it, thanks for the clarification.


>
> >
> >
> > I will suggest, If possible ensure that both files are modified
> synchronously. This might be possible with precommit hooks although I am
> not familiar.
>
> Yeah setting up something like that would help, but surely there's a
> better way to lock dependencies than to count on people manually
> updating this file (even with a precommit reminding them)? I'm not
> that familiar with the complexities of python package management, but
> it seems like we should be able to set up something with pip freeze or
> pipenv's Pipfile.lock. The thing that confounds me when I try to think
> through that setup though is how to reference a development build of
> beam. If I run pip freeze in the container the requirement for
> apache-beam is "2.16.0.dev0" which of course doesn't resolve when you
> turn around and try to install it. Maybe an automated solution would
> just replace the apache-beam line with
> /opt/apache/beam/tars/apache-beam.tar.gz[gcp]?
>

I agree, this probably could be automated in a better way.

I believe just doing "pip
install /opt/apache/beam/tars/apache-beam.tar.gz[gcp]" at the beginning
will prevent a later error for "pip install apache-beam==2.16.0.dev0". So
we could have the former line in the docker file before installing the
requirements.


>
> >
> >
> > Ahmet
> >
> >
> > On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <bh...@google.com>
> wrote:
> >>
> >> I recently ran into a portable python precommit failure that led me to
> discover that python dependencies for the container are defined in two
> different places, in slightly different ways: in setup.py with version
> ranges [1], and in a base_image_requirements.txt file in the container
> directory with pinned versions [2]. The latter includes some dependencies
> that are commonly used in user code, in addition to the Beam dependencies,
> and states that "Specifying the versions manually helps to resolve
> dependency conflicts with other packages installed in the container."
> >>
> >> But isn't the purpose of a package manager to resolve those conflicts?
> It seems that we should be able to just list the beam tar.gz as a
> dependency and let pip resolve it. I wrote up a patch [3] to test this
> approach and it seems to work. The only issue I ran into is that cython
> _does_ needs to be installed separately before beam so that it can be used
> when installing beam. Would this be a reasonable approach to remove the
> duplicate dependency specifications?
> >>
> >> Apologies if I'm rehashing an old discussion, if there's already an ML
> thread or PR discussing this issue I'd be happy to take a look at it.
> >>
> >> Thanks,
> >> Brian
> >>
> >> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py
> >> [2]
> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
> >> [3] https://github.com/apache/beam/pull/9236
> >>
> >>
> >>
> >> PS - What caused my precommit failure:
> >> I tested out the portable runner locally to debug the issue. It worked
> fine with the loopback worker, but failed when using the docker worker. I
> eventually determined the actual cause: I unintentionally relied on a
> protobuf 3.8.0 feature in the code I added. The setup.py requirement, used
> locally in the loopback worker, specified a range requirement which
> resolved to 3.9.0, but the dependency for the container was pinned to
> 3.6.1. I've proposed to resolve the issue by just bumping the lower bound
> of the range in setup.py to 3.8.0, to reflect the new requirement.
>

Re: [DISCUSS] Dependency management for Python SDK Container

Posted by Brian Hulette <bh...@google.com>.
Thanks for the reply, I added some responses inline.

On Fri, Aug 2, 2019 at 2:42 PM Ahmet Altay <al...@google.com> wrote:
>
> There is a value in explicitly pinning the dependencies to be used in the containers:
> - It reproducibly produces the same container. This will be important once we start release Beam container. By looking at a Beam release branch, one could exactly figure out the set of dependencies available in a released container.
>
> - Package manager (pip) in this is notorious about resolving versions for sub-dependencies, and sometimes picks incompatible dependencies.
> - SImilarly this repdocubility is helpful with tests, that can work on the same container version unaffected by numerous dependency version changes happening in sub-dependecies.

I thought this may be the reason, it definitely makes sense to pin
dependencies for reproducible builds. If that is the case though I
think we should at least change the comment in
base_image_requirements.txt to say so (I'm happy to write a patch for
that).

> In addition, I will argue that we will want to keep a form of base_image_requirements.txt, in order to be able to install additional dependencies on the container and we would not be able to get rid of this mechanism.

My PR doesn't remove base_image_requirements.txt. It keeps the file
there but removes all the pinned beam dependencies, and instead just
lists the beam tar.gz as a dependency directly, along with all of the
additional dependencies.

>
>
> I will suggest, If possible ensure that both files are modified synchronously. This might be possible with precommit hooks although I am not familiar.

Yeah setting up something like that would help, but surely there's a
better way to lock dependencies than to count on people manually
updating this file (even with a precommit reminding them)? I'm not
that familiar with the complexities of python package management, but
it seems like we should be able to set up something with pip freeze or
pipenv's Pipfile.lock. The thing that confounds me when I try to think
through that setup though is how to reference a development build of
beam. If I run pip freeze in the container the requirement for
apache-beam is "2.16.0.dev0" which of course doesn't resolve when you
turn around and try to install it. Maybe an automated solution would
just replace the apache-beam line with
/opt/apache/beam/tars/apache-beam.tar.gz[gcp]?

>
>
> Ahmet
>
>
> On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <bh...@google.com> wrote:
>>
>> I recently ran into a portable python precommit failure that led me to discover that python dependencies for the container are defined in two different places, in slightly different ways: in setup.py with version ranges [1], and in a base_image_requirements.txt file in the container directory with pinned versions [2]. The latter includes some dependencies that are commonly used in user code, in addition to the Beam dependencies, and states that "Specifying the versions manually helps to resolve dependency conflicts with other packages installed in the container."
>>
>> But isn't the purpose of a package manager to resolve those conflicts? It seems that we should be able to just list the beam tar.gz as a dependency and let pip resolve it. I wrote up a patch [3] to test this approach and it seems to work. The only issue I ran into is that cython _does_ needs to be installed separately before beam so that it can be used when installing beam. Would this be a reasonable approach to remove the duplicate dependency specifications?
>>
>> Apologies if I'm rehashing an old discussion, if there's already an ML thread or PR discussing this issue I'd be happy to take a look at it.
>>
>> Thanks,
>> Brian
>>
>> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py
>> [2] https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>> [3] https://github.com/apache/beam/pull/9236
>>
>>
>>
>> PS - What caused my precommit failure:
>> I tested out the portable runner locally to debug the issue. It worked fine with the loopback worker, but failed when using the docker worker. I eventually determined the actual cause: I unintentionally relied on a protobuf 3.8.0 feature in the code I added. The setup.py requirement, used locally in the loopback worker, specified a range requirement which resolved to 3.9.0, but the dependency for the container was pinned to 3.6.1. I've proposed to resolve the issue by just bumping the lower bound of the range in setup.py to 3.8.0, to reflect the new requirement.

Re: [DISCUSS] Dependency management for Python SDK Container

Posted by Ahmet Altay <al...@google.com>.
There is a value in explicitly pinning the dependencies to be used in the
containers:
- It reproducibly produces the same container. This will be important once
we start release Beam container. By looking at a Beam release branch, one
could exactly figure out the set of dependencies available in a released
container.
- Package manager (pip) in this is notorious about resolving versions for
sub-dependencies, and sometimes picks incompatible dependencies.
- SImilarly this repdocubility is helpful with tests, that can work on the
same container version unaffected by numerous dependency version changes
happening in sub-dependecies.

In addition, I will argue that we will want to keep a form
of base_image_requirements.txt, in order to be able to install additional
dependencies on the container and we would not be able to get rid of this
mechanism.

I will suggest, If possible ensure that both files are modified
synchronously. This might be possible with precommit hooks although I am
not familiar.

Ahmet


On Fri, Aug 2, 2019 at 2:20 PM Brian Hulette <bh...@google.com> wrote:

> I recently ran into a portable python precommit failure that led me to
> discover that python dependencies for the container are defined in two
> different places, in slightly different ways: in setup.py with version
> ranges [1], and in a base_image_requirements.txt file in the container
> directory with pinned versions [2]. The latter includes some dependencies
> that are commonly used in user code, in addition to the Beam dependencies,
> and states that "Specifying the versions manually helps to resolve
> dependency conflicts with other packages installed in the container."
>
> But isn't the purpose of a package manager to resolve those conflicts? It
> seems that we should be able to just list the beam tar.gz as a dependency
> and let pip resolve it. I wrote up a patch [3] to test this approach and it
> seems to work. The only issue I ran into is that cython _does_ needs to be
> installed separately before beam so that it can be used when installing
> beam. Would this be a reasonable approach to remove the duplicate
> dependency specifications?
>
> Apologies if I'm rehashing an old discussion, if there's already an ML
> thread or PR discussing this issue I'd be happy to take a look at it.
>
> Thanks,
> Brian
>
> [1] https://github.com/apache/beam/blob/master/sdks/python/setup.py
> [2]
> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
> [3] https://github.com/apache/beam/pull/9236
>
>
>
> PS - What caused my precommit failure:
> I tested out the portable runner locally to debug the issue. It worked
> fine with the loopback worker, but failed when using the docker worker. I
> eventually determined the actual cause: I unintentionally relied on a
> protobuf 3.8.0 feature in the code I added. The setup.py requirement, used
> locally in the loopback worker, specified a range requirement which
> resolved to 3.9.0, but the dependency for the container was pinned to
> 3.6.1. I've proposed to resolve the issue by just bumping the lower bound
> of the range in setup.py to 3.8.0, to reflect the new requirement.
>