You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Yoshiki Obata <yo...@gmail.com> on 2021/10/11 12:08:20 UTC

Switching Python sdist format

Hello everyone,

I'm working on BEAM-8954[1] which introduces tox isolated_build for
python tests.
Concerning this issue, I want opinions about using .tar.gz as sdist format.

Introducing tox isolated_build leads replacement of
build-requirements.txt to pyproject.toml[2] and we should use
pyproject.toml when creating sdist because we install dependencies
with build-requirements.txt before calling "python setup.py sdist"
PEP 517 based build tools like pypa/build will help to do so, but it
does not allow .zip as sdist format[3].
Therefore I think it would be better to switch sdist format to .tar.gz
when starting to use pyproject.toml.

Are there any obstacles to use .tar.gz?
Please let me know details about adopting .zip as Beam sdist format(I
could not find discussions about this)

Regards,
yoshiki

[1] https://issues.apache.org/jira/browse/BEAM-8954
[2] https://tox.wiki/en/latest/config.html?highlight=isolated#conf-isolated_build
[3] https://www.python.org/dev/peps/pep-0517/#source-distributions

Re: Switching Python sdist format

Posted by Luke Cwik <lc...@google.com>.
500mbs would be noticeable for sure.

On Fri, Oct 15, 2021 at 11:04 AM Valentyn Tymofieiev <va...@google.com>
wrote:

> One other possibility is to make --populate_requirements_cache an opt-in
> instead of a default feature. A few things have changed since it was
> implemented:
>
> 1) We have custom containers
> 2) We have container prebuilding capabilities
>
> Although may be not fully available on all runners / all execution modes
> yet, and requires access to container building tools.
>
> On Fri, Oct 15, 2021 at 10:44 AM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> Thanks. Chances are that most custom container users will either start
>> from Apache Beam image as a base image or use a compatible Linux platform,
>> so if we were to go with the guess-the-platform route, I would limit
>> bdists to 1 and provide an option to override the platform or
>> --disable_requirements_cache.
>>
>> Overheads would only affect users who pass a custom container AND a
>> --requirements_file AND have a non-standard platform.
>>
>> Overheads from downloading sdists along side bdists might be not
>> significant. Overheads from starting to download bdists may be noticeable.
>> An extreme case, Tensorflow currently does not have an sdist, so we don't
>> stage it, but it has a ~500 mb bdist, which we'd have to download and stage
>> if users pass it in a --requirements_file.
>>
>> Also, currently we stage the entire accumulated contents of the
>> requirements cache from previous runs, not just dependencies of the current
>> pipeline, which we may want to revisit now that we are looking into
>> requirements cache and potentially staging more files [1] [2].
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-680
>> [2] https://issues.apache.org/jira/browse/BEAM-10147
>>
>> On Fri, Oct 15, 2021 at 10:10 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> On Fri, Oct 15, 2021 at 9:58 AM Luke Cwik <lc...@google.com> wrote:
>>> >
>>> > When the platform is unknown, would it be too expensive to package the
>>> 2 or 3 most common binary distributions and the sdist as a backup. This way
>>> the container could install a bdist if there is a compatible one falling
>>> back to sdist.
>>>
>>> Given that platform is a container of our making, or a custom
>>> container (in which case it should be rare that the dependencies not
>>> be installed directly), I don't know if we want to have 3-4x the
>>> uploads.
>>>
>>> > Alternatively could the user specify the target platform with a python
>>> specific option that would guide the packaging system to provide the
>>> appropriate bdist?
>>>
>>> +1
>>>
>>> > On Thu, Oct 14, 2021 at 10:44 PM Valentyn Tymofieiev <
>>> valentyn@google.com> wrote:
>>> >>
>>> >> I think I finally understand the problematic behavior here, and
>>> described it in [1].
>>> >>
>>> >> Also had an offline conversation about this issue with Robert and
>>> other folks.
>>> >>
>>> >> A possible solution to go about populating requirements cache on the
>>> workers is to download bdists instead of sdists whenever users launch a
>>> pipeline with a requirements file that will run in a default container
>>> image, known to the SDK, such as  apache/beam_python3.7_sdk:2.xx.0 and
>>> gcr.io/cloud-dataflow/v1beta3/python...  For these container images we
>>> know and control the platform, so when we `pip download` the dependencies,
>>> we can prefer to download bdists for a specific platform. This would fix
>>> BEAM-4032[2], and will improve worker startup time   since dependencies
>>> won't have to be built from sources on the workers. Also note that we
>>> currently don't download any build dependencies[3] for packages using
>>> PEP-517 , so build dependencies would have to be downloaded on the fly on
>>> the workers, reducing the benefit of populating the requirements cache.
>>> Staging bdists instead of sdists should help with that too, since we won't
>>> need to build packages, thus won't need build deps.
>>> >>
>>> >> For users who pass custom containers (via --sdk_container_image
>>> flags), we wouldn't have the knowledge of the target platform and couldn't
>>> know for sure which bdists to download and stage, but custom container
>>> users can package all necessary Python dependencies in their container
>>> image, and won't need to pass a --requirements_file with their pipeline, so
>>> requirements cache will not be necessary.
>>> >>
>>> >> [1] https://github.com/pypa/pip/issues/10589
>>> >> [2] https://issues.apache.org/jira/browse/BEAM-4032
>>> >> [3] https://github.com/pypa/pip/issues/10589#issuecomment-944006331
>>> >>
>>> >>
>>> >> On Thu, Oct 14, 2021 at 7:36 AM Yoshiki Obata <
>>> yoshiki.obata@gmail.com> wrote:
>>> >>>
>>> >>> Thank you for your reply, Robert and Valentyn.
>>> >>>
>>> >>> Concerns about slow pipeline submission is very helpful which I
>>> >>> couldn't take into account.
>>> >>> It seems that using .tar.gz as sdist format would be fine but
>>> >>> introducing PEP 517 to Beam should have bad impact to users therefore
>>> >>> more consideration must be needed.
>>> >>>
>>> >>> I will investigate more and post an update if ideal solutions are
>>> discovered.
>>> >>>
>>> >>> Thanks,
>>> >>> yoshiki
>>> >>>
>>> >>> On Wed, Oct 13, 2021 at 10:59 AM Valentyn Tymofieiev
>>> >>> <va...@google.com> wrote:
>>> >>> >
>>> >>> > Hi Yoshiki,
>>> >>> >
>>> >>> > it should be fine to use tar.gz. There may be some Beam
>>> release-related scripts we need to update that expect a .zip. Also I
>>> noticed that numpy uses .zip for its sdist[1], and they also use
>>> pyproject.toml[2], so I don't know if the requirement to use .tar.gz sdist
>>> in PEP-517 is a hard requirement, but we can ask try to confirm. Perhaps it
>>> matters when some particular hooks of PEP-517 are defined.
>>> >>> >
>>> >>> > I would like to discuss further your suggestion to use PEP-517 in
>>> Beam and potential implications. For the benefit of others on this thread,
>>> for a simple explanation of PEP-517, see [3].
>>> >>> >
>>> >>> > I noticed that when several other projects switched to using
>>> PEP-517 (numpy, pyarrow, it impacted the pipeline submission experience for
>>> Beam users. To make sure that the pipeline execution environment has
>>> necessary pipeline dependencies, Beam downloads distributions for packages
>>> (and their transitive dependencies) that users supply in requirements.txt
>>> [4], and stages them to the execution environment[5]. Specifically, Beam
>>> downloads and stages source distributions (aka sdists) of the packages. We
>>> download sdists, since the target execution platform on the runner is not
>>> known to the SDK, so staging binary distributions (aka bdists, wheels) was
>>> never supported (BEAM-4032 [6]). It turns out that downloading a source
>>> distribution of a library from PyPI via `python -m pip download --dest /tmp
>>> pyarrow==5.0.0 --no-deps --no-binary :all:` makes pip also do a metadata
>>> check to verify that the contents of the sdist that was downloaded is
>>> indeed the package that was requested [7]. For libraries packaged with
>>> setuptools / prior to PEP-517, this involved a call `python setup.py
>>> egg_info` [8] , which is fairly fast. However, for libraries packaged using
>>> pyproject.toml/PEP-517, getting the package metadata involves installing
>>> build dependencies, and potentially building a wheel and extracting the
>>> metadata from a wheel. If a library has c extensions, this step involves
>>> compilation, can be quite slow and require  dev header packages to be
>>> installed on the OS. I am not sure whether the slowdown happens on every
>>> project (with extensions) that adopts PEP-517, or only if the project is
>>> somehow misconfigured, asked on [9]. So far I observed such a slowdown with
>>> numpy and pyarrow[10].
>>> >>> >
>>> >>> > Users have brought this up to the pip maintainers. For various
>>> reasons, this has proved not easy to address, and long term solutions are
>>> still in the works [11, 12, 13].
>>> >>> >
>>> >>> > What does it mean for Beam:
>>> >>> >
>>> >>> > 1) With more Python packages adopting PEP-517, users may be
>>> getting affected by slow pipeline submission time if they require certain
>>> packages in requirements.txt that take a long time to download+build.
>>> >>> > 2) There is a possibility that adopting PEP-517 in Beam will
>>> increase pipeline submission time due to the slowness of the pip download
>>> command, because we download Beam SDK and stage it to the runner during job
>>> submission[14]. Given we don't know the target platform for sure, we stage
>>> both an sdist and a bdist [14]. The platform selected for bdist matches the
>>> platform of Beam's default containers, but with custom containers we can't
>>> guarantee the target platform will always match a predefined default, so we
>>> also pass an sdist.
>>> >>> > 3) Users sometimes include apache-beam into requirements.txt of
>>> users' pipelines. Although not necessary, this contributes to slowdowns
>>> because numpy and pyarrow are Beam's dependencies, and they end up being
>>> downloaded.
>>> >>> >
>>> >>> > I am not against using PEP-517, it seems to have been written for
>>> good reason, but we should prevent slowdown in pipeline submissions.
>>> >>> >
>>> >>> > I am curious what the community thinks about ways to adopt it.
>>> Possible avenues:
>>> >>> >
>>> >>> > - Provide some information to the SDK about the runner's platform
>>> at pipeline submission, and stage only binary packages to the runner
>>> whenever possible [6]. Pip is only slow to download sdists, downloading
>>> bdists is fast. Also installing bdists on the runner would be much faster.
>>> >>> > - Rethink dependency staging. Avoid staging the SDK, and/or
>>> dependency package sdists via container prebuilding workflow[15] by default
>>> for all runners that use containers. During prebuilding, install packages
>>> directly on the containers with `pip install`, and avoid `pip download`
>>> step. Filter out apache_beam from  requirements.txt file when users add it
>>> there.
>>> >>> > - Wait until solutions for [11] become available or get involved
>>> to help move that forward.
>>> >>> > - Switch to download sdists over a HTTP instead of `pip download`
>>> [16], fall back to pip download if not successful.
>>> >>> > - There may be some way to configure PEP-517 properly, that avoids
>>> the slowdown, asked on [9].
>>> >>> >
>>> >>> > Thanks,
>>> >>> > Valentyn
>>> >>> >
>>> >>> > [1]
>>> https://pypi.org/project/numpy/#copy-hash-modal-89500de9-70e8-4e7e-87f5-12adf0808905
>>> >>> > [2] https://github.com/numpy/numpy/blob/main/pyproject.toml
>>> >>> > [3] https://snarky.ca/clarifying-pep-518
>>> >>> > [4]
>>> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L640
>>> >>> > [5]
>>> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L238
>>> >>> > [6] https://issues.apache.org/jira/browse/BEAM-4032
>>> >>> > [7] https://github.com/pypa/pip/issues/1884#issuecomment-670364453
>>> >>> > [8] https://github.com/pypa/pip/issues/8387#issuecomment-638118900
>>> >>> > [9]
>>> https://github.com/pypa/pip/issues/10195#issuecomment-941811201
>>> >>> > [10]
>>> https://github.com/numpy/numpy/pull/14053#issuecomment-637709988
>>> >>> > [11] https://github.com/pypa/pip/issues/1884
>>> >>> > [12]
>>> https://discuss.python.org/t/pep-625-file-name-of-a-source-distribution/4686
>>> >>> > [13] https://github.com/pypa/pip/issues/10195
>>> >>> > [14]
>>> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L715
>>> >>> > [15]
>>> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/options/pipeline_options.py#L1096
>>> >>> > [16]
>>> https://github.com/pypa/pip/issues/1884#issuecomment-800483766
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Mon, Oct 11, 2021 at 8:51 AM Robert Bradshaw <
>>> robertwb@google.com> wrote:
>>> >>> >>
>>> >>> >> That's fine by me. The only advantage I can think of for .zip is
>>> that
>>> >>> >> it's (generally) better supported on Windows, but as far as I know
>>> >>> >> .tar.gz works on Windows just fine for python package
>>> distribution.
>>> >>> >>
>>> >>> >> On Mon, Oct 11, 2021 at 5:09 AM Yoshiki Obata <
>>> yoshiki.obata@gmail.com> wrote:
>>> >>> >> >
>>> >>> >> > Hello everyone,
>>> >>> >> >
>>> >>> >> > I'm working on BEAM-8954[1] which introduces tox isolated_build
>>> for
>>> >>> >> > python tests.
>>> >>> >> > Concerning this issue, I want opinions about using .tar.gz as
>>> sdist format.
>>> >>> >> >
>>> >>> >> > Introducing tox isolated_build leads replacement of
>>> >>> >> > build-requirements.txt to pyproject.toml[2] and we should use
>>> >>> >> > pyproject.toml when creating sdist because we install
>>> dependencies
>>> >>> >> > with build-requirements.txt before calling "python setup.py
>>> sdist"
>>> >>> >> > PEP 517 based build tools like pypa/build will help to do so,
>>> but it
>>> >>> >> > does not allow .zip as sdist format[3].
>>> >>> >> > Therefore I think it would be better to switch sdist format to
>>> .tar.gz
>>> >>> >> > when starting to use pyproject.toml.
>>> >>> >> >
>>> >>> >> > Are there any obstacles to use .tar.gz?
>>> >>> >> > Please let me know details about adopting .zip as Beam sdist
>>> format(I
>>> >>> >> > could not find discussions about this)
>>> >>> >> >
>>> >>> >> > Regards,
>>> >>> >> > yoshiki
>>> >>> >> >
>>> >>> >> > [1] https://issues.apache.org/jira/browse/BEAM-8954
>>> >>> >> > [2]
>>> https://tox.wiki/en/latest/config.html?highlight=isolated#conf-isolated_build
>>> >>> >> > [3]
>>> https://www.python.org/dev/peps/pep-0517/#source-distributions
>>>
>>

Re: Switching Python sdist format

Posted by Valentyn Tymofieiev <va...@google.com>.
One other possibility is to make --populate_requirements_cache an opt-in
instead of a default feature. A few things have changed since it was
implemented:

1) We have custom containers
2) We have container prebuilding capabilities

Although may be not fully available on all runners / all execution modes
yet, and requires access to container building tools.

On Fri, Oct 15, 2021 at 10:44 AM Valentyn Tymofieiev <va...@google.com>
wrote:

> Thanks. Chances are that most custom container users will either start
> from Apache Beam image as a base image or use a compatible Linux platform,
> so if we were to go with the guess-the-platform route, I would limit
> bdists to 1 and provide an option to override the platform or
> --disable_requirements_cache.
>
> Overheads would only affect users who pass a custom container AND a
> --requirements_file AND have a non-standard platform.
>
> Overheads from downloading sdists along side bdists might be not
> significant. Overheads from starting to download bdists may be noticeable.
> An extreme case, Tensorflow currently does not have an sdist, so we don't
> stage it, but it has a ~500 mb bdist, which we'd have to download and stage
> if users pass it in a --requirements_file.
>
> Also, currently we stage the entire accumulated contents of the
> requirements cache from previous runs, not just dependencies of the current
> pipeline, which we may want to revisit now that we are looking into
> requirements cache and potentially staging more files [1] [2].
>
> [1] https://issues.apache.org/jira/browse/BEAM-680
> [2] https://issues.apache.org/jira/browse/BEAM-10147
>
> On Fri, Oct 15, 2021 at 10:10 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Fri, Oct 15, 2021 at 9:58 AM Luke Cwik <lc...@google.com> wrote:
>> >
>> > When the platform is unknown, would it be too expensive to package the
>> 2 or 3 most common binary distributions and the sdist as a backup. This way
>> the container could install a bdist if there is a compatible one falling
>> back to sdist.
>>
>> Given that platform is a container of our making, or a custom
>> container (in which case it should be rare that the dependencies not
>> be installed directly), I don't know if we want to have 3-4x the
>> uploads.
>>
>> > Alternatively could the user specify the target platform with a python
>> specific option that would guide the packaging system to provide the
>> appropriate bdist?
>>
>> +1
>>
>> > On Thu, Oct 14, 2021 at 10:44 PM Valentyn Tymofieiev <
>> valentyn@google.com> wrote:
>> >>
>> >> I think I finally understand the problematic behavior here, and
>> described it in [1].
>> >>
>> >> Also had an offline conversation about this issue with Robert and
>> other folks.
>> >>
>> >> A possible solution to go about populating requirements cache on the
>> workers is to download bdists instead of sdists whenever users launch a
>> pipeline with a requirements file that will run in a default container
>> image, known to the SDK, such as  apache/beam_python3.7_sdk:2.xx.0 and
>> gcr.io/cloud-dataflow/v1beta3/python...  For these container images we
>> know and control the platform, so when we `pip download` the dependencies,
>> we can prefer to download bdists for a specific platform. This would fix
>> BEAM-4032[2], and will improve worker startup time   since dependencies
>> won't have to be built from sources on the workers. Also note that we
>> currently don't download any build dependencies[3] for packages using
>> PEP-517 , so build dependencies would have to be downloaded on the fly on
>> the workers, reducing the benefit of populating the requirements cache.
>> Staging bdists instead of sdists should help with that too, since we won't
>> need to build packages, thus won't need build deps.
>> >>
>> >> For users who pass custom containers (via --sdk_container_image
>> flags), we wouldn't have the knowledge of the target platform and couldn't
>> know for sure which bdists to download and stage, but custom container
>> users can package all necessary Python dependencies in their container
>> image, and won't need to pass a --requirements_file with their pipeline, so
>> requirements cache will not be necessary.
>> >>
>> >> [1] https://github.com/pypa/pip/issues/10589
>> >> [2] https://issues.apache.org/jira/browse/BEAM-4032
>> >> [3] https://github.com/pypa/pip/issues/10589#issuecomment-944006331
>> >>
>> >>
>> >> On Thu, Oct 14, 2021 at 7:36 AM Yoshiki Obata <yo...@gmail.com>
>> wrote:
>> >>>
>> >>> Thank you for your reply, Robert and Valentyn.
>> >>>
>> >>> Concerns about slow pipeline submission is very helpful which I
>> >>> couldn't take into account.
>> >>> It seems that using .tar.gz as sdist format would be fine but
>> >>> introducing PEP 517 to Beam should have bad impact to users therefore
>> >>> more consideration must be needed.
>> >>>
>> >>> I will investigate more and post an update if ideal solutions are
>> discovered.
>> >>>
>> >>> Thanks,
>> >>> yoshiki
>> >>>
>> >>> On Wed, Oct 13, 2021 at 10:59 AM Valentyn Tymofieiev
>> >>> <va...@google.com> wrote:
>> >>> >
>> >>> > Hi Yoshiki,
>> >>> >
>> >>> > it should be fine to use tar.gz. There may be some Beam
>> release-related scripts we need to update that expect a .zip. Also I
>> noticed that numpy uses .zip for its sdist[1], and they also use
>> pyproject.toml[2], so I don't know if the requirement to use .tar.gz sdist
>> in PEP-517 is a hard requirement, but we can ask try to confirm. Perhaps it
>> matters when some particular hooks of PEP-517 are defined.
>> >>> >
>> >>> > I would like to discuss further your suggestion to use PEP-517 in
>> Beam and potential implications. For the benefit of others on this thread,
>> for a simple explanation of PEP-517, see [3].
>> >>> >
>> >>> > I noticed that when several other projects switched to using
>> PEP-517 (numpy, pyarrow, it impacted the pipeline submission experience for
>> Beam users. To make sure that the pipeline execution environment has
>> necessary pipeline dependencies, Beam downloads distributions for packages
>> (and their transitive dependencies) that users supply in requirements.txt
>> [4], and stages them to the execution environment[5]. Specifically, Beam
>> downloads and stages source distributions (aka sdists) of the packages. We
>> download sdists, since the target execution platform on the runner is not
>> known to the SDK, so staging binary distributions (aka bdists, wheels) was
>> never supported (BEAM-4032 [6]). It turns out that downloading a source
>> distribution of a library from PyPI via `python -m pip download --dest /tmp
>> pyarrow==5.0.0 --no-deps --no-binary :all:` makes pip also do a metadata
>> check to verify that the contents of the sdist that was downloaded is
>> indeed the package that was requested [7]. For libraries packaged with
>> setuptools / prior to PEP-517, this involved a call `python setup.py
>> egg_info` [8] , which is fairly fast. However, for libraries packaged using
>> pyproject.toml/PEP-517, getting the package metadata involves installing
>> build dependencies, and potentially building a wheel and extracting the
>> metadata from a wheel. If a library has c extensions, this step involves
>> compilation, can be quite slow and require  dev header packages to be
>> installed on the OS. I am not sure whether the slowdown happens on every
>> project (with extensions) that adopts PEP-517, or only if the project is
>> somehow misconfigured, asked on [9]. So far I observed such a slowdown with
>> numpy and pyarrow[10].
>> >>> >
>> >>> > Users have brought this up to the pip maintainers. For various
>> reasons, this has proved not easy to address, and long term solutions are
>> still in the works [11, 12, 13].
>> >>> >
>> >>> > What does it mean for Beam:
>> >>> >
>> >>> > 1) With more Python packages adopting PEP-517, users may be getting
>> affected by slow pipeline submission time if they require certain packages
>> in requirements.txt that take a long time to download+build.
>> >>> > 2) There is a possibility that adopting PEP-517 in Beam will
>> increase pipeline submission time due to the slowness of the pip download
>> command, because we download Beam SDK and stage it to the runner during job
>> submission[14]. Given we don't know the target platform for sure, we stage
>> both an sdist and a bdist [14]. The platform selected for bdist matches the
>> platform of Beam's default containers, but with custom containers we can't
>> guarantee the target platform will always match a predefined default, so we
>> also pass an sdist.
>> >>> > 3) Users sometimes include apache-beam into requirements.txt of
>> users' pipelines. Although not necessary, this contributes to slowdowns
>> because numpy and pyarrow are Beam's dependencies, and they end up being
>> downloaded.
>> >>> >
>> >>> > I am not against using PEP-517, it seems to have been written for
>> good reason, but we should prevent slowdown in pipeline submissions.
>> >>> >
>> >>> > I am curious what the community thinks about ways to adopt it.
>> Possible avenues:
>> >>> >
>> >>> > - Provide some information to the SDK about the runner's platform
>> at pipeline submission, and stage only binary packages to the runner
>> whenever possible [6]. Pip is only slow to download sdists, downloading
>> bdists is fast. Also installing bdists on the runner would be much faster.
>> >>> > - Rethink dependency staging. Avoid staging the SDK, and/or
>> dependency package sdists via container prebuilding workflow[15] by default
>> for all runners that use containers. During prebuilding, install packages
>> directly on the containers with `pip install`, and avoid `pip download`
>> step. Filter out apache_beam from  requirements.txt file when users add it
>> there.
>> >>> > - Wait until solutions for [11] become available or get involved to
>> help move that forward.
>> >>> > - Switch to download sdists over a HTTP instead of `pip download`
>> [16], fall back to pip download if not successful.
>> >>> > - There may be some way to configure PEP-517 properly, that avoids
>> the slowdown, asked on [9].
>> >>> >
>> >>> > Thanks,
>> >>> > Valentyn
>> >>> >
>> >>> > [1]
>> https://pypi.org/project/numpy/#copy-hash-modal-89500de9-70e8-4e7e-87f5-12adf0808905
>> >>> > [2] https://github.com/numpy/numpy/blob/main/pyproject.toml
>> >>> > [3] https://snarky.ca/clarifying-pep-518
>> >>> > [4]
>> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L640
>> >>> > [5]
>> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L238
>> >>> > [6] https://issues.apache.org/jira/browse/BEAM-4032
>> >>> > [7] https://github.com/pypa/pip/issues/1884#issuecomment-670364453
>> >>> > [8] https://github.com/pypa/pip/issues/8387#issuecomment-638118900
>> >>> > [9] https://github.com/pypa/pip/issues/10195#issuecomment-941811201
>> >>> > [10]
>> https://github.com/numpy/numpy/pull/14053#issuecomment-637709988
>> >>> > [11] https://github.com/pypa/pip/issues/1884
>> >>> > [12]
>> https://discuss.python.org/t/pep-625-file-name-of-a-source-distribution/4686
>> >>> > [13] https://github.com/pypa/pip/issues/10195
>> >>> > [14]
>> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L715
>> >>> > [15]
>> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/options/pipeline_options.py#L1096
>> >>> > [16] https://github.com/pypa/pip/issues/1884#issuecomment-800483766
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Mon, Oct 11, 2021 at 8:51 AM Robert Bradshaw <
>> robertwb@google.com> wrote:
>> >>> >>
>> >>> >> That's fine by me. The only advantage I can think of for .zip is
>> that
>> >>> >> it's (generally) better supported on Windows, but as far as I know
>> >>> >> .tar.gz works on Windows just fine for python package distribution.
>> >>> >>
>> >>> >> On Mon, Oct 11, 2021 at 5:09 AM Yoshiki Obata <
>> yoshiki.obata@gmail.com> wrote:
>> >>> >> >
>> >>> >> > Hello everyone,
>> >>> >> >
>> >>> >> > I'm working on BEAM-8954[1] which introduces tox isolated_build
>> for
>> >>> >> > python tests.
>> >>> >> > Concerning this issue, I want opinions about using .tar.gz as
>> sdist format.
>> >>> >> >
>> >>> >> > Introducing tox isolated_build leads replacement of
>> >>> >> > build-requirements.txt to pyproject.toml[2] and we should use
>> >>> >> > pyproject.toml when creating sdist because we install
>> dependencies
>> >>> >> > with build-requirements.txt before calling "python setup.py
>> sdist"
>> >>> >> > PEP 517 based build tools like pypa/build will help to do so,
>> but it
>> >>> >> > does not allow .zip as sdist format[3].
>> >>> >> > Therefore I think it would be better to switch sdist format to
>> .tar.gz
>> >>> >> > when starting to use pyproject.toml.
>> >>> >> >
>> >>> >> > Are there any obstacles to use .tar.gz?
>> >>> >> > Please let me know details about adopting .zip as Beam sdist
>> format(I
>> >>> >> > could not find discussions about this)
>> >>> >> >
>> >>> >> > Regards,
>> >>> >> > yoshiki
>> >>> >> >
>> >>> >> > [1] https://issues.apache.org/jira/browse/BEAM-8954
>> >>> >> > [2]
>> https://tox.wiki/en/latest/config.html?highlight=isolated#conf-isolated_build
>> >>> >> > [3]
>> https://www.python.org/dev/peps/pep-0517/#source-distributions
>>
>

Re: Switching Python sdist format

Posted by Valentyn Tymofieiev <va...@google.com>.
Thanks. Chances are that most custom container users will either start from
Apache Beam image as a base image or use a compatible Linux platform, so if
we were to go with the guess-the-platform route, I would limit  bdists to 1
and provide an option to override the platform or
--disable_requirements_cache.

Overheads would only affect users who pass a custom container AND a
--requirements_file AND have a non-standard platform.

Overheads from downloading sdists along side bdists might be not
significant. Overheads from starting to download bdists may be noticeable.
An extreme case, Tensorflow currently does not have an sdist, so we don't
stage it, but it has a ~500 mb bdist, which we'd have to download and stage
if users pass it in a --requirements_file.

Also, currently we stage the entire accumulated contents of the
requirements cache from previous runs, not just dependencies of the current
pipeline, which we may want to revisit now that we are looking into
requirements cache and potentially staging more files [1] [2].

[1] https://issues.apache.org/jira/browse/BEAM-680
[2] https://issues.apache.org/jira/browse/BEAM-10147

On Fri, Oct 15, 2021 at 10:10 AM Robert Bradshaw <ro...@google.com>
wrote:

> On Fri, Oct 15, 2021 at 9:58 AM Luke Cwik <lc...@google.com> wrote:
> >
> > When the platform is unknown, would it be too expensive to package the 2
> or 3 most common binary distributions and the sdist as a backup. This way
> the container could install a bdist if there is a compatible one falling
> back to sdist.
>
> Given that platform is a container of our making, or a custom
> container (in which case it should be rare that the dependencies not
> be installed directly), I don't know if we want to have 3-4x the
> uploads.
>
> > Alternatively could the user specify the target platform with a python
> specific option that would guide the packaging system to provide the
> appropriate bdist?
>
> +1
>
> > On Thu, Oct 14, 2021 at 10:44 PM Valentyn Tymofieiev <
> valentyn@google.com> wrote:
> >>
> >> I think I finally understand the problematic behavior here, and
> described it in [1].
> >>
> >> Also had an offline conversation about this issue with Robert and other
> folks.
> >>
> >> A possible solution to go about populating requirements cache on the
> workers is to download bdists instead of sdists whenever users launch a
> pipeline with a requirements file that will run in a default container
> image, known to the SDK, such as  apache/beam_python3.7_sdk:2.xx.0 and
> gcr.io/cloud-dataflow/v1beta3/python...  For these container images we
> know and control the platform, so when we `pip download` the dependencies,
> we can prefer to download bdists for a specific platform. This would fix
> BEAM-4032[2], and will improve worker startup time   since dependencies
> won't have to be built from sources on the workers. Also note that we
> currently don't download any build dependencies[3] for packages using
> PEP-517 , so build dependencies would have to be downloaded on the fly on
> the workers, reducing the benefit of populating the requirements cache.
> Staging bdists instead of sdists should help with that too, since we won't
> need to build packages, thus won't need build deps.
> >>
> >> For users who pass custom containers (via --sdk_container_image flags),
> we wouldn't have the knowledge of the target platform and couldn't know for
> sure which bdists to download and stage, but custom container users can
> package all necessary Python dependencies in their container image, and
> won't need to pass a --requirements_file with their pipeline, so
> requirements cache will not be necessary.
> >>
> >> [1] https://github.com/pypa/pip/issues/10589
> >> [2] https://issues.apache.org/jira/browse/BEAM-4032
> >> [3] https://github.com/pypa/pip/issues/10589#issuecomment-944006331
> >>
> >>
> >> On Thu, Oct 14, 2021 at 7:36 AM Yoshiki Obata <yo...@gmail.com>
> wrote:
> >>>
> >>> Thank you for your reply, Robert and Valentyn.
> >>>
> >>> Concerns about slow pipeline submission is very helpful which I
> >>> couldn't take into account.
> >>> It seems that using .tar.gz as sdist format would be fine but
> >>> introducing PEP 517 to Beam should have bad impact to users therefore
> >>> more consideration must be needed.
> >>>
> >>> I will investigate more and post an update if ideal solutions are
> discovered.
> >>>
> >>> Thanks,
> >>> yoshiki
> >>>
> >>> On Wed, Oct 13, 2021 at 10:59 AM Valentyn Tymofieiev
> >>> <va...@google.com> wrote:
> >>> >
> >>> > Hi Yoshiki,
> >>> >
> >>> > it should be fine to use tar.gz. There may be some Beam
> release-related scripts we need to update that expect a .zip. Also I
> noticed that numpy uses .zip for its sdist[1], and they also use
> pyproject.toml[2], so I don't know if the requirement to use .tar.gz sdist
> in PEP-517 is a hard requirement, but we can ask try to confirm. Perhaps it
> matters when some particular hooks of PEP-517 are defined.
> >>> >
> >>> > I would like to discuss further your suggestion to use PEP-517 in
> Beam and potential implications. For the benefit of others on this thread,
> for a simple explanation of PEP-517, see [3].
> >>> >
> >>> > I noticed that when several other projects switched to using PEP-517
> (numpy, pyarrow, it impacted the pipeline submission experience for Beam
> users. To make sure that the pipeline execution environment has necessary
> pipeline dependencies, Beam downloads distributions for packages (and their
> transitive dependencies) that users supply in requirements.txt [4], and
> stages them to the execution environment[5]. Specifically, Beam downloads
> and stages source distributions (aka sdists) of the packages. We download
> sdists, since the target execution platform on the runner is not known to
> the SDK, so staging binary distributions (aka bdists, wheels) was never
> supported (BEAM-4032 [6]). It turns out that downloading a source
> distribution of a library from PyPI via `python -m pip download --dest /tmp
> pyarrow==5.0.0 --no-deps --no-binary :all:` makes pip also do a metadata
> check to verify that the contents of the sdist that was downloaded is
> indeed the package that was requested [7]. For libraries packaged with
> setuptools / prior to PEP-517, this involved a call `python setup.py
> egg_info` [8] , which is fairly fast. However, for libraries packaged using
> pyproject.toml/PEP-517, getting the package metadata involves installing
> build dependencies, and potentially building a wheel and extracting the
> metadata from a wheel. If a library has c extensions, this step involves
> compilation, can be quite slow and require  dev header packages to be
> installed on the OS. I am not sure whether the slowdown happens on every
> project (with extensions) that adopts PEP-517, or only if the project is
> somehow misconfigured, asked on [9]. So far I observed such a slowdown with
> numpy and pyarrow[10].
> >>> >
> >>> > Users have brought this up to the pip maintainers. For various
> reasons, this has proved not easy to address, and long term solutions are
> still in the works [11, 12, 13].
> >>> >
> >>> > What does it mean for Beam:
> >>> >
> >>> > 1) With more Python packages adopting PEP-517, users may be getting
> affected by slow pipeline submission time if they require certain packages
> in requirements.txt that take a long time to download+build.
> >>> > 2) There is a possibility that adopting PEP-517 in Beam will
> increase pipeline submission time due to the slowness of the pip download
> command, because we download Beam SDK and stage it to the runner during job
> submission[14]. Given we don't know the target platform for sure, we stage
> both an sdist and a bdist [14]. The platform selected for bdist matches the
> platform of Beam's default containers, but with custom containers we can't
> guarantee the target platform will always match a predefined default, so we
> also pass an sdist.
> >>> > 3) Users sometimes include apache-beam into requirements.txt of
> users' pipelines. Although not necessary, this contributes to slowdowns
> because numpy and pyarrow are Beam's dependencies, and they end up being
> downloaded.
> >>> >
> >>> > I am not against using PEP-517, it seems to have been written for
> good reason, but we should prevent slowdown in pipeline submissions.
> >>> >
> >>> > I am curious what the community thinks about ways to adopt it.
> Possible avenues:
> >>> >
> >>> > - Provide some information to the SDK about the runner's platform at
> pipeline submission, and stage only binary packages to the runner whenever
> possible [6]. Pip is only slow to download sdists, downloading bdists is
> fast. Also installing bdists on the runner would be much faster.
> >>> > - Rethink dependency staging. Avoid staging the SDK, and/or
> dependency package sdists via container prebuilding workflow[15] by default
> for all runners that use containers. During prebuilding, install packages
> directly on the containers with `pip install`, and avoid `pip download`
> step. Filter out apache_beam from  requirements.txt file when users add it
> there.
> >>> > - Wait until solutions for [11] become available or get involved to
> help move that forward.
> >>> > - Switch to download sdists over a HTTP instead of `pip download`
> [16], fall back to pip download if not successful.
> >>> > - There may be some way to configure PEP-517 properly, that avoids
> the slowdown, asked on [9].
> >>> >
> >>> > Thanks,
> >>> > Valentyn
> >>> >
> >>> > [1]
> https://pypi.org/project/numpy/#copy-hash-modal-89500de9-70e8-4e7e-87f5-12adf0808905
> >>> > [2] https://github.com/numpy/numpy/blob/main/pyproject.toml
> >>> > [3] https://snarky.ca/clarifying-pep-518
> >>> > [4]
> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L640
> >>> > [5]
> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L238
> >>> > [6] https://issues.apache.org/jira/browse/BEAM-4032
> >>> > [7] https://github.com/pypa/pip/issues/1884#issuecomment-670364453
> >>> > [8] https://github.com/pypa/pip/issues/8387#issuecomment-638118900
> >>> > [9] https://github.com/pypa/pip/issues/10195#issuecomment-941811201
> >>> > [10]
> https://github.com/numpy/numpy/pull/14053#issuecomment-637709988
> >>> > [11] https://github.com/pypa/pip/issues/1884
> >>> > [12]
> https://discuss.python.org/t/pep-625-file-name-of-a-source-distribution/4686
> >>> > [13] https://github.com/pypa/pip/issues/10195
> >>> > [14]
> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L715
> >>> > [15]
> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/options/pipeline_options.py#L1096
> >>> > [16] https://github.com/pypa/pip/issues/1884#issuecomment-800483766
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Mon, Oct 11, 2021 at 8:51 AM Robert Bradshaw <ro...@google.com>
> wrote:
> >>> >>
> >>> >> That's fine by me. The only advantage I can think of for .zip is
> that
> >>> >> it's (generally) better supported on Windows, but as far as I know
> >>> >> .tar.gz works on Windows just fine for python package distribution.
> >>> >>
> >>> >> On Mon, Oct 11, 2021 at 5:09 AM Yoshiki Obata <
> yoshiki.obata@gmail.com> wrote:
> >>> >> >
> >>> >> > Hello everyone,
> >>> >> >
> >>> >> > I'm working on BEAM-8954[1] which introduces tox isolated_build
> for
> >>> >> > python tests.
> >>> >> > Concerning this issue, I want opinions about using .tar.gz as
> sdist format.
> >>> >> >
> >>> >> > Introducing tox isolated_build leads replacement of
> >>> >> > build-requirements.txt to pyproject.toml[2] and we should use
> >>> >> > pyproject.toml when creating sdist because we install dependencies
> >>> >> > with build-requirements.txt before calling "python setup.py sdist"
> >>> >> > PEP 517 based build tools like pypa/build will help to do so, but
> it
> >>> >> > does not allow .zip as sdist format[3].
> >>> >> > Therefore I think it would be better to switch sdist format to
> .tar.gz
> >>> >> > when starting to use pyproject.toml.
> >>> >> >
> >>> >> > Are there any obstacles to use .tar.gz?
> >>> >> > Please let me know details about adopting .zip as Beam sdist
> format(I
> >>> >> > could not find discussions about this)
> >>> >> >
> >>> >> > Regards,
> >>> >> > yoshiki
> >>> >> >
> >>> >> > [1] https://issues.apache.org/jira/browse/BEAM-8954
> >>> >> > [2]
> https://tox.wiki/en/latest/config.html?highlight=isolated#conf-isolated_build
> >>> >> > [3]
> https://www.python.org/dev/peps/pep-0517/#source-distributions
>

Re: Switching Python sdist format

Posted by Robert Bradshaw <ro...@google.com>.
On Fri, Oct 15, 2021 at 9:58 AM Luke Cwik <lc...@google.com> wrote:
>
> When the platform is unknown, would it be too expensive to package the 2 or 3 most common binary distributions and the sdist as a backup. This way the container could install a bdist if there is a compatible one falling back to sdist.

Given that platform is a container of our making, or a custom
container (in which case it should be rare that the dependencies not
be installed directly), I don't know if we want to have 3-4x the
uploads.

> Alternatively could the user specify the target platform with a python specific option that would guide the packaging system to provide the appropriate bdist?

+1

> On Thu, Oct 14, 2021 at 10:44 PM Valentyn Tymofieiev <va...@google.com> wrote:
>>
>> I think I finally understand the problematic behavior here, and described it in [1].
>>
>> Also had an offline conversation about this issue with Robert and other folks.
>>
>> A possible solution to go about populating requirements cache on the workers is to download bdists instead of sdists whenever users launch a pipeline with a requirements file that will run in a default container image, known to the SDK, such as  apache/beam_python3.7_sdk:2.xx.0 and gcr.io/cloud-dataflow/v1beta3/python...  For these container images we know and control the platform, so when we `pip download` the dependencies, we can prefer to download bdists for a specific platform. This would fix BEAM-4032[2], and will improve worker startup time   since dependencies won't have to be built from sources on the workers. Also note that we  currently don't download any build dependencies[3] for packages using PEP-517 , so build dependencies would have to be downloaded on the fly on the workers, reducing the benefit of populating the requirements cache. Staging bdists instead of sdists should help with that too, since we won't need to build packages, thus won't need build deps.
>>
>> For users who pass custom containers (via --sdk_container_image flags), we wouldn't have the knowledge of the target platform and couldn't know for sure which bdists to download and stage, but custom container users can package all necessary Python dependencies in their container image, and won't need to pass a --requirements_file with their pipeline, so requirements cache will not be necessary.
>>
>> [1] https://github.com/pypa/pip/issues/10589
>> [2] https://issues.apache.org/jira/browse/BEAM-4032
>> [3] https://github.com/pypa/pip/issues/10589#issuecomment-944006331
>>
>>
>> On Thu, Oct 14, 2021 at 7:36 AM Yoshiki Obata <yo...@gmail.com> wrote:
>>>
>>> Thank you for your reply, Robert and Valentyn.
>>>
>>> Concerns about slow pipeline submission is very helpful which I
>>> couldn't take into account.
>>> It seems that using .tar.gz as sdist format would be fine but
>>> introducing PEP 517 to Beam should have bad impact to users therefore
>>> more consideration must be needed.
>>>
>>> I will investigate more and post an update if ideal solutions are discovered.
>>>
>>> Thanks,
>>> yoshiki
>>>
>>> On Wed, Oct 13, 2021 at 10:59 AM Valentyn Tymofieiev
>>> <va...@google.com> wrote:
>>> >
>>> > Hi Yoshiki,
>>> >
>>> > it should be fine to use tar.gz. There may be some Beam release-related scripts we need to update that expect a .zip. Also I noticed that numpy uses .zip for its sdist[1], and they also use pyproject.toml[2], so I don't know if the requirement to use .tar.gz sdist in PEP-517 is a hard requirement, but we can ask try to confirm. Perhaps it matters when some particular hooks of PEP-517 are defined.
>>> >
>>> > I would like to discuss further your suggestion to use PEP-517 in Beam and potential implications. For the benefit of others on this thread, for a simple explanation of PEP-517, see [3].
>>> >
>>> > I noticed that when several other projects switched to using PEP-517 (numpy, pyarrow, it impacted the pipeline submission experience for Beam users. To make sure that the pipeline execution environment has necessary pipeline dependencies, Beam downloads distributions for packages (and their transitive dependencies) that users supply in requirements.txt [4], and stages them to the execution environment[5]. Specifically, Beam downloads and stages source distributions (aka sdists) of the packages. We download sdists, since the target execution platform on the runner is not known to the SDK, so staging binary distributions (aka bdists, wheels) was never supported (BEAM-4032 [6]). It turns out that downloading a source distribution of a library from PyPI via `python -m pip download --dest /tmp pyarrow==5.0.0 --no-deps --no-binary :all:` makes pip also do a metadata check to verify that the contents of the sdist that was downloaded is indeed the package that was requested [7]. For libraries packaged with setuptools / prior to PEP-517, this involved a call `python setup.py egg_info` [8] , which is fairly fast. However, for libraries packaged using pyproject.toml/PEP-517, getting the package metadata involves installing build dependencies, and potentially building a wheel and extracting the metadata from a wheel. If a library has c extensions, this step involves compilation, can be quite slow and require  dev header packages to be installed on the OS. I am not sure whether the slowdown happens on every  project (with extensions) that adopts PEP-517, or only if the project is somehow misconfigured, asked on [9]. So far I observed such a slowdown with numpy and pyarrow[10].
>>> >
>>> > Users have brought this up to the pip maintainers. For various reasons, this has proved not easy to address, and long term solutions are still in the works [11, 12, 13].
>>> >
>>> > What does it mean for Beam:
>>> >
>>> > 1) With more Python packages adopting PEP-517, users may be getting affected by slow pipeline submission time if they require certain packages in requirements.txt that take a long time to download+build.
>>> > 2) There is a possibility that adopting PEP-517 in Beam will increase pipeline submission time due to the slowness of the pip download command, because we download Beam SDK and stage it to the runner during job submission[14]. Given we don't know the target platform for sure, we stage both an sdist and a bdist [14]. The platform selected for bdist matches the platform of Beam's default containers, but with custom containers we can't guarantee the target platform will always match a predefined default, so we also pass an sdist.
>>> > 3) Users sometimes include apache-beam into requirements.txt of users' pipelines. Although not necessary, this contributes to slowdowns because numpy and pyarrow are Beam's dependencies, and they end up being downloaded.
>>> >
>>> > I am not against using PEP-517, it seems to have been written for good reason, but we should prevent slowdown in pipeline submissions.
>>> >
>>> > I am curious what the community thinks about ways to adopt it. Possible avenues:
>>> >
>>> > - Provide some information to the SDK about the runner's platform at pipeline submission, and stage only binary packages to the runner whenever possible [6]. Pip is only slow to download sdists, downloading bdists is fast. Also installing bdists on the runner would be much faster.
>>> > - Rethink dependency staging. Avoid staging the SDK, and/or dependency package sdists via container prebuilding workflow[15] by default for all runners that use containers. During prebuilding, install packages directly on the containers with `pip install`, and avoid `pip download` step. Filter out apache_beam from  requirements.txt file when users add it there.
>>> > - Wait until solutions for [11] become available or get involved to help move that forward.
>>> > - Switch to download sdists over a HTTP instead of `pip download` [16], fall back to pip download if not successful.
>>> > - There may be some way to configure PEP-517 properly, that avoids the slowdown, asked on [9].
>>> >
>>> > Thanks,
>>> > Valentyn
>>> >
>>> > [1] https://pypi.org/project/numpy/#copy-hash-modal-89500de9-70e8-4e7e-87f5-12adf0808905
>>> > [2] https://github.com/numpy/numpy/blob/main/pyproject.toml
>>> > [3] https://snarky.ca/clarifying-pep-518
>>> > [4] https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L640
>>> > [5] https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L238
>>> > [6] https://issues.apache.org/jira/browse/BEAM-4032
>>> > [7] https://github.com/pypa/pip/issues/1884#issuecomment-670364453
>>> > [8] https://github.com/pypa/pip/issues/8387#issuecomment-638118900
>>> > [9] https://github.com/pypa/pip/issues/10195#issuecomment-941811201
>>> > [10] https://github.com/numpy/numpy/pull/14053#issuecomment-637709988
>>> > [11] https://github.com/pypa/pip/issues/1884
>>> > [12] https://discuss.python.org/t/pep-625-file-name-of-a-source-distribution/4686
>>> > [13] https://github.com/pypa/pip/issues/10195
>>> > [14] https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L715
>>> > [15] https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/options/pipeline_options.py#L1096
>>> > [16] https://github.com/pypa/pip/issues/1884#issuecomment-800483766
>>> >
>>> >
>>> >
>>> >
>>> > On Mon, Oct 11, 2021 at 8:51 AM Robert Bradshaw <ro...@google.com> wrote:
>>> >>
>>> >> That's fine by me. The only advantage I can think of for .zip is that
>>> >> it's (generally) better supported on Windows, but as far as I know
>>> >> .tar.gz works on Windows just fine for python package distribution.
>>> >>
>>> >> On Mon, Oct 11, 2021 at 5:09 AM Yoshiki Obata <yo...@gmail.com> wrote:
>>> >> >
>>> >> > Hello everyone,
>>> >> >
>>> >> > I'm working on BEAM-8954[1] which introduces tox isolated_build for
>>> >> > python tests.
>>> >> > Concerning this issue, I want opinions about using .tar.gz as sdist format.
>>> >> >
>>> >> > Introducing tox isolated_build leads replacement of
>>> >> > build-requirements.txt to pyproject.toml[2] and we should use
>>> >> > pyproject.toml when creating sdist because we install dependencies
>>> >> > with build-requirements.txt before calling "python setup.py sdist"
>>> >> > PEP 517 based build tools like pypa/build will help to do so, but it
>>> >> > does not allow .zip as sdist format[3].
>>> >> > Therefore I think it would be better to switch sdist format to .tar.gz
>>> >> > when starting to use pyproject.toml.
>>> >> >
>>> >> > Are there any obstacles to use .tar.gz?
>>> >> > Please let me know details about adopting .zip as Beam sdist format(I
>>> >> > could not find discussions about this)
>>> >> >
>>> >> > Regards,
>>> >> > yoshiki
>>> >> >
>>> >> > [1] https://issues.apache.org/jira/browse/BEAM-8954
>>> >> > [2] https://tox.wiki/en/latest/config.html?highlight=isolated#conf-isolated_build
>>> >> > [3] https://www.python.org/dev/peps/pep-0517/#source-distributions

Re: Switching Python sdist format

Posted by Luke Cwik <lc...@google.com>.
When the platform is unknown, would it be too expensive to package the 2 or
3 most common binary distributions and the sdist as a backup. This way the
container could install a bdist if there is a compatible one falling back
to sdist.

Alternatively could the user specify the target platform with a python
specific option that would guide the packaging system to provide the
appropriate bdist?



On Thu, Oct 14, 2021 at 10:44 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> I think I finally understand the problematic behavior here, and described
> it in [1].
>
> Also had an offline conversation about this issue with Robert and other
> folks.
>
> A possible solution to go about populating requirements cache on the
> workers is to download bdists instead of sdists whenever users launch a
> pipeline with a requirements file that will run in a default container
> image, known to the SDK, such as  apache/beam_python3.7_sdk:2.xx.0 and
> gcr.io/cloud-dataflow/v1beta3/python...  For these container images we
> know and control the platform, so when we `pip download` the dependencies,
> we can prefer to download bdists for a specific platform. This would fix
> BEAM-4032[2], and will improve worker startup time   since dependencies
> won't have to be built from sources on the workers. Also note that we
> currently don't download any build dependencies[3] for packages using
> PEP-517 , so build dependencies would have to be downloaded on the fly on
> the workers, reducing the benefit of populating the requirements cache.
> Staging bdists instead of sdists should help with that too, since we won't
> need to build packages, thus won't need build deps.
>
> For users who pass custom containers (via --sdk_container_image flags), we
> wouldn't have the knowledge of the target platform and couldn't know for
> sure which bdists to download and stage, but custom container users can
> package all necessary Python dependencies in their container image, and
> won't need to pass a --requirements_file with their pipeline, so
> requirements cache will not be necessary.
>
> [1] https://github.com/pypa/pip/issues/10589
> [2] https://issues.apache.org/jira/browse/BEAM-4032
> [3] https://github.com/pypa/pip/issues/10589#issuecomment-944006331
>
>
> On Thu, Oct 14, 2021 at 7:36 AM Yoshiki Obata <yo...@gmail.com>
> wrote:
>
>> Thank you for your reply, Robert and Valentyn.
>>
>> Concerns about slow pipeline submission is very helpful which I
>> couldn't take into account.
>> It seems that using .tar.gz as sdist format would be fine but
>> introducing PEP 517 to Beam should have bad impact to users therefore
>> more consideration must be needed.
>>
>> I will investigate more and post an update if ideal solutions are
>> discovered.
>>
>> Thanks,
>> yoshiki
>>
>> On Wed, Oct 13, 2021 at 10:59 AM Valentyn Tymofieiev
>> <va...@google.com> wrote:
>> >
>> > Hi Yoshiki,
>> >
>> > it should be fine to use tar.gz. There may be some Beam release-related
>> scripts we need to update that expect a .zip. Also I noticed that numpy
>> uses .zip for its sdist[1], and they also use pyproject.toml[2], so I don't
>> know if the requirement to use .tar.gz sdist in PEP-517 is a hard
>> requirement, but we can ask try to confirm. Perhaps it matters when some
>> particular hooks of PEP-517 are defined.
>> >
>> > I would like to discuss further your suggestion to use PEP-517 in Beam
>> and potential implications. For the benefit of others on this thread, for a
>> simple explanation of PEP-517, see [3].
>> >
>> > I noticed that when several other projects switched to using PEP-517
>> (numpy, pyarrow, it impacted the pipeline submission experience for Beam
>> users. To make sure that the pipeline execution environment has necessary
>> pipeline dependencies, Beam downloads distributions for packages (and their
>> transitive dependencies) that users supply in requirements.txt [4], and
>> stages them to the execution environment[5]. Specifically, Beam downloads
>> and stages source distributions (aka sdists) of the packages. We download
>> sdists, since the target execution platform on the runner is not known to
>> the SDK, so staging binary distributions (aka bdists, wheels) was never
>> supported (BEAM-4032 [6]). It turns out that downloading a source
>> distribution of a library from PyPI via `python -m pip download --dest /tmp
>> pyarrow==5.0.0 --no-deps --no-binary :all:` makes pip also do a metadata
>> check to verify that the contents of the sdist that was downloaded is
>> indeed the package that was requested [7]. For libraries packaged with
>> setuptools / prior to PEP-517, this involved a call `python setup.py
>> egg_info` [8] , which is fairly fast. However, for libraries packaged using
>> pyproject.toml/PEP-517, getting the package metadata involves installing
>> build dependencies, and potentially building a wheel and extracting the
>> metadata from a wheel. If a library has c extensions, this step involves
>> compilation, can be quite slow and require  dev header packages to be
>> installed on the OS. I am not sure whether the slowdown happens on every
>> project (with extensions) that adopts PEP-517, or only if the project is
>> somehow misconfigured, asked on [9]. So far I observed such a slowdown with
>> numpy and pyarrow[10].
>> >
>> > Users have brought this up to the pip maintainers. For various reasons,
>> this has proved not easy to address, and long term solutions are still in
>> the works [11, 12, 13].
>> >
>> > What does it mean for Beam:
>> >
>> > 1) With more Python packages adopting PEP-517, users may be getting
>> affected by slow pipeline submission time if they require certain packages
>> in requirements.txt that take a long time to download+build.
>> > 2) There is a possibility that adopting PEP-517 in Beam will increase
>> pipeline submission time due to the slowness of the pip download command,
>> because we download Beam SDK and stage it to the runner during job
>> submission[14]. Given we don't know the target platform for sure, we stage
>> both an sdist and a bdist [14]. The platform selected for bdist matches the
>> platform of Beam's default containers, but with custom containers we can't
>> guarantee the target platform will always match a predefined default, so we
>> also pass an sdist.
>> > 3) Users sometimes include apache-beam into requirements.txt of users'
>> pipelines. Although not necessary, this contributes to slowdowns because
>> numpy and pyarrow are Beam's dependencies, and they end up being downloaded.
>> >
>> > I am not against using PEP-517, it seems to have been written for good
>> reason, but we should prevent slowdown in pipeline submissions.
>> >
>> > I am curious what the community thinks about ways to adopt it. Possible
>> avenues:
>> >
>> > - Provide some information to the SDK about the runner's platform at
>> pipeline submission, and stage only binary packages to the runner whenever
>> possible [6]. Pip is only slow to download sdists, downloading bdists is
>> fast. Also installing bdists on the runner would be much faster.
>> > - Rethink dependency staging. Avoid staging the SDK, and/or dependency
>> package sdists via container prebuilding workflow[15] by default for all
>> runners that use containers. During prebuilding, install packages directly
>> on the containers with `pip install`, and avoid `pip download` step. Filter
>> out apache_beam from  requirements.txt file when users add it there.
>> > - Wait until solutions for [11] become available or get involved to
>> help move that forward.
>> > - Switch to download sdists over a HTTP instead of `pip download` [16],
>> fall back to pip download if not successful.
>> > - There may be some way to configure PEP-517 properly, that avoids the
>> slowdown, asked on [9].
>> >
>> > Thanks,
>> > Valentyn
>> >
>> > [1]
>> https://pypi.org/project/numpy/#copy-hash-modal-89500de9-70e8-4e7e-87f5-12adf0808905
>> > [2] https://github.com/numpy/numpy/blob/main/pyproject.toml
>> > [3] https://snarky.ca/clarifying-pep-518
>> > [4]
>> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L640
>> > [5]
>> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L238
>> > [6] https://issues.apache.org/jira/browse/BEAM-4032
>> > [7] https://github.com/pypa/pip/issues/1884#issuecomment-670364453
>> > [8] https://github.com/pypa/pip/issues/8387#issuecomment-638118900
>> > [9] https://github.com/pypa/pip/issues/10195#issuecomment-941811201
>> > [10] https://github.com/numpy/numpy/pull/14053#issuecomment-637709988
>> > [11] https://github.com/pypa/pip/issues/1884
>> > [12]
>> https://discuss.python.org/t/pep-625-file-name-of-a-source-distribution/4686
>> > [13] https://github.com/pypa/pip/issues/10195
>> > [14]
>> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L715
>> > [15]
>> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/options/pipeline_options.py#L1096
>> > [16] https://github.com/pypa/pip/issues/1884#issuecomment-800483766
>> >
>> >
>> >
>> >
>> > On Mon, Oct 11, 2021 at 8:51 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>> >>
>> >> That's fine by me. The only advantage I can think of for .zip is that
>> >> it's (generally) better supported on Windows, but as far as I know
>> >> .tar.gz works on Windows just fine for python package distribution.
>> >>
>> >> On Mon, Oct 11, 2021 at 5:09 AM Yoshiki Obata <yo...@gmail.com>
>> wrote:
>> >> >
>> >> > Hello everyone,
>> >> >
>> >> > I'm working on BEAM-8954[1] which introduces tox isolated_build for
>> >> > python tests.
>> >> > Concerning this issue, I want opinions about using .tar.gz as sdist
>> format.
>> >> >
>> >> > Introducing tox isolated_build leads replacement of
>> >> > build-requirements.txt to pyproject.toml[2] and we should use
>> >> > pyproject.toml when creating sdist because we install dependencies
>> >> > with build-requirements.txt before calling "python setup.py sdist"
>> >> > PEP 517 based build tools like pypa/build will help to do so, but it
>> >> > does not allow .zip as sdist format[3].
>> >> > Therefore I think it would be better to switch sdist format to
>> .tar.gz
>> >> > when starting to use pyproject.toml.
>> >> >
>> >> > Are there any obstacles to use .tar.gz?
>> >> > Please let me know details about adopting .zip as Beam sdist format(I
>> >> > could not find discussions about this)
>> >> >
>> >> > Regards,
>> >> > yoshiki
>> >> >
>> >> > [1] https://issues.apache.org/jira/browse/BEAM-8954
>> >> > [2]
>> https://tox.wiki/en/latest/config.html?highlight=isolated#conf-isolated_build
>> >> > [3] https://www.python.org/dev/peps/pep-0517/#source-distributions
>>
>

Re: Switching Python sdist format

Posted by Valentyn Tymofieiev <va...@google.com>.
I think I finally understand the problematic behavior here, and described
it in [1].

Also had an offline conversation about this issue with Robert and other
folks.

A possible solution to go about populating requirements cache on the
workers is to download bdists instead of sdists whenever users launch a
pipeline with a requirements file that will run in a default container
image, known to the SDK, such as  apache/beam_python3.7_sdk:2.xx.0 and
gcr.io/cloud-dataflow/v1beta3/python...  For these container images we know
and control the platform, so when we `pip download` the dependencies, we
can prefer to download bdists for a specific platform. This would fix
BEAM-4032[2], and will improve worker startup time   since dependencies
won't have to be built from sources on the workers. Also note that we
currently don't download any build dependencies[3] for packages using
PEP-517 , so build dependencies would have to be downloaded on the fly on
the workers, reducing the benefit of populating the requirements cache.
Staging bdists instead of sdists should help with that too, since we won't
need to build packages, thus won't need build deps.

For users who pass custom containers (via --sdk_container_image flags), we
wouldn't have the knowledge of the target platform and couldn't know for
sure which bdists to download and stage, but custom container users can
package all necessary Python dependencies in their container image, and
won't need to pass a --requirements_file with their pipeline, so
requirements cache will not be necessary.

[1] https://github.com/pypa/pip/issues/10589
[2] https://issues.apache.org/jira/browse/BEAM-4032
[3] https://github.com/pypa/pip/issues/10589#issuecomment-944006331


On Thu, Oct 14, 2021 at 7:36 AM Yoshiki Obata <yo...@gmail.com>
wrote:

> Thank you for your reply, Robert and Valentyn.
>
> Concerns about slow pipeline submission is very helpful which I
> couldn't take into account.
> It seems that using .tar.gz as sdist format would be fine but
> introducing PEP 517 to Beam should have bad impact to users therefore
> more consideration must be needed.
>
> I will investigate more and post an update if ideal solutions are
> discovered.
>
> Thanks,
> yoshiki
>
> On Wed, Oct 13, 2021 at 10:59 AM Valentyn Tymofieiev
> <va...@google.com> wrote:
> >
> > Hi Yoshiki,
> >
> > it should be fine to use tar.gz. There may be some Beam release-related
> scripts we need to update that expect a .zip. Also I noticed that numpy
> uses .zip for its sdist[1], and they also use pyproject.toml[2], so I don't
> know if the requirement to use .tar.gz sdist in PEP-517 is a hard
> requirement, but we can ask try to confirm. Perhaps it matters when some
> particular hooks of PEP-517 are defined.
> >
> > I would like to discuss further your suggestion to use PEP-517 in Beam
> and potential implications. For the benefit of others on this thread, for a
> simple explanation of PEP-517, see [3].
> >
> > I noticed that when several other projects switched to using PEP-517
> (numpy, pyarrow, it impacted the pipeline submission experience for Beam
> users. To make sure that the pipeline execution environment has necessary
> pipeline dependencies, Beam downloads distributions for packages (and their
> transitive dependencies) that users supply in requirements.txt [4], and
> stages them to the execution environment[5]. Specifically, Beam downloads
> and stages source distributions (aka sdists) of the packages. We download
> sdists, since the target execution platform on the runner is not known to
> the SDK, so staging binary distributions (aka bdists, wheels) was never
> supported (BEAM-4032 [6]). It turns out that downloading a source
> distribution of a library from PyPI via `python -m pip download --dest /tmp
> pyarrow==5.0.0 --no-deps --no-binary :all:` makes pip also do a metadata
> check to verify that the contents of the sdist that was downloaded is
> indeed the package that was requested [7]. For libraries packaged with
> setuptools / prior to PEP-517, this involved a call `python setup.py
> egg_info` [8] , which is fairly fast. However, for libraries packaged using
> pyproject.toml/PEP-517, getting the package metadata involves installing
> build dependencies, and potentially building a wheel and extracting the
> metadata from a wheel. If a library has c extensions, this step involves
> compilation, can be quite slow and require  dev header packages to be
> installed on the OS. I am not sure whether the slowdown happens on every
> project (with extensions) that adopts PEP-517, or only if the project is
> somehow misconfigured, asked on [9]. So far I observed such a slowdown with
> numpy and pyarrow[10].
> >
> > Users have brought this up to the pip maintainers. For various reasons,
> this has proved not easy to address, and long term solutions are still in
> the works [11, 12, 13].
> >
> > What does it mean for Beam:
> >
> > 1) With more Python packages adopting PEP-517, users may be getting
> affected by slow pipeline submission time if they require certain packages
> in requirements.txt that take a long time to download+build.
> > 2) There is a possibility that adopting PEP-517 in Beam will increase
> pipeline submission time due to the slowness of the pip download command,
> because we download Beam SDK and stage it to the runner during job
> submission[14]. Given we don't know the target platform for sure, we stage
> both an sdist and a bdist [14]. The platform selected for bdist matches the
> platform of Beam's default containers, but with custom containers we can't
> guarantee the target platform will always match a predefined default, so we
> also pass an sdist.
> > 3) Users sometimes include apache-beam into requirements.txt of users'
> pipelines. Although not necessary, this contributes to slowdowns because
> numpy and pyarrow are Beam's dependencies, and they end up being downloaded.
> >
> > I am not against using PEP-517, it seems to have been written for good
> reason, but we should prevent slowdown in pipeline submissions.
> >
> > I am curious what the community thinks about ways to adopt it. Possible
> avenues:
> >
> > - Provide some information to the SDK about the runner's platform at
> pipeline submission, and stage only binary packages to the runner whenever
> possible [6]. Pip is only slow to download sdists, downloading bdists is
> fast. Also installing bdists on the runner would be much faster.
> > - Rethink dependency staging. Avoid staging the SDK, and/or dependency
> package sdists via container prebuilding workflow[15] by default for all
> runners that use containers. During prebuilding, install packages directly
> on the containers with `pip install`, and avoid `pip download` step. Filter
> out apache_beam from  requirements.txt file when users add it there.
> > - Wait until solutions for [11] become available or get involved to help
> move that forward.
> > - Switch to download sdists over a HTTP instead of `pip download` [16],
> fall back to pip download if not successful.
> > - There may be some way to configure PEP-517 properly, that avoids the
> slowdown, asked on [9].
> >
> > Thanks,
> > Valentyn
> >
> > [1]
> https://pypi.org/project/numpy/#copy-hash-modal-89500de9-70e8-4e7e-87f5-12adf0808905
> > [2] https://github.com/numpy/numpy/blob/main/pyproject.toml
> > [3] https://snarky.ca/clarifying-pep-518
> > [4]
> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L640
> > [5]
> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L238
> > [6] https://issues.apache.org/jira/browse/BEAM-4032
> > [7] https://github.com/pypa/pip/issues/1884#issuecomment-670364453
> > [8] https://github.com/pypa/pip/issues/8387#issuecomment-638118900
> > [9] https://github.com/pypa/pip/issues/10195#issuecomment-941811201
> > [10] https://github.com/numpy/numpy/pull/14053#issuecomment-637709988
> > [11] https://github.com/pypa/pip/issues/1884
> > [12]
> https://discuss.python.org/t/pep-625-file-name-of-a-source-distribution/4686
> > [13] https://github.com/pypa/pip/issues/10195
> > [14]
> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L715
> > [15]
> https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/options/pipeline_options.py#L1096
> > [16] https://github.com/pypa/pip/issues/1884#issuecomment-800483766
> >
> >
> >
> >
> > On Mon, Oct 11, 2021 at 8:51 AM Robert Bradshaw <ro...@google.com>
> wrote:
> >>
> >> That's fine by me. The only advantage I can think of for .zip is that
> >> it's (generally) better supported on Windows, but as far as I know
> >> .tar.gz works on Windows just fine for python package distribution.
> >>
> >> On Mon, Oct 11, 2021 at 5:09 AM Yoshiki Obata <yo...@gmail.com>
> wrote:
> >> >
> >> > Hello everyone,
> >> >
> >> > I'm working on BEAM-8954[1] which introduces tox isolated_build for
> >> > python tests.
> >> > Concerning this issue, I want opinions about using .tar.gz as sdist
> format.
> >> >
> >> > Introducing tox isolated_build leads replacement of
> >> > build-requirements.txt to pyproject.toml[2] and we should use
> >> > pyproject.toml when creating sdist because we install dependencies
> >> > with build-requirements.txt before calling "python setup.py sdist"
> >> > PEP 517 based build tools like pypa/build will help to do so, but it
> >> > does not allow .zip as sdist format[3].
> >> > Therefore I think it would be better to switch sdist format to .tar.gz
> >> > when starting to use pyproject.toml.
> >> >
> >> > Are there any obstacles to use .tar.gz?
> >> > Please let me know details about adopting .zip as Beam sdist format(I
> >> > could not find discussions about this)
> >> >
> >> > Regards,
> >> > yoshiki
> >> >
> >> > [1] https://issues.apache.org/jira/browse/BEAM-8954
> >> > [2]
> https://tox.wiki/en/latest/config.html?highlight=isolated#conf-isolated_build
> >> > [3] https://www.python.org/dev/peps/pep-0517/#source-distributions
>

Re: Switching Python sdist format

Posted by Yoshiki Obata <yo...@gmail.com>.
Thank you for your reply, Robert and Valentyn.

Concerns about slow pipeline submission is very helpful which I
couldn't take into account.
It seems that using .tar.gz as sdist format would be fine but
introducing PEP 517 to Beam should have bad impact to users therefore
more consideration must be needed.

I will investigate more and post an update if ideal solutions are discovered.

Thanks,
yoshiki

On Wed, Oct 13, 2021 at 10:59 AM Valentyn Tymofieiev
<va...@google.com> wrote:
>
> Hi Yoshiki,
>
> it should be fine to use tar.gz. There may be some Beam release-related scripts we need to update that expect a .zip. Also I noticed that numpy uses .zip for its sdist[1], and they also use pyproject.toml[2], so I don't know if the requirement to use .tar.gz sdist in PEP-517 is a hard requirement, but we can ask try to confirm. Perhaps it matters when some particular hooks of PEP-517 are defined.
>
> I would like to discuss further your suggestion to use PEP-517 in Beam and potential implications. For the benefit of others on this thread, for a simple explanation of PEP-517, see [3].
>
> I noticed that when several other projects switched to using PEP-517 (numpy, pyarrow, it impacted the pipeline submission experience for Beam users. To make sure that the pipeline execution environment has necessary pipeline dependencies, Beam downloads distributions for packages (and their transitive dependencies) that users supply in requirements.txt [4], and stages them to the execution environment[5]. Specifically, Beam downloads and stages source distributions (aka sdists) of the packages. We download sdists, since the target execution platform on the runner is not known to the SDK, so staging binary distributions (aka bdists, wheels) was never supported (BEAM-4032 [6]). It turns out that downloading a source distribution of a library from PyPI via `python -m pip download --dest /tmp pyarrow==5.0.0 --no-deps --no-binary :all:` makes pip also do a metadata check to verify that the contents of the sdist that was downloaded is indeed the package that was requested [7]. For libraries packaged with setuptools / prior to PEP-517, this involved a call `python setup.py egg_info` [8] , which is fairly fast. However, for libraries packaged using pyproject.toml/PEP-517, getting the package metadata involves installing build dependencies, and potentially building a wheel and extracting the metadata from a wheel. If a library has c extensions, this step involves compilation, can be quite slow and require  dev header packages to be installed on the OS. I am not sure whether the slowdown happens on every  project (with extensions) that adopts PEP-517, or only if the project is somehow misconfigured, asked on [9]. So far I observed such a slowdown with numpy and pyarrow[10].
>
> Users have brought this up to the pip maintainers. For various reasons, this has proved not easy to address, and long term solutions are still in the works [11, 12, 13].
>
> What does it mean for Beam:
>
> 1) With more Python packages adopting PEP-517, users may be getting affected by slow pipeline submission time if they require certain packages in requirements.txt that take a long time to download+build.
> 2) There is a possibility that adopting PEP-517 in Beam will increase pipeline submission time due to the slowness of the pip download command, because we download Beam SDK and stage it to the runner during job submission[14]. Given we don't know the target platform for sure, we stage both an sdist and a bdist [14]. The platform selected for bdist matches the platform of Beam's default containers, but with custom containers we can't guarantee the target platform will always match a predefined default, so we also pass an sdist.
> 3) Users sometimes include apache-beam into requirements.txt of users' pipelines. Although not necessary, this contributes to slowdowns because numpy and pyarrow are Beam's dependencies, and they end up being downloaded.
>
> I am not against using PEP-517, it seems to have been written for good reason, but we should prevent slowdown in pipeline submissions.
>
> I am curious what the community thinks about ways to adopt it. Possible avenues:
>
> - Provide some information to the SDK about the runner's platform at pipeline submission, and stage only binary packages to the runner whenever possible [6]. Pip is only slow to download sdists, downloading bdists is fast. Also installing bdists on the runner would be much faster.
> - Rethink dependency staging. Avoid staging the SDK, and/or dependency package sdists via container prebuilding workflow[15] by default for all runners that use containers. During prebuilding, install packages directly on the containers with `pip install`, and avoid `pip download` step. Filter out apache_beam from  requirements.txt file when users add it there.
> - Wait until solutions for [11] become available or get involved to help move that forward.
> - Switch to download sdists over a HTTP instead of `pip download` [16], fall back to pip download if not successful.
> - There may be some way to configure PEP-517 properly, that avoids the slowdown, asked on [9].
>
> Thanks,
> Valentyn
>
> [1] https://pypi.org/project/numpy/#copy-hash-modal-89500de9-70e8-4e7e-87f5-12adf0808905
> [2] https://github.com/numpy/numpy/blob/main/pyproject.toml
> [3] https://snarky.ca/clarifying-pep-518
> [4] https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L640
> [5] https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L238
> [6] https://issues.apache.org/jira/browse/BEAM-4032
> [7] https://github.com/pypa/pip/issues/1884#issuecomment-670364453
> [8] https://github.com/pypa/pip/issues/8387#issuecomment-638118900
> [9] https://github.com/pypa/pip/issues/10195#issuecomment-941811201
> [10] https://github.com/numpy/numpy/pull/14053#issuecomment-637709988
> [11] https://github.com/pypa/pip/issues/1884
> [12] https://discuss.python.org/t/pep-625-file-name-of-a-source-distribution/4686
> [13] https://github.com/pypa/pip/issues/10195
> [14] https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L715
> [15] https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/options/pipeline_options.py#L1096
> [16] https://github.com/pypa/pip/issues/1884#issuecomment-800483766
>
>
>
>
> On Mon, Oct 11, 2021 at 8:51 AM Robert Bradshaw <ro...@google.com> wrote:
>>
>> That's fine by me. The only advantage I can think of for .zip is that
>> it's (generally) better supported on Windows, but as far as I know
>> .tar.gz works on Windows just fine for python package distribution.
>>
>> On Mon, Oct 11, 2021 at 5:09 AM Yoshiki Obata <yo...@gmail.com> wrote:
>> >
>> > Hello everyone,
>> >
>> > I'm working on BEAM-8954[1] which introduces tox isolated_build for
>> > python tests.
>> > Concerning this issue, I want opinions about using .tar.gz as sdist format.
>> >
>> > Introducing tox isolated_build leads replacement of
>> > build-requirements.txt to pyproject.toml[2] and we should use
>> > pyproject.toml when creating sdist because we install dependencies
>> > with build-requirements.txt before calling "python setup.py sdist"
>> > PEP 517 based build tools like pypa/build will help to do so, but it
>> > does not allow .zip as sdist format[3].
>> > Therefore I think it would be better to switch sdist format to .tar.gz
>> > when starting to use pyproject.toml.
>> >
>> > Are there any obstacles to use .tar.gz?
>> > Please let me know details about adopting .zip as Beam sdist format(I
>> > could not find discussions about this)
>> >
>> > Regards,
>> > yoshiki
>> >
>> > [1] https://issues.apache.org/jira/browse/BEAM-8954
>> > [2] https://tox.wiki/en/latest/config.html?highlight=isolated#conf-isolated_build
>> > [3] https://www.python.org/dev/peps/pep-0517/#source-distributions

Re: Switching Python sdist format

Posted by Valentyn Tymofieiev <va...@google.com>.
Hi Yoshiki,

it should be fine to use tar.gz. There may be some Beam release-related
scripts we need to update that expect a .zip. Also I noticed that numpy
uses .zip for its sdist[1], and they also use pyproject.toml[2], so I don't
know if the requirement to use .tar.gz sdist in PEP-517 is a
hard requirement, but we can ask try to confirm. Perhaps it matters when
some particular hooks of PEP-517 are defined.

I would like to discuss further your suggestion to use PEP-517 in Beam and
potential implications. For the benefit of others on this thread, for a
simple explanation of PEP-517, see [3].

I noticed that when several other projects switched to using PEP-517
(numpy, pyarrow, it impacted the pipeline submission experience for Beam
users. To make sure that the pipeline execution environment has necessary
pipeline dependencies, Beam downloads distributions for packages (and their
transitive dependencies) that users supply in requirements.txt [4], and
stages them to the execution environment[5]. Specifically, Beam downloads
and stages *source* distributions (aka sdists) of the packages. We download
sdists, since the target execution platform on the runner is not known to
the SDK, so staging binary distributions (aka bdists, wheels) was never
supported (BEAM-4032 [6]). It turns out that downloading a source
distribution of a library from PyPI via `python -m pip download --dest /tmp
pyarrow==5.0.0 --no-deps --no-binary :all:` makes pip also do a metadata
check to verify that the contents of the sdist that was downloaded is
indeed the package that was requested [7]. For libraries packaged with
setuptools / prior to PEP-517, this involved a call `python setup.py
egg_info` [8] , which is fairly fast. However, for libraries packaged using
pyproject.toml/PEP-517, getting the package metadata involves installing
build dependencies, and potentially building a wheel and extracting the
metadata from a wheel. If a library has c extensions, this step involves
compilation, can be quite slow and require  dev header packages to be
installed on the OS. I am not sure whether the slowdown happens on every
project (with extensions) that adopts PEP-517, or only if the project is
somehow misconfigured, asked on [9]. So far I observed such a slowdown with
numpy and pyarrow[10].

Users have brought this up to the pip maintainers. For various reasons,
this has proved not easy to address, and long term solutions are still in
the works [11, 12, 13].

What does it mean for Beam:

1) With more Python packages adopting PEP-517, users may be getting
affected by slow pipeline submission time if they require certain packages
in requirements.txt that take a long time to download+build.
2) There is a possibility that adopting PEP-517 in Beam will increase
pipeline submission time due to the slowness of the pip download command,
because we download Beam SDK and stage it to the runner during job
submission[14]. Given we don't know the target platform for sure, we stage
both an sdist and a bdist [14]. The platform selected for bdist matches the
platform of Beam's default containers, but with custom containers we can't
guarantee the target platform will always match a predefined default, so we
also pass an sdist.
3) Users sometimes include apache-beam into requirements.txt of users'
pipelines. Although not necessary, this contributes to slowdowns because
numpy and pyarrow are Beam's dependencies, and they end up being
downloaded.

I am not against using PEP-517, it seems to have been written for good
reason, but we should prevent slowdown in pipeline submissions.

I am curious what the community thinks about ways to adopt it. Possible
avenues:

- Provide some information to the SDK about the runner's platform at
pipeline submission, and stage only binary packages to the runner whenever
possible [6]. Pip is only slow to download sdists, downloading bdists is
fast. Also installing bdists on the runner would be much faster.
- Rethink dependency staging. Avoid staging the SDK, and/or dependency
package sdists via container prebuilding workflow[15] by default for all
runners that use containers. During prebuilding, install packages directly
on the containers with `pip install`, and avoid `pip download` step. Filter
out apache_beam from  requirements.txt file when users add it there.
- Wait until solutions for [11] become available or get involved to help
move that forward.
- Switch to download sdists over a HTTP instead of `pip download` [16],
fall back to pip download if not successful.
- There may be some way to configure PEP-517 properly, that avoids the
slowdown, asked on [9].

Thanks,
Valentyn

[1]
https://pypi.org/project/numpy/#copy-hash-modal-89500de9-70e8-4e7e-87f5-12adf0808905
[2] https://github.com/numpy/numpy/blob/main/pyproject.toml
[3] https://snarky.ca/clarifying-pep-518
[4]
https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L640
[5]
https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L238
[6] https://issues.apache.org/jira/browse/BEAM-4032
[7] https://github.com/pypa/pip/issues/1884#issuecomment-670364453
[8] https://github.com/pypa/pip/issues/8387#issuecomment-638118900
[9] https://github.com/pypa/pip/issues/10195#issuecomment-941811201
[10] https://github.com/numpy/numpy/pull/14053#issuecomment-637709988
[11] https://github.com/pypa/pip/issues/1884
[12]
https://discuss.python.org/t/pep-625-file-name-of-a-source-distribution/4686
[13] https://github.com/pypa/pip/issues/10195
[14]
https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/runners/portability/stager.py#L715
[15]
https://github.com/apache/beam/blob/1ce290bab031192c22f643cac92bd6470788798d/sdks/python/apache_beam/options/pipeline_options.py#L1096
[16] https://github.com/pypa/pip/issues/1884#issuecomment-800483766




On Mon, Oct 11, 2021 at 8:51 AM Robert Bradshaw <ro...@google.com> wrote:

> That's fine by me. The only advantage I can think of for .zip is that
> it's (generally) better supported on Windows, but as far as I know
> .tar.gz works on Windows just fine for python package distribution.
>
> On Mon, Oct 11, 2021 at 5:09 AM Yoshiki Obata <yo...@gmail.com>
> wrote:
> >
> > Hello everyone,
> >
> > I'm working on BEAM-8954[1] which introduces tox isolated_build for
> > python tests.
> > Concerning this issue, I want opinions about using .tar.gz as sdist
> format.
> >
> > Introducing tox isolated_build leads replacement of
> > build-requirements.txt to pyproject.toml[2] and we should use
> > pyproject.toml when creating sdist because we install dependencies
> > with build-requirements.txt before calling "python setup.py sdist"
> > PEP 517 based build tools like pypa/build will help to do so, but it
> > does not allow .zip as sdist format[3].
> > Therefore I think it would be better to switch sdist format to .tar.gz
> > when starting to use pyproject.toml.
> >
> > Are there any obstacles to use .tar.gz?
> > Please let me know details about adopting .zip as Beam sdist format(I
> > could not find discussions about this)
> >
> > Regards,
> > yoshiki
> >
> > [1] https://issues.apache.org/jira/browse/BEAM-8954
> > [2]
> https://tox.wiki/en/latest/config.html?highlight=isolated#conf-isolated_build
> > [3] https://www.python.org/dev/peps/pep-0517/#source-distributions
>

Re: Switching Python sdist format

Posted by Robert Bradshaw <ro...@google.com>.
That's fine by me. The only advantage I can think of for .zip is that
it's (generally) better supported on Windows, but as far as I know
.tar.gz works on Windows just fine for python package distribution.

On Mon, Oct 11, 2021 at 5:09 AM Yoshiki Obata <yo...@gmail.com> wrote:
>
> Hello everyone,
>
> I'm working on BEAM-8954[1] which introduces tox isolated_build for
> python tests.
> Concerning this issue, I want opinions about using .tar.gz as sdist format.
>
> Introducing tox isolated_build leads replacement of
> build-requirements.txt to pyproject.toml[2] and we should use
> pyproject.toml when creating sdist because we install dependencies
> with build-requirements.txt before calling "python setup.py sdist"
> PEP 517 based build tools like pypa/build will help to do so, but it
> does not allow .zip as sdist format[3].
> Therefore I think it would be better to switch sdist format to .tar.gz
> when starting to use pyproject.toml.
>
> Are there any obstacles to use .tar.gz?
> Please let me know details about adopting .zip as Beam sdist format(I
> could not find discussions about this)
>
> Regards,
> yoshiki
>
> [1] https://issues.apache.org/jira/browse/BEAM-8954
> [2] https://tox.wiki/en/latest/config.html?highlight=isolated#conf-isolated_build
> [3] https://www.python.org/dev/peps/pep-0517/#source-distributions