You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Udi Meiri <eh...@google.com> on 2019/11/27 19:54:18 UTC

Python staging file weirdness

I was investigating a Dataflow postcommit test failure (endpoints_pb2
missing), and saw this in the staging directory:

$ gsutil ls gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz
gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz


Does anyone know why so many versions of setuptools need to be staged?
Shouldn't 1 be enough?

Re: Python staging file weirdness

Posted by Valentyn Tymofieiev <va...@google.com>.
Filed https://issues.apache.org/jira/browse/BEAM-8900 to address the
inefficiency discussed here. Thanks everyone.

On Thu, Dec 5, 2019 at 2:53 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> Note that so far we have not been staging wheels, since SDK does not have
> a knowledge of a target platform, but there is
> https://issues.apache.org/jira/browse/BEAM-4032 to add this support.
>
> On Thu, Dec 5, 2019 at 2:35 PM Chad Dombrova <ch...@gmail.com> wrote:
>
>> On Thu, Dec 5, 2019 at 12:36 PM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>> Ah nice, so then the workflow would be: download [missing] deps from pypi
>>> into a long-lived cache directory, then download copy the same deps
>>> into a short-lived temporary directory, using  long-lived cache directory
>>> as SoT, then stage files from a short-lived temporary directory and clean
>>> it up. Is that what you are suggesting, Chad?
>>>
>> Yes, I just did a quick test to confirm:
>>
>> # download or build wheels of anything that's missing from the cache
>> # note: we're including gcp extras:
>> pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache
>> # copy some of those wheels somewhere else
>> # note: we're excluding gcp extras
>> pip download apache_beam==2.16 --no-binary --find-links=/tmp/wheel-cache --dest /tmp/wheel-dest/
>> # rerun to confirm that cached wheels are being re-used instead of downloaded from pypi
>> pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache
>>
>> /tmp/wheel-dest/ will now have a subset of the deps from
>> /tmp/wheel-cache, excluding the gcp extras.
>>
>> Note that for some reason the equal sign after —find-links is required,
>> at least for me on pip 19.1.1. Using a space resulted in an error.
>>
>> -chad
>>
>>
>>

Re: Python staging file weirdness

Posted by Valentyn Tymofieiev <va...@google.com>.
Note that so far we have not been staging wheels, since SDK does not have a
knowledge of a target platform, but there is
https://issues.apache.org/jira/browse/BEAM-4032 to add this support.

On Thu, Dec 5, 2019 at 2:35 PM Chad Dombrova <ch...@gmail.com> wrote:

> On Thu, Dec 5, 2019 at 12:36 PM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
> Ah nice, so then the workflow would be: download [missing] deps from pypi
>> into a long-lived cache directory, then download copy the same deps into
>> a short-lived temporary directory, using  long-lived cache directory as
>> SoT, then stage files from a short-lived temporary directory and clean it
>> up. Is that what you are suggesting, Chad?
>>
> Yes, I just did a quick test to confirm:
>
> # download or build wheels of anything that's missing from the cache
> # note: we're including gcp extras:
> pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache
> # copy some of those wheels somewhere else
> # note: we're excluding gcp extras
> pip download apache_beam==2.16 --no-binary --find-links=/tmp/wheel-cache --dest /tmp/wheel-dest/
> # rerun to confirm that cached wheels are being re-used instead of downloaded from pypi
> pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache
>
> /tmp/wheel-dest/ will now have a subset of the deps from /tmp/wheel-cache,
> excluding the gcp extras.
>
> Note that for some reason the equal sign after —find-links is required, at
> least for me on pip 19.1.1. Using a space resulted in an error.
>
> -chad
>
>
>

Re: Python staging file weirdness

Posted by Chad Dombrova <ch...@gmail.com>.
On Thu, Dec 5, 2019 at 12:36 PM Valentyn Tymofieiev <va...@google.com>
wrote:

Ah nice, so then the workflow would be: download [missing] deps from pypi
> into a long-lived cache directory, then download copy the same deps into
> a short-lived temporary directory, using  long-lived cache directory as
> SoT, then stage files from a short-lived temporary directory and clean it
> up. Is that what you are suggesting, Chad?
>
Yes, I just did a quick test to confirm:

# download or build wheels of anything that's missing from the cache
# note: we're including gcp extras:
pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache
# copy some of those wheels somewhere else
# note: we're excluding gcp extras
pip download apache_beam==2.16 --no-binary
--find-links=/tmp/wheel-cache --dest /tmp/wheel-dest/
# rerun to confirm that cached wheels are being re-used instead of
downloaded from pypi
pip wheel apache_beam[gcp]==2.16 --wheel-dir /tmp/wheel-cache

/tmp/wheel-dest/ will now have a subset of the deps from /tmp/wheel-cache,
excluding the gcp extras.

Note that for some reason the equal sign after —find-links is required, at
least for me on pip 19.1.1. Using a space resulted in an error.

-chad

Re: Python staging file weirdness

Posted by Valentyn Tymofieiev <va...@google.com>.
Ah nice, so then the workflow would be: download [missing] deps from pypi
into a long-lived cache directory, then download copy the same deps into a
short-lived temporary directory, using  long-lived cache directory as SoT,
then stage files from a short-lived temporary directory and clean it up. Is
that what you are suggesting, Chad?

Re: Python staging file weirdness

Posted by Chad Dombrova <ch...@gmail.com>.
Another way to copy only the deps you care about is to use `pip download`
to do the copy.  I believe you can provide the cache dir to `pip download
--find-links` and it will read from that before reading from pypi (you may
also need to set --wheel-dir to the cache dir as well), and thus it acts as
a simple copy.

-chad


On Thu, Dec 5, 2019 at 12:07 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> Looked for a bit at pip download command. The alternative seems to parse
> the output of
>
> python -m pip download  --dest . -r requirements.txt  --exists-action i
> --no-binary :all:
>
> and see which files were downloaded and/or skipped since they were already
> present, and then stage only the files that appear in the log output. Seems
> doable but may break if pip output changes between pip implementations, so
> we'd have to add a test as well.
>
> On Thu, Dec 5, 2019 at 11:10 AM Luke Cwik <lc...@google.com> wrote:
>
>> I think reusing the same cache directory makes sense during downloading
>> but why do we upload everything that is there?
>>
>> On Thu, Dec 5, 2019 at 9:24 AM Udi Meiri <eh...@google.com> wrote:
>>
>>> Looking at the source, it seems that it should be using a
>>> os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache')
>>> to create a different tmp directory on each run.
>>>
>>> Also, sampling worker no. 2:
>>>
>>> *jenkins@apache-beam-jenkins-2*:*~*$ ls -l /tmp/dataflow-requirements-cache/
>>> total 7172
>>> -rw-rw-r-- 1 jenkins jenkins  27947 Sep  6 22:46 *funcsigs-1.0.2.tar.gz*
>>> -rw-rw-r-- 1 jenkins jenkins  28126 Sep  6 21:38 *mock-3.0.5.tar.gz*
>>> -rw-rw-r-- 1 jenkins jenkins 376623 Sep  6 21:38 *PyHamcrest-1.9.0.tar.gz*
>>> -rw-rw-r-- 1 jenkins jenkins 851251 Sep  6 21:38 *setuptools-41.2.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 855608 Oct  7 06:03 *setuptools-41.4.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 851068 Oct 28 06:10 *setuptools-41.5.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 851097 Oct 28 19:46 *setuptools-41.5.1.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 852541 Oct 29 14:06 *setuptools-41.6.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 852125 Nov 24 08:10 *setuptools-42.0.0.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 852264 Nov 25 20:55 *setuptools-42.0.1.zip*
>>> -rw-rw-r-- 1 jenkins jenkins 858444 Dec  1 18:12 *setuptools-42.0.2.zip*
>>> -rw-rw-r-- 1 jenkins jenkins  32725 Sep  6 21:38 *six-1.12.0.tar.gz*
>>> -rw-rw-r-- 1 jenkins jenkins  33726 Nov  5 19:18 *six-1.13.0.tar.gz*
>>>
>>>
>>> On Wed, Dec 4, 2019 at 8:00 PM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> Can we filter the cache directory only for the artifacts that we want
>>>> and not everything that is there?
>>>>
>>>> On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev <va...@google.com>
>>>> wrote:
>>>>
>>>>> Luke, I am not sure I understand the question. The caching that
>>>>> happens here is implemented in the SDK for requirements packages:
>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>>>>
>>>>>
>>>>> On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> Is there a way to use a cache on disk that is separate from the set
>>>>>> of packages we use as requirements?
>>>>>>
>>>>>> On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri <eh...@google.com> wrote:
>>>>>>
>>>>>>> Thanks!
>>>>>>> Another reason to periodically referesh workers.
>>>>>>>
>>>>>>> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev <
>>>>>>> valentyn@google.com> wrote:
>>>>>>>
>>>>>>>> Tests job specify[1] a requirements.txt file that contains two
>>>>>>>> entries: pyhamcrest, mock.
>>>>>>>>
>>>>>>>> We download[2]  sources of packages specified in requirements file,
>>>>>>>> and packages they depend on. While doing so, it appears that we use a cache
>>>>>>>> directory on jenkins to store the sources of the packages [3], perhaps to
>>>>>>>> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire
>>>>>>>> cache directory[4], which includes all packages ever cached. Overtime the
>>>>>>>> versions that our requirements packages need change, but I guess we don't
>>>>>>>> clean the cache on Jenkins workers.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
>>>>>>>> [2]
>>>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
>>>>>>>> [3]
>>>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>>>>>>>
>>>>>>>> [4]
>>>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172
>>>>>>>>
>>>>>>>> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri <eh...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I was investigating a Dataflow postcommit test failure
>>>>>>>>> (endpoints_pb2 missing), and saw this in the staging directory:
>>>>>>>>>
>>>>>>>>> $ gsutil ls gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz
>>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Does anyone know why so many versions of setuptools need to be
>>>>>>>>> staged? Shouldn't 1 be enough?
>>>>>>>>>
>>>>>>>>

Re: Python staging file weirdness

Posted by Valentyn Tymofieiev <va...@google.com>.
Looked for a bit at pip download command. The alternative seems to parse
the output of

python -m pip download  --dest . -r requirements.txt  --exists-action i
--no-binary :all:

and see which files were downloaded and/or skipped since they were already
present, and then stage only the files that appear in the log output. Seems
doable but may break if pip output changes between pip implementations, so
we'd have to add a test as well.

On Thu, Dec 5, 2019 at 11:10 AM Luke Cwik <lc...@google.com> wrote:

> I think reusing the same cache directory makes sense during downloading
> but why do we upload everything that is there?
>
> On Thu, Dec 5, 2019 at 9:24 AM Udi Meiri <eh...@google.com> wrote:
>
>> Looking at the source, it seems that it should be using a
>> os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache')
>> to create a different tmp directory on each run.
>>
>> Also, sampling worker no. 2:
>>
>> *jenkins@apache-beam-jenkins-2*:*~*$ ls -l /tmp/dataflow-requirements-cache/
>> total 7172
>> -rw-rw-r-- 1 jenkins jenkins  27947 Sep  6 22:46 *funcsigs-1.0.2.tar.gz*
>> -rw-rw-r-- 1 jenkins jenkins  28126 Sep  6 21:38 *mock-3.0.5.tar.gz*
>> -rw-rw-r-- 1 jenkins jenkins 376623 Sep  6 21:38 *PyHamcrest-1.9.0.tar.gz*
>> -rw-rw-r-- 1 jenkins jenkins 851251 Sep  6 21:38 *setuptools-41.2.0.zip*
>> -rw-rw-r-- 1 jenkins jenkins 855608 Oct  7 06:03 *setuptools-41.4.0.zip*
>> -rw-rw-r-- 1 jenkins jenkins 851068 Oct 28 06:10 *setuptools-41.5.0.zip*
>> -rw-rw-r-- 1 jenkins jenkins 851097 Oct 28 19:46 *setuptools-41.5.1.zip*
>> -rw-rw-r-- 1 jenkins jenkins 852541 Oct 29 14:06 *setuptools-41.6.0.zip*
>> -rw-rw-r-- 1 jenkins jenkins 852125 Nov 24 08:10 *setuptools-42.0.0.zip*
>> -rw-rw-r-- 1 jenkins jenkins 852264 Nov 25 20:55 *setuptools-42.0.1.zip*
>> -rw-rw-r-- 1 jenkins jenkins 858444 Dec  1 18:12 *setuptools-42.0.2.zip*
>> -rw-rw-r-- 1 jenkins jenkins  32725 Sep  6 21:38 *six-1.12.0.tar.gz*
>> -rw-rw-r-- 1 jenkins jenkins  33726 Nov  5 19:18 *six-1.13.0.tar.gz*
>>
>>
>> On Wed, Dec 4, 2019 at 8:00 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> Can we filter the cache directory only for the artifacts that we want
>>> and not everything that is there?
>>>
>>> On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev <va...@google.com>
>>> wrote:
>>>
>>>> Luke, I am not sure I understand the question. The caching that happens
>>>> here is implemented in the SDK for requirements packages:
>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>>>
>>>>
>>>> On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik <lc...@google.com> wrote:
>>>>
>>>>> Is there a way to use a cache on disk that is separate from the set of
>>>>> packages we use as requirements?
>>>>>
>>>>> On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri <eh...@google.com> wrote:
>>>>>
>>>>>> Thanks!
>>>>>> Another reason to periodically referesh workers.
>>>>>>
>>>>>> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev <
>>>>>> valentyn@google.com> wrote:
>>>>>>
>>>>>>> Tests job specify[1] a requirements.txt file that contains two
>>>>>>> entries: pyhamcrest, mock.
>>>>>>>
>>>>>>> We download[2]  sources of packages specified in requirements file,
>>>>>>> and packages they depend on. While doing so, it appears that we use a cache
>>>>>>> directory on jenkins to store the sources of the packages [3], perhaps to
>>>>>>> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire
>>>>>>> cache directory[4], which includes all packages ever cached. Overtime the
>>>>>>> versions that our requirements packages need change, but I guess we don't
>>>>>>> clean the cache on Jenkins workers.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
>>>>>>> [2]
>>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
>>>>>>> [3]
>>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>>>>>>
>>>>>>> [4]
>>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172
>>>>>>>
>>>>>>> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri <eh...@google.com> wrote:
>>>>>>>
>>>>>>>> I was investigating a Dataflow postcommit test failure
>>>>>>>> (endpoints_pb2 missing), and saw this in the staging directory:
>>>>>>>>
>>>>>>>> $ gsutil ls gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz
>>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz
>>>>>>>>
>>>>>>>>
>>>>>>>> Does anyone know why so many versions of setuptools need to be
>>>>>>>> staged? Shouldn't 1 be enough?
>>>>>>>>
>>>>>>>

Re: Python staging file weirdness

Posted by Luke Cwik <lc...@google.com>.
I think reusing the same cache directory makes sense during downloading but
why do we upload everything that is there?

On Thu, Dec 5, 2019 at 9:24 AM Udi Meiri <eh...@google.com> wrote:

> Looking at the source, it seems that it should be using a
> os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache')
> to create a different tmp directory on each run.
>
> Also, sampling worker no. 2:
>
> *jenkins@apache-beam-jenkins-2*:*~*$ ls -l /tmp/dataflow-requirements-cache/
> total 7172
> -rw-rw-r-- 1 jenkins jenkins  27947 Sep  6 22:46 *funcsigs-1.0.2.tar.gz*
> -rw-rw-r-- 1 jenkins jenkins  28126 Sep  6 21:38 *mock-3.0.5.tar.gz*
> -rw-rw-r-- 1 jenkins jenkins 376623 Sep  6 21:38 *PyHamcrest-1.9.0.tar.gz*
> -rw-rw-r-- 1 jenkins jenkins 851251 Sep  6 21:38 *setuptools-41.2.0.zip*
> -rw-rw-r-- 1 jenkins jenkins 855608 Oct  7 06:03 *setuptools-41.4.0.zip*
> -rw-rw-r-- 1 jenkins jenkins 851068 Oct 28 06:10 *setuptools-41.5.0.zip*
> -rw-rw-r-- 1 jenkins jenkins 851097 Oct 28 19:46 *setuptools-41.5.1.zip*
> -rw-rw-r-- 1 jenkins jenkins 852541 Oct 29 14:06 *setuptools-41.6.0.zip*
> -rw-rw-r-- 1 jenkins jenkins 852125 Nov 24 08:10 *setuptools-42.0.0.zip*
> -rw-rw-r-- 1 jenkins jenkins 852264 Nov 25 20:55 *setuptools-42.0.1.zip*
> -rw-rw-r-- 1 jenkins jenkins 858444 Dec  1 18:12 *setuptools-42.0.2.zip*
> -rw-rw-r-- 1 jenkins jenkins  32725 Sep  6 21:38 *six-1.12.0.tar.gz*
> -rw-rw-r-- 1 jenkins jenkins  33726 Nov  5 19:18 *six-1.13.0.tar.gz*
>
>
> On Wed, Dec 4, 2019 at 8:00 PM Luke Cwik <lc...@google.com> wrote:
>
>> Can we filter the cache directory only for the artifacts that we want and
>> not everything that is there?
>>
>> On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>> Luke, I am not sure I understand the question. The caching that happens
>>> here is implemented in the SDK for requirements packages:
>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>>
>>>
>>> On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> Is there a way to use a cache on disk that is separate from the set of
>>>> packages we use as requirements?
>>>>
>>>> On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri <eh...@google.com> wrote:
>>>>
>>>>> Thanks!
>>>>> Another reason to periodically referesh workers.
>>>>>
>>>>> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev <
>>>>> valentyn@google.com> wrote:
>>>>>
>>>>>> Tests job specify[1] a requirements.txt file that contains two
>>>>>> entries: pyhamcrest, mock.
>>>>>>
>>>>>> We download[2]  sources of packages specified in requirements file,
>>>>>> and packages they depend on. While doing so, it appears that we use a cache
>>>>>> directory on jenkins to store the sources of the packages [3], perhaps to
>>>>>> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire
>>>>>> cache directory[4], which includes all packages ever cached. Overtime the
>>>>>> versions that our requirements packages need change, but I guess we don't
>>>>>> clean the cache on Jenkins workers.
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
>>>>>> [2]
>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
>>>>>> [3]
>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>>>>>
>>>>>> [4]
>>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172
>>>>>>
>>>>>> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri <eh...@google.com> wrote:
>>>>>>
>>>>>>> I was investigating a Dataflow postcommit test failure
>>>>>>> (endpoints_pb2 missing), and saw this in the staging directory:
>>>>>>>
>>>>>>> $ gsutil ls gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip
>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip
>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip
>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip
>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip
>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip
>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip
>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz
>>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz
>>>>>>>
>>>>>>>
>>>>>>> Does anyone know why so many versions of setuptools need to be
>>>>>>> staged? Shouldn't 1 be enough?
>>>>>>>
>>>>>>

Re: Python staging file weirdness

Posted by Udi Meiri <eh...@google.com>.
Looking at the source, it seems that it should be using a
os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache')
to create a different tmp directory on each run.

Also, sampling worker no. 2:

*jenkins@apache-beam-jenkins-2*:*~*$ ls -l /tmp/dataflow-requirements-cache/
total 7172
-rw-rw-r-- 1 jenkins jenkins  27947 Sep  6 22:46 *funcsigs-1.0.2.tar.gz*
-rw-rw-r-- 1 jenkins jenkins  28126 Sep  6 21:38 *mock-3.0.5.tar.gz*
-rw-rw-r-- 1 jenkins jenkins 376623 Sep  6 21:38 *PyHamcrest-1.9.0.tar.gz*
-rw-rw-r-- 1 jenkins jenkins 851251 Sep  6 21:38 *setuptools-41.2.0.zip*
-rw-rw-r-- 1 jenkins jenkins 855608 Oct  7 06:03 *setuptools-41.4.0.zip*
-rw-rw-r-- 1 jenkins jenkins 851068 Oct 28 06:10 *setuptools-41.5.0.zip*
-rw-rw-r-- 1 jenkins jenkins 851097 Oct 28 19:46 *setuptools-41.5.1.zip*
-rw-rw-r-- 1 jenkins jenkins 852541 Oct 29 14:06 *setuptools-41.6.0.zip*
-rw-rw-r-- 1 jenkins jenkins 852125 Nov 24 08:10 *setuptools-42.0.0.zip*
-rw-rw-r-- 1 jenkins jenkins 852264 Nov 25 20:55 *setuptools-42.0.1.zip*
-rw-rw-r-- 1 jenkins jenkins 858444 Dec  1 18:12 *setuptools-42.0.2.zip*
-rw-rw-r-- 1 jenkins jenkins  32725 Sep  6 21:38 *six-1.12.0.tar.gz*
-rw-rw-r-- 1 jenkins jenkins  33726 Nov  5 19:18 *six-1.13.0.tar.gz*


On Wed, Dec 4, 2019 at 8:00 PM Luke Cwik <lc...@google.com> wrote:

> Can we filter the cache directory only for the artifacts that we want and
> not everything that is there?
>
> On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> Luke, I am not sure I understand the question. The caching that happens
>> here is implemented in the SDK for requirements packages:
>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>
>>
>> On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> Is there a way to use a cache on disk that is separate from the set of
>>> packages we use as requirements?
>>>
>>> On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri <eh...@google.com> wrote:
>>>
>>>> Thanks!
>>>> Another reason to periodically referesh workers.
>>>>
>>>> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev <
>>>> valentyn@google.com> wrote:
>>>>
>>>>> Tests job specify[1] a requirements.txt file that contains two
>>>>> entries: pyhamcrest, mock.
>>>>>
>>>>> We download[2]  sources of packages specified in requirements file,
>>>>> and packages they depend on. While doing so, it appears that we use a cache
>>>>> directory on jenkins to store the sources of the packages [3], perhaps to
>>>>> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire
>>>>> cache directory[4], which includes all packages ever cached. Overtime the
>>>>> versions that our requirements packages need change, but I guess we don't
>>>>> clean the cache on Jenkins workers.
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
>>>>> [2]
>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
>>>>> [3]
>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>>>>
>>>>> [4]
>>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172
>>>>>
>>>>> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri <eh...@google.com> wrote:
>>>>>
>>>>>> I was investigating a Dataflow postcommit test failure (endpoints_pb2
>>>>>> missing), and saw this in the staging directory:
>>>>>>
>>>>>> $ gsutil ls gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip
>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip
>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip
>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip
>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip
>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip
>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip
>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz
>>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz
>>>>>>
>>>>>>
>>>>>> Does anyone know why so many versions of setuptools need to be
>>>>>> staged? Shouldn't 1 be enough?
>>>>>>
>>>>>

Re: Python staging file weirdness

Posted by Luke Cwik <lc...@google.com>.
Can we filter the cache directory only for the artifacts that we want and
not everything that is there?

On Wed, Dec 4, 2019 at 6:56 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> Luke, I am not sure I understand the question. The caching that happens
> here is implemented in the SDK for requirements packages:
> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>
>
> On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik <lc...@google.com> wrote:
>
>> Is there a way to use a cache on disk that is separate from the set of
>> packages we use as requirements?
>>
>> On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri <eh...@google.com> wrote:
>>
>>> Thanks!
>>> Another reason to periodically referesh workers.
>>>
>>> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev <
>>> valentyn@google.com> wrote:
>>>
>>>> Tests job specify[1] a requirements.txt file that contains two entries:
>>>> pyhamcrest, mock.
>>>>
>>>> We download[2]  sources of packages specified in requirements file,
>>>> and packages they depend on. While doing so, it appears that we use a cache
>>>> directory on jenkins to store the sources of the packages [3], perhaps to
>>>> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire
>>>> cache directory[4], which includes all packages ever cached. Overtime the
>>>> versions that our requirements packages need change, but I guess we don't
>>>> clean the cache on Jenkins workers.
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
>>>> [2]
>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
>>>> [3]
>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>>>
>>>> [4]
>>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172
>>>>
>>>> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri <eh...@google.com> wrote:
>>>>
>>>>> I was investigating a Dataflow postcommit test failure (endpoints_pb2
>>>>> missing), and saw this in the staging directory:
>>>>>
>>>>> $ gsutil ls gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip
>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip
>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip
>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip
>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip
>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip
>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip
>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz
>>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz
>>>>>
>>>>>
>>>>> Does anyone know why so many versions of setuptools need to be staged?
>>>>> Shouldn't 1 be enough?
>>>>>
>>>>

Re: Python staging file weirdness

Posted by Valentyn Tymofieiev <va...@google.com>.
Luke, I am not sure I understand the question. The caching that happens
here is implemented in the SDK for requirements packages:
https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161


On Wed, Dec 4, 2019 at 6:19 PM Luke Cwik <lc...@google.com> wrote:

> Is there a way to use a cache on disk that is separate from the set of
> packages we use as requirements?
>
> On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri <eh...@google.com> wrote:
>
>> Thanks!
>> Another reason to periodically referesh workers.
>>
>> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>> Tests job specify[1] a requirements.txt file that contains two entries:
>>> pyhamcrest, mock.
>>>
>>> We download[2]  sources of packages specified in requirements file,
>>> and packages they depend on. While doing so, it appears that we use a cache
>>> directory on jenkins to store the sources of the packages [3], perhaps to
>>> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire
>>> cache directory[4], which includes all packages ever cached. Overtime the
>>> versions that our requirements packages need change, but I guess we don't
>>> clean the cache on Jenkins workers.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
>>> [2]
>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
>>> [3]
>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>>
>>> [4]
>>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172
>>>
>>> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri <eh...@google.com> wrote:
>>>
>>>> I was investigating a Dataflow postcommit test failure (endpoints_pb2
>>>> missing), and saw this in the staging directory:
>>>>
>>>> $ gsutil ls gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip
>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip
>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip
>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip
>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip
>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip
>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip
>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz
>>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz
>>>>
>>>>
>>>> Does anyone know why so many versions of setuptools need to be staged?
>>>> Shouldn't 1 be enough?
>>>>
>>>

Re: Python staging file weirdness

Posted by Luke Cwik <lc...@google.com>.
Is there a way to use a cache on disk that is separate from the set of
packages we use as requirements?

On Wed, Dec 4, 2019 at 5:58 PM Udi Meiri <eh...@google.com> wrote:

> Thanks!
> Another reason to periodically referesh workers.
>
> On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> Tests job specify[1] a requirements.txt file that contains two entries:
>> pyhamcrest, mock.
>>
>> We download[2]  sources of packages specified in requirements file,
>> and packages they depend on. While doing so, it appears that we use a cache
>> directory on jenkins to store the sources of the packages [3], perhaps to
>> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire
>> cache directory[4], which includes all packages ever cached. Overtime the
>> versions that our requirements packages need change, but I guess we don't
>> clean the cache on Jenkins workers.
>>
>> [1]
>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
>> [2]
>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
>> [3]
>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>>
>> [4]
>> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172
>>
>> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri <eh...@google.com> wrote:
>>
>>> I was investigating a Dataflow postcommit test failure (endpoints_pb2
>>> missing), and saw this in the staging directory:
>>>
>>> $ gsutil ls gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz
>>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz
>>>
>>>
>>> Does anyone know why so many versions of setuptools need to be staged?
>>> Shouldn't 1 be enough?
>>>
>>

Re: Python staging file weirdness

Posted by Udi Meiri <eh...@google.com>.
Thanks!
Another reason to periodically referesh workers.

On Wed, Nov 27, 2019 at 10:37 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> Tests job specify[1] a requirements.txt file that contains two entries:
> pyhamcrest, mock.
>
> We download[2]  sources of packages specified in requirements file,
> and packages they depend on. While doing so, it appears that we use a cache
> directory on jenkins to store the sources of the packages [3], perhaps to
> save a trip to pypi and reduce pypi flakiness? Then, we stage the entire
> cache directory[4], which includes all packages ever cached. Overtime the
> versions that our requirements packages need change, but I guess we don't
> clean the cache on Jenkins workers.
>
> [1]
> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
> [2]
> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
> [3]
> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161
>
> [4]
> https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172
>
> On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri <eh...@google.com> wrote:
>
>> I was investigating a Dataflow postcommit test failure (endpoints_pb2
>> missing), and saw this in the staging directory:
>>
>> $ gsutil ls gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz
>> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz
>>
>>
>> Does anyone know why so many versions of setuptools need to be staged?
>> Shouldn't 1 be enough?
>>
>

Re: Python staging file weirdness

Posted by Valentyn Tymofieiev <va...@google.com>.
Tests job specify[1] a requirements.txt file that contains two entries:
pyhamcrest, mock.

We download[2]  sources of packages specified in requirements file,
and packages they depend on. While doing so, it appears that we use a cache
directory on jenkins to store the sources of the packages [3], perhaps to
save a trip to pypi and reduce pypi flakiness? Then, we stage the entire
cache directory[4], which includes all packages ever cached. Overtime the
versions that our requirements packages need change, but I guess we don't
clean the cache on Jenkins workers.

[1]
https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/scripts/run_integration_test.sh#L197
[2]
https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L469
[3]
https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L161

[4]
https://github.com/apache/beam/blob/438055c95116f4e6e419e5faa9c42f7d329c421c/sdks/python/apache_beam/runners/portability/stager.py#L172

On Wed, Nov 27, 2019 at 11:55 AM Udi Meiri <eh...@google.com> wrote:

> I was investigating a Dataflow postcommit test failure (endpoints_pb2
> missing), and saw this in the staging directory:
>
> $ gsutil ls gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/PyHamcrest-1.9.0.tar.gz
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow-worker.jar
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/dataflow_python_sdk.tar
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/funcsigs-1.0.2.tar.gz
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/mock-3.0.5.tar.gz
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/pipeline.pb
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/requirements.txt
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.2.0.zip
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.4.0.zip
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.0.zip
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.5.1.zip
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-41.6.0.zip
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.0.zip
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/setuptools-42.0.1.zip
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.12.0.tar.gz
> gs://temp-storage-for-end-to-end-tests/staging-it/beamapp-jenkins-1126202146-314738.1574799706.314882/six-1.13.0.tar.gz
>
>
> Does anyone know why so many versions of setuptools need to be staged?
> Shouldn't 1 be enough?
>