You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Kyle Weaver <kc...@google.com> on 2019/11/06 18:37:44 UTC

Re: How to use a locally built worker image?

Bumping this thread from the other one [1].

> 1. Read sdk version from gradle.properties and use this as the default
tag. Done with Python, need to implement it with Java and Go.

100% agree with this one. Using the same tag for local and release images
has already caused a good deal of confusion. Filed BEAM-8570 and BEAM-8571
[2][3].

> 2. Remove pulling images before executing docker run command. This should
be fixed for Python, Java and Go.

Valentyn (from [1]):
> I think pulling the latest image for the current tag is actually a
desired behavior, in case the external image was updated (due to a bug fix
for example).

There's a PR for this [4]. Once we fix the default tag for Java/Go
containers, the dev and release containers will be distinct, which makes it
seldom important whether or not the image is `docker pull`ed. Anyway, I
agree with Thomas that implicitly running `docker pull` is confusing and
requires some adjustments to work around. The user can always run `docker
pull` themselves if that's the intention.

[1]
https://lists.apache.org/thread.html/0f2ccbbe7969b91dc21ba331c1a30d730e268cc0355c1ac1ba0b7988@%3Cdev.beam.apache.org%3E
[2] https://issues.apache.org/jira/browse/BEAM-8570
[3] https://issues.apache.org/jira/browse/BEAM-8571
[4] https://github.com/apache/beam/pull/9972

On Wed, Oct 2, 2019 at 5:32 PM Ahmet Altay <al...@google.com> wrote:

> I do not believe this is a blocker for Beam 2.16. I agree that it would be
> good to fix this.
>
> On Wed, Oct 2, 2019 at 3:15 PM Hannah Jiang <ha...@verily.com>
> wrote:
>
>> Hi Thomas
>>
>> Thanks for bring this up.
>>
>> Now Python uses sdk version as a default tag, while Java and Go use
>> latest as a default tag. I agree using latest as a tag is problematic. The
>> reason only Python uses sdk version as a default tag is Python has
>> version.py so the version is easy to read. For Java and Go, we need to read
>> it from gradle.properties when creating images with the default tag and
>> when setting the default image.
>>
>> Here is what we need to do:
>> 1. Read sdk version from gradle.properties and use this as the default
>> tag. Done with Python, need to implement it with Java and Go.
>> 2. Remove pulling images before executing docker run command. This should
>> be fixed for Python, Java and Go.
>>
>> Is this a blocker for 2.16? If so and above are too much work for 2.16 at
>> the moment, we can hardcode the default tag for release branch for now.
>>
>> Using timestamp as a tag is an option as well, as long as runners know
>> which timestamp they should use.
>>
>> Hannah
>>
>> On Wed, Oct 2, 2019 at 10:13 AM Alan Myrvold <am...@google.com> wrote:
>>
>>> Yes, using the latest tag is problematic and can lead to unexpected
>>> behavior.
>>> Using a date/time or 2.17.0.dev-$USER tag would be better. The validates
>>> container shell script uses a datetime
>>> <https://github.com/apache/beam/blob/6551d0937ee31a8e310b63b222dbc750ec9331f8/sdks/python/container/run_validatescontainer.sh#L87>
>>> tag, which allows a unique name if no two tests are run in the same second.
>>>
>>> On Wed, Oct 2, 2019 at 10:05 AM Thomas Weise <th...@apache.org> wrote:
>>>
>>>> Want to bump this thread.
>>>>
>>>> If the current behavior is to replace locally built image with the last
>>>> published, then this is not only unexpected for developers but also
>>>> problematic for the CI, where tests should run against what was built from
>>>> source. Or am I missing something?
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>>
>>>> On Tue, Sep 24, 2019 at 7:08 PM Thomas Weise <th...@apache.org> wrote:
>>>>
>>>>> Hi Hannah,
>>>>>
>>>>> I believe this is unexpected from the developer perspective. When
>>>>> building something locally, we do expect that to be used. We may need to
>>>>> change to not pull when the image is available locally, at least when it is
>>>>> a snapshot/master branch. Release images should be immutable anyways.
>>>>>
>>>>> Thomas
>>>>>
>>>>>
>>>>> On Tue, Sep 24, 2019 at 4:13 PM Hannah Jiang <ha...@google.com>
>>>>> wrote:
>>>>>
>>>>>> A minor update, with custom container, the pipeline would not fail,
>>>>>> it throws out warning and moves on to `docker run` command.
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 4:05 PM Hannah Jiang <ha...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Brian
>>>>>>>
>>>>>>> If we pull docker images, it always downloads from remote
>>>>>>> repository, which is expected behavior.
>>>>>>> In case we want to run a local image and pull it only when the image
>>>>>>> is not available at local, we can use `docker run` command directly,
>>>>>>> without pulling it in advance. [1]
>>>>>>> In case we want to pull images only when they are not available at
>>>>>>> local, we can use `docker images -q` to check if images are existing at
>>>>>>> local before pulling it.
>>>>>>> Another option is re-tag your local image, pass your image to
>>>>>>> pipeline and overwrite default one, but the code is still trying to pull,
>>>>>>> so if your image is not pushed to the remote repository, it would fail.
>>>>>>>
>>>>>>> 1. https://github.com/docker/cli/pull/1498
>>>>>>>
>>>>>>> Hannah
>>>>>>>
>>>>>>> On Tue, Sep 24, 2019 at 11:56 AM Brian Hulette <bh...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm working on a demo cross-language pipeline on a local flink
>>>>>>>> cluster that relies on my python row coder PR [1]. The PR includes some
>>>>>>>> changes to the Java worker code, so I need to build a Java SDK container
>>>>>>>> locally and use that in the pipeline.
>>>>>>>>
>>>>>>>> Unfortunately, whenever I run the pipeline,
>>>>>>>> the apachebeam/java_sdk:latest tag is moved off of my locally built image
>>>>>>>> to a newly downloaded image with a creation date 2 weeks ago, and that
>>>>>>>> image is used instead. It looks like the reason is we run `docker pull`
>>>>>>>> before running the container [2]. As the comment says this should be a
>>>>>>>> no-op if the image already exists, but that doesn't seem to be the case. If
>>>>>>>> I just run `docker pull apachebeam/java_sdk:latest` on my local machine it
>>>>>>>> downloads the 2 week old image and happily informs me:
>>>>>>>>
>>>>>>>> Status: Downloaded newer image for apachebeam/java_sdk:latest
>>>>>>>>
>>>>>>>> Does anyone know how I can prevent `docker pull` from doing this? I
>>>>>>>> can unblock myself for now just by commenting out the docker pull command,
>>>>>>>> but I'd like to understand what is going on here.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Brian
>>>>>>>>
>>>>>>>> [1] https://github.com/apache/beam/pull/9188
>>>>>>>> [2]
>>>>>>>> https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/DockerCommand.java#L80
>>>>>>>>
>>>>>>>

Re: How to use a locally built worker image?

Posted by Kyle Weaver <kc...@google.com>.
> These are -SNAPSHOT or .dev and distinct from releases. So perhaps this
is just a matter of correct version number management?

In light of the concerns Valentyn & Ahmet raise, it seems safer to change
tags instead of removing the pull. PR for Java:
https://github.com/apache/beam/pull/10017

On Wed, Nov 6, 2019 at 1:41 PM Ahmet Altay <al...@google.com> wrote:

>
>
> On Wed, Nov 6, 2019 at 1:34 PM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> On Wed, Nov 6, 2019 at 11:48 AM Kyle Weaver <kc...@google.com> wrote:
>>
>>> The way the Python SDK currently does this is to use the version as the
>>> default tag, eg 2.16.0. While master uses 2.16.0.dev. This means there
>>> should never be any conflicts between a release and developer image, unless
>>> the user deliberately changes the image tags.
>>>
>>> > if a users' pipeline is relies on a  container image released by Beam
>>> ( or maybe a third party), external updates to such container image may not
>>> propagate to the pipeline workflow without an explicit pull
>>>
>>> There should only be one released container per release. Upgrades to a
>>> container image should not happen independently of the release process.
>>>
>>
>> Fair point, although we have not yet encountered issues requiring  an
>> update of a previously released Docker, so I would not rule out
>> considerations requiring us to re-release the image under the same tag. A
>> scenario that is possible today is multiple pushes of container image to
>> docker repo before the Beam release is finalized, so early adopters may be
>> affected by stale images without  a pull.
>>
>
> This is an interesting problem. It is true that adopters of RCs may get
> stuck pre-release candidates of those images. Could we still docker pull
> only if user is trying to use a default and released image tag?
>
>
>>
>>
>>> Note that so far I've just been discussing defaults. It's always
>>> possible to use a custom container using environment_config, as mentioned
>>> earlier.
>>>
>>
>> My understanding is that to pull or not to pull decision equally applies
>> to custom image provided by environment config.
>>
>>
>>> The goal is to make that unnecessary for most everyday use cases and
>>> development. Using different container images for different transforms is a
>>> more specialized use case worth a separate discussion.
>>>
>>> On Wed, Nov 6, 2019 at 11:33 AM Valentyn Tymofieiev <va...@google.com>
>>> wrote:
>>>
>>>> Anyway, I agree with Thomas that implicitly running `docker pull` is
>>>>> confusing and requires some adjustments to work around. The user can always
>>>>> run `docker pull` themselves if that's the intention.
>>>>
>>>>
>>>> I understand that implicit pull may come across as surprising. However
>>>> I see the  required adjustments as a better practice. I would argue that
>>>> customized containers images should not reuse the same  name:tag
>>>> combination, and it would also help the users avoid a situation where a
>>>> runner may use a different container image in different execution
>>>> environments.
>>>> It may also help avoid issue where a user reports an issue with Beam,
>>>> that others cannot reproduce only because a user was running a customized
>>>> container on their local machine (and forgot about it).
>>>> Also, if a users' pipeline is relies on a  container image released by
>>>> Beam ( or maybe a third party), external updates to such container image
>>>> may not propagate to the pipeline workflow without an explicit pull
>>>>
>>>>> > 1. Read sdk version from gradle.properties and use this as the
>>>>> default tag. Done with Python, need to implement it with Java and Go.
>>>>>
>>>>> 100% agree with this one. Using the same tag for local and release
>>>>> images has already caused a good deal of confusion. Filed BEAM-8570 and
>>>>> BEAM-8571 [2][3].
>>>>>
>>>>> > 2. Remove pulling images before executing docker run command. This
>>>>> should be fixed for Python, Java and Go.
>>>>>
>>>>> Valentyn (from [1]):
>>>>> > I think pulling the latest image for the current tag is actually a
>>>>> desired behavior, in case the external image was updated (due to a bug fix
>>>>> for example).
>>>>>
>>>>> There's a PR for this [4]. Once we fix the default tag for Java/Go
>>>>> containers, the dev and release containers will be distinct, which makes it
>>>>> seldom important whether or not the image is `docker pull`ed. Anyway, I
>>>>> agree with Thomas that implicitly running `docker pull` is confusing and
>>>>> requires some adjustments to work around. The user can always run `docker
>>>>> pull` themselves if that's the intention.
>>>>>
>>>>> [1]
>>>>> https://lists.apache.org/thread.html/0f2ccbbe7969b91dc21ba331c1a30d730e268cc0355c1ac1ba0b7988@%3Cdev.beam.apache.org%3E
>>>>> [2] https://issues.apache.org/jira/browse/BEAM-8570
>>>>> [3] https://issues.apache.org/jira/browse/BEAM-8571
>>>>> [4] https://github.com/apache/beam/pull/9972
>>>>>
>>>>> On Wed, Oct 2, 2019 at 5:32 PM Ahmet Altay <al...@google.com> wrote:
>>>>>
>>>>>> I do not believe this is a blocker for Beam 2.16. I agree that it
>>>>>> would be good to fix this.
>>>>>>
>>>>>> On Wed, Oct 2, 2019 at 3:15 PM Hannah Jiang <ha...@verily.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Thomas
>>>>>>>
>>>>>>> Thanks for bring this up.
>>>>>>>
>>>>>>> Now Python uses sdk version as a default tag, while Java and Go use
>>>>>>> latest as a default tag. I agree using latest as a tag is problematic. The
>>>>>>> reason only Python uses sdk version as a default tag is Python has
>>>>>>> version.py so the version is easy to read. For Java and Go, we need to read
>>>>>>> it from gradle.properties when creating images with the default tag and
>>>>>>> when setting the default image.
>>>>>>>
>>>>>>> Here is what we need to do:
>>>>>>> 1. Read sdk version from gradle.properties and use this as the
>>>>>>> default tag. Done with Python, need to implement it with Java and Go.
>>>>>>> 2. Remove pulling images before executing docker run command. This
>>>>>>> should be fixed for Python, Java and Go.
>>>>>>>
>>>>>>> Is this a blocker for 2.16? If so and above are too much work for
>>>>>>> 2.16 at the moment, we can hardcode the default tag for release branch for
>>>>>>> now.
>>>>>>>
>>>>>>> Using timestamp as a tag is an option as well, as long as runners
>>>>>>> know which timestamp they should use.
>>>>>>>
>>>>>>> Hannah
>>>>>>>
>>>>>>> On Wed, Oct 2, 2019 at 10:13 AM Alan Myrvold <am...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yes, using the latest tag is problematic and can lead to unexpected
>>>>>>>> behavior.
>>>>>>>> Using a date/time or 2.17.0.dev-$USER tag would be better. The
>>>>>>>> validates container shell script uses a datetime
>>>>>>>> <https://github.com/apache/beam/blob/6551d0937ee31a8e310b63b222dbc750ec9331f8/sdks/python/container/run_validatescontainer.sh#L87>
>>>>>>>> tag, which allows a unique name if no two tests are run in the same second.
>>>>>>>>
>>>>>>>> On Wed, Oct 2, 2019 at 10:05 AM Thomas Weise <th...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Want to bump this thread.
>>>>>>>>>
>>>>>>>>> If the current behavior is to replace locally built image with the
>>>>>>>>> last published, then this is not only unexpected for developers but also
>>>>>>>>> problematic for the CI, where tests should run against what was built from
>>>>>>>>> source. Or am I missing something?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Thomas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 24, 2019 at 7:08 PM Thomas Weise <th...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Hannah,
>>>>>>>>>>
>>>>>>>>>> I believe this is unexpected from the developer perspective. When
>>>>>>>>>> building something locally, we do expect that to be used. We may need to
>>>>>>>>>> change to not pull when the image is available locally, at least when it is
>>>>>>>>>> a snapshot/master branch. Release images should be immutable anyways.
>>>>>>>>>>
>>>>>>>>>> Thomas
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 24, 2019 at 4:13 PM Hannah Jiang <
>>>>>>>>>> hannahjiang@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> A minor update, with custom container, the pipeline would not
>>>>>>>>>>> fail, it throws out warning and moves on to `docker run` command.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 24, 2019 at 4:05 PM Hannah Jiang <
>>>>>>>>>>> hannahjiang@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Brian
>>>>>>>>>>>>
>>>>>>>>>>>> If we pull docker images, it always downloads from remote
>>>>>>>>>>>> repository, which is expected behavior.
>>>>>>>>>>>> In case we want to run a local image and pull it only when the
>>>>>>>>>>>> image is not available at local, we can use `docker run` command directly,
>>>>>>>>>>>> without pulling it in advance. [1]
>>>>>>>>>>>> In case we want to pull images only when they are not
>>>>>>>>>>>> available at local, we can use `docker images -q` to check if images are
>>>>>>>>>>>> existing at local before pulling it.
>>>>>>>>>>>> Another option is re-tag your local image, pass your image to
>>>>>>>>>>>> pipeline and overwrite default one, but the code is still trying to pull,
>>>>>>>>>>>> so if your image is not pushed to the remote repository, it would fail.
>>>>>>>>>>>>
>>>>>>>>>>>> 1. https://github.com/docker/cli/pull/1498
>>>>>>>>>>>>
>>>>>>>>>>>> Hannah
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 24, 2019 at 11:56 AM Brian Hulette <
>>>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I'm working on a demo cross-language pipeline on a local flink
>>>>>>>>>>>>> cluster that relies on my python row coder PR [1]. The PR includes some
>>>>>>>>>>>>> changes to the Java worker code, so I need to build a Java SDK container
>>>>>>>>>>>>> locally and use that in the pipeline.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Unfortunately, whenever I run the pipeline,
>>>>>>>>>>>>> the apachebeam/java_sdk:latest tag is moved off of my locally built image
>>>>>>>>>>>>> to a newly downloaded image with a creation date 2 weeks ago, and that
>>>>>>>>>>>>> image is used instead. It looks like the reason is we run `docker pull`
>>>>>>>>>>>>> before running the container [2]. As the comment says this should be a
>>>>>>>>>>>>> no-op if the image already exists, but that doesn't seem to be the case. If
>>>>>>>>>>>>> I just run `docker pull apachebeam/java_sdk:latest` on my local machine it
>>>>>>>>>>>>> downloads the 2 week old image and happily informs me:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Status: Downloaded newer image for apachebeam/java_sdk:latest
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does anyone know how I can prevent `docker pull` from doing
>>>>>>>>>>>>> this? I can unblock myself for now just by commenting out the docker pull
>>>>>>>>>>>>> command, but I'd like to understand what is going on here.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] https://github.com/apache/beam/pull/9188
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/DockerCommand.java#L80
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: How to use a locally built worker image?

Posted by Ahmet Altay <al...@google.com>.
On Wed, Nov 6, 2019 at 1:34 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> On Wed, Nov 6, 2019 at 11:48 AM Kyle Weaver <kc...@google.com> wrote:
>
>> The way the Python SDK currently does this is to use the version as the
>> default tag, eg 2.16.0. While master uses 2.16.0.dev. This means there
>> should never be any conflicts between a release and developer image, unless
>> the user deliberately changes the image tags.
>>
>> > if a users' pipeline is relies on a  container image released by Beam (
>> or maybe a third party), external updates to such container image may not
>> propagate to the pipeline workflow without an explicit pull
>>
>> There should only be one released container per release. Upgrades to a
>> container image should not happen independently of the release process.
>>
>
> Fair point, although we have not yet encountered issues requiring  an
> update of a previously released Docker, so I would not rule out
> considerations requiring us to re-release the image under the same tag. A
> scenario that is possible today is multiple pushes of container image to
> docker repo before the Beam release is finalized, so early adopters may be
> affected by stale images without  a pull.
>

This is an interesting problem. It is true that adopters of RCs may get
stuck pre-release candidates of those images. Could we still docker pull
only if user is trying to use a default and released image tag?


>
>
>> Note that so far I've just been discussing defaults. It's always possible
>> to use a custom container using environment_config, as mentioned earlier.
>>
>
> My understanding is that to pull or not to pull decision equally applies
> to custom image provided by environment config.
>
>
>> The goal is to make that unnecessary for most everyday use cases and
>> development. Using different container images for different transforms is a
>> more specialized use case worth a separate discussion.
>>
>> On Wed, Nov 6, 2019 at 11:33 AM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>> Anyway, I agree with Thomas that implicitly running `docker pull` is
>>>> confusing and requires some adjustments to work around. The user can always
>>>> run `docker pull` themselves if that's the intention.
>>>
>>>
>>> I understand that implicit pull may come across as surprising. However I
>>> see the  required adjustments as a better practice. I would argue that
>>> customized containers images should not reuse the same  name:tag
>>> combination, and it would also help the users avoid a situation where a
>>> runner may use a different container image in different execution
>>> environments.
>>> It may also help avoid issue where a user reports an issue with Beam,
>>> that others cannot reproduce only because a user was running a customized
>>> container on their local machine (and forgot about it).
>>> Also, if a users' pipeline is relies on a  container image released by
>>> Beam ( or maybe a third party), external updates to such container image
>>> may not propagate to the pipeline workflow without an explicit pull
>>>
>>>> > 1. Read sdk version from gradle.properties and use this as the
>>>> default tag. Done with Python, need to implement it with Java and Go.
>>>>
>>>> 100% agree with this one. Using the same tag for local and release
>>>> images has already caused a good deal of confusion. Filed BEAM-8570 and
>>>> BEAM-8571 [2][3].
>>>>
>>>> > 2. Remove pulling images before executing docker run command. This
>>>> should be fixed for Python, Java and Go.
>>>>
>>>> Valentyn (from [1]):
>>>> > I think pulling the latest image for the current tag is actually a
>>>> desired behavior, in case the external image was updated (due to a bug fix
>>>> for example).
>>>>
>>>> There's a PR for this [4]. Once we fix the default tag for Java/Go
>>>> containers, the dev and release containers will be distinct, which makes it
>>>> seldom important whether or not the image is `docker pull`ed. Anyway, I
>>>> agree with Thomas that implicitly running `docker pull` is confusing and
>>>> requires some adjustments to work around. The user can always run `docker
>>>> pull` themselves if that's the intention.
>>>>
>>>> [1]
>>>> https://lists.apache.org/thread.html/0f2ccbbe7969b91dc21ba331c1a30d730e268cc0355c1ac1ba0b7988@%3Cdev.beam.apache.org%3E
>>>> [2] https://issues.apache.org/jira/browse/BEAM-8570
>>>> [3] https://issues.apache.org/jira/browse/BEAM-8571
>>>> [4] https://github.com/apache/beam/pull/9972
>>>>
>>>> On Wed, Oct 2, 2019 at 5:32 PM Ahmet Altay <al...@google.com> wrote:
>>>>
>>>>> I do not believe this is a blocker for Beam 2.16. I agree that it
>>>>> would be good to fix this.
>>>>>
>>>>> On Wed, Oct 2, 2019 at 3:15 PM Hannah Jiang <ha...@verily.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Thomas
>>>>>>
>>>>>> Thanks for bring this up.
>>>>>>
>>>>>> Now Python uses sdk version as a default tag, while Java and Go use
>>>>>> latest as a default tag. I agree using latest as a tag is problematic. The
>>>>>> reason only Python uses sdk version as a default tag is Python has
>>>>>> version.py so the version is easy to read. For Java and Go, we need to read
>>>>>> it from gradle.properties when creating images with the default tag and
>>>>>> when setting the default image.
>>>>>>
>>>>>> Here is what we need to do:
>>>>>> 1. Read sdk version from gradle.properties and use this as the
>>>>>> default tag. Done with Python, need to implement it with Java and Go.
>>>>>> 2. Remove pulling images before executing docker run command. This
>>>>>> should be fixed for Python, Java and Go.
>>>>>>
>>>>>> Is this a blocker for 2.16? If so and above are too much work for
>>>>>> 2.16 at the moment, we can hardcode the default tag for release branch for
>>>>>> now.
>>>>>>
>>>>>> Using timestamp as a tag is an option as well, as long as runners
>>>>>> know which timestamp they should use.
>>>>>>
>>>>>> Hannah
>>>>>>
>>>>>> On Wed, Oct 2, 2019 at 10:13 AM Alan Myrvold <am...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, using the latest tag is problematic and can lead to unexpected
>>>>>>> behavior.
>>>>>>> Using a date/time or 2.17.0.dev-$USER tag would be better. The
>>>>>>> validates container shell script uses a datetime
>>>>>>> <https://github.com/apache/beam/blob/6551d0937ee31a8e310b63b222dbc750ec9331f8/sdks/python/container/run_validatescontainer.sh#L87>
>>>>>>> tag, which allows a unique name if no two tests are run in the same second.
>>>>>>>
>>>>>>> On Wed, Oct 2, 2019 at 10:05 AM Thomas Weise <th...@apache.org> wrote:
>>>>>>>
>>>>>>>> Want to bump this thread.
>>>>>>>>
>>>>>>>> If the current behavior is to replace locally built image with the
>>>>>>>> last published, then this is not only unexpected for developers but also
>>>>>>>> problematic for the CI, where tests should run against what was built from
>>>>>>>> source. Or am I missing something?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Thomas
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 24, 2019 at 7:08 PM Thomas Weise <th...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Hannah,
>>>>>>>>>
>>>>>>>>> I believe this is unexpected from the developer perspective. When
>>>>>>>>> building something locally, we do expect that to be used. We may need to
>>>>>>>>> change to not pull when the image is available locally, at least when it is
>>>>>>>>> a snapshot/master branch. Release images should be immutable anyways.
>>>>>>>>>
>>>>>>>>> Thomas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 24, 2019 at 4:13 PM Hannah Jiang <
>>>>>>>>> hannahjiang@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> A minor update, with custom container, the pipeline would not
>>>>>>>>>> fail, it throws out warning and moves on to `docker run` command.
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 24, 2019 at 4:05 PM Hannah Jiang <
>>>>>>>>>> hannahjiang@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Brian
>>>>>>>>>>>
>>>>>>>>>>> If we pull docker images, it always downloads from remote
>>>>>>>>>>> repository, which is expected behavior.
>>>>>>>>>>> In case we want to run a local image and pull it only when the
>>>>>>>>>>> image is not available at local, we can use `docker run` command directly,
>>>>>>>>>>> without pulling it in advance. [1]
>>>>>>>>>>> In case we want to pull images only when they are not
>>>>>>>>>>> available at local, we can use `docker images -q` to check if images are
>>>>>>>>>>> existing at local before pulling it.
>>>>>>>>>>> Another option is re-tag your local image, pass your image to
>>>>>>>>>>> pipeline and overwrite default one, but the code is still trying to pull,
>>>>>>>>>>> so if your image is not pushed to the remote repository, it would fail.
>>>>>>>>>>>
>>>>>>>>>>> 1. https://github.com/docker/cli/pull/1498
>>>>>>>>>>>
>>>>>>>>>>> Hannah
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 24, 2019 at 11:56 AM Brian Hulette <
>>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I'm working on a demo cross-language pipeline on a local flink
>>>>>>>>>>>> cluster that relies on my python row coder PR [1]. The PR includes some
>>>>>>>>>>>> changes to the Java worker code, so I need to build a Java SDK container
>>>>>>>>>>>> locally and use that in the pipeline.
>>>>>>>>>>>>
>>>>>>>>>>>> Unfortunately, whenever I run the pipeline,
>>>>>>>>>>>> the apachebeam/java_sdk:latest tag is moved off of my locally built image
>>>>>>>>>>>> to a newly downloaded image with a creation date 2 weeks ago, and that
>>>>>>>>>>>> image is used instead. It looks like the reason is we run `docker pull`
>>>>>>>>>>>> before running the container [2]. As the comment says this should be a
>>>>>>>>>>>> no-op if the image already exists, but that doesn't seem to be the case. If
>>>>>>>>>>>> I just run `docker pull apachebeam/java_sdk:latest` on my local machine it
>>>>>>>>>>>> downloads the 2 week old image and happily informs me:
>>>>>>>>>>>>
>>>>>>>>>>>> Status: Downloaded newer image for apachebeam/java_sdk:latest
>>>>>>>>>>>>
>>>>>>>>>>>> Does anyone know how I can prevent `docker pull` from doing
>>>>>>>>>>>> this? I can unblock myself for now just by commenting out the docker pull
>>>>>>>>>>>> command, but I'd like to understand what is going on here.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Brian
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://github.com/apache/beam/pull/9188
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/DockerCommand.java#L80
>>>>>>>>>>>>
>>>>>>>>>>>

Re: How to use a locally built worker image?

Posted by Valentyn Tymofieiev <va...@google.com>.
On Wed, Nov 6, 2019 at 11:48 AM Kyle Weaver <kc...@google.com> wrote:

> The way the Python SDK currently does this is to use the version as the
> default tag, eg 2.16.0. While master uses 2.16.0.dev. This means there
> should never be any conflicts between a release and developer image, unless
> the user deliberately changes the image tags.
>
> > if a users' pipeline is relies on a  container image released by Beam (
> or maybe a third party), external updates to such container image may not
> propagate to the pipeline workflow without an explicit pull
>
> There should only be one released container per release. Upgrades to a
> container image should not happen independently of the release process.
>

Fair point, although we have not yet encountered issues requiring  an
update of a previously released Docker, so I would not rule out
considerations requiring us to re-release the image under the same tag. A
scenario that is possible today is multiple pushes of container image to
docker repo before the Beam release is finalized, so early adopters may be
affected by stale images without  a pull.


> Note that so far I've just been discussing defaults. It's always possible
> to use a custom container using environment_config, as mentioned earlier.
>

My understanding is that to pull or not to pull decision equally applies to
custom image provided by environment config.


> The goal is to make that unnecessary for most everyday use cases and
> development. Using different container images for different transforms is a
> more specialized use case worth a separate discussion.
>
> On Wed, Nov 6, 2019 at 11:33 AM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> Anyway, I agree with Thomas that implicitly running `docker pull` is
>>> confusing and requires some adjustments to work around. The user can always
>>> run `docker pull` themselves if that's the intention.
>>
>>
>> I understand that implicit pull may come across as surprising. However I
>> see the  required adjustments as a better practice. I would argue that
>> customized containers images should not reuse the same  name:tag
>> combination, and it would also help the users avoid a situation where a
>> runner may use a different container image in different execution
>> environments.
>> It may also help avoid issue where a user reports an issue with Beam,
>> that others cannot reproduce only because a user was running a customized
>> container on their local machine (and forgot about it).
>> Also, if a users' pipeline is relies on a  container image released by
>> Beam ( or maybe a third party), external updates to such container image
>> may not propagate to the pipeline workflow without an explicit pull
>>
>>> > 1. Read sdk version from gradle.properties and use this as the default
>>> tag. Done with Python, need to implement it with Java and Go.
>>>
>>> 100% agree with this one. Using the same tag for local and release
>>> images has already caused a good deal of confusion. Filed BEAM-8570 and
>>> BEAM-8571 [2][3].
>>>
>>> > 2. Remove pulling images before executing docker run command. This
>>> should be fixed for Python, Java and Go.
>>>
>>> Valentyn (from [1]):
>>> > I think pulling the latest image for the current tag is actually a
>>> desired behavior, in case the external image was updated (due to a bug fix
>>> for example).
>>>
>>> There's a PR for this [4]. Once we fix the default tag for Java/Go
>>> containers, the dev and release containers will be distinct, which makes it
>>> seldom important whether or not the image is `docker pull`ed. Anyway, I
>>> agree with Thomas that implicitly running `docker pull` is confusing and
>>> requires some adjustments to work around. The user can always run `docker
>>> pull` themselves if that's the intention.
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/0f2ccbbe7969b91dc21ba331c1a30d730e268cc0355c1ac1ba0b7988@%3Cdev.beam.apache.org%3E
>>> [2] https://issues.apache.org/jira/browse/BEAM-8570
>>> [3] https://issues.apache.org/jira/browse/BEAM-8571
>>> [4] https://github.com/apache/beam/pull/9972
>>>
>>> On Wed, Oct 2, 2019 at 5:32 PM Ahmet Altay <al...@google.com> wrote:
>>>
>>>> I do not believe this is a blocker for Beam 2.16. I agree that it would
>>>> be good to fix this.
>>>>
>>>> On Wed, Oct 2, 2019 at 3:15 PM Hannah Jiang <ha...@verily.com>
>>>> wrote:
>>>>
>>>>> Hi Thomas
>>>>>
>>>>> Thanks for bring this up.
>>>>>
>>>>> Now Python uses sdk version as a default tag, while Java and Go use
>>>>> latest as a default tag. I agree using latest as a tag is problematic. The
>>>>> reason only Python uses sdk version as a default tag is Python has
>>>>> version.py so the version is easy to read. For Java and Go, we need to read
>>>>> it from gradle.properties when creating images with the default tag and
>>>>> when setting the default image.
>>>>>
>>>>> Here is what we need to do:
>>>>> 1. Read sdk version from gradle.properties and use this as the default
>>>>> tag. Done with Python, need to implement it with Java and Go.
>>>>> 2. Remove pulling images before executing docker run command. This
>>>>> should be fixed for Python, Java and Go.
>>>>>
>>>>> Is this a blocker for 2.16? If so and above are too much work for 2.16
>>>>> at the moment, we can hardcode the default tag for release branch for now.
>>>>>
>>>>> Using timestamp as a tag is an option as well, as long as runners know
>>>>> which timestamp they should use.
>>>>>
>>>>> Hannah
>>>>>
>>>>> On Wed, Oct 2, 2019 at 10:13 AM Alan Myrvold <am...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Yes, using the latest tag is problematic and can lead to unexpected
>>>>>> behavior.
>>>>>> Using a date/time or 2.17.0.dev-$USER tag would be better. The
>>>>>> validates container shell script uses a datetime
>>>>>> <https://github.com/apache/beam/blob/6551d0937ee31a8e310b63b222dbc750ec9331f8/sdks/python/container/run_validatescontainer.sh#L87>
>>>>>> tag, which allows a unique name if no two tests are run in the same second.
>>>>>>
>>>>>> On Wed, Oct 2, 2019 at 10:05 AM Thomas Weise <th...@apache.org> wrote:
>>>>>>
>>>>>>> Want to bump this thread.
>>>>>>>
>>>>>>> If the current behavior is to replace locally built image with the
>>>>>>> last published, then this is not only unexpected for developers but also
>>>>>>> problematic for the CI, where tests should run against what was built from
>>>>>>> source. Or am I missing something?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Thomas
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 24, 2019 at 7:08 PM Thomas Weise <th...@apache.org> wrote:
>>>>>>>
>>>>>>>> Hi Hannah,
>>>>>>>>
>>>>>>>> I believe this is unexpected from the developer perspective. When
>>>>>>>> building something locally, we do expect that to be used. We may need to
>>>>>>>> change to not pull when the image is available locally, at least when it is
>>>>>>>> a snapshot/master branch. Release images should be immutable anyways.
>>>>>>>>
>>>>>>>> Thomas
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 24, 2019 at 4:13 PM Hannah Jiang <
>>>>>>>> hannahjiang@google.com> wrote:
>>>>>>>>
>>>>>>>>> A minor update, with custom container, the pipeline would not
>>>>>>>>> fail, it throws out warning and moves on to `docker run` command.
>>>>>>>>>
>>>>>>>>> On Tue, Sep 24, 2019 at 4:05 PM Hannah Jiang <
>>>>>>>>> hannahjiang@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Brian
>>>>>>>>>>
>>>>>>>>>> If we pull docker images, it always downloads from remote
>>>>>>>>>> repository, which is expected behavior.
>>>>>>>>>> In case we want to run a local image and pull it only when the
>>>>>>>>>> image is not available at local, we can use `docker run` command directly,
>>>>>>>>>> without pulling it in advance. [1]
>>>>>>>>>> In case we want to pull images only when they are not
>>>>>>>>>> available at local, we can use `docker images -q` to check if images are
>>>>>>>>>> existing at local before pulling it.
>>>>>>>>>> Another option is re-tag your local image, pass your image to
>>>>>>>>>> pipeline and overwrite default one, but the code is still trying to pull,
>>>>>>>>>> so if your image is not pushed to the remote repository, it would fail.
>>>>>>>>>>
>>>>>>>>>> 1. https://github.com/docker/cli/pull/1498
>>>>>>>>>>
>>>>>>>>>> Hannah
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 24, 2019 at 11:56 AM Brian Hulette <
>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm working on a demo cross-language pipeline on a local flink
>>>>>>>>>>> cluster that relies on my python row coder PR [1]. The PR includes some
>>>>>>>>>>> changes to the Java worker code, so I need to build a Java SDK container
>>>>>>>>>>> locally and use that in the pipeline.
>>>>>>>>>>>
>>>>>>>>>>> Unfortunately, whenever I run the pipeline,
>>>>>>>>>>> the apachebeam/java_sdk:latest tag is moved off of my locally built image
>>>>>>>>>>> to a newly downloaded image with a creation date 2 weeks ago, and that
>>>>>>>>>>> image is used instead. It looks like the reason is we run `docker pull`
>>>>>>>>>>> before running the container [2]. As the comment says this should be a
>>>>>>>>>>> no-op if the image already exists, but that doesn't seem to be the case. If
>>>>>>>>>>> I just run `docker pull apachebeam/java_sdk:latest` on my local machine it
>>>>>>>>>>> downloads the 2 week old image and happily informs me:
>>>>>>>>>>>
>>>>>>>>>>> Status: Downloaded newer image for apachebeam/java_sdk:latest
>>>>>>>>>>>
>>>>>>>>>>> Does anyone know how I can prevent `docker pull` from doing
>>>>>>>>>>> this? I can unblock myself for now just by commenting out the docker pull
>>>>>>>>>>> command, but I'd like to understand what is going on here.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Brian
>>>>>>>>>>>
>>>>>>>>>>> [1] https://github.com/apache/beam/pull/9188
>>>>>>>>>>> [2]
>>>>>>>>>>> https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/DockerCommand.java#L80
>>>>>>>>>>>
>>>>>>>>>>

Re: How to use a locally built worker image?

Posted by Kyle Weaver <kc...@google.com>.
The way the Python SDK currently does this is to use the version as the
default tag, eg 2.16.0. While master uses 2.16.0.dev. This means there
should never be any conflicts between a release and developer image, unless
the user deliberately changes the image tags.

> if a users' pipeline is relies on a  container image released by Beam (
or maybe a third party), external updates to such container image may not
propagate to the pipeline workflow without an explicit pull

There should only be one released container per release. Upgrades to a
container image should not happen independently of the release process.

Note that so far I've just been discussing defaults. It's always possible
to use a custom container using environment_config, as mentioned earlier.
The goal is to make that unnecessary for most everyday use cases and
development. Using different container images for different transforms is a
more specialized use case worth a separate discussion.

On Wed, Nov 6, 2019 at 11:33 AM Valentyn Tymofieiev <va...@google.com>
wrote:

> Anyway, I agree with Thomas that implicitly running `docker pull` is
>> confusing and requires some adjustments to work around. The user can always
>> run `docker pull` themselves if that's the intention.
>
>
> I understand that implicit pull may come across as surprising. However I
> see the  required adjustments as a better practice. I would argue that
> customized containers images should not reuse the same  name:tag
> combination, and it would also help the users avoid a situation where a
> runner may use a different container image in different execution
> environments.
> It may also help avoid issue where a user reports an issue with Beam, that
> others cannot reproduce only because a user was running a customized
> container on their local machine (and forgot about it).
> Also, if a users' pipeline is relies on a  container image released by
> Beam ( or maybe a third party), external updates to such container image
> may not propagate to the pipeline workflow without an explicit pull
>
>> > 1. Read sdk version from gradle.properties and use this as the default
>> tag. Done with Python, need to implement it with Java and Go.
>>
>> 100% agree with this one. Using the same tag for local and release images
>> has already caused a good deal of confusion. Filed BEAM-8570 and BEAM-8571
>> [2][3].
>>
>> > 2. Remove pulling images before executing docker run command. This
>> should be fixed for Python, Java and Go.
>>
>> Valentyn (from [1]):
>> > I think pulling the latest image for the current tag is actually a
>> desired behavior, in case the external image was updated (due to a bug fix
>> for example).
>>
>> There's a PR for this [4]. Once we fix the default tag for Java/Go
>> containers, the dev and release containers will be distinct, which makes it
>> seldom important whether or not the image is `docker pull`ed. Anyway, I
>> agree with Thomas that implicitly running `docker pull` is confusing and
>> requires some adjustments to work around. The user can always run `docker
>> pull` themselves if that's the intention.
>>
>> [1]
>> https://lists.apache.org/thread.html/0f2ccbbe7969b91dc21ba331c1a30d730e268cc0355c1ac1ba0b7988@%3Cdev.beam.apache.org%3E
>> [2] https://issues.apache.org/jira/browse/BEAM-8570
>> [3] https://issues.apache.org/jira/browse/BEAM-8571
>> [4] https://github.com/apache/beam/pull/9972
>>
>> On Wed, Oct 2, 2019 at 5:32 PM Ahmet Altay <al...@google.com> wrote:
>>
>>> I do not believe this is a blocker for Beam 2.16. I agree that it would
>>> be good to fix this.
>>>
>>> On Wed, Oct 2, 2019 at 3:15 PM Hannah Jiang <ha...@verily.com>
>>> wrote:
>>>
>>>> Hi Thomas
>>>>
>>>> Thanks for bring this up.
>>>>
>>>> Now Python uses sdk version as a default tag, while Java and Go use
>>>> latest as a default tag. I agree using latest as a tag is problematic. The
>>>> reason only Python uses sdk version as a default tag is Python has
>>>> version.py so the version is easy to read. For Java and Go, we need to read
>>>> it from gradle.properties when creating images with the default tag and
>>>> when setting the default image.
>>>>
>>>> Here is what we need to do:
>>>> 1. Read sdk version from gradle.properties and use this as the default
>>>> tag. Done with Python, need to implement it with Java and Go.
>>>> 2. Remove pulling images before executing docker run command. This
>>>> should be fixed for Python, Java and Go.
>>>>
>>>> Is this a blocker for 2.16? If so and above are too much work for 2.16
>>>> at the moment, we can hardcode the default tag for release branch for now.
>>>>
>>>> Using timestamp as a tag is an option as well, as long as runners know
>>>> which timestamp they should use.
>>>>
>>>> Hannah
>>>>
>>>> On Wed, Oct 2, 2019 at 10:13 AM Alan Myrvold <am...@google.com>
>>>> wrote:
>>>>
>>>>> Yes, using the latest tag is problematic and can lead to unexpected
>>>>> behavior.
>>>>> Using a date/time or 2.17.0.dev-$USER tag would be better. The
>>>>> validates container shell script uses a datetime
>>>>> <https://github.com/apache/beam/blob/6551d0937ee31a8e310b63b222dbc750ec9331f8/sdks/python/container/run_validatescontainer.sh#L87>
>>>>> tag, which allows a unique name if no two tests are run in the same second.
>>>>>
>>>>> On Wed, Oct 2, 2019 at 10:05 AM Thomas Weise <th...@apache.org> wrote:
>>>>>
>>>>>> Want to bump this thread.
>>>>>>
>>>>>> If the current behavior is to replace locally built image with the
>>>>>> last published, then this is not only unexpected for developers but also
>>>>>> problematic for the CI, where tests should run against what was built from
>>>>>> source. Or am I missing something?
>>>>>>
>>>>>> Thanks,
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 7:08 PM Thomas Weise <th...@apache.org> wrote:
>>>>>>
>>>>>>> Hi Hannah,
>>>>>>>
>>>>>>> I believe this is unexpected from the developer perspective. When
>>>>>>> building something locally, we do expect that to be used. We may need to
>>>>>>> change to not pull when the image is available locally, at least when it is
>>>>>>> a snapshot/master branch. Release images should be immutable anyways.
>>>>>>>
>>>>>>> Thomas
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 24, 2019 at 4:13 PM Hannah Jiang <ha...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> A minor update, with custom container, the pipeline would not fail,
>>>>>>>> it throws out warning and moves on to `docker run` command.
>>>>>>>>
>>>>>>>> On Tue, Sep 24, 2019 at 4:05 PM Hannah Jiang <
>>>>>>>> hannahjiang@google.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Brian
>>>>>>>>>
>>>>>>>>> If we pull docker images, it always downloads from remote
>>>>>>>>> repository, which is expected behavior.
>>>>>>>>> In case we want to run a local image and pull it only when the
>>>>>>>>> image is not available at local, we can use `docker run` command directly,
>>>>>>>>> without pulling it in advance. [1]
>>>>>>>>> In case we want to pull images only when they are not available at
>>>>>>>>> local, we can use `docker images -q` to check if images are existing at
>>>>>>>>> local before pulling it.
>>>>>>>>> Another option is re-tag your local image, pass your image to
>>>>>>>>> pipeline and overwrite default one, but the code is still trying to pull,
>>>>>>>>> so if your image is not pushed to the remote repository, it would fail.
>>>>>>>>>
>>>>>>>>> 1. https://github.com/docker/cli/pull/1498
>>>>>>>>>
>>>>>>>>> Hannah
>>>>>>>>>
>>>>>>>>> On Tue, Sep 24, 2019 at 11:56 AM Brian Hulette <
>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> I'm working on a demo cross-language pipeline on a local flink
>>>>>>>>>> cluster that relies on my python row coder PR [1]. The PR includes some
>>>>>>>>>> changes to the Java worker code, so I need to build a Java SDK container
>>>>>>>>>> locally and use that in the pipeline.
>>>>>>>>>>
>>>>>>>>>> Unfortunately, whenever I run the pipeline,
>>>>>>>>>> the apachebeam/java_sdk:latest tag is moved off of my locally built image
>>>>>>>>>> to a newly downloaded image with a creation date 2 weeks ago, and that
>>>>>>>>>> image is used instead. It looks like the reason is we run `docker pull`
>>>>>>>>>> before running the container [2]. As the comment says this should be a
>>>>>>>>>> no-op if the image already exists, but that doesn't seem to be the case. If
>>>>>>>>>> I just run `docker pull apachebeam/java_sdk:latest` on my local machine it
>>>>>>>>>> downloads the 2 week old image and happily informs me:
>>>>>>>>>>
>>>>>>>>>> Status: Downloaded newer image for apachebeam/java_sdk:latest
>>>>>>>>>>
>>>>>>>>>> Does anyone know how I can prevent `docker pull` from doing this?
>>>>>>>>>> I can unblock myself for now just by commenting out the docker pull
>>>>>>>>>> command, but I'd like to understand what is going on here.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Brian
>>>>>>>>>>
>>>>>>>>>> [1] https://github.com/apache/beam/pull/9188
>>>>>>>>>> [2]
>>>>>>>>>> https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/DockerCommand.java#L80
>>>>>>>>>>
>>>>>>>>>

Re: How to use a locally built worker image?

Posted by Thomas Weise <th...@apache.org>.
As developer in Beam, I expect something that I built locally to be used.
That's the case with Java and Python dependencies also.

These are -SNAPSHOT or .dev and distinct from releases. So perhaps this is
just a matter of correct version number management?

Users that consume our releases (version) should see the published Docker
images.

It would be great if that just works out of the box, w/o extra pipeline
options magic.

Thomas

On Wed, Nov 6, 2019 at 11:33 AM Valentyn Tymofieiev <va...@google.com>
wrote:

> Anyway, I agree with Thomas that implicitly running `docker pull` is
>> confusing and requires some adjustments to work around. The user can always
>> run `docker pull` themselves if that's the intention.
>
>
> I understand that implicit pull may come across as surprising. However I
> see the  required adjustments as a better practice. I would argue that
> customized containers images should not reuse the same  name:tag
> combination, and it would also help the users avoid a situation where a
> runner may use a different container image in different execution
> environments.
> It may also help avoid issue where a user reports an issue with Beam, that
> others cannot reproduce only because a user was running a customized
> container on their local machine (and forgot about it).
> Also, if a users' pipeline is relies on a  container image released by
> Beam ( or maybe a third party), external updates to such container image
> may not propagate to the pipeline workflow without an explicit pull. Always
> pulling the image may help to ensure a more deterministic behavior.
>
> On Wed, Nov 6, 2019 at 10:38 AM Kyle Weaver <kc...@google.com> wrote:
>
>> Bumping this thread from the other one [1].
>>
>> > 1. Read sdk version from gradle.properties and use this as the default
>> tag. Done with Python, need to implement it with Java and Go.
>>
>> 100% agree with this one. Using the same tag for local and release images
>> has already caused a good deal of confusion. Filed BEAM-8570 and BEAM-8571
>> [2][3].
>>
>> > 2. Remove pulling images before executing docker run command. This
>> should be fixed for Python, Java and Go.
>>
>> Valentyn (from [1]):
>> > I think pulling the latest image for the current tag is actually a
>> desired behavior, in case the external image was updated (due to a bug fix
>> for example).
>>
>> There's a PR for this [4]. Once we fix the default tag for Java/Go
>> containers, the dev and release containers will be distinct, which makes it
>> seldom important whether or not the image is `docker pull`ed. Anyway, I
>> agree with Thomas that implicitly running `docker pull` is confusing and
>> requires some adjustments to work around. The user can always run `docker
>> pull` themselves if that's the intention.
>>
>> [1]
>> https://lists.apache.org/thread.html/0f2ccbbe7969b91dc21ba331c1a30d730e268cc0355c1ac1ba0b7988@%3Cdev.beam.apache.org%3E
>> [2] https://issues.apache.org/jira/browse/BEAM-8570
>> [3] https://issues.apache.org/jira/browse/BEAM-8571
>> [4] https://github.com/apache/beam/pull/9972
>>
>> On Wed, Oct 2, 2019 at 5:32 PM Ahmet Altay <al...@google.com> wrote:
>>
>>> I do not believe this is a blocker for Beam 2.16. I agree that it would
>>> be good to fix this.
>>>
>>> On Wed, Oct 2, 2019 at 3:15 PM Hannah Jiang <ha...@verily.com>
>>> wrote:
>>>
>>>> Hi Thomas
>>>>
>>>> Thanks for bring this up.
>>>>
>>>> Now Python uses sdk version as a default tag, while Java and Go use
>>>> latest as a default tag. I agree using latest as a tag is problematic. The
>>>> reason only Python uses sdk version as a default tag is Python has
>>>> version.py so the version is easy to read. For Java and Go, we need to read
>>>> it from gradle.properties when creating images with the default tag and
>>>> when setting the default image.
>>>>
>>>> Here is what we need to do:
>>>> 1. Read sdk version from gradle.properties and use this as the default
>>>> tag. Done with Python, need to implement it with Java and Go.
>>>> 2. Remove pulling images before executing docker run command. This
>>>> should be fixed for Python, Java and Go.
>>>>
>>>> Is this a blocker for 2.16? If so and above are too much work for 2.16
>>>> at the moment, we can hardcode the default tag for release branch for now.
>>>>
>>>> Using timestamp as a tag is an option as well, as long as runners know
>>>> which timestamp they should use.
>>>>
>>>> Hannah
>>>>
>>>> On Wed, Oct 2, 2019 at 10:13 AM Alan Myrvold <am...@google.com>
>>>> wrote:
>>>>
>>>>> Yes, using the latest tag is problematic and can lead to unexpected
>>>>> behavior.
>>>>> Using a date/time or 2.17.0.dev-$USER tag would be better. The
>>>>> validates container shell script uses a datetime
>>>>> <https://github.com/apache/beam/blob/6551d0937ee31a8e310b63b222dbc750ec9331f8/sdks/python/container/run_validatescontainer.sh#L87>
>>>>> tag, which allows a unique name if no two tests are run in the same second.
>>>>>
>>>>> On Wed, Oct 2, 2019 at 10:05 AM Thomas Weise <th...@apache.org> wrote:
>>>>>
>>>>>> Want to bump this thread.
>>>>>>
>>>>>> If the current behavior is to replace locally built image with the
>>>>>> last published, then this is not only unexpected for developers but also
>>>>>> problematic for the CI, where tests should run against what was built from
>>>>>> source. Or am I missing something?
>>>>>>
>>>>>> Thanks,
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 7:08 PM Thomas Weise <th...@apache.org> wrote:
>>>>>>
>>>>>>> Hi Hannah,
>>>>>>>
>>>>>>> I believe this is unexpected from the developer perspective. When
>>>>>>> building something locally, we do expect that to be used. We may need to
>>>>>>> change to not pull when the image is available locally, at least when it is
>>>>>>> a snapshot/master branch. Release images should be immutable anyways.
>>>>>>>
>>>>>>> Thomas
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 24, 2019 at 4:13 PM Hannah Jiang <ha...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> A minor update, with custom container, the pipeline would not fail,
>>>>>>>> it throws out warning and moves on to `docker run` command.
>>>>>>>>
>>>>>>>> On Tue, Sep 24, 2019 at 4:05 PM Hannah Jiang <
>>>>>>>> hannahjiang@google.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Brian
>>>>>>>>>
>>>>>>>>> If we pull docker images, it always downloads from remote
>>>>>>>>> repository, which is expected behavior.
>>>>>>>>> In case we want to run a local image and pull it only when the
>>>>>>>>> image is not available at local, we can use `docker run` command directly,
>>>>>>>>> without pulling it in advance. [1]
>>>>>>>>> In case we want to pull images only when they are not available at
>>>>>>>>> local, we can use `docker images -q` to check if images are existing at
>>>>>>>>> local before pulling it.
>>>>>>>>> Another option is re-tag your local image, pass your image to
>>>>>>>>> pipeline and overwrite default one, but the code is still trying to pull,
>>>>>>>>> so if your image is not pushed to the remote repository, it would fail.
>>>>>>>>>
>>>>>>>>> 1. https://github.com/docker/cli/pull/1498
>>>>>>>>>
>>>>>>>>> Hannah
>>>>>>>>>
>>>>>>>>> On Tue, Sep 24, 2019 at 11:56 AM Brian Hulette <
>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> I'm working on a demo cross-language pipeline on a local flink
>>>>>>>>>> cluster that relies on my python row coder PR [1]. The PR includes some
>>>>>>>>>> changes to the Java worker code, so I need to build a Java SDK container
>>>>>>>>>> locally and use that in the pipeline.
>>>>>>>>>>
>>>>>>>>>> Unfortunately, whenever I run the pipeline,
>>>>>>>>>> the apachebeam/java_sdk:latest tag is moved off of my locally built image
>>>>>>>>>> to a newly downloaded image with a creation date 2 weeks ago, and that
>>>>>>>>>> image is used instead. It looks like the reason is we run `docker pull`
>>>>>>>>>> before running the container [2]. As the comment says this should be a
>>>>>>>>>> no-op if the image already exists, but that doesn't seem to be the case. If
>>>>>>>>>> I just run `docker pull apachebeam/java_sdk:latest` on my local machine it
>>>>>>>>>> downloads the 2 week old image and happily informs me:
>>>>>>>>>>
>>>>>>>>>> Status: Downloaded newer image for apachebeam/java_sdk:latest
>>>>>>>>>>
>>>>>>>>>> Does anyone know how I can prevent `docker pull` from doing this?
>>>>>>>>>> I can unblock myself for now just by commenting out the docker pull
>>>>>>>>>> command, but I'd like to understand what is going on here.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Brian
>>>>>>>>>>
>>>>>>>>>> [1] https://github.com/apache/beam/pull/9188
>>>>>>>>>> [2]
>>>>>>>>>> https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/DockerCommand.java#L80
>>>>>>>>>>
>>>>>>>>>

Re: How to use a locally built worker image?

Posted by Valentyn Tymofieiev <va...@google.com>.
>
> Anyway, I agree with Thomas that implicitly running `docker pull` is
> confusing and requires some adjustments to work around. The user can always
> run `docker pull` themselves if that's the intention.


I understand that implicit pull may come across as surprising. However I
see the  required adjustments as a better practice. I would argue that
customized containers images should not reuse the same  name:tag
combination, and it would also help the users avoid a situation where a
runner may use a different container image in different execution
environments.
It may also help avoid issue where a user reports an issue with Beam, that
others cannot reproduce only because a user was running a customized
container on their local machine (and forgot about it).
Also, if a users' pipeline is relies on a  container image released by Beam
( or maybe a third party), external updates to such container image may not
propagate to the pipeline workflow without an explicit pull. Always pulling
the image may help to ensure a more deterministic behavior.

On Wed, Nov 6, 2019 at 10:38 AM Kyle Weaver <kc...@google.com> wrote:

> Bumping this thread from the other one [1].
>
> > 1. Read sdk version from gradle.properties and use this as the default
> tag. Done with Python, need to implement it with Java and Go.
>
> 100% agree with this one. Using the same tag for local and release images
> has already caused a good deal of confusion. Filed BEAM-8570 and BEAM-8571
> [2][3].
>
> > 2. Remove pulling images before executing docker run command. This
> should be fixed for Python, Java and Go.
>
> Valentyn (from [1]):
> > I think pulling the latest image for the current tag is actually a
> desired behavior, in case the external image was updated (due to a bug fix
> for example).
>
> There's a PR for this [4]. Once we fix the default tag for Java/Go
> containers, the dev and release containers will be distinct, which makes it
> seldom important whether or not the image is `docker pull`ed. Anyway, I
> agree with Thomas that implicitly running `docker pull` is confusing and
> requires some adjustments to work around. The user can always run `docker
> pull` themselves if that's the intention.
>
> [1]
> https://lists.apache.org/thread.html/0f2ccbbe7969b91dc21ba331c1a30d730e268cc0355c1ac1ba0b7988@%3Cdev.beam.apache.org%3E
> [2] https://issues.apache.org/jira/browse/BEAM-8570
> [3] https://issues.apache.org/jira/browse/BEAM-8571
> [4] https://github.com/apache/beam/pull/9972
>
> On Wed, Oct 2, 2019 at 5:32 PM Ahmet Altay <al...@google.com> wrote:
>
>> I do not believe this is a blocker for Beam 2.16. I agree that it would
>> be good to fix this.
>>
>> On Wed, Oct 2, 2019 at 3:15 PM Hannah Jiang <ha...@verily.com>
>> wrote:
>>
>>> Hi Thomas
>>>
>>> Thanks for bring this up.
>>>
>>> Now Python uses sdk version as a default tag, while Java and Go use
>>> latest as a default tag. I agree using latest as a tag is problematic. The
>>> reason only Python uses sdk version as a default tag is Python has
>>> version.py so the version is easy to read. For Java and Go, we need to read
>>> it from gradle.properties when creating images with the default tag and
>>> when setting the default image.
>>>
>>> Here is what we need to do:
>>> 1. Read sdk version from gradle.properties and use this as the default
>>> tag. Done with Python, need to implement it with Java and Go.
>>> 2. Remove pulling images before executing docker run command. This
>>> should be fixed for Python, Java and Go.
>>>
>>> Is this a blocker for 2.16? If so and above are too much work for 2.16
>>> at the moment, we can hardcode the default tag for release branch for now.
>>>
>>> Using timestamp as a tag is an option as well, as long as runners know
>>> which timestamp they should use.
>>>
>>> Hannah
>>>
>>> On Wed, Oct 2, 2019 at 10:13 AM Alan Myrvold <am...@google.com>
>>> wrote:
>>>
>>>> Yes, using the latest tag is problematic and can lead to unexpected
>>>> behavior.
>>>> Using a date/time or 2.17.0.dev-$USER tag would be better. The
>>>> validates container shell script uses a datetime
>>>> <https://github.com/apache/beam/blob/6551d0937ee31a8e310b63b222dbc750ec9331f8/sdks/python/container/run_validatescontainer.sh#L87>
>>>> tag, which allows a unique name if no two tests are run in the same second.
>>>>
>>>> On Wed, Oct 2, 2019 at 10:05 AM Thomas Weise <th...@apache.org> wrote:
>>>>
>>>>> Want to bump this thread.
>>>>>
>>>>> If the current behavior is to replace locally built image with the
>>>>> last published, then this is not only unexpected for developers but also
>>>>> problematic for the CI, where tests should run against what was built from
>>>>> source. Or am I missing something?
>>>>>
>>>>> Thanks,
>>>>> Thomas
>>>>>
>>>>>
>>>>> On Tue, Sep 24, 2019 at 7:08 PM Thomas Weise <th...@apache.org> wrote:
>>>>>
>>>>>> Hi Hannah,
>>>>>>
>>>>>> I believe this is unexpected from the developer perspective. When
>>>>>> building something locally, we do expect that to be used. We may need to
>>>>>> change to not pull when the image is available locally, at least when it is
>>>>>> a snapshot/master branch. Release images should be immutable anyways.
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 4:13 PM Hannah Jiang <ha...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> A minor update, with custom container, the pipeline would not fail,
>>>>>>> it throws out warning and moves on to `docker run` command.
>>>>>>>
>>>>>>> On Tue, Sep 24, 2019 at 4:05 PM Hannah Jiang <ha...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Brian
>>>>>>>>
>>>>>>>> If we pull docker images, it always downloads from remote
>>>>>>>> repository, which is expected behavior.
>>>>>>>> In case we want to run a local image and pull it only when the
>>>>>>>> image is not available at local, we can use `docker run` command directly,
>>>>>>>> without pulling it in advance. [1]
>>>>>>>> In case we want to pull images only when they are not available at
>>>>>>>> local, we can use `docker images -q` to check if images are existing at
>>>>>>>> local before pulling it.
>>>>>>>> Another option is re-tag your local image, pass your image to
>>>>>>>> pipeline and overwrite default one, but the code is still trying to pull,
>>>>>>>> so if your image is not pushed to the remote repository, it would fail.
>>>>>>>>
>>>>>>>> 1. https://github.com/docker/cli/pull/1498
>>>>>>>>
>>>>>>>> Hannah
>>>>>>>>
>>>>>>>> On Tue, Sep 24, 2019 at 11:56 AM Brian Hulette <bh...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I'm working on a demo cross-language pipeline on a local flink
>>>>>>>>> cluster that relies on my python row coder PR [1]. The PR includes some
>>>>>>>>> changes to the Java worker code, so I need to build a Java SDK container
>>>>>>>>> locally and use that in the pipeline.
>>>>>>>>>
>>>>>>>>> Unfortunately, whenever I run the pipeline,
>>>>>>>>> the apachebeam/java_sdk:latest tag is moved off of my locally built image
>>>>>>>>> to a newly downloaded image with a creation date 2 weeks ago, and that
>>>>>>>>> image is used instead. It looks like the reason is we run `docker pull`
>>>>>>>>> before running the container [2]. As the comment says this should be a
>>>>>>>>> no-op if the image already exists, but that doesn't seem to be the case. If
>>>>>>>>> I just run `docker pull apachebeam/java_sdk:latest` on my local machine it
>>>>>>>>> downloads the 2 week old image and happily informs me:
>>>>>>>>>
>>>>>>>>> Status: Downloaded newer image for apachebeam/java_sdk:latest
>>>>>>>>>
>>>>>>>>> Does anyone know how I can prevent `docker pull` from doing this?
>>>>>>>>> I can unblock myself for now just by commenting out the docker pull
>>>>>>>>> command, but I'd like to understand what is going on here.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Brian
>>>>>>>>>
>>>>>>>>> [1] https://github.com/apache/beam/pull/9188
>>>>>>>>> [2]
>>>>>>>>> https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/DockerCommand.java#L80
>>>>>>>>>
>>>>>>>>