You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Sam Bourne <sa...@gmail.com> on 2019/08/02 06:48:29 UTC

Re: Docker Run Options in SDK Container

Thanks for the feedback. I added a response in-line.

On Wed, Jul 17, 2019 at 4:48 AM Ankur Goenka goenka@google.com
<ht...@google.com> wrote:

Thanks for summarizing the discussion.
>
> A few comments inline below:
>
>
> On Mon, Jul 15, 2019 at 5:28 PM Sam Bourne <sa...@gmail.com> wrote:
>
>> Hello Beam devs,
>>
>> I’ve opened a PR (https://github.com/apache/beam/pull/8982) to support
>> passing options/flags to the docker run command executed as part of the
>> portable environment workflow. I’m in need of providing specific volumes
>> and possibly other docker run options as I refine our custom container and
>> workflow.
>>
>> There were requests to bring this up in the mailing list to discuss
>> possible ways to achieve this. There’s an existing PR #8828
>> <https://github.com/apache/beam/pull/8828> but we took quite different
>> approaches. #8828 is limited to only mounting /tmp/ directories with no
>> support for other docker run options/flags so wouldn’t solve my needs.
>>
>> I chose to expand upon the existing flag environment_config and provide
>> the additional docker run options there. This requires the SDK parse these
>> out when building the DockerPayload protobuf. It’s worth noting that
>> what is provided to environment_config changes depending on the
>> environment_type. e.g. if environment_type is docker, environment_config
>> is currently expected to be the docker container name, but other
>> environment types have completely different expectations, and each uses its
>> own protobuf message type.
>>
>> The current method (using python SDK) looks like this:
>>
>> python -m mymodule —runner PortableRunner —job_endpoint localhost:8099 —environment_type DOCKER —environment_config MY_CONTAINER_NAME
>>
>> My PR expects other run options to be provided before the container name
>> - similar to how you would start the container locally:
>>
>> python -m mymodule —runner PortableRunner —job_endpoint localhost:8099 —environment_type DOCKER —environment_config “-v /Volumes/mnt/foo:/Volumes/mnt/foo -v /Volumes/mnt/bar:/Volumes/mnt/bar —user sambvfx MY_CONTAINER_NAME”
>>
>> The PR’s feedback raises some questions that some of you may have
>> opinions about. A hopefully faithful summary of them and my commentary
>> below:
>>
>> Should we require the environment_config be a json encoded string that
>> mirrors the protobuf?
>>
>> e.g.
>>
>> --environment_config '{"image_name": "MY_CONTAINER_NAME", "run_options": “-v /Volumes/mnt/foo:/Volumes/mnt/foo -v /Volumes/mnt/bar:/Volumes/mnt/bar —user sambvfx"}'
>>
>> I’m not a fan due to it not being backwards compatible and difficult to
>> provide to CLI. Users don’t want to type json into the shell.
>>
> I agree, typing JSON on command line is really messy. But I think having
> meaningful parts in the config will be easier to maintain and compare.
>
Can we give a config file which can be read, parsed and delivered as
> options to the docker environment.
> Something like "--environment_config '~/my_docker_config.json/yaml'"
>
That does provide a better CLI experience than writing json. If we went
that route we would also want the SDKs to handle either direct json or a
file path. I still prefer my original proposal which maintained backwards
compatibility and within the CLI mirrors how the user normally calls docker
run.

I think passing a user provided command to start docker might have security
> issues as users might load mount an otherwise non accessible drive or
> access prohibited port etc.
>
I’d argue it’s up to the runner to ensure the desired privileges are setup
for the worker. Doesn’t running java transform directly via an embedded
environment gives you everything the worker has access to as well?

Should we not assume docker run ... is the only way to start the container?
>>
>> I think any other method would likely require further changes to the
>> protobuf or a completely new one.
>>
> Yes I think that makes sense. However, if we add more parameters to the
> docker startup then the dockerpayload protobuf can be updated to have
> those.
>
I imagine that any changes to the protobuf would only be to support other
methods of starting a docker container - simply because run_options will
contain everything the user wants to pass to docker run.

Should we provide different args for mounting volume(s) and map that to the
>> appropriate docker command within the beam code?
>>
>> This requires a lot of docker specific code to be included within beam.
>>
>> Any input would be appreciated.
>>
>> Cheers,
>> Sam
>>
>

Re: Docker Run Options in SDK Container

Posted by Lukasz Cwik <lc...@google.com>.
On Fri, Aug 2, 2019 at 11:00 AM Chad Dombrova <ch...@gmail.com> wrote:

> Hi all,
> I’m a bit confused about the desire to use json for the environment_config.
>
Note that complex PipelineOptions are already expected to be in JSON
format[1, 2]. This has solved many string parsing and ambiguity issues.

> It’s harder to use json on the command line, such that now we’re talking
> about the value being *either* a docker image name *or* a path to a json
> file (OR maybe yaml too!), which is not only less convenient than just
> typing the docker ags you want, it's also IMHO a dirty/inconsistent design.
>
> The idea of having each key of the json config map to a docker flag seems
> like a maintenance quagmire with little benefit.  In that case, Beam devs
> would have to maintain parity with docker options and be the arbiters of
> what's "safe" and what's not, and users would have to read additional beam
> documentation (which may or may not exist) to discover what keys are valid,
> rather than simply doing what they know, and passing the docker args.  As
> Sam points out, if security is a concern there are plenty of ways to abuse
> the system already. Security should be handled at the infrastructure
> deployment level, where it’s actually meaningful.
>
Wouldn't supporting every possible docker option also be a backwards
compatibility and portability quagmire?
Users could say that option X worked with Beam Y but no longer with Y+1 or
with runner A since it used docker Z+1 but not with runner B because it
uses docker Z.

Both options have tradeoffs and the important part is whether the
convenience of specifying all options available to users via docker run
outweigh the drawbacks.

> It also seems like there’s already a precedent for encoding environment
> configuration as command line args. Consider the SUBPROCESS_SDK environment:
>
>     options = PipelineOptions()
>     options.view_as(PortableOptions).environment_type = \
>         python_urns.SUBPROCESS_SDK
>     options.view_as(PortableOptions).environment_config = \
>         b'/usr/bin/python -m apache_beam.runners.worker.sdk_worker_main'
>
> This could be encoded as json to avoid someone passing something nasty,
> but luckily that was *not* the choice that was made, because I think this
> is a fine design.
>
> As a result, I think the original proposal was the most elegant and
> consistent with other environment types:
>
> --environment_type DOCKER --environment_config "-v /Volumes/mnt/foo:/Volumes/mnt/foo --user sambvfx MY_CONTAINER_NAME"
>
>
> Note that the help docs for --environment_config heavily suggest that the
intent for the command was always JSON:
Set environment configuration for running the user code. For DOCKER: Url
for the docker image.\n For PROCESS: json of the form {"os": "<OS>",
"arch": "<ARCHITECTURE>", "command": "<process to execute>",
"env":{"<Environment variables 1>": "<ENV_VAL>"} }. All fields in the json
are optional except command.

1:
https://github.com/apache/beam/blob/24e9cedcc768d901de795477fa78c7f357635671/sdks/java/core/src/main/java/org/apache/beam/sdk/options/PipelineOptionsFactory.java#L163
2:
https://github.com/apache/beam/blob/24e9cedcc768d901de795477fa78c7f357635671/sdks/python/apache_beam/options/pipeline_options.py#L822

Re: Docker Run Options in SDK Container

Posted by Chad Dombrova <ch...@gmail.com>.
Hi all,
I’m a bit confused about the desire to use json for the environment_config.

It’s harder to use json on the command line, such that now we’re talking
about the value being *either* a docker image name *or* a path to a json
file (OR maybe yaml too!), which is not only less convenient than just
typing the docker ags you want, it's also IMHO a dirty/inconsistent design.

The idea of having each key of the json config map to a docker flag seems
like a maintenance quagmire with little benefit.  In that case, Beam devs
would have to maintain parity with docker options and be the arbiters of
what's "safe" and what's not, and users would have to read additional beam
documentation (which may or may not exist) to discover what keys are valid,
rather than simply doing what they know, and passing the docker args.  As
Sam points out, if security is a concern there are plenty of ways to abuse
the system already. Security should be handled at the infrastructure
deployment level, where it’s actually meaningful.

It also seems like there’s already a precedent for encoding environment
configuration as command line args. Consider the SUBPROCESS_SDK environment:

    options = PipelineOptions()
    options.view_as(PortableOptions).environment_type = \
        python_urns.SUBPROCESS_SDK
    options.view_as(PortableOptions).environment_config = \
        b'/usr/bin/python -m apache_beam.runners.worker.sdk_worker_main'

This could be encoded as json to avoid someone passing something nasty, but
luckily that was *not* the choice that was made, because I think this is a
fine design.

As a result, I think the original proposal was the most elegant and
consistent with other environment types:

--environment_type DOCKER --environment_config "-v
/Volumes/mnt/foo:/Volumes/mnt/foo --user sambvfx MY_CONTAINER_NAME"