You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Thomas Weise <th...@apache.org> on 2018/08/06 23:50:03 UTC

Process JobBundleFactory for portable runner

Hi,

Currently the portable Flink runner only works with SDK Docker containers
for execution (DockerJobBundleFactory, besides an in-process (embedded)
factory option for testing [1]). I'm considering adding another out of
process JobBundleFactory implementation that directly forks the processes
on the task manager host, eliminating the need for Docker. This would work
reasonably well in environments where the dependencies (in this case
Python) can easily be tied into the host deployment (also within an
application specific Kubernetes pod).

There was already some discussion about alternative JobBundleFactory
implementation in [2]. There is also a JIRA to make the bundle factory
pluggable [3], pending availability of runner level options.

For a "ProcessBundleFactory", in addition to the Python dependencies the
environment would also need to have the Go boot executable [4] (or a
substitute thereof) to perform the harness initialization.

Is anyone else interested in this SDK execution option or has already
investigated an alternative implementation?

Thanks,
Thomas

[1]
https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83

[2]
https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E

[3] https://issues.apache.org/jira/browse/BEAM-4819

[4] https://github.com/apache/beam/blob/master/sdks/python/container/boot.go

Re: Process JobBundleFactory for portable runner

Posted by Henning Rohde <he...@google.com>.

> Do we expect pipelines to always have a single environment for each
PTransform, thus the SDK is dictating how it is launched/managed or do we
expect for each SDK to say here are a couple of ways to run me, letting the
runner to decide?

Good point. I think we should allow multiple environments in the
representation -- even if it's always a singleton, the cost is only a small
loss in readability. A docker and embedded combo might be useful default
for Java, for example. The staged files needed might be different for
different environments, so there are some sharp edges. However, an SDK can
always send just one environment to avoid those.

One idea: we should perhaps allow multiple staged artifact manifests for a
job and add an artifact manifest id to the env (as opposed to having
multi-env pipelines organize the artifacts internally to make sense of
them). Artifacts are global to the pipeline right now, but it might be
easier to manage if they were per environment.

> What does providing the target os/arch provide in beam:env:process:v1?

It would allow supporting mixed linux/windows workers, for example, as well
as allow the runner to reject unsupported process environments upfront as a
sanity check. I'm thinking the latter would be convenient to catch user
mistakes with local/remote runners of different architecture.

Henning

On Fri, Aug 24, 2018 at 1:22 PM Lukasz Cwik <lc...@google.com> wrote:

> Do we expect pipelines to always have a single environment for each
> PTransform, thus the SDK is dictating how it is launched/managed or do we
> expect for each SDK to say here are a couple of ways to run me, letting the
> runner to decide?
>
> What does providing the target os/arch provide in beam:env:process:v1?
> I would suspect that the user must have supplied a script compatible with
> the cluster being run on. The user knows upfront what is being invoked.
> Note that launching a binary with the current container contract allows the
> binary to do plenty (activate virtualenv, install libraries in some SDK
> specific way oblivious to the runner, ...) so keeping it limited to a
> binary with some args being passed in required by the container contract
> seems to work well.
>
> +1 for making usage of URNs and payloads.
>
> On Fri, Aug 24, 2018 at 1:32 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> I think "external" still needs some way (I was suggesting grpc) to
>> pass the control address, etc. to whatever starts up the workers.
>>
>> Also, +1 to making this a URN. Embedded makes sense too.
>> On Fri, Aug 24, 2018 at 6:00 AM Thomas Weise <th...@apache.org> wrote:
>> >
>> > Option #3 "external" would fit the Kubernetes use case we discussed a
>> while ago also. Container(s) can be part of the same pod and need to find
>> the runner.
>> >
>> > There is another option: "embedded". When the SDK is Java and the
>> runner Flink (or all the other OSS runners), then harness can (optionally)
>> run embedded in the same JVM.
>> >
>> > Thanks,
>> > Thomas
>> >
>> >
>> > On Thu, Aug 23, 2018 at 9:14 AM Henning Rohde <he...@google.com>
>> wrote:
>> >>
>> >> A process-based SDK harness does not IMO imply that the host is fully
>> provisioned by the SDK/user and invoking the user command line in the
>> context of the staged files is a critical aspect for it to work. So I
>> consider staged artifact support needed. Also, I would like to suggest that
>> we move to a concrete environment proto to crystalize what is actually
>> being proposed. I'm not sure what activating a virtualenv would look like,
>> for example. To start things off:
>> >>
>> >> message Environment {
>> >>   string urn = 1;
>> >>   bytes payload = 2;
>> >> }
>> >>
>> >> // urn == "beam:env:docker:v1"
>> >> message DockerPayload {
>> >>   string container_image = 1;  // implicitly linux_amd64.
>> >> }
>> >>
>> >> // urn == "beam:env:process:v1"
>> >> message ProcessPayload {
>> >>   string os = 1;  // "linux", "darwin", ..
>> >>   string arch = 2;  // "amd64", ..
>> >>   string command_line = 3;
>> >> }
>> >>
>> >> // urn == "beam:env:external:v1"
>> >> // (no payload)
>> >>
>> >> A runner may support any subset and reject any unsupported
>> configuration. There are 3 kinds of environments that I think are useful:
>> >>  (1) docker: works as currently. Offers the most flexibility for SDKs
>> and users, especially when the runner is outside the control (such as
>> hosted runners). The runner starts the SDK harnesses.
>> >>  (2) process: as discussed here. The runner starts the SDK harnesses.
>> The semantics is that the shell commandline is invoked in a directory
>> rooted in the staged artifacts with the container contract arguments. It is
>> up to the user and runner deployment to ensure that it makes sense, i.e.,
>> on windows a linux binary or bash script is not specified. Executing the
>> user command in a shell env (bash, zsh, cmd, ..) ensures that paths and so
>> on are set up:, i.e., specifying "java -jar foo" would actually work.
>> Useful for cases where the user controls both the SDK and runner (such as
>> locally) or when docker is not an option. Intended to be minimal and
>> SDK/language agnostic.
>> >>  (3) external: this is what I think Robert was alluding to. The runner
>> does not start any SDK harnesses. Instead it waits for user-controlled SDK
>> harnesses to connect. Useful for manually debugging SDK code (connect from
>> code running in a debugger) or when the user code must run in a special or
>> privileged environment. It's runner-specific how the SDK will need to
>> connect.
>> >>
>> >> Part of the idea of placing this information in the environment is
>> that pipelines can potentially use multiple, such as cross-windows/linux.
>> >>
>> >> Henning
>> >>
>> >> On Thu, Aug 23, 2018 at 6:44 AM Thomas Weise <th...@apache.org> wrote:
>> >>>
>> >>> I would see support for staging libraries as optional / nice to have
>> since that can also be done as part of host provisioning (i.e. in the
>> Python case a virtual environment was already setup and just needs to be
>> activated).
>> >>>
>> >>> Depending on how the command that launches the harness is configured,
>> additional steps such as virtualenv activate or setting of other
>> environment variables can be included as well.
>> >>>
>> >>>
>> >>> On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels <mx...@apache.org>
>> wrote:
>> >>>>
>> >>>> Just to recap:
>> >>>>
>> >>>>  From this and the other thread ("Bootstraping Beam's Job Server") we
>> >>>> got sufficient evidence that process-based execution is a desired
>> feature.
>> >>>>
>> >>>> Process-based execution as an alternative to dockerized execution
>> >>>> https://issues.apache.org/jira/browse/BEAM-5187
>> >>>>
>> >>>> Which parts are executed as a process?
>> >>>> => The SDK harness for user code
>> >>>>
>> >>>> What configuration options are supported?
>> >>>> => Provide information about the target architecture (OS/CPU)
>> >>>> => Staging libraries, as also supported by Docker
>> >>>> => Activating a pre-existing environment (e.g. virutalenv)
>> >>>>
>> >>>>
>> >>>> On 23.08.18 14:13, Maximilian Michels wrote:
>> >>>> >> One thing to consider that we've talked about in the past. It
>> might
>> >>>> >> make sense to extend the environment proto and have the SDK be
>> >>>> >> explicit about which kinds of environment it support
>> >>>> >
>> >>>> > +1 Encoding environment information there is a good idea.
>> >>>> >
>> >>>> >> Seems it will create a default docker url even if the
>> >>>> >> hardness_docker_image is set to None in pipeline options. Shall
>> we add
>> >>>> >> another option or honor the None in this option to support the
>> process
>> >>>> >> job?
>> >>>> >
>> >>>> > Yes, if no Docker image is set the default one will be used.
>> Currently
>> >>>> > Docker is the only way to execute pipelines with the
>> PortableRunner. If
>> >>>> > the docker_image is not set, execution won't succeed.
>> >>>> >
>> >>>> > On 22.08.18 22:59, Xinyu Liu wrote:
>> >>>> >> We are also interested in this Process JobBundleFactory as we are
>> >>>> >> planning to fork a process to run python sdk in Samza runner,
>> instead
>> >>>> >> of using docker container. So this change will be helpful to us
>> too.
>> >>>> >> On the same note, we are trying out portable_runner.py to submit a
>> >>>> >> python job. Seems it will create a default docker url even if the
>> >>>> >> hardness_docker_image is set to None in pipeline options. Shall
>> we add
>> >>>> >> another option or honor the None in this option to support the
>> process
>> >>>> >> job? I made some local changes right now to walk around this.
>> >>>> >>
>> >>>> >> Thanks,
>> >>>> >> Xinyu
>> >>>> >>
>> >>>> >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde <
>> herohde@google.com
>> >>>> >> <ma...@google.com>> wrote:
>> >>>> >>
>> >>>> >>     By "enum" in quotes, I meant the usual open URN style pattern
>> not an
>> >>>> >>     actual enum. Sorry if that wasn't clear.
>> >>>> >>
>> >>>> >>     On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik <
>> lcwik@google.com
>> >>>> >>     <ma...@google.com>> wrote:
>> >>>> >>
>> >>>> >>         I would model the environment to be more free form then
>> enums
>> >>>> >>         such that we have forward looking extensibility and would
>> >>>> >>         suggest to follow the same pattern we use on PTransforms
>> (using
>> >>>> >>         an URN and a URN specific payload). Note that in this
>> case we
>> >>>> >>         may want to support a list of supported environments
>> (e.g. java,
>> >>>> >>         docker, python, ...).
>> >>>> >>
>> >>>> >>         On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
>> >>>> >>         <herohde@google.com <ma...@google.com>> wrote:
>> >>>> >>
>> >>>> >>             One thing to consider that we've talked about in the
>> past.
>> >>>> >>             It might make sense to extend the environment proto
>> and have
>> >>>> >>             the SDK be explicit about which kinds of environment
>> it
>> >>>> >>             supports:
>> >>>> >>
>> >>>> >>
>> >>>> >>
>> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>> >>>> >>
>> >>>> >>
>> >>>> >> <
>> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>> >
>> >>>> >>
>> >>>> >>
>> >>>> >>             This choice might impact what files are staged or
>> what not.
>> >>>> >>             Some SDKs, such as Go, provide a compiled binary and
>> _need_
>> >>>> >>             to know what the target architecture is. Running on a
>> mac
>> >>>> >>             with and without docker, say, requires a different
>> worker in
>> >>>> >>             each case. If we add an "enum", we can also easily
>> add the
>> >>>> >>             external idea where the SDK/user starts the SDK
>> harnesses
>> >>>> >>             instead of the runner. Each runner may not support
>> all types
>> >>>> >>             of environments.
>> >>>> >>
>> >>>> >>             Henning
>> >>>> >>
>> >>>> >>             On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels
>> >>>> >>             <mxm@apache.org <ma...@apache.org>> wrote:
>> >>>> >>
>> >>>> >>                 For reference, here is corresponding JIRA issue
>> for this
>> >>>> >>                 thread:
>> >>>> >>                 https://issues.apache.org/jira/browse/BEAM-5187
>> >>>> >>                 <https://issues.apache.org/jira/browse/BEAM-5187>
>> >>>> >>
>> >>>> >>                 On 16.08.18 11:15, Maximilian Michels wrote:
>> >>>> >>                  > Makes sense to have an option to run the SDK
>> harness
>> >>>> >>                 in a non-dockerized
>> >>>> >>                  > environment.
>> >>>> >>                  >
>> >>>> >>                  > I'm in the process of creating a Docker entry
>> point
>> >>>> >>                 for Flink's
>> >>>> >>                  > JobServer[1]. I suppose you would also prefer
>> to
>> >>>> >>                 execute that one
>> >>>> >>                  > standalone. We should make sure this is also an
>> >>>> >> option.
>> >>>> >>                  >
>> >>>> >>                  > [1]
>> https://issues.apache.org/jira/browse/BEAM-4130
>> >>>> >>                 <https://issues.apache.org/jira/browse/BEAM-4130>
>> >>>> >>                  >
>> >>>> >>                  > On 16.08.18 07:42, Thomas Weise wrote:
>> >>>> >>                  >> Yes, that's the proposal. Everything that
>> would
>> >>>> >>                 otherwise be packaged
>> >>>> >>                  >> into the Docker container would need to be
>> >>>> >>                 pre-installed in the host
>> >>>> >>                  >> environment. In the case of Python SDK, that
>> could
>> >>>> >>                 simply mean a
>> >>>> >>                  >> (frozen) virtual environment that was setup
>> when the
>> >>>> >>                 host was
>> >>>> >>                  >> provisioned - the SDK harness process(es)
>> will then
>> >>>> >>                 just utilize that.
>> >>>> >>                  >> Of course this flavor of SDK harness
>> execution could
>> >>>> >>                 also be useful in
>> >>>> >>                  >> the local development environment, where
>> right now
>> >>>> >>                 someone who already
>> >>>> >>                  >> has the Python environment needs to also
>> install
>> >>>> >>                 Docker and update a
>> >>>> >>                  >> container to launch a Python SDK pipeline on
>> the
>> >>>> >>                 Flink runner.
>> >>>> >>                  >>
>> >>>> >>                  >> On Wed, Aug 15, 2018 at 12:40 PM Daniel
>> Oliveira
>> >>>> >>                 <danoliveira@google.com <mailto:
>> danoliveira@google.com>
>> >>>> >>                  >> <mailto:danoliveira@google.com
>> >>>> >>                 <ma...@google.com>>> wrote:
>> >>>> >>                  >>
>> >>>> >>                  >>      I just want to clarify that I understand
>> this
>> >>>> >>                 correctly since I'm
>> >>>> >>                  >>      not that familiar with the details
>> behind all
>> >>>> >>                 these execution
>> >>>> >>                  >>      environments yet. Is the proposal to
>> create a
>> >>>> >>                 new JobBundleFactory
>> >>>> >>                  >>      that instead of using Docker to create
>> the
>> >>>> >>                 environment that the new
>> >>>> >>                  >>      processes will execute in, this
>> >>>> >>                 JobBundleFactory would execute the
>> >>>> >>                  >>      new processes directly in the host
>> environment?
>> >>>> >>                 So in practice if I
>> >>>> >>                  >>      ran a pipeline with this
>> JobBundleFactory the
>> >>>> >>                 SDK Harness and Runner
>> >>>> >>                  >>      Harness would both be executing directly
>> on my
>> >>>> >>                 machine and would
>> >>>> >>                  >>      depend on me having the dependencies
>> already
>> >>>> >>                 present on my machine?
>> >>>> >>                  >>
>> >>>> >>                  >>      On Mon, Aug 13, 2018 at 5:50 PM Ankur
>> Goenka
>> >>>> >>                 <goenka@google.com <ma...@google.com>
>> >>>> >>                  >>      <mailto:goenka@google.com
>> >>>> >>                 <ma...@google.com>>> wrote:
>> >>>> >>                  >>
>> >>>> >>                  >>          Thanks for starting the discussion.
>> I will
>> >>>> >>                 be happy to help.
>> >>>> >>                  >>          I agree, we should have pluggable
>> >>>> >>                 SDKHarness environment Factory.
>> >>>> >>                  >>          We can register multiple Environment
>> >>>> >>                 factory using service
>> >>>> >>                  >>          registry and use the PipelineOption
>> to pick
>> >>>> >>                 the right one on per
>> >>>> >>                  >>          job basis.
>> >>>> >>                  >>
>> >>>> >>                  >>          There are a couple of things which
>> are
>> >>>> >>                 require to setup before
>> >>>> >>                  >>          launching the process.
>> >>>> >>                  >>
>> >>>> >>                  >>            * Setting up the environment as
>> done in
>> >>>> >>                 boot.go [4]
>> >>>> >>                  >>            * Retrieving and putting the
>> artifacts in
>> >>>> >>                 the right location.
>> >>>> >>                  >>
>> >>>> >>                  >>          You can probably leverage boot.go
>> code to
>> >>>> >>                 setup the environment.
>> >>>> >>                  >>
>> >>>> >>                  >>          Also, it will be useful to enumerate
>> pros
>> >>>> >>                 and cons of different
>> >>>> >>                  >>          Environments to help users choose
>> the right
>> >>>> >>                 one.
>> >>>> >>                  >>
>> >>>> >>                  >>
>> >>>> >>                  >>          On Mon, Aug 6, 2018 at 4:50 PM
>> Thomas Weise
>> >>>> >>                 <thw@apache.org <ma...@apache.org>
>> >>>> >>                  >>          <mailto:thw@apache.org
>> >>>> >>                 <ma...@apache.org>>> wrote:
>> >>>> >>                  >>
>> >>>> >>                  >>              Hi,
>> >>>> >>                  >>
>> >>>> >>                  >>              Currently the portable Flink
>> runner
>> >>>> >>                 only works with SDK
>> >>>> >>                  >>              Docker containers for execution
>> >>>> >>                 (DockerJobBundleFactory,
>> >>>> >>                  >>              besides an in-process (embedded)
>> >>>> >>                 factory option for testing
>> >>>> >>                  >>              [1]). I'm considering adding
>> another
>> >>>> >>                 out of process
>> >>>> >>                  >>              JobBundleFactory implementation
>> that
>> >>>> >>                 directly forks the
>> >>>> >>                  >>              processes on the task manager
>> host,
>> >>>> >>                 eliminating the need for
>> >>>> >>                  >>              Docker. This would work
>> reasonably well
>> >>>> >>                 in environments
>> >>>> >>                  >>              where the dependencies (in this
>> case
>> >>>> >>                 Python) can easily be
>> >>>> >>                  >>              tied into the host deployment
>> (also
>> >>>> >>                 within an application
>> >>>> >>                  >>              specific Kubernetes pod).
>> >>>> >>                  >>
>> >>>> >>                  >>              There was already some
>> discussion about
>> >>>> >>                 alternative
>> >>>> >>                  >>              JobBundleFactory implementation
>> in [2].
>> >>>> >>                 There is also a JIRA
>> >>>> >>                  >>              to make the bundle factory
>> pluggable
>> >>>> >>                 [3], pending
>> >>>> >>                  >>              availability of runner level
>> options.
>> >>>> >>                  >>
>> >>>> >>                  >>              For a "ProcessBundleFactory", in
>> >>>> >>                 addition to the Python
>> >>>> >>                  >>              dependencies the environment
>> would also
>> >>>> >>                 need to have the Go
>> >>>> >>                  >>              boot executable [4] (or a
>> substitute
>> >>>> >>                 thereof) to perform the
>> >>>> >>                  >>              harness initialization.
>> >>>> >>                  >>
>> >>>> >>                  >>              Is anyone else interested in
>> this SDK
>> >>>> >>                 execution option or
>> >>>> >>                  >>              has already investigated an
>> alternative
>> >>>> >>                 implementation?
>> >>>> >>                  >>
>> >>>> >>                  >>              Thanks,
>> >>>> >>                  >>              Thomas
>> >>>> >>                  >>
>> >>>> >>                  >>              [1]
>> >>>> >>                  >>
>> >>>> >>
>> >>>> >>
>> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>> >>>> >>
>> >>>> >>
>> >>>> >> <
>> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>> >
>> >>>> >>
>> >>>> >>                  >>
>> >>>> >>                  >>              [2]
>> >>>> >>                  >>
>> >>>> >>
>> >>>> >>
>> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>> >>>> >>
>> >>>> >>
>> >>>> >> <
>> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>> >
>> >>>> >>
>> >>>> >>                  >>
>> >>>> >>                  >>              [3]
>> >>>> >>                 https://issues.apache.org/jira/browse/BEAM-4819
>> >>>> >>                 <https://issues.apache.org/jira/browse/BEAM-4819>
>> >>>> >>                  >>
>> >>>> >>                  >>              [4]
>> >>>> >>
>> >>>> >>
>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>> >>>> >>
>> >>>> >> <
>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go>
>> >>>> >>
>> >>>> >>                  >>
>> >>>> >>
>> >>>> >>                 --                 Max
>> >>>> >>
>> >>>> >>
>> >>>> >
>> >>>>
>> >>>> --
>> >>>> Max
>>
>

Re: Process JobBundleFactory for portable runner

Posted by Lukasz Cwik <lc...@google.com>.

Do we expect pipelines to always have a single environment for each
PTransform, thus the SDK is dictating how it is launched/managed or do we
expect for each SDK to say here are a couple of ways to run me, letting the
runner to decide?

What does providing the target os/arch provide in beam:env:process:v1?
I would suspect that the user must have supplied a script compatible with
the cluster being run on. The user knows upfront what is being invoked.
Note that launching a binary with the current container contract allows the
binary to do plenty (activate virtualenv, install libraries in some SDK
specific way oblivious to the runner, ...) so keeping it limited to a
binary with some args being passed in required by the container contract
seems to work well.

+1 for making usage of URNs and payloads.

On Fri, Aug 24, 2018 at 1:32 AM Robert Bradshaw <ro...@google.com> wrote:

> I think "external" still needs some way (I was suggesting grpc) to
> pass the control address, etc. to whatever starts up the workers.
>
> Also, +1 to making this a URN. Embedded makes sense too.
> On Fri, Aug 24, 2018 at 6:00 AM Thomas Weise <th...@apache.org> wrote:
> >
> > Option #3 "external" would fit the Kubernetes use case we discussed a
> while ago also. Container(s) can be part of the same pod and need to find
> the runner.
> >
> > There is another option: "embedded". When the SDK is Java and the runner
> Flink (or all the other OSS runners), then harness can (optionally) run
> embedded in the same JVM.
> >
> > Thanks,
> > Thomas
> >
> >
> > On Thu, Aug 23, 2018 at 9:14 AM Henning Rohde <he...@google.com>
> wrote:
> >>
> >> A process-based SDK harness does not IMO imply that the host is fully
> provisioned by the SDK/user and invoking the user command line in the
> context of the staged files is a critical aspect for it to work. So I
> consider staged artifact support needed. Also, I would like to suggest that
> we move to a concrete environment proto to crystalize what is actually
> being proposed. I'm not sure what activating a virtualenv would look like,
> for example. To start things off:
> >>
> >> message Environment {
> >>   string urn = 1;
> >>   bytes payload = 2;
> >> }
> >>
> >> // urn == "beam:env:docker:v1"
> >> message DockerPayload {
> >>   string container_image = 1;  // implicitly linux_amd64.
> >> }
> >>
> >> // urn == "beam:env:process:v1"
> >> message ProcessPayload {
> >>   string os = 1;  // "linux", "darwin", ..
> >>   string arch = 2;  // "amd64", ..
> >>   string command_line = 3;
> >> }
> >>
> >> // urn == "beam:env:external:v1"
> >> // (no payload)
> >>
> >> A runner may support any subset and reject any unsupported
> configuration. There are 3 kinds of environments that I think are useful:
> >>  (1) docker: works as currently. Offers the most flexibility for SDKs
> and users, especially when the runner is outside the control (such as
> hosted runners). The runner starts the SDK harnesses.
> >>  (2) process: as discussed here. The runner starts the SDK harnesses.
> The semantics is that the shell commandline is invoked in a directory
> rooted in the staged artifacts with the container contract arguments. It is
> up to the user and runner deployment to ensure that it makes sense, i.e.,
> on windows a linux binary or bash script is not specified. Executing the
> user command in a shell env (bash, zsh, cmd, ..) ensures that paths and so
> on are set up:, i.e., specifying "java -jar foo" would actually work.
> Useful for cases where the user controls both the SDK and runner (such as
> locally) or when docker is not an option. Intended to be minimal and
> SDK/language agnostic.
> >>  (3) external: this is what I think Robert was alluding to. The runner
> does not start any SDK harnesses. Instead it waits for user-controlled SDK
> harnesses to connect. Useful for manually debugging SDK code (connect from
> code running in a debugger) or when the user code must run in a special or
> privileged environment. It's runner-specific how the SDK will need to
> connect.
> >>
> >> Part of the idea of placing this information in the environment is that
> pipelines can potentially use multiple, such as cross-windows/linux.
> >>
> >> Henning
> >>
> >> On Thu, Aug 23, 2018 at 6:44 AM Thomas Weise <th...@apache.org> wrote:
> >>>
> >>> I would see support for staging libraries as optional / nice to have
> since that can also be done as part of host provisioning (i.e. in the
> Python case a virtual environment was already setup and just needs to be
> activated).
> >>>
> >>> Depending on how the command that launches the harness is configured,
> additional steps such as virtualenv activate or setting of other
> environment variables can be included as well.
> >>>
> >>>
> >>> On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels <mx...@apache.org>
> wrote:
> >>>>
> >>>> Just to recap:
> >>>>
> >>>>  From this and the other thread ("Bootstraping Beam's Job Server") we
> >>>> got sufficient evidence that process-based execution is a desired
> feature.
> >>>>
> >>>> Process-based execution as an alternative to dockerized execution
> >>>> https://issues.apache.org/jira/browse/BEAM-5187
> >>>>
> >>>> Which parts are executed as a process?
> >>>> => The SDK harness for user code
> >>>>
> >>>> What configuration options are supported?
> >>>> => Provide information about the target architecture (OS/CPU)
> >>>> => Staging libraries, as also supported by Docker
> >>>> => Activating a pre-existing environment (e.g. virutalenv)
> >>>>
> >>>>
> >>>> On 23.08.18 14:13, Maximilian Michels wrote:
> >>>> >> One thing to consider that we've talked about in the past. It might
> >>>> >> make sense to extend the environment proto and have the SDK be
> >>>> >> explicit about which kinds of environment it support
> >>>> >
> >>>> > +1 Encoding environment information there is a good idea.
> >>>> >
> >>>> >> Seems it will create a default docker url even if the
> >>>> >> hardness_docker_image is set to None in pipeline options. Shall we
> add
> >>>> >> another option or honor the None in this option to support the
> process
> >>>> >> job?
> >>>> >
> >>>> > Yes, if no Docker image is set the default one will be used.
> Currently
> >>>> > Docker is the only way to execute pipelines with the
> PortableRunner. If
> >>>> > the docker_image is not set, execution won't succeed.
> >>>> >
> >>>> > On 22.08.18 22:59, Xinyu Liu wrote:
> >>>> >> We are also interested in this Process JobBundleFactory as we are
> >>>> >> planning to fork a process to run python sdk in Samza runner,
> instead
> >>>> >> of using docker container. So this change will be helpful to us
> too.
> >>>> >> On the same note, we are trying out portable_runner.py to submit a
> >>>> >> python job. Seems it will create a default docker url even if the
> >>>> >> hardness_docker_image is set to None in pipeline options. Shall we
> add
> >>>> >> another option or honor the None in this option to support the
> process
> >>>> >> job? I made some local changes right now to walk around this.
> >>>> >>
> >>>> >> Thanks,
> >>>> >> Xinyu
> >>>> >>
> >>>> >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde <
> herohde@google.com
> >>>> >> <ma...@google.com>> wrote:
> >>>> >>
> >>>> >>     By "enum" in quotes, I meant the usual open URN style pattern
> not an
> >>>> >>     actual enum. Sorry if that wasn't clear.
> >>>> >>
> >>>> >>     On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik <lcwik@google.com
> >>>> >>     <ma...@google.com>> wrote:
> >>>> >>
> >>>> >>         I would model the environment to be more free form then
> enums
> >>>> >>         such that we have forward looking extensibility and would
> >>>> >>         suggest to follow the same pattern we use on PTransforms
> (using
> >>>> >>         an URN and a URN specific payload). Note that in this case
> we
> >>>> >>         may want to support a list of supported environments (e.g.
> java,
> >>>> >>         docker, python, ...).
> >>>> >>
> >>>> >>         On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
> >>>> >>         <herohde@google.com <ma...@google.com>> wrote:
> >>>> >>
> >>>> >>             One thing to consider that we've talked about in the
> past.
> >>>> >>             It might make sense to extend the environment proto
> and have
> >>>> >>             the SDK be explicit about which kinds of environment it
> >>>> >>             supports:
> >>>> >>
> >>>> >>
> >>>> >>
> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
> >>>> >>
> >>>> >>
> >>>> >> <
> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
> >
> >>>> >>
> >>>> >>
> >>>> >>             This choice might impact what files are staged or what
> not.
> >>>> >>             Some SDKs, such as Go, provide a compiled binary and
> _need_
> >>>> >>             to know what the target architecture is. Running on a
> mac
> >>>> >>             with and without docker, say, requires a different
> worker in
> >>>> >>             each case. If we add an "enum", we can also easily add
> the
> >>>> >>             external idea where the SDK/user starts the SDK
> harnesses
> >>>> >>             instead of the runner. Each runner may not support all
> types
> >>>> >>             of environments.
> >>>> >>
> >>>> >>             Henning
> >>>> >>
> >>>> >>             On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels
> >>>> >>             <mxm@apache.org <ma...@apache.org>> wrote:
> >>>> >>
> >>>> >>                 For reference, here is corresponding JIRA issue
> for this
> >>>> >>                 thread:
> >>>> >>                 https://issues.apache.org/jira/browse/BEAM-5187
> >>>> >>                 <https://issues.apache.org/jira/browse/BEAM-5187>
> >>>> >>
> >>>> >>                 On 16.08.18 11:15, Maximilian Michels wrote:
> >>>> >>                  > Makes sense to have an option to run the SDK
> harness
> >>>> >>                 in a non-dockerized
> >>>> >>                  > environment.
> >>>> >>                  >
> >>>> >>                  > I'm in the process of creating a Docker entry
> point
> >>>> >>                 for Flink's
> >>>> >>                  > JobServer[1]. I suppose you would also prefer to
> >>>> >>                 execute that one
> >>>> >>                  > standalone. We should make sure this is also an
> >>>> >> option.
> >>>> >>                  >
> >>>> >>                  > [1]
> https://issues.apache.org/jira/browse/BEAM-4130
> >>>> >>                 <https://issues.apache.org/jira/browse/BEAM-4130>
> >>>> >>                  >
> >>>> >>                  > On 16.08.18 07:42, Thomas Weise wrote:
> >>>> >>                  >> Yes, that's the proposal. Everything that would
> >>>> >>                 otherwise be packaged
> >>>> >>                  >> into the Docker container would need to be
> >>>> >>                 pre-installed in the host
> >>>> >>                  >> environment. In the case of Python SDK, that
> could
> >>>> >>                 simply mean a
> >>>> >>                  >> (frozen) virtual environment that was setup
> when the
> >>>> >>                 host was
> >>>> >>                  >> provisioned - the SDK harness process(es) will
> then
> >>>> >>                 just utilize that.
> >>>> >>                  >> Of course this flavor of SDK harness execution
> could
> >>>> >>                 also be useful in
> >>>> >>                  >> the local development environment, where right
> now
> >>>> >>                 someone who already
> >>>> >>                  >> has the Python environment needs to also
> install
> >>>> >>                 Docker and update a
> >>>> >>                  >> container to launch a Python SDK pipeline on
> the
> >>>> >>                 Flink runner.
> >>>> >>                  >>
> >>>> >>                  >> On Wed, Aug 15, 2018 at 12:40 PM Daniel
> Oliveira
> >>>> >>                 <danoliveira@google.com <mailto:
> danoliveira@google.com>
> >>>> >>                  >> <mailto:danoliveira@google.com
> >>>> >>                 <ma...@google.com>>> wrote:
> >>>> >>                  >>
> >>>> >>                  >>      I just want to clarify that I understand
> this
> >>>> >>                 correctly since I'm
> >>>> >>                  >>      not that familiar with the details behind
> all
> >>>> >>                 these execution
> >>>> >>                  >>      environments yet. Is the proposal to
> create a
> >>>> >>                 new JobBundleFactory
> >>>> >>                  >>      that instead of using Docker to create the
> >>>> >>                 environment that the new
> >>>> >>                  >>      processes will execute in, this
> >>>> >>                 JobBundleFactory would execute the
> >>>> >>                  >>      new processes directly in the host
> environment?
> >>>> >>                 So in practice if I
> >>>> >>                  >>      ran a pipeline with this JobBundleFactory
> the
> >>>> >>                 SDK Harness and Runner
> >>>> >>                  >>      Harness would both be executing directly
> on my
> >>>> >>                 machine and would
> >>>> >>                  >>      depend on me having the dependencies
> already
> >>>> >>                 present on my machine?
> >>>> >>                  >>
> >>>> >>                  >>      On Mon, Aug 13, 2018 at 5:50 PM Ankur
> Goenka
> >>>> >>                 <goenka@google.com <ma...@google.com>
> >>>> >>                  >>      <mailto:goenka@google.com
> >>>> >>                 <ma...@google.com>>> wrote:
> >>>> >>                  >>
> >>>> >>                  >>          Thanks for starting the discussion. I
> will
> >>>> >>                 be happy to help.
> >>>> >>                  >>          I agree, we should have pluggable
> >>>> >>                 SDKHarness environment Factory.
> >>>> >>                  >>          We can register multiple Environment
> >>>> >>                 factory using service
> >>>> >>                  >>          registry and use the PipelineOption
> to pick
> >>>> >>                 the right one on per
> >>>> >>                  >>          job basis.
> >>>> >>                  >>
> >>>> >>                  >>          There are a couple of things which are
> >>>> >>                 require to setup before
> >>>> >>                  >>          launching the process.
> >>>> >>                  >>
> >>>> >>                  >>            * Setting up the environment as
> done in
> >>>> >>                 boot.go [4]
> >>>> >>                  >>            * Retrieving and putting the
> artifacts in
> >>>> >>                 the right location.
> >>>> >>                  >>
> >>>> >>                  >>          You can probably leverage boot.go
> code to
> >>>> >>                 setup the environment.
> >>>> >>                  >>
> >>>> >>                  >>          Also, it will be useful to enumerate
> pros
> >>>> >>                 and cons of different
> >>>> >>                  >>          Environments to help users choose the
> right
> >>>> >>                 one.
> >>>> >>                  >>
> >>>> >>                  >>
> >>>> >>                  >>          On Mon, Aug 6, 2018 at 4:50 PM Thomas
> Weise
> >>>> >>                 <thw@apache.org <ma...@apache.org>
> >>>> >>                  >>          <mailto:thw@apache.org
> >>>> >>                 <ma...@apache.org>>> wrote:
> >>>> >>                  >>
> >>>> >>                  >>              Hi,
> >>>> >>                  >>
> >>>> >>                  >>              Currently the portable Flink
> runner
> >>>> >>                 only works with SDK
> >>>> >>                  >>              Docker containers for execution
> >>>> >>                 (DockerJobBundleFactory,
> >>>> >>                  >>              besides an in-process (embedded)
> >>>> >>                 factory option for testing
> >>>> >>                  >>              [1]). I'm considering adding
> another
> >>>> >>                 out of process
> >>>> >>                  >>              JobBundleFactory implementation
> that
> >>>> >>                 directly forks the
> >>>> >>                  >>              processes on the task manager
> host,
> >>>> >>                 eliminating the need for
> >>>> >>                  >>              Docker. This would work
> reasonably well
> >>>> >>                 in environments
> >>>> >>                  >>              where the dependencies (in this
> case
> >>>> >>                 Python) can easily be
> >>>> >>                  >>              tied into the host deployment
> (also
> >>>> >>                 within an application
> >>>> >>                  >>              specific Kubernetes pod).
> >>>> >>                  >>
> >>>> >>                  >>              There was already some discussion
> about
> >>>> >>                 alternative
> >>>> >>                  >>              JobBundleFactory implementation
> in [2].
> >>>> >>                 There is also a JIRA
> >>>> >>                  >>              to make the bundle factory
> pluggable
> >>>> >>                 [3], pending
> >>>> >>                  >>              availability of runner level
> options.
> >>>> >>                  >>
> >>>> >>                  >>              For a "ProcessBundleFactory", in
> >>>> >>                 addition to the Python
> >>>> >>                  >>              dependencies the environment
> would also
> >>>> >>                 need to have the Go
> >>>> >>                  >>              boot executable [4] (or a
> substitute
> >>>> >>                 thereof) to perform the
> >>>> >>                  >>              harness initialization.
> >>>> >>                  >>
> >>>> >>                  >>              Is anyone else interested in this
> SDK
> >>>> >>                 execution option or
> >>>> >>                  >>              has already investigated an
> alternative
> >>>> >>                 implementation?
> >>>> >>                  >>
> >>>> >>                  >>              Thanks,
> >>>> >>                  >>              Thomas
> >>>> >>                  >>
> >>>> >>                  >>              [1]
> >>>> >>                  >>
> >>>> >>
> >>>> >>
> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
> >>>> >>
> >>>> >>
> >>>> >> <
> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
> >
> >>>> >>
> >>>> >>                  >>
> >>>> >>                  >>              [2]
> >>>> >>                  >>
> >>>> >>
> >>>> >>
> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
> >>>> >>
> >>>> >>
> >>>> >> <
> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
> >
> >>>> >>
> >>>> >>                  >>
> >>>> >>                  >>              [3]
> >>>> >>                 https://issues.apache.org/jira/browse/BEAM-4819
> >>>> >>                 <https://issues.apache.org/jira/browse/BEAM-4819>
> >>>> >>                  >>
> >>>> >>                  >>              [4]
> >>>> >>
> >>>> >>
> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
> >>>> >>
> >>>> >> <
> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go>
> >>>> >>
> >>>> >>                  >>
> >>>> >>
> >>>> >>                 --                 Max
> >>>> >>
> >>>> >>
> >>>> >
> >>>>
> >>>> --
> >>>> Max
>

Re: Process JobBundleFactory for portable runner

Posted by Robert Bradshaw <ro...@google.com>.

I think "external" still needs some way (I was suggesting grpc) to
pass the control address, etc. to whatever starts up the workers.

Also, +1 to making this a URN. Embedded makes sense too.
On Fri, Aug 24, 2018 at 6:00 AM Thomas Weise <th...@apache.org> wrote:
>
> Option #3 "external" would fit the Kubernetes use case we discussed a while ago also. Container(s) can be part of the same pod and need to find the runner.
>
> There is another option: "embedded". When the SDK is Java and the runner Flink (or all the other OSS runners), then harness can (optionally) run embedded in the same JVM.
>
> Thanks,
> Thomas
>
>
> On Thu, Aug 23, 2018 at 9:14 AM Henning Rohde <he...@google.com> wrote:
>>
>> A process-based SDK harness does not IMO imply that the host is fully provisioned by the SDK/user and invoking the user command line in the context of the staged files is a critical aspect for it to work. So I consider staged artifact support needed. Also, I would like to suggest that we move to a concrete environment proto to crystalize what is actually being proposed. I'm not sure what activating a virtualenv would look like, for example. To start things off:
>>
>> message Environment {
>>   string urn = 1;
>>   bytes payload = 2;
>> }
>>
>> // urn == "beam:env:docker:v1"
>> message DockerPayload {
>>   string container_image = 1;  // implicitly linux_amd64.
>> }
>>
>> // urn == "beam:env:process:v1"
>> message ProcessPayload {
>>   string os = 1;  // "linux", "darwin", ..
>>   string arch = 2;  // "amd64", ..
>>   string command_line = 3;
>> }
>>
>> // urn == "beam:env:external:v1"
>> // (no payload)
>>
>> A runner may support any subset and reject any unsupported configuration. There are 3 kinds of environments that I think are useful:
>>  (1) docker: works as currently. Offers the most flexibility for SDKs and users, especially when the runner is outside the control (such as hosted runners). The runner starts the SDK harnesses.
>>  (2) process: as discussed here. The runner starts the SDK harnesses. The semantics is that the shell commandline is invoked in a directory rooted in the staged artifacts with the container contract arguments. It is up to the user and runner deployment to ensure that it makes sense, i.e., on windows a linux binary or bash script is not specified. Executing the user command in a shell env (bash, zsh, cmd, ..) ensures that paths and so on are set up:, i.e., specifying "java -jar foo" would actually work. Useful for cases where the user controls both the SDK and runner (such as locally) or when docker is not an option. Intended to be minimal and SDK/language agnostic.
>>  (3) external: this is what I think Robert was alluding to. The runner does not start any SDK harnesses. Instead it waits for user-controlled SDK harnesses to connect. Useful for manually debugging SDK code (connect from code running in a debugger) or when the user code must run in a special or privileged environment. It's runner-specific how the SDK will need to connect.
>>
>> Part of the idea of placing this information in the environment is that pipelines can potentially use multiple, such as cross-windows/linux.
>>
>> Henning
>>
>> On Thu, Aug 23, 2018 at 6:44 AM Thomas Weise <th...@apache.org> wrote:
>>>
>>> I would see support for staging libraries as optional / nice to have since that can also be done as part of host provisioning (i.e. in the Python case a virtual environment was already setup and just needs to be activated).
>>>
>>> Depending on how the command that launches the harness is configured, additional steps such as virtualenv activate or setting of other environment variables can be included as well.
>>>
>>>
>>> On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels <mx...@apache.org> wrote:
>>>>
>>>> Just to recap:
>>>>
>>>>  From this and the other thread ("Bootstraping Beam's Job Server") we
>>>> got sufficient evidence that process-based execution is a desired feature.
>>>>
>>>> Process-based execution as an alternative to dockerized execution
>>>> https://issues.apache.org/jira/browse/BEAM-5187
>>>>
>>>> Which parts are executed as a process?
>>>> => The SDK harness for user code
>>>>
>>>> What configuration options are supported?
>>>> => Provide information about the target architecture (OS/CPU)
>>>> => Staging libraries, as also supported by Docker
>>>> => Activating a pre-existing environment (e.g. virutalenv)
>>>>
>>>>
>>>> On 23.08.18 14:13, Maximilian Michels wrote:
>>>> >> One thing to consider that we've talked about in the past. It might
>>>> >> make sense to extend the environment proto and have the SDK be
>>>> >> explicit about which kinds of environment it support
>>>> >
>>>> > +1 Encoding environment information there is a good idea.
>>>> >
>>>> >> Seems it will create a default docker url even if the
>>>> >> hardness_docker_image is set to None in pipeline options. Shall we add
>>>> >> another option or honor the None in this option to support the process
>>>> >> job?
>>>> >
>>>> > Yes, if no Docker image is set the default one will be used. Currently
>>>> > Docker is the only way to execute pipelines with the PortableRunner. If
>>>> > the docker_image is not set, execution won't succeed.
>>>> >
>>>> > On 22.08.18 22:59, Xinyu Liu wrote:
>>>> >> We are also interested in this Process JobBundleFactory as we are
>>>> >> planning to fork a process to run python sdk in Samza runner, instead
>>>> >> of using docker container. So this change will be helpful to us too.
>>>> >> On the same note, we are trying out portable_runner.py to submit a
>>>> >> python job. Seems it will create a default docker url even if the
>>>> >> hardness_docker_image is set to None in pipeline options. Shall we add
>>>> >> another option or honor the None in this option to support the process
>>>> >> job? I made some local changes right now to walk around this.
>>>> >>
>>>> >> Thanks,
>>>> >> Xinyu
>>>> >>
>>>> >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde <herohde@google.com
>>>> >> <ma...@google.com>> wrote:
>>>> >>
>>>> >>     By "enum" in quotes, I meant the usual open URN style pattern not an
>>>> >>     actual enum. Sorry if that wasn't clear.
>>>> >>
>>>> >>     On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik <lcwik@google.com
>>>> >>     <ma...@google.com>> wrote:
>>>> >>
>>>> >>         I would model the environment to be more free form then enums
>>>> >>         such that we have forward looking extensibility and would
>>>> >>         suggest to follow the same pattern we use on PTransforms (using
>>>> >>         an URN and a URN specific payload). Note that in this case we
>>>> >>         may want to support a list of supported environments (e.g. java,
>>>> >>         docker, python, ...).
>>>> >>
>>>> >>         On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
>>>> >>         <herohde@google.com <ma...@google.com>> wrote:
>>>> >>
>>>> >>             One thing to consider that we've talked about in the past.
>>>> >>             It might make sense to extend the environment proto and have
>>>> >>             the SDK be explicit about which kinds of environment it
>>>> >>             supports:
>>>> >>
>>>> >>
>>>> >> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>>>> >>
>>>> >>
>>>> >> <https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969>
>>>> >>
>>>> >>
>>>> >>             This choice might impact what files are staged or what not.
>>>> >>             Some SDKs, such as Go, provide a compiled binary and _need_
>>>> >>             to know what the target architecture is. Running on a mac
>>>> >>             with and without docker, say, requires a different worker in
>>>> >>             each case. If we add an "enum", we can also easily add the
>>>> >>             external idea where the SDK/user starts the SDK harnesses
>>>> >>             instead of the runner. Each runner may not support all types
>>>> >>             of environments.
>>>> >>
>>>> >>             Henning
>>>> >>
>>>> >>             On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels
>>>> >>             <mxm@apache.org <ma...@apache.org>> wrote:
>>>> >>
>>>> >>                 For reference, here is corresponding JIRA issue for this
>>>> >>                 thread:
>>>> >>                 https://issues.apache.org/jira/browse/BEAM-5187
>>>> >>                 <https://issues.apache.org/jira/browse/BEAM-5187>
>>>> >>
>>>> >>                 On 16.08.18 11:15, Maximilian Michels wrote:
>>>> >>                  > Makes sense to have an option to run the SDK harness
>>>> >>                 in a non-dockerized
>>>> >>                  > environment.
>>>> >>                  >
>>>> >>                  > I'm in the process of creating a Docker entry point
>>>> >>                 for Flink's
>>>> >>                  > JobServer[1]. I suppose you would also prefer to
>>>> >>                 execute that one
>>>> >>                  > standalone. We should make sure this is also an
>>>> >> option.
>>>> >>                  >
>>>> >>                  > [1] https://issues.apache.org/jira/browse/BEAM-4130
>>>> >>                 <https://issues.apache.org/jira/browse/BEAM-4130>
>>>> >>                  >
>>>> >>                  > On 16.08.18 07:42, Thomas Weise wrote:
>>>> >>                  >> Yes, that's the proposal. Everything that would
>>>> >>                 otherwise be packaged
>>>> >>                  >> into the Docker container would need to be
>>>> >>                 pre-installed in the host
>>>> >>                  >> environment. In the case of Python SDK, that could
>>>> >>                 simply mean a
>>>> >>                  >> (frozen) virtual environment that was setup when the
>>>> >>                 host was
>>>> >>                  >> provisioned - the SDK harness process(es) will then
>>>> >>                 just utilize that.
>>>> >>                  >> Of course this flavor of SDK harness execution could
>>>> >>                 also be useful in
>>>> >>                  >> the local development environment, where right now
>>>> >>                 someone who already
>>>> >>                  >> has the Python environment needs to also install
>>>> >>                 Docker and update a
>>>> >>                  >> container to launch a Python SDK pipeline on the
>>>> >>                 Flink runner.
>>>> >>                  >>
>>>> >>                  >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira
>>>> >>                 <danoliveira@google.com <ma...@google.com>
>>>> >>                  >> <mailto:danoliveira@google.com
>>>> >>                 <ma...@google.com>>> wrote:
>>>> >>                  >>
>>>> >>                  >>      I just want to clarify that I understand this
>>>> >>                 correctly since I'm
>>>> >>                  >>      not that familiar with the details behind all
>>>> >>                 these execution
>>>> >>                  >>      environments yet. Is the proposal to create a
>>>> >>                 new JobBundleFactory
>>>> >>                  >>      that instead of using Docker to create the
>>>> >>                 environment that the new
>>>> >>                  >>      processes will execute in, this
>>>> >>                 JobBundleFactory would execute the
>>>> >>                  >>      new processes directly in the host environment?
>>>> >>                 So in practice if I
>>>> >>                  >>      ran a pipeline with this JobBundleFactory the
>>>> >>                 SDK Harness and Runner
>>>> >>                  >>      Harness would both be executing directly on my
>>>> >>                 machine and would
>>>> >>                  >>      depend on me having the dependencies already
>>>> >>                 present on my machine?
>>>> >>                  >>
>>>> >>                  >>      On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka
>>>> >>                 <goenka@google.com <ma...@google.com>
>>>> >>                  >>      <mailto:goenka@google.com
>>>> >>                 <ma...@google.com>>> wrote:
>>>> >>                  >>
>>>> >>                  >>          Thanks for starting the discussion. I will
>>>> >>                 be happy to help.
>>>> >>                  >>          I agree, we should have pluggable
>>>> >>                 SDKHarness environment Factory.
>>>> >>                  >>          We can register multiple Environment
>>>> >>                 factory using service
>>>> >>                  >>          registry and use the PipelineOption to pick
>>>> >>                 the right one on per
>>>> >>                  >>          job basis.
>>>> >>                  >>
>>>> >>                  >>          There are a couple of things which are
>>>> >>                 require to setup before
>>>> >>                  >>          launching the process.
>>>> >>                  >>
>>>> >>                  >>            * Setting up the environment as done in
>>>> >>                 boot.go [4]
>>>> >>                  >>            * Retrieving and putting the artifacts in
>>>> >>                 the right location.
>>>> >>                  >>
>>>> >>                  >>          You can probably leverage boot.go code to
>>>> >>                 setup the environment.
>>>> >>                  >>
>>>> >>                  >>          Also, it will be useful to enumerate pros
>>>> >>                 and cons of different
>>>> >>                  >>          Environments to help users choose the right
>>>> >>                 one.
>>>> >>                  >>
>>>> >>                  >>
>>>> >>                  >>          On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise
>>>> >>                 <thw@apache.org <ma...@apache.org>
>>>> >>                  >>          <mailto:thw@apache.org
>>>> >>                 <ma...@apache.org>>> wrote:
>>>> >>                  >>
>>>> >>                  >>              Hi,
>>>> >>                  >>
>>>> >>                  >>              Currently the portable Flink runner
>>>> >>                 only works with SDK
>>>> >>                  >>              Docker containers for execution
>>>> >>                 (DockerJobBundleFactory,
>>>> >>                  >>              besides an in-process (embedded)
>>>> >>                 factory option for testing
>>>> >>                  >>              [1]). I'm considering adding another
>>>> >>                 out of process
>>>> >>                  >>              JobBundleFactory implementation that
>>>> >>                 directly forks the
>>>> >>                  >>              processes on the task manager host,
>>>> >>                 eliminating the need for
>>>> >>                  >>              Docker. This would work reasonably well
>>>> >>                 in environments
>>>> >>                  >>              where the dependencies (in this case
>>>> >>                 Python) can easily be
>>>> >>                  >>              tied into the host deployment (also
>>>> >>                 within an application
>>>> >>                  >>              specific Kubernetes pod).
>>>> >>                  >>
>>>> >>                  >>              There was already some discussion about
>>>> >>                 alternative
>>>> >>                  >>              JobBundleFactory implementation in [2].
>>>> >>                 There is also a JIRA
>>>> >>                  >>              to make the bundle factory pluggable
>>>> >>                 [3], pending
>>>> >>                  >>              availability of runner level options.
>>>> >>                  >>
>>>> >>                  >>              For a "ProcessBundleFactory", in
>>>> >>                 addition to the Python
>>>> >>                  >>              dependencies the environment would also
>>>> >>                 need to have the Go
>>>> >>                  >>              boot executable [4] (or a substitute
>>>> >>                 thereof) to perform the
>>>> >>                  >>              harness initialization.
>>>> >>                  >>
>>>> >>                  >>              Is anyone else interested in this SDK
>>>> >>                 execution option or
>>>> >>                  >>              has already investigated an alternative
>>>> >>                 implementation?
>>>> >>                  >>
>>>> >>                  >>              Thanks,
>>>> >>                  >>              Thomas
>>>> >>                  >>
>>>> >>                  >>              [1]
>>>> >>                  >>
>>>> >>
>>>> >> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>>>> >>
>>>> >>
>>>> >> <https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83>
>>>> >>
>>>> >>                  >>
>>>> >>                  >>              [2]
>>>> >>                  >>
>>>> >>
>>>> >> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>>>> >>
>>>> >>
>>>> >> <https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E>
>>>> >>
>>>> >>                  >>
>>>> >>                  >>              [3]
>>>> >>                 https://issues.apache.org/jira/browse/BEAM-4819
>>>> >>                 <https://issues.apache.org/jira/browse/BEAM-4819>
>>>> >>                  >>
>>>> >>                  >>              [4]
>>>> >>
>>>> >> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>>>> >>
>>>> >> <https://github.com/apache/beam/blob/master/sdks/python/container/boot.go>
>>>> >>
>>>> >>                  >>
>>>> >>
>>>> >>                 --                 Max
>>>> >>
>>>> >>
>>>> >
>>>>
>>>> --
>>>> Max

Re: Process JobBundleFactory for portable runner

Posted by Thomas Weise <th...@apache.org>.

Option #3 "external" would fit the Kubernetes use case we discussed a while
ago also. Container(s) can be part of the same pod and need to find the
runner.

There is another option: "embedded". When the SDK is Java and the runner
Flink (or all the other OSS runners), then harness can (optionally) run
embedded in the same JVM.

Thanks,
Thomas


On Thu, Aug 23, 2018 at 9:14 AM Henning Rohde <he...@google.com> wrote:

> A process-based SDK harness does not IMO imply that the host is fully
> provisioned by the SDK/user and invoking the user command line in the
> context of the staged files is a critical aspect for it to work. So I
> consider staged artifact support needed. Also, I would like to suggest that
> we move to a concrete environment proto to crystalize what is actually
> being proposed. I'm not sure what activating a virtualenv would look like,
> for example. To start things off:
>
> message Environment {
>   string urn = 1;
>   bytes payload = 2;
> }
>
> // urn == "beam:env:docker:v1"
> message DockerPayload {
>   string container_image = 1;  // implicitly linux_amd64.
> }
>
> // urn == "beam:env:process:v1"
> message ProcessPayload {
>   string os = 1;  // "linux", "darwin", ..
>   string arch = 2;  // "amd64", ..
>   string command_line = 3;
> }
>
> // urn == "beam:env:external:v1"
> // (no payload)
>
> A runner may support any subset and reject any unsupported configuration.
> There are 3 kinds of environments that I think are useful:
>  (1) docker: works as currently. Offers the most flexibility for SDKs and
> users, especially when the runner is outside the control (such as hosted
> runners). The runner starts the SDK harnesses.
>  (2) process: as discussed here. The runner starts the SDK harnesses. The
> semantics is that the shell commandline is invoked in a directory rooted in
> the staged artifacts with the container contract arguments. It is up to the
> user and runner deployment to ensure that it makes sense, i.e., on windows
> a linux binary or bash script is not specified. Executing the user command
> in a shell env (bash, zsh, cmd, ..) ensures that paths and so on are set
> up:, i.e., specifying "java -jar foo" would actually work. Useful for cases
> where the user controls both the SDK and runner (such as locally) or when
> docker is not an option. Intended to be minimal and SDK/language agnostic.
>  (3) external: this is what I think Robert was alluding to. The runner
> does not start any SDK harnesses. Instead it waits for user-controlled SDK
> harnesses to connect. Useful for manually debugging SDK code (connect from
> code running in a debugger) or when the user code must run in a special or
> privileged environment. It's runner-specific how the SDK will need to
> connect.
>
> Part of the idea of placing this information in the environment is that
> pipelines can potentially use multiple, such as cross-windows/linux.
>
> Henning
>
> On Thu, Aug 23, 2018 at 6:44 AM Thomas Weise <th...@apache.org> wrote:
>
>> I would see support for staging libraries as optional / nice to have
>> since that can also be done as part of host provisioning (i.e. in the
>> Python case a virtual environment was already setup and just needs to be
>> activated).
>>
>> Depending on how the command that launches the harness is configured,
>> additional steps such as virtualenv activate or setting of other
>> environment variables can be included as well.
>>
>>
>> On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels <mx...@apache.org>
>> wrote:
>>
>>> Just to recap:
>>>
>>>  From this and the other thread ("Bootstraping Beam's Job Server") we
>>> got sufficient evidence that process-based execution is a desired
>>> feature.
>>>
>>> Process-based execution as an alternative to dockerized execution
>>> https://issues.apache.org/jira/browse/BEAM-5187
>>>
>>> Which parts are executed as a process?
>>> => The SDK harness for user code
>>>
>>> What configuration options are supported?
>>> => Provide information about the target architecture (OS/CPU)
>>> => Staging libraries, as also supported by Docker
>>> => Activating a pre-existing environment (e.g. virutalenv)
>>>
>>>
>>> On 23.08.18 14:13, Maximilian Michels wrote:
>>> >> One thing to consider that we've talked about in the past. It might
>>> >> make sense to extend the environment proto and have the SDK be
>>> >> explicit about which kinds of environment it support
>>> >
>>> > +1 Encoding environment information there is a good idea.
>>> >
>>> >> Seems it will create a default docker url even if the
>>> >> hardness_docker_image is set to None in pipeline options. Shall we
>>> add
>>> >> another option or honor the None in this option to support the
>>> process
>>> >> job?
>>> >
>>> > Yes, if no Docker image is set the default one will be used. Currently
>>> > Docker is the only way to execute pipelines with the PortableRunner.
>>> If
>>> > the docker_image is not set, execution won't succeed.
>>> >
>>> > On 22.08.18 22:59, Xinyu Liu wrote:
>>> >> We are also interested in this Process JobBundleFactory as we are
>>> >> planning to fork a process to run python sdk in Samza runner, instead
>>> >> of using docker container. So this change will be helpful to us too.
>>> >> On the same note, we are trying out portable_runner.py to submit a
>>> >> python job. Seems it will create a default docker url even if the
>>> >> hardness_docker_image is set to None in pipeline options. Shall we
>>> add
>>> >> another option or honor the None in this option to support the
>>> process
>>> >> job? I made some local changes right now to walk around this.
>>> >>
>>> >> Thanks,
>>> >> Xinyu
>>> >>
>>> >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde <herohde@google.com
>>> >> <ma...@google.com>> wrote:
>>> >>
>>> >>     By "enum" in quotes, I meant the usual open URN style pattern not
>>> an
>>> >>     actual enum. Sorry if that wasn't clear.
>>> >>
>>> >>     On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik <lcwik@google.com
>>> >>     <ma...@google.com>> wrote:
>>> >>
>>> >>         I would model the environment to be more free form then enums
>>> >>         such that we have forward looking extensibility and would
>>> >>         suggest to follow the same pattern we use on PTransforms
>>> (using
>>> >>         an URN and a URN specific payload). Note that in this case we
>>> >>         may want to support a list of supported environments (e.g.
>>> java,
>>> >>         docker, python, ...).
>>> >>
>>> >>         On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
>>> >>         <herohde@google.com <ma...@google.com>> wrote:
>>> >>
>>> >>             One thing to consider that we've talked about in the past.
>>> >>             It might make sense to extend the environment proto and
>>> have
>>> >>             the SDK be explicit about which kinds of environment it
>>> >>             supports:
>>> >>
>>> >>
>>> >>
>>> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>>> >>
>>> >>
>>> >> <
>>> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969>
>>>
>>> >>
>>> >>
>>> >>             This choice might impact what files are staged or what
>>> not.
>>> >>             Some SDKs, such as Go, provide a compiled binary and
>>> _need_
>>> >>             to know what the target architecture is. Running on a mac
>>> >>             with and without docker, say, requires a different worker
>>> in
>>> >>             each case. If we add an "enum", we can also easily add the
>>> >>             external idea where the SDK/user starts the SDK harnesses
>>> >>             instead of the runner. Each runner may not support all
>>> types
>>> >>             of environments.
>>> >>
>>> >>             Henning
>>> >>
>>> >>             On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels
>>> >>             <mxm@apache.org <ma...@apache.org>> wrote:
>>> >>
>>> >>                 For reference, here is corresponding JIRA issue for
>>> this
>>> >>                 thread:
>>> >>                 https://issues.apache.org/jira/browse/BEAM-5187
>>> >>                 <https://issues.apache.org/jira/browse/BEAM-5187>
>>> >>
>>> >>                 On 16.08.18 11:15, Maximilian Michels wrote:
>>> >>                  > Makes sense to have an option to run the SDK
>>> harness
>>> >>                 in a non-dockerized
>>> >>                  > environment.
>>> >>                  >
>>> >>                  > I'm in the process of creating a Docker entry point
>>> >>                 for Flink's
>>> >>                  > JobServer[1]. I suppose you would also prefer to
>>> >>                 execute that one
>>> >>                  > standalone. We should make sure this is also an
>>> >> option.
>>> >>                  >
>>> >>                  > [1]
>>> https://issues.apache.org/jira/browse/BEAM-4130
>>> >>                 <https://issues.apache.org/jira/browse/BEAM-4130>
>>> >>                  >
>>> >>                  > On 16.08.18 07:42, Thomas Weise wrote:
>>> >>                  >> Yes, that's the proposal. Everything that would
>>> >>                 otherwise be packaged
>>> >>                  >> into the Docker container would need to be
>>> >>                 pre-installed in the host
>>> >>                  >> environment. In the case of Python SDK, that could
>>> >>                 simply mean a
>>> >>                  >> (frozen) virtual environment that was setup when
>>> the
>>> >>                 host was
>>> >>                  >> provisioned - the SDK harness process(es) will
>>> then
>>> >>                 just utilize that.
>>> >>                  >> Of course this flavor of SDK harness execution
>>> could
>>> >>                 also be useful in
>>> >>                  >> the local development environment, where right now
>>> >>                 someone who already
>>> >>                  >> has the Python environment needs to also install
>>> >>                 Docker and update a
>>> >>                  >> container to launch a Python SDK pipeline on the
>>> >>                 Flink runner.
>>> >>                  >>
>>> >>                  >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira
>>> >>                 <danoliveira@google.com <mailto:
>>> danoliveira@google.com>
>>> >>                  >> <mailto:danoliveira@google.com
>>> >>                 <ma...@google.com>>> wrote:
>>> >>                  >>
>>> >>                  >>      I just want to clarify that I understand this
>>> >>                 correctly since I'm
>>> >>                  >>      not that familiar with the details behind all
>>> >>                 these execution
>>> >>                  >>      environments yet. Is the proposal to create a
>>> >>                 new JobBundleFactory
>>> >>                  >>      that instead of using Docker to create the
>>> >>                 environment that the new
>>> >>                  >>      processes will execute in, this
>>> >>                 JobBundleFactory would execute the
>>> >>                  >>      new processes directly in the host
>>> environment?
>>> >>                 So in practice if I
>>> >>                  >>      ran a pipeline with this JobBundleFactory the
>>> >>                 SDK Harness and Runner
>>> >>                  >>      Harness would both be executing directly on
>>> my
>>> >>                 machine and would
>>> >>                  >>      depend on me having the dependencies already
>>> >>                 present on my machine?
>>> >>                  >>
>>> >>                  >>      On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka
>>> >>                 <goenka@google.com <ma...@google.com>
>>> >>                  >>      <mailto:goenka@google.com
>>> >>                 <ma...@google.com>>> wrote:
>>> >>                  >>
>>> >>                  >>          Thanks for starting the discussion. I
>>> will
>>> >>                 be happy to help.
>>> >>                  >>          I agree, we should have pluggable
>>> >>                 SDKHarness environment Factory.
>>> >>                  >>          We can register multiple Environment
>>> >>                 factory using service
>>> >>                  >>          registry and use the PipelineOption to
>>> pick
>>> >>                 the right one on per
>>> >>                  >>          job basis.
>>> >>                  >>
>>> >>                  >>          There are a couple of things which are
>>> >>                 require to setup before
>>> >>                  >>          launching the process.
>>> >>                  >>
>>> >>                  >>            * Setting up the environment as done in
>>> >>                 boot.go [4]
>>> >>                  >>            * Retrieving and putting the artifacts
>>> in
>>> >>                 the right location.
>>> >>                  >>
>>> >>                  >>          You can probably leverage boot.go code to
>>> >>                 setup the environment.
>>> >>                  >>
>>> >>                  >>          Also, it will be useful to enumerate pros
>>> >>                 and cons of different
>>> >>                  >>          Environments to help users choose the
>>> right
>>> >>                 one.
>>> >>                  >>
>>> >>                  >>
>>> >>                  >>          On Mon, Aug 6, 2018 at 4:50 PM Thomas
>>> Weise
>>> >>                 <thw@apache.org <ma...@apache.org>
>>> >>                  >>          <mailto:thw@apache.org
>>> >>                 <ma...@apache.org>>> wrote:
>>> >>                  >>
>>> >>                  >>              Hi,
>>> >>                  >>
>>> >>                  >>              Currently the portable Flink runner
>>> >>                 only works with SDK
>>> >>                  >>              Docker containers for execution
>>> >>                 (DockerJobBundleFactory,
>>> >>                  >>              besides an in-process (embedded)
>>> >>                 factory option for testing
>>> >>                  >>              [1]). I'm considering adding another
>>> >>                 out of process
>>> >>                  >>              JobBundleFactory implementation that
>>> >>                 directly forks the
>>> >>                  >>              processes on the task manager host,
>>> >>                 eliminating the need for
>>> >>                  >>              Docker. This would work reasonably
>>> well
>>> >>                 in environments
>>> >>                  >>              where the dependencies (in this case
>>> >>                 Python) can easily be
>>> >>                  >>              tied into the host deployment (also
>>> >>                 within an application
>>> >>                  >>              specific Kubernetes pod).
>>> >>                  >>
>>> >>                  >>              There was already some discussion
>>> about
>>> >>                 alternative
>>> >>                  >>              JobBundleFactory implementation in
>>> [2].
>>> >>                 There is also a JIRA
>>> >>                  >>              to make the bundle factory pluggable
>>> >>                 [3], pending
>>> >>                  >>              availability of runner level options.
>>> >>                  >>
>>> >>                  >>              For a "ProcessBundleFactory", in
>>> >>                 addition to the Python
>>> >>                  >>              dependencies the environment would
>>> also
>>> >>                 need to have the Go
>>> >>                  >>              boot executable [4] (or a substitute
>>> >>                 thereof) to perform the
>>> >>                  >>              harness initialization.
>>> >>                  >>
>>> >>                  >>              Is anyone else interested in this SDK
>>> >>                 execution option or
>>> >>                  >>              has already investigated an
>>> alternative
>>> >>                 implementation?
>>> >>                  >>
>>> >>                  >>              Thanks,
>>> >>                  >>              Thomas
>>> >>                  >>
>>> >>                  >>              [1]
>>> >>                  >>
>>> >>
>>> >>
>>> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>>> >>
>>> >>
>>> >> <
>>> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83>
>>>
>>> >>
>>> >>                  >>
>>> >>                  >>              [2]
>>> >>                  >>
>>> >>
>>> >>
>>> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>>> >>
>>> >>
>>> >> <
>>> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E>
>>>
>>> >>
>>> >>                  >>
>>> >>                  >>              [3]
>>> >>                 https://issues.apache.org/jira/browse/BEAM-4819
>>> >>                 <https://issues.apache.org/jira/browse/BEAM-4819>
>>> >>                  >>
>>> >>                  >>              [4]
>>> >>
>>> >>
>>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>>> >>
>>> >> <
>>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go>
>>>
>>> >>
>>> >>                  >>
>>> >>
>>> >>                 --                 Max
>>> >>
>>> >>
>>> >
>>>
>>> --
>>> Max
>>>
>>

Re: Process JobBundleFactory for portable runner

Posted by Maximilian Michels <mx...@apache.org>.

+1 I think it makes sense to consolidate yours and Henning's proposal 
for the Environment changes. I've created a JIRA to collect the ideas. 
We can go ahead and implement this next.

https://issues.apache.org/jira/browse/BEAM-5288

On 01.09.18 03:38, Ankur Goenka wrote:
> Sorry for posting on a separate thread. Lets continue the discussion here.
> +1 for having URN to identify environment type. I think URN is better 
> than 'oneof' structure as its more flexible and forward compatible.
> 
> On Fri, Aug 31, 2018 at 9:14 AM Thomas Weise <thw@apache.org 
> <ma...@apache.org>> wrote:
> 
>     FYI the first part of support for (direct) process based job bundle
>     factory was merged (thanks Max!)
> 
>     https://github.com/apache/beam/pull/6287
> 
>     On top of that I have built a customization that runs the Python SDK
>     worker directly:
> 
>     https://github.com/lyft/beam/blob/release-2.8.0-lyft/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/control/LyftProcessJobBundleFactory.java
> 
>     While doing that, I thought it would be nice to support a custom
>     environment (along with the other options we already discussed) that
>     allows users to hook in their own extensions without having to do
>     stuff like this:
>     https://github.com/lyft/beam/pull/6/files#diff-b74ff692340bcae0032d119a7192624cR61
> 
>     Thanks,
>     Thomas
> 
> 
> 
> 
>     On Mon, Aug 27, 2018 at 2:43 AM Robert Bradshaw <robertwb@google.com
>     <ma...@google.com>> wrote:
> 
>         On Mon, Aug 27, 2018 at 11:23 AM Maximilian Michels
>         <mxm@apache.org <ma...@apache.org>> wrote:
>          >
>          > Thanks for your proposal Henning. +1 for explicit environment
>         messages.
>          > I'm not sure how important it is to support cross-platform
>         pipelines. I
>          > can foresee future use but I wouldn't consider it essential.
> 
>         One may want to execute Tensforflow pipeline segments from the
>         middle
>         of a Java pipeline, or leverage the SQL code (currently
>         implemented in
>         Java) from Python or Go. In my experience, a common pipeline
>         shape is
>         to do a significant amount of (often fairly trivial) filtering
>         at the
>         front of a pipeline, and sophisticated analysis at the end, and the
>         tradeoffs of execution efficiency vs. expressiveness and prototype
>         friendliness are different in these two halves of the same pipeline.
> 
>          > However, it
>          > basically comes for free if we extend the existing environment
>          > information for ExecutableStage. The overhead, as you said,
>         is negligible.
> 
>         +1. Also it should be noted that the runner is ideally not
>         restricted
>         to this set of environments; if it understands the URN it can use
>         whatever environment it finds appropriate. This could be especially
>         useful for optimal choice of environment to avoid unneeded fusion
>         barriers (e.g. the trivial Count transform will has a pair-with-one
>         before the GBK, and a sum-values after the GBK, and assuming those
>         operations are available in nearly every environment it would be
>         preferable to choose the variant according to what
>         precedes/follows it
>         to allow fusion (or, possibly, even embed the operation).
> 
>          > Also agree that artifact staging is important even with
>         process-based
>          > execution. The execution environment might be managed
>         externally but we
>          > still want to be able to execute new pipelines without
>         copying over
>          > required artifact. That said, a first version could come without
>          > artifact staging.
> 
>         One of the parameters passed to the script is the staging endpoint,
>         which it can use to do all the staging itself, so this could also be
>         internal to the script. I can however see wanting the ability to
>         stage
>         artifacts more cheaply (e.g. symlinks), and this is actually
>         also the
>         case with the docker environment (e.g. mount points).
> 
>          > On 23.08.18 18:14, Henning Rohde wrote:
>          > > A process-based SDK harness does not IMO imply that the
>         host is fully
>          > > provisioned by the SDK/user and invoking the user command
>         line in the
>          > > context of the staged files is a critical aspect for it to
>         work. So I
>          > > consider staged artifact support needed. Also, I would like
>         to suggest
>          > > that we move to a concrete environment proto to crystalize
>         what is
>          > > actually being proposed. I'm not sure what activating a
>         virtualenv would
>          > > look like, for example. To start things off:
>          > >
>          > > message Environment {
>          > >    string urn = 1;
>          > >    bytes payload = 2;
>          > > }
>          > >
>          > > // urn == "beam:env:docker:v1"
>          > > message DockerPayload {
>          > >    string container_image = 1;  // implicitly linux_amd64.
>          > > }
>          > >
>          > > // urn == "beam:env:process:v1"
>          > > message ProcessPayload {
>          > >    string os = 1;  // "linux", "darwin", ..
>          > >    string arch = 2;  // "amd64", ..
>          > >    string command_line = 3;
>          > > }
>          > >
>          > > // urn == "beam:env:external:v1"
>          > > // (no payload)
>          > >
>          > > A runner may support any subset and reject any unsupported
>          > > configuration. There are 3 kinds of environments that I
>         think are useful:
>          > >   (1) docker: works as currently. Offers the most
>         flexibility for SDKs
>          > > and users, especially when the runner is outside the
>         control (such
>          > > as hosted runners). The runner starts the SDK harnesses.
>          > >   (2) process: as discussed here. The runner starts the SDK
>         harnesses.
>          > > The semantics is that the shell commandline is invoked in a
>         directory
>          > > rooted in the staged artifacts with the container contract
>         arguments. It
>          > > is up to the user and runner deployment to ensure that it
>         makes sense,
>          > > i.e., on windows a linux binary or bash script is not
>         specified.
>          > > Executing the user command in a shell env (bash, zsh, cmd,
>         ..) ensures
>          > > that paths and so on are set up:, i.e., specifying "java
>         -jar foo" would
>          > > actually work. Useful for cases where the user controls
>         both the SDK and
>          > > runner (such as locally) or when docker is not an option.
>         Intended to be
>          > > minimal and SDK/language agnostic.
>          > >   (3) external: this is what I think Robert was alluding
>         to. The runner
>          > > does not start any SDK harnesses. Instead it waits for
>         user-controlled
>          > > SDK harnesses to connect. Useful for manually debugging SDK
>         code
>          > > (connect from code running in a debugger) or when the user
>         code must run
>          > > in a special or privileged environment. It's
>         runner-specific how the SDK
>          > > will need to connect.
>          > >
>          > > Part of the idea of placing this information in the
>         environment is that
>          > > pipelines can potentially use multiple, such as
>         cross-windows/linux.
>          > >
>          > > Henning
>          > >
>          > > On Thu, Aug 23, 2018 at 6:44 AM Thomas Weise
>         <thw@apache.org <ma...@apache.org>
>          > > <mailto:thw@apache.org <ma...@apache.org>>> wrote:
>          > >
>          > >     I would see support for staging libraries as optional /
>         nice to have
>          > >     since that can also be done as part of host
>         provisioning (i.e. in
>          > >     the Python case a virtual environment was already setup
>         and just
>          > >     needs to be activated).
>          > >
>          > >     Depending on how the command that launches the harness is
>          > >     configured, additional steps such as virtualenv
>         activate or setting
>          > >     of other environment variables can be included as well.
>          > >
>          > >
>          > >     On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels
>         <mxm@apache.org <ma...@apache.org>
>          > >     <mailto:mxm@apache.org <ma...@apache.org>>> wrote:
>          > >
>          > >         Just to recap:
>          > >
>          > >           From this and the other thread ("Bootstraping
>         Beam's Job
>          > >         Server") we
>          > >         got sufficient evidence that process-based
>         execution is a
>          > >         desired feature.
>          > >
>          > >         Process-based execution as an alternative to
>         dockerized execution
>          > > https://issues.apache.org/jira/browse/BEAM-5187
>          > >
>          > >         Which parts are executed as a process?
>          > >         => The SDK harness for user code
>          > >
>          > >         What configuration options are supported?
>          > >         => Provide information about the target
>         architecture (OS/CPU)
>          > >         => Staging libraries, as also supported by Docker
>          > >         => Activating a pre-existing environment (e.g.
>         virutalenv)
>          > >
>          > >
>          > >         On 23.08.18 14:13, Maximilian Michels wrote:
>          > >          >> One thing to consider that we've talked about
>         in the past.
>          > >         It might
>          > >          >> make sense to extend the environment proto and
>         have the SDK be
>          > >          >> explicit about which kinds of environment it
>         support
>          > >          >
>          > >          > +1 Encoding environment information there is a
>         good idea.
>          > >          >
>          > >          >> Seems it will create a default docker url even
>         if the
>          > >          >> hardness_docker_image is set to None in
>         pipeline options.
>          > >         Shall we add
>          > >          >> another option or honor the None in this option
>         to support
>          > >         the process
>          > >          >> job?
>          > >          >
>          > >          > Yes, if no Docker image is set the default one
>         will be used.
>          > >         Currently
>          > >          > Docker is the only way to execute pipelines with the
>          > >         PortableRunner. If
>          > >          > the docker_image is not set, execution won't
>         succeed.
>          > >          >
>          > >          > On 22.08.18 22:59, Xinyu Liu wrote:
>          > >          >> We are also interested in this Process
>         JobBundleFactory as
>          > >         we are
>          > >          >> planning to fork a process to run python sdk in
>         Samza
>          > >         runner, instead
>          > >          >> of using docker container. So this change will
>         be helpful to
>          > >         us too.
>          > >          >> On the same note, we are trying out
>         portable_runner.py to
>          > >         submit a
>          > >          >> python job. Seems it will create a default
>         docker url even
>          > >         if the
>          > >          >> hardness_docker_image is set to None in
>         pipeline options.
>          > >         Shall we add
>          > >          >> another option or honor the None in this option
>         to support
>          > >         the process
>          > >          >> job? I made some local changes right now to
>         walk around this.
>          > >          >>
>          > >          >> Thanks,
>          > >          >> Xinyu
>          > >          >>
>          > >          >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde
>          > >         <herohde@google.com <ma...@google.com>
>         <mailto:herohde@google.com <ma...@google.com>>
>          > >          >> <mailto:herohde@google.com
>         <ma...@google.com> <mailto:herohde@google.com
>         <ma...@google.com>>>> wrote:
>          > >          >>
>          > >          >>     By "enum" in quotes, I meant the usual open
>         URN style
>          > >         pattern not an
>          > >          >>     actual enum. Sorry if that wasn't clear.
>          > >          >>
>          > >          >>     On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik
>          > >         <lcwik@google.com <ma...@google.com>
>         <mailto:lcwik@google.com <ma...@google.com>>
>          > >          >>     <mailto:lcwik@google.com
>         <ma...@google.com> <mailto:lcwik@google.com
>         <ma...@google.com>>>> wrote:
>          > >          >>
>          > >          >>         I would model the environment to be
>         more free form
>          > >         then enums
>          > >          >>         such that we have forward looking
>         extensibility and
>          > >         would
>          > >          >>         suggest to follow the same pattern we
>         use on
>          > >         PTransforms (using
>          > >          >>         an URN and a URN specific payload).
>         Note that in
>          > >         this case we
>          > >          >>         may want to support a list of supported
>         environments
>          > >         (e.g. java,
>          > >          >>         docker, python, ...).
>          > >          >>
>          > >          >>         On Tue, Aug 21, 2018 at 10:37 AM
>         Henning Rohde
>          > >          >>         <herohde@google.com
>         <ma...@google.com> <mailto:herohde@google.com
>         <ma...@google.com>>
>          > >         <mailto:herohde@google.com
>         <ma...@google.com> <mailto:herohde@google.com
>         <ma...@google.com>>>> wrote:
>          > >          >>
>          > >          >>             One thing to consider that we've
>         talked about in
>          > >         the past.
>          > >          >>             It might make sense to extend the
>         environment
>          > >         proto and have
>          > >          >>             the SDK be explicit about which
>         kinds of
>          > >         environment it
>          > >          >>             supports:
>          > >          >>
>          > >          >>
>          > >          >>
>          > >
>         https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>          > >
>          > >          >>
>          > >          >>
>          > >          >>
>          > >       
>           <https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969>
>          > >
>          > >          >>
>          > >          >>
>          > >          >>             This choice might impact what files
>         are staged
>          > >         or what not.
>          > >          >>             Some SDKs, such as Go, provide a
>         compiled binary
>          > >         and _need_
>          > >          >>             to know what the target
>         architecture is. Running
>          > >         on a mac
>          > >          >>             with and without docker, say,
>         requires a
>          > >         different worker in
>          > >          >>             each case. If we add an "enum", we
>         can also
>          > >         easily add the
>          > >          >>             external idea where the SDK/user
>         starts the SDK
>          > >         harnesses
>          > >          >>             instead of the runner. Each runner
>         may not
>          > >         support all types
>          > >          >>             of environments.
>          > >          >>
>          > >          >>             Henning
>          > >          >>
>          > >          >>             On Tue, Aug 21, 2018 at 2:52 AM
>         Maximilian Michels
>          > >          >>             <mxm@apache.org
>         <ma...@apache.org> <mailto:mxm@apache.org
>         <ma...@apache.org>>
>          > >         <mailto:mxm@apache.org <ma...@apache.org>
>         <mailto:mxm@apache.org <ma...@apache.org>>>> wrote:
>          > >          >>
>          > >          >>                 For reference, here is
>         corresponding JIRA
>          > >         issue for this
>          > >          >>                 thread:
>          > >          >> https://issues.apache.org/jira/browse/BEAM-5187
>          > >          >>
>          > >         <https://issues.apache.org/jira/browse/BEAM-5187>
>          > >          >>
>          > >          >>                 On 16.08.18 11:15, Maximilian
>         Michels wrote:
>          > >          >>                  > Makes sense to have an
>         option to run the
>          > >         SDK harness
>          > >          >>                 in a non-dockerized
>          > >          >>                  > environment.
>          > >          >>                  >
>          > >          >>                  > I'm in the process of
>         creating a Docker
>          > >         entry point
>          > >          >>                 for Flink's
>          > >          >>                  > JobServer[1]. I suppose you
>         would also
>          > >         prefer to
>          > >          >>                 execute that one
>          > >          >>                  > standalone. We should make
>         sure this is
>          > >         also an
>          > >          >> option.
>          > >          >>                  >
>          > >          >>                  > [1]
>          > > https://issues.apache.org/jira/browse/BEAM-4130
>          > >          >>
>          > >         <https://issues.apache.org/jira/browse/BEAM-4130>
>          > >          >>                  >
>          > >          >>                  > On 16.08.18 07:42, Thomas
>         Weise wrote:
>          > >          >>                  >> Yes, that's the proposal.
>         Everything
>          > >         that would
>          > >          >>                 otherwise be packaged
>          > >          >>                  >> into the Docker container
>         would need to be
>          > >          >>                 pre-installed in the host
>          > >          >>                  >> environment. In the case of
>         Python SDK,
>          > >         that could
>          > >          >>                 simply mean a
>          > >          >>                  >> (frozen) virtual
>         environment that was
>          > >         setup when the
>          > >          >>                 host was
>          > >          >>                  >> provisioned - the SDK harness
>          > >         process(es) will then
>          > >          >>                 just utilize that.
>          > >          >>                  >> Of course this flavor of
>         SDK harness
>          > >         execution could
>          > >          >>                 also be useful in
>          > >          >>                  >> the local development
>         environment, where
>          > >         right now
>          > >          >>                 someone who already
>          > >          >>                  >> has the Python environment
>         needs to also
>          > >         install
>          > >          >>                 Docker and update a
>          > >          >>                  >> container to launch a
>         Python SDK
>          > >         pipeline on the
>          > >          >>                 Flink runner.
>          > >          >>                  >>
>          > >          >>                  >> On Wed, Aug 15, 2018 at
>         12:40 PM Daniel
>          > >         Oliveira
>          > >          >>                 <danoliveira@google.com
>         <ma...@google.com>
>          > >         <mailto:danoliveira@google.com
>         <ma...@google.com>> <mailto:danoliveira@google.com
>         <ma...@google.com>
>          > >         <mailto:danoliveira@google.com
>         <ma...@google.com>>>
>          > >          >>                  >>
>         <mailto:danoliveira@google.com <ma...@google.com>
>          > >         <mailto:danoliveira@google.com
>         <ma...@google.com>>
>          > >          >>                 <mailto:danoliveira@google.com
>         <ma...@google.com>
>          > >         <mailto:danoliveira@google.com
>         <ma...@google.com>>>>> wrote:
>          > >          >>                  >>
>          > >          >>                  >>      I just want to clarify
>         that I
>          > >         understand this
>          > >          >>                 correctly since I'm
>          > >          >>                  >>      not that familiar with
>         the details
>          > >         behind all
>          > >          >>                 these execution
>          > >          >>                  >>      environments yet. Is
>         the proposal
>          > >         to create a
>          > >          >>                 new JobBundleFactory
>          > >          >>                  >>      that instead of using
>         Docker to
>          > >         create the
>          > >          >>                 environment that the new
>          > >          >>                  >>      processes will execute
>         in, this
>          > >          >>                 JobBundleFactory would execute the
>          > >          >>                  >>      new processes directly
>         in the host
>          > >         environment?
>          > >          >>                 So in practice if I
>          > >          >>                  >>      ran a pipeline with this
>          > >         JobBundleFactory the
>          > >          >>                 SDK Harness and Runner
>          > >          >>                  >>      Harness would both be
>         executing
>          > >         directly on my
>          > >          >>                 machine and would
>          > >          >>                  >>      depend on me having the
>          > >         dependencies already
>          > >          >>                 present on my machine?
>          > >          >>                  >>
>          > >          >>                  >>      On Mon, Aug 13, 2018
>         at 5:50 PM
>          > >         Ankur Goenka
>          > >          >>                 <goenka@google.com
>         <ma...@google.com>
>          > >         <mailto:goenka@google.com
>         <ma...@google.com>> <mailto:goenka@google.com
>         <ma...@google.com>
>          > >         <mailto:goenka@google.com <ma...@google.com>>>
>          > >          >>                  >>     
>         <mailto:goenka@google.com <ma...@google.com>
>          > >         <mailto:goenka@google.com <ma...@google.com>>
>          > >          >>                 <mailto:goenka@google.com
>         <ma...@google.com>
>          > >         <mailto:goenka@google.com
>         <ma...@google.com>>>>> wrote:
>          > >          >>                  >>
>          > >          >>                  >>          Thanks for
>         starting the
>          > >         discussion. I will
>          > >          >>                 be happy to help.
>          > >          >>                  >>          I agree, we should
>         have pluggable
>          > >          >>                 SDKHarness environment Factory.
>          > >          >>                  >>          We can register
>         multiple
>          > >         Environment
>          > >          >>                 factory using service
>          > >          >>                  >>          registry and use the
>          > >         PipelineOption to pick
>          > >          >>                 the right one on per
>          > >          >>                  >>          job basis.
>          > >          >>                  >>
>          > >          >>                  >>          There are a couple
>         of things
>          > >         which are
>          > >          >>                 require to setup before
>          > >          >>                  >>          launching the process.
>          > >          >>                  >>
>          > >          >>                  >>            * Setting up the
>         environment
>          > >         as done in
>          > >          >>                 boot.go [4]
>          > >          >>                  >>            * Retrieving and
>         putting the
>          > >         artifacts in
>          > >          >>                 the right location.
>          > >          >>                  >>
>          > >          >>                  >>          You can probably
>         leverage
>          > >         boot.go code to
>          > >          >>                 setup the environment.
>          > >          >>                  >>
>          > >          >>                  >>          Also, it will be
>         useful to
>          > >         enumerate pros
>          > >          >>                 and cons of different
>          > >          >>                  >>          Environments to
>         help users
>          > >         choose the right
>          > >          >>                 one.
>          > >          >>                  >>
>          > >          >>                  >>
>          > >          >>                  >>          On Mon, Aug 6,
>         2018 at 4:50 PM
>          > >         Thomas Weise
>          > >          >>                 <thw@apache.org
>         <ma...@apache.org> <mailto:thw@apache.org
>         <ma...@apache.org>>
>          > >         <mailto:thw@apache.org <ma...@apache.org>
>         <mailto:thw@apache.org <ma...@apache.org>>>
>          > >          >>                  >>         
>         <mailto:thw@apache.org <ma...@apache.org>
>          > >         <mailto:thw@apache.org <ma...@apache.org>>
>          > >          >>                 <mailto:thw@apache.org
>         <ma...@apache.org>
>          > >         <mailto:thw@apache.org <ma...@apache.org>>>>>
>         wrote:
>          > >          >>                  >>
>          > >          >>                  >>              Hi,
>          > >          >>                  >>
>          > >          >>                  >>              Currently the
>         portable
>          > >         Flink runner
>          > >          >>                 only works with SDK
>          > >          >>                  >>              Docker
>         containers for execution
>          > >          >>                 (DockerJobBundleFactory,
>          > >          >>                  >>              besides an
>         in-process
>          > >         (embedded)
>          > >          >>                 factory option for testing
>          > >          >>                  >>              [1]). I'm
>         considering
>          > >         adding another
>          > >          >>                 out of process
>          > >          >>                  >>              JobBundleFactory
>          > >         implementation that
>          > >          >>                 directly forks the
>          > >          >>                  >>              processes on
>         the task
>          > >         manager host,
>          > >          >>                 eliminating the need for
>          > >          >>                  >>              Docker. This
>         would work
>          > >         reasonably well
>          > >          >>                 in environments
>          > >          >>                  >>              where the
>         dependencies (in
>          > >         this case
>          > >          >>                 Python) can easily be
>          > >          >>                  >>              tied into the host
>          > >         deployment (also
>          > >          >>                 within an application
>          > >          >>                  >>              specific
>         Kubernetes pod).
>          > >          >>                  >>
>          > >          >>                  >>              There was
>         already some
>          > >         discussion about
>          > >          >>                 alternative
>          > >          >>                  >>              JobBundleFactory
>          > >         implementation in [2].
>          > >          >>                 There is also a JIRA
>          > >          >>                  >>              to make the
>         bundle factory
>          > >         pluggable
>          > >          >>                 [3], pending
>          > >          >>                  >>              availability
>         of runner
>          > >         level options.
>          > >          >>                  >>
>          > >          >>                  >>              For a
>          > >         "ProcessBundleFactory", in
>          > >          >>                 addition to the Python
>          > >          >>                  >>              dependencies the
>          > >         environment would also
>          > >          >>                 need to have the Go
>          > >          >>                  >>              boot
>         executable [4] (or a
>          > >         substitute
>          > >          >>                 thereof) to perform the
>          > >          >>                  >>              harness
>         initialization.
>          > >          >>                  >>
>          > >          >>                  >>              Is anyone else
>         interested
>          > >         in this SDK
>          > >          >>                 execution option or
>          > >          >>                  >>              has already
>         investigated an
>          > >         alternative
>          > >          >>                 implementation?
>          > >          >>                  >>
>          > >          >>                  >>              Thanks,
>          > >          >>                  >>              Thomas
>          > >          >>                  >>
>          > >          >>                  >>              [1]
>          > >          >>                  >>
>          > >          >>
>          > >          >>
>          > >
>         https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>          > >
>          > >          >>
>          > >          >>
>          > >          >>
>          > >       
>           <https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83>
>          > >
>          > >          >>
>          > >          >>                  >>
>          > >          >>                  >>              [2]
>          > >          >>                  >>
>          > >          >>
>          > >          >>
>          > >
>         https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>          > >
>          > >          >>
>          > >          >>
>          > >          >>
>          > >       
>           <https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E>
>          > >
>          > >          >>
>          > >          >>                  >>
>          > >          >>                  >>              [3]
>          > >          >> https://issues.apache.org/jira/browse/BEAM-4819
>          > >          >>
>          > >         <https://issues.apache.org/jira/browse/BEAM-4819>
>          > >          >>                  >>
>          > >          >>                  >>              [4]
>          > >          >>
>          > >          >>
>          > >
>         https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>          > >          >>
>          > >          >>
>          > >       
>           <https://github.com/apache/beam/blob/master/sdks/python/container/boot.go>
>          > >
>          > >          >>
>          > >          >>                  >>
>          > >          >>
>          > >          >>                 --                 Max
>          > >          >>
>          > >          >>
>          > >          >
>          > >
>          > >         --
>          > >         Max
>          > >
>          >
>          > --
>          > Max
>

Re: Process JobBundleFactory for portable runner

Posted by Ankur Goenka <go...@google.com>.

Sorry for posting on a separate thread. Lets continue the discussion here.
+1 for having URN to identify environment type. I think URN is better than
'oneof' structure as its more flexible and forward compatible.

On Fri, Aug 31, 2018 at 9:14 AM Thomas Weise <th...@apache.org> wrote:

> FYI the first part of support for (direct) process based job bundle
> factory was merged (thanks Max!)
>
> https://github.com/apache/beam/pull/6287
>
> On top of that I have built a customization that runs the Python SDK
> worker directly:
>
>
> https://github.com/lyft/beam/blob/release-2.8.0-lyft/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/control/LyftProcessJobBundleFactory.java
>
> While doing that, I thought it would be nice to support a custom
> environment (along with the other options we already discussed) that allows
> users to hook in their own extensions without having to do stuff like this:
> https://github.com/lyft/beam/pull/6/files#diff-b74ff692340bcae0032d119a7192624cR61
>
> Thanks,
> Thomas
>
>
>
>
>
> On Mon, Aug 27, 2018 at 2:43 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Mon, Aug 27, 2018 at 11:23 AM Maximilian Michels <mx...@apache.org>
>> wrote:
>> >
>> > Thanks for your proposal Henning. +1 for explicit environment messages.
>> > I'm not sure how important it is to support cross-platform pipelines. I
>> > can foresee future use but I wouldn't consider it essential.
>>
>> One may want to execute Tensforflow pipeline segments from the middle
>> of a Java pipeline, or leverage the SQL code (currently implemented in
>> Java) from Python or Go. In my experience, a common pipeline shape is
>> to do a significant amount of (often fairly trivial) filtering at the
>> front of a pipeline, and sophisticated analysis at the end, and the
>> tradeoffs of execution efficiency vs. expressiveness and prototype
>> friendliness are different in these two halves of the same pipeline.
>>
>> > However, it
>> > basically comes for free if we extend the existing environment
>> > information for ExecutableStage. The overhead, as you said, is
>> negligible.
>>
>> +1. Also it should be noted that the runner is ideally not restricted
>> to this set of environments; if it understands the URN it can use
>> whatever environment it finds appropriate. This could be especially
>> useful for optimal choice of environment to avoid unneeded fusion
>> barriers (e.g. the trivial Count transform will has a pair-with-one
>> before the GBK, and a sum-values after the GBK, and assuming those
>> operations are available in nearly every environment it would be
>> preferable to choose the variant according to what precedes/follows it
>> to allow fusion (or, possibly, even embed the operation).
>>
>> > Also agree that artifact staging is important even with process-based
>> > execution. The execution environment might be managed externally but we
>> > still want to be able to execute new pipelines without copying over
>> > required artifact. That said, a first version could come without
>> > artifact staging.
>>
>> One of the parameters passed to the script is the staging endpoint,
>> which it can use to do all the staging itself, so this could also be
>> internal to the script. I can however see wanting the ability to stage
>> artifacts more cheaply (e.g. symlinks), and this is actually also the
>> case with the docker environment (e.g. mount points).
>>
>> > On 23.08.18 18:14, Henning Rohde wrote:
>> > > A process-based SDK harness does not IMO imply that the host is fully
>> > > provisioned by the SDK/user and invoking the user command line in the
>> > > context of the staged files is a critical aspect for it to work. So I
>> > > consider staged artifact support needed. Also, I would like to suggest
>> > > that we move to a concrete environment proto to crystalize what is
>> > > actually being proposed. I'm not sure what activating a virtualenv
>> would
>> > > look like, for example. To start things off:
>> > >
>> > > message Environment {
>> > >    string urn = 1;
>> > >    bytes payload = 2;
>> > > }
>> > >
>> > > // urn == "beam:env:docker:v1"
>> > > message DockerPayload {
>> > >    string container_image = 1;  // implicitly linux_amd64.
>> > > }
>> > >
>> > > // urn == "beam:env:process:v1"
>> > > message ProcessPayload {
>> > >    string os = 1;  // "linux", "darwin", ..
>> > >    string arch = 2;  // "amd64", ..
>> > >    string command_line = 3;
>> > > }
>> > >
>> > > // urn == "beam:env:external:v1"
>> > > // (no payload)
>> > >
>> > > A runner may support any subset and reject any unsupported
>> > > configuration. There are 3 kinds of environments that I think are
>> useful:
>> > >   (1) docker: works as currently. Offers the most flexibility for SDKs
>> > > and users, especially when the runner is outside the control (such
>> > > as hosted runners). The runner starts the SDK harnesses.
>> > >   (2) process: as discussed here. The runner starts the SDK harnesses.
>> > > The semantics is that the shell commandline is invoked in a directory
>> > > rooted in the staged artifacts with the container contract arguments.
>> It
>> > > is up to the user and runner deployment to ensure that it makes sense,
>> > > i.e., on windows a linux binary or bash script is not specified.
>> > > Executing the user command in a shell env (bash, zsh, cmd, ..) ensures
>> > > that paths and so on are set up:, i.e., specifying "java -jar foo"
>> would
>> > > actually work. Useful for cases where the user controls both the SDK
>> and
>> > > runner (such as locally) or when docker is not an option. Intended to
>> be
>> > > minimal and SDK/language agnostic.
>> > >   (3) external: this is what I think Robert was alluding to. The
>> runner
>> > > does not start any SDK harnesses. Instead it waits for user-controlled
>> > > SDK harnesses to connect. Useful for manually debugging SDK code
>> > > (connect from code running in a debugger) or when the user code must
>> run
>> > > in a special or privileged environment. It's runner-specific how the
>> SDK
>> > > will need to connect.
>> > >
>> > > Part of the idea of placing this information in the environment is
>> that
>> > > pipelines can potentially use multiple, such as cross-windows/linux.
>> > >
>> > > Henning
>> > >
>> > > On Thu, Aug 23, 2018 at 6:44 AM Thomas Weise <thw@apache.org
>> > > <ma...@apache.org>> wrote:
>> > >
>> > >     I would see support for staging libraries as optional / nice to
>> have
>> > >     since that can also be done as part of host provisioning (i.e. in
>> > >     the Python case a virtual environment was already setup and just
>> > >     needs to be activated).
>> > >
>> > >     Depending on how the command that launches the harness is
>> > >     configured, additional steps such as virtualenv activate or
>> setting
>> > >     of other environment variables can be included as well.
>> > >
>> > >
>> > >     On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels <
>> mxm@apache.org
>> > >     <ma...@apache.org>> wrote:
>> > >
>> > >         Just to recap:
>> > >
>> > >           From this and the other thread ("Bootstraping Beam's Job
>> > >         Server") we
>> > >         got sufficient evidence that process-based execution is a
>> > >         desired feature.
>> > >
>> > >         Process-based execution as an alternative to dockerized
>> execution
>> > >         https://issues.apache.org/jira/browse/BEAM-5187
>> > >
>> > >         Which parts are executed as a process?
>> > >         => The SDK harness for user code
>> > >
>> > >         What configuration options are supported?
>> > >         => Provide information about the target architecture (OS/CPU)
>> > >         => Staging libraries, as also supported by Docker
>> > >         => Activating a pre-existing environment (e.g. virutalenv)
>> > >
>> > >
>> > >         On 23.08.18 14:13, Maximilian Michels wrote:
>> > >          >> One thing to consider that we've talked about in the past.
>> > >         It might
>> > >          >> make sense to extend the environment proto and have the
>> SDK be
>> > >          >> explicit about which kinds of environment it support
>> > >          >
>> > >          > +1 Encoding environment information there is a good idea.
>> > >          >
>> > >          >> Seems it will create a default docker url even if the
>> > >          >> hardness_docker_image is set to None in pipeline options.
>> > >         Shall we add
>> > >          >> another option or honor the None in this option to support
>> > >         the process
>> > >          >> job?
>> > >          >
>> > >          > Yes, if no Docker image is set the default one will be
>> used.
>> > >         Currently
>> > >          > Docker is the only way to execute pipelines with the
>> > >         PortableRunner. If
>> > >          > the docker_image is not set, execution won't succeed.
>> > >          >
>> > >          > On 22.08.18 22:59, Xinyu Liu wrote:
>> > >          >> We are also interested in this Process JobBundleFactory as
>> > >         we are
>> > >          >> planning to fork a process to run python sdk in Samza
>> > >         runner, instead
>> > >          >> of using docker container. So this change will be helpful
>> to
>> > >         us too.
>> > >          >> On the same note, we are trying out portable_runner.py to
>> > >         submit a
>> > >          >> python job. Seems it will create a default docker url even
>> > >         if the
>> > >          >> hardness_docker_image is set to None in pipeline options.
>> > >         Shall we add
>> > >          >> another option or honor the None in this option to support
>> > >         the process
>> > >          >> job? I made some local changes right now to walk around
>> this.
>> > >          >>
>> > >          >> Thanks,
>> > >          >> Xinyu
>> > >          >>
>> > >          >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde
>> > >         <herohde@google.com <ma...@google.com>
>> > >          >> <mailto:herohde@google.com <ma...@google.com>>>
>> wrote:
>> > >          >>
>> > >          >>     By "enum" in quotes, I meant the usual open URN style
>> > >         pattern not an
>> > >          >>     actual enum. Sorry if that wasn't clear.
>> > >          >>
>> > >          >>     On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik
>> > >         <lcwik@google.com <ma...@google.com>
>> > >          >>     <mailto:lcwik@google.com <ma...@google.com>>>
>> wrote:
>> > >          >>
>> > >          >>         I would model the environment to be more free form
>> > >         then enums
>> > >          >>         such that we have forward looking extensibility
>> and
>> > >         would
>> > >          >>         suggest to follow the same pattern we use on
>> > >         PTransforms (using
>> > >          >>         an URN and a URN specific payload). Note that in
>> > >         this case we
>> > >          >>         may want to support a list of supported
>> environments
>> > >         (e.g. java,
>> > >          >>         docker, python, ...).
>> > >          >>
>> > >          >>         On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
>> > >          >>         <herohde@google.com <ma...@google.com>
>> > >         <mailto:herohde@google.com <ma...@google.com>>>
>> wrote:
>> > >          >>
>> > >          >>             One thing to consider that we've talked about
>> in
>> > >         the past.
>> > >          >>             It might make sense to extend the environment
>> > >         proto and have
>> > >          >>             the SDK be explicit about which kinds of
>> > >         environment it
>> > >          >>             supports:
>> > >          >>
>> > >          >>
>> > >          >>
>> > >
>> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>> > >
>> > >          >>
>> > >          >>
>> > >          >>
>> > >         <
>> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>> >
>> > >
>> > >          >>
>> > >          >>
>> > >          >>             This choice might impact what files are staged
>> > >         or what not.
>> > >          >>             Some SDKs, such as Go, provide a compiled
>> binary
>> > >         and _need_
>> > >          >>             to know what the target architecture is.
>> Running
>> > >         on a mac
>> > >          >>             with and without docker, say, requires a
>> > >         different worker in
>> > >          >>             each case. If we add an "enum", we can also
>> > >         easily add the
>> > >          >>             external idea where the SDK/user starts the
>> SDK
>> > >         harnesses
>> > >          >>             instead of the runner. Each runner may not
>> > >         support all types
>> > >          >>             of environments.
>> > >          >>
>> > >          >>             Henning
>> > >          >>
>> > >          >>             On Tue, Aug 21, 2018 at 2:52 AM Maximilian
>> Michels
>> > >          >>             <mxm@apache.org <ma...@apache.org>
>> > >         <mailto:mxm@apache.org <ma...@apache.org>>> wrote:
>> > >          >>
>> > >          >>                 For reference, here is corresponding JIRA
>> > >         issue for this
>> > >          >>                 thread:
>> > >          >> https://issues.apache.org/jira/browse/BEAM-5187
>> > >          >>
>> > >         <https://issues.apache.org/jira/browse/BEAM-5187>
>> > >          >>
>> > >          >>                 On 16.08.18 11:15, Maximilian Michels
>> wrote:
>> > >          >>                  > Makes sense to have an option to run
>> the
>> > >         SDK harness
>> > >          >>                 in a non-dockerized
>> > >          >>                  > environment.
>> > >          >>                  >
>> > >          >>                  > I'm in the process of creating a Docker
>> > >         entry point
>> > >          >>                 for Flink's
>> > >          >>                  > JobServer[1]. I suppose you would also
>> > >         prefer to
>> > >          >>                 execute that one
>> > >          >>                  > standalone. We should make sure this is
>> > >         also an
>> > >          >> option.
>> > >          >>                  >
>> > >          >>                  > [1]
>> > >         https://issues.apache.org/jira/browse/BEAM-4130
>> > >          >>
>> > >         <https://issues.apache.org/jira/browse/BEAM-4130>
>> > >          >>                  >
>> > >          >>                  > On 16.08.18 07:42, Thomas Weise wrote:
>> > >          >>                  >> Yes, that's the proposal. Everything
>> > >         that would
>> > >          >>                 otherwise be packaged
>> > >          >>                  >> into the Docker container would need
>> to be
>> > >          >>                 pre-installed in the host
>> > >          >>                  >> environment. In the case of Python
>> SDK,
>> > >         that could
>> > >          >>                 simply mean a
>> > >          >>                  >> (frozen) virtual environment that was
>> > >         setup when the
>> > >          >>                 host was
>> > >          >>                  >> provisioned - the SDK harness
>> > >         process(es) will then
>> > >          >>                 just utilize that.
>> > >          >>                  >> Of course this flavor of SDK harness
>> > >         execution could
>> > >          >>                 also be useful in
>> > >          >>                  >> the local development environment,
>> where
>> > >         right now
>> > >          >>                 someone who already
>> > >          >>                  >> has the Python environment needs to
>> also
>> > >         install
>> > >          >>                 Docker and update a
>> > >          >>                  >> container to launch a Python SDK
>> > >         pipeline on the
>> > >          >>                 Flink runner.
>> > >          >>                  >>
>> > >          >>                  >> On Wed, Aug 15, 2018 at 12:40 PM
>> Daniel
>> > >         Oliveira
>> > >          >>                 <danoliveira@google.com
>> > >         <ma...@google.com> <mailto:
>> danoliveira@google.com
>> > >         <ma...@google.com>>
>> > >          >>                  >> <mailto:danoliveira@google.com
>> > >         <ma...@google.com>
>> > >          >>                 <mailto:danoliveira@google.com
>> > >         <ma...@google.com>>>> wrote:
>> > >          >>                  >>
>> > >          >>                  >>      I just want to clarify that I
>> > >         understand this
>> > >          >>                 correctly since I'm
>> > >          >>                  >>      not that familiar with the
>> details
>> > >         behind all
>> > >          >>                 these execution
>> > >          >>                  >>      environments yet. Is the proposal
>> > >         to create a
>> > >          >>                 new JobBundleFactory
>> > >          >>                  >>      that instead of using Docker to
>> > >         create the
>> > >          >>                 environment that the new
>> > >          >>                  >>      processes will execute in, this
>> > >          >>                 JobBundleFactory would execute the
>> > >          >>                  >>      new processes directly in the
>> host
>> > >         environment?
>> > >          >>                 So in practice if I
>> > >          >>                  >>      ran a pipeline with this
>> > >         JobBundleFactory the
>> > >          >>                 SDK Harness and Runner
>> > >          >>                  >>      Harness would both be executing
>> > >         directly on my
>> > >          >>                 machine and would
>> > >          >>                  >>      depend on me having the
>> > >         dependencies already
>> > >          >>                 present on my machine?
>> > >          >>                  >>
>> > >          >>                  >>      On Mon, Aug 13, 2018 at 5:50 PM
>> > >         Ankur Goenka
>> > >          >>                 <goenka@google.com
>> > >         <ma...@google.com> <mailto:goenka@google.com
>> > >         <ma...@google.com>>
>> > >          >>                  >>      <mailto:goenka@google.com
>> > >         <ma...@google.com>
>> > >          >>                 <mailto:goenka@google.com
>> > >         <ma...@google.com>>>> wrote:
>> > >          >>                  >>
>> > >          >>                  >>          Thanks for starting the
>> > >         discussion. I will
>> > >          >>                 be happy to help.
>> > >          >>                  >>          I agree, we should have
>> pluggable
>> > >          >>                 SDKHarness environment Factory.
>> > >          >>                  >>          We can register multiple
>> > >         Environment
>> > >          >>                 factory using service
>> > >          >>                  >>          registry and use the
>> > >         PipelineOption to pick
>> > >          >>                 the right one on per
>> > >          >>                  >>          job basis.
>> > >          >>                  >>
>> > >          >>                  >>          There are a couple of things
>> > >         which are
>> > >          >>                 require to setup before
>> > >          >>                  >>          launching the process.
>> > >          >>                  >>
>> > >          >>                  >>            * Setting up the
>> environment
>> > >         as done in
>> > >          >>                 boot.go [4]
>> > >          >>                  >>            * Retrieving and putting
>> the
>> > >         artifacts in
>> > >          >>                 the right location.
>> > >          >>                  >>
>> > >          >>                  >>          You can probably leverage
>> > >         boot.go code to
>> > >          >>                 setup the environment.
>> > >          >>                  >>
>> > >          >>                  >>          Also, it will be useful to
>> > >         enumerate pros
>> > >          >>                 and cons of different
>> > >          >>                  >>          Environments to help users
>> > >         choose the right
>> > >          >>                 one.
>> > >          >>                  >>
>> > >          >>                  >>
>> > >          >>                  >>          On Mon, Aug 6, 2018 at 4:50
>> PM
>> > >         Thomas Weise
>> > >          >>                 <thw@apache.org <ma...@apache.org>
>> > >         <mailto:thw@apache.org <ma...@apache.org>>
>> > >          >>                  >>          <mailto:thw@apache.org
>> > >         <ma...@apache.org>
>> > >          >>                 <mailto:thw@apache.org
>> > >         <ma...@apache.org>>>> wrote:
>> > >          >>                  >>
>> > >          >>                  >>              Hi,
>> > >          >>                  >>
>> > >          >>                  >>              Currently the portable
>> > >         Flink runner
>> > >          >>                 only works with SDK
>> > >          >>                  >>              Docker containers for
>> execution
>> > >          >>                 (DockerJobBundleFactory,
>> > >          >>                  >>              besides an in-process
>> > >         (embedded)
>> > >          >>                 factory option for testing
>> > >          >>                  >>              [1]). I'm considering
>> > >         adding another
>> > >          >>                 out of process
>> > >          >>                  >>              JobBundleFactory
>> > >         implementation that
>> > >          >>                 directly forks the
>> > >          >>                  >>              processes on the task
>> > >         manager host,
>> > >          >>                 eliminating the need for
>> > >          >>                  >>              Docker. This would work
>> > >         reasonably well
>> > >          >>                 in environments
>> > >          >>                  >>              where the dependencies
>> (in
>> > >         this case
>> > >          >>                 Python) can easily be
>> > >          >>                  >>              tied into the host
>> > >         deployment (also
>> > >          >>                 within an application
>> > >          >>                  >>              specific Kubernetes pod).
>> > >          >>                  >>
>> > >          >>                  >>              There was already some
>> > >         discussion about
>> > >          >>                 alternative
>> > >          >>                  >>              JobBundleFactory
>> > >         implementation in [2].
>> > >          >>                 There is also a JIRA
>> > >          >>                  >>              to make the bundle
>> factory
>> > >         pluggable
>> > >          >>                 [3], pending
>> > >          >>                  >>              availability of runner
>> > >         level options.
>> > >          >>                  >>
>> > >          >>                  >>              For a
>> > >         "ProcessBundleFactory", in
>> > >          >>                 addition to the Python
>> > >          >>                  >>              dependencies the
>> > >         environment would also
>> > >          >>                 need to have the Go
>> > >          >>                  >>              boot executable [4] (or a
>> > >         substitute
>> > >          >>                 thereof) to perform the
>> > >          >>                  >>              harness initialization.
>> > >          >>                  >>
>> > >          >>                  >>              Is anyone else interested
>> > >         in this SDK
>> > >          >>                 execution option or
>> > >          >>                  >>              has already investigated
>> an
>> > >         alternative
>> > >          >>                 implementation?
>> > >          >>                  >>
>> > >          >>                  >>              Thanks,
>> > >          >>                  >>              Thomas
>> > >          >>                  >>
>> > >          >>                  >>              [1]
>> > >          >>                  >>
>> > >          >>
>> > >          >>
>> > >
>> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>> > >
>> > >          >>
>> > >          >>
>> > >          >>
>> > >         <
>> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>> >
>> > >
>> > >          >>
>> > >          >>                  >>
>> > >          >>                  >>              [2]
>> > >          >>                  >>
>> > >          >>
>> > >          >>
>> > >
>> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>> > >
>> > >          >>
>> > >          >>
>> > >          >>
>> > >         <
>> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>> >
>> > >
>> > >          >>
>> > >          >>                  >>
>> > >          >>                  >>              [3]
>> > >          >> https://issues.apache.org/jira/browse/BEAM-4819
>> > >          >>
>> > >         <https://issues.apache.org/jira/browse/BEAM-4819>
>> > >          >>                  >>
>> > >          >>                  >>              [4]
>> > >          >>
>> > >          >>
>> > >
>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>> > >          >>
>> > >          >>
>> > >         <
>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go>
>> > >
>> > >          >>
>> > >          >>                  >>
>> > >          >>
>> > >          >>                 --                 Max
>> > >          >>
>> > >          >>
>> > >          >
>> > >
>> > >         --
>> > >         Max
>> > >
>> >
>> > --
>> > Max
>>
>

Re: Process JobBundleFactory for portable runner

Posted by Thomas Weise <th...@apache.org>.

FYI the first part of support for (direct) process based job bundle factory
was merged (thanks Max!)

https://github.com/apache/beam/pull/6287

On top of that I have built a customization that runs the Python SDK worker
directly:

https://github.com/lyft/beam/blob/release-2.8.0-lyft/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/control/LyftProcessJobBundleFactory.java

While doing that, I thought it would be nice to support a custom
environment (along with the other options we already discussed) that allows
users to hook in their own extensions without having to do stuff like this:
https://github.com/lyft/beam/pull/6/files#diff-b74ff692340bcae0032d119a7192624cR61

Thanks,
Thomas





On Mon, Aug 27, 2018 at 2:43 AM Robert Bradshaw <ro...@google.com> wrote:

> On Mon, Aug 27, 2018 at 11:23 AM Maximilian Michels <mx...@apache.org>
> wrote:
> >
> > Thanks for your proposal Henning. +1 for explicit environment messages.
> > I'm not sure how important it is to support cross-platform pipelines. I
> > can foresee future use but I wouldn't consider it essential.
>
> One may want to execute Tensforflow pipeline segments from the middle
> of a Java pipeline, or leverage the SQL code (currently implemented in
> Java) from Python or Go. In my experience, a common pipeline shape is
> to do a significant amount of (often fairly trivial) filtering at the
> front of a pipeline, and sophisticated analysis at the end, and the
> tradeoffs of execution efficiency vs. expressiveness and prototype
> friendliness are different in these two halves of the same pipeline.
>
> > However, it
> > basically comes for free if we extend the existing environment
> > information for ExecutableStage. The overhead, as you said, is
> negligible.
>
> +1. Also it should be noted that the runner is ideally not restricted
> to this set of environments; if it understands the URN it can use
> whatever environment it finds appropriate. This could be especially
> useful for optimal choice of environment to avoid unneeded fusion
> barriers (e.g. the trivial Count transform will has a pair-with-one
> before the GBK, and a sum-values after the GBK, and assuming those
> operations are available in nearly every environment it would be
> preferable to choose the variant according to what precedes/follows it
> to allow fusion (or, possibly, even embed the operation).
>
> > Also agree that artifact staging is important even with process-based
> > execution. The execution environment might be managed externally but we
> > still want to be able to execute new pipelines without copying over
> > required artifact. That said, a first version could come without
> > artifact staging.
>
> One of the parameters passed to the script is the staging endpoint,
> which it can use to do all the staging itself, so this could also be
> internal to the script. I can however see wanting the ability to stage
> artifacts more cheaply (e.g. symlinks), and this is actually also the
> case with the docker environment (e.g. mount points).
>
> > On 23.08.18 18:14, Henning Rohde wrote:
> > > A process-based SDK harness does not IMO imply that the host is fully
> > > provisioned by the SDK/user and invoking the user command line in the
> > > context of the staged files is a critical aspect for it to work. So I
> > > consider staged artifact support needed. Also, I would like to suggest
> > > that we move to a concrete environment proto to crystalize what is
> > > actually being proposed. I'm not sure what activating a virtualenv
> would
> > > look like, for example. To start things off:
> > >
> > > message Environment {
> > >    string urn = 1;
> > >    bytes payload = 2;
> > > }
> > >
> > > // urn == "beam:env:docker:v1"
> > > message DockerPayload {
> > >    string container_image = 1;  // implicitly linux_amd64.
> > > }
> > >
> > > // urn == "beam:env:process:v1"
> > > message ProcessPayload {
> > >    string os = 1;  // "linux", "darwin", ..
> > >    string arch = 2;  // "amd64", ..
> > >    string command_line = 3;
> > > }
> > >
> > > // urn == "beam:env:external:v1"
> > > // (no payload)
> > >
> > > A runner may support any subset and reject any unsupported
> > > configuration. There are 3 kinds of environments that I think are
> useful:
> > >   (1) docker: works as currently. Offers the most flexibility for SDKs
> > > and users, especially when the runner is outside the control (such
> > > as hosted runners). The runner starts the SDK harnesses.
> > >   (2) process: as discussed here. The runner starts the SDK harnesses.
> > > The semantics is that the shell commandline is invoked in a directory
> > > rooted in the staged artifacts with the container contract arguments.
> It
> > > is up to the user and runner deployment to ensure that it makes sense,
> > > i.e., on windows a linux binary or bash script is not specified.
> > > Executing the user command in a shell env (bash, zsh, cmd, ..) ensures
> > > that paths and so on are set up:, i.e., specifying "java -jar foo"
> would
> > > actually work. Useful for cases where the user controls both the SDK
> and
> > > runner (such as locally) or when docker is not an option. Intended to
> be
> > > minimal and SDK/language agnostic.
> > >   (3) external: this is what I think Robert was alluding to. The runner
> > > does not start any SDK harnesses. Instead it waits for user-controlled
> > > SDK harnesses to connect. Useful for manually debugging SDK code
> > > (connect from code running in a debugger) or when the user code must
> run
> > > in a special or privileged environment. It's runner-specific how the
> SDK
> > > will need to connect.
> > >
> > > Part of the idea of placing this information in the environment is that
> > > pipelines can potentially use multiple, such as cross-windows/linux.
> > >
> > > Henning
> > >
> > > On Thu, Aug 23, 2018 at 6:44 AM Thomas Weise <thw@apache.org
> > > <ma...@apache.org>> wrote:
> > >
> > >     I would see support for staging libraries as optional / nice to
> have
> > >     since that can also be done as part of host provisioning (i.e. in
> > >     the Python case a virtual environment was already setup and just
> > >     needs to be activated).
> > >
> > >     Depending on how the command that launches the harness is
> > >     configured, additional steps such as virtualenv activate or setting
> > >     of other environment variables can be included as well.
> > >
> > >
> > >     On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels <mxm@apache.org
> > >     <ma...@apache.org>> wrote:
> > >
> > >         Just to recap:
> > >
> > >           From this and the other thread ("Bootstraping Beam's Job
> > >         Server") we
> > >         got sufficient evidence that process-based execution is a
> > >         desired feature.
> > >
> > >         Process-based execution as an alternative to dockerized
> execution
> > >         https://issues.apache.org/jira/browse/BEAM-5187
> > >
> > >         Which parts are executed as a process?
> > >         => The SDK harness for user code
> > >
> > >         What configuration options are supported?
> > >         => Provide information about the target architecture (OS/CPU)
> > >         => Staging libraries, as also supported by Docker
> > >         => Activating a pre-existing environment (e.g. virutalenv)
> > >
> > >
> > >         On 23.08.18 14:13, Maximilian Michels wrote:
> > >          >> One thing to consider that we've talked about in the past.
> > >         It might
> > >          >> make sense to extend the environment proto and have the
> SDK be
> > >          >> explicit about which kinds of environment it support
> > >          >
> > >          > +1 Encoding environment information there is a good idea.
> > >          >
> > >          >> Seems it will create a default docker url even if the
> > >          >> hardness_docker_image is set to None in pipeline options.
> > >         Shall we add
> > >          >> another option or honor the None in this option to support
> > >         the process
> > >          >> job?
> > >          >
> > >          > Yes, if no Docker image is set the default one will be used.
> > >         Currently
> > >          > Docker is the only way to execute pipelines with the
> > >         PortableRunner. If
> > >          > the docker_image is not set, execution won't succeed.
> > >          >
> > >          > On 22.08.18 22:59, Xinyu Liu wrote:
> > >          >> We are also interested in this Process JobBundleFactory as
> > >         we are
> > >          >> planning to fork a process to run python sdk in Samza
> > >         runner, instead
> > >          >> of using docker container. So this change will be helpful
> to
> > >         us too.
> > >          >> On the same note, we are trying out portable_runner.py to
> > >         submit a
> > >          >> python job. Seems it will create a default docker url even
> > >         if the
> > >          >> hardness_docker_image is set to None in pipeline options.
> > >         Shall we add
> > >          >> another option or honor the None in this option to support
> > >         the process
> > >          >> job? I made some local changes right now to walk around
> this.
> > >          >>
> > >          >> Thanks,
> > >          >> Xinyu
> > >          >>
> > >          >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde
> > >         <herohde@google.com <ma...@google.com>
> > >          >> <mailto:herohde@google.com <ma...@google.com>>>
> wrote:
> > >          >>
> > >          >>     By "enum" in quotes, I meant the usual open URN style
> > >         pattern not an
> > >          >>     actual enum. Sorry if that wasn't clear.
> > >          >>
> > >          >>     On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik
> > >         <lcwik@google.com <ma...@google.com>
> > >          >>     <mailto:lcwik@google.com <ma...@google.com>>>
> wrote:
> > >          >>
> > >          >>         I would model the environment to be more free form
> > >         then enums
> > >          >>         such that we have forward looking extensibility and
> > >         would
> > >          >>         suggest to follow the same pattern we use on
> > >         PTransforms (using
> > >          >>         an URN and a URN specific payload). Note that in
> > >         this case we
> > >          >>         may want to support a list of supported
> environments
> > >         (e.g. java,
> > >          >>         docker, python, ...).
> > >          >>
> > >          >>         On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
> > >          >>         <herohde@google.com <ma...@google.com>
> > >         <mailto:herohde@google.com <ma...@google.com>>>
> wrote:
> > >          >>
> > >          >>             One thing to consider that we've talked about
> in
> > >         the past.
> > >          >>             It might make sense to extend the environment
> > >         proto and have
> > >          >>             the SDK be explicit about which kinds of
> > >         environment it
> > >          >>             supports:
> > >          >>
> > >          >>
> > >          >>
> > >
> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
> > >
> > >          >>
> > >          >>
> > >          >>
> > >         <
> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
> >
> > >
> > >          >>
> > >          >>
> > >          >>             This choice might impact what files are staged
> > >         or what not.
> > >          >>             Some SDKs, such as Go, provide a compiled
> binary
> > >         and _need_
> > >          >>             to know what the target architecture is.
> Running
> > >         on a mac
> > >          >>             with and without docker, say, requires a
> > >         different worker in
> > >          >>             each case. If we add an "enum", we can also
> > >         easily add the
> > >          >>             external idea where the SDK/user starts the SDK
> > >         harnesses
> > >          >>             instead of the runner. Each runner may not
> > >         support all types
> > >          >>             of environments.
> > >          >>
> > >          >>             Henning
> > >          >>
> > >          >>             On Tue, Aug 21, 2018 at 2:52 AM Maximilian
> Michels
> > >          >>             <mxm@apache.org <ma...@apache.org>
> > >         <mailto:mxm@apache.org <ma...@apache.org>>> wrote:
> > >          >>
> > >          >>                 For reference, here is corresponding JIRA
> > >         issue for this
> > >          >>                 thread:
> > >          >> https://issues.apache.org/jira/browse/BEAM-5187
> > >          >>
> > >         <https://issues.apache.org/jira/browse/BEAM-5187>
> > >          >>
> > >          >>                 On 16.08.18 11:15, Maximilian Michels
> wrote:
> > >          >>                  > Makes sense to have an option to run the
> > >         SDK harness
> > >          >>                 in a non-dockerized
> > >          >>                  > environment.
> > >          >>                  >
> > >          >>                  > I'm in the process of creating a Docker
> > >         entry point
> > >          >>                 for Flink's
> > >          >>                  > JobServer[1]. I suppose you would also
> > >         prefer to
> > >          >>                 execute that one
> > >          >>                  > standalone. We should make sure this is
> > >         also an
> > >          >> option.
> > >          >>                  >
> > >          >>                  > [1]
> > >         https://issues.apache.org/jira/browse/BEAM-4130
> > >          >>
> > >         <https://issues.apache.org/jira/browse/BEAM-4130>
> > >          >>                  >
> > >          >>                  > On 16.08.18 07:42, Thomas Weise wrote:
> > >          >>                  >> Yes, that's the proposal. Everything
> > >         that would
> > >          >>                 otherwise be packaged
> > >          >>                  >> into the Docker container would need
> to be
> > >          >>                 pre-installed in the host
> > >          >>                  >> environment. In the case of Python SDK,
> > >         that could
> > >          >>                 simply mean a
> > >          >>                  >> (frozen) virtual environment that was
> > >         setup when the
> > >          >>                 host was
> > >          >>                  >> provisioned - the SDK harness
> > >         process(es) will then
> > >          >>                 just utilize that.
> > >          >>                  >> Of course this flavor of SDK harness
> > >         execution could
> > >          >>                 also be useful in
> > >          >>                  >> the local development environment,
> where
> > >         right now
> > >          >>                 someone who already
> > >          >>                  >> has the Python environment needs to
> also
> > >         install
> > >          >>                 Docker and update a
> > >          >>                  >> container to launch a Python SDK
> > >         pipeline on the
> > >          >>                 Flink runner.
> > >          >>                  >>
> > >          >>                  >> On Wed, Aug 15, 2018 at 12:40 PM Daniel
> > >         Oliveira
> > >          >>                 <danoliveira@google.com
> > >         <ma...@google.com> <mailto:danoliveira@google.com
> > >         <ma...@google.com>>
> > >          >>                  >> <mailto:danoliveira@google.com
> > >         <ma...@google.com>
> > >          >>                 <mailto:danoliveira@google.com
> > >         <ma...@google.com>>>> wrote:
> > >          >>                  >>
> > >          >>                  >>      I just want to clarify that I
> > >         understand this
> > >          >>                 correctly since I'm
> > >          >>                  >>      not that familiar with the details
> > >         behind all
> > >          >>                 these execution
> > >          >>                  >>      environments yet. Is the proposal
> > >         to create a
> > >          >>                 new JobBundleFactory
> > >          >>                  >>      that instead of using Docker to
> > >         create the
> > >          >>                 environment that the new
> > >          >>                  >>      processes will execute in, this
> > >          >>                 JobBundleFactory would execute the
> > >          >>                  >>      new processes directly in the host
> > >         environment?
> > >          >>                 So in practice if I
> > >          >>                  >>      ran a pipeline with this
> > >         JobBundleFactory the
> > >          >>                 SDK Harness and Runner
> > >          >>                  >>      Harness would both be executing
> > >         directly on my
> > >          >>                 machine and would
> > >          >>                  >>      depend on me having the
> > >         dependencies already
> > >          >>                 present on my machine?
> > >          >>                  >>
> > >          >>                  >>      On Mon, Aug 13, 2018 at 5:50 PM
> > >         Ankur Goenka
> > >          >>                 <goenka@google.com
> > >         <ma...@google.com> <mailto:goenka@google.com
> > >         <ma...@google.com>>
> > >          >>                  >>      <mailto:goenka@google.com
> > >         <ma...@google.com>
> > >          >>                 <mailto:goenka@google.com
> > >         <ma...@google.com>>>> wrote:
> > >          >>                  >>
> > >          >>                  >>          Thanks for starting the
> > >         discussion. I will
> > >          >>                 be happy to help.
> > >          >>                  >>          I agree, we should have
> pluggable
> > >          >>                 SDKHarness environment Factory.
> > >          >>                  >>          We can register multiple
> > >         Environment
> > >          >>                 factory using service
> > >          >>                  >>          registry and use the
> > >         PipelineOption to pick
> > >          >>                 the right one on per
> > >          >>                  >>          job basis.
> > >          >>                  >>
> > >          >>                  >>          There are a couple of things
> > >         which are
> > >          >>                 require to setup before
> > >          >>                  >>          launching the process.
> > >          >>                  >>
> > >          >>                  >>            * Setting up the environment
> > >         as done in
> > >          >>                 boot.go [4]
> > >          >>                  >>            * Retrieving and putting the
> > >         artifacts in
> > >          >>                 the right location.
> > >          >>                  >>
> > >          >>                  >>          You can probably leverage
> > >         boot.go code to
> > >          >>                 setup the environment.
> > >          >>                  >>
> > >          >>                  >>          Also, it will be useful to
> > >         enumerate pros
> > >          >>                 and cons of different
> > >          >>                  >>          Environments to help users
> > >         choose the right
> > >          >>                 one.
> > >          >>                  >>
> > >          >>                  >>
> > >          >>                  >>          On Mon, Aug 6, 2018 at 4:50 PM
> > >         Thomas Weise
> > >          >>                 <thw@apache.org <ma...@apache.org>
> > >         <mailto:thw@apache.org <ma...@apache.org>>
> > >          >>                  >>          <mailto:thw@apache.org
> > >         <ma...@apache.org>
> > >          >>                 <mailto:thw@apache.org
> > >         <ma...@apache.org>>>> wrote:
> > >          >>                  >>
> > >          >>                  >>              Hi,
> > >          >>                  >>
> > >          >>                  >>              Currently the portable
> > >         Flink runner
> > >          >>                 only works with SDK
> > >          >>                  >>              Docker containers for
> execution
> > >          >>                 (DockerJobBundleFactory,
> > >          >>                  >>              besides an in-process
> > >         (embedded)
> > >          >>                 factory option for testing
> > >          >>                  >>              [1]). I'm considering
> > >         adding another
> > >          >>                 out of process
> > >          >>                  >>              JobBundleFactory
> > >         implementation that
> > >          >>                 directly forks the
> > >          >>                  >>              processes on the task
> > >         manager host,
> > >          >>                 eliminating the need for
> > >          >>                  >>              Docker. This would work
> > >         reasonably well
> > >          >>                 in environments
> > >          >>                  >>              where the dependencies (in
> > >         this case
> > >          >>                 Python) can easily be
> > >          >>                  >>              tied into the host
> > >         deployment (also
> > >          >>                 within an application
> > >          >>                  >>              specific Kubernetes pod).
> > >          >>                  >>
> > >          >>                  >>              There was already some
> > >         discussion about
> > >          >>                 alternative
> > >          >>                  >>              JobBundleFactory
> > >         implementation in [2].
> > >          >>                 There is also a JIRA
> > >          >>                  >>              to make the bundle factory
> > >         pluggable
> > >          >>                 [3], pending
> > >          >>                  >>              availability of runner
> > >         level options.
> > >          >>                  >>
> > >          >>                  >>              For a
> > >         "ProcessBundleFactory", in
> > >          >>                 addition to the Python
> > >          >>                  >>              dependencies the
> > >         environment would also
> > >          >>                 need to have the Go
> > >          >>                  >>              boot executable [4] (or a
> > >         substitute
> > >          >>                 thereof) to perform the
> > >          >>                  >>              harness initialization.
> > >          >>                  >>
> > >          >>                  >>              Is anyone else interested
> > >         in this SDK
> > >          >>                 execution option or
> > >          >>                  >>              has already investigated
> an
> > >         alternative
> > >          >>                 implementation?
> > >          >>                  >>
> > >          >>                  >>              Thanks,
> > >          >>                  >>              Thomas
> > >          >>                  >>
> > >          >>                  >>              [1]
> > >          >>                  >>
> > >          >>
> > >          >>
> > >
> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
> > >
> > >          >>
> > >          >>
> > >          >>
> > >         <
> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
> >
> > >
> > >          >>
> > >          >>                  >>
> > >          >>                  >>              [2]
> > >          >>                  >>
> > >          >>
> > >          >>
> > >
> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
> > >
> > >          >>
> > >          >>
> > >          >>
> > >         <
> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
> >
> > >
> > >          >>
> > >          >>                  >>
> > >          >>                  >>              [3]
> > >          >> https://issues.apache.org/jira/browse/BEAM-4819
> > >          >>
> > >         <https://issues.apache.org/jira/browse/BEAM-4819>
> > >          >>                  >>
> > >          >>                  >>              [4]
> > >          >>
> > >          >>
> > >
> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
> > >          >>
> > >          >>
> > >         <
> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go>
> > >
> > >          >>
> > >          >>                  >>
> > >          >>
> > >          >>                 --                 Max
> > >          >>
> > >          >>
> > >          >
> > >
> > >         --
> > >         Max
> > >
> >
> > --
> > Max
>

Re: Process JobBundleFactory for portable runner

Posted by Robert Bradshaw <ro...@google.com>.

On Mon, Aug 27, 2018 at 11:23 AM Maximilian Michels <mx...@apache.org> wrote:
>
> Thanks for your proposal Henning. +1 for explicit environment messages.
> I'm not sure how important it is to support cross-platform pipelines. I
> can foresee future use but I wouldn't consider it essential.

One may want to execute Tensforflow pipeline segments from the middle
of a Java pipeline, or leverage the SQL code (currently implemented in
Java) from Python or Go. In my experience, a common pipeline shape is
to do a significant amount of (often fairly trivial) filtering at the
front of a pipeline, and sophisticated analysis at the end, and the
tradeoffs of execution efficiency vs. expressiveness and prototype
friendliness are different in these two halves of the same pipeline.

> However, it
> basically comes for free if we extend the existing environment
> information for ExecutableStage. The overhead, as you said, is negligible.

+1. Also it should be noted that the runner is ideally not restricted
to this set of environments; if it understands the URN it can use
whatever environment it finds appropriate. This could be especially
useful for optimal choice of environment to avoid unneeded fusion
barriers (e.g. the trivial Count transform will has a pair-with-one
before the GBK, and a sum-values after the GBK, and assuming those
operations are available in nearly every environment it would be
preferable to choose the variant according to what precedes/follows it
to allow fusion (or, possibly, even embed the operation).

> Also agree that artifact staging is important even with process-based
> execution. The execution environment might be managed externally but we
> still want to be able to execute new pipelines without copying over
> required artifact. That said, a first version could come without
> artifact staging.

One of the parameters passed to the script is the staging endpoint,
which it can use to do all the staging itself, so this could also be
internal to the script. I can however see wanting the ability to stage
artifacts more cheaply (e.g. symlinks), and this is actually also the
case with the docker environment (e.g. mount points).

> On 23.08.18 18:14, Henning Rohde wrote:
> > A process-based SDK harness does not IMO imply that the host is fully
> > provisioned by the SDK/user and invoking the user command line in the
> > context of the staged files is a critical aspect for it to work. So I
> > consider staged artifact support needed. Also, I would like to suggest
> > that we move to a concrete environment proto to crystalize what is
> > actually being proposed. I'm not sure what activating a virtualenv would
> > look like, for example. To start things off:
> >
> > message Environment {
> >    string urn = 1;
> >    bytes payload = 2;
> > }
> >
> > // urn == "beam:env:docker:v1"
> > message DockerPayload {
> >    string container_image = 1;  // implicitly linux_amd64.
> > }
> >
> > // urn == "beam:env:process:v1"
> > message ProcessPayload {
> >    string os = 1;  // "linux", "darwin", ..
> >    string arch = 2;  // "amd64", ..
> >    string command_line = 3;
> > }
> >
> > // urn == "beam:env:external:v1"
> > // (no payload)
> >
> > A runner may support any subset and reject any unsupported
> > configuration. There are 3 kinds of environments that I think are useful:
> >   (1) docker: works as currently. Offers the most flexibility for SDKs
> > and users, especially when the runner is outside the control (such
> > as hosted runners). The runner starts the SDK harnesses.
> >   (2) process: as discussed here. The runner starts the SDK harnesses.
> > The semantics is that the shell commandline is invoked in a directory
> > rooted in the staged artifacts with the container contract arguments. It
> > is up to the user and runner deployment to ensure that it makes sense,
> > i.e., on windows a linux binary or bash script is not specified.
> > Executing the user command in a shell env (bash, zsh, cmd, ..) ensures
> > that paths and so on are set up:, i.e., specifying "java -jar foo" would
> > actually work. Useful for cases where the user controls both the SDK and
> > runner (such as locally) or when docker is not an option. Intended to be
> > minimal and SDK/language agnostic.
> >   (3) external: this is what I think Robert was alluding to. The runner
> > does not start any SDK harnesses. Instead it waits for user-controlled
> > SDK harnesses to connect. Useful for manually debugging SDK code
> > (connect from code running in a debugger) or when the user code must run
> > in a special or privileged environment. It's runner-specific how the SDK
> > will need to connect.
> >
> > Part of the idea of placing this information in the environment is that
> > pipelines can potentially use multiple, such as cross-windows/linux.
> >
> > Henning
> >
> > On Thu, Aug 23, 2018 at 6:44 AM Thomas Weise <thw@apache.org
> > <ma...@apache.org>> wrote:
> >
> >     I would see support for staging libraries as optional / nice to have
> >     since that can also be done as part of host provisioning (i.e. in
> >     the Python case a virtual environment was already setup and just
> >     needs to be activated).
> >
> >     Depending on how the command that launches the harness is
> >     configured, additional steps such as virtualenv activate or setting
> >     of other environment variables can be included as well.
> >
> >
> >     On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels <mxm@apache.org
> >     <ma...@apache.org>> wrote:
> >
> >         Just to recap:
> >
> >           From this and the other thread ("Bootstraping Beam's Job
> >         Server") we
> >         got sufficient evidence that process-based execution is a
> >         desired feature.
> >
> >         Process-based execution as an alternative to dockerized execution
> >         https://issues.apache.org/jira/browse/BEAM-5187
> >
> >         Which parts are executed as a process?
> >         => The SDK harness for user code
> >
> >         What configuration options are supported?
> >         => Provide information about the target architecture (OS/CPU)
> >         => Staging libraries, as also supported by Docker
> >         => Activating a pre-existing environment (e.g. virutalenv)
> >
> >
> >         On 23.08.18 14:13, Maximilian Michels wrote:
> >          >> One thing to consider that we've talked about in the past.
> >         It might
> >          >> make sense to extend the environment proto and have the SDK be
> >          >> explicit about which kinds of environment it support
> >          >
> >          > +1 Encoding environment information there is a good idea.
> >          >
> >          >> Seems it will create a default docker url even if the
> >          >> hardness_docker_image is set to None in pipeline options.
> >         Shall we add
> >          >> another option or honor the None in this option to support
> >         the process
> >          >> job?
> >          >
> >          > Yes, if no Docker image is set the default one will be used.
> >         Currently
> >          > Docker is the only way to execute pipelines with the
> >         PortableRunner. If
> >          > the docker_image is not set, execution won't succeed.
> >          >
> >          > On 22.08.18 22:59, Xinyu Liu wrote:
> >          >> We are also interested in this Process JobBundleFactory as
> >         we are
> >          >> planning to fork a process to run python sdk in Samza
> >         runner, instead
> >          >> of using docker container. So this change will be helpful to
> >         us too.
> >          >> On the same note, we are trying out portable_runner.py to
> >         submit a
> >          >> python job. Seems it will create a default docker url even
> >         if the
> >          >> hardness_docker_image is set to None in pipeline options.
> >         Shall we add
> >          >> another option or honor the None in this option to support
> >         the process
> >          >> job? I made some local changes right now to walk around this.
> >          >>
> >          >> Thanks,
> >          >> Xinyu
> >          >>
> >          >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde
> >         <herohde@google.com <ma...@google.com>
> >          >> <mailto:herohde@google.com <ma...@google.com>>> wrote:
> >          >>
> >          >>     By "enum" in quotes, I meant the usual open URN style
> >         pattern not an
> >          >>     actual enum. Sorry if that wasn't clear.
> >          >>
> >          >>     On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik
> >         <lcwik@google.com <ma...@google.com>
> >          >>     <mailto:lcwik@google.com <ma...@google.com>>> wrote:
> >          >>
> >          >>         I would model the environment to be more free form
> >         then enums
> >          >>         such that we have forward looking extensibility and
> >         would
> >          >>         suggest to follow the same pattern we use on
> >         PTransforms (using
> >          >>         an URN and a URN specific payload). Note that in
> >         this case we
> >          >>         may want to support a list of supported environments
> >         (e.g. java,
> >          >>         docker, python, ...).
> >          >>
> >          >>         On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
> >          >>         <herohde@google.com <ma...@google.com>
> >         <mailto:herohde@google.com <ma...@google.com>>> wrote:
> >          >>
> >          >>             One thing to consider that we've talked about in
> >         the past.
> >          >>             It might make sense to extend the environment
> >         proto and have
> >          >>             the SDK be explicit about which kinds of
> >         environment it
> >          >>             supports:
> >          >>
> >          >>
> >          >>
> >         https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
> >
> >          >>
> >          >>
> >          >>
> >         <https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969>
> >
> >          >>
> >          >>
> >          >>             This choice might impact what files are staged
> >         or what not.
> >          >>             Some SDKs, such as Go, provide a compiled binary
> >         and _need_
> >          >>             to know what the target architecture is. Running
> >         on a mac
> >          >>             with and without docker, say, requires a
> >         different worker in
> >          >>             each case. If we add an "enum", we can also
> >         easily add the
> >          >>             external idea where the SDK/user starts the SDK
> >         harnesses
> >          >>             instead of the runner. Each runner may not
> >         support all types
> >          >>             of environments.
> >          >>
> >          >>             Henning
> >          >>
> >          >>             On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels
> >          >>             <mxm@apache.org <ma...@apache.org>
> >         <mailto:mxm@apache.org <ma...@apache.org>>> wrote:
> >          >>
> >          >>                 For reference, here is corresponding JIRA
> >         issue for this
> >          >>                 thread:
> >          >> https://issues.apache.org/jira/browse/BEAM-5187
> >          >>
> >         <https://issues.apache.org/jira/browse/BEAM-5187>
> >          >>
> >          >>                 On 16.08.18 11:15, Maximilian Michels wrote:
> >          >>                  > Makes sense to have an option to run the
> >         SDK harness
> >          >>                 in a non-dockerized
> >          >>                  > environment.
> >          >>                  >
> >          >>                  > I'm in the process of creating a Docker
> >         entry point
> >          >>                 for Flink's
> >          >>                  > JobServer[1]. I suppose you would also
> >         prefer to
> >          >>                 execute that one
> >          >>                  > standalone. We should make sure this is
> >         also an
> >          >> option.
> >          >>                  >
> >          >>                  > [1]
> >         https://issues.apache.org/jira/browse/BEAM-4130
> >          >>
> >         <https://issues.apache.org/jira/browse/BEAM-4130>
> >          >>                  >
> >          >>                  > On 16.08.18 07:42, Thomas Weise wrote:
> >          >>                  >> Yes, that's the proposal. Everything
> >         that would
> >          >>                 otherwise be packaged
> >          >>                  >> into the Docker container would need to be
> >          >>                 pre-installed in the host
> >          >>                  >> environment. In the case of Python SDK,
> >         that could
> >          >>                 simply mean a
> >          >>                  >> (frozen) virtual environment that was
> >         setup when the
> >          >>                 host was
> >          >>                  >> provisioned - the SDK harness
> >         process(es) will then
> >          >>                 just utilize that.
> >          >>                  >> Of course this flavor of SDK harness
> >         execution could
> >          >>                 also be useful in
> >          >>                  >> the local development environment, where
> >         right now
> >          >>                 someone who already
> >          >>                  >> has the Python environment needs to also
> >         install
> >          >>                 Docker and update a
> >          >>                  >> container to launch a Python SDK
> >         pipeline on the
> >          >>                 Flink runner.
> >          >>                  >>
> >          >>                  >> On Wed, Aug 15, 2018 at 12:40 PM Daniel
> >         Oliveira
> >          >>                 <danoliveira@google.com
> >         <ma...@google.com> <mailto:danoliveira@google.com
> >         <ma...@google.com>>
> >          >>                  >> <mailto:danoliveira@google.com
> >         <ma...@google.com>
> >          >>                 <mailto:danoliveira@google.com
> >         <ma...@google.com>>>> wrote:
> >          >>                  >>
> >          >>                  >>      I just want to clarify that I
> >         understand this
> >          >>                 correctly since I'm
> >          >>                  >>      not that familiar with the details
> >         behind all
> >          >>                 these execution
> >          >>                  >>      environments yet. Is the proposal
> >         to create a
> >          >>                 new JobBundleFactory
> >          >>                  >>      that instead of using Docker to
> >         create the
> >          >>                 environment that the new
> >          >>                  >>      processes will execute in, this
> >          >>                 JobBundleFactory would execute the
> >          >>                  >>      new processes directly in the host
> >         environment?
> >          >>                 So in practice if I
> >          >>                  >>      ran a pipeline with this
> >         JobBundleFactory the
> >          >>                 SDK Harness and Runner
> >          >>                  >>      Harness would both be executing
> >         directly on my
> >          >>                 machine and would
> >          >>                  >>      depend on me having the
> >         dependencies already
> >          >>                 present on my machine?
> >          >>                  >>
> >          >>                  >>      On Mon, Aug 13, 2018 at 5:50 PM
> >         Ankur Goenka
> >          >>                 <goenka@google.com
> >         <ma...@google.com> <mailto:goenka@google.com
> >         <ma...@google.com>>
> >          >>                  >>      <mailto:goenka@google.com
> >         <ma...@google.com>
> >          >>                 <mailto:goenka@google.com
> >         <ma...@google.com>>>> wrote:
> >          >>                  >>
> >          >>                  >>          Thanks for starting the
> >         discussion. I will
> >          >>                 be happy to help.
> >          >>                  >>          I agree, we should have pluggable
> >          >>                 SDKHarness environment Factory.
> >          >>                  >>          We can register multiple
> >         Environment
> >          >>                 factory using service
> >          >>                  >>          registry and use the
> >         PipelineOption to pick
> >          >>                 the right one on per
> >          >>                  >>          job basis.
> >          >>                  >>
> >          >>                  >>          There are a couple of things
> >         which are
> >          >>                 require to setup before
> >          >>                  >>          launching the process.
> >          >>                  >>
> >          >>                  >>            * Setting up the environment
> >         as done in
> >          >>                 boot.go [4]
> >          >>                  >>            * Retrieving and putting the
> >         artifacts in
> >          >>                 the right location.
> >          >>                  >>
> >          >>                  >>          You can probably leverage
> >         boot.go code to
> >          >>                 setup the environment.
> >          >>                  >>
> >          >>                  >>          Also, it will be useful to
> >         enumerate pros
> >          >>                 and cons of different
> >          >>                  >>          Environments to help users
> >         choose the right
> >          >>                 one.
> >          >>                  >>
> >          >>                  >>
> >          >>                  >>          On Mon, Aug 6, 2018 at 4:50 PM
> >         Thomas Weise
> >          >>                 <thw@apache.org <ma...@apache.org>
> >         <mailto:thw@apache.org <ma...@apache.org>>
> >          >>                  >>          <mailto:thw@apache.org
> >         <ma...@apache.org>
> >          >>                 <mailto:thw@apache.org
> >         <ma...@apache.org>>>> wrote:
> >          >>                  >>
> >          >>                  >>              Hi,
> >          >>                  >>
> >          >>                  >>              Currently the portable
> >         Flink runner
> >          >>                 only works with SDK
> >          >>                  >>              Docker containers for execution
> >          >>                 (DockerJobBundleFactory,
> >          >>                  >>              besides an in-process
> >         (embedded)
> >          >>                 factory option for testing
> >          >>                  >>              [1]). I'm considering
> >         adding another
> >          >>                 out of process
> >          >>                  >>              JobBundleFactory
> >         implementation that
> >          >>                 directly forks the
> >          >>                  >>              processes on the task
> >         manager host,
> >          >>                 eliminating the need for
> >          >>                  >>              Docker. This would work
> >         reasonably well
> >          >>                 in environments
> >          >>                  >>              where the dependencies (in
> >         this case
> >          >>                 Python) can easily be
> >          >>                  >>              tied into the host
> >         deployment (also
> >          >>                 within an application
> >          >>                  >>              specific Kubernetes pod).
> >          >>                  >>
> >          >>                  >>              There was already some
> >         discussion about
> >          >>                 alternative
> >          >>                  >>              JobBundleFactory
> >         implementation in [2].
> >          >>                 There is also a JIRA
> >          >>                  >>              to make the bundle factory
> >         pluggable
> >          >>                 [3], pending
> >          >>                  >>              availability of runner
> >         level options.
> >          >>                  >>
> >          >>                  >>              For a
> >         "ProcessBundleFactory", in
> >          >>                 addition to the Python
> >          >>                  >>              dependencies the
> >         environment would also
> >          >>                 need to have the Go
> >          >>                  >>              boot executable [4] (or a
> >         substitute
> >          >>                 thereof) to perform the
> >          >>                  >>              harness initialization.
> >          >>                  >>
> >          >>                  >>              Is anyone else interested
> >         in this SDK
> >          >>                 execution option or
> >          >>                  >>              has already investigated an
> >         alternative
> >          >>                 implementation?
> >          >>                  >>
> >          >>                  >>              Thanks,
> >          >>                  >>              Thomas
> >          >>                  >>
> >          >>                  >>              [1]
> >          >>                  >>
> >          >>
> >          >>
> >         https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
> >
> >          >>
> >          >>
> >          >>
> >         <https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83>
> >
> >          >>
> >          >>                  >>
> >          >>                  >>              [2]
> >          >>                  >>
> >          >>
> >          >>
> >         https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
> >
> >          >>
> >          >>
> >          >>
> >         <https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E>
> >
> >          >>
> >          >>                  >>
> >          >>                  >>              [3]
> >          >> https://issues.apache.org/jira/browse/BEAM-4819
> >          >>
> >         <https://issues.apache.org/jira/browse/BEAM-4819>
> >          >>                  >>
> >          >>                  >>              [4]
> >          >>
> >          >>
> >         https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
> >          >>
> >          >>
> >         <https://github.com/apache/beam/blob/master/sdks/python/container/boot.go>
> >
> >          >>
> >          >>                  >>
> >          >>
> >          >>                 --                 Max
> >          >>
> >          >>
> >          >
> >
> >         --
> >         Max
> >
>
> --
> Max

Re: Process JobBundleFactory for portable runner

Posted by Maximilian Michels <mx...@apache.org>.

Thanks for your proposal Henning. +1 for explicit environment messages. 
I'm not sure how important it is to support cross-platform pipelines. I 
can foresee future use but I wouldn't consider it essential. However, it 
basically comes for free if we extend the existing environment 
information for ExecutableStage. The overhead, as you said, is negligible.

Also agree that artifact staging is important even with process-based 
execution. The execution environment might be managed externally but we 
still want to be able to execute new pipelines without copying over 
required artifact. That said, a first version could come without 
artifact staging.

On 23.08.18 18:14, Henning Rohde wrote:
> A process-based SDK harness does not IMO imply that the host is fully 
> provisioned by the SDK/user and invoking the user command line in the 
> context of the staged files is a critical aspect for it to work. So I 
> consider staged artifact support needed. Also, I would like to suggest 
> that we move to a concrete environment proto to crystalize what is 
> actually being proposed. I'm not sure what activating a virtualenv would 
> look like, for example. To start things off:
> 
> message Environment {
>    string urn = 1;
>    bytes payload = 2;
> }
> 
> // urn == "beam:env:docker:v1"
> message DockerPayload {
>    string container_image = 1;  // implicitly linux_amd64.
> }
> 
> // urn == "beam:env:process:v1"
> message ProcessPayload {
>    string os = 1;  // "linux", "darwin", ..
>    string arch = 2;  // "amd64", ..
>    string command_line = 3;
> }
> 
> // urn == "beam:env:external:v1"
> // (no payload)
> 
> A runner may support any subset and reject any unsupported 
> configuration. There are 3 kinds of environments that I think are useful:
>   (1) docker: works as currently. Offers the most flexibility for SDKs 
> and users, especially when the runner is outside the control (such 
> as hosted runners). The runner starts the SDK harnesses.
>   (2) process: as discussed here. The runner starts the SDK harnesses. 
> The semantics is that the shell commandline is invoked in a directory 
> rooted in the staged artifacts with the container contract arguments. It 
> is up to the user and runner deployment to ensure that it makes sense, 
> i.e., on windows a linux binary or bash script is not specified. 
> Executing the user command in a shell env (bash, zsh, cmd, ..) ensures 
> that paths and so on are set up:, i.e., specifying "java -jar foo" would 
> actually work. Useful for cases where the user controls both the SDK and 
> runner (such as locally) or when docker is not an option. Intended to be 
> minimal and SDK/language agnostic.
>   (3) external: this is what I think Robert was alluding to. The runner 
> does not start any SDK harnesses. Instead it waits for user-controlled 
> SDK harnesses to connect. Useful for manually debugging SDK code 
> (connect from code running in a debugger) or when the user code must run 
> in a special or privileged environment. It's runner-specific how the SDK 
> will need to connect.
> 
> Part of the idea of placing this information in the environment is that 
> pipelines can potentially use multiple, such as cross-windows/linux.
> 
> Henning
> 
> On Thu, Aug 23, 2018 at 6:44 AM Thomas Weise <thw@apache.org 
> <ma...@apache.org>> wrote:
> 
>     I would see support for staging libraries as optional / nice to have
>     since that can also be done as part of host provisioning (i.e. in
>     the Python case a virtual environment was already setup and just
>     needs to be activated).
> 
>     Depending on how the command that launches the harness is
>     configured, additional steps such as virtualenv activate or setting
>     of other environment variables can be included as well.
> 
> 
>     On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels <mxm@apache.org
>     <ma...@apache.org>> wrote:
> 
>         Just to recap:
> 
>           From this and the other thread ("Bootstraping Beam's Job
>         Server") we
>         got sufficient evidence that process-based execution is a
>         desired feature.
> 
>         Process-based execution as an alternative to dockerized execution
>         https://issues.apache.org/jira/browse/BEAM-5187
> 
>         Which parts are executed as a process?
>         => The SDK harness for user code
> 
>         What configuration options are supported?
>         => Provide information about the target architecture (OS/CPU)
>         => Staging libraries, as also supported by Docker
>         => Activating a pre-existing environment (e.g. virutalenv)
> 
> 
>         On 23.08.18 14:13, Maximilian Michels wrote:
>          >> One thing to consider that we've talked about in the past.
>         It might
>          >> make sense to extend the environment proto and have the SDK be
>          >> explicit about which kinds of environment it support
>          >
>          > +1 Encoding environment information there is a good idea.
>          >
>          >> Seems it will create a default docker url even if the
>          >> hardness_docker_image is set to None in pipeline options.
>         Shall we add
>          >> another option or honor the None in this option to support
>         the process
>          >> job?
>          >
>          > Yes, if no Docker image is set the default one will be used.
>         Currently
>          > Docker is the only way to execute pipelines with the
>         PortableRunner. If
>          > the docker_image is not set, execution won't succeed.
>          >
>          > On 22.08.18 22:59, Xinyu Liu wrote:
>          >> We are also interested in this Process JobBundleFactory as
>         we are
>          >> planning to fork a process to run python sdk in Samza
>         runner, instead
>          >> of using docker container. So this change will be helpful to
>         us too.
>          >> On the same note, we are trying out portable_runner.py to
>         submit a
>          >> python job. Seems it will create a default docker url even
>         if the
>          >> hardness_docker_image is set to None in pipeline options.
>         Shall we add
>          >> another option or honor the None in this option to support
>         the process
>          >> job? I made some local changes right now to walk around this.
>          >>
>          >> Thanks,
>          >> Xinyu
>          >>
>          >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde
>         <herohde@google.com <ma...@google.com>
>          >> <mailto:herohde@google.com <ma...@google.com>>> wrote:
>          >>
>          >>     By "enum" in quotes, I meant the usual open URN style
>         pattern not an
>          >>     actual enum. Sorry if that wasn't clear.
>          >>
>          >>     On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik
>         <lcwik@google.com <ma...@google.com>
>          >>     <mailto:lcwik@google.com <ma...@google.com>>> wrote:
>          >>
>          >>         I would model the environment to be more free form
>         then enums
>          >>         such that we have forward looking extensibility and
>         would
>          >>         suggest to follow the same pattern we use on
>         PTransforms (using
>          >>         an URN and a URN specific payload). Note that in
>         this case we
>          >>         may want to support a list of supported environments
>         (e.g. java,
>          >>         docker, python, ...).
>          >>
>          >>         On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
>          >>         <herohde@google.com <ma...@google.com>
>         <mailto:herohde@google.com <ma...@google.com>>> wrote:
>          >>
>          >>             One thing to consider that we've talked about in
>         the past.
>          >>             It might make sense to extend the environment
>         proto and have
>          >>             the SDK be explicit about which kinds of
>         environment it
>          >>             supports:
>          >>
>          >>
>          >>
>         https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
> 
>          >>
>          >>
>          >>
>         <https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969>
> 
>          >>
>          >>
>          >>             This choice might impact what files are staged
>         or what not.
>          >>             Some SDKs, such as Go, provide a compiled binary
>         and _need_
>          >>             to know what the target architecture is. Running
>         on a mac
>          >>             with and without docker, say, requires a
>         different worker in
>          >>             each case. If we add an "enum", we can also
>         easily add the
>          >>             external idea where the SDK/user starts the SDK
>         harnesses
>          >>             instead of the runner. Each runner may not
>         support all types
>          >>             of environments.
>          >>
>          >>             Henning
>          >>
>          >>             On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels
>          >>             <mxm@apache.org <ma...@apache.org>
>         <mailto:mxm@apache.org <ma...@apache.org>>> wrote:
>          >>
>          >>                 For reference, here is corresponding JIRA
>         issue for this
>          >>                 thread:
>          >> https://issues.apache.org/jira/browse/BEAM-5187
>          >>                
>         <https://issues.apache.org/jira/browse/BEAM-5187>
>          >>
>          >>                 On 16.08.18 11:15, Maximilian Michels wrote:
>          >>                  > Makes sense to have an option to run the
>         SDK harness
>          >>                 in a non-dockerized
>          >>                  > environment.
>          >>                  >
>          >>                  > I'm in the process of creating a Docker
>         entry point
>          >>                 for Flink's
>          >>                  > JobServer[1]. I suppose you would also
>         prefer to
>          >>                 execute that one
>          >>                  > standalone. We should make sure this is
>         also an
>          >> option.
>          >>                  >
>          >>                  > [1]
>         https://issues.apache.org/jira/browse/BEAM-4130
>          >>                
>         <https://issues.apache.org/jira/browse/BEAM-4130>
>          >>                  >
>          >>                  > On 16.08.18 07:42, Thomas Weise wrote:
>          >>                  >> Yes, that's the proposal. Everything
>         that would
>          >>                 otherwise be packaged
>          >>                  >> into the Docker container would need to be
>          >>                 pre-installed in the host
>          >>                  >> environment. In the case of Python SDK,
>         that could
>          >>                 simply mean a
>          >>                  >> (frozen) virtual environment that was
>         setup when the
>          >>                 host was
>          >>                  >> provisioned - the SDK harness
>         process(es) will then
>          >>                 just utilize that.
>          >>                  >> Of course this flavor of SDK harness
>         execution could
>          >>                 also be useful in
>          >>                  >> the local development environment, where
>         right now
>          >>                 someone who already
>          >>                  >> has the Python environment needs to also
>         install
>          >>                 Docker and update a
>          >>                  >> container to launch a Python SDK
>         pipeline on the
>          >>                 Flink runner.
>          >>                  >>
>          >>                  >> On Wed, Aug 15, 2018 at 12:40 PM Daniel
>         Oliveira
>          >>                 <danoliveira@google.com
>         <ma...@google.com> <mailto:danoliveira@google.com
>         <ma...@google.com>>
>          >>                  >> <mailto:danoliveira@google.com
>         <ma...@google.com>
>          >>                 <mailto:danoliveira@google.com
>         <ma...@google.com>>>> wrote:
>          >>                  >>
>          >>                  >>      I just want to clarify that I
>         understand this
>          >>                 correctly since I'm
>          >>                  >>      not that familiar with the details
>         behind all
>          >>                 these execution
>          >>                  >>      environments yet. Is the proposal
>         to create a
>          >>                 new JobBundleFactory
>          >>                  >>      that instead of using Docker to
>         create the
>          >>                 environment that the new
>          >>                  >>      processes will execute in, this
>          >>                 JobBundleFactory would execute the
>          >>                  >>      new processes directly in the host
>         environment?
>          >>                 So in practice if I
>          >>                  >>      ran a pipeline with this
>         JobBundleFactory the
>          >>                 SDK Harness and Runner
>          >>                  >>      Harness would both be executing
>         directly on my
>          >>                 machine and would
>          >>                  >>      depend on me having the
>         dependencies already
>          >>                 present on my machine?
>          >>                  >>
>          >>                  >>      On Mon, Aug 13, 2018 at 5:50 PM
>         Ankur Goenka
>          >>                 <goenka@google.com
>         <ma...@google.com> <mailto:goenka@google.com
>         <ma...@google.com>>
>          >>                  >>      <mailto:goenka@google.com
>         <ma...@google.com>
>          >>                 <mailto:goenka@google.com
>         <ma...@google.com>>>> wrote:
>          >>                  >>
>          >>                  >>          Thanks for starting the
>         discussion. I will
>          >>                 be happy to help.
>          >>                  >>          I agree, we should have pluggable
>          >>                 SDKHarness environment Factory.
>          >>                  >>          We can register multiple
>         Environment
>          >>                 factory using service
>          >>                  >>          registry and use the
>         PipelineOption to pick
>          >>                 the right one on per
>          >>                  >>          job basis.
>          >>                  >>
>          >>                  >>          There are a couple of things
>         which are
>          >>                 require to setup before
>          >>                  >>          launching the process.
>          >>                  >>
>          >>                  >>            * Setting up the environment
>         as done in
>          >>                 boot.go [4]
>          >>                  >>            * Retrieving and putting the
>         artifacts in
>          >>                 the right location.
>          >>                  >>
>          >>                  >>          You can probably leverage
>         boot.go code to
>          >>                 setup the environment.
>          >>                  >>
>          >>                  >>          Also, it will be useful to
>         enumerate pros
>          >>                 and cons of different
>          >>                  >>          Environments to help users
>         choose the right
>          >>                 one.
>          >>                  >>
>          >>                  >>
>          >>                  >>          On Mon, Aug 6, 2018 at 4:50 PM
>         Thomas Weise
>          >>                 <thw@apache.org <ma...@apache.org>
>         <mailto:thw@apache.org <ma...@apache.org>>
>          >>                  >>          <mailto:thw@apache.org
>         <ma...@apache.org>
>          >>                 <mailto:thw@apache.org
>         <ma...@apache.org>>>> wrote:
>          >>                  >>
>          >>                  >>              Hi,
>          >>                  >>
>          >>                  >>              Currently the portable
>         Flink runner
>          >>                 only works with SDK
>          >>                  >>              Docker containers for execution
>          >>                 (DockerJobBundleFactory,
>          >>                  >>              besides an in-process
>         (embedded)
>          >>                 factory option for testing
>          >>                  >>              [1]). I'm considering
>         adding another
>          >>                 out of process
>          >>                  >>              JobBundleFactory
>         implementation that
>          >>                 directly forks the
>          >>                  >>              processes on the task
>         manager host,
>          >>                 eliminating the need for
>          >>                  >>              Docker. This would work
>         reasonably well
>          >>                 in environments
>          >>                  >>              where the dependencies (in
>         this case
>          >>                 Python) can easily be
>          >>                  >>              tied into the host
>         deployment (also
>          >>                 within an application
>          >>                  >>              specific Kubernetes pod).
>          >>                  >>
>          >>                  >>              There was already some
>         discussion about
>          >>                 alternative
>          >>                  >>              JobBundleFactory
>         implementation in [2].
>          >>                 There is also a JIRA
>          >>                  >>              to make the bundle factory
>         pluggable
>          >>                 [3], pending
>          >>                  >>              availability of runner
>         level options.
>          >>                  >>
>          >>                  >>              For a
>         "ProcessBundleFactory", in
>          >>                 addition to the Python
>          >>                  >>              dependencies the
>         environment would also
>          >>                 need to have the Go
>          >>                  >>              boot executable [4] (or a
>         substitute
>          >>                 thereof) to perform the
>          >>                  >>              harness initialization.
>          >>                  >>
>          >>                  >>              Is anyone else interested
>         in this SDK
>          >>                 execution option or
>          >>                  >>              has already investigated an
>         alternative
>          >>                 implementation?
>          >>                  >>
>          >>                  >>              Thanks,
>          >>                  >>              Thomas
>          >>                  >>
>          >>                  >>              [1]
>          >>                  >>
>          >>
>          >>
>         https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
> 
>          >>
>          >>
>          >>
>         <https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83>
> 
>          >>
>          >>                  >>
>          >>                  >>              [2]
>          >>                  >>
>          >>
>          >>
>         https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
> 
>          >>
>          >>
>          >>
>         <https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E>
> 
>          >>
>          >>                  >>
>          >>                  >>              [3]
>          >> https://issues.apache.org/jira/browse/BEAM-4819
>          >>                
>         <https://issues.apache.org/jira/browse/BEAM-4819>
>          >>                  >>
>          >>                  >>              [4]
>          >>
>          >>
>         https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>          >>
>          >>
>         <https://github.com/apache/beam/blob/master/sdks/python/container/boot.go>
> 
>          >>
>          >>                  >>
>          >>
>          >>                 --                 Max
>          >>
>          >>
>          >
> 
>         -- 
>         Max
> 

-- 
Max

Re: Process JobBundleFactory for portable runner

Posted by Henning Rohde <he...@google.com>.

A process-based SDK harness does not IMO imply that the host is fully
provisioned by the SDK/user and invoking the user command line in the
context of the staged files is a critical aspect for it to work. So I
consider staged artifact support needed. Also, I would like to suggest that
we move to a concrete environment proto to crystalize what is actually
being proposed. I'm not sure what activating a virtualenv would look like,
for example. To start things off:

message Environment {
  string urn = 1;
  bytes payload = 2;
}

// urn == "beam:env:docker:v1"
message DockerPayload {
  string container_image = 1;  // implicitly linux_amd64.
}

// urn == "beam:env:process:v1"
message ProcessPayload {
  string os = 1;  // "linux", "darwin", ..
  string arch = 2;  // "amd64", ..
  string command_line = 3;
}

// urn == "beam:env:external:v1"
// (no payload)

A runner may support any subset and reject any unsupported configuration.
There are 3 kinds of environments that I think are useful:
 (1) docker: works as currently. Offers the most flexibility for SDKs and
users, especially when the runner is outside the control (such as hosted
runners). The runner starts the SDK harnesses.
 (2) process: as discussed here. The runner starts the SDK harnesses. The
semantics is that the shell commandline is invoked in a directory rooted in
the staged artifacts with the container contract arguments. It is up to the
user and runner deployment to ensure that it makes sense, i.e., on windows
a linux binary or bash script is not specified. Executing the user command
in a shell env (bash, zsh, cmd, ..) ensures that paths and so on are set
up:, i.e., specifying "java -jar foo" would actually work. Useful for cases
where the user controls both the SDK and runner (such as locally) or when
docker is not an option. Intended to be minimal and SDK/language agnostic.
 (3) external: this is what I think Robert was alluding to. The runner does
not start any SDK harnesses. Instead it waits for user-controlled SDK
harnesses to connect. Useful for manually debugging SDK code (connect from
code running in a debugger) or when the user code must run in a special or
privileged environment. It's runner-specific how the SDK will need to
connect.

Part of the idea of placing this information in the environment is that
pipelines can potentially use multiple, such as cross-windows/linux.

Henning

On Thu, Aug 23, 2018 at 6:44 AM Thomas Weise <th...@apache.org> wrote:

> I would see support for staging libraries as optional / nice to have since
> that can also be done as part of host provisioning (i.e. in the Python case
> a virtual environment was already setup and just needs to be activated).
>
> Depending on how the command that launches the harness is configured,
> additional steps such as virtualenv activate or setting of other
> environment variables can be included as well.
>
>
> On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels <mx...@apache.org> wrote:
>
>> Just to recap:
>>
>>  From this and the other thread ("Bootstraping Beam's Job Server") we
>> got sufficient evidence that process-based execution is a desired feature.
>>
>> Process-based execution as an alternative to dockerized execution
>> https://issues.apache.org/jira/browse/BEAM-5187
>>
>> Which parts are executed as a process?
>> => The SDK harness for user code
>>
>> What configuration options are supported?
>> => Provide information about the target architecture (OS/CPU)
>> => Staging libraries, as also supported by Docker
>> => Activating a pre-existing environment (e.g. virutalenv)
>>
>>
>> On 23.08.18 14:13, Maximilian Michels wrote:
>> >> One thing to consider that we've talked about in the past. It might
>> >> make sense to extend the environment proto and have the SDK be
>> >> explicit about which kinds of environment it support
>> >
>> > +1 Encoding environment information there is a good idea.
>> >
>> >> Seems it will create a default docker url even if the
>> >> hardness_docker_image is set to None in pipeline options. Shall we add
>> >> another option or honor the None in this option to support the process
>> >> job?
>> >
>> > Yes, if no Docker image is set the default one will be used. Currently
>> > Docker is the only way to execute pipelines with the PortableRunner. If
>> > the docker_image is not set, execution won't succeed.
>> >
>> > On 22.08.18 22:59, Xinyu Liu wrote:
>> >> We are also interested in this Process JobBundleFactory as we are
>> >> planning to fork a process to run python sdk in Samza runner, instead
>> >> of using docker container. So this change will be helpful to us too.
>> >> On the same note, we are trying out portable_runner.py to submit a
>> >> python job. Seems it will create a default docker url even if the
>> >> hardness_docker_image is set to None in pipeline options. Shall we add
>> >> another option or honor the None in this option to support the process
>> >> job? I made some local changes right now to walk around this.
>> >>
>> >> Thanks,
>> >> Xinyu
>> >>
>> >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde <herohde@google.com
>> >> <ma...@google.com>> wrote:
>> >>
>> >>     By "enum" in quotes, I meant the usual open URN style pattern not
>> an
>> >>     actual enum. Sorry if that wasn't clear.
>> >>
>> >>     On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik <lcwik@google.com
>> >>     <ma...@google.com>> wrote:
>> >>
>> >>         I would model the environment to be more free form then enums
>> >>         such that we have forward looking extensibility and would
>> >>         suggest to follow the same pattern we use on PTransforms (using
>> >>         an URN and a URN specific payload). Note that in this case we
>> >>         may want to support a list of supported environments (e.g.
>> java,
>> >>         docker, python, ...).
>> >>
>> >>         On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
>> >>         <herohde@google.com <ma...@google.com>> wrote:
>> >>
>> >>             One thing to consider that we've talked about in the past.
>> >>             It might make sense to extend the environment proto and
>> have
>> >>             the SDK be explicit about which kinds of environment it
>> >>             supports:
>> >>
>> >>
>> >>
>> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>> >>
>> >>
>> >> <
>> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969>
>>
>> >>
>> >>
>> >>             This choice might impact what files are staged or what not.
>> >>             Some SDKs, such as Go, provide a compiled binary and _need_
>> >>             to know what the target architecture is. Running on a mac
>> >>             with and without docker, say, requires a different worker
>> in
>> >>             each case. If we add an "enum", we can also easily add the
>> >>             external idea where the SDK/user starts the SDK harnesses
>> >>             instead of the runner. Each runner may not support all
>> types
>> >>             of environments.
>> >>
>> >>             Henning
>> >>
>> >>             On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels
>> >>             <mxm@apache.org <ma...@apache.org>> wrote:
>> >>
>> >>                 For reference, here is corresponding JIRA issue for
>> this
>> >>                 thread:
>> >>                 https://issues.apache.org/jira/browse/BEAM-5187
>> >>                 <https://issues.apache.org/jira/browse/BEAM-5187>
>> >>
>> >>                 On 16.08.18 11:15, Maximilian Michels wrote:
>> >>                  > Makes sense to have an option to run the SDK harness
>> >>                 in a non-dockerized
>> >>                  > environment.
>> >>                  >
>> >>                  > I'm in the process of creating a Docker entry point
>> >>                 for Flink's
>> >>                  > JobServer[1]. I suppose you would also prefer to
>> >>                 execute that one
>> >>                  > standalone. We should make sure this is also an
>> >> option.
>> >>                  >
>> >>                  > [1] https://issues.apache.org/jira/browse/BEAM-4130
>> >>                 <https://issues.apache.org/jira/browse/BEAM-4130>
>> >>                  >
>> >>                  > On 16.08.18 07:42, Thomas Weise wrote:
>> >>                  >> Yes, that's the proposal. Everything that would
>> >>                 otherwise be packaged
>> >>                  >> into the Docker container would need to be
>> >>                 pre-installed in the host
>> >>                  >> environment. In the case of Python SDK, that could
>> >>                 simply mean a
>> >>                  >> (frozen) virtual environment that was setup when
>> the
>> >>                 host was
>> >>                  >> provisioned - the SDK harness process(es) will then
>> >>                 just utilize that.
>> >>                  >> Of course this flavor of SDK harness execution
>> could
>> >>                 also be useful in
>> >>                  >> the local development environment, where right now
>> >>                 someone who already
>> >>                  >> has the Python environment needs to also install
>> >>                 Docker and update a
>> >>                  >> container to launch a Python SDK pipeline on the
>> >>                 Flink runner.
>> >>                  >>
>> >>                  >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira
>> >>                 <danoliveira@google.com <mailto:danoliveira@google.com
>> >
>> >>                  >> <mailto:danoliveira@google.com
>> >>                 <ma...@google.com>>> wrote:
>> >>                  >>
>> >>                  >>      I just want to clarify that I understand this
>> >>                 correctly since I'm
>> >>                  >>      not that familiar with the details behind all
>> >>                 these execution
>> >>                  >>      environments yet. Is the proposal to create a
>> >>                 new JobBundleFactory
>> >>                  >>      that instead of using Docker to create the
>> >>                 environment that the new
>> >>                  >>      processes will execute in, this
>> >>                 JobBundleFactory would execute the
>> >>                  >>      new processes directly in the host
>> environment?
>> >>                 So in practice if I
>> >>                  >>      ran a pipeline with this JobBundleFactory the
>> >>                 SDK Harness and Runner
>> >>                  >>      Harness would both be executing directly on my
>> >>                 machine and would
>> >>                  >>      depend on me having the dependencies already
>> >>                 present on my machine?
>> >>                  >>
>> >>                  >>      On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka
>> >>                 <goenka@google.com <ma...@google.com>
>> >>                  >>      <mailto:goenka@google.com
>> >>                 <ma...@google.com>>> wrote:
>> >>                  >>
>> >>                  >>          Thanks for starting the discussion. I will
>> >>                 be happy to help.
>> >>                  >>          I agree, we should have pluggable
>> >>                 SDKHarness environment Factory.
>> >>                  >>          We can register multiple Environment
>> >>                 factory using service
>> >>                  >>          registry and use the PipelineOption to
>> pick
>> >>                 the right one on per
>> >>                  >>          job basis.
>> >>                  >>
>> >>                  >>          There are a couple of things which are
>> >>                 require to setup before
>> >>                  >>          launching the process.
>> >>                  >>
>> >>                  >>            * Setting up the environment as done in
>> >>                 boot.go [4]
>> >>                  >>            * Retrieving and putting the artifacts
>> in
>> >>                 the right location.
>> >>                  >>
>> >>                  >>          You can probably leverage boot.go code to
>> >>                 setup the environment.
>> >>                  >>
>> >>                  >>          Also, it will be useful to enumerate pros
>> >>                 and cons of different
>> >>                  >>          Environments to help users choose the
>> right
>> >>                 one.
>> >>                  >>
>> >>                  >>
>> >>                  >>          On Mon, Aug 6, 2018 at 4:50 PM Thomas
>> Weise
>> >>                 <thw@apache.org <ma...@apache.org>
>> >>                  >>          <mailto:thw@apache.org
>> >>                 <ma...@apache.org>>> wrote:
>> >>                  >>
>> >>                  >>              Hi,
>> >>                  >>
>> >>                  >>              Currently the portable Flink runner
>> >>                 only works with SDK
>> >>                  >>              Docker containers for execution
>> >>                 (DockerJobBundleFactory,
>> >>                  >>              besides an in-process (embedded)
>> >>                 factory option for testing
>> >>                  >>              [1]). I'm considering adding another
>> >>                 out of process
>> >>                  >>              JobBundleFactory implementation that
>> >>                 directly forks the
>> >>                  >>              processes on the task manager host,
>> >>                 eliminating the need for
>> >>                  >>              Docker. This would work reasonably
>> well
>> >>                 in environments
>> >>                  >>              where the dependencies (in this case
>> >>                 Python) can easily be
>> >>                  >>              tied into the host deployment (also
>> >>                 within an application
>> >>                  >>              specific Kubernetes pod).
>> >>                  >>
>> >>                  >>              There was already some discussion
>> about
>> >>                 alternative
>> >>                  >>              JobBundleFactory implementation in
>> [2].
>> >>                 There is also a JIRA
>> >>                  >>              to make the bundle factory pluggable
>> >>                 [3], pending
>> >>                  >>              availability of runner level options.
>> >>                  >>
>> >>                  >>              For a "ProcessBundleFactory", in
>> >>                 addition to the Python
>> >>                  >>              dependencies the environment would
>> also
>> >>                 need to have the Go
>> >>                  >>              boot executable [4] (or a substitute
>> >>                 thereof) to perform the
>> >>                  >>              harness initialization.
>> >>                  >>
>> >>                  >>              Is anyone else interested in this SDK
>> >>                 execution option or
>> >>                  >>              has already investigated an
>> alternative
>> >>                 implementation?
>> >>                  >>
>> >>                  >>              Thanks,
>> >>                  >>              Thomas
>> >>                  >>
>> >>                  >>              [1]
>> >>                  >>
>> >>
>> >>
>> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>> >>
>> >>
>> >> <
>> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83>
>>
>> >>
>> >>                  >>
>> >>                  >>              [2]
>> >>                  >>
>> >>
>> >>
>> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>> >>
>> >>
>> >> <
>> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E>
>>
>> >>
>> >>                  >>
>> >>                  >>              [3]
>> >>                 https://issues.apache.org/jira/browse/BEAM-4819
>> >>                 <https://issues.apache.org/jira/browse/BEAM-4819>
>> >>                  >>
>> >>                  >>              [4]
>> >>
>> >>
>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>> >>
>> >> <
>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go>
>>
>> >>
>> >>                  >>
>> >>
>> >>                 --                 Max
>> >>
>> >>
>> >
>>
>> --
>> Max
>>
>

Re: Process JobBundleFactory for portable runner

Posted by Thomas Weise <th...@apache.org>.

I would see support for staging libraries as optional / nice to have since
that can also be done as part of host provisioning (i.e. in the Python case
a virtual environment was already setup and just needs to be activated).

Depending on how the command that launches the harness is configured,
additional steps such as virtualenv activate or setting of other
environment variables can be included as well.


On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels <mx...@apache.org> wrote:

> Just to recap:
>
>  From this and the other thread ("Bootstraping Beam's Job Server") we
> got sufficient evidence that process-based execution is a desired feature.
>
> Process-based execution as an alternative to dockerized execution
> https://issues.apache.org/jira/browse/BEAM-5187
>
> Which parts are executed as a process?
> => The SDK harness for user code
>
> What configuration options are supported?
> => Provide information about the target architecture (OS/CPU)
> => Staging libraries, as also supported by Docker
> => Activating a pre-existing environment (e.g. virutalenv)
>
>
> On 23.08.18 14:13, Maximilian Michels wrote:
> >> One thing to consider that we've talked about in the past. It might
> >> make sense to extend the environment proto and have the SDK be
> >> explicit about which kinds of environment it support
> >
> > +1 Encoding environment information there is a good idea.
> >
> >> Seems it will create a default docker url even if the
> >> hardness_docker_image is set to None in pipeline options. Shall we add
> >> another option or honor the None in this option to support the process
> >> job?
> >
> > Yes, if no Docker image is set the default one will be used. Currently
> > Docker is the only way to execute pipelines with the PortableRunner. If
> > the docker_image is not set, execution won't succeed.
> >
> > On 22.08.18 22:59, Xinyu Liu wrote:
> >> We are also interested in this Process JobBundleFactory as we are
> >> planning to fork a process to run python sdk in Samza runner, instead
> >> of using docker container. So this change will be helpful to us too.
> >> On the same note, we are trying out portable_runner.py to submit a
> >> python job. Seems it will create a default docker url even if the
> >> hardness_docker_image is set to None in pipeline options. Shall we add
> >> another option or honor the None in this option to support the process
> >> job? I made some local changes right now to walk around this.
> >>
> >> Thanks,
> >> Xinyu
> >>
> >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde <herohde@google.com
> >> <ma...@google.com>> wrote:
> >>
> >>     By "enum" in quotes, I meant the usual open URN style pattern not an
> >>     actual enum. Sorry if that wasn't clear.
> >>
> >>     On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik <lcwik@google.com
> >>     <ma...@google.com>> wrote:
> >>
> >>         I would model the environment to be more free form then enums
> >>         such that we have forward looking extensibility and would
> >>         suggest to follow the same pattern we use on PTransforms (using
> >>         an URN and a URN specific payload). Note that in this case we
> >>         may want to support a list of supported environments (e.g. java,
> >>         docker, python, ...).
> >>
> >>         On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
> >>         <herohde@google.com <ma...@google.com>> wrote:
> >>
> >>             One thing to consider that we've talked about in the past.
> >>             It might make sense to extend the environment proto and have
> >>             the SDK be explicit about which kinds of environment it
> >>             supports:
> >>
> >>
> >>
> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
> >>
> >>
> >> <
> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969>
>
> >>
> >>
> >>             This choice might impact what files are staged or what not.
> >>             Some SDKs, such as Go, provide a compiled binary and _need_
> >>             to know what the target architecture is. Running on a mac
> >>             with and without docker, say, requires a different worker in
> >>             each case. If we add an "enum", we can also easily add the
> >>             external idea where the SDK/user starts the SDK harnesses
> >>             instead of the runner. Each runner may not support all types
> >>             of environments.
> >>
> >>             Henning
> >>
> >>             On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels
> >>             <mxm@apache.org <ma...@apache.org>> wrote:
> >>
> >>                 For reference, here is corresponding JIRA issue for this
> >>                 thread:
> >>                 https://issues.apache.org/jira/browse/BEAM-5187
> >>                 <https://issues.apache.org/jira/browse/BEAM-5187>
> >>
> >>                 On 16.08.18 11:15, Maximilian Michels wrote:
> >>                  > Makes sense to have an option to run the SDK harness
> >>                 in a non-dockerized
> >>                  > environment.
> >>                  >
> >>                  > I'm in the process of creating a Docker entry point
> >>                 for Flink's
> >>                  > JobServer[1]. I suppose you would also prefer to
> >>                 execute that one
> >>                  > standalone. We should make sure this is also an
> >> option.
> >>                  >
> >>                  > [1] https://issues.apache.org/jira/browse/BEAM-4130
> >>                 <https://issues.apache.org/jira/browse/BEAM-4130>
> >>                  >
> >>                  > On 16.08.18 07:42, Thomas Weise wrote:
> >>                  >> Yes, that's the proposal. Everything that would
> >>                 otherwise be packaged
> >>                  >> into the Docker container would need to be
> >>                 pre-installed in the host
> >>                  >> environment. In the case of Python SDK, that could
> >>                 simply mean a
> >>                  >> (frozen) virtual environment that was setup when the
> >>                 host was
> >>                  >> provisioned - the SDK harness process(es) will then
> >>                 just utilize that.
> >>                  >> Of course this flavor of SDK harness execution could
> >>                 also be useful in
> >>                  >> the local development environment, where right now
> >>                 someone who already
> >>                  >> has the Python environment needs to also install
> >>                 Docker and update a
> >>                  >> container to launch a Python SDK pipeline on the
> >>                 Flink runner.
> >>                  >>
> >>                  >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira
> >>                 <danoliveira@google.com <ma...@google.com>
> >>                  >> <mailto:danoliveira@google.com
> >>                 <ma...@google.com>>> wrote:
> >>                  >>
> >>                  >>      I just want to clarify that I understand this
> >>                 correctly since I'm
> >>                  >>      not that familiar with the details behind all
> >>                 these execution
> >>                  >>      environments yet. Is the proposal to create a
> >>                 new JobBundleFactory
> >>                  >>      that instead of using Docker to create the
> >>                 environment that the new
> >>                  >>      processes will execute in, this
> >>                 JobBundleFactory would execute the
> >>                  >>      new processes directly in the host environment?
> >>                 So in practice if I
> >>                  >>      ran a pipeline with this JobBundleFactory the
> >>                 SDK Harness and Runner
> >>                  >>      Harness would both be executing directly on my
> >>                 machine and would
> >>                  >>      depend on me having the dependencies already
> >>                 present on my machine?
> >>                  >>
> >>                  >>      On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka
> >>                 <goenka@google.com <ma...@google.com>
> >>                  >>      <mailto:goenka@google.com
> >>                 <ma...@google.com>>> wrote:
> >>                  >>
> >>                  >>          Thanks for starting the discussion. I will
> >>                 be happy to help.
> >>                  >>          I agree, we should have pluggable
> >>                 SDKHarness environment Factory.
> >>                  >>          We can register multiple Environment
> >>                 factory using service
> >>                  >>          registry and use the PipelineOption to pick
> >>                 the right one on per
> >>                  >>          job basis.
> >>                  >>
> >>                  >>          There are a couple of things which are
> >>                 require to setup before
> >>                  >>          launching the process.
> >>                  >>
> >>                  >>            * Setting up the environment as done in
> >>                 boot.go [4]
> >>                  >>            * Retrieving and putting the artifacts in
> >>                 the right location.
> >>                  >>
> >>                  >>          You can probably leverage boot.go code to
> >>                 setup the environment.
> >>                  >>
> >>                  >>          Also, it will be useful to enumerate pros
> >>                 and cons of different
> >>                  >>          Environments to help users choose the right
> >>                 one.
> >>                  >>
> >>                  >>
> >>                  >>          On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise
> >>                 <thw@apache.org <ma...@apache.org>
> >>                  >>          <mailto:thw@apache.org
> >>                 <ma...@apache.org>>> wrote:
> >>                  >>
> >>                  >>              Hi,
> >>                  >>
> >>                  >>              Currently the portable Flink runner
> >>                 only works with SDK
> >>                  >>              Docker containers for execution
> >>                 (DockerJobBundleFactory,
> >>                  >>              besides an in-process (embedded)
> >>                 factory option for testing
> >>                  >>              [1]). I'm considering adding another
> >>                 out of process
> >>                  >>              JobBundleFactory implementation that
> >>                 directly forks the
> >>                  >>              processes on the task manager host,
> >>                 eliminating the need for
> >>                  >>              Docker. This would work reasonably well
> >>                 in environments
> >>                  >>              where the dependencies (in this case
> >>                 Python) can easily be
> >>                  >>              tied into the host deployment (also
> >>                 within an application
> >>                  >>              specific Kubernetes pod).
> >>                  >>
> >>                  >>              There was already some discussion about
> >>                 alternative
> >>                  >>              JobBundleFactory implementation in [2].
> >>                 There is also a JIRA
> >>                  >>              to make the bundle factory pluggable
> >>                 [3], pending
> >>                  >>              availability of runner level options.
> >>                  >>
> >>                  >>              For a "ProcessBundleFactory", in
> >>                 addition to the Python
> >>                  >>              dependencies the environment would also
> >>                 need to have the Go
> >>                  >>              boot executable [4] (or a substitute
> >>                 thereof) to perform the
> >>                  >>              harness initialization.
> >>                  >>
> >>                  >>              Is anyone else interested in this SDK
> >>                 execution option or
> >>                  >>              has already investigated an alternative
> >>                 implementation?
> >>                  >>
> >>                  >>              Thanks,
> >>                  >>              Thomas
> >>                  >>
> >>                  >>              [1]
> >>                  >>
> >>
> >>
> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
> >>
> >>
> >> <
> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83>
>
> >>
> >>                  >>
> >>                  >>              [2]
> >>                  >>
> >>
> >>
> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
> >>
> >>
> >> <
> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E>
>
> >>
> >>                  >>
> >>                  >>              [3]
> >>                 https://issues.apache.org/jira/browse/BEAM-4819
> >>                 <https://issues.apache.org/jira/browse/BEAM-4819>
> >>                  >>
> >>                  >>              [4]
> >>
> >>
> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
> >>
> >> <
> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go>
> >>
> >>                  >>
> >>
> >>                 --                 Max
> >>
> >>
> >
>
> --
> Max
>

Re: Process JobBundleFactory for portable runner

Posted by Maximilian Michels <mx...@apache.org>.

Just to recap:

 From this and the other thread ("Bootstraping Beam's Job Server") we 
got sufficient evidence that process-based execution is a desired feature.

Process-based execution as an alternative to dockerized execution
https://issues.apache.org/jira/browse/BEAM-5187

Which parts are executed as a process?
=> The SDK harness for user code

What configuration options are supported?
=> Provide information about the target architecture (OS/CPU)
=> Staging libraries, as also supported by Docker
=> Activating a pre-existing environment (e.g. virutalenv)


On 23.08.18 14:13, Maximilian Michels wrote:
>> One thing to consider that we've talked about in the past. It might 
>> make sense to extend the environment proto and have the SDK be 
>> explicit about which kinds of environment it support
> 
> +1 Encoding environment information there is a good idea.
> 
>> Seems it will create a default docker url even if the 
>> hardness_docker_image is set to None in pipeline options. Shall we add 
>> another option or honor the None in this option to support the process 
>> job? 
> 
> Yes, if no Docker image is set the default one will be used. Currently 
> Docker is the only way to execute pipelines with the PortableRunner. If 
> the docker_image is not set, execution won't succeed.
> 
> On 22.08.18 22:59, Xinyu Liu wrote:
>> We are also interested in this Process JobBundleFactory as we are 
>> planning to fork a process to run python sdk in Samza runner, instead 
>> of using docker container. So this change will be helpful to us too. 
>> On the same note, we are trying out portable_runner.py to submit a 
>> python job. Seems it will create a default docker url even if the 
>> hardness_docker_image is set to None in pipeline options. Shall we add 
>> another option or honor the None in this option to support the process 
>> job? I made some local changes right now to walk around this.
>>
>> Thanks,
>> Xinyu
>>
>> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde <herohde@google.com 
>> <ma...@google.com>> wrote:
>>
>>     By "enum" in quotes, I meant the usual open URN style pattern not an
>>     actual enum. Sorry if that wasn't clear.
>>
>>     On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik <lcwik@google.com
>>     <ma...@google.com>> wrote:
>>
>>         I would model the environment to be more free form then enums
>>         such that we have forward looking extensibility and would
>>         suggest to follow the same pattern we use on PTransforms (using
>>         an URN and a URN specific payload). Note that in this case we
>>         may want to support a list of supported environments (e.g. java,
>>         docker, python, ...).
>>
>>         On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
>>         <herohde@google.com <ma...@google.com>> wrote:
>>
>>             One thing to consider that we've talked about in the past.
>>             It might make sense to extend the environment proto and have
>>             the SDK be explicit about which kinds of environment it
>>             supports:
>>
>>             
>> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969 
>>
>>             
>> <https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969> 
>>
>>
>>             This choice might impact what files are staged or what not.
>>             Some SDKs, such as Go, provide a compiled binary and _need_
>>             to know what the target architecture is. Running on a mac
>>             with and without docker, say, requires a different worker in
>>             each case. If we add an "enum", we can also easily add the
>>             external idea where the SDK/user starts the SDK harnesses
>>             instead of the runner. Each runner may not support all types
>>             of environments.
>>
>>             Henning
>>
>>             On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels
>>             <mxm@apache.org <ma...@apache.org>> wrote:
>>
>>                 For reference, here is corresponding JIRA issue for this
>>                 thread:
>>                 https://issues.apache.org/jira/browse/BEAM-5187
>>                 <https://issues.apache.org/jira/browse/BEAM-5187>
>>
>>                 On 16.08.18 11:15, Maximilian Michels wrote:
>>                  > Makes sense to have an option to run the SDK harness
>>                 in a non-dockerized
>>                  > environment.
>>                  >
>>                  > I'm in the process of creating a Docker entry point
>>                 for Flink's
>>                  > JobServer[1]. I suppose you would also prefer to
>>                 execute that one
>>                  > standalone. We should make sure this is also an 
>> option.
>>                  >
>>                  > [1] https://issues.apache.org/jira/browse/BEAM-4130
>>                 <https://issues.apache.org/jira/browse/BEAM-4130>
>>                  >
>>                  > On 16.08.18 07:42, Thomas Weise wrote:
>>                  >> Yes, that's the proposal. Everything that would
>>                 otherwise be packaged
>>                  >> into the Docker container would need to be
>>                 pre-installed in the host
>>                  >> environment. In the case of Python SDK, that could
>>                 simply mean a
>>                  >> (frozen) virtual environment that was setup when the
>>                 host was
>>                  >> provisioned - the SDK harness process(es) will then
>>                 just utilize that.
>>                  >> Of course this flavor of SDK harness execution could
>>                 also be useful in
>>                  >> the local development environment, where right now
>>                 someone who already
>>                  >> has the Python environment needs to also install
>>                 Docker and update a
>>                  >> container to launch a Python SDK pipeline on the
>>                 Flink runner.
>>                  >>
>>                  >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira
>>                 <danoliveira@google.com <ma...@google.com>
>>                  >> <mailto:danoliveira@google.com
>>                 <ma...@google.com>>> wrote:
>>                  >>
>>                  >>      I just want to clarify that I understand this
>>                 correctly since I'm
>>                  >>      not that familiar with the details behind all
>>                 these execution
>>                  >>      environments yet. Is the proposal to create a
>>                 new JobBundleFactory
>>                  >>      that instead of using Docker to create the
>>                 environment that the new
>>                  >>      processes will execute in, this
>>                 JobBundleFactory would execute the
>>                  >>      new processes directly in the host environment?
>>                 So in practice if I
>>                  >>      ran a pipeline with this JobBundleFactory the
>>                 SDK Harness and Runner
>>                  >>      Harness would both be executing directly on my
>>                 machine and would
>>                  >>      depend on me having the dependencies already
>>                 present on my machine?
>>                  >>
>>                  >>      On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka
>>                 <goenka@google.com <ma...@google.com>
>>                  >>      <mailto:goenka@google.com
>>                 <ma...@google.com>>> wrote:
>>                  >>
>>                  >>          Thanks for starting the discussion. I will
>>                 be happy to help.
>>                  >>          I agree, we should have pluggable
>>                 SDKHarness environment Factory.
>>                  >>          We can register multiple Environment
>>                 factory using service
>>                  >>          registry and use the PipelineOption to pick
>>                 the right one on per
>>                  >>          job basis.
>>                  >>
>>                  >>          There are a couple of things which are
>>                 require to setup before
>>                  >>          launching the process.
>>                  >>
>>                  >>            * Setting up the environment as done in
>>                 boot.go [4]
>>                  >>            * Retrieving and putting the artifacts in
>>                 the right location.
>>                  >>
>>                  >>          You can probably leverage boot.go code to
>>                 setup the environment.
>>                  >>
>>                  >>          Also, it will be useful to enumerate pros
>>                 and cons of different
>>                  >>          Environments to help users choose the right
>>                 one.
>>                  >>
>>                  >>
>>                  >>          On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise
>>                 <thw@apache.org <ma...@apache.org>
>>                  >>          <mailto:thw@apache.org
>>                 <ma...@apache.org>>> wrote:
>>                  >>
>>                  >>              Hi,
>>                  >>
>>                  >>              Currently the portable Flink runner
>>                 only works with SDK
>>                  >>              Docker containers for execution
>>                 (DockerJobBundleFactory,
>>                  >>              besides an in-process (embedded)
>>                 factory option for testing
>>                  >>              [1]). I'm considering adding another
>>                 out of process
>>                  >>              JobBundleFactory implementation that
>>                 directly forks the
>>                  >>              processes on the task manager host,
>>                 eliminating the need for
>>                  >>              Docker. This would work reasonably well
>>                 in environments
>>                  >>              where the dependencies (in this case
>>                 Python) can easily be
>>                  >>              tied into the host deployment (also
>>                 within an application
>>                  >>              specific Kubernetes pod).
>>                  >>
>>                  >>              There was already some discussion about
>>                 alternative
>>                  >>              JobBundleFactory implementation in [2].
>>                 There is also a JIRA
>>                  >>              to make the bundle factory pluggable
>>                 [3], pending
>>                  >>              availability of runner level options.
>>                  >>
>>                  >>              For a "ProcessBundleFactory", in
>>                 addition to the Python
>>                  >>              dependencies the environment would also
>>                 need to have the Go
>>                  >>              boot executable [4] (or a substitute
>>                 thereof) to perform the
>>                  >>              harness initialization.
>>                  >>
>>                  >>              Is anyone else interested in this SDK
>>                 execution option or
>>                  >>              has already investigated an alternative
>>                 implementation?
>>                  >>
>>                  >>              Thanks,
>>                  >>              Thomas
>>                  >>
>>                  >>              [1]
>>                  >>
>>                 
>> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83 
>>
>>                 
>> <https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83> 
>>
>>                  >>
>>                  >>              [2]
>>                  >>
>>                 
>> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E 
>>
>>                 
>> <https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E> 
>>
>>                  >>
>>                  >>              [3]
>>                 https://issues.apache.org/jira/browse/BEAM-4819
>>                 <https://issues.apache.org/jira/browse/BEAM-4819>
>>                  >>
>>                  >>              [4]
>>                 
>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>>                 
>> <https://github.com/apache/beam/blob/master/sdks/python/container/boot.go> 
>>
>>                  >>
>>
>>                 --                 Max
>>
>>
> 

-- 
Max

Re: Process JobBundleFactory for portable runner

Posted by Maximilian Michels <mx...@apache.org>.

> One thing to consider that we've talked about in the past. It might make sense to extend the environment proto and have the SDK be explicit about which kinds of environment it support

+1 Encoding environment information there is a good idea.

> Seems it will create a default docker url even if the 
> hardness_docker_image is set to None in pipeline options. Shall we add 
> another option or honor the None in this option to support the process 
> job? 

Yes, if no Docker image is set the default one will be used. Currently 
Docker is the only way to execute pipelines with the PortableRunner. If 
the docker_image is not set, execution won't succeed.

On 22.08.18 22:59, Xinyu Liu wrote:
> We are also interested in this Process JobBundleFactory as we are 
> planning to fork a process to run python sdk in Samza runner, instead of 
> using docker container. So this change will be helpful to us too. On the 
> same note, we are trying out portable_runner.py to submit a python job. 
> Seems it will create a default docker url even if the 
> hardness_docker_image is set to None in pipeline options. Shall we add 
> another option or honor the None in this option to support the process 
> job? I made some local changes right now to walk around this.
> 
> Thanks,
> Xinyu
> 
> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde <herohde@google.com 
> <ma...@google.com>> wrote:
> 
>     By "enum" in quotes, I meant the usual open URN style pattern not an
>     actual enum. Sorry if that wasn't clear.
> 
>     On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik <lcwik@google.com
>     <ma...@google.com>> wrote:
> 
>         I would model the environment to be more free form then enums
>         such that we have forward looking extensibility and would
>         suggest to follow the same pattern we use on PTransforms (using
>         an URN and a URN specific payload). Note that in this case we
>         may want to support a list of supported environments (e.g. java,
>         docker, python, ...).
> 
>         On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
>         <herohde@google.com <ma...@google.com>> wrote:
> 
>             One thing to consider that we've talked about in the past.
>             It might make sense to extend the environment proto and have
>             the SDK be explicit about which kinds of environment it
>             supports:
> 
>             https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>             <https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969>
> 
>             This choice might impact what files are staged or what not.
>             Some SDKs, such as Go, provide a compiled binary and _need_
>             to know what the target architecture is. Running on a mac
>             with and without docker, say, requires a different worker in
>             each case. If we add an "enum", we can also easily add the
>             external idea where the SDK/user starts the SDK harnesses
>             instead of the runner. Each runner may not support all types
>             of environments.
> 
>             Henning
> 
>             On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels
>             <mxm@apache.org <ma...@apache.org>> wrote:
> 
>                 For reference, here is corresponding JIRA issue for this
>                 thread:
>                 https://issues.apache.org/jira/browse/BEAM-5187
>                 <https://issues.apache.org/jira/browse/BEAM-5187>
> 
>                 On 16.08.18 11:15, Maximilian Michels wrote:
>                  > Makes sense to have an option to run the SDK harness
>                 in a non-dockerized
>                  > environment.
>                  >
>                  > I'm in the process of creating a Docker entry point
>                 for Flink's
>                  > JobServer[1]. I suppose you would also prefer to
>                 execute that one
>                  > standalone. We should make sure this is also an option.
>                  >
>                  > [1] https://issues.apache.org/jira/browse/BEAM-4130
>                 <https://issues.apache.org/jira/browse/BEAM-4130>
>                  >
>                  > On 16.08.18 07:42, Thomas Weise wrote:
>                  >> Yes, that's the proposal. Everything that would
>                 otherwise be packaged
>                  >> into the Docker container would need to be
>                 pre-installed in the host
>                  >> environment. In the case of Python SDK, that could
>                 simply mean a
>                  >> (frozen) virtual environment that was setup when the
>                 host was
>                  >> provisioned - the SDK harness process(es) will then
>                 just utilize that.
>                  >> Of course this flavor of SDK harness execution could
>                 also be useful in
>                  >> the local development environment, where right now
>                 someone who already
>                  >> has the Python environment needs to also install
>                 Docker and update a
>                  >> container to launch a Python SDK pipeline on the
>                 Flink runner.
>                  >>
>                  >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira
>                 <danoliveira@google.com <ma...@google.com>
>                  >> <mailto:danoliveira@google.com
>                 <ma...@google.com>>> wrote:
>                  >>
>                  >>      I just want to clarify that I understand this
>                 correctly since I'm
>                  >>      not that familiar with the details behind all
>                 these execution
>                  >>      environments yet. Is the proposal to create a
>                 new JobBundleFactory
>                  >>      that instead of using Docker to create the
>                 environment that the new
>                  >>      processes will execute in, this
>                 JobBundleFactory would execute the
>                  >>      new processes directly in the host environment?
>                 So in practice if I
>                  >>      ran a pipeline with this JobBundleFactory the
>                 SDK Harness and Runner
>                  >>      Harness would both be executing directly on my
>                 machine and would
>                  >>      depend on me having the dependencies already
>                 present on my machine?
>                  >>
>                  >>      On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka
>                 <goenka@google.com <ma...@google.com>
>                  >>      <mailto:goenka@google.com
>                 <ma...@google.com>>> wrote:
>                  >>
>                  >>          Thanks for starting the discussion. I will
>                 be happy to help.
>                  >>          I agree, we should have pluggable
>                 SDKHarness environment Factory.
>                  >>          We can register multiple Environment
>                 factory using service
>                  >>          registry and use the PipelineOption to pick
>                 the right one on per
>                  >>          job basis.
>                  >>
>                  >>          There are a couple of things which are
>                 require to setup before
>                  >>          launching the process.
>                  >>
>                  >>            * Setting up the environment as done in
>                 boot.go [4]
>                  >>            * Retrieving and putting the artifacts in
>                 the right location.
>                  >>
>                  >>          You can probably leverage boot.go code to
>                 setup the environment.
>                  >>
>                  >>          Also, it will be useful to enumerate pros
>                 and cons of different
>                  >>          Environments to help users choose the right
>                 one.
>                  >>
>                  >>
>                  >>          On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise
>                 <thw@apache.org <ma...@apache.org>
>                  >>          <mailto:thw@apache.org
>                 <ma...@apache.org>>> wrote:
>                  >>
>                  >>              Hi,
>                  >>
>                  >>              Currently the portable Flink runner
>                 only works with SDK
>                  >>              Docker containers for execution
>                 (DockerJobBundleFactory,
>                  >>              besides an in-process (embedded)
>                 factory option for testing
>                  >>              [1]). I'm considering adding another
>                 out of process
>                  >>              JobBundleFactory implementation that
>                 directly forks the
>                  >>              processes on the task manager host,
>                 eliminating the need for
>                  >>              Docker. This would work reasonably well
>                 in environments
>                  >>              where the dependencies (in this case
>                 Python) can easily be
>                  >>              tied into the host deployment (also
>                 within an application
>                  >>              specific Kubernetes pod).
>                  >>
>                  >>              There was already some discussion about
>                 alternative
>                  >>              JobBundleFactory implementation in [2].
>                 There is also a JIRA
>                  >>              to make the bundle factory pluggable
>                 [3], pending
>                  >>              availability of runner level options.
>                  >>
>                  >>              For a "ProcessBundleFactory", in
>                 addition to the Python
>                  >>              dependencies the environment would also
>                 need to have the Go
>                  >>              boot executable [4] (or a substitute
>                 thereof) to perform the
>                  >>              harness initialization.
>                  >>
>                  >>              Is anyone else interested in this SDK
>                 execution option or
>                  >>              has already investigated an alternative
>                 implementation?
>                  >>
>                  >>              Thanks,
>                  >>              Thomas
>                  >>
>                  >>              [1]
>                  >>
>                 https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>                 <https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83>
>                  >>
>                  >>              [2]
>                  >>
>                 https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>                 <https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E>
>                  >>
>                  >>              [3]
>                 https://issues.apache.org/jira/browse/BEAM-4819
>                 <https://issues.apache.org/jira/browse/BEAM-4819>
>                  >>
>                  >>              [4]
>                 https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>                 <https://github.com/apache/beam/blob/master/sdks/python/container/boot.go>
>                  >>
> 
>                 -- 
>                 Max
> 
> 

-- 
Max

Re: Process JobBundleFactory for portable runner

Posted by Xinyu Liu <xi...@gmail.com>.

We are also interested in this Process JobBundleFactory as we are planning
to fork a process to run python sdk in Samza runner, instead of using
docker container. So this change will be helpful to us too. On the same
note, we are trying out portable_runner.py to submit a python job. Seems it
will create a default docker url even if the hardness_docker_image is set
to None in pipeline options. Shall we add another option or honor the None
in this option to support the process job? I made some local changes right
now to walk around this.

Thanks,
Xinyu

On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde <he...@google.com> wrote:

> By "enum" in quotes, I meant the usual open URN style pattern not an
> actual enum. Sorry if that wasn't clear.
>
> On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik <lc...@google.com> wrote:
>
>> I would model the environment to be more free form then enums such that
>> we have forward looking extensibility and would suggest to follow the same
>> pattern we use on PTransforms (using an URN and a URN specific payload).
>> Note that in this case we may want to support a list of supported
>> environments (e.g. java, docker, python, ...).
>>
>> On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde <he...@google.com>
>> wrote:
>>
>>> One thing to consider that we've talked about in the past. It might make
>>> sense to extend the environment proto and have the SDK be explicit about
>>> which kinds of environment it supports:
>>>
>>>         https://github.com/apache/beam/blob/
>>> 8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/
>>> main/proto/beam_runner_api.proto#L969
>>>
>>> This choice might impact what files are staged or what not. Some SDKs,
>>> such as Go, provide a compiled binary and _need_ to know what the target
>>> architecture is. Running on a mac with and without docker, say, requires a
>>> different worker in each case. If we add an "enum", we can also easily add
>>> the external idea where the SDK/user starts the SDK harnesses instead of
>>> the runner. Each runner may not support all types of environments.
>>>
>>> Henning
>>>
>>> On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels <mx...@apache.org>
>>> wrote:
>>>
>>>> For reference, here is corresponding JIRA issue for this thread:
>>>> https://issues.apache.org/jira/browse/BEAM-5187
>>>>
>>>> On 16.08.18 11:15, Maximilian Michels wrote:
>>>> > Makes sense to have an option to run the SDK harness in a
>>>> non-dockerized
>>>> > environment.
>>>> >
>>>> > I'm in the process of creating a Docker entry point for Flink's
>>>> > JobServer[1]. I suppose you would also prefer to execute that one
>>>> > standalone. We should make sure this is also an option.
>>>> >
>>>> > [1] https://issues.apache.org/jira/browse/BEAM-4130
>>>> >
>>>> > On 16.08.18 07:42, Thomas Weise wrote:
>>>> >> Yes, that's the proposal. Everything that would otherwise be packaged
>>>> >> into the Docker container would need to be pre-installed in the host
>>>> >> environment. In the case of Python SDK, that could simply mean a
>>>> >> (frozen) virtual environment that was setup when the host was
>>>> >> provisioned - the SDK harness process(es) will then just utilize
>>>> that.
>>>> >> Of course this flavor of SDK harness execution could also be useful
>>>> in
>>>> >> the local development environment, where right now someone who
>>>> already
>>>> >> has the Python environment needs to also install Docker and update a
>>>> >> container to launch a Python SDK pipeline on the Flink runner.
>>>> >>
>>>> >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira <
>>>> danoliveira@google.com
>>>> >> <ma...@google.com>> wrote:
>>>> >>
>>>> >>      I just want to clarify that I understand this correctly since
>>>> I'm
>>>> >>      not that familiar with the details behind all these execution
>>>> >>      environments yet. Is the proposal to create a new
>>>> JobBundleFactory
>>>> >>      that instead of using Docker to create the environment that the
>>>> new
>>>> >>      processes will execute in, this JobBundleFactory would execute
>>>> the
>>>> >>      new processes directly in the host environment? So in practice
>>>> if I
>>>> >>      ran a pipeline with this JobBundleFactory the SDK Harness and
>>>> Runner
>>>> >>      Harness would both be executing directly on my machine and would
>>>> >>      depend on me having the dependencies already present on my
>>>> machine?
>>>> >>
>>>> >>      On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka <goenka@google.com
>>>> >>      <ma...@google.com>> wrote:
>>>> >>
>>>> >>          Thanks for starting the discussion. I will be happy to help.
>>>> >>          I agree, we should have pluggable SDKHarness environment
>>>> Factory.
>>>> >>          We can register multiple Environment factory using service
>>>> >>          registry and use the PipelineOption to pick the right one
>>>> on per
>>>> >>          job basis.
>>>> >>
>>>> >>          There are a couple of things which are require to setup
>>>> before
>>>> >>          launching the process.
>>>> >>
>>>> >>            * Setting up the environment as done in boot.go [4]
>>>> >>            * Retrieving and putting the artifacts in the right
>>>> location.
>>>> >>
>>>> >>          You can probably leverage boot.go code to setup the
>>>> environment.
>>>> >>
>>>> >>          Also, it will be useful to enumerate pros and cons of
>>>> different
>>>> >>          Environments to help users choose the right one.
>>>> >>
>>>> >>
>>>> >>          On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise <thw@apache.org
>>>> >>          <ma...@apache.org>> wrote:
>>>> >>
>>>> >>              Hi,
>>>> >>
>>>> >>              Currently the portable Flink runner only works with SDK
>>>> >>              Docker containers for execution (DockerJobBundleFactory,
>>>> >>              besides an in-process (embedded) factory option for
>>>> testing
>>>> >>              [1]). I'm considering adding another out of process
>>>> >>              JobBundleFactory implementation that directly forks the
>>>> >>              processes on the task manager host, eliminating the
>>>> need for
>>>> >>              Docker. This would work reasonably well in environments
>>>> >>              where the dependencies (in this case Python) can easily
>>>> be
>>>> >>              tied into the host deployment (also within an
>>>> application
>>>> >>              specific Kubernetes pod).
>>>> >>
>>>> >>              There was already some discussion about alternative
>>>> >>              JobBundleFactory implementation in [2]. There is also a
>>>> JIRA
>>>> >>              to make the bundle factory pluggable [3], pending
>>>> >>              availability of runner level options.
>>>> >>
>>>> >>              For a "ProcessBundleFactory", in addition to the Python
>>>> >>              dependencies the environment would also need to have
>>>> the Go
>>>> >>              boot executable [4] (or a substitute thereof) to
>>>> perform the
>>>> >>              harness initialization.
>>>> >>
>>>> >>              Is anyone else interested in this SDK execution option
>>>> or
>>>> >>              has already investigated an alternative implementation?
>>>> >>
>>>> >>              Thanks,
>>>> >>              Thomas
>>>> >>
>>>> >>              [1]
>>>> >>              https://github.com/apache/beam/blob/
>>>> 7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/
>>>> test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>>>> >>
>>>> >>              [2]
>>>> >>              https://lists.apache.org/thread.html/
>>>> d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%
>>>> 3Cdev.beam.apache.org%3E
>>>> >>
>>>> >>              [3] https://issues.apache.org/jira/browse/BEAM-4819
>>>> >>
>>>> >>              [4] https://github.com/apache/
>>>> beam/blob/master/sdks/python/container/boot.go
>>>> >>
>>>>
>>>> --
>>>> Max
>>>>
>>>

Re: Process JobBundleFactory for portable runner

Posted by Henning Rohde <he...@google.com>.

By "enum" in quotes, I meant the usual open URN style pattern not an actual
enum. Sorry if that wasn't clear.

On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik <lc...@google.com> wrote:

> I would model the environment to be more free form then enums such that we
> have forward looking extensibility and would suggest to follow the same
> pattern we use on PTransforms (using an URN and a URN specific payload).
> Note that in this case we may want to support a list of supported
> environments (e.g. java, docker, python, ...).
>
> On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde <he...@google.com> wrote:
>
>> One thing to consider that we've talked about in the past. It might make
>> sense to extend the environment proto and have the SDK be explicit about
>> which kinds of environment it supports:
>>
>>
>> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>>
>> This choice might impact what files are staged or what not. Some SDKs,
>> such as Go, provide a compiled binary and _need_ to know what the target
>> architecture is. Running on a mac with and without docker, say, requires a
>> different worker in each case. If we add an "enum", we can also easily add
>> the external idea where the SDK/user starts the SDK harnesses instead of
>> the runner. Each runner may not support all types of environments.
>>
>> Henning
>>
>> On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels <mx...@apache.org>
>> wrote:
>>
>>> For reference, here is corresponding JIRA issue for this thread:
>>> https://issues.apache.org/jira/browse/BEAM-5187
>>>
>>> On 16.08.18 11:15, Maximilian Michels wrote:
>>> > Makes sense to have an option to run the SDK harness in a
>>> non-dockerized
>>> > environment.
>>> >
>>> > I'm in the process of creating a Docker entry point for Flink's
>>> > JobServer[1]. I suppose you would also prefer to execute that one
>>> > standalone. We should make sure this is also an option.
>>> >
>>> > [1] https://issues.apache.org/jira/browse/BEAM-4130
>>> >
>>> > On 16.08.18 07:42, Thomas Weise wrote:
>>> >> Yes, that's the proposal. Everything that would otherwise be packaged
>>> >> into the Docker container would need to be pre-installed in the host
>>> >> environment. In the case of Python SDK, that could simply mean a
>>> >> (frozen) virtual environment that was setup when the host was
>>> >> provisioned - the SDK harness process(es) will then just utilize that.
>>> >> Of course this flavor of SDK harness execution could also be useful in
>>> >> the local development environment, where right now someone who already
>>> >> has the Python environment needs to also install Docker and update a
>>> >> container to launch a Python SDK pipeline on the Flink runner.
>>> >>
>>> >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira <
>>> danoliveira@google.com
>>> >> <ma...@google.com>> wrote:
>>> >>
>>> >>      I just want to clarify that I understand this correctly since I'm
>>> >>      not that familiar with the details behind all these execution
>>> >>      environments yet. Is the proposal to create a new
>>> JobBundleFactory
>>> >>      that instead of using Docker to create the environment that the
>>> new
>>> >>      processes will execute in, this JobBundleFactory would execute
>>> the
>>> >>      new processes directly in the host environment? So in practice
>>> if I
>>> >>      ran a pipeline with this JobBundleFactory the SDK Harness and
>>> Runner
>>> >>      Harness would both be executing directly on my machine and would
>>> >>      depend on me having the dependencies already present on my
>>> machine?
>>> >>
>>> >>      On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka <goenka@google.com
>>> >>      <ma...@google.com>> wrote:
>>> >>
>>> >>          Thanks for starting the discussion. I will be happy to help.
>>> >>          I agree, we should have pluggable SDKHarness environment
>>> Factory.
>>> >>          We can register multiple Environment factory using service
>>> >>          registry and use the PipelineOption to pick the right one on
>>> per
>>> >>          job basis.
>>> >>
>>> >>          There are a couple of things which are require to setup
>>> before
>>> >>          launching the process.
>>> >>
>>> >>            * Setting up the environment as done in boot.go [4]
>>> >>            * Retrieving and putting the artifacts in the right
>>> location.
>>> >>
>>> >>          You can probably leverage boot.go code to setup the
>>> environment.
>>> >>
>>> >>          Also, it will be useful to enumerate pros and cons of
>>> different
>>> >>          Environments to help users choose the right one.
>>> >>
>>> >>
>>> >>          On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise <thw@apache.org
>>> >>          <ma...@apache.org>> wrote:
>>> >>
>>> >>              Hi,
>>> >>
>>> >>              Currently the portable Flink runner only works with SDK
>>> >>              Docker containers for execution (DockerJobBundleFactory,
>>> >>              besides an in-process (embedded) factory option for
>>> testing
>>> >>              [1]). I'm considering adding another out of process
>>> >>              JobBundleFactory implementation that directly forks the
>>> >>              processes on the task manager host, eliminating the need
>>> for
>>> >>              Docker. This would work reasonably well in environments
>>> >>              where the dependencies (in this case Python) can easily
>>> be
>>> >>              tied into the host deployment (also within an application
>>> >>              specific Kubernetes pod).
>>> >>
>>> >>              There was already some discussion about alternative
>>> >>              JobBundleFactory implementation in [2]. There is also a
>>> JIRA
>>> >>              to make the bundle factory pluggable [3], pending
>>> >>              availability of runner level options.
>>> >>
>>> >>              For a "ProcessBundleFactory", in addition to the Python
>>> >>              dependencies the environment would also need to have the
>>> Go
>>> >>              boot executable [4] (or a substitute thereof) to perform
>>> the
>>> >>              harness initialization.
>>> >>
>>> >>              Is anyone else interested in this SDK execution option or
>>> >>              has already investigated an alternative implementation?
>>> >>
>>> >>              Thanks,
>>> >>              Thomas
>>> >>
>>> >>              [1]
>>> >>
>>> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>>> >>
>>> >>              [2]
>>> >>
>>> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>>> >>
>>> >>              [3] https://issues.apache.org/jira/browse/BEAM-4819
>>> >>
>>> >>              [4]
>>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>>> >>
>>>
>>> --
>>> Max
>>>
>>

Re: Process JobBundleFactory for portable runner

Posted by Lukasz Cwik <lc...@google.com>.

I would model the environment to be more free form then enums such that we
have forward looking extensibility and would suggest to follow the same
pattern we use on PTransforms (using an URN and a URN specific payload).
Note that in this case we may want to support a list of supported
environments (e.g. java, docker, python, ...).

On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde <he...@google.com> wrote:

> One thing to consider that we've talked about in the past. It might make
> sense to extend the environment proto and have the SDK be explicit about
> which kinds of environment it supports:
>
>
> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>
> This choice might impact what files are staged or what not. Some SDKs,
> such as Go, provide a compiled binary and _need_ to know what the target
> architecture is. Running on a mac with and without docker, say, requires a
> different worker in each case. If we add an "enum", we can also easily add
> the external idea where the SDK/user starts the SDK harnesses instead of
> the runner. Each runner may not support all types of environments.
>
> Henning
>
> On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels <mx...@apache.org> wrote:
>
>> For reference, here is corresponding JIRA issue for this thread:
>> https://issues.apache.org/jira/browse/BEAM-5187
>>
>> On 16.08.18 11:15, Maximilian Michels wrote:
>> > Makes sense to have an option to run the SDK harness in a non-dockerized
>> > environment.
>> >
>> > I'm in the process of creating a Docker entry point for Flink's
>> > JobServer[1]. I suppose you would also prefer to execute that one
>> > standalone. We should make sure this is also an option.
>> >
>> > [1] https://issues.apache.org/jira/browse/BEAM-4130
>> >
>> > On 16.08.18 07:42, Thomas Weise wrote:
>> >> Yes, that's the proposal. Everything that would otherwise be packaged
>> >> into the Docker container would need to be pre-installed in the host
>> >> environment. In the case of Python SDK, that could simply mean a
>> >> (frozen) virtual environment that was setup when the host was
>> >> provisioned - the SDK harness process(es) will then just utilize that.
>> >> Of course this flavor of SDK harness execution could also be useful in
>> >> the local development environment, where right now someone who already
>> >> has the Python environment needs to also install Docker and update a
>> >> container to launch a Python SDK pipeline on the Flink runner.
>> >>
>> >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira <
>> danoliveira@google.com
>> >> <ma...@google.com>> wrote:
>> >>
>> >>      I just want to clarify that I understand this correctly since I'm
>> >>      not that familiar with the details behind all these execution
>> >>      environments yet. Is the proposal to create a new JobBundleFactory
>> >>      that instead of using Docker to create the environment that the
>> new
>> >>      processes will execute in, this JobBundleFactory would execute the
>> >>      new processes directly in the host environment? So in practice if
>> I
>> >>      ran a pipeline with this JobBundleFactory the SDK Harness and
>> Runner
>> >>      Harness would both be executing directly on my machine and would
>> >>      depend on me having the dependencies already present on my
>> machine?
>> >>
>> >>      On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka <goenka@google.com
>> >>      <ma...@google.com>> wrote:
>> >>
>> >>          Thanks for starting the discussion. I will be happy to help.
>> >>          I agree, we should have pluggable SDKHarness environment
>> Factory.
>> >>          We can register multiple Environment factory using service
>> >>          registry and use the PipelineOption to pick the right one on
>> per
>> >>          job basis.
>> >>
>> >>          There are a couple of things which are require to setup before
>> >>          launching the process.
>> >>
>> >>            * Setting up the environment as done in boot.go [4]
>> >>            * Retrieving and putting the artifacts in the right
>> location.
>> >>
>> >>          You can probably leverage boot.go code to setup the
>> environment.
>> >>
>> >>          Also, it will be useful to enumerate pros and cons of
>> different
>> >>          Environments to help users choose the right one.
>> >>
>> >>
>> >>          On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise <thw@apache.org
>> >>          <ma...@apache.org>> wrote:
>> >>
>> >>              Hi,
>> >>
>> >>              Currently the portable Flink runner only works with SDK
>> >>              Docker containers for execution (DockerJobBundleFactory,
>> >>              besides an in-process (embedded) factory option for
>> testing
>> >>              [1]). I'm considering adding another out of process
>> >>              JobBundleFactory implementation that directly forks the
>> >>              processes on the task manager host, eliminating the need
>> for
>> >>              Docker. This would work reasonably well in environments
>> >>              where the dependencies (in this case Python) can easily be
>> >>              tied into the host deployment (also within an application
>> >>              specific Kubernetes pod).
>> >>
>> >>              There was already some discussion about alternative
>> >>              JobBundleFactory implementation in [2]. There is also a
>> JIRA
>> >>              to make the bundle factory pluggable [3], pending
>> >>              availability of runner level options.
>> >>
>> >>              For a "ProcessBundleFactory", in addition to the Python
>> >>              dependencies the environment would also need to have the
>> Go
>> >>              boot executable [4] (or a substitute thereof) to perform
>> the
>> >>              harness initialization.
>> >>
>> >>              Is anyone else interested in this SDK execution option or
>> >>              has already investigated an alternative implementation?
>> >>
>> >>              Thanks,
>> >>              Thomas
>> >>
>> >>              [1]
>> >>
>> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>> >>
>> >>              [2]
>> >>
>> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>> >>
>> >>              [3] https://issues.apache.org/jira/browse/BEAM-4819
>> >>
>> >>              [4]
>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>> >>
>>
>> --
>> Max
>>
>

Re: Process JobBundleFactory for portable runner

Posted by Henning Rohde <he...@google.com>.

One thing to consider that we've talked about in the past. It might make
sense to extend the environment proto and have the SDK be explicit about
which kinds of environment it supports:


https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969

This choice might impact what files are staged or what not. Some SDKs, such
as Go, provide a compiled binary and _need_ to know what the target
architecture is. Running on a mac with and without docker, say, requires a
different worker in each case. If we add an "enum", we can also easily add
the external idea where the SDK/user starts the SDK harnesses instead of
the runner. Each runner may not support all types of environments.

Henning

On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels <mx...@apache.org> wrote:

> For reference, here is corresponding JIRA issue for this thread:
> https://issues.apache.org/jira/browse/BEAM-5187
>
> On 16.08.18 11:15, Maximilian Michels wrote:
> > Makes sense to have an option to run the SDK harness in a non-dockerized
> > environment.
> >
> > I'm in the process of creating a Docker entry point for Flink's
> > JobServer[1]. I suppose you would also prefer to execute that one
> > standalone. We should make sure this is also an option.
> >
> > [1] https://issues.apache.org/jira/browse/BEAM-4130
> >
> > On 16.08.18 07:42, Thomas Weise wrote:
> >> Yes, that's the proposal. Everything that would otherwise be packaged
> >> into the Docker container would need to be pre-installed in the host
> >> environment. In the case of Python SDK, that could simply mean a
> >> (frozen) virtual environment that was setup when the host was
> >> provisioned - the SDK harness process(es) will then just utilize that.
> >> Of course this flavor of SDK harness execution could also be useful in
> >> the local development environment, where right now someone who already
> >> has the Python environment needs to also install Docker and update a
> >> container to launch a Python SDK pipeline on the Flink runner.
> >>
> >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira <
> danoliveira@google.com
> >> <ma...@google.com>> wrote:
> >>
> >>      I just want to clarify that I understand this correctly since I'm
> >>      not that familiar with the details behind all these execution
> >>      environments yet. Is the proposal to create a new JobBundleFactory
> >>      that instead of using Docker to create the environment that the new
> >>      processes will execute in, this JobBundleFactory would execute the
> >>      new processes directly in the host environment? So in practice if I
> >>      ran a pipeline with this JobBundleFactory the SDK Harness and
> Runner
> >>      Harness would both be executing directly on my machine and would
> >>      depend on me having the dependencies already present on my machine?
> >>
> >>      On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka <goenka@google.com
> >>      <ma...@google.com>> wrote:
> >>
> >>          Thanks for starting the discussion. I will be happy to help.
> >>          I agree, we should have pluggable SDKHarness environment
> Factory.
> >>          We can register multiple Environment factory using service
> >>          registry and use the PipelineOption to pick the right one on
> per
> >>          job basis.
> >>
> >>          There are a couple of things which are require to setup before
> >>          launching the process.
> >>
> >>            * Setting up the environment as done in boot.go [4]
> >>            * Retrieving and putting the artifacts in the right location.
> >>
> >>          You can probably leverage boot.go code to setup the
> environment.
> >>
> >>          Also, it will be useful to enumerate pros and cons of different
> >>          Environments to help users choose the right one.
> >>
> >>
> >>          On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise <thw@apache.org
> >>          <ma...@apache.org>> wrote:
> >>
> >>              Hi,
> >>
> >>              Currently the portable Flink runner only works with SDK
> >>              Docker containers for execution (DockerJobBundleFactory,
> >>              besides an in-process (embedded) factory option for testing
> >>              [1]). I'm considering adding another out of process
> >>              JobBundleFactory implementation that directly forks the
> >>              processes on the task manager host, eliminating the need
> for
> >>              Docker. This would work reasonably well in environments
> >>              where the dependencies (in this case Python) can easily be
> >>              tied into the host deployment (also within an application
> >>              specific Kubernetes pod).
> >>
> >>              There was already some discussion about alternative
> >>              JobBundleFactory implementation in [2]. There is also a
> JIRA
> >>              to make the bundle factory pluggable [3], pending
> >>              availability of runner level options.
> >>
> >>              For a "ProcessBundleFactory", in addition to the Python
> >>              dependencies the environment would also need to have the Go
> >>              boot executable [4] (or a substitute thereof) to perform
> the
> >>              harness initialization.
> >>
> >>              Is anyone else interested in this SDK execution option or
> >>              has already investigated an alternative implementation?
> >>
> >>              Thanks,
> >>              Thomas
> >>
> >>              [1]
> >>
> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
> >>
> >>              [2]
> >>
> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
> >>
> >>              [3] https://issues.apache.org/jira/browse/BEAM-4819
> >>
> >>              [4]
> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
> >>
>
> --
> Max
>

Re: Process JobBundleFactory for portable runner

Posted by Maximilian Michels <mx...@apache.org>.

For reference, here is corresponding JIRA issue for this thread: 
https://issues.apache.org/jira/browse/BEAM-5187

On 16.08.18 11:15, Maximilian Michels wrote:
> Makes sense to have an option to run the SDK harness in a non-dockerized
> environment.
> 
> I'm in the process of creating a Docker entry point for Flink's
> JobServer[1]. I suppose you would also prefer to execute that one
> standalone. We should make sure this is also an option.
> 
> [1] https://issues.apache.org/jira/browse/BEAM-4130
> 
> On 16.08.18 07:42, Thomas Weise wrote:
>> Yes, that's the proposal. Everything that would otherwise be packaged
>> into the Docker container would need to be pre-installed in the host
>> environment. In the case of Python SDK, that could simply mean a
>> (frozen) virtual environment that was setup when the host was
>> provisioned - the SDK harness process(es) will then just utilize that.
>> Of course this flavor of SDK harness execution could also be useful in
>> the local development environment, where right now someone who already
>> has the Python environment needs to also install Docker and update a
>> container to launch a Python SDK pipeline on the Flink runner.
>>
>> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira <danoliveira@google.com
>> <ma...@google.com>> wrote:
>>
>>      I just want to clarify that I understand this correctly since I'm
>>      not that familiar with the details behind all these execution
>>      environments yet. Is the proposal to create a new JobBundleFactory
>>      that instead of using Docker to create the environment that the new
>>      processes will execute in, this JobBundleFactory would execute the
>>      new processes directly in the host environment? So in practice if I
>>      ran a pipeline with this JobBundleFactory the SDK Harness and Runner
>>      Harness would both be executing directly on my machine and would
>>      depend on me having the dependencies already present on my machine?
>>
>>      On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka <goenka@google.com
>>      <ma...@google.com>> wrote:
>>
>>          Thanks for starting the discussion. I will be happy to help.
>>          I agree, we should have pluggable SDKHarness environment Factory.
>>          We can register multiple Environment factory using service
>>          registry and use the PipelineOption to pick the right one on per
>>          job basis.
>>
>>          There are a couple of things which are require to setup before
>>          launching the process.
>>
>>            * Setting up the environment as done in boot.go [4]
>>            * Retrieving and putting the artifacts in the right location.
>>
>>          You can probably leverage boot.go code to setup the environment.
>>
>>          Also, it will be useful to enumerate pros and cons of different
>>          Environments to help users choose the right one.
>>
>>
>>          On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise <thw@apache.org
>>          <ma...@apache.org>> wrote:
>>
>>              Hi,
>>
>>              Currently the portable Flink runner only works with SDK
>>              Docker containers for execution (DockerJobBundleFactory,
>>              besides an in-process (embedded) factory option for testing
>>              [1]). I'm considering adding another out of process
>>              JobBundleFactory implementation that directly forks the
>>              processes on the task manager host, eliminating the need for
>>              Docker. This would work reasonably well in environments
>>              where the dependencies (in this case Python) can easily be
>>              tied into the host deployment (also within an application
>>              specific Kubernetes pod).
>>
>>              There was already some discussion about alternative
>>              JobBundleFactory implementation in [2]. There is also a JIRA
>>              to make the bundle factory pluggable [3], pending
>>              availability of runner level options.
>>
>>              For a "ProcessBundleFactory", in addition to the Python
>>              dependencies the environment would also need to have the Go
>>              boot executable [4] (or a substitute thereof) to perform the
>>              harness initialization.
>>
>>              Is anyone else interested in this SDK execution option or
>>              has already investigated an alternative implementation?
>>
>>              Thanks,
>>              Thomas
>>
>>              [1]
>>              https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>>
>>              [2]
>>              https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>>
>>              [3] https://issues.apache.org/jira/browse/BEAM-4819
>>
>>              [4] https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>>

-- 
Max

Re: Process JobBundleFactory for portable runner

Posted by Maximilian Michels <mx...@apache.org>.

Makes sense to have an option to run the SDK harness in a non-dockerized
environment.

I'm in the process of creating a Docker entry point for Flink's
JobServer[1]. I suppose you would also prefer to execute that one
standalone. We should make sure this is also an option.

[1] https://issues.apache.org/jira/browse/BEAM-4130

On 16.08.18 07:42, Thomas Weise wrote:
> Yes, that's the proposal. Everything that would otherwise be packaged
> into the Docker container would need to be pre-installed in the host
> environment. In the case of Python SDK, that could simply mean a
> (frozen) virtual environment that was setup when the host was
> provisioned - the SDK harness process(es) will then just utilize that.
> Of course this flavor of SDK harness execution could also be useful in
> the local development environment, where right now someone who already
> has the Python environment needs to also install Docker and update a
> container to launch a Python SDK pipeline on the Flink runner.
> 
> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira <danoliveira@google.com
> <ma...@google.com>> wrote:
> 
>     I just want to clarify that I understand this correctly since I'm
>     not that familiar with the details behind all these execution
>     environments yet. Is the proposal to create a new JobBundleFactory
>     that instead of using Docker to create the environment that the new
>     processes will execute in, this JobBundleFactory would execute the
>     new processes directly in the host environment? So in practice if I
>     ran a pipeline with this JobBundleFactory the SDK Harness and Runner
>     Harness would both be executing directly on my machine and would
>     depend on me having the dependencies already present on my machine?
> 
>     On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka <goenka@google.com
>     <ma...@google.com>> wrote:
> 
>         Thanks for starting the discussion. I will be happy to help.
>         I agree, we should have pluggable SDKHarness environment Factory.
>         We can register multiple Environment factory using service
>         registry and use the PipelineOption to pick the right one on per
>         job basis.
> 
>         There are a couple of things which are require to setup before
>         launching the process.
> 
>           * Setting up the environment as done in boot.go [4]
>           * Retrieving and putting the artifacts in the right location.
> 
>         You can probably leverage boot.go code to setup the environment.
> 
>         Also, it will be useful to enumerate pros and cons of different
>         Environments to help users choose the right one.
> 
> 
>         On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise <thw@apache.org
>         <ma...@apache.org>> wrote:
> 
>             Hi,
> 
>             Currently the portable Flink runner only works with SDK
>             Docker containers for execution (DockerJobBundleFactory,
>             besides an in-process (embedded) factory option for testing
>             [1]). I'm considering adding another out of process
>             JobBundleFactory implementation that directly forks the
>             processes on the task manager host, eliminating the need for
>             Docker. This would work reasonably well in environments
>             where the dependencies (in this case Python) can easily be
>             tied into the host deployment (also within an application
>             specific Kubernetes pod).
> 
>             There was already some discussion about alternative
>             JobBundleFactory implementation in [2]. There is also a JIRA
>             to make the bundle factory pluggable [3], pending
>             availability of runner level options.
> 
>             For a "ProcessBundleFactory", in addition to the Python
>             dependencies the environment would also need to have the Go
>             boot executable [4] (or a substitute thereof) to perform the
>             harness initialization.
> 
>             Is anyone else interested in this SDK execution option or
>             has already investigated an alternative implementation?
> 
>             Thanks,
>             Thomas
> 
>             [1]
>             https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
> 
>             [2]
>             https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
> 
>             [3] https://issues.apache.org/jira/browse/BEAM-4819
> 
>             [4] https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>

Re: Process JobBundleFactory for portable runner

Posted by Thomas Weise <th...@apache.org>.

Yes, that's the proposal. Everything that would otherwise be packaged into
the Docker container would need to be pre-installed in the host
environment. In the case of Python SDK, that could simply mean a (frozen)
virtual environment that was setup when the host was provisioned - the SDK
harness process(es) will then just utilize that. Of course this flavor of
SDK harness execution could also be useful in the local development
environment, where right now someone who already has the Python environment
needs to also install Docker and update a container to launch a Python SDK
pipeline on the Flink runner.

On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira <da...@google.com>
wrote:

> I just want to clarify that I understand this correctly since I'm not that
> familiar with the details behind all these execution environments yet. Is
> the proposal to create a new JobBundleFactory that instead of using Docker
> to create the environment that the new processes will execute in, this
> JobBundleFactory would execute the new processes directly in the host
> environment? So in practice if I ran a pipeline with this JobBundleFactory
> the SDK Harness and Runner Harness would both be executing directly on my
> machine and would depend on me having the dependencies already present on
> my machine?
>
> On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka <go...@google.com> wrote:
>
>> Thanks for starting the discussion. I will be happy to help.
>> I agree, we should have pluggable SDKHarness environment Factory.
>> We can register multiple Environment factory using service registry and
>> use the PipelineOption to pick the right one on per job basis.
>>
>> There are a couple of things which are require to setup before launching
>> the process.
>>
>>    - Setting up the environment as done in boot.go [4]
>>    - Retrieving and putting the artifacts in the right location.
>>
>> You can probably leverage boot.go code to setup the environment.
>>
>> Also, it will be useful to enumerate pros and cons of different
>> Environments to help users choose the right one.
>>
>>
>> On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise <th...@apache.org> wrote:
>>
>>> Hi,
>>>
>>> Currently the portable Flink runner only works with SDK Docker
>>> containers for execution (DockerJobBundleFactory, besides an in-process
>>> (embedded) factory option for testing [1]). I'm considering adding another
>>> out of process JobBundleFactory implementation that directly forks the
>>> processes on the task manager host, eliminating the need for Docker. This
>>> would work reasonably well in environments where the dependencies (in this
>>> case Python) can easily be tied into the host deployment (also within an
>>> application specific Kubernetes pod).
>>>
>>> There was already some discussion about alternative JobBundleFactory
>>> implementation in [2]. There is also a JIRA to make the bundle factory
>>> pluggable [3], pending availability of runner level options.
>>>
>>> For a "ProcessBundleFactory", in addition to the Python dependencies the
>>> environment would also need to have the Go boot executable [4] (or a
>>> substitute thereof) to perform the harness initialization.
>>>
>>> Is anyone else interested in this SDK execution option or has already
>>> investigated an alternative implementation?
>>>
>>> Thanks,
>>> Thomas
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>>>
>>> [2]
>>> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>>>
>>> [3] https://issues.apache.org/jira/browse/BEAM-4819
>>>
>>> [4]
>>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>>>
>>>

Re: Process JobBundleFactory for portable runner

Posted by Daniel Oliveira <da...@google.com>.

I just want to clarify that I understand this correctly since I'm not that
familiar with the details behind all these execution environments yet. Is
the proposal to create a new JobBundleFactory that instead of using Docker
to create the environment that the new processes will execute in, this
JobBundleFactory would execute the new processes directly in the host
environment? So in practice if I ran a pipeline with this JobBundleFactory
the SDK Harness and Runner Harness would both be executing directly on my
machine and would depend on me having the dependencies already present on
my machine?

On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka <go...@google.com> wrote:

> Thanks for starting the discussion. I will be happy to help.
> I agree, we should have pluggable SDKHarness environment Factory.
> We can register multiple Environment factory using service registry and
> use the PipelineOption to pick the right one on per job basis.
>
> There are a couple of things which are require to setup before launching
> the process.
>
>    - Setting up the environment as done in boot.go [4]
>    - Retrieving and putting the artifacts in the right location.
>
> You can probably leverage boot.go code to setup the environment.
>
> Also, it will be useful to enumerate pros and cons of different
> Environments to help users choose the right one.
>
>
> On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise <th...@apache.org> wrote:
>
>> Hi,
>>
>> Currently the portable Flink runner only works with SDK Docker containers
>> for execution (DockerJobBundleFactory, besides an in-process (embedded)
>> factory option for testing [1]). I'm considering adding another out of
>> process JobBundleFactory implementation that directly forks the processes
>> on the task manager host, eliminating the need for Docker. This would work
>> reasonably well in environments where the dependencies (in this case
>> Python) can easily be tied into the host deployment (also within an
>> application specific Kubernetes pod).
>>
>> There was already some discussion about alternative JobBundleFactory
>> implementation in [2]. There is also a JIRA to make the bundle factory
>> pluggable [3], pending availability of runner level options.
>>
>> For a "ProcessBundleFactory", in addition to the Python dependencies the
>> environment would also need to have the Go boot executable [4] (or a
>> substitute thereof) to perform the harness initialization.
>>
>> Is anyone else interested in this SDK execution option or has already
>> investigated an alternative implementation?
>>
>> Thanks,
>> Thomas
>>
>> [1]
>> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>>
>> [2]
>> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>>
>> [3] https://issues.apache.org/jira/browse/BEAM-4819
>>
>> [4]
>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>>
>>

Re: Process JobBundleFactory for portable runner

Posted by Ankur Goenka <go...@google.com>.

Thanks for starting the discussion. I will be happy to help.
I agree, we should have pluggable SDKHarness environment Factory.
We can register multiple Environment factory using service registry and use
the PipelineOption to pick the right one on per job basis.

There are a couple of things which are require to setup before launching
the process.

   - Setting up the environment as done in boot.go [4]
   - Retrieving and putting the artifacts in the right location.

You can probably leverage boot.go code to setup the environment.

Also, it will be useful to enumerate pros and cons of different
Environments to help users choose the right one.


On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise <th...@apache.org> wrote:

> Hi,
>
> Currently the portable Flink runner only works with SDK Docker containers
> for execution (DockerJobBundleFactory, besides an in-process (embedded)
> factory option for testing [1]). I'm considering adding another out of
> process JobBundleFactory implementation that directly forks the processes
> on the task manager host, eliminating the need for Docker. This would work
> reasonably well in environments where the dependencies (in this case
> Python) can easily be tied into the host deployment (also within an
> application specific Kubernetes pod).
>
> There was already some discussion about alternative JobBundleFactory
> implementation in [2]. There is also a JIRA to make the bundle factory
> pluggable [3], pending availability of runner level options.
>
> For a "ProcessBundleFactory", in addition to the Python dependencies the
> environment would also need to have the Go boot executable [4] (or a
> substitute thereof) to perform the harness initialization.
>
> Is anyone else interested in this SDK execution option or has already
> investigated an alternative implementation?
>
> Thanks,
> Thomas
>
> [1]
> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>
> [2]
> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>
> [3] https://issues.apache.org/jira/browse/BEAM-4819
>
> [4]
> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>
>