You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Kyle Weaver <kc...@google.com> on 2019/08/07 00:03:02 UTC

(mini-doc) Beam (Flink) portable job templates

Hi all,

Following up on discussion about portable Beam on Flink on Kubernetes [1],
I have drafted a short document on how I propose we bundle portable Beam
applications into jars that can be run on OSS runners, similar to Dataflow
templates (but without the actual template part, at least for the first
iteration). It's pretty straightforward, but I thought I would broadcast it
here in case anyone is interested.

https://docs.google.com/document/d/1kj_9JWxGWOmSGeZ5hbLVDXSTv-zBrx4kQRqOq85RYD4/edit#

[1]
https://lists.apache.org/thread.html/a12dd939c4af254694481796bc08b05bb1321cfaadd1a79cd3866584@%3Cdev.beam.apache.org%3E

Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Kyle Weaver <kc...@google.com>.

> For example, there could be access to a file system or other service to
fetch metadata that is required to build the pipeline.

That's a good point. It's totally up to users to decide how they want to
deploy. I just think the jar solution would provide a useful option for
many, but not all use cases.

> That requires the (matching) Java environment on the Python developers
machine.

IIUC this recent PR to fetch the job server from maven on the Python SDK
should help with that. https://github.com/apache/beam/pull/9043

Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com


On Wed, Aug 7, 2019 at 8:59 AM Thomas Weise <th...@apache.org> wrote:

> -->
>
>>
>>
>> > * The pipeline construction code itself may need access to cluster
>> resources. In such cases the jar file cannot be created offline.
>>
>> Could you elaborate?
>>
>
> The entry point is arbitrary code written by the user, not limited to Beam
> pipeline construction alone. For example, there could be access to a file
> system or other service to fetch metadata that is required to build the
> pipeline. Such services can be accessed when the code runs within the
> infrastructure, but typically not in a development environment.
>
>
>> > * For k8s deployment, a container image with the SDK and application
>> code is required for the worker. The jar file (which is really a derived
>> artifact) would need to be built in addition to the container image.
>>
>> Yes. For standard use, a vanilla released Beam published SDK container
>> + staged artifacts should be sufficient.
>>
>> > * To build such jar file, the user would need a build environment with
>> job server and application code. Do we want to make that assumption?
>>
>> Actually, it's probably much easier than that. A jar file is just a
>> zip file with a standard structure, to which one can easily add (data)
>> files without having a full build environment. The (pre-compiled) main
>> class would know how to read this data to construct the pipeline and
>> kick off the job just like any other Flink job.
>>
>
> Before assembling the jar, the job server runs to create the ingredients.
> That requires the (matching) Java environment on the Python developers
> machine.
>
>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Thomas Weise <th...@apache.org>.

I also added this as option for pipeline submission to the k8s discussion:

https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/edit#heading=h.iov21d695qx5


On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <th...@apache.org> wrote:

> Hi Kyle,
>
> It might also be useful to have the option to just output the proto and
> artifacts, as alternative to the jar file.
>
> For the Flink entry point we would need to allow for the job server to be
> used as a library. It would probably not be too hard to have the Flink job
> constructed via the context execution environment, which would require no
> changes on the Flink side.
>
> Thanks,
> Thomas
>
>
> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <kc...@google.com> wrote:
>
>> Re Javaless/serverless solution:
>> I take it this would probably mean that we would construct the jar
>> directly from the SDK. There are advantages to this: full separation of
>> Python and Java environments, no need for a job server, and likely a
>> simpler implementation, since we'd no longer have to work within the
>> constraints of the existing job server infrastructure. The only downside I
>> can think of is the additional cost of implementing/maintaining jar
>> creation code in each SDK, but that cost may be acceptable if it's simple
>> enough.
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com
>>
>>
>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org> wrote:
>>
>>>
>>>
>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> > Before assembling the jar, the job server runs to create the
>>>> ingredients. That requires the (matching) Java environment on the Python
>>>> developers machine.
>>>>
>>>> We can run the job server and have it create the jar (and if we keep
>>>> the job server running we can use it to interact with the running
>>>> job). However, if the jar layout is simple enough, there's no need to
>>>> even build it from Java.
>>>>
>>>> Taken to the extreme, this is a one-shot, jar-based JobService API. We
>>>> choose a standard layout of where to put the pipeline description and
>>>> artifacts, and can "augment" an existing jar (that has a
>>>> runner-specific main class whose entry point knows how to read this
>>>> data to kick off a pipeline as if it were a users driver code) into
>>>> one that has a portable pipeline packaged into it for submission to a
>>>> cluster.
>>>>
>>>
>>> It would be nice if the Python developer doesn't have to run anything
>>> Java at all.
>>>
>>> As we just discussed offline, this could be accomplished by  including
>>> the proto that is produced by the SDK into the pre-existing jar.
>>>
>>> And if the jar has an entry point that creates the Flink job in the
>>> prescribed manner [1], it can be directly submitted to the Flink REST API.
>>> That would allow for Java free client.
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>>>
>>>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Chad Dombrova <ch...@gmail.com>.

Thanks for the follow up, Thomas.

On Mon, Oct 28, 2019 at 7:55 PM Thomas Weise <th...@apache.org> wrote:

> Follow-up for users looking to run portable pipelines on Flink:
>
> After prototyping the generate-jar-file approach for internal deployment
> and some related discussion, the conclusion was that it is too limiting.
> The sticky point is that the jar file would need to be generated at
> container build time. That does not allow us to execute any logic in the
> Python driver program that depends on the deploy environment, such as
> retrieval of environment variables for configuration/credentials, setting a
> submission timestamp for stream positioning etc.
>
> What worked well was that no job server was required to submit the Flink
> job and the jar file could be used with the existing Flink tooling; there
> was no need to change the FlinkK8sOperator
> <https://github.com/lyft/flinkk8soperator> at all.
>
> I then looked for a way to eliminate the build time translation and
> execute the Python driver program when the job is submitted, but still as a
> Flink entry point w/o extra job server deployment and client side
> dependencies. How can that work?
>
> https://issues.apache.org/jira/browse/BEAM-8471
>
> The main point was that there should be no requirement to install things
> on the client. FlinkK8sOperator is talking to the Flink REST API, w/o
> Python or Java. The Python dependencies need to be present on the Flink job
> manager host at the time the job is started through the REST API. That was
> something we had already solved for our container image build, and from
> conversation with few other folks this was their preferred container build
> approach also.
>
> In the future we may seek the ability to separate Flink and
> SDK/application bits into different images. For the SDK worker, this is
> intended via the external environment and sidecar container. For the client
> driver program, a similar approach could be implemented. Through an
> "external client environment", instead of a local process execution.
>
> The new Flink runner can be used as entry point for the REST API, the
> Flink CLI or standalone, especially for Flink centric automation. Of course
> portable pipelines can also be directly submitted through the SDK language
> client, via job server or other tooling, like the Python Flink client that
> Robert contributed recently.
>
> Thanks,
> Thomas
>
>
> On Thu, Aug 22, 2019 at 12:58 PM Kyle Weaver <kc...@google.com> wrote:
>
>> Following up on discussion in this morning's OSS runners meeting, I have
>> uploaded a draft PR for the full implementation (job creation + execution):
>> https://github.com/apache/beam/pull/9408
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com
>>
>>
>> On Tue, Aug 20, 2019 at 1:24 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> The point of expansion services is to run at pipeline construction
>>> time so that the caller can build on top of the outputs. E.g. we're
>>> hoping to expose Beam's SQL transforms to other languages via an
>>> expansion service and *not* duplicate the logic of parsing the SQL
>>> statements to determine the type(s) of the outputs. Even for simpler
>>> IOs, we would like to take advantage of schema information (e.g.
>>> looked up at construction time) to produce results and validate (or
>>> even inform) subsequent construction.
>>>
>>> I think we're also making a mistake in talking about "the" expansion
>>> service here, as if there was only one well defined service that all
>>> pipenes used. If we go the route of deferring some expansion to the
>>> runner, we need a way of naming expansion services. It seems like this
>>> proposal is simply isomorphic to defining new primitive transforms
>>> which some (all?) runners are just expected to understand.
>>>
>>> On Tue, Aug 20, 2019 at 10:11 AM Thomas Weise <th...@apache.org> wrote:
>>> >
>>> >
>>> >
>>> > On Tue, Aug 20, 2019 at 8:56 AM Lukasz Cwik <lc...@google.com> wrote:
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Aug 19, 2019 at 5:52 PM Ahmet Altay <al...@google.com> wrote:
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sun, Aug 18, 2019 at 12:34 PM Thomas Weise <th...@apache.org>
>>> wrote:
>>> >>>>
>>> >>>> There is a PR open for this:
>>> https://github.com/apache/beam/pull/9331
>>> >>>>
>>> >>>> (it wasn't tagged with the JIRA and therefore not linked)
>>> >>>>
>>> >>>> I think it is worthwhile to explore how we could further detangle
>>> the client side Python and Java dependencies.
>>> >>>>
>>> >>>> The expansion service is one more dependency to consider in a build
>>> environment. Is it really necessary to expand external transforms prior to
>>> submission to the job service?
>>> >>>
>>> >>>
>>> >>> +1, this will make it easier to use external transforms from the
>>> already familiar client environments.
>>> >>>
>>> >>
>>> >>
>>> >> The intent is to make it so that you CAN (not MUST) run an expansion
>>> service separate from a Runner. Creating a single endpoint that hosts both
>>> the Job and Expansion service is something that gRPC does very easily since
>>> you can host multiple service definitions on a single port.
>>> >
>>> >
>>> > Yes, that's fine. The point here is when the expansion occurs. I
>>> believe the runner can also invoke the expansion service, thereby
>>> eliminating the expansion service interaction from the client side.
>>> >
>>> >
>>> >>
>>> >>
>>> >>>>
>>> >>>>
>>> >>>> Can we come up with a partially constructed proto that can be
>>> produced by just running the Python entry point? Note this would also
>>> require pushing the pipeline options parsing into the job service.
>>> >>>
>>> >>>
>>> >>> Why would this require pushing the pipeline options parsing to the
>>> job service. Assuming that python will have enough idea about the external
>>> transform what options it will need. The necessary bit could be converted
>>> to arguments and be part of that partially constructed proto.
>>> >>>
>>> >>>>
>>> >>>>
>>> >>>> On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri <
>>> ecanzonieri@gmail.com> wrote:
>>> >>>>>
>>> >>>>> I found the tracking ticket at BEAM-7966
>>> >>>>>
>>> >>>>> On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri <
>>> ecanzonieri@gmail.com> wrote:
>>> >>>>>>
>>> >>>>>> Is this alternative still being considered? Creating a portable
>>> jar sounds like a good solution to re-use the existing runner specific
>>> deployment mechanism (e.g. Flink k8s operator) and in general simplify the
>>> deployment story.
>>> >>>>>>
>>> >>>>>> On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw <
>>> robertwb@google.com> wrote:
>>> >>>>>>>
>>> >>>>>>> The expansion service is a separate service. (The flink jar
>>> happens to
>>> >>>>>>> bring both up.) However, there is negotiation to
>>> receive/validate the
>>> >>>>>>> pipeline options.
>>> >>>>>>>
>>> >>>>>>> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise <th...@apache.org>
>>> wrote:
>>> >>>>>>> >
>>> >>>>>>> > We would also need to consider cross-language pipelines that
>>> (currently) assume the interaction with an expansion service at
>>> construction time.
>>> >>>>>>> >
>>> >>>>>>> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver <kc...@google.com>
>>> wrote:
>>> >>>>>>> >>
>>> >>>>>>> >> > It might also be useful to have the option to just output
>>> the proto and artifacts, as alternative to the jar file.
>>> >>>>>>> >>
>>> >>>>>>> >> Sure, that wouldn't be too big a change if we were to decide
>>> to go the SDK route.
>>> >>>>>>> >>
>>> >>>>>>> >> > For the Flink entry point we would need to allow for the
>>> job server to be used as a library.
>>> >>>>>>> >>
>>> >>>>>>> >> We don't need the whole job server, we only need to add a
>>> main method to FlinkPipelineRunner [1] as the entry point, which would
>>> basically just do the setup described in the doc then call
>>> FlinkPipelineRunner::run.
>>> >>>>>>> >>
>>> >>>>>>> >> [1]
>>> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
>>> >>>>>>> >>
>>> >>>>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>>> kcweaver@google.com
>>> >>>>>>> >>
>>> >>>>>>> >>
>>> >>>>>>> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <th...@apache.org>
>>> wrote:
>>> >>>>>>> >>>
>>> >>>>>>> >>> Hi Kyle,
>>> >>>>>>> >>>
>>> >>>>>>> >>> It might also be useful to have the option to just output
>>> the proto and artifacts, as alternative to the jar file.
>>> >>>>>>> >>>
>>> >>>>>>> >>> For the Flink entry point we would need to allow for the job
>>> server to be used as a library. It would probably not be too hard to have
>>> the Flink job constructed via the context execution environment, which
>>> would require no changes on the Flink side.
>>> >>>>>>> >>>
>>> >>>>>>> >>> Thanks,
>>> >>>>>>> >>> Thomas
>>> >>>>>>> >>>
>>> >>>>>>> >>>
>>> >>>>>>> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <
>>> kcweaver@google.com> wrote:
>>> >>>>>>> >>>>
>>> >>>>>>> >>>> Re Javaless/serverless solution:
>>> >>>>>>> >>>> I take it this would probably mean that we would construct
>>> the jar directly from the SDK. There are advantages to this: full
>>> separation of Python and Java environments, no need for a job server, and
>>> likely a simpler implementation, since we'd no longer have to work within
>>> the constraints of the existing job server infrastructure. The only
>>> downside I can think of is the additional cost of implementing/maintaining
>>> jar creation code in each SDK, but that cost may be acceptable if it's
>>> simple enough.
>>> >>>>>>> >>>>
>>> >>>>>>> >>>> Kyle Weaver | Software Engineer | github.com/ibzib |
>>> kcweaver@google.com
>>> >>>>>>> >>>>
>>> >>>>>>> >>>>
>>> >>>>>>> >>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org>
>>> wrote:
>>> >>>>>>> >>>>>
>>> >>>>>>> >>>>>
>>> >>>>>>> >>>>>
>>> >>>>>>> >>>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <
>>> robertwb@google.com> wrote:
>>> >>>>>>> >>>>>>
>>> >>>>>>> >>>>>> > Before assembling the jar, the job server runs to
>>> create the ingredients. That requires the (matching) Java environment on
>>> the Python developers machine.
>>> >>>>>>> >>>>>>
>>> >>>>>>> >>>>>> We can run the job server and have it create the jar (and
>>> if we keep
>>> >>>>>>> >>>>>> the job server running we can use it to interact with the
>>> running
>>> >>>>>>> >>>>>> job). However, if the jar layout is simple enough,
>>> there's no need to
>>> >>>>>>> >>>>>> even build it from Java.
>>> >>>>>>> >>>>>>
>>> >>>>>>> >>>>>> Taken to the extreme, this is a one-shot, jar-based
>>> JobService API. We
>>> >>>>>>> >>>>>> choose a standard layout of where to put the pipeline
>>> description and
>>> >>>>>>> >>>>>> artifacts, and can "augment" an existing jar (that has a
>>> >>>>>>> >>>>>> runner-specific main class whose entry point knows how to
>>> read this
>>> >>>>>>> >>>>>> data to kick off a pipeline as if it were a users driver
>>> code) into
>>> >>>>>>> >>>>>> one that has a portable pipeline packaged into it for
>>> submission to a
>>> >>>>>>> >>>>>> cluster.
>>> >>>>>>> >>>>>
>>> >>>>>>> >>>>>
>>> >>>>>>> >>>>> It would be nice if the Python developer doesn't have to
>>> run anything Java at all.
>>> >>>>>>> >>>>>
>>> >>>>>>> >>>>> As we just discussed offline, this could be accomplished
>>> by  including the proto that is produced by the SDK into the pre-existing
>>> jar.
>>> >>>>>>> >>>>>
>>> >>>>>>> >>>>> And if the jar has an entry point that creates the Flink
>>> job in the prescribed manner [1], it can be directly submitted to the Flink
>>> REST API. That would allow for Java free client.
>>> >>>>>>> >>>>>
>>> >>>>>>> >>>>> [1]
>>> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>>> >>>>>>> >>>>>
>>>
>>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Thomas Weise <th...@apache.org>.

Follow-up for users looking to run portable pipelines on Flink:

After prototyping the generate-jar-file approach for internal deployment
and some related discussion, the conclusion was that it is too limiting.
The sticky point is that the jar file would need to be generated at
container build time. That does not allow us to execute any logic in the
Python driver program that depends on the deploy environment, such as
retrieval of environment variables for configuration/credentials, setting a
submission timestamp for stream positioning etc.

What worked well was that no job server was required to submit the Flink
job and the jar file could be used with the existing Flink tooling; there
was no need to change the FlinkK8sOperator
<https://github.com/lyft/flinkk8soperator> at all.

I then looked for a way to eliminate the build time translation and execute
the Python driver program when the job is submitted, but still as a Flink
entry point w/o extra job server deployment and client side dependencies.
How can that work?

https://issues.apache.org/jira/browse/BEAM-8471

The main point was that there should be no requirement to install things on
the client. FlinkK8sOperator is talking to the Flink REST API, w/o Python
or Java. The Python dependencies need to be present on the Flink job
manager host at the time the job is started through the REST API. That was
something we had already solved for our container image build, and from
conversation with few other folks this was their preferred container build
approach also.

In the future we may seek the ability to separate Flink and SDK/application
bits into different images. For the SDK worker, this is intended via the
external environment and sidecar container. For the client driver program,
a similar approach could be implemented. Through an "external client
environment", instead of a local process execution.

The new Flink runner can be used as entry point for the REST API, the Flink
CLI or standalone, especially for Flink centric automation. Of course
portable pipelines can also be directly submitted through the SDK language
client, via job server or other tooling, like the Python Flink client that
Robert contributed recently.

Thanks,
Thomas


On Thu, Aug 22, 2019 at 12:58 PM Kyle Weaver <kc...@google.com> wrote:

> Following up on discussion in this morning's OSS runners meeting, I have
> uploaded a draft PR for the full implementation (job creation + execution):
> https://github.com/apache/beam/pull/9408
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com
>
>
> On Tue, Aug 20, 2019 at 1:24 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> The point of expansion services is to run at pipeline construction
>> time so that the caller can build on top of the outputs. E.g. we're
>> hoping to expose Beam's SQL transforms to other languages via an
>> expansion service and *not* duplicate the logic of parsing the SQL
>> statements to determine the type(s) of the outputs. Even for simpler
>> IOs, we would like to take advantage of schema information (e.g.
>> looked up at construction time) to produce results and validate (or
>> even inform) subsequent construction.
>>
>> I think we're also making a mistake in talking about "the" expansion
>> service here, as if there was only one well defined service that all
>> pipenes used. If we go the route of deferring some expansion to the
>> runner, we need a way of naming expansion services. It seems like this
>> proposal is simply isomorphic to defining new primitive transforms
>> which some (all?) runners are just expected to understand.
>>
>> On Tue, Aug 20, 2019 at 10:11 AM Thomas Weise <th...@apache.org> wrote:
>> >
>> >
>> >
>> > On Tue, Aug 20, 2019 at 8:56 AM Lukasz Cwik <lc...@google.com> wrote:
>> >>
>> >>
>> >>
>> >> On Mon, Aug 19, 2019 at 5:52 PM Ahmet Altay <al...@google.com> wrote:
>> >>>
>> >>>
>> >>>
>> >>> On Sun, Aug 18, 2019 at 12:34 PM Thomas Weise <th...@apache.org> wrote:
>> >>>>
>> >>>> There is a PR open for this:
>> https://github.com/apache/beam/pull/9331
>> >>>>
>> >>>> (it wasn't tagged with the JIRA and therefore not linked)
>> >>>>
>> >>>> I think it is worthwhile to explore how we could further detangle
>> the client side Python and Java dependencies.
>> >>>>
>> >>>> The expansion service is one more dependency to consider in a build
>> environment. Is it really necessary to expand external transforms prior to
>> submission to the job service?
>> >>>
>> >>>
>> >>> +1, this will make it easier to use external transforms from the
>> already familiar client environments.
>> >>>
>> >>
>> >>
>> >> The intent is to make it so that you CAN (not MUST) run an expansion
>> service separate from a Runner. Creating a single endpoint that hosts both
>> the Job and Expansion service is something that gRPC does very easily since
>> you can host multiple service definitions on a single port.
>> >
>> >
>> > Yes, that's fine. The point here is when the expansion occurs. I
>> believe the runner can also invoke the expansion service, thereby
>> eliminating the expansion service interaction from the client side.
>> >
>> >
>> >>
>> >>
>> >>>>
>> >>>>
>> >>>> Can we come up with a partially constructed proto that can be
>> produced by just running the Python entry point? Note this would also
>> require pushing the pipeline options parsing into the job service.
>> >>>
>> >>>
>> >>> Why would this require pushing the pipeline options parsing to the
>> job service. Assuming that python will have enough idea about the external
>> transform what options it will need. The necessary bit could be converted
>> to arguments and be part of that partially constructed proto.
>> >>>
>> >>>>
>> >>>>
>> >>>> On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri <
>> ecanzonieri@gmail.com> wrote:
>> >>>>>
>> >>>>> I found the tracking ticket at BEAM-7966
>> >>>>>
>> >>>>> On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri <
>> ecanzonieri@gmail.com> wrote:
>> >>>>>>
>> >>>>>> Is this alternative still being considered? Creating a portable
>> jar sounds like a good solution to re-use the existing runner specific
>> deployment mechanism (e.g. Flink k8s operator) and in general simplify the
>> deployment story.
>> >>>>>>
>> >>>>>> On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw <
>> robertwb@google.com> wrote:
>> >>>>>>>
>> >>>>>>> The expansion service is a separate service. (The flink jar
>> happens to
>> >>>>>>> bring both up.) However, there is negotiation to receive/validate
>> the
>> >>>>>>> pipeline options.
>> >>>>>>>
>> >>>>>>> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise <th...@apache.org>
>> wrote:
>> >>>>>>> >
>> >>>>>>> > We would also need to consider cross-language pipelines that
>> (currently) assume the interaction with an expansion service at
>> construction time.
>> >>>>>>> >
>> >>>>>>> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver <kc...@google.com>
>> wrote:
>> >>>>>>> >>
>> >>>>>>> >> > It might also be useful to have the option to just output
>> the proto and artifacts, as alternative to the jar file.
>> >>>>>>> >>
>> >>>>>>> >> Sure, that wouldn't be too big a change if we were to decide
>> to go the SDK route.
>> >>>>>>> >>
>> >>>>>>> >> > For the Flink entry point we would need to allow for the job
>> server to be used as a library.
>> >>>>>>> >>
>> >>>>>>> >> We don't need the whole job server, we only need to add a main
>> method to FlinkPipelineRunner [1] as the entry point, which would basically
>> just do the setup described in the doc then call FlinkPipelineRunner::run.
>> >>>>>>> >>
>> >>>>>>> >> [1]
>> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
>> >>>>>>> >>
>> >>>>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>> kcweaver@google.com
>> >>>>>>> >>
>> >>>>>>> >>
>> >>>>>>> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <th...@apache.org>
>> wrote:
>> >>>>>>> >>>
>> >>>>>>> >>> Hi Kyle,
>> >>>>>>> >>>
>> >>>>>>> >>> It might also be useful to have the option to just output the
>> proto and artifacts, as alternative to the jar file.
>> >>>>>>> >>>
>> >>>>>>> >>> For the Flink entry point we would need to allow for the job
>> server to be used as a library. It would probably not be too hard to have
>> the Flink job constructed via the context execution environment, which
>> would require no changes on the Flink side.
>> >>>>>>> >>>
>> >>>>>>> >>> Thanks,
>> >>>>>>> >>> Thomas
>> >>>>>>> >>>
>> >>>>>>> >>>
>> >>>>>>> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <
>> kcweaver@google.com> wrote:
>> >>>>>>> >>>>
>> >>>>>>> >>>> Re Javaless/serverless solution:
>> >>>>>>> >>>> I take it this would probably mean that we would construct
>> the jar directly from the SDK. There are advantages to this: full
>> separation of Python and Java environments, no need for a job server, and
>> likely a simpler implementation, since we'd no longer have to work within
>> the constraints of the existing job server infrastructure. The only
>> downside I can think of is the additional cost of implementing/maintaining
>> jar creation code in each SDK, but that cost may be acceptable if it's
>> simple enough.
>> >>>>>>> >>>>
>> >>>>>>> >>>> Kyle Weaver | Software Engineer | github.com/ibzib |
>> kcweaver@google.com
>> >>>>>>> >>>>
>> >>>>>>> >>>>
>> >>>>>>> >>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org>
>> wrote:
>> >>>>>>> >>>>>
>> >>>>>>> >>>>>
>> >>>>>>> >>>>>
>> >>>>>>> >>>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <
>> robertwb@google.com> wrote:
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>> > Before assembling the jar, the job server runs to create
>> the ingredients. That requires the (matching) Java environment on the
>> Python developers machine.
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>> We can run the job server and have it create the jar (and
>> if we keep
>> >>>>>>> >>>>>> the job server running we can use it to interact with the
>> running
>> >>>>>>> >>>>>> job). However, if the jar layout is simple enough, there's
>> no need to
>> >>>>>>> >>>>>> even build it from Java.
>> >>>>>>> >>>>>>
>> >>>>>>> >>>>>> Taken to the extreme, this is a one-shot, jar-based
>> JobService API. We
>> >>>>>>> >>>>>> choose a standard layout of where to put the pipeline
>> description and
>> >>>>>>> >>>>>> artifacts, and can "augment" an existing jar (that has a
>> >>>>>>> >>>>>> runner-specific main class whose entry point knows how to
>> read this
>> >>>>>>> >>>>>> data to kick off a pipeline as if it were a users driver
>> code) into
>> >>>>>>> >>>>>> one that has a portable pipeline packaged into it for
>> submission to a
>> >>>>>>> >>>>>> cluster.
>> >>>>>>> >>>>>
>> >>>>>>> >>>>>
>> >>>>>>> >>>>> It would be nice if the Python developer doesn't have to
>> run anything Java at all.
>> >>>>>>> >>>>>
>> >>>>>>> >>>>> As we just discussed offline, this could be accomplished
>> by  including the proto that is produced by the SDK into the pre-existing
>> jar.
>> >>>>>>> >>>>>
>> >>>>>>> >>>>> And if the jar has an entry point that creates the Flink
>> job in the prescribed manner [1], it can be directly submitted to the Flink
>> REST API. That would allow for Java free client.
>> >>>>>>> >>>>>
>> >>>>>>> >>>>> [1]
>> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>> >>>>>>> >>>>>
>>
>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Kyle Weaver <kc...@google.com>.

Following up on discussion in this morning's OSS runners meeting, I have
uploaded a draft PR for the full implementation (job creation + execution):
https://github.com/apache/beam/pull/9408

Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com


On Tue, Aug 20, 2019 at 1:24 PM Robert Bradshaw <ro...@google.com> wrote:

> The point of expansion services is to run at pipeline construction
> time so that the caller can build on top of the outputs. E.g. we're
> hoping to expose Beam's SQL transforms to other languages via an
> expansion service and *not* duplicate the logic of parsing the SQL
> statements to determine the type(s) of the outputs. Even for simpler
> IOs, we would like to take advantage of schema information (e.g.
> looked up at construction time) to produce results and validate (or
> even inform) subsequent construction.
>
> I think we're also making a mistake in talking about "the" expansion
> service here, as if there was only one well defined service that all
> pipenes used. If we go the route of deferring some expansion to the
> runner, we need a way of naming expansion services. It seems like this
> proposal is simply isomorphic to defining new primitive transforms
> which some (all?) runners are just expected to understand.
>
> On Tue, Aug 20, 2019 at 10:11 AM Thomas Weise <th...@apache.org> wrote:
> >
> >
> >
> > On Tue, Aug 20, 2019 at 8:56 AM Lukasz Cwik <lc...@google.com> wrote:
> >>
> >>
> >>
> >> On Mon, Aug 19, 2019 at 5:52 PM Ahmet Altay <al...@google.com> wrote:
> >>>
> >>>
> >>>
> >>> On Sun, Aug 18, 2019 at 12:34 PM Thomas Weise <th...@apache.org> wrote:
> >>>>
> >>>> There is a PR open for this: https://github.com/apache/beam/pull/9331
> >>>>
> >>>> (it wasn't tagged with the JIRA and therefore not linked)
> >>>>
> >>>> I think it is worthwhile to explore how we could further detangle the
> client side Python and Java dependencies.
> >>>>
> >>>> The expansion service is one more dependency to consider in a build
> environment. Is it really necessary to expand external transforms prior to
> submission to the job service?
> >>>
> >>>
> >>> +1, this will make it easier to use external transforms from the
> already familiar client environments.
> >>>
> >>
> >>
> >> The intent is to make it so that you CAN (not MUST) run an expansion
> service separate from a Runner. Creating a single endpoint that hosts both
> the Job and Expansion service is something that gRPC does very easily since
> you can host multiple service definitions on a single port.
> >
> >
> > Yes, that's fine. The point here is when the expansion occurs. I believe
> the runner can also invoke the expansion service, thereby eliminating the
> expansion service interaction from the client side.
> >
> >
> >>
> >>
> >>>>
> >>>>
> >>>> Can we come up with a partially constructed proto that can be
> produced by just running the Python entry point? Note this would also
> require pushing the pipeline options parsing into the job service.
> >>>
> >>>
> >>> Why would this require pushing the pipeline options parsing to the job
> service. Assuming that python will have enough idea about the external
> transform what options it will need. The necessary bit could be converted
> to arguments and be part of that partially constructed proto.
> >>>
> >>>>
> >>>>
> >>>> On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri <
> ecanzonieri@gmail.com> wrote:
> >>>>>
> >>>>> I found the tracking ticket at BEAM-7966
> >>>>>
> >>>>> On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri <
> ecanzonieri@gmail.com> wrote:
> >>>>>>
> >>>>>> Is this alternative still being considered? Creating a portable jar
> sounds like a good solution to re-use the existing runner specific
> deployment mechanism (e.g. Flink k8s operator) and in general simplify the
> deployment story.
> >>>>>>
> >>>>>> On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw <
> robertwb@google.com> wrote:
> >>>>>>>
> >>>>>>> The expansion service is a separate service. (The flink jar
> happens to
> >>>>>>> bring both up.) However, there is negotiation to receive/validate
> the
> >>>>>>> pipeline options.
> >>>>>>>
> >>>>>>> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise <th...@apache.org>
> wrote:
> >>>>>>> >
> >>>>>>> > We would also need to consider cross-language pipelines that
> (currently) assume the interaction with an expansion service at
> construction time.
> >>>>>>> >
> >>>>>>> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver <kc...@google.com>
> wrote:
> >>>>>>> >>
> >>>>>>> >> > It might also be useful to have the option to just output the
> proto and artifacts, as alternative to the jar file.
> >>>>>>> >>
> >>>>>>> >> Sure, that wouldn't be too big a change if we were to decide to
> go the SDK route.
> >>>>>>> >>
> >>>>>>> >> > For the Flink entry point we would need to allow for the job
> server to be used as a library.
> >>>>>>> >>
> >>>>>>> >> We don't need the whole job server, we only need to add a main
> method to FlinkPipelineRunner [1] as the entry point, which would basically
> just do the setup described in the doc then call FlinkPipelineRunner::run.
> >>>>>>> >>
> >>>>>>> >> [1]
> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
> >>>>>>> >>
> >>>>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcweaver@google.com
> >>>>>>> >>
> >>>>>>> >>
> >>>>>>> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <th...@apache.org>
> wrote:
> >>>>>>> >>>
> >>>>>>> >>> Hi Kyle,
> >>>>>>> >>>
> >>>>>>> >>> It might also be useful to have the option to just output the
> proto and artifacts, as alternative to the jar file.
> >>>>>>> >>>
> >>>>>>> >>> For the Flink entry point we would need to allow for the job
> server to be used as a library. It would probably not be too hard to have
> the Flink job constructed via the context execution environment, which
> would require no changes on the Flink side.
> >>>>>>> >>>
> >>>>>>> >>> Thanks,
> >>>>>>> >>> Thomas
> >>>>>>> >>>
> >>>>>>> >>>
> >>>>>>> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <
> kcweaver@google.com> wrote:
> >>>>>>> >>>>
> >>>>>>> >>>> Re Javaless/serverless solution:
> >>>>>>> >>>> I take it this would probably mean that we would construct
> the jar directly from the SDK. There are advantages to this: full
> separation of Python and Java environments, no need for a job server, and
> likely a simpler implementation, since we'd no longer have to work within
> the constraints of the existing job server infrastructure. The only
> downside I can think of is the additional cost of implementing/maintaining
> jar creation code in each SDK, but that cost may be acceptable if it's
> simple enough.
> >>>>>>> >>>>
> >>>>>>> >>>> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcweaver@google.com
> >>>>>>> >>>>
> >>>>>>> >>>>
> >>>>>>> >>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org>
> wrote:
> >>>>>>> >>>>>
> >>>>>>> >>>>>
> >>>>>>> >>>>>
> >>>>>>> >>>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <
> robertwb@google.com> wrote:
> >>>>>>> >>>>>>
> >>>>>>> >>>>>> > Before assembling the jar, the job server runs to create
> the ingredients. That requires the (matching) Java environment on the
> Python developers machine.
> >>>>>>> >>>>>>
> >>>>>>> >>>>>> We can run the job server and have it create the jar (and
> if we keep
> >>>>>>> >>>>>> the job server running we can use it to interact with the
> running
> >>>>>>> >>>>>> job). However, if the jar layout is simple enough, there's
> no need to
> >>>>>>> >>>>>> even build it from Java.
> >>>>>>> >>>>>>
> >>>>>>> >>>>>> Taken to the extreme, this is a one-shot, jar-based
> JobService API. We
> >>>>>>> >>>>>> choose a standard layout of where to put the pipeline
> description and
> >>>>>>> >>>>>> artifacts, and can "augment" an existing jar (that has a
> >>>>>>> >>>>>> runner-specific main class whose entry point knows how to
> read this
> >>>>>>> >>>>>> data to kick off a pipeline as if it were a users driver
> code) into
> >>>>>>> >>>>>> one that has a portable pipeline packaged into it for
> submission to a
> >>>>>>> >>>>>> cluster.
> >>>>>>> >>>>>
> >>>>>>> >>>>>
> >>>>>>> >>>>> It would be nice if the Python developer doesn't have to run
> anything Java at all.
> >>>>>>> >>>>>
> >>>>>>> >>>>> As we just discussed offline, this could be accomplished by
> including the proto that is produced by the SDK into the pre-existing jar.
> >>>>>>> >>>>>
> >>>>>>> >>>>> And if the jar has an entry point that creates the Flink job
> in the prescribed manner [1], it can be directly submitted to the Flink
> REST API. That would allow for Java free client.
> >>>>>>> >>>>>
> >>>>>>> >>>>> [1]
> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
> >>>>>>> >>>>>
>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Robert Bradshaw <ro...@google.com>.

The point of expansion services is to run at pipeline construction
time so that the caller can build on top of the outputs. E.g. we're
hoping to expose Beam's SQL transforms to other languages via an
expansion service and *not* duplicate the logic of parsing the SQL
statements to determine the type(s) of the outputs. Even for simpler
IOs, we would like to take advantage of schema information (e.g.
looked up at construction time) to produce results and validate (or
even inform) subsequent construction.

I think we're also making a mistake in talking about "the" expansion
service here, as if there was only one well defined service that all
pipenes used. If we go the route of deferring some expansion to the
runner, we need a way of naming expansion services. It seems like this
proposal is simply isomorphic to defining new primitive transforms
which some (all?) runners are just expected to understand.

On Tue, Aug 20, 2019 at 10:11 AM Thomas Weise <th...@apache.org> wrote:
>
>
>
> On Tue, Aug 20, 2019 at 8:56 AM Lukasz Cwik <lc...@google.com> wrote:
>>
>>
>>
>> On Mon, Aug 19, 2019 at 5:52 PM Ahmet Altay <al...@google.com> wrote:
>>>
>>>
>>>
>>> On Sun, Aug 18, 2019 at 12:34 PM Thomas Weise <th...@apache.org> wrote:
>>>>
>>>> There is a PR open for this: https://github.com/apache/beam/pull/9331
>>>>
>>>> (it wasn't tagged with the JIRA and therefore not linked)
>>>>
>>>> I think it is worthwhile to explore how we could further detangle the client side Python and Java dependencies.
>>>>
>>>> The expansion service is one more dependency to consider in a build environment. Is it really necessary to expand external transforms prior to submission to the job service?
>>>
>>>
>>> +1, this will make it easier to use external transforms from the already familiar client environments.
>>>
>>
>>
>> The intent is to make it so that you CAN (not MUST) run an expansion service separate from a Runner. Creating a single endpoint that hosts both the Job and Expansion service is something that gRPC does very easily since you can host multiple service definitions on a single port.
>
>
> Yes, that's fine. The point here is when the expansion occurs. I believe the runner can also invoke the expansion service, thereby eliminating the expansion service interaction from the client side.
>
>
>>
>>
>>>>
>>>>
>>>> Can we come up with a partially constructed proto that can be produced by just running the Python entry point? Note this would also require pushing the pipeline options parsing into the job service.
>>>
>>>
>>> Why would this require pushing the pipeline options parsing to the job service. Assuming that python will have enough idea about the external transform what options it will need. The necessary bit could be converted to arguments and be part of that partially constructed proto.
>>>
>>>>
>>>>
>>>> On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri <ec...@gmail.com> wrote:
>>>>>
>>>>> I found the tracking ticket at BEAM-7966
>>>>>
>>>>> On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri <ec...@gmail.com> wrote:
>>>>>>
>>>>>> Is this alternative still being considered? Creating a portable jar sounds like a good solution to re-use the existing runner specific deployment mechanism (e.g. Flink k8s operator) and in general simplify the deployment story.
>>>>>>
>>>>>> On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw <ro...@google.com> wrote:
>>>>>>>
>>>>>>> The expansion service is a separate service. (The flink jar happens to
>>>>>>> bring both up.) However, there is negotiation to receive/validate the
>>>>>>> pipeline options.
>>>>>>>
>>>>>>> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise <th...@apache.org> wrote:
>>>>>>> >
>>>>>>> > We would also need to consider cross-language pipelines that (currently) assume the interaction with an expansion service at construction time.
>>>>>>> >
>>>>>>> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver <kc...@google.com> wrote:
>>>>>>> >>
>>>>>>> >> > It might also be useful to have the option to just output the proto and artifacts, as alternative to the jar file.
>>>>>>> >>
>>>>>>> >> Sure, that wouldn't be too big a change if we were to decide to go the SDK route.
>>>>>>> >>
>>>>>>> >> > For the Flink entry point we would need to allow for the job server to be used as a library.
>>>>>>> >>
>>>>>>> >> We don't need the whole job server, we only need to add a main method to FlinkPipelineRunner [1] as the entry point, which would basically just do the setup described in the doc then call FlinkPipelineRunner::run.
>>>>>>> >>
>>>>>>> >> [1] https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
>>>>>>> >>
>>>>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <th...@apache.org> wrote:
>>>>>>> >>>
>>>>>>> >>> Hi Kyle,
>>>>>>> >>>
>>>>>>> >>> It might also be useful to have the option to just output the proto and artifacts, as alternative to the jar file.
>>>>>>> >>>
>>>>>>> >>> For the Flink entry point we would need to allow for the job server to be used as a library. It would probably not be too hard to have the Flink job constructed via the context execution environment, which would require no changes on the Flink side.
>>>>>>> >>>
>>>>>>> >>> Thanks,
>>>>>>> >>> Thomas
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <kc...@google.com> wrote:
>>>>>>> >>>>
>>>>>>> >>>> Re Javaless/serverless solution:
>>>>>>> >>>> I take it this would probably mean that we would construct the jar directly from the SDK. There are advantages to this: full separation of Python and Java environments, no need for a job server, and likely a simpler implementation, since we'd no longer have to work within the constraints of the existing job server infrastructure. The only downside I can think of is the additional cost of implementing/maintaining jar creation code in each SDK, but that cost may be acceptable if it's simple enough.
>>>>>>> >>>>
>>>>>>> >>>> Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org> wrote:
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <ro...@google.com> wrote:
>>>>>>> >>>>>>
>>>>>>> >>>>>> > Before assembling the jar, the job server runs to create the ingredients. That requires the (matching) Java environment on the Python developers machine.
>>>>>>> >>>>>>
>>>>>>> >>>>>> We can run the job server and have it create the jar (and if we keep
>>>>>>> >>>>>> the job server running we can use it to interact with the running
>>>>>>> >>>>>> job). However, if the jar layout is simple enough, there's no need to
>>>>>>> >>>>>> even build it from Java.
>>>>>>> >>>>>>
>>>>>>> >>>>>> Taken to the extreme, this is a one-shot, jar-based JobService API. We
>>>>>>> >>>>>> choose a standard layout of where to put the pipeline description and
>>>>>>> >>>>>> artifacts, and can "augment" an existing jar (that has a
>>>>>>> >>>>>> runner-specific main class whose entry point knows how to read this
>>>>>>> >>>>>> data to kick off a pipeline as if it were a users driver code) into
>>>>>>> >>>>>> one that has a portable pipeline packaged into it for submission to a
>>>>>>> >>>>>> cluster.
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>> It would be nice if the Python developer doesn't have to run anything Java at all.
>>>>>>> >>>>>
>>>>>>> >>>>> As we just discussed offline, this could be accomplished by  including the proto that is produced by the SDK into the pre-existing jar.
>>>>>>> >>>>>
>>>>>>> >>>>> And if the jar has an entry point that creates the Flink job in the prescribed manner [1], it can be directly submitted to the Flink REST API. That would allow for Java free client.
>>>>>>> >>>>>
>>>>>>> >>>>> [1] https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>>>>>>> >>>>>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Thomas Weise <th...@apache.org>.

On Tue, Aug 20, 2019 at 8:56 AM Lukasz Cwik <lc...@google.com> wrote:

>
>
> On Mon, Aug 19, 2019 at 5:52 PM Ahmet Altay <al...@google.com> wrote:
>
>>
>>
>> On Sun, Aug 18, 2019 at 12:34 PM Thomas Weise <th...@apache.org> wrote:
>>
>>> There is a PR open for this: https://github.com/apache/beam/pull/9331
>>>
>>> (it wasn't tagged with the JIRA and therefore not linked)
>>>
>>> I think it is worthwhile to explore how we could further detangle the
>>> client side Python and Java dependencies.
>>>
>>> The expansion service is one more dependency to consider in a build
>>> environment. Is it really necessary to expand external transforms prior to
>>> submission to the job service?
>>>
>>
>> +1, this will make it easier to use external transforms from the already
>> familiar client environments.
>>
>>
>
> The intent is to make it so that you CAN (not MUST) run an expansion
> service separate from a Runner. Creating a single endpoint that hosts both
> the Job and Expansion service is something that gRPC does very easily since
> you can host multiple service definitions on a single port.
>

Yes, that's fine. The point here is when the expansion occurs. I believe
the runner can also invoke the expansion service, thereby eliminating the
expansion service interaction from the client side.



>
>
>>
>>> Can we come up with a partially constructed proto that can be produced
>>> by just running the Python entry point? Note this would also require
>>> pushing the pipeline options parsing into the job service.
>>>
>>
>> Why would this require pushing the pipeline options parsing to the job
>> service. Assuming that python will have enough idea about the external
>> transform what options it will need. The necessary bit could be converted
>> to arguments and be part of that partially constructed proto.
>>
>>
>>>
>>> On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri <
>>> ecanzonieri@gmail.com> wrote:
>>>
>>>> I found the tracking ticket at BEAM-7966
>>>> <https://jira.apache.org/jira/browse/BEAM-7966>
>>>>
>>>> On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri <
>>>> ecanzonieri@gmail.com> wrote:
>>>>
>>>>> Is this alternative still being considered? Creating a portable jar
>>>>> sounds like a good solution to re-use the existing runner specific
>>>>> deployment mechanism (e.g. Flink k8s operator) and in general simplify the
>>>>> deployment story.
>>>>>
>>>>> On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> The expansion service is a separate service. (The flink jar happens to
>>>>>> bring both up.) However, there is negotiation to receive/validate the
>>>>>> pipeline options.
>>>>>>
>>>>>> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise <th...@apache.org> wrote:
>>>>>> >
>>>>>> > We would also need to consider cross-language pipelines that
>>>>>> (currently) assume the interaction with an expansion service at
>>>>>> construction time.
>>>>>> >
>>>>>> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver <kc...@google.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> > It might also be useful to have the option to just output the
>>>>>> proto and artifacts, as alternative to the jar file.
>>>>>> >>
>>>>>> >> Sure, that wouldn't be too big a change if we were to decide to go
>>>>>> the SDK route.
>>>>>> >>
>>>>>> >> > For the Flink entry point we would need to allow for the job
>>>>>> server to be used as a library.
>>>>>> >>
>>>>>> >> We don't need the whole job server, we only need to add a main
>>>>>> method to FlinkPipelineRunner [1] as the entry point, which would basically
>>>>>> just do the setup described in the doc then call FlinkPipelineRunner::run.
>>>>>> >>
>>>>>> >> [1]
>>>>>> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
>>>>>> >>
>>>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>>>>>> kcweaver@google.com
>>>>>> >>
>>>>>> >>
>>>>>> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <th...@apache.org>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> Hi Kyle,
>>>>>> >>>
>>>>>> >>> It might also be useful to have the option to just output the
>>>>>> proto and artifacts, as alternative to the jar file.
>>>>>> >>>
>>>>>> >>> For the Flink entry point we would need to allow for the job
>>>>>> server to be used as a library. It would probably not be too hard to have
>>>>>> the Flink job constructed via the context execution environment, which
>>>>>> would require no changes on the Flink side.
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>> Thomas
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <kc...@google.com>
>>>>>> wrote:
>>>>>> >>>>
>>>>>> >>>> Re Javaless/serverless solution:
>>>>>> >>>> I take it this would probably mean that we would construct the
>>>>>> jar directly from the SDK. There are advantages to this: full separation of
>>>>>> Python and Java environments, no need for a job server, and likely a
>>>>>> simpler implementation, since we'd no longer have to work within the
>>>>>> constraints of the existing job server infrastructure. The only downside I
>>>>>> can think of is the additional cost of implementing/maintaining jar
>>>>>> creation code in each SDK, but that cost may be acceptable if it's simple
>>>>>> enough.
>>>>>> >>>>
>>>>>> >>>> Kyle Weaver | Software Engineer | github.com/ibzib |
>>>>>> kcweaver@google.com
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org>
>>>>>> wrote:
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <
>>>>>> robertwb@google.com> wrote:
>>>>>> >>>>>>
>>>>>> >>>>>> > Before assembling the jar, the job server runs to create the
>>>>>> ingredients. That requires the (matching) Java environment on the Python
>>>>>> developers machine.
>>>>>> >>>>>>
>>>>>> >>>>>> We can run the job server and have it create the jar (and if
>>>>>> we keep
>>>>>> >>>>>> the job server running we can use it to interact with the
>>>>>> running
>>>>>> >>>>>> job). However, if the jar layout is simple enough, there's no
>>>>>> need to
>>>>>> >>>>>> even build it from Java.
>>>>>> >>>>>>
>>>>>> >>>>>> Taken to the extreme, this is a one-shot, jar-based JobService
>>>>>> API. We
>>>>>> >>>>>> choose a standard layout of where to put the pipeline
>>>>>> description and
>>>>>> >>>>>> artifacts, and can "augment" an existing jar (that has a
>>>>>> >>>>>> runner-specific main class whose entry point knows how to read
>>>>>> this
>>>>>> >>>>>> data to kick off a pipeline as if it were a users driver code)
>>>>>> into
>>>>>> >>>>>> one that has a portable pipeline packaged into it for
>>>>>> submission to a
>>>>>> >>>>>> cluster.
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>> It would be nice if the Python developer doesn't have to run
>>>>>> anything Java at all.
>>>>>> >>>>>
>>>>>> >>>>> As we just discussed offline, this could be accomplished by
>>>>>> including the proto that is produced by the SDK into the pre-existing jar.
>>>>>> >>>>>
>>>>>> >>>>> And if the jar has an entry point that creates the Flink job in
>>>>>> the prescribed manner [1], it can be directly submitted to the Flink REST
>>>>>> API. That would allow for Java free client.
>>>>>> >>>>>
>>>>>> >>>>> [1]
>>>>>> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>>>>>> >>>>>
>>>>>>
>>>>>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Lukasz Cwik <lc...@google.com>.

On Mon, Aug 19, 2019 at 5:52 PM Ahmet Altay <al...@google.com> wrote:

>
>
> On Sun, Aug 18, 2019 at 12:34 PM Thomas Weise <th...@apache.org> wrote:
>
>> There is a PR open for this: https://github.com/apache/beam/pull/9331
>>
>> (it wasn't tagged with the JIRA and therefore not linked)
>>
>> I think it is worthwhile to explore how we could further detangle the
>> client side Python and Java dependencies.
>>
>> The expansion service is one more dependency to consider in a build
>> environment. Is it really necessary to expand external transforms prior to
>> submission to the job service?
>>
>
> +1, this will make it easier to use external transforms from the already
> familiar client environments.
>
>

The intent is to make it so that you CAN (not MUST) run an expansion
service separate from a Runner. Creating a single endpoint that hosts both
the Job and Expansion service is something that gRPC does very easily since
you can host multiple service definitions on a single port.


>
>> Can we come up with a partially constructed proto that can be produced by
>> just running the Python entry point? Note this would also require pushing
>> the pipeline options parsing into the job service.
>>
>
> Why would this require pushing the pipeline options parsing to the job
> service. Assuming that python will have enough idea about the external
> transform what options it will need. The necessary bit could be converted
> to arguments and be part of that partially constructed proto.
>
>
>>
>> On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri <ec...@gmail.com>
>> wrote:
>>
>>> I found the tracking ticket at BEAM-7966
>>> <https://jira.apache.org/jira/browse/BEAM-7966>
>>>
>>> On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri <
>>> ecanzonieri@gmail.com> wrote:
>>>
>>>> Is this alternative still being considered? Creating a portable jar
>>>> sounds like a good solution to re-use the existing runner specific
>>>> deployment mechanism (e.g. Flink k8s operator) and in general simplify the
>>>> deployment story.
>>>>
>>>> On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> The expansion service is a separate service. (The flink jar happens to
>>>>> bring both up.) However, there is negotiation to receive/validate the
>>>>> pipeline options.
>>>>>
>>>>> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise <th...@apache.org> wrote:
>>>>> >
>>>>> > We would also need to consider cross-language pipelines that
>>>>> (currently) assume the interaction with an expansion service at
>>>>> construction time.
>>>>> >
>>>>> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver <kc...@google.com>
>>>>> wrote:
>>>>> >>
>>>>> >> > It might also be useful to have the option to just output the
>>>>> proto and artifacts, as alternative to the jar file.
>>>>> >>
>>>>> >> Sure, that wouldn't be too big a change if we were to decide to go
>>>>> the SDK route.
>>>>> >>
>>>>> >> > For the Flink entry point we would need to allow for the job
>>>>> server to be used as a library.
>>>>> >>
>>>>> >> We don't need the whole job server, we only need to add a main
>>>>> method to FlinkPipelineRunner [1] as the entry point, which would basically
>>>>> just do the setup described in the doc then call FlinkPipelineRunner::run.
>>>>> >>
>>>>> >> [1]
>>>>> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
>>>>> >>
>>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>>>>> kcweaver@google.com
>>>>> >>
>>>>> >>
>>>>> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <th...@apache.org> wrote:
>>>>> >>>
>>>>> >>> Hi Kyle,
>>>>> >>>
>>>>> >>> It might also be useful to have the option to just output the
>>>>> proto and artifacts, as alternative to the jar file.
>>>>> >>>
>>>>> >>> For the Flink entry point we would need to allow for the job
>>>>> server to be used as a library. It would probably not be too hard to have
>>>>> the Flink job constructed via the context execution environment, which
>>>>> would require no changes on the Flink side.
>>>>> >>>
>>>>> >>> Thanks,
>>>>> >>> Thomas
>>>>> >>>
>>>>> >>>
>>>>> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <kc...@google.com>
>>>>> wrote:
>>>>> >>>>
>>>>> >>>> Re Javaless/serverless solution:
>>>>> >>>> I take it this would probably mean that we would construct the
>>>>> jar directly from the SDK. There are advantages to this: full separation of
>>>>> Python and Java environments, no need for a job server, and likely a
>>>>> simpler implementation, since we'd no longer have to work within the
>>>>> constraints of the existing job server infrastructure. The only downside I
>>>>> can think of is the additional cost of implementing/maintaining jar
>>>>> creation code in each SDK, but that cost may be acceptable if it's simple
>>>>> enough.
>>>>> >>>>
>>>>> >>>> Kyle Weaver | Software Engineer | github.com/ibzib |
>>>>> kcweaver@google.com
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org>
>>>>> wrote:
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <
>>>>> robertwb@google.com> wrote:
>>>>> >>>>>>
>>>>> >>>>>> > Before assembling the jar, the job server runs to create the
>>>>> ingredients. That requires the (matching) Java environment on the Python
>>>>> developers machine.
>>>>> >>>>>>
>>>>> >>>>>> We can run the job server and have it create the jar (and if we
>>>>> keep
>>>>> >>>>>> the job server running we can use it to interact with the
>>>>> running
>>>>> >>>>>> job). However, if the jar layout is simple enough, there's no
>>>>> need to
>>>>> >>>>>> even build it from Java.
>>>>> >>>>>>
>>>>> >>>>>> Taken to the extreme, this is a one-shot, jar-based JobService
>>>>> API. We
>>>>> >>>>>> choose a standard layout of where to put the pipeline
>>>>> description and
>>>>> >>>>>> artifacts, and can "augment" an existing jar (that has a
>>>>> >>>>>> runner-specific main class whose entry point knows how to read
>>>>> this
>>>>> >>>>>> data to kick off a pipeline as if it were a users driver code)
>>>>> into
>>>>> >>>>>> one that has a portable pipeline packaged into it for
>>>>> submission to a
>>>>> >>>>>> cluster.
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> It would be nice if the Python developer doesn't have to run
>>>>> anything Java at all.
>>>>> >>>>>
>>>>> >>>>> As we just discussed offline, this could be accomplished by
>>>>> including the proto that is produced by the SDK into the pre-existing jar.
>>>>> >>>>>
>>>>> >>>>> And if the jar has an entry point that creates the Flink job in
>>>>> the prescribed manner [1], it can be directly submitted to the Flink REST
>>>>> API. That would allow for Java free client.
>>>>> >>>>>
>>>>> >>>>> [1]
>>>>> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>>>>> >>>>>
>>>>>
>>>>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Thomas Weise <th...@gmail.com>.

On Mon, Aug 19, 2019 at 5:52 PM Ahmet Altay <al...@google.com> wrote:

>
> Can we come up with a partially constructed proto that can be produced by
>> just running the Python entry point? Note this would also require pushing
>> the pipeline options parsing into the job service.
>>
>
> Why would this require pushing the pipeline options parsing to the job
> service. Assuming that python will have enough idea about the external
> transform what options it will need. The necessary bit could be converted
> to arguments and be part of that partially constructed proto.
>
>

The pipeline option discovery is separate from external transforms. The
client gets the supported options from the job service so it can validate
and submit the pipeline with the correct options.

But this can also be solved by sending "unknown options" as part of the
partially constructed proto and have the job service take care of
validation and conversion.



>
>> On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri <ec...@gmail.com>
>> wrote:
>>
>>> I found the tracking ticket at BEAM-7966
>>> <https://jira.apache.org/jira/browse/BEAM-7966>
>>>
>>> On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri <
>>> ecanzonieri@gmail.com> wrote:
>>>
>>>> Is this alternative still being considered? Creating a portable jar
>>>> sounds like a good solution to re-use the existing runner specific
>>>> deployment mechanism (e.g. Flink k8s operator) and in general simplify the
>>>> deployment story.
>>>>
>>>> On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> The expansion service is a separate service. (The flink jar happens to
>>>>> bring both up.) However, there is negotiation to receive/validate the
>>>>> pipeline options.
>>>>>
>>>>> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise <th...@apache.org> wrote:
>>>>> >
>>>>> > We would also need to consider cross-language pipelines that
>>>>> (currently) assume the interaction with an expansion service at
>>>>> construction time.
>>>>> >
>>>>> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver <kc...@google.com>
>>>>> wrote:
>>>>> >>
>>>>> >> > It might also be useful to have the option to just output the
>>>>> proto and artifacts, as alternative to the jar file.
>>>>> >>
>>>>> >> Sure, that wouldn't be too big a change if we were to decide to go
>>>>> the SDK route.
>>>>> >>
>>>>> >> > For the Flink entry point we would need to allow for the job
>>>>> server to be used as a library.
>>>>> >>
>>>>> >> We don't need the whole job server, we only need to add a main
>>>>> method to FlinkPipelineRunner [1] as the entry point, which would basically
>>>>> just do the setup described in the doc then call FlinkPipelineRunner::run.
>>>>> >>
>>>>> >> [1]
>>>>> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
>>>>> >>
>>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>>>>> kcweaver@google.com
>>>>> >>
>>>>> >>
>>>>> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <th...@apache.org> wrote:
>>>>> >>>
>>>>> >>> Hi Kyle,
>>>>> >>>
>>>>> >>> It might also be useful to have the option to just output the
>>>>> proto and artifacts, as alternative to the jar file.
>>>>> >>>
>>>>> >>> For the Flink entry point we would need to allow for the job
>>>>> server to be used as a library. It would probably not be too hard to have
>>>>> the Flink job constructed via the context execution environment, which
>>>>> would require no changes on the Flink side.
>>>>> >>>
>>>>> >>> Thanks,
>>>>> >>> Thomas
>>>>> >>>
>>>>> >>>
>>>>> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <kc...@google.com>
>>>>> wrote:
>>>>> >>>>
>>>>> >>>> Re Javaless/serverless solution:
>>>>> >>>> I take it this would probably mean that we would construct the
>>>>> jar directly from the SDK. There are advantages to this: full separation of
>>>>> Python and Java environments, no need for a job server, and likely a
>>>>> simpler implementation, since we'd no longer have to work within the
>>>>> constraints of the existing job server infrastructure. The only downside I
>>>>> can think of is the additional cost of implementing/maintaining jar
>>>>> creation code in each SDK, but that cost may be acceptable if it's simple
>>>>> enough.
>>>>> >>>>
>>>>> >>>> Kyle Weaver | Software Engineer | github.com/ibzib |
>>>>> kcweaver@google.com
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org>
>>>>> wrote:
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <
>>>>> robertwb@google.com> wrote:
>>>>> >>>>>>
>>>>> >>>>>> > Before assembling the jar, the job server runs to create the
>>>>> ingredients. That requires the (matching) Java environment on the Python
>>>>> developers machine.
>>>>> >>>>>>
>>>>> >>>>>> We can run the job server and have it create the jar (and if we
>>>>> keep
>>>>> >>>>>> the job server running we can use it to interact with the
>>>>> running
>>>>> >>>>>> job). However, if the jar layout is simple enough, there's no
>>>>> need to
>>>>> >>>>>> even build it from Java.
>>>>> >>>>>>
>>>>> >>>>>> Taken to the extreme, this is a one-shot, jar-based JobService
>>>>> API. We
>>>>> >>>>>> choose a standard layout of where to put the pipeline
>>>>> description and
>>>>> >>>>>> artifacts, and can "augment" an existing jar (that has a
>>>>> >>>>>> runner-specific main class whose entry point knows how to read
>>>>> this
>>>>> >>>>>> data to kick off a pipeline as if it were a users driver code)
>>>>> into
>>>>> >>>>>> one that has a portable pipeline packaged into it for
>>>>> submission to a
>>>>> >>>>>> cluster.
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> It would be nice if the Python developer doesn't have to run
>>>>> anything Java at all.
>>>>> >>>>>
>>>>> >>>>> As we just discussed offline, this could be accomplished by
>>>>> including the proto that is produced by the SDK into the pre-existing jar.
>>>>> >>>>>
>>>>> >>>>> And if the jar has an entry point that creates the Flink job in
>>>>> the prescribed manner [1], it can be directly submitted to the Flink REST
>>>>> API. That would allow for Java free client.
>>>>> >>>>>
>>>>> >>>>> [1]
>>>>> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>>>>> >>>>>
>>>>>
>>>>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Ahmet Altay <al...@google.com>.

On Sun, Aug 18, 2019 at 12:34 PM Thomas Weise <th...@apache.org> wrote:

> There is a PR open for this: https://github.com/apache/beam/pull/9331
>
> (it wasn't tagged with the JIRA and therefore not linked)
>
> I think it is worthwhile to explore how we could further detangle the
> client side Python and Java dependencies.
>
> The expansion service is one more dependency to consider in a build
> environment. Is it really necessary to expand external transforms prior to
> submission to the job service?
>

+1, this will make it easier to use external transforms from the already
familiar client environments.


>
> Can we come up with a partially constructed proto that can be produced by
> just running the Python entry point? Note this would also require pushing
> the pipeline options parsing into the job service.
>

Why would this require pushing the pipeline options parsing to the job
service. Assuming that python will have enough idea about the external
transform what options it will need. The necessary bit could be converted
to arguments and be part of that partially constructed proto.


>
> On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri <ec...@gmail.com>
> wrote:
>
>> I found the tracking ticket at BEAM-7966
>> <https://jira.apache.org/jira/browse/BEAM-7966>
>>
>> On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri <ec...@gmail.com>
>> wrote:
>>
>>> Is this alternative still being considered? Creating a portable jar
>>> sounds like a good solution to re-use the existing runner specific
>>> deployment mechanism (e.g. Flink k8s operator) and in general simplify the
>>> deployment story.
>>>
>>> On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> The expansion service is a separate service. (The flink jar happens to
>>>> bring both up.) However, there is negotiation to receive/validate the
>>>> pipeline options.
>>>>
>>>> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise <th...@apache.org> wrote:
>>>> >
>>>> > We would also need to consider cross-language pipelines that
>>>> (currently) assume the interaction with an expansion service at
>>>> construction time.
>>>> >
>>>> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver <kc...@google.com> wrote:
>>>> >>
>>>> >> > It might also be useful to have the option to just output the
>>>> proto and artifacts, as alternative to the jar file.
>>>> >>
>>>> >> Sure, that wouldn't be too big a change if we were to decide to go
>>>> the SDK route.
>>>> >>
>>>> >> > For the Flink entry point we would need to allow for the job
>>>> server to be used as a library.
>>>> >>
>>>> >> We don't need the whole job server, we only need to add a main
>>>> method to FlinkPipelineRunner [1] as the entry point, which would basically
>>>> just do the setup described in the doc then call FlinkPipelineRunner::run.
>>>> >>
>>>> >> [1]
>>>> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
>>>> >>
>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>>>> kcweaver@google.com
>>>> >>
>>>> >>
>>>> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <th...@apache.org> wrote:
>>>> >>>
>>>> >>> Hi Kyle,
>>>> >>>
>>>> >>> It might also be useful to have the option to just output the proto
>>>> and artifacts, as alternative to the jar file.
>>>> >>>
>>>> >>> For the Flink entry point we would need to allow for the job server
>>>> to be used as a library. It would probably not be too hard to have the
>>>> Flink job constructed via the context execution environment, which would
>>>> require no changes on the Flink side.
>>>> >>>
>>>> >>> Thanks,
>>>> >>> Thomas
>>>> >>>
>>>> >>>
>>>> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <kc...@google.com>
>>>> wrote:
>>>> >>>>
>>>> >>>> Re Javaless/serverless solution:
>>>> >>>> I take it this would probably mean that we would construct the jar
>>>> directly from the SDK. There are advantages to this: full separation of
>>>> Python and Java environments, no need for a job server, and likely a
>>>> simpler implementation, since we'd no longer have to work within the
>>>> constraints of the existing job server infrastructure. The only downside I
>>>> can think of is the additional cost of implementing/maintaining jar
>>>> creation code in each SDK, but that cost may be acceptable if it's simple
>>>> enough.
>>>> >>>>
>>>> >>>> Kyle Weaver | Software Engineer | github.com/ibzib |
>>>> kcweaver@google.com
>>>> >>>>
>>>> >>>>
>>>> >>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org>
>>>> wrote:
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <
>>>> robertwb@google.com> wrote:
>>>> >>>>>>
>>>> >>>>>> > Before assembling the jar, the job server runs to create the
>>>> ingredients. That requires the (matching) Java environment on the Python
>>>> developers machine.
>>>> >>>>>>
>>>> >>>>>> We can run the job server and have it create the jar (and if we
>>>> keep
>>>> >>>>>> the job server running we can use it to interact with the running
>>>> >>>>>> job). However, if the jar layout is simple enough, there's no
>>>> need to
>>>> >>>>>> even build it from Java.
>>>> >>>>>>
>>>> >>>>>> Taken to the extreme, this is a one-shot, jar-based JobService
>>>> API. We
>>>> >>>>>> choose a standard layout of where to put the pipeline
>>>> description and
>>>> >>>>>> artifacts, and can "augment" an existing jar (that has a
>>>> >>>>>> runner-specific main class whose entry point knows how to read
>>>> this
>>>> >>>>>> data to kick off a pipeline as if it were a users driver code)
>>>> into
>>>> >>>>>> one that has a portable pipeline packaged into it for submission
>>>> to a
>>>> >>>>>> cluster.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> It would be nice if the Python developer doesn't have to run
>>>> anything Java at all.
>>>> >>>>>
>>>> >>>>> As we just discussed offline, this could be accomplished by
>>>> including the proto that is produced by the SDK into the pre-existing jar.
>>>> >>>>>
>>>> >>>>> And if the jar has an entry point that creates the Flink job in
>>>> the prescribed manner [1], it can be directly submitted to the Flink REST
>>>> API. That would allow for Java free client.
>>>> >>>>>
>>>> >>>>> [1]
>>>> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>>>> >>>>>
>>>>
>>>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Thomas Weise <th...@apache.org>.

There is a PR open for this: https://github.com/apache/beam/pull/9331

(it wasn't tagged with the JIRA and therefore not linked)

I think it is worthwhile to explore how we could further detangle the
client side Python and Java dependencies.

The expansion service is one more dependency to consider in a build
environment. Is it really necessary to expand external transforms prior to
submission to the job service?

Can we come up with a partially constructed proto that can be produced by
just running the Python entry point? Note this would also require pushing
the pipeline options parsing into the job service.

On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri <ec...@gmail.com>
wrote:

> I found the tracking ticket at BEAM-7966
> <https://jira.apache.org/jira/browse/BEAM-7966>
>
> On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri <ec...@gmail.com>
> wrote:
>
>> Is this alternative still being considered? Creating a portable jar
>> sounds like a good solution to re-use the existing runner specific
>> deployment mechanism (e.g. Flink k8s operator) and in general simplify the
>> deployment story.
>>
>> On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> The expansion service is a separate service. (The flink jar happens to
>>> bring both up.) However, there is negotiation to receive/validate the
>>> pipeline options.
>>>
>>> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise <th...@apache.org> wrote:
>>> >
>>> > We would also need to consider cross-language pipelines that
>>> (currently) assume the interaction with an expansion service at
>>> construction time.
>>> >
>>> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver <kc...@google.com> wrote:
>>> >>
>>> >> > It might also be useful to have the option to just output the proto
>>> and artifacts, as alternative to the jar file.
>>> >>
>>> >> Sure, that wouldn't be too big a change if we were to decide to go
>>> the SDK route.
>>> >>
>>> >> > For the Flink entry point we would need to allow for the job server
>>> to be used as a library.
>>> >>
>>> >> We don't need the whole job server, we only need to add a main method
>>> to FlinkPipelineRunner [1] as the entry point, which would basically just
>>> do the setup described in the doc then call FlinkPipelineRunner::run.
>>> >>
>>> >> [1]
>>> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
>>> >>
>>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>>> kcweaver@google.com
>>> >>
>>> >>
>>> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <th...@apache.org> wrote:
>>> >>>
>>> >>> Hi Kyle,
>>> >>>
>>> >>> It might also be useful to have the option to just output the proto
>>> and artifacts, as alternative to the jar file.
>>> >>>
>>> >>> For the Flink entry point we would need to allow for the job server
>>> to be used as a library. It would probably not be too hard to have the
>>> Flink job constructed via the context execution environment, which would
>>> require no changes on the Flink side.
>>> >>>
>>> >>> Thanks,
>>> >>> Thomas
>>> >>>
>>> >>>
>>> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <kc...@google.com>
>>> wrote:
>>> >>>>
>>> >>>> Re Javaless/serverless solution:
>>> >>>> I take it this would probably mean that we would construct the jar
>>> directly from the SDK. There are advantages to this: full separation of
>>> Python and Java environments, no need for a job server, and likely a
>>> simpler implementation, since we'd no longer have to work within the
>>> constraints of the existing job server infrastructure. The only downside I
>>> can think of is the additional cost of implementing/maintaining jar
>>> creation code in each SDK, but that cost may be acceptable if it's simple
>>> enough.
>>> >>>>
>>> >>>> Kyle Weaver | Software Engineer | github.com/ibzib |
>>> kcweaver@google.com
>>> >>>>
>>> >>>>
>>> >>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org> wrote:
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <
>>> robertwb@google.com> wrote:
>>> >>>>>>
>>> >>>>>> > Before assembling the jar, the job server runs to create the
>>> ingredients. That requires the (matching) Java environment on the Python
>>> developers machine.
>>> >>>>>>
>>> >>>>>> We can run the job server and have it create the jar (and if we
>>> keep
>>> >>>>>> the job server running we can use it to interact with the running
>>> >>>>>> job). However, if the jar layout is simple enough, there's no
>>> need to
>>> >>>>>> even build it from Java.
>>> >>>>>>
>>> >>>>>> Taken to the extreme, this is a one-shot, jar-based JobService
>>> API. We
>>> >>>>>> choose a standard layout of where to put the pipeline description
>>> and
>>> >>>>>> artifacts, and can "augment" an existing jar (that has a
>>> >>>>>> runner-specific main class whose entry point knows how to read
>>> this
>>> >>>>>> data to kick off a pipeline as if it were a users driver code)
>>> into
>>> >>>>>> one that has a portable pipeline packaged into it for submission
>>> to a
>>> >>>>>> cluster.
>>> >>>>>
>>> >>>>>
>>> >>>>> It would be nice if the Python developer doesn't have to run
>>> anything Java at all.
>>> >>>>>
>>> >>>>> As we just discussed offline, this could be accomplished by
>>> including the proto that is produced by the SDK into the pre-existing jar.
>>> >>>>>
>>> >>>>> And if the jar has an entry point that creates the Flink job in
>>> the prescribed manner [1], it can be directly submitted to the Flink REST
>>> API. That would allow for Java free client.
>>> >>>>>
>>> >>>>> [1]
>>> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>>> >>>>>
>>>
>>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by enrico canzonieri <ec...@gmail.com>.

I found the tracking ticket at BEAM-7966
<https://jira.apache.org/jira/browse/BEAM-7966>

On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri <ec...@gmail.com>
wrote:

> Is this alternative still being considered? Creating a portable jar sounds
> like a good solution to re-use the existing runner specific deployment
> mechanism (e.g. Flink k8s operator) and in general simplify the deployment
> story.
>
> On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> The expansion service is a separate service. (The flink jar happens to
>> bring both up.) However, there is negotiation to receive/validate the
>> pipeline options.
>>
>> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise <th...@apache.org> wrote:
>> >
>> > We would also need to consider cross-language pipelines that
>> (currently) assume the interaction with an expansion service at
>> construction time.
>> >
>> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver <kc...@google.com> wrote:
>> >>
>> >> > It might also be useful to have the option to just output the proto
>> and artifacts, as alternative to the jar file.
>> >>
>> >> Sure, that wouldn't be too big a change if we were to decide to go the
>> SDK route.
>> >>
>> >> > For the Flink entry point we would need to allow for the job server
>> to be used as a library.
>> >>
>> >> We don't need the whole job server, we only need to add a main method
>> to FlinkPipelineRunner [1] as the entry point, which would basically just
>> do the setup described in the doc then call FlinkPipelineRunner::run.
>> >>
>> >> [1]
>> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
>> >>
>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>> kcweaver@google.com
>> >>
>> >>
>> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <th...@apache.org> wrote:
>> >>>
>> >>> Hi Kyle,
>> >>>
>> >>> It might also be useful to have the option to just output the proto
>> and artifacts, as alternative to the jar file.
>> >>>
>> >>> For the Flink entry point we would need to allow for the job server
>> to be used as a library. It would probably not be too hard to have the
>> Flink job constructed via the context execution environment, which would
>> require no changes on the Flink side.
>> >>>
>> >>> Thanks,
>> >>> Thomas
>> >>>
>> >>>
>> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <kc...@google.com>
>> wrote:
>> >>>>
>> >>>> Re Javaless/serverless solution:
>> >>>> I take it this would probably mean that we would construct the jar
>> directly from the SDK. There are advantages to this: full separation of
>> Python and Java environments, no need for a job server, and likely a
>> simpler implementation, since we'd no longer have to work within the
>> constraints of the existing job server infrastructure. The only downside I
>> can think of is the additional cost of implementing/maintaining jar
>> creation code in each SDK, but that cost may be acceptable if it's simple
>> enough.
>> >>>>
>> >>>> Kyle Weaver | Software Engineer | github.com/ibzib |
>> kcweaver@google.com
>> >>>>
>> >>>>
>> >>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org> wrote:
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>> >>>>>>
>> >>>>>> > Before assembling the jar, the job server runs to create the
>> ingredients. That requires the (matching) Java environment on the Python
>> developers machine.
>> >>>>>>
>> >>>>>> We can run the job server and have it create the jar (and if we
>> keep
>> >>>>>> the job server running we can use it to interact with the running
>> >>>>>> job). However, if the jar layout is simple enough, there's no need
>> to
>> >>>>>> even build it from Java.
>> >>>>>>
>> >>>>>> Taken to the extreme, this is a one-shot, jar-based JobService
>> API. We
>> >>>>>> choose a standard layout of where to put the pipeline description
>> and
>> >>>>>> artifacts, and can "augment" an existing jar (that has a
>> >>>>>> runner-specific main class whose entry point knows how to read this
>> >>>>>> data to kick off a pipeline as if it were a users driver code) into
>> >>>>>> one that has a portable pipeline packaged into it for submission
>> to a
>> >>>>>> cluster.
>> >>>>>
>> >>>>>
>> >>>>> It would be nice if the Python developer doesn't have to run
>> anything Java at all.
>> >>>>>
>> >>>>> As we just discussed offline, this could be accomplished by
>> including the proto that is produced by the SDK into the pre-existing jar.
>> >>>>>
>> >>>>> And if the jar has an entry point that creates the Flink job in the
>> prescribed manner [1], it can be directly submitted to the Flink REST API.
>> That would allow for Java free client.
>> >>>>>
>> >>>>> [1]
>> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>> >>>>>
>>
>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by enrico canzonieri <ec...@gmail.com>.

Is this alternative still being considered? Creating a portable jar sounds
like a good solution to re-use the existing runner specific deployment
mechanism (e.g. Flink k8s operator) and in general simplify the deployment
story.

On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw <ro...@google.com> wrote:

> The expansion service is a separate service. (The flink jar happens to
> bring both up.) However, there is negotiation to receive/validate the
> pipeline options.
>
> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise <th...@apache.org> wrote:
> >
> > We would also need to consider cross-language pipelines that (currently)
> assume the interaction with an expansion service at construction time.
> >
> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver <kc...@google.com> wrote:
> >>
> >> > It might also be useful to have the option to just output the proto
> and artifacts, as alternative to the jar file.
> >>
> >> Sure, that wouldn't be too big a change if we were to decide to go the
> SDK route.
> >>
> >> > For the Flink entry point we would need to allow for the job server
> to be used as a library.
> >>
> >> We don't need the whole job server, we only need to add a main method
> to FlinkPipelineRunner [1] as the entry point, which would basically just
> do the setup described in the doc then call FlinkPipelineRunner::run.
> >>
> >> [1]
> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
> >>
> >> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcweaver@google.com
> >>
> >>
> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <th...@apache.org> wrote:
> >>>
> >>> Hi Kyle,
> >>>
> >>> It might also be useful to have the option to just output the proto
> and artifacts, as alternative to the jar file.
> >>>
> >>> For the Flink entry point we would need to allow for the job server to
> be used as a library. It would probably not be too hard to have the Flink
> job constructed via the context execution environment, which would require
> no changes on the Flink side.
> >>>
> >>> Thanks,
> >>> Thomas
> >>>
> >>>
> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <kc...@google.com>
> wrote:
> >>>>
> >>>> Re Javaless/serverless solution:
> >>>> I take it this would probably mean that we would construct the jar
> directly from the SDK. There are advantages to this: full separation of
> Python and Java environments, no need for a job server, and likely a
> simpler implementation, since we'd no longer have to work within the
> constraints of the existing job server infrastructure. The only downside I
> can think of is the additional cost of implementing/maintaining jar
> creation code in each SDK, but that cost may be acceptable if it's simple
> enough.
> >>>>
> >>>> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcweaver@google.com
> >>>>
> >>>>
> >>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <ro...@google.com>
> wrote:
> >>>>>>
> >>>>>> > Before assembling the jar, the job server runs to create the
> ingredients. That requires the (matching) Java environment on the Python
> developers machine.
> >>>>>>
> >>>>>> We can run the job server and have it create the jar (and if we keep
> >>>>>> the job server running we can use it to interact with the running
> >>>>>> job). However, if the jar layout is simple enough, there's no need
> to
> >>>>>> even build it from Java.
> >>>>>>
> >>>>>> Taken to the extreme, this is a one-shot, jar-based JobService API.
> We
> >>>>>> choose a standard layout of where to put the pipeline description
> and
> >>>>>> artifacts, and can "augment" an existing jar (that has a
> >>>>>> runner-specific main class whose entry point knows how to read this
> >>>>>> data to kick off a pipeline as if it were a users driver code) into
> >>>>>> one that has a portable pipeline packaged into it for submission to
> a
> >>>>>> cluster.
> >>>>>
> >>>>>
> >>>>> It would be nice if the Python developer doesn't have to run
> anything Java at all.
> >>>>>
> >>>>> As we just discussed offline, this could be accomplished by
> including the proto that is produced by the SDK into the pre-existing jar.
> >>>>>
> >>>>> And if the jar has an entry point that creates the Flink job in the
> prescribed manner [1], it can be directly submitted to the Flink REST API.
> That would allow for Java free client.
> >>>>>
> >>>>> [1]
> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
> >>>>>
>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Robert Bradshaw <ro...@google.com>.

The expansion service is a separate service. (The flink jar happens to
bring both up.) However, there is negotiation to receive/validate the
pipeline options.

On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise <th...@apache.org> wrote:
>
> We would also need to consider cross-language pipelines that (currently) assume the interaction with an expansion service at construction time.
>
> On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver <kc...@google.com> wrote:
>>
>> > It might also be useful to have the option to just output the proto and artifacts, as alternative to the jar file.
>>
>> Sure, that wouldn't be too big a change if we were to decide to go the SDK route.
>>
>> > For the Flink entry point we would need to allow for the job server to be used as a library.
>>
>> We don't need the whole job server, we only need to add a main method to FlinkPipelineRunner [1] as the entry point, which would basically just do the setup described in the doc then call FlinkPipelineRunner::run.
>>
>> [1] https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com
>>
>>
>> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <th...@apache.org> wrote:
>>>
>>> Hi Kyle,
>>>
>>> It might also be useful to have the option to just output the proto and artifacts, as alternative to the jar file.
>>>
>>> For the Flink entry point we would need to allow for the job server to be used as a library. It would probably not be too hard to have the Flink job constructed via the context execution environment, which would require no changes on the Flink side.
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <kc...@google.com> wrote:
>>>>
>>>> Re Javaless/serverless solution:
>>>> I take it this would probably mean that we would construct the jar directly from the SDK. There are advantages to this: full separation of Python and Java environments, no need for a job server, and likely a simpler implementation, since we'd no longer have to work within the constraints of the existing job server infrastructure. The only downside I can think of is the additional cost of implementing/maintaining jar creation code in each SDK, but that cost may be acceptable if it's simple enough.
>>>>
>>>> Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com
>>>>
>>>>
>>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <ro...@google.com> wrote:
>>>>>>
>>>>>> > Before assembling the jar, the job server runs to create the ingredients. That requires the (matching) Java environment on the Python developers machine.
>>>>>>
>>>>>> We can run the job server and have it create the jar (and if we keep
>>>>>> the job server running we can use it to interact with the running
>>>>>> job). However, if the jar layout is simple enough, there's no need to
>>>>>> even build it from Java.
>>>>>>
>>>>>> Taken to the extreme, this is a one-shot, jar-based JobService API. We
>>>>>> choose a standard layout of where to put the pipeline description and
>>>>>> artifacts, and can "augment" an existing jar (that has a
>>>>>> runner-specific main class whose entry point knows how to read this
>>>>>> data to kick off a pipeline as if it were a users driver code) into
>>>>>> one that has a portable pipeline packaged into it for submission to a
>>>>>> cluster.
>>>>>
>>>>>
>>>>> It would be nice if the Python developer doesn't have to run anything Java at all.
>>>>>
>>>>> As we just discussed offline, this could be accomplished by  including the proto that is produced by the SDK into the pre-existing jar.
>>>>>
>>>>> And if the jar has an entry point that creates the Flink job in the prescribed manner [1], it can be directly submitted to the Flink REST API. That would allow for Java free client.
>>>>>
>>>>> [1] https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>>>>>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Thomas Weise <th...@apache.org>.

We would also need to consider cross-language pipelines that (currently)
assume the interaction with an expansion service at construction time.

On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver <kc...@google.com> wrote:

> > It might also be useful to have the option to just output the proto and
> artifacts, as alternative to the jar file.
>
> Sure, that wouldn't be too big a change if we were to decide to go the SDK
> route.
>
> > For the Flink entry point we would need to allow for the job server to
> be used as a library.
>
> We don't need the whole job server, we only need to add a main method to
> FlinkPipelineRunner [1] as the entry point, which would basically just do
> the setup described in the doc then call FlinkPipelineRunner::run.
>
> [1]
> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com
>
>
> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <th...@apache.org> wrote:
>
>> Hi Kyle,
>>
>> It might also be useful to have the option to just output the proto and
>> artifacts, as alternative to the jar file.
>>
>> For the Flink entry point we would need to allow for the job server to be
>> used as a library. It would probably not be too hard to have the Flink job
>> constructed via the context execution environment, which would require no
>> changes on the Flink side.
>>
>> Thanks,
>> Thomas
>>
>>
>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <kc...@google.com> wrote:
>>
>>> Re Javaless/serverless solution:
>>> I take it this would probably mean that we would construct the jar
>>> directly from the SDK. There are advantages to this: full separation of
>>> Python and Java environments, no need for a job server, and likely a
>>> simpler implementation, since we'd no longer have to work within the
>>> constraints of the existing job server infrastructure. The only downside I
>>> can think of is the additional cost of implementing/maintaining jar
>>> creation code in each SDK, but that cost may be acceptable if it's simple
>>> enough.
>>>
>>> Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com
>>>
>>>
>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> > Before assembling the jar, the job server runs to create the
>>>>> ingredients. That requires the (matching) Java environment on the Python
>>>>> developers machine.
>>>>>
>>>>> We can run the job server and have it create the jar (and if we keep
>>>>> the job server running we can use it to interact with the running
>>>>> job). However, if the jar layout is simple enough, there's no need to
>>>>> even build it from Java.
>>>>>
>>>>> Taken to the extreme, this is a one-shot, jar-based JobService API. We
>>>>> choose a standard layout of where to put the pipeline description and
>>>>> artifacts, and can "augment" an existing jar (that has a
>>>>> runner-specific main class whose entry point knows how to read this
>>>>> data to kick off a pipeline as if it were a users driver code) into
>>>>> one that has a portable pipeline packaged into it for submission to a
>>>>> cluster.
>>>>>
>>>>
>>>> It would be nice if the Python developer doesn't have to run anything
>>>> Java at all.
>>>>
>>>> As we just discussed offline, this could be accomplished by  including
>>>> the proto that is produced by the SDK into the pre-existing jar.
>>>>
>>>> And if the jar has an entry point that creates the Flink job in the
>>>> prescribed manner [1], it can be directly submitted to the Flink REST API.
>>>> That would allow for Java free client.
>>>>
>>>> [1]
>>>> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>>>>
>>>>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Kyle Weaver <kc...@google.com>.

> It might also be useful to have the option to just output the proto and
artifacts, as alternative to the jar file.

Sure, that wouldn't be too big a change if we were to decide to go the SDK
route.

> For the Flink entry point we would need to allow for the job server to be
used as a library.

We don't need the whole job server, we only need to add a main method to
FlinkPipelineRunner [1] as the entry point, which would basically just do
the setup described in the doc then call FlinkPipelineRunner::run.

[1]
https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53

Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com


On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise <th...@apache.org> wrote:

> Hi Kyle,
>
> It might also be useful to have the option to just output the proto and
> artifacts, as alternative to the jar file.
>
> For the Flink entry point we would need to allow for the job server to be
> used as a library. It would probably not be too hard to have the Flink job
> constructed via the context execution environment, which would require no
> changes on the Flink side.
>
> Thanks,
> Thomas
>
>
> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <kc...@google.com> wrote:
>
>> Re Javaless/serverless solution:
>> I take it this would probably mean that we would construct the jar
>> directly from the SDK. There are advantages to this: full separation of
>> Python and Java environments, no need for a job server, and likely a
>> simpler implementation, since we'd no longer have to work within the
>> constraints of the existing job server infrastructure. The only downside I
>> can think of is the additional cost of implementing/maintaining jar
>> creation code in each SDK, but that cost may be acceptable if it's simple
>> enough.
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com
>>
>>
>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org> wrote:
>>
>>>
>>>
>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> > Before assembling the jar, the job server runs to create the
>>>> ingredients. That requires the (matching) Java environment on the Python
>>>> developers machine.
>>>>
>>>> We can run the job server and have it create the jar (and if we keep
>>>> the job server running we can use it to interact with the running
>>>> job). However, if the jar layout is simple enough, there's no need to
>>>> even build it from Java.
>>>>
>>>> Taken to the extreme, this is a one-shot, jar-based JobService API. We
>>>> choose a standard layout of where to put the pipeline description and
>>>> artifacts, and can "augment" an existing jar (that has a
>>>> runner-specific main class whose entry point knows how to read this
>>>> data to kick off a pipeline as if it were a users driver code) into
>>>> one that has a portable pipeline packaged into it for submission to a
>>>> cluster.
>>>>
>>>
>>> It would be nice if the Python developer doesn't have to run anything
>>> Java at all.
>>>
>>> As we just discussed offline, this could be accomplished by  including
>>> the proto that is produced by the SDK into the pre-existing jar.
>>>
>>> And if the jar has an entry point that creates the Flink job in the
>>> prescribed manner [1], it can be directly submitted to the Flink REST API.
>>> That would allow for Java free client.
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>>>
>>>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Thomas Weise <th...@apache.org>.

Hi Kyle,

It might also be useful to have the option to just output the proto and
artifacts, as alternative to the jar file.

For the Flink entry point we would need to allow for the job server to be
used as a library. It would probably not be too hard to have the Flink job
constructed via the context execution environment, which would require no
changes on the Flink side.

Thanks,
Thomas


On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver <kc...@google.com> wrote:

> Re Javaless/serverless solution:
> I take it this would probably mean that we would construct the jar
> directly from the SDK. There are advantages to this: full separation of
> Python and Java environments, no need for a job server, and likely a
> simpler implementation, since we'd no longer have to work within the
> constraints of the existing job server infrastructure. The only downside I
> can think of is the additional cost of implementing/maintaining jar
> creation code in each SDK, but that cost may be acceptable if it's simple
> enough.
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com
>
>
> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org> wrote:
>
>>
>>
>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> > Before assembling the jar, the job server runs to create the
>>> ingredients. That requires the (matching) Java environment on the Python
>>> developers machine.
>>>
>>> We can run the job server and have it create the jar (and if we keep
>>> the job server running we can use it to interact with the running
>>> job). However, if the jar layout is simple enough, there's no need to
>>> even build it from Java.
>>>
>>> Taken to the extreme, this is a one-shot, jar-based JobService API. We
>>> choose a standard layout of where to put the pipeline description and
>>> artifacts, and can "augment" an existing jar (that has a
>>> runner-specific main class whose entry point knows how to read this
>>> data to kick off a pipeline as if it were a users driver code) into
>>> one that has a portable pipeline packaged into it for submission to a
>>> cluster.
>>>
>>
>> It would be nice if the Python developer doesn't have to run anything
>> Java at all.
>>
>> As we just discussed offline, this could be accomplished by  including
>> the proto that is produced by the SDK into the pre-existing jar.
>>
>> And if the jar has an entry point that creates the Flink job in the
>> prescribed manner [1], it can be directly submitted to the Flink REST API.
>> That would allow for Java free client.
>>
>> [1]
>> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>>
>>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Kyle Weaver <kc...@google.com>.

Re Javaless/serverless solution:
I take it this would probably mean that we would construct the jar directly
from the SDK. There are advantages to this: full separation of Python and
Java environments, no need for a job server, and likely a simpler
implementation, since we'd no longer have to work within the constraints of
the existing job server infrastructure. The only downside I can think of is
the additional cost of implementing/maintaining jar creation code in each
SDK, but that cost may be acceptable if it's simple enough.

Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com


On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise <th...@apache.org> wrote:

>
>
> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> > Before assembling the jar, the job server runs to create the
>> ingredients. That requires the (matching) Java environment on the Python
>> developers machine.
>>
>> We can run the job server and have it create the jar (and if we keep
>> the job server running we can use it to interact with the running
>> job). However, if the jar layout is simple enough, there's no need to
>> even build it from Java.
>>
>> Taken to the extreme, this is a one-shot, jar-based JobService API. We
>> choose a standard layout of where to put the pipeline description and
>> artifacts, and can "augment" an existing jar (that has a
>> runner-specific main class whose entry point knows how to read this
>> data to kick off a pipeline as if it were a users driver code) into
>> one that has a portable pipeline packaged into it for submission to a
>> cluster.
>>
>
> It would be nice if the Python developer doesn't have to run anything Java
> at all.
>
> As we just discussed offline, this could be accomplished by  including the
> proto that is produced by the SDK into the pre-existing jar.
>
> And if the jar has an entry point that creates the Flink job in the
> prescribed manner [1], it can be directly submitted to the Flink REST API.
> That would allow for Java free client.
>
> [1]
> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>
>

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Thomas Weise <th...@apache.org>.

On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <ro...@google.com> wrote:

> > Before assembling the jar, the job server runs to create the
> ingredients. That requires the (matching) Java environment on the Python
> developers machine.
>
> We can run the job server and have it create the jar (and if we keep
> the job server running we can use it to interact with the running
> job). However, if the jar layout is simple enough, there's no need to
> even build it from Java.
>
> Taken to the extreme, this is a one-shot, jar-based JobService API. We
> choose a standard layout of where to put the pipeline description and
> artifacts, and can "augment" an existing jar (that has a
> runner-specific main class whose entry point knows how to read this
> data to kick off a pipeline as if it were a users driver code) into
> one that has a portable pipeline packaged into it for submission to a
> cluster.
>

It would be nice if the Python developer doesn't have to run anything Java
at all.

As we just discussed offline, this could be accomplished by  including the
proto that is produced by the SDK into the pre-existing jar.

And if the jar has an entry point that creates the Flink job in the
prescribed manner [1], it can be directly submitted to the Flink REST API.
That would allow for Java free client.

[1]
https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Robert Bradshaw <ro...@google.com>.

On Wed, Aug 7, 2019 at 5:59 PM Thomas Weise <th...@apache.org> wrote:
>
>> > * The pipeline construction code itself may need access to cluster resources. In such cases the jar file cannot be created offline.
>>
>> Could you elaborate?
>
>
> The entry point is arbitrary code written by the user, not limited to Beam pipeline construction alone. For example, there could be access to a file system or other service to fetch metadata that is required to build the pipeline. Such services can be accessed when the code runs within the infrastructure, but typically not in a development environment.

Yes, this may be limited to the case that the pipeline construction
can be done on the users machine before submission (remotely staging
the executing the Python (or Go, or ...) code within the
infrastructure to build the pipeline and then running the job server
there is a bit more complicated). We control the entry point from then
on.

>> > * For k8s deployment, a container image with the SDK and application code is required for the worker. The jar file (which is really a derived artifact) would need to be built in addition to the container image.
>>
>> Yes. For standard use, a vanilla released Beam published SDK container
>> + staged artifacts should be sufficient.
>>
>> > * To build such jar file, the user would need a build environment with job server and application code. Do we want to make that assumption?
>>
>> Actually, it's probably much easier than that. A jar file is just a
>> zip file with a standard structure, to which one can easily add (data)
>> files without having a full build environment. The (pre-compiled) main
>> class would know how to read this data to construct the pipeline and
>> kick off the job just like any other Flink job.
>
> Before assembling the jar, the job server runs to create the ingredients. That requires the (matching) Java environment on the Python developers machine.

We can run the job server and have it create the jar (and if we keep
the job server running we can use it to interact with the running
job). However, if the jar layout is simple enough, there's no need to
even build it from Java.

Taken to the extreme, this is a one-shot, jar-based JobService API. We
choose a standard layout of where to put the pipeline description and
artifacts, and can "augment" an existing jar (that has a
runner-specific main class whose entry point knows how to read this
data to kick off a pipeline as if it were a users driver code) into
one that has a portable pipeline packaged into it for submission to a
cluster.

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Thomas Weise <th...@apache.org>.

-->

>
>
> > * The pipeline construction code itself may need access to cluster
> resources. In such cases the jar file cannot be created offline.
>
> Could you elaborate?
>

The entry point is arbitrary code written by the user, not limited to Beam
pipeline construction alone. For example, there could be access to a file
system or other service to fetch metadata that is required to build the
pipeline. Such services can be accessed when the code runs within the
infrastructure, but typically not in a development environment.


> > * For k8s deployment, a container image with the SDK and application
> code is required for the worker. The jar file (which is really a derived
> artifact) would need to be built in addition to the container image.
>
> Yes. For standard use, a vanilla released Beam published SDK container
> + staged artifacts should be sufficient.
>
> > * To build such jar file, the user would need a build environment with
> job server and application code. Do we want to make that assumption?
>
> Actually, it's probably much easier than that. A jar file is just a
> zip file with a standard structure, to which one can easily add (data)
> files without having a full build environment. The (pre-compiled) main
> class would know how to read this data to construct the pipeline and
> kick off the job just like any other Flink job.
>

Before assembling the jar, the job server runs to create the ingredients.
That requires the (matching) Java environment on the Python developers
machine.

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Kyle Weaver <kc...@google.com>.

Added comment access to the doc link.

On Wed, Aug 7, 2019 at 3:00 AM Robert Bradshaw <ro...@google.com> wrote:

> On Wed, Aug 7, 2019 at 6:20 AM Thomas Weise <th...@apache.org> wrote:
> >
> > Hi Kyle,
> >
> > [document doesn't have comments enabled currently]
> >
> > As noted, worker deployment is an open question. I believe pipeline
> submission and worker execution need to be considered together for a
> complete deployment story. The idea of creating a self containing jar file
> is interesting, but there are trade-offs:
> >
> > * The pipeline construction code itself may need access to cluster
> resources. In such cases the jar file cannot be created offline.
>
> Could you elaborate?
>
> > * For k8s deployment, a container image with the SDK and application
> code is required for the worker. The jar file (which is really a derived
> artifact) would need to be built in addition to the container image.
>
> Yes. For standard use, a vanilla released Beam published SDK container
> + staged artifacts should be sufficient.
>
> > * To build such jar file, the user would need a build environment with
> job server and application code. Do we want to make that assumption?
>
> Actually, it's probably much easier than that. A jar file is just a
> zip file with a standard structure, to which one can easily add (data)
> files without having a full build environment. The (pre-compiled) main
> class would know how to read this data to construct the pipeline and
> kick off the job just like any other Flink job.
>
> > The document that I had shared discusses options for pipeline
> submission. It might be interesting to explore if your proposal for
> building such a jar can be integrated or if you have other comments?
> >
> > Thomas
> >
> >
> >
> > On Tue, Aug 6, 2019 at 5:03 PM Kyle Weaver <kc...@google.com> wrote:
> >>
> >> Hi all,
> >>
> >> Following up on discussion about portable Beam on Flink on Kubernetes
> [1], I have drafted a short document on how I propose we bundle portable
> Beam applications into jars that can be run on OSS runners, similar to
> Dataflow templates (but without the actual template part, at least for the
> first iteration). It's pretty straightforward, but I thought I would
> broadcast it here in case anyone is interested.
> >>
> >>
> https://docs.google.com/document/d/1kj_9JWxGWOmSGeZ5hbLVDXSTv-zBrx4kQRqOq85RYD4/edit#
> >>
> >> [1]
> https://lists.apache.org/thread.html/a12dd939c4af254694481796bc08b05bb1321cfaadd1a79cd3866584@%3Cdev.beam.apache.org%3E
> >>
> >> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcweaver@google.com
>
-- 
Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Robert Bradshaw <ro...@google.com>.

On Wed, Aug 7, 2019 at 6:20 AM Thomas Weise <th...@apache.org> wrote:
>
> Hi Kyle,
>
> [document doesn't have comments enabled currently]
>
> As noted, worker deployment is an open question. I believe pipeline submission and worker execution need to be considered together for a complete deployment story. The idea of creating a self containing jar file is interesting, but there are trade-offs:
>
> * The pipeline construction code itself may need access to cluster resources. In such cases the jar file cannot be created offline.

Could you elaborate?

> * For k8s deployment, a container image with the SDK and application code is required for the worker. The jar file (which is really a derived artifact) would need to be built in addition to the container image.

Yes. For standard use, a vanilla released Beam published SDK container
+ staged artifacts should be sufficient.

> * To build such jar file, the user would need a build environment with job server and application code. Do we want to make that assumption?

Actually, it's probably much easier than that. A jar file is just a
zip file with a standard structure, to which one can easily add (data)
files without having a full build environment. The (pre-compiled) main
class would know how to read this data to construct the pipeline and
kick off the job just like any other Flink job.

> The document that I had shared discusses options for pipeline submission. It might be interesting to explore if your proposal for building such a jar can be integrated or if you have other comments?
>
> Thomas
>
>
>
> On Tue, Aug 6, 2019 at 5:03 PM Kyle Weaver <kc...@google.com> wrote:
>>
>> Hi all,
>>
>> Following up on discussion about portable Beam on Flink on Kubernetes [1], I have drafted a short document on how I propose we bundle portable Beam applications into jars that can be run on OSS runners, similar to Dataflow templates (but without the actual template part, at least for the first iteration). It's pretty straightforward, but I thought I would broadcast it here in case anyone is interested.
>>
>> https://docs.google.com/document/d/1kj_9JWxGWOmSGeZ5hbLVDXSTv-zBrx4kQRqOq85RYD4/edit#
>>
>> [1] https://lists.apache.org/thread.html/a12dd939c4af254694481796bc08b05bb1321cfaadd1a79cd3866584@%3Cdev.beam.apache.org%3E
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com

Re: (mini-doc) Beam (Flink) portable job templates

Posted by Thomas Weise <th...@apache.org>.

Hi Kyle,

[document doesn't have comments enabled currently]

As noted, worker deployment is an open question. I believe pipeline
submission and worker execution need to be considered together for a
complete deployment story. The idea of creating a self containing jar file
is interesting, but there are trade-offs:

* The pipeline construction code itself may need access to cluster
resources. In such cases the jar file cannot be created offline.
* For k8s deployment, a container image with the SDK and application code
is required for the worker. The jar file (which is really a derived
artifact) would need to be built in addition to the container image.
* To build such jar file, the user would need a build environment with job
server and application code. Do we want to make that assumption?

The document that I had shared discusses options for pipeline submission.
It might be interesting to explore if your proposal for building such a jar
can be integrated or if you have other comments?

Thomas

On Tue, Aug 6, 2019 at 5:03 PM Kyle Weaver <kc...@google.com> wrote:

> Hi all,
>
> Following up on discussion about portable Beam on Flink on Kubernetes [1],
> I have drafted a short document on how I propose we bundle portable Beam
> applications into jars that can be run on OSS runners, similar to Dataflow
> templates (but without the actual template part, at least for the first
> iteration). It's pretty straightforward, but I thought I would broadcast it
> here in case anyone is interested.
>
>
> https://docs.google.com/document/d/1kj_9JWxGWOmSGeZ5hbLVDXSTv-zBrx4kQRqOq85RYD4/edit#
>
> [1]
> https://lists.apache.org/thread.html/a12dd939c4af254694481796bc08b05bb1321cfaadd1a79cd3866584@%3Cdev.beam.apache.org%3E
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com
>