You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Can Gencer <ca...@hazelcast.com> on 2019/02/15 06:35:56 UTC

Hazelcast Jet Runner

We at Hazelcast are looking into writing a Beam runner for Hazelcast Jet (
https://github.com/hazelcast/hazelcast-jet). I wanted to introduce myself
as we'll likely have questions as we start development.

Some of the things I'm wondering about currently:

* Currently there seems to be a guide available at
https://beam.apache.org/contribute/runner-guide/ , is this up to date? Is
there anything in specific to be aware of when starting with a new runner
that's not covered here?
* Should we be targeting the latest master which is at 2.12-SNAPSHOT or a
stable version?
* After a runner is developed, how is the maintenance typically handled, as
the runners seems to be part of Beam codebase?

Re: Hazelcast Jet Runner

Posted by Ankur Goenka <go...@google.com>.

Hi Can,

Like GreedyPipelineFuser, we have added many more components which makes
building a Portable Runner easy. Here is a link [1] to slides which
explains at a very high level what is needed to add a new portable runner.
Still adding a portable runner will be more complex than adding a native
runner but with these components it should be relatively easier than
originally expected.

[1]
https://docs.google.com/presentation/d/1JRNUSpOC8qaA4uLDuyGsuuyf6Tk8Xi9LAukhgl-hT_w/edit?usp=sharing

Thanks,
Ankur

On Wed, Mar 20, 2019 at 7:19 AM Maximilian Michels <mx...@apache.org> wrote:

> Documentation on portability is still a bit sparse although there are
> many design documents:
> https://beam.apache.org/contribute/design-documents/#portability
>
> The structure of portable Runners is not fundamentally different, but
> some of the operations are deferred to the SDK which runs code for all
> supported languages. The Runner needs to provide an integration with it.
>
> Eventually, the old Runners will become obsolete though that won't
> happen very soon. Performance should be slightly better on the old Runners.
>
> I think writing an old-style Runner now will give you enough experience
> to port it to the new language-portable style later on.
>
> Cheers,
> Max
>
> On 20.03.19 14:52, Can Gencer wrote:
> > I had a look at "GreedyPipelineFuser" and indeed this was what exactly I
> > was talking about.
> >
> > Is https://beam.apache.org/roadmap/portability/ still the best
> > information about the portable runners or is there a more in-depth guide
> > available anywhere?
> >
> > On Wed, Mar 20, 2019 at 2:29 PM Can Gencer <can@hazelcast.com
> > <ma...@hazelcast.com>> wrote:
> >
> >     Hi Max,
> >
> >     Thanks. When you mean "old-style runner"  is this meant that this
> >     style of runners will be obsolete and only the portable one will be
> >     supported? The documentation for portable runners wasn't quite
> >     complete and the barrier to entry for writing an old style runner
> >     seemed easier for us and the old style runner should have better
> >     performance?
> >
> >     On Wed, Mar 20, 2019 at 1:36 PM Maximilian Michels <mxm@apache.org
> >     <ma...@apache.org>> wrote:
> >
> >         Hi Can,
> >
> >         Thanks for the update. Interesting question. Flink has an
> >         optimization
> >         built in called chaining which works together nicely with Beam.
> >         Essentially, operators which share the same partitioning get
> >         executed
> >         one after another inside a master operator. This saves resources.
> >
> >         Interestingly, Beam's Fuser for portable Runners does something
> >         similar.
> >         AFAIK there is no built-in solution for the old-style Runners. I
> >         think
> >         it would be possible to build something like this on top of the
> >         existing
> >         translation.
> >
> >         Cheers,
> >         Max
> >
> >         On 20.03.19 13:07, Can Gencer wrote:
> >          > Hi again,
> >          >
> >          > We've made some progress on the runner since writing this
> >         more than a
> >          > month ago, the repo is available here publicly:
> >          > https://github.com/hazelcast/hazelcast-jet-beam-runner
> >          >
> >          > Still very much a work in progress though. One of the issues
> >         I wanted to
> >          > raise is that currently we're translating each PTransform to
> >         a Jet
> >          > Vertex (could be consider analogous to a Flink operator or a
> >         vertex in
> >          > Tez). This is sub-optimal, since Beam creates lots of
> >         transforms for
> >          > computations that could be performed inside the same Vertex,
> >         such as
> >          > subsequent mapping transforms and many others. Ideally you
> >         only need
> >          > distinct vertices where the data is re-partitioned and/or
> >         shuffled. I'm
> >          > curious if Beam offers some way of translating the PTransform
> >         graph to a
> >          > more minimal set of transforms, i.e. some kind of planner or
> >         would this
> >          > have to be custom code? We've done a similar integration with
> >         Cascading
> >          > in the past and it offered a planner which given a set of
> >         rules would
> >          > partition the Cascading DAG into a minimal set of vertices
> >         for the same
> >          > DAG. Curious if Beam has any similar functionality?
> >          >
> >          >
> >          >
> >          > On Sat, Feb 16, 2019 at 4:50 AM Kenneth Knowles
> >         <kenn@apache.org <ma...@apache.org>
> >          > <mailto:kenn@apache.org <ma...@apache.org>>> wrote:
> >          >
> >          >     Elaborating on what Robert alluded to: when I wrote that
> >         runner
> >          >     author guide, portability was in its infancy. Now Beam
> >         Python can be
> >          >     run on Flink. So that guide is primarily focused on the
> >         "deserialize
> >          >     a Java DoFn and call its methods" approach. A decent
> >         amount of it is
> >          >     still really important to know, but is now the
> >         responsibility of the
> >          >     "SDK harness", aka language-specific coprocessor. For
> >         Python & Go &
> >          >     <insert new SDK language here> you really want to use the
> >          >     portability protos and the portable Flink runner is the
> >         best model.
> >          >
> >          >     Kenn
> >          >
> >          >
> >          >     On Fri, Feb 15, 2019 at 2:08 AM Robert Bradshaw
> >         <robertwb@google.com <ma...@google.com>
> >          >     <mailto:robertwb@google.com
> >         <ma...@google.com>>> wrote:
> >          >
> >          >         On Fri, Feb 15, 2019 at 7:36 AM Can Gencer
> >         <can@hazelcast.com <ma...@hazelcast.com>
> >          >         <mailto:can@hazelcast.com
> >         <ma...@hazelcast.com>>> wrote:
> >          >          >
> >          >          > We at Hazelcast are looking into writing a Beam
> >         runner for
> >          >         Hazelcast Jet
> >         (https://github.com/hazelcast/hazelcast-jet). I
> >          >         wanted to introduce myself as we'll likely have
> >         questions as we
> >          >         start development.
> >          >
> >          >         Welcome!
> >          >
> >          >         Hazelcast looks interesting, a Beam runner for it
> >         would be very
> >          >         cool.
> >          >
> >          >          > Some of the things I'm wondering about currently:
> >          >          >
> >          >          > * Currently there seems to be a guide available at
> >          > https://beam.apache.org/contribute/runner-guide/ , is this
> up to
> >          >         date? Is there anything in specific to be aware of
> >         when starting
> >          >         with a new runner that's not covered here?
> >          >
> >          >         That looks like a pretty good starting point. At a
> >         quick glance, I
> >          >         don't see anything that looks out of date. Another
> >         resource that
> >          >         might
> >          >         be helpful is a talk from last year on writing an SDK
> >         (but as it
> >          >         mostly covers the runner-sdk interaction, it's also
> >         quite useful for
> >          >         understanding the runner side:
> >          >
> >
> https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p
> >          >         And please feel free to ask any questions on this
> >         list as well; we'd
> >          >         be happy to help.
> >          >
> >          >          > * Should we be targeting the latest master which
> is at
> >          >         2.12-SNAPSHOT or a stable version?
> >          >
> >          >         I would target the latest master.
> >          >
> >          >          > * After a runner is developed, how is the
> maintenance
> >          >         typically handled, as the runners seems to be part of
> >         Beam codebase?
> >          >
> >          >         Either is possible. Several runner adapters are part
> >         of the Beam
> >          >         codebase, but for example the IMB Streams Beam runner
> >         is not. There
> >          >         are certainly pros and cons (certainly early on when
> >         the APIs
> >          >         themselves were under heavy development it was easier
> >         to keep things
> >          >         in sync in the same codebase, but things have mostly
> >         stabilized
> >          >         now).
> >          >         A runner only becomes part of the Beam codebase if
> >         there are members
> >          >         of the community committed to maintaining it (which
> >         could include
> >          >         you). Both approaches are fine.
> >          >
> >          >         - Robert
> >          >
> >
>

Re: Hazelcast Jet Runner

Posted by Maximilian Michels <mx...@apache.org>.

Documentation on portability is still a bit sparse although there are 
many design documents: 
https://beam.apache.org/contribute/design-documents/#portability

The structure of portable Runners is not fundamentally different, but 
some of the operations are deferred to the SDK which runs code for all 
supported languages. The Runner needs to provide an integration with it.

Eventually, the old Runners will become obsolete though that won't 
happen very soon. Performance should be slightly better on the old Runners.

I think writing an old-style Runner now will give you enough experience 
to port it to the new language-portable style later on.

Cheers,
Max

On 20.03.19 14:52, Can Gencer wrote:
> I had a look at "GreedyPipelineFuser" and indeed this was what exactly I 
> was talking about.
> 
> Is https://beam.apache.org/roadmap/portability/ still the best 
> information about the portable runners or is there a more in-depth guide 
> available anywhere?
> 
> On Wed, Mar 20, 2019 at 2:29 PM Can Gencer <can@hazelcast.com 
> <ma...@hazelcast.com>> wrote:
> 
>     Hi Max,
> 
>     Thanks. When you mean "old-style runner"  is this meant that this
>     style of runners will be obsolete and only the portable one will be
>     supported? The documentation for portable runners wasn't quite
>     complete and the barrier to entry for writing an old style runner
>     seemed easier for us and the old style runner should have better
>     performance?
> 
>     On Wed, Mar 20, 2019 at 1:36 PM Maximilian Michels <mxm@apache.org
>     <ma...@apache.org>> wrote:
> 
>         Hi Can,
> 
>         Thanks for the update. Interesting question. Flink has an
>         optimization
>         built in called chaining which works together nicely with Beam.
>         Essentially, operators which share the same partitioning get
>         executed
>         one after another inside a master operator. This saves resources.
> 
>         Interestingly, Beam's Fuser for portable Runners does something
>         similar.
>         AFAIK there is no built-in solution for the old-style Runners. I
>         think
>         it would be possible to build something like this on top of the
>         existing
>         translation.
> 
>         Cheers,
>         Max
> 
>         On 20.03.19 13:07, Can Gencer wrote:
>          > Hi again,
>          >
>          > We've made some progress on the runner since writing this
>         more than a
>          > month ago, the repo is available here publicly:
>          > https://github.com/hazelcast/hazelcast-jet-beam-runner
>          >
>          > Still very much a work in progress though. One of the issues
>         I wanted to
>          > raise is that currently we're translating each PTransform to
>         a Jet
>          > Vertex (could be consider analogous to a Flink operator or a
>         vertex in
>          > Tez). This is sub-optimal, since Beam creates lots of
>         transforms for
>          > computations that could be performed inside the same Vertex,
>         such as
>          > subsequent mapping transforms and many others. Ideally you
>         only need
>          > distinct vertices where the data is re-partitioned and/or
>         shuffled. I'm
>          > curious if Beam offers some way of translating the PTransform
>         graph to a
>          > more minimal set of transforms, i.e. some kind of planner or
>         would this
>          > have to be custom code? We've done a similar integration with
>         Cascading
>          > in the past and it offered a planner which given a set of
>         rules would
>          > partition the Cascading DAG into a minimal set of vertices
>         for the same
>          > DAG. Curious if Beam has any similar functionality?
>          >
>          >
>          >
>          > On Sat, Feb 16, 2019 at 4:50 AM Kenneth Knowles
>         <kenn@apache.org <ma...@apache.org>
>          > <mailto:kenn@apache.org <ma...@apache.org>>> wrote:
>          >
>          >     Elaborating on what Robert alluded to: when I wrote that
>         runner
>          >     author guide, portability was in its infancy. Now Beam
>         Python can be
>          >     run on Flink. So that guide is primarily focused on the
>         "deserialize
>          >     a Java DoFn and call its methods" approach. A decent
>         amount of it is
>          >     still really important to know, but is now the
>         responsibility of the
>          >     "SDK harness", aka language-specific coprocessor. For
>         Python & Go &
>          >     <insert new SDK language here> you really want to use the
>          >     portability protos and the portable Flink runner is the
>         best model.
>          >
>          >     Kenn
>          >
>          >
>          >     On Fri, Feb 15, 2019 at 2:08 AM Robert Bradshaw
>         <robertwb@google.com <ma...@google.com>
>          >     <mailto:robertwb@google.com
>         <ma...@google.com>>> wrote:
>          >
>          >         On Fri, Feb 15, 2019 at 7:36 AM Can Gencer
>         <can@hazelcast.com <ma...@hazelcast.com>
>          >         <mailto:can@hazelcast.com
>         <ma...@hazelcast.com>>> wrote:
>          >          >
>          >          > We at Hazelcast are looking into writing a Beam
>         runner for
>          >         Hazelcast Jet
>         (https://github.com/hazelcast/hazelcast-jet). I
>          >         wanted to introduce myself as we'll likely have
>         questions as we
>          >         start development.
>          >
>          >         Welcome!
>          >
>          >         Hazelcast looks interesting, a Beam runner for it
>         would be very
>          >         cool.
>          >
>          >          > Some of the things I'm wondering about currently:
>          >          >
>          >          > * Currently there seems to be a guide available at
>          > https://beam.apache.org/contribute/runner-guide/ , is this up to
>          >         date? Is there anything in specific to be aware of
>         when starting
>          >         with a new runner that's not covered here?
>          >
>          >         That looks like a pretty good starting point. At a
>         quick glance, I
>          >         don't see anything that looks out of date. Another
>         resource that
>          >         might
>          >         be helpful is a talk from last year on writing an SDK
>         (but as it
>          >         mostly covers the runner-sdk interaction, it's also
>         quite useful for
>          >         understanding the runner side:
>          >
>         https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p
>          >         And please feel free to ask any questions on this
>         list as well; we'd
>          >         be happy to help.
>          >
>          >          > * Should we be targeting the latest master which is at
>          >         2.12-SNAPSHOT or a stable version?
>          >
>          >         I would target the latest master.
>          >
>          >          > * After a runner is developed, how is the maintenance
>          >         typically handled, as the runners seems to be part of
>         Beam codebase?
>          >
>          >         Either is possible. Several runner adapters are part
>         of the Beam
>          >         codebase, but for example the IMB Streams Beam runner
>         is not. There
>          >         are certainly pros and cons (certainly early on when
>         the APIs
>          >         themselves were under heavy development it was easier
>         to keep things
>          >         in sync in the same codebase, but things have mostly
>         stabilized
>          >         now).
>          >         A runner only becomes part of the Beam codebase if
>         there are members
>          >         of the community committed to maintaining it (which
>         could include
>          >         you). Both approaches are fine.
>          >
>          >         - Robert
>          >
>

Re: Hazelcast Jet Runner

Posted by Can Gencer <ca...@hazelcast.com>.

I had a look at "GreedyPipelineFuser" and indeed this was what exactly I
was talking about.

Is https://beam.apache.org/roadmap/portability/ still the best information
about the portable runners or is there a more in-depth guide available
anywhere?

On Wed, Mar 20, 2019 at 2:29 PM Can Gencer <ca...@hazelcast.com> wrote:

> Hi Max,
>
> Thanks. When you mean "old-style runner"  is this meant that this style of
> runners will be obsolete and only the portable one will be supported? The
> documentation for portable runners wasn't quite complete and the barrier to
> entry for writing an old style runner seemed easier for us and the old
> style runner should have better performance?
>
> On Wed, Mar 20, 2019 at 1:36 PM Maximilian Michels <mx...@apache.org> wrote:
>
>> Hi Can,
>>
>> Thanks for the update. Interesting question. Flink has an optimization
>> built in called chaining which works together nicely with Beam.
>> Essentially, operators which share the same partitioning get executed
>> one after another inside a master operator. This saves resources.
>>
>> Interestingly, Beam's Fuser for portable Runners does something similar.
>> AFAIK there is no built-in solution for the old-style Runners. I think
>> it would be possible to build something like this on top of the existing
>> translation.
>>
>> Cheers,
>> Max
>>
>> On 20.03.19 13:07, Can Gencer wrote:
>> > Hi again,
>> >
>> > We've made some progress on the runner since writing this more than a
>> > month ago, the repo is available here publicly:
>> > https://github.com/hazelcast/hazelcast-jet-beam-runner
>> >
>> > Still very much a work in progress though. One of the issues I wanted
>> to
>> > raise is that currently we're translating each PTransform to a Jet
>> > Vertex (could be consider analogous to a Flink operator or a vertex in
>> > Tez). This is sub-optimal, since Beam creates lots of transforms for
>> > computations that could be performed inside the same Vertex, such as
>> > subsequent mapping transforms and many others. Ideally you only need
>> > distinct vertices where the data is re-partitioned and/or shuffled. I'm
>> > curious if Beam offers some way of translating the PTransform graph to
>> a
>> > more minimal set of transforms, i.e. some kind of planner or would this
>> > have to be custom code? We've done a similar integration with Cascading
>> > in the past and it offered a planner which given a set of rules would
>> > partition the Cascading DAG into a minimal set of vertices for the same
>> > DAG. Curious if Beam has any similar functionality?
>> >
>> >
>> >
>> > On Sat, Feb 16, 2019 at 4:50 AM Kenneth Knowles <kenn@apache.org
>> > <ma...@apache.org>> wrote:
>> >
>> >     Elaborating on what Robert alluded to: when I wrote that runner
>> >     author guide, portability was in its infancy. Now Beam Python can be
>> >     run on Flink. So that guide is primarily focused on the "deserialize
>> >     a Java DoFn and call its methods" approach. A decent amount of it is
>> >     still really important to know, but is now the responsibility of the
>> >     "SDK harness", aka language-specific coprocessor. For Python & Go &
>> >     <insert new SDK language here> you really want to use the
>> >     portability protos and the portable Flink runner is the best model.
>> >
>> >     Kenn
>> >
>> >
>> >     On Fri, Feb 15, 2019 at 2:08 AM Robert Bradshaw <
>> robertwb@google.com
>> >     <ma...@google.com>> wrote:
>> >
>> >         On Fri, Feb 15, 2019 at 7:36 AM Can Gencer <can@hazelcast.com
>> >         <ma...@hazelcast.com>> wrote:
>> >          >
>> >          > We at Hazelcast are looking into writing a Beam runner for
>> >         Hazelcast Jet (https://github.com/hazelcast/hazelcast-jet). I
>> >         wanted to introduce myself as we'll likely have questions as we
>> >         start development.
>> >
>> >         Welcome!
>> >
>> >         Hazelcast looks interesting, a Beam runner for it would be very
>> >         cool.
>> >
>> >          > Some of the things I'm wondering about currently:
>> >          >
>> >          > * Currently there seems to be a guide available at
>> >         https://beam.apache.org/contribute/runner-guide/ , is this up
>> to
>> >         date? Is there anything in specific to be aware of when starting
>> >         with a new runner that's not covered here?
>> >
>> >         That looks like a pretty good starting point. At a quick
>> glance, I
>> >         don't see anything that looks out of date. Another resource that
>> >         might
>> >         be helpful is a talk from last year on writing an SDK (but as it
>> >         mostly covers the runner-sdk interaction, it's also quite
>> useful for
>> >         understanding the runner side:
>> >
>> https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p
>> >         And please feel free to ask any questions on this list as well;
>> we'd
>> >         be happy to help.
>> >
>> >          > * Should we be targeting the latest master which is at
>> >         2.12-SNAPSHOT or a stable version?
>> >
>> >         I would target the latest master.
>> >
>> >          > * After a runner is developed, how is the maintenance
>> >         typically handled, as the runners seems to be part of Beam
>> codebase?
>> >
>> >         Either is possible. Several runner adapters are part of the Beam
>> >         codebase, but for example the IMB Streams Beam runner is not.
>> There
>> >         are certainly pros and cons (certainly early on when the APIs
>> >         themselves were under heavy development it was easier to keep
>> things
>> >         in sync in the same codebase, but things have mostly stabilized
>> >         now).
>> >         A runner only becomes part of the Beam codebase if there are
>> members
>> >         of the community committed to maintaining it (which could
>> include
>> >         you). Both approaches are fine.
>> >
>> >         - Robert
>> >
>>
>

Re: Hazelcast Jet Runner

Posted by Can Gencer <ca...@hazelcast.com>.

Hi Max,

Thanks. When you mean "old-style runner"  is this meant that this style of
runners will be obsolete and only the portable one will be supported? The
documentation for portable runners wasn't quite complete and the barrier to
entry for writing an old style runner seemed easier for us and the old
style runner should have better performance?

On Wed, Mar 20, 2019 at 1:36 PM Maximilian Michels <mx...@apache.org> wrote:

> Hi Can,
>
> Thanks for the update. Interesting question. Flink has an optimization
> built in called chaining which works together nicely with Beam.
> Essentially, operators which share the same partitioning get executed
> one after another inside a master operator. This saves resources.
>
> Interestingly, Beam's Fuser for portable Runners does something similar.
> AFAIK there is no built-in solution for the old-style Runners. I think
> it would be possible to build something like this on top of the existing
> translation.
>
> Cheers,
> Max
>
> On 20.03.19 13:07, Can Gencer wrote:
> > Hi again,
> >
> > We've made some progress on the runner since writing this more than a
> > month ago, the repo is available here publicly:
> > https://github.com/hazelcast/hazelcast-jet-beam-runner
> >
> > Still very much a work in progress though. One of the issues I wanted to
> > raise is that currently we're translating each PTransform to a Jet
> > Vertex (could be consider analogous to a Flink operator or a vertex in
> > Tez). This is sub-optimal, since Beam creates lots of transforms for
> > computations that could be performed inside the same Vertex, such as
> > subsequent mapping transforms and many others. Ideally you only need
> > distinct vertices where the data is re-partitioned and/or shuffled. I'm
> > curious if Beam offers some way of translating the PTransform graph to a
> > more minimal set of transforms, i.e. some kind of planner or would this
> > have to be custom code? We've done a similar integration with Cascading
> > in the past and it offered a planner which given a set of rules would
> > partition the Cascading DAG into a minimal set of vertices for the same
> > DAG. Curious if Beam has any similar functionality?
> >
> >
> >
> > On Sat, Feb 16, 2019 at 4:50 AM Kenneth Knowles <kenn@apache.org
> > <ma...@apache.org>> wrote:
> >
> >     Elaborating on what Robert alluded to: when I wrote that runner
> >     author guide, portability was in its infancy. Now Beam Python can be
> >     run on Flink. So that guide is primarily focused on the "deserialize
> >     a Java DoFn and call its methods" approach. A decent amount of it is
> >     still really important to know, but is now the responsibility of the
> >     "SDK harness", aka language-specific coprocessor. For Python & Go &
> >     <insert new SDK language here> you really want to use the
> >     portability protos and the portable Flink runner is the best model.
> >
> >     Kenn
> >
> >
> >     On Fri, Feb 15, 2019 at 2:08 AM Robert Bradshaw <robertwb@google.com
> >     <ma...@google.com>> wrote:
> >
> >         On Fri, Feb 15, 2019 at 7:36 AM Can Gencer <can@hazelcast.com
> >         <ma...@hazelcast.com>> wrote:
> >          >
> >          > We at Hazelcast are looking into writing a Beam runner for
> >         Hazelcast Jet (https://github.com/hazelcast/hazelcast-jet). I
> >         wanted to introduce myself as we'll likely have questions as we
> >         start development.
> >
> >         Welcome!
> >
> >         Hazelcast looks interesting, a Beam runner for it would be very
> >         cool.
> >
> >          > Some of the things I'm wondering about currently:
> >          >
> >          > * Currently there seems to be a guide available at
> >         https://beam.apache.org/contribute/runner-guide/ , is this up to
> >         date? Is there anything in specific to be aware of when starting
> >         with a new runner that's not covered here?
> >
> >         That looks like a pretty good starting point. At a quick glance,
> I
> >         don't see anything that looks out of date. Another resource that
> >         might
> >         be helpful is a talk from last year on writing an SDK (but as it
> >         mostly covers the runner-sdk interaction, it's also quite useful
> for
> >         understanding the runner side:
> >
> https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p
> >         And please feel free to ask any questions on this list as well;
> we'd
> >         be happy to help.
> >
> >          > * Should we be targeting the latest master which is at
> >         2.12-SNAPSHOT or a stable version?
> >
> >         I would target the latest master.
> >
> >          > * After a runner is developed, how is the maintenance
> >         typically handled, as the runners seems to be part of Beam
> codebase?
> >
> >         Either is possible. Several runner adapters are part of the Beam
> >         codebase, but for example the IMB Streams Beam runner is not.
> There
> >         are certainly pros and cons (certainly early on when the APIs
> >         themselves were under heavy development it was easier to keep
> things
> >         in sync in the same codebase, but things have mostly stabilized
> >         now).
> >         A runner only becomes part of the Beam codebase if there are
> members
> >         of the community committed to maintaining it (which could include
> >         you). Both approaches are fine.
> >
> >         - Robert
> >
>

Re: Hazelcast Jet Runner

Posted by Maximilian Michels <mx...@apache.org>.

Hi Can,

Thanks for the update. Interesting question. Flink has an optimization 
built in called chaining which works together nicely with Beam. 
Essentially, operators which share the same partitioning get executed 
one after another inside a master operator. This saves resources.

Interestingly, Beam's Fuser for portable Runners does something similar. 
AFAIK there is no built-in solution for the old-style Runners. I think 
it would be possible to build something like this on top of the existing 
translation.

Cheers,
Max

On 20.03.19 13:07, Can Gencer wrote:
> Hi again,
> 
> We've made some progress on the runner since writing this more than a 
> month ago, the repo is available here publicly: 
> https://github.com/hazelcast/hazelcast-jet-beam-runner
> 
> Still very much a work in progress though. One of the issues I wanted to 
> raise is that currently we're translating each PTransform to a Jet 
> Vertex (could be consider analogous to a Flink operator or a vertex in 
> Tez). This is sub-optimal, since Beam creates lots of transforms for 
> computations that could be performed inside the same Vertex, such as 
> subsequent mapping transforms and many others. Ideally you only need 
> distinct vertices where the data is re-partitioned and/or shuffled. I'm 
> curious if Beam offers some way of translating the PTransform graph to a 
> more minimal set of transforms, i.e. some kind of planner or would this 
> have to be custom code? We've done a similar integration with Cascading 
> in the past and it offered a planner which given a set of rules would 
> partition the Cascading DAG into a minimal set of vertices for the same 
> DAG. Curious if Beam has any similar functionality?
> 
> 
> 
> On Sat, Feb 16, 2019 at 4:50 AM Kenneth Knowles <kenn@apache.org 
> <ma...@apache.org>> wrote:
> 
>     Elaborating on what Robert alluded to: when I wrote that runner
>     author guide, portability was in its infancy. Now Beam Python can be
>     run on Flink. So that guide is primarily focused on the "deserialize
>     a Java DoFn and call its methods" approach. A decent amount of it is
>     still really important to know, but is now the responsibility of the
>     "SDK harness", aka language-specific coprocessor. For Python & Go &
>     <insert new SDK language here> you really want to use the
>     portability protos and the portable Flink runner is the best model.
> 
>     Kenn
> 
> 
>     On Fri, Feb 15, 2019 at 2:08 AM Robert Bradshaw <robertwb@google.com
>     <ma...@google.com>> wrote:
> 
>         On Fri, Feb 15, 2019 at 7:36 AM Can Gencer <can@hazelcast.com
>         <ma...@hazelcast.com>> wrote:
>          >
>          > We at Hazelcast are looking into writing a Beam runner for
>         Hazelcast Jet (https://github.com/hazelcast/hazelcast-jet). I
>         wanted to introduce myself as we'll likely have questions as we
>         start development.
> 
>         Welcome!
> 
>         Hazelcast looks interesting, a Beam runner for it would be very
>         cool.
> 
>          > Some of the things I'm wondering about currently:
>          >
>          > * Currently there seems to be a guide available at
>         https://beam.apache.org/contribute/runner-guide/ , is this up to
>         date? Is there anything in specific to be aware of when starting
>         with a new runner that's not covered here?
> 
>         That looks like a pretty good starting point. At a quick glance, I
>         don't see anything that looks out of date. Another resource that
>         might
>         be helpful is a talk from last year on writing an SDK (but as it
>         mostly covers the runner-sdk interaction, it's also quite useful for
>         understanding the runner side:
>         https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p
>         And please feel free to ask any questions on this list as well; we'd
>         be happy to help.
> 
>          > * Should we be targeting the latest master which is at
>         2.12-SNAPSHOT or a stable version?
> 
>         I would target the latest master.
> 
>          > * After a runner is developed, how is the maintenance
>         typically handled, as the runners seems to be part of Beam codebase?
> 
>         Either is possible. Several runner adapters are part of the Beam
>         codebase, but for example the IMB Streams Beam runner is not. There
>         are certainly pros and cons (certainly early on when the APIs
>         themselves were under heavy development it was easier to keep things
>         in sync in the same codebase, but things have mostly stabilized
>         now).
>         A runner only becomes part of the Beam codebase if there are members
>         of the community committed to maintaining it (which could include
>         you). Both approaches are fine.
> 
>         - Robert
>

Re: Hazelcast Jet Runner

Posted by Can Gencer <ca...@hazelcast.com>.

Hi again,

We've made some progress on the runner since writing this more than a month
ago, the repo is available here publicly:
https://github.com/hazelcast/hazelcast-jet-beam-runner

Still very much a work in progress though. One of the issues I wanted to
raise is that currently we're translating each PTransform to a Jet Vertex
(could be consider analogous to a Flink operator or a vertex in Tez). This
is sub-optimal, since Beam creates lots of transforms for computations that
could be performed inside the same Vertex, such as subsequent mapping
transforms and many others. Ideally you only need distinct vertices where
the data is re-partitioned and/or shuffled. I'm curious if Beam offers some
way of translating the PTransform graph to a more minimal set of
transforms, i.e. some kind of planner or would this have to be custom code?
We've done a similar integration with Cascading in the past and it offered
a planner which given a set of rules would partition the Cascading DAG into
a minimal set of vertices for the same DAG. Curious if Beam has any similar
functionality?



On Sat, Feb 16, 2019 at 4:50 AM Kenneth Knowles <ke...@apache.org> wrote:

> Elaborating on what Robert alluded to: when I wrote that runner author
> guide, portability was in its infancy. Now Beam Python can be run on Flink.
> So that guide is primarily focused on the "deserialize a Java DoFn and call
> its methods" approach. A decent amount of it is still really important to
> know, but is now the responsibility of the "SDK harness", aka
> language-specific coprocessor. For Python & Go & <insert new SDK language
> here> you really want to use the portability protos and the portable Flink
> runner is the best model.
>
> Kenn
>
>
> On Fri, Feb 15, 2019 at 2:08 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Fri, Feb 15, 2019 at 7:36 AM Can Gencer <ca...@hazelcast.com> wrote:
>> >
>> > We at Hazelcast are looking into writing a Beam runner for Hazelcast
>> Jet (https://github.com/hazelcast/hazelcast-jet). I wanted to introduce
>> myself as we'll likely have questions as we start development.
>>
>> Welcome!
>>
>> Hazelcast looks interesting, a Beam runner for it would be very cool.
>>
>> > Some of the things I'm wondering about currently:
>> >
>> > * Currently there seems to be a guide available at
>> https://beam.apache.org/contribute/runner-guide/ , is this up to date?
>> Is there anything in specific to be aware of when starting with a new
>> runner that's not covered here?
>>
>> That looks like a pretty good starting point. At a quick glance, I
>> don't see anything that looks out of date. Another resource that might
>> be helpful is a talk from last year on writing an SDK (but as it
>> mostly covers the runner-sdk interaction, it's also quite useful for
>> understanding the runner side:
>>
>> https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p
>> And please feel free to ask any questions on this list as well; we'd
>> be happy to help.
>>
>> > * Should we be targeting the latest master which is at 2.12-SNAPSHOT or
>> a stable version?
>>
>> I would target the latest master.
>>
>> > * After a runner is developed, how is the maintenance typically
>> handled, as the runners seems to be part of Beam codebase?
>>
>> Either is possible. Several runner adapters are part of the Beam
>> codebase, but for example the IMB Streams Beam runner is not. There
>> are certainly pros and cons (certainly early on when the APIs
>> themselves were under heavy development it was easier to keep things
>> in sync in the same codebase, but things have mostly stabilized now).
>> A runner only becomes part of the Beam codebase if there are members
>> of the community committed to maintaining it (which could include
>> you). Both approaches are fine.
>>
>> - Robert
>>
>

Re: Hazelcast Jet Runner

Posted by Kenneth Knowles <ke...@apache.org>.

Elaborating on what Robert alluded to: when I wrote that runner author
guide, portability was in its infancy. Now Beam Python can be run on Flink.
So that guide is primarily focused on the "deserialize a Java DoFn and call
its methods" approach. A decent amount of it is still really important to
know, but is now the responsibility of the "SDK harness", aka
language-specific coprocessor. For Python & Go & <insert new SDK language
here> you really want to use the portability protos and the portable Flink
runner is the best model.

Kenn


On Fri, Feb 15, 2019 at 2:08 AM Robert Bradshaw <ro...@google.com> wrote:

> On Fri, Feb 15, 2019 at 7:36 AM Can Gencer <ca...@hazelcast.com> wrote:
> >
> > We at Hazelcast are looking into writing a Beam runner for Hazelcast Jet
> (https://github.com/hazelcast/hazelcast-jet). I wanted to introduce
> myself as we'll likely have questions as we start development.
>
> Welcome!
>
> Hazelcast looks interesting, a Beam runner for it would be very cool.
>
> > Some of the things I'm wondering about currently:
> >
> > * Currently there seems to be a guide available at
> https://beam.apache.org/contribute/runner-guide/ , is this up to date? Is
> there anything in specific to be aware of when starting with a new runner
> that's not covered here?
>
> That looks like a pretty good starting point. At a quick glance, I
> don't see anything that looks out of date. Another resource that might
> be helpful is a talk from last year on writing an SDK (but as it
> mostly covers the runner-sdk interaction, it's also quite useful for
> understanding the runner side:
>
> https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p
> And please feel free to ask any questions on this list as well; we'd
> be happy to help.
>
> > * Should we be targeting the latest master which is at 2.12-SNAPSHOT or
> a stable version?
>
> I would target the latest master.
>
> > * After a runner is developed, how is the maintenance typically handled,
> as the runners seems to be part of Beam codebase?
>
> Either is possible. Several runner adapters are part of the Beam
> codebase, but for example the IMB Streams Beam runner is not. There
> are certainly pros and cons (certainly early on when the APIs
> themselves were under heavy development it was easier to keep things
> in sync in the same codebase, but things have mostly stabilized now).
> A runner only becomes part of the Beam codebase if there are members
> of the community committed to maintaining it (which could include
> you). Both approaches are fine.
>
> - Robert
>

Re: Hazelcast Jet Runner

Posted by Robert Bradshaw <ro...@google.com>.

On Fri, Feb 15, 2019 at 7:36 AM Can Gencer <ca...@hazelcast.com> wrote:
>
> We at Hazelcast are looking into writing a Beam runner for Hazelcast Jet (https://github.com/hazelcast/hazelcast-jet). I wanted to introduce myself as we'll likely have questions as we start development.

Welcome!

Hazelcast looks interesting, a Beam runner for it would be very cool.

> Some of the things I'm wondering about currently:
>
> * Currently there seems to be a guide available at https://beam.apache.org/contribute/runner-guide/ , is this up to date? Is there anything in specific to be aware of when starting with a new runner that's not covered here?

That looks like a pretty good starting point. At a quick glance, I
don't see anything that looks out of date. Another resource that might
be helpful is a talk from last year on writing an SDK (but as it
mostly covers the runner-sdk interaction, it's also quite useful for
understanding the runner side:
https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p
And please feel free to ask any questions on this list as well; we'd
be happy to help.

> * Should we be targeting the latest master which is at 2.12-SNAPSHOT or a stable version?

I would target the latest master.

> * After a runner is developed, how is the maintenance typically handled, as the runners seems to be part of Beam codebase?

Either is possible. Several runner adapters are part of the Beam
codebase, but for example the IMB Streams Beam runner is not. There
are certainly pros and cons (certainly early on when the APIs
themselves were under heavy development it was easier to keep things
in sync in the same codebase, but things have mostly stabilized now).
A runner only becomes part of the Beam codebase if there are members
of the community committed to maintaining it (which could include
you). Both approaches are fine.

- Robert