You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Pulasthi Supun Wickramasinghe <pu...@gmail.com> on 2020/03/04 19:03:12 UTC

Re: Contributing Twister2 runner to Apache Beam

Hi

I believe the pull request is pretty complete now with the help of Ismaël.
Kenn, would you be able to take a look at it and suggest any changes if
needed?. The build checks and validations tests are passing at the moment.
I will start working on the documentation that you mentioned in an earlier
email separately.

Best Regards,
Pulasthi





On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe <
pulasthi911@gmail.com> wrote:

> Hi All,
>
> I have created the initial pull request [1] to contribute the Twister2
> Beam runner to the Apache Beam codebase. More information on Twister2 can
> be found here[2] and the Twister2 codebase is available here[3]. At the
> moment only batch mode is supported in the runner, but we are planning to
> add stream support and implement a portable runner for Twister2 in the near
> future.
>
> As Kenn pointed out in an earlier email it would be great to have inputs
> from the community regarding this contribution since it is a sizable one. I
> am sure there are many improvements that can be done in the contributed
> codebase with input from the community.
>
> [1] https://github.com/apache/beam/pull/10888
> [2] https://twister2.org/
> [3] https://github.com/DSC-SPIDAL/twister2
>
> Best Regards,
> Pulasthi
> --
> Pulasthi S. Wickramasinghe
> PhD Candidate  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035
>


-- 
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035

Re: Contributing Twister2 runner to Apache Beam

Posted by Pulasthi Supun Wickramasinghe <pu...@gmail.com>.
Hi Ismaël,

Thanks for the update, No problem at all, please take your time and let me
know if my assistance is needed, The virus has affected everyone's
timetables. I hope you are safe.

Best Regards,
Pulasthi

On Fri, Apr 3, 2020 at 12:14 PM Ismaël Mejía <ie...@gmail.com> wrote:

> Hello Pulasthi,
>
> Please excuse me for my delay, I have probably 1/3 of my common
> available time since the coronavirus lockdown so I have not advanced
> as expected. I hope to catch up rapidly and ping you. Our expected
> target of merging it before the 2.21.0 release seems to be hard to get
> at this point because the branch will be cut next week. I hope this is
> not a problem but if it is please excuse me.
>
> I also profit to ask any other Beamer that could have more free cycles
> at the moment in case (s)he can give me an extra hand for the review.
>
> Regards,
> Ismaël
>
>
> On Fri, Apr 3, 2020 at 4:16 AM Pulasthi Supun Wickramasinghe
> <pu...@gmail.com> wrote:
> >
> > Hi Ismaël
> >
> > Did you get some free time to perform a code review on the pull request
> >
> > Best Regards
> > Pulasthi
> >
> > On Tue, Mar 10, 2020 at 3:30 PM Luke Cwik <lc...@google.com> wrote:
> >>
> >> I have to disagree. Allowing for runners within the Apache Beam repo
> and SDKs that reach into the implementation details of each other are
> usability, feature development, maintenance and complexity problems.
> >>
> >> The usability issue comes from our public core facing APIs exposing
> methods that runners "need" so they can introspect details that shouldn't
> be visible to them (e.g. setWindowingStrategyInternal on PCollection).
> Getting to 1 would remove the pipeline construction time instances but not
> the execution side ones and there are currently 100+ usages of the
> @Internal annotation.
> >>
> >> The feature development and maintenance issues both stem from
> duplication of work. We need to have at least two copies of how to do
> something, one that is for runner -> SDK direct and one for Fn API. An
> example of this is the timer family work which was started and completed
> for the non portable implementation yet the portable implementation was
> left as future work.
> >>
> >> Finally, the complexity comes from how many layers we have that wrap
> existing components to create variants for different use cases. I'm looking
> at all the DoFnRunners and each of their variants and how those have layers
> within themselves within the SDK and how additional layers have been made
> to interface with runner specific internal details.
> >>
> >>
> >> On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles <ke...@apache.org>
> wrote:
> >>>
> >>> I do support all the efforts to get Dataflow, Flink, and Spark to 3
> (Fn API). But I disagree with it as a requirement; the whole point of
> ptransforms with URNs is that if the runner can figure out how to execute
> it according to semantics, then it is fine. A runner meets (1) and (2) but
> can only run certain subset of DoFns is allowed by design (whether the
> subset is based on language, state/timer support, etc).
> >>>
> >>> Kenn
> >>>
> >>> On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik <lc...@google.com> wrote:
> >>>>
> >>>> I would like to move away from having runners access APIs that are
> related to pipeline construction and other internal SDK APIs and I would
> like for SDKs to not inspect internal runner APIs. This would enable the
> community to improve each independently without needing to fix the world
> all the time and would enable the community to run a cluster that supports
> multiple Beam versions at the same time and would also allow for the
> cluster to be updated independently of the pipelines it runs.
> >>>>
> >>>> As a community, I believe we need to achieve 1, 2 and 3. Outside of
> the Apache Beam repo, anyone can do whatever they want but there should be
> no compatibility guarantees.
> >>>>
> >>>> 4 and 5 are extensions that enable a richer set of pipelines to run
> and are optional like many other parts such as if a runner supports metrics
> aggregation or dynamic work rebalancing.
> >>>>
> >>>> On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles <ke...@apache.org>
> wrote:
> >>>>>
> >>>>> There are a lot of different meanings to "portable runner". Here are
> some:
> >>>>>
> >>>>> (1) A runner that accepts a pipeline proto and either runs it or
> says it cannot run it
> >>>>> (2) A runner that accepts jobs via the job management APIs
> >>>>> (3) A runner that executes UDFs via the Fn API
> >>>>> (4) A runner that can execute multiple languages
> >>>>> (5) A runner that can run cross-language transforms aka multiple
> languages in the same pipeline
> >>>>>
> >>>>> I think (1) is a very good bar, and (2) is a nice addition on top of
> that. Then we have a unified way to submit pipelines and understand their
> status.
> >>>>>
> >>>>> I think (3) is optional - a runner can run things however it likes,
> including with native implementations. And then (4) and (5) as well are
> just levels of feature capabilities.
> >>>>>
> >>>>> Kenn
> >>>>>
> >>>>> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik <lc...@google.com> wrote:
> >>>>>>
> >>>>>> +1
> >>>>>>
> >>>>>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel <al...@vanboxel.be>
> wrote:
> >>>>>>>
> >>>>>>> One last thing, for any runner after this one... wouldn't it be a
> good acceptance criteria to only accept portable implementations anymore?
> >>>>>>>
> >>>>>>>  _/
> >>>>>>> _/ Alex Van Boxel
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía <ie...@gmail.com>
> wrote:
> >>>>>>>>
> >>>>>>>> Good points Kenn. I think we mostly agree on what has been
> discussed in this
> >>>>>>>> thread the pros/cons of having runners on our repository, but
> this is probably
> >>>>>>>> not the best moment in time to change any policy in that aspect.
> >>>>>>>>
> >>>>>>>> So if nobody objects I think we can proceed. I am OOO this week
> so with less
> >>>>>>>> time to continue with the code review, but I will be back to
> finish the review
> >>>>>>>> and hopefully finally get this merged with Pulasthi next week
> (sorry for the
> >>>>>>>> delay).
> >>>>>>>>
> >>>>>>>> > (don't wait for me on code review - if Ismaël said it is good,
> then it is
> >>>>>>>> > good.)
> >>>>>>>>
> >>>>>>>> Thanks for your confidence. Twister2 runners looks good so far,
> but I will
> >>>>>>>> confirm 100% next week :) In the meantime if someone has some
> extra cycles to
> >>>>>>>> take a look extra feedback is always welcome.
> >>>>>>>>
> >>>>>>>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <ke...@apache.org>
> wrote:
> >>>>>>>> >
> >>>>>>>> > I haven't heard anyone suggest that we need a vote. I haven't
> heard anyone object to this being merged to master. Some time ago, we
> mostly decided to favor master instead of branches, because it is so much
> smoother for contributors and users.
> >>>>>>>> >
> >>>>>>>> > So I am poking this thread one last time and otherwise I would
> consider it consensus that once code review is done the runner is a part of
> Beam (experimental!).
> >>>>>>>> >
> >>>>>>>> > (don't wait for me on code review - if Ismaël said it is good,
> then it is good.)
> >>>>>>>> >
> >>>>>>>> > Kenn
> >>>>>>>> >
> >>>>>>>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
> pulasthi911@gmail.com> wrote:
> >>>>>>>> >>
> >>>>>>>> >> I understand that the discussion is on a more broad level than
> the Twister2 runner. From my experience developing the runner the main
> advantage of being inside the beam project was the easy access to the wide
> range of tests and other core/utility code as Kyle pointed out. Unmerging
> runners that are not properly maintained and updated would be the most
> logical path to follow since the internals of the runners are only well
> understood by developers of that particular project. It would be
> unreasonable to expect the Beam community to maintain them. And since the
> runners do not alter the core API's I assume they would be easy to unmerge
> if the need arises.
> >>>>>>>> >>
> >>>>>>>> >> Talking specifically about Twister2 runner, we hope to
> continue developing the runner in the future to add both streaming
> capability and develop a portable runner as well. The team behind Twister2
> is working towards the goal to get the project into Apache Incubator in the
> near future (Hopefully to submit the proposal in the next couple of months).
> >>>>>>>> >>
> >>>>>>>> >> Best Regards,
> >>>>>>>> >> Pulasthi
> >>>>>>>> >>
> >>>>>>>> >>
> >>>>>>>> >>
> >>>>>>>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw <
> robertwb@google.com> wrote:
> >>>>>>>> >>>
> >>>>>>>> >>> I think we will get to a point where it makes sense for
> runners to
> >>>>>>>> >>> live in their own repositories, with their own release
> cadence, but
> >>>>>>>> >>> we're not at that point yet. One prerequisite is a stable
> API--we're
> >>>>>>>> >>> closing in on that with the portability protos, but many
> (java)
> >>>>>>>> >>> runners actually share the common runner core libraries and
> that is
> >>>>>>>> >>> even less set in stone.
> >>>>>>>> >>>
> >>>>>>>> >>> On the other hand, taking responsibility for maintaining all
> runners
> >>>>>>>> >>> is not a tenable or scalable position for the Beam project.
> If a
> >>>>>>>> >>> runner is merged, it should be understood that it can be
> "un-merged"
> >>>>>>>> >>> if it causes a maintenance burden. A completely separate
> >>>>>>>> >>> project/repository makes this less messy.
> >>>>>>>> >>>
> >>>>>>>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <
> kenn@apache.org> wrote:
> >>>>>>>> >>> >
> >>>>>>>> >>> > I agree with both of you, mostly :-)
> >>>>>>>> >>> >
> >>>>>>>> >>> > The monorepo approach doesn't work/scale well for shipped
> libraries (name a Google library that silently just works and never causes
> any dependency problems) and the pain we feel has been constant and
> increasing, but I don't think we are at the breaking point.
> >>>>>>>> >>> >
> >>>>>>>> >>> > But Google's big monorepo [1] demonstrates similar benefits
> to what Kyle describes. In the early stages the benefit of not having to
> think too hard about build/test infra and share it everywhere is a big
> help, and it scales well. Eventually, shipping test utility libraries and
> compliance suites can be equivalent. And to your point - it is very helpful
> for users to know that they can use CassandraIO with the other Beam
> artifacts. This is why Google requires the whole big repo to depend on a
> single version of any externally-controlled artifact. But, yes, as a
> consequence it is preposterously difficult to stay up to date, since
> literally anything can block progress. You need a unified escalation chain
> for that policy to make sense. It is the definition of a healthy Apache
> project to *not* have that (PMC is different).
> >>>>>>>> >>> >
> >>>>>>>> >>> > Independent dependencies, independent git histories, and
> independent release cadence/process are all separate discussions.
> >>>>>>>> >>> >
> >>>>>>>> >>> > It is a broader question than this particular contribution,
> so let's merge this runner before changing our whole way of doing things :-)
> >>>>>>>> >>> >
> >>>>>>>> >>> > Kenn
> >>>>>>>> >>> >
> >>>>>>>> >>> > [1]
> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
> (really quite a balanced analysis)
> >>>>>>>> >>> >
> >>>>>>>> >>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <
> kcweaver@google.com> wrote:
> >>>>>>>> >>> >>
> >>>>>>>> >>> >> > Should runners, current and future, be in the same
> repository as Beam
> >>>>>>>> >>> >> > core?
> >>>>>>>> >>> >>
> >>>>>>>> >>> >> In the distant past, runners lived in their own
> repositories, and then were donated to Beam. But Beam's current uber-repo
> setup allows a lot of convenience. For example, a ton of code (including
> core functionality and tests) is shared directly between runners, which is
> useful for keeping runners up to date and ensuring consistent behavior
> between them (in other words, maintainable and reliable).
> >>>>>>>> >>> >>
> >>>>>>>> >>> >> Generally, it is up to the authors of a particular Beam
> related project/subproject to decide whether to host their code in Beam or
> in a different repo, and up to the community to decide whether to take on
> the donation, as discussed in previous threads on the Twister2 runner. In
> this case, it seems there is agreement between the Twister2 runner authors
> and the community that the runner can be hosted in Beam proper.
> >>>>>>>> >>> >>
> >>>>>>>> >>> >> There are examples of successful independent Beam
> projects, such as Spotify's Scio, but having an independent project with
> its own releases requires a lot of dedicated resources, and the bar for
> entry for extending Beam should not be that high. All that's required of
> subproject authors is that they keep the subproject in step with Beam. If
> they can't maintain it any longer, the subproject can be allowed to bitrot
> without getting in anyone's way. On the other hand, I'm not sure of the
> details with Cassandra, but in general, a subproject should not have "the
> ability to block progress" just because it is contained in the Beam
> uber-repo.
> >>>>>>>> >>> >>
> >>>>>>>> >>> >> tl;dr Having an uber repo generally seems to work for
> Beam. Exceptions are few enough to be handled on a case-by-case basis.
> >>>>>>>> >>> >>
> >>>>>>>> >>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <
> elharo@ibiblio.org> wrote:
> >>>>>>>> >>> >>>
> >>>>>>>> >>> >>> Generic question without commenting on Twister2
> specifically:
> >>>>>>>> >>> >>>
> >>>>>>>> >>> >>> Should runners, current and future, be in the same
> repository as Beam
> >>>>>>>> >>> >>> core? Can or should they be completely separate products
> with their
> >>>>>>>> >>> >>> own release cycles?
> >>>>>>>> >>> >>>
> >>>>>>>> >>> >>> Generally, loose coupling leads to more maintainable,
> reliable
> >>>>>>>> >>> >>> projects. Specifically, Cassandra is holding back some
> other changes
> >>>>>>>> >>> >>> in Beam and I really wish it didn't have the ability to
> block
> >>>>>>>> >>> >>> progress. The more different runners we have in core, the
> worse this
> >>>>>>>> >>> >>> problem is likely to become.
> >>>>>>>> >>> >>>
> >>>>>>>> >>> >>>
> >>>>>>>> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun
> Wickramasinghe
> >>>>>>>> >>> >>> <pu...@gmail.com> wrote:
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> > Hi
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> > I believe the pull request is pretty complete now with
> the help of Ismaël. Kenn, would you be able to take a look at it and
> suggest any changes if needed?. The build checks and validations tests are
> passing at the moment.  I will start working on the documentation that you
> mentioned in an earlier email separately.
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> > Best Regards,
> >>>>>>>> >>> >>> > Pulasthi
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun
> Wickramasinghe <pu...@gmail.com> wrote:
> >>>>>>>> >>> >>> >>
> >>>>>>>> >>> >>> >> Hi All,
> >>>>>>>> >>> >>> >>
> >>>>>>>> >>> >>> >> I have created the initial pull request [1] to
> contribute the Twister2 Beam runner to the Apache Beam codebase. More
> information on Twister2 can be found here[2] and the Twister2 codebase is
> available here[3]. At the moment only batch mode is supported in the
> runner, but we are planning to add stream support and implement a portable
> runner for Twister2 in the near future.
> >>>>>>>> >>> >>> >>
> >>>>>>>> >>> >>> >> As Kenn pointed out in an earlier email it would be
> great to have inputs from the community regarding this contribution since
> it is a sizable one. I am sure there are many improvements that can be done
> in the contributed codebase with input from the community.
> >>>>>>>> >>> >>> >>
> >>>>>>>> >>> >>> >> [1] https://github.com/apache/beam/pull/10888
> >>>>>>>> >>> >>> >> [2] https://twister2.org/
> >>>>>>>> >>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2
> >>>>>>>> >>> >>> >>
> >>>>>>>> >>> >>> >> Best Regards,
> >>>>>>>> >>> >>> >> Pulasthi
> >>>>>>>> >>> >>> >> --
> >>>>>>>> >>> >>> >> Pulasthi S. Wickramasinghe
> >>>>>>>> >>> >>> >> PhD Candidate  | Research Assistant
> >>>>>>>> >>> >>> >> School of Informatics and Computing | Digital Science
> Center
> >>>>>>>> >>> >>> >> Indiana University, Bloomington
> >>>>>>>> >>> >>> >> cell: 224-386-9035
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> > --
> >>>>>>>> >>> >>> > Pulasthi S. Wickramasinghe
> >>>>>>>> >>> >>> > PhD Candidate  | Research Assistant
> >>>>>>>> >>> >>> > School of Informatics and Computing | Digital Science
> Center
> >>>>>>>> >>> >>> > Indiana University, Bloomington
> >>>>>>>> >>> >>> > cell: 224-386-9035
> >>>>>>>> >>> >>>
> >>>>>>>> >>> >>>
> >>>>>>>> >>> >>>
> >>>>>>>> >>> >>> --
> >>>>>>>> >>> >>> Elliotte Rusty Harold
> >>>>>>>> >>> >>> elharo@ibiblio.org
> >>>>>>>> >>
> >>>>>>>> >>
> >>>>>>>> >>
> >>>>>>>> >> --
> >>>>>>>> >> Pulasthi S. Wickramasinghe
> >>>>>>>> >> PhD Candidate  | Research Assistant
> >>>>>>>> >> School of Informatics and Computing | Digital Science Center
> >>>>>>>> >> Indiana University, Bloomington
> >>>>>>>> >> cell: 224-386-9035
> >
> >
> >
> > --
> > Pulasthi S. Wickramasinghe
> > PhD Candidate  | Research Assistant
> > School of Informatics and Computing | Digital Science Center
> > Indiana University, Bloomington
> > cell: 224-386-9035
>


-- 
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035

Re: Contributing Twister2 runner to Apache Beam

Posted by Ismaël Mejía <ie...@gmail.com>.
Hello Pulasthi,

Please excuse me for my delay, I have probably 1/3 of my common
available time since the coronavirus lockdown so I have not advanced
as expected. I hope to catch up rapidly and ping you. Our expected
target of merging it before the 2.21.0 release seems to be hard to get
at this point because the branch will be cut next week. I hope this is
not a problem but if it is please excuse me.

I also profit to ask any other Beamer that could have more free cycles
at the moment in case (s)he can give me an extra hand for the review.

Regards,
Ismaël


On Fri, Apr 3, 2020 at 4:16 AM Pulasthi Supun Wickramasinghe
<pu...@gmail.com> wrote:
>
> Hi Ismaël
>
> Did you get some free time to perform a code review on the pull request
>
> Best Regards
> Pulasthi
>
> On Tue, Mar 10, 2020 at 3:30 PM Luke Cwik <lc...@google.com> wrote:
>>
>> I have to disagree. Allowing for runners within the Apache Beam repo and SDKs that reach into the implementation details of each other are usability, feature development, maintenance and complexity problems.
>>
>> The usability issue comes from our public core facing APIs exposing methods that runners "need" so they can introspect details that shouldn't be visible to them (e.g. setWindowingStrategyInternal on PCollection). Getting to 1 would remove the pipeline construction time instances but not the execution side ones and there are currently 100+ usages of the @Internal annotation.
>>
>> The feature development and maintenance issues both stem from duplication of work. We need to have at least two copies of how to do something, one that is for runner -> SDK direct and one for Fn API. An example of this is the timer family work which was started and completed for the non portable implementation yet the portable implementation was left as future work.
>>
>> Finally, the complexity comes from how many layers we have that wrap existing components to create variants for different use cases. I'm looking at all the DoFnRunners and each of their variants and how those have layers within themselves within the SDK and how additional layers have been made to interface with runner specific internal details.
>>
>>
>> On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles <ke...@apache.org> wrote:
>>>
>>> I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn API). But I disagree with it as a requirement; the whole point of ptransforms with URNs is that if the runner can figure out how to execute it according to semantics, then it is fine. A runner meets (1) and (2) but can only run certain subset of DoFns is allowed by design (whether the subset is based on language, state/timer support, etc).
>>>
>>> Kenn
>>>
>>> On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik <lc...@google.com> wrote:
>>>>
>>>> I would like to move away from having runners access APIs that are related to pipeline construction and other internal SDK APIs and I would like for SDKs to not inspect internal runner APIs. This would enable the community to improve each independently without needing to fix the world all the time and would enable the community to run a cluster that supports multiple Beam versions at the same time and would also allow for the cluster to be updated independently of the pipelines it runs.
>>>>
>>>> As a community, I believe we need to achieve 1, 2 and 3. Outside of the Apache Beam repo, anyone can do whatever they want but there should be no compatibility guarantees.
>>>>
>>>> 4 and 5 are extensions that enable a richer set of pipelines to run and are optional like many other parts such as if a runner supports metrics aggregation or dynamic work rebalancing.
>>>>
>>>> On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles <ke...@apache.org> wrote:
>>>>>
>>>>> There are a lot of different meanings to "portable runner". Here are some:
>>>>>
>>>>> (1) A runner that accepts a pipeline proto and either runs it or says it cannot run it
>>>>> (2) A runner that accepts jobs via the job management APIs
>>>>> (3) A runner that executes UDFs via the Fn API
>>>>> (4) A runner that can execute multiple languages
>>>>> (5) A runner that can run cross-language transforms aka multiple languages in the same pipeline
>>>>>
>>>>> I think (1) is a very good bar, and (2) is a nice addition on top of that. Then we have a unified way to submit pipelines and understand their status.
>>>>>
>>>>> I think (3) is optional - a runner can run things however it likes, including with native implementations. And then (4) and (5) as well are just levels of feature capabilities.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik <lc...@google.com> wrote:
>>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel <al...@vanboxel.be> wrote:
>>>>>>>
>>>>>>> One last thing, for any runner after this one... wouldn't it be a good acceptance criteria to only accept portable implementations anymore?
>>>>>>>
>>>>>>>  _/
>>>>>>> _/ Alex Van Boxel
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía <ie...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Good points Kenn. I think we mostly agree on what has been discussed in this
>>>>>>>> thread the pros/cons of having runners on our repository, but this is probably
>>>>>>>> not the best moment in time to change any policy in that aspect.
>>>>>>>>
>>>>>>>> So if nobody objects I think we can proceed. I am OOO this week so with less
>>>>>>>> time to continue with the code review, but I will be back to finish the review
>>>>>>>> and hopefully finally get this merged with Pulasthi next week (sorry for the
>>>>>>>> delay).
>>>>>>>>
>>>>>>>> > (don't wait for me on code review - if Ismaël said it is good, then it is
>>>>>>>> > good.)
>>>>>>>>
>>>>>>>> Thanks for your confidence. Twister2 runners looks good so far, but I will
>>>>>>>> confirm 100% next week :) In the meantime if someone has some extra cycles to
>>>>>>>> take a look extra feedback is always welcome.
>>>>>>>>
>>>>>>>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <ke...@apache.org> wrote:
>>>>>>>> >
>>>>>>>> > I haven't heard anyone suggest that we need a vote. I haven't heard anyone object to this being merged to master. Some time ago, we mostly decided to favor master instead of branches, because it is so much smoother for contributors and users.
>>>>>>>> >
>>>>>>>> > So I am poking this thread one last time and otherwise I would consider it consensus that once code review is done the runner is a part of Beam (experimental!).
>>>>>>>> >
>>>>>>>> > (don't wait for me on code review - if Ismaël said it is good, then it is good.)
>>>>>>>> >
>>>>>>>> > Kenn
>>>>>>>> >
>>>>>>>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <pu...@gmail.com> wrote:
>>>>>>>> >>
>>>>>>>> >> I understand that the discussion is on a more broad level than the Twister2 runner. From my experience developing the runner the main advantage of being inside the beam project was the easy access to the wide range of tests and other core/utility code as Kyle pointed out. Unmerging runners that are not properly maintained and updated would be the most logical path to follow since the internals of the runners are only well understood by developers of that particular project. It would be unreasonable to expect the Beam community to maintain them. And since the runners do not alter the core API's I assume they would be easy to unmerge if the need arises.
>>>>>>>> >>
>>>>>>>> >> Talking specifically about Twister2 runner, we hope to continue developing the runner in the future to add both streaming capability and develop a portable runner as well. The team behind Twister2 is working towards the goal to get the project into Apache Incubator in the near future (Hopefully to submit the proposal in the next couple of months).
>>>>>>>> >>
>>>>>>>> >> Best Regards,
>>>>>>>> >> Pulasthi
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw <ro...@google.com> wrote:
>>>>>>>> >>>
>>>>>>>> >>> I think we will get to a point where it makes sense for runners to
>>>>>>>> >>> live in their own repositories, with their own release cadence, but
>>>>>>>> >>> we're not at that point yet. One prerequisite is a stable API--we're
>>>>>>>> >>> closing in on that with the portability protos, but many (java)
>>>>>>>> >>> runners actually share the common runner core libraries and that is
>>>>>>>> >>> even less set in stone.
>>>>>>>> >>>
>>>>>>>> >>> On the other hand, taking responsibility for maintaining all runners
>>>>>>>> >>> is not a tenable or scalable position for the Beam project. If a
>>>>>>>> >>> runner is merged, it should be understood that it can be "un-merged"
>>>>>>>> >>> if it causes a maintenance burden. A completely separate
>>>>>>>> >>> project/repository makes this less messy.
>>>>>>>> >>>
>>>>>>>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <ke...@apache.org> wrote:
>>>>>>>> >>> >
>>>>>>>> >>> > I agree with both of you, mostly :-)
>>>>>>>> >>> >
>>>>>>>> >>> > The monorepo approach doesn't work/scale well for shipped libraries (name a Google library that silently just works and never causes any dependency problems) and the pain we feel has been constant and increasing, but I don't think we are at the breaking point.
>>>>>>>> >>> >
>>>>>>>> >>> > But Google's big monorepo [1] demonstrates similar benefits to what Kyle describes. In the early stages the benefit of not having to think too hard about build/test infra and share it everywhere is a big help, and it scales well. Eventually, shipping test utility libraries and compliance suites can be equivalent. And to your point - it is very helpful for users to know that they can use CassandraIO with the other Beam artifacts. This is why Google requires the whole big repo to depend on a single version of any externally-controlled artifact. But, yes, as a consequence it is preposterously difficult to stay up to date, since literally anything can block progress. You need a unified escalation chain for that policy to make sense. It is the definition of a healthy Apache project to *not* have that (PMC is different).
>>>>>>>> >>> >
>>>>>>>> >>> > Independent dependencies, independent git histories, and independent release cadence/process are all separate discussions.
>>>>>>>> >>> >
>>>>>>>> >>> > It is a broader question than this particular contribution, so let's merge this runner before changing our whole way of doing things :-)
>>>>>>>> >>> >
>>>>>>>> >>> > Kenn
>>>>>>>> >>> >
>>>>>>>> >>> > [1] https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext (really quite a balanced analysis)
>>>>>>>> >>> >
>>>>>>>> >>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <kc...@google.com> wrote:
>>>>>>>> >>> >>
>>>>>>>> >>> >> > Should runners, current and future, be in the same repository as Beam
>>>>>>>> >>> >> > core?
>>>>>>>> >>> >>
>>>>>>>> >>> >> In the distant past, runners lived in their own repositories, and then were donated to Beam. But Beam's current uber-repo setup allows a lot of convenience. For example, a ton of code (including core functionality and tests) is shared directly between runners, which is useful for keeping runners up to date and ensuring consistent behavior between them (in other words, maintainable and reliable).
>>>>>>>> >>> >>
>>>>>>>> >>> >> Generally, it is up to the authors of a particular Beam related project/subproject to decide whether to host their code in Beam or in a different repo, and up to the community to decide whether to take on the donation, as discussed in previous threads on the Twister2 runner. In this case, it seems there is agreement between the Twister2 runner authors and the community that the runner can be hosted in Beam proper.
>>>>>>>> >>> >>
>>>>>>>> >>> >> There are examples of successful independent Beam projects, such as Spotify's Scio, but having an independent project with its own releases requires a lot of dedicated resources, and the bar for entry for extending Beam should not be that high. All that's required of subproject authors is that they keep the subproject in step with Beam. If they can't maintain it any longer, the subproject can be allowed to bitrot without getting in anyone's way. On the other hand, I'm not sure of the details with Cassandra, but in general, a subproject should not have "the ability to block progress" just because it is contained in the Beam uber-repo.
>>>>>>>> >>> >>
>>>>>>>> >>> >> tl;dr Having an uber repo generally seems to work for Beam. Exceptions are few enough to be handled on a case-by-case basis.
>>>>>>>> >>> >>
>>>>>>>> >>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <el...@ibiblio.org> wrote:
>>>>>>>> >>> >>>
>>>>>>>> >>> >>> Generic question without commenting on Twister2 specifically:
>>>>>>>> >>> >>>
>>>>>>>> >>> >>> Should runners, current and future, be in the same repository as Beam
>>>>>>>> >>> >>> core? Can or should they be completely separate products with their
>>>>>>>> >>> >>> own release cycles?
>>>>>>>> >>> >>>
>>>>>>>> >>> >>> Generally, loose coupling leads to more maintainable, reliable
>>>>>>>> >>> >>> projects. Specifically, Cassandra is holding back some other changes
>>>>>>>> >>> >>> in Beam and I really wish it didn't have the ability to block
>>>>>>>> >>> >>> progress. The more different runners we have in core, the worse this
>>>>>>>> >>> >>> problem is likely to become.
>>>>>>>> >>> >>>
>>>>>>>> >>> >>>
>>>>>>>> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>>>>>>>> >>> >>> <pu...@gmail.com> wrote:
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> > Hi
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> > I believe the pull request is pretty complete now with the help of Ismaël. Kenn, would you be able to take a look at it and suggest any changes if needed?. The build checks and validations tests are passing at the moment.  I will start working on the documentation that you mentioned in an earlier email separately.
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> > Best Regards,
>>>>>>>> >>> >>> > Pulasthi
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe <pu...@gmail.com> wrote:
>>>>>>>> >>> >>> >>
>>>>>>>> >>> >>> >> Hi All,
>>>>>>>> >>> >>> >>
>>>>>>>> >>> >>> >> I have created the initial pull request [1] to contribute the Twister2 Beam runner to the Apache Beam codebase. More information on Twister2 can be found here[2] and the Twister2 codebase is available here[3]. At the moment only batch mode is supported in the runner, but we are planning to add stream support and implement a portable runner for Twister2 in the near future.
>>>>>>>> >>> >>> >>
>>>>>>>> >>> >>> >> As Kenn pointed out in an earlier email it would be great to have inputs from the community regarding this contribution since it is a sizable one. I am sure there are many improvements that can be done in the contributed codebase with input from the community.
>>>>>>>> >>> >>> >>
>>>>>>>> >>> >>> >> [1] https://github.com/apache/beam/pull/10888
>>>>>>>> >>> >>> >> [2] https://twister2.org/
>>>>>>>> >>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2
>>>>>>>> >>> >>> >>
>>>>>>>> >>> >>> >> Best Regards,
>>>>>>>> >>> >>> >> Pulasthi
>>>>>>>> >>> >>> >> --
>>>>>>>> >>> >>> >> Pulasthi S. Wickramasinghe
>>>>>>>> >>> >>> >> PhD Candidate  | Research Assistant
>>>>>>>> >>> >>> >> School of Informatics and Computing | Digital Science Center
>>>>>>>> >>> >>> >> Indiana University, Bloomington
>>>>>>>> >>> >>> >> cell: 224-386-9035
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> >
>>>>>>>> >>> >>> > --
>>>>>>>> >>> >>> > Pulasthi S. Wickramasinghe
>>>>>>>> >>> >>> > PhD Candidate  | Research Assistant
>>>>>>>> >>> >>> > School of Informatics and Computing | Digital Science Center
>>>>>>>> >>> >>> > Indiana University, Bloomington
>>>>>>>> >>> >>> > cell: 224-386-9035
>>>>>>>> >>> >>>
>>>>>>>> >>> >>>
>>>>>>>> >>> >>>
>>>>>>>> >>> >>> --
>>>>>>>> >>> >>> Elliotte Rusty Harold
>>>>>>>> >>> >>> elharo@ibiblio.org
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> --
>>>>>>>> >> Pulasthi S. Wickramasinghe
>>>>>>>> >> PhD Candidate  | Research Assistant
>>>>>>>> >> School of Informatics and Computing | Digital Science Center
>>>>>>>> >> Indiana University, Bloomington
>>>>>>>> >> cell: 224-386-9035
>
>
>
> --
> Pulasthi S. Wickramasinghe
> PhD Candidate  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035

Re: Contributing Twister2 runner to Apache Beam

Posted by Pulasthi Supun Wickramasinghe <pu...@gmail.com>.
Hi Ismaël

Did you get some free time to perform a code review on the pull request

Best Regards
Pulasthi

On Tue, Mar 10, 2020 at 3:30 PM Luke Cwik <lc...@google.com> wrote:

> I have to disagree. Allowing for runners within the Apache Beam repo and
> SDKs that reach into the implementation details of each other are
> usability, feature development, maintenance and complexity problems.
>
> The usability issue comes from our public core facing APIs exposing
> methods that runners "need" so they can introspect details that shouldn't
> be visible to them (e.g. setWindowingStrategyInternal on PCollection).
> Getting to 1 would remove the pipeline construction time instances but not
> the execution side ones and there are currently 100+ usages of
> the @Internal annotation.
>
> The feature development and maintenance issues both stem from duplication
> of work. We need to have at least two copies of how to do something, one
> that is for runner -> SDK direct and one for Fn API. An example of this is
> the timer family work which was started and completed for the non portable
> implementation yet the portable implementation was left as future work.
>
> Finally, the complexity comes from how many layers we have that wrap
> existing components to create variants for different use cases. I'm looking
> at all the DoFnRunners and each of their variants and how those have layers
> within themselves within the SDK and how additional layers have been made
> to interface with runner specific internal details.
>
>
> On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles <ke...@apache.org> wrote:
>
>> I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn
>> API). But I disagree with it as a requirement; the whole point of
>> ptransforms with URNs is that if the runner can figure out how to execute
>> it according to semantics, then it is fine. A runner meets (1) and (2) but
>> can only run certain subset of DoFns is allowed by design (whether the
>> subset is based on language, state/timer support, etc).
>>
>> Kenn
>>
>> On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik <lc...@google.com> wrote:
>>
>>> I would like to move away from having runners access APIs that are
>>> related to pipeline construction and other internal SDK APIs and I would
>>> like for SDKs to not inspect internal runner APIs. This would enable the
>>> community to improve each independently without needing to fix the world
>>> all the time and would enable the community to run a cluster that supports
>>> multiple Beam versions at the same time and would also allow for the
>>> cluster to be updated independently of the pipelines it runs.
>>>
>>> As a community, I believe we need to achieve 1, 2 and 3. Outside of the
>>> Apache Beam repo, anyone can do whatever they want but there should be no
>>> compatibility guarantees.
>>>
>>> 4 and 5 are extensions that enable a richer set of pipelines to run and
>>> are optional like many other parts such as if a runner supports metrics
>>> aggregation or dynamic work rebalancing.
>>>
>>> On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles <ke...@apache.org> wrote:
>>>
>>>> There are a lot of different meanings to "portable runner". Here are
>>>> some:
>>>>
>>>> (1) A runner that accepts a pipeline proto and either runs it or says
>>>> it cannot run it
>>>> (2) A runner that accepts jobs via the job management APIs
>>>> (3) A runner that executes UDFs via the Fn API
>>>> (4) A runner that can execute multiple languages
>>>> (5) A runner that can run cross-language transforms aka multiple
>>>> languages in the same pipeline
>>>>
>>>> I think (1) is a very good bar, and (2) is a nice addition on top of
>>>> that. Then we have a unified way to submit pipelines and understand their
>>>> status.
>>>>
>>>> I think (3) is optional - a runner can run things however it likes,
>>>> including with native implementations. And then (4) and (5) as well are
>>>> just levels of feature capabilities.
>>>>
>>>> Kenn
>>>>
>>>> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik <lc...@google.com> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel <al...@vanboxel.be>
>>>>> wrote:
>>>>>
>>>>>> One last thing, for any runner after this one... wouldn't it be a
>>>>>> good acceptance criteria to only accept portable implementations anymore?
>>>>>>
>>>>>>  _/
>>>>>> _/ Alex Van Boxel
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía <ie...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Good points Kenn. I think we mostly agree on what has been discussed
>>>>>>> in this
>>>>>>> thread the pros/cons of having runners on our repository, but this
>>>>>>> is probably
>>>>>>> not the best moment in time to change any policy in that aspect.
>>>>>>>
>>>>>>> So if nobody objects I think we can proceed. I am OOO this week so
>>>>>>> with less
>>>>>>> time to continue with the code review, but I will be back to finish
>>>>>>> the review
>>>>>>> and hopefully finally get this merged with Pulasthi next week (sorry
>>>>>>> for the
>>>>>>> delay).
>>>>>>>
>>>>>>> > (don't wait for me on code review - if Ismaël said it is good,
>>>>>>> then it is
>>>>>>> > good.)
>>>>>>>
>>>>>>> Thanks for your confidence. Twister2 runners looks good so far, but
>>>>>>> I will
>>>>>>> confirm 100% next week :) In the meantime if someone has some extra
>>>>>>> cycles to
>>>>>>> take a look extra feedback is always welcome.
>>>>>>>
>>>>>>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <ke...@apache.org>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > I haven't heard anyone suggest that we need a vote. I haven't
>>>>>>> heard anyone object to this being merged to master. Some time ago, we
>>>>>>> mostly decided to favor master instead of branches, because it is so much
>>>>>>> smoother for contributors and users.
>>>>>>> >
>>>>>>> > So I am poking this thread one last time and otherwise I would
>>>>>>> consider it consensus that once code review is done the runner is a part of
>>>>>>> Beam (experimental!).
>>>>>>> >
>>>>>>> > (don't wait for me on code review - if Ismaël said it is good,
>>>>>>> then it is good.)
>>>>>>> >
>>>>>>> > Kenn
>>>>>>> >
>>>>>>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
>>>>>>> pulasthi911@gmail.com> wrote:
>>>>>>> >>
>>>>>>> >> I understand that the discussion is on a more broad level than
>>>>>>> the Twister2 runner. From my experience developing the runner the main
>>>>>>> advantage of being inside the beam project was the easy access to the wide
>>>>>>> range of tests and other core/utility code as Kyle pointed out. Unmerging
>>>>>>> runners that are not properly maintained and updated would be the most
>>>>>>> logical path to follow since the internals of the runners are only well
>>>>>>> understood by developers of that particular project. It would be
>>>>>>> unreasonable to expect the Beam community to maintain them. And since the
>>>>>>> runners do not alter the core API's I assume they would be easy to unmerge
>>>>>>> if the need arises.
>>>>>>> >>
>>>>>>> >> Talking specifically about Twister2 runner, we hope to continue
>>>>>>> developing the runner in the future to add both streaming capability and
>>>>>>> develop a portable runner as well. The team behind Twister2 is working
>>>>>>> towards the goal to get the project into Apache Incubator in the near
>>>>>>> future (Hopefully to submit the proposal in the next couple of months).
>>>>>>> >>
>>>>>>> >> Best Regards,
>>>>>>> >> Pulasthi
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw <
>>>>>>> robertwb@google.com> wrote:
>>>>>>> >>>
>>>>>>> >>> I think we will get to a point where it makes sense for runners
>>>>>>> to
>>>>>>> >>> live in their own repositories, with their own release cadence,
>>>>>>> but
>>>>>>> >>> we're not at that point yet. One prerequisite is a stable
>>>>>>> API--we're
>>>>>>> >>> closing in on that with the portability protos, but many (java)
>>>>>>> >>> runners actually share the common runner core libraries and that
>>>>>>> is
>>>>>>> >>> even less set in stone.
>>>>>>> >>>
>>>>>>> >>> On the other hand, taking responsibility for maintaining all
>>>>>>> runners
>>>>>>> >>> is not a tenable or scalable position for the Beam project. If a
>>>>>>> >>> runner is merged, it should be understood that it can be
>>>>>>> "un-merged"
>>>>>>> >>> if it causes a maintenance burden. A completely separate
>>>>>>> >>> project/repository makes this less messy.
>>>>>>> >>>
>>>>>>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <ke...@apache.org>
>>>>>>> wrote:
>>>>>>> >>> >
>>>>>>> >>> > I agree with both of you, mostly :-)
>>>>>>> >>> >
>>>>>>> >>> > The monorepo approach doesn't work/scale well for shipped
>>>>>>> libraries (name a Google library that silently just works and never causes
>>>>>>> any dependency problems) and the pain we feel has been constant and
>>>>>>> increasing, but I don't think we are at the breaking point.
>>>>>>> >>> >
>>>>>>> >>> > But Google's big monorepo [1] demonstrates similar benefits to
>>>>>>> what Kyle describes. In the early stages the benefit of not having to think
>>>>>>> too hard about build/test infra and share it everywhere is a big help, and
>>>>>>> it scales well. Eventually, shipping test utility libraries and compliance
>>>>>>> suites can be equivalent. And to your point - it is very helpful for users
>>>>>>> to know that they can use CassandraIO with the other Beam artifacts. This
>>>>>>> is why Google requires the whole big repo to depend on a single version of
>>>>>>> any externally-controlled artifact. But, yes, as a consequence it is
>>>>>>> preposterously difficult to stay up to date, since literally anything can
>>>>>>> block progress. You need a unified escalation chain for that policy to make
>>>>>>> sense. It is the definition of a healthy Apache project to *not* have that
>>>>>>> (PMC is different).
>>>>>>> >>> >
>>>>>>> >>> > Independent dependencies, independent git histories, and
>>>>>>> independent release cadence/process are all separate discussions.
>>>>>>> >>> >
>>>>>>> >>> > It is a broader question than this particular contribution, so
>>>>>>> let's merge this runner before changing our whole way of doing things :-)
>>>>>>> >>> >
>>>>>>> >>> > Kenn
>>>>>>> >>> >
>>>>>>> >>> > [1]
>>>>>>> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
>>>>>>> (really quite a balanced analysis)
>>>>>>> >>> >
>>>>>>> >>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <
>>>>>>> kcweaver@google.com> wrote:
>>>>>>> >>> >>
>>>>>>> >>> >> > Should runners, current and future, be in the same
>>>>>>> repository as Beam
>>>>>>> >>> >> > core?
>>>>>>> >>> >>
>>>>>>> >>> >> In the distant past, runners lived in their own repositories,
>>>>>>> and then were donated to Beam. But Beam's current uber-repo setup allows a
>>>>>>> lot of convenience. For example, a ton of code (including core
>>>>>>> functionality and tests) is shared directly between runners, which is
>>>>>>> useful for keeping runners up to date and ensuring consistent behavior
>>>>>>> between them (in other words, maintainable and reliable).
>>>>>>> >>> >>
>>>>>>> >>> >> Generally, it is up to the authors of a particular Beam
>>>>>>> related project/subproject to decide whether to host their code in Beam or
>>>>>>> in a different repo, and up to the community to decide whether to take on
>>>>>>> the donation, as discussed in previous threads on the Twister2 runner. In
>>>>>>> this case, it seems there is agreement between the Twister2 runner authors
>>>>>>> and the community that the runner can be hosted in Beam proper.
>>>>>>> >>> >>
>>>>>>> >>> >> There are examples of successful independent Beam projects,
>>>>>>> such as Spotify's Scio, but having an independent project with its own
>>>>>>> releases requires a lot of dedicated resources, and the bar for entry for
>>>>>>> extending Beam should not be that high. All that's required of subproject
>>>>>>> authors is that they keep the subproject in step with Beam. If they can't
>>>>>>> maintain it any longer, the subproject can be allowed to bitrot without
>>>>>>> getting in anyone's way. On the other hand, I'm not sure of the details
>>>>>>> with Cassandra, but in general, a subproject should not have "the ability
>>>>>>> to block progress" just because it is contained in the Beam uber-repo.
>>>>>>> >>> >>
>>>>>>> >>> >> tl;dr Having an uber repo generally seems to work for Beam.
>>>>>>> Exceptions are few enough to be handled on a case-by-case basis.
>>>>>>> >>> >>
>>>>>>> >>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <
>>>>>>> elharo@ibiblio.org> wrote:
>>>>>>> >>> >>>
>>>>>>> >>> >>> Generic question without commenting on Twister2 specifically:
>>>>>>> >>> >>>
>>>>>>> >>> >>> Should runners, current and future, be in the same
>>>>>>> repository as Beam
>>>>>>> >>> >>> core? Can or should they be completely separate products
>>>>>>> with their
>>>>>>> >>> >>> own release cycles?
>>>>>>> >>> >>>
>>>>>>> >>> >>> Generally, loose coupling leads to more maintainable,
>>>>>>> reliable
>>>>>>> >>> >>> projects. Specifically, Cassandra is holding back some other
>>>>>>> changes
>>>>>>> >>> >>> in Beam and I really wish it didn't have the ability to block
>>>>>>> >>> >>> progress. The more different runners we have in core, the
>>>>>>> worse this
>>>>>>> >>> >>> problem is likely to become.
>>>>>>> >>> >>>
>>>>>>> >>> >>>
>>>>>>> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>>>>>>> >>> >>> <pu...@gmail.com> wrote:
>>>>>>> >>> >>> >
>>>>>>> >>> >>> > Hi
>>>>>>> >>> >>> >
>>>>>>> >>> >>> > I believe the pull request is pretty complete now with the
>>>>>>> help of Ismaël. Kenn, would you be able to take a look at it and suggest
>>>>>>> any changes if needed?. The build checks and validations tests are passing
>>>>>>> at the moment.  I will start working on the documentation that you
>>>>>>> mentioned in an earlier email separately.
>>>>>>> >>> >>> >
>>>>>>> >>> >>> > Best Regards,
>>>>>>> >>> >>> > Pulasthi
>>>>>>> >>> >>> >
>>>>>>> >>> >>> >
>>>>>>> >>> >>> >
>>>>>>> >>> >>> >
>>>>>>> >>> >>> >
>>>>>>> >>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun
>>>>>>> Wickramasinghe <pu...@gmail.com> wrote:
>>>>>>> >>> >>> >>
>>>>>>> >>> >>> >> Hi All,
>>>>>>> >>> >>> >>
>>>>>>> >>> >>> >> I have created the initial pull request [1] to contribute
>>>>>>> the Twister2 Beam runner to the Apache Beam codebase. More information on
>>>>>>> Twister2 can be found here[2] and the Twister2 codebase is available
>>>>>>> here[3]. At the moment only batch mode is supported in the runner, but we
>>>>>>> are planning to add stream support and implement a portable runner for
>>>>>>> Twister2 in the near future.
>>>>>>> >>> >>> >>
>>>>>>> >>> >>> >> As Kenn pointed out in an earlier email it would be great
>>>>>>> to have inputs from the community regarding this contribution since it is a
>>>>>>> sizable one. I am sure there are many improvements that can be done in the
>>>>>>> contributed codebase with input from the community.
>>>>>>> >>> >>> >>
>>>>>>> >>> >>> >> [1] https://github.com/apache/beam/pull/10888
>>>>>>> >>> >>> >> [2] https://twister2.org/
>>>>>>> >>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2
>>>>>>> >>> >>> >>
>>>>>>> >>> >>> >> Best Regards,
>>>>>>> >>> >>> >> Pulasthi
>>>>>>> >>> >>> >> --
>>>>>>> >>> >>> >> Pulasthi S. Wickramasinghe
>>>>>>> >>> >>> >> PhD Candidate  | Research Assistant
>>>>>>> >>> >>> >> School of Informatics and Computing | Digital Science
>>>>>>> Center
>>>>>>> >>> >>> >> Indiana University, Bloomington
>>>>>>> >>> >>> >> cell: 224-386-9035 <(224)%20386-9035>
>>>>>>> >>> >>> >
>>>>>>> >>> >>> >
>>>>>>> >>> >>> >
>>>>>>> >>> >>> > --
>>>>>>> >>> >>> > Pulasthi S. Wickramasinghe
>>>>>>> >>> >>> > PhD Candidate  | Research Assistant
>>>>>>> >>> >>> > School of Informatics and Computing | Digital Science
>>>>>>> Center
>>>>>>> >>> >>> > Indiana University, Bloomington
>>>>>>> >>> >>> > cell: 224-386-9035 <(224)%20386-9035>
>>>>>>> >>> >>>
>>>>>>> >>> >>>
>>>>>>> >>> >>>
>>>>>>> >>> >>> --
>>>>>>> >>> >>> Elliotte Rusty Harold
>>>>>>> >>> >>> elharo@ibiblio.org
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> --
>>>>>>> >> Pulasthi S. Wickramasinghe
>>>>>>> >> PhD Candidate  | Research Assistant
>>>>>>> >> School of Informatics and Computing | Digital Science Center
>>>>>>> >> Indiana University, Bloomington
>>>>>>> >> cell: 224-386-9035 <(224)%20386-9035>
>>>>>>>
>>>>>>

-- 
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035

Re: Contributing Twister2 runner to Apache Beam

Posted by Luke Cwik <lc...@google.com>.
I have to disagree. Allowing for runners within the Apache Beam repo and
SDKs that reach into the implementation details of each other are
usability, feature development, maintenance and complexity problems.

The usability issue comes from our public core facing APIs exposing methods
that runners "need" so they can introspect details that shouldn't be
visible to them (e.g. setWindowingStrategyInternal on PCollection). Getting
to 1 would remove the pipeline construction time instances but not the
execution side ones and there are currently 100+ usages of the @Internal
annotation.

The feature development and maintenance issues both stem from duplication
of work. We need to have at least two copies of how to do something, one
that is for runner -> SDK direct and one for Fn API. An example of this is
the timer family work which was started and completed for the non portable
implementation yet the portable implementation was left as future work.

Finally, the complexity comes from how many layers we have that wrap
existing components to create variants for different use cases. I'm looking
at all the DoFnRunners and each of their variants and how those have layers
within themselves within the SDK and how additional layers have been made
to interface with runner specific internal details.


On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles <ke...@apache.org> wrote:

> I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn
> API). But I disagree with it as a requirement; the whole point of
> ptransforms with URNs is that if the runner can figure out how to execute
> it according to semantics, then it is fine. A runner meets (1) and (2) but
> can only run certain subset of DoFns is allowed by design (whether the
> subset is based on language, state/timer support, etc).
>
> Kenn
>
> On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik <lc...@google.com> wrote:
>
>> I would like to move away from having runners access APIs that are
>> related to pipeline construction and other internal SDK APIs and I would
>> like for SDKs to not inspect internal runner APIs. This would enable the
>> community to improve each independently without needing to fix the world
>> all the time and would enable the community to run a cluster that supports
>> multiple Beam versions at the same time and would also allow for the
>> cluster to be updated independently of the pipelines it runs.
>>
>> As a community, I believe we need to achieve 1, 2 and 3. Outside of the
>> Apache Beam repo, anyone can do whatever they want but there should be no
>> compatibility guarantees.
>>
>> 4 and 5 are extensions that enable a richer set of pipelines to run and
>> are optional like many other parts such as if a runner supports metrics
>> aggregation or dynamic work rebalancing.
>>
>> On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles <ke...@apache.org> wrote:
>>
>>> There are a lot of different meanings to "portable runner". Here are
>>> some:
>>>
>>> (1) A runner that accepts a pipeline proto and either runs it or says it
>>> cannot run it
>>> (2) A runner that accepts jobs via the job management APIs
>>> (3) A runner that executes UDFs via the Fn API
>>> (4) A runner that can execute multiple languages
>>> (5) A runner that can run cross-language transforms aka multiple
>>> languages in the same pipeline
>>>
>>> I think (1) is a very good bar, and (2) is a nice addition on top of
>>> that. Then we have a unified way to submit pipelines and understand their
>>> status.
>>>
>>> I think (3) is optional - a runner can run things however it likes,
>>> including with native implementations. And then (4) and (5) as well are
>>> just levels of feature capabilities.
>>>
>>> Kenn
>>>
>>> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> +1
>>>>
>>>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel <al...@vanboxel.be>
>>>> wrote:
>>>>
>>>>> One last thing, for any runner after this one... wouldn't it be a good
>>>>> acceptance criteria to only accept portable implementations anymore?
>>>>>
>>>>>  _/
>>>>> _/ Alex Van Boxel
>>>>>
>>>>>
>>>>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía <ie...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Good points Kenn. I think we mostly agree on what has been discussed
>>>>>> in this
>>>>>> thread the pros/cons of having runners on our repository, but this is
>>>>>> probably
>>>>>> not the best moment in time to change any policy in that aspect.
>>>>>>
>>>>>> So if nobody objects I think we can proceed. I am OOO this week so
>>>>>> with less
>>>>>> time to continue with the code review, but I will be back to finish
>>>>>> the review
>>>>>> and hopefully finally get this merged with Pulasthi next week (sorry
>>>>>> for the
>>>>>> delay).
>>>>>>
>>>>>> > (don't wait for me on code review - if Ismaël said it is good, then
>>>>>> it is
>>>>>> > good.)
>>>>>>
>>>>>> Thanks for your confidence. Twister2 runners looks good so far, but I
>>>>>> will
>>>>>> confirm 100% next week :) In the meantime if someone has some extra
>>>>>> cycles to
>>>>>> take a look extra feedback is always welcome.
>>>>>>
>>>>>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <ke...@apache.org>
>>>>>> wrote:
>>>>>> >
>>>>>> > I haven't heard anyone suggest that we need a vote. I haven't heard
>>>>>> anyone object to this being merged to master. Some time ago, we mostly
>>>>>> decided to favor master instead of branches, because it is so much smoother
>>>>>> for contributors and users.
>>>>>> >
>>>>>> > So I am poking this thread one last time and otherwise I would
>>>>>> consider it consensus that once code review is done the runner is a part of
>>>>>> Beam (experimental!).
>>>>>> >
>>>>>> > (don't wait for me on code review - if Ismaël said it is good, then
>>>>>> it is good.)
>>>>>> >
>>>>>> > Kenn
>>>>>> >
>>>>>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
>>>>>> pulasthi911@gmail.com> wrote:
>>>>>> >>
>>>>>> >> I understand that the discussion is on a more broad level than the
>>>>>> Twister2 runner. From my experience developing the runner the main
>>>>>> advantage of being inside the beam project was the easy access to the wide
>>>>>> range of tests and other core/utility code as Kyle pointed out. Unmerging
>>>>>> runners that are not properly maintained and updated would be the most
>>>>>> logical path to follow since the internals of the runners are only well
>>>>>> understood by developers of that particular project. It would be
>>>>>> unreasonable to expect the Beam community to maintain them. And since the
>>>>>> runners do not alter the core API's I assume they would be easy to unmerge
>>>>>> if the need arises.
>>>>>> >>
>>>>>> >> Talking specifically about Twister2 runner, we hope to continue
>>>>>> developing the runner in the future to add both streaming capability and
>>>>>> develop a portable runner as well. The team behind Twister2 is working
>>>>>> towards the goal to get the project into Apache Incubator in the near
>>>>>> future (Hopefully to submit the proposal in the next couple of months).
>>>>>> >>
>>>>>> >> Best Regards,
>>>>>> >> Pulasthi
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw <
>>>>>> robertwb@google.com> wrote:
>>>>>> >>>
>>>>>> >>> I think we will get to a point where it makes sense for runners to
>>>>>> >>> live in their own repositories, with their own release cadence,
>>>>>> but
>>>>>> >>> we're not at that point yet. One prerequisite is a stable
>>>>>> API--we're
>>>>>> >>> closing in on that with the portability protos, but many (java)
>>>>>> >>> runners actually share the common runner core libraries and that
>>>>>> is
>>>>>> >>> even less set in stone.
>>>>>> >>>
>>>>>> >>> On the other hand, taking responsibility for maintaining all
>>>>>> runners
>>>>>> >>> is not a tenable or scalable position for the Beam project. If a
>>>>>> >>> runner is merged, it should be understood that it can be
>>>>>> "un-merged"
>>>>>> >>> if it causes a maintenance burden. A completely separate
>>>>>> >>> project/repository makes this less messy.
>>>>>> >>>
>>>>>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <ke...@apache.org>
>>>>>> wrote:
>>>>>> >>> >
>>>>>> >>> > I agree with both of you, mostly :-)
>>>>>> >>> >
>>>>>> >>> > The monorepo approach doesn't work/scale well for shipped
>>>>>> libraries (name a Google library that silently just works and never causes
>>>>>> any dependency problems) and the pain we feel has been constant and
>>>>>> increasing, but I don't think we are at the breaking point.
>>>>>> >>> >
>>>>>> >>> > But Google's big monorepo [1] demonstrates similar benefits to
>>>>>> what Kyle describes. In the early stages the benefit of not having to think
>>>>>> too hard about build/test infra and share it everywhere is a big help, and
>>>>>> it scales well. Eventually, shipping test utility libraries and compliance
>>>>>> suites can be equivalent. And to your point - it is very helpful for users
>>>>>> to know that they can use CassandraIO with the other Beam artifacts. This
>>>>>> is why Google requires the whole big repo to depend on a single version of
>>>>>> any externally-controlled artifact. But, yes, as a consequence it is
>>>>>> preposterously difficult to stay up to date, since literally anything can
>>>>>> block progress. You need a unified escalation chain for that policy to make
>>>>>> sense. It is the definition of a healthy Apache project to *not* have that
>>>>>> (PMC is different).
>>>>>> >>> >
>>>>>> >>> > Independent dependencies, independent git histories, and
>>>>>> independent release cadence/process are all separate discussions.
>>>>>> >>> >
>>>>>> >>> > It is a broader question than this particular contribution, so
>>>>>> let's merge this runner before changing our whole way of doing things :-)
>>>>>> >>> >
>>>>>> >>> > Kenn
>>>>>> >>> >
>>>>>> >>> > [1]
>>>>>> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
>>>>>> (really quite a balanced analysis)
>>>>>> >>> >
>>>>>> >>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <
>>>>>> kcweaver@google.com> wrote:
>>>>>> >>> >>
>>>>>> >>> >> > Should runners, current and future, be in the same
>>>>>> repository as Beam
>>>>>> >>> >> > core?
>>>>>> >>> >>
>>>>>> >>> >> In the distant past, runners lived in their own repositories,
>>>>>> and then were donated to Beam. But Beam's current uber-repo setup allows a
>>>>>> lot of convenience. For example, a ton of code (including core
>>>>>> functionality and tests) is shared directly between runners, which is
>>>>>> useful for keeping runners up to date and ensuring consistent behavior
>>>>>> between them (in other words, maintainable and reliable).
>>>>>> >>> >>
>>>>>> >>> >> Generally, it is up to the authors of a particular Beam
>>>>>> related project/subproject to decide whether to host their code in Beam or
>>>>>> in a different repo, and up to the community to decide whether to take on
>>>>>> the donation, as discussed in previous threads on the Twister2 runner. In
>>>>>> this case, it seems there is agreement between the Twister2 runner authors
>>>>>> and the community that the runner can be hosted in Beam proper.
>>>>>> >>> >>
>>>>>> >>> >> There are examples of successful independent Beam projects,
>>>>>> such as Spotify's Scio, but having an independent project with its own
>>>>>> releases requires a lot of dedicated resources, and the bar for entry for
>>>>>> extending Beam should not be that high. All that's required of subproject
>>>>>> authors is that they keep the subproject in step with Beam. If they can't
>>>>>> maintain it any longer, the subproject can be allowed to bitrot without
>>>>>> getting in anyone's way. On the other hand, I'm not sure of the details
>>>>>> with Cassandra, but in general, a subproject should not have "the ability
>>>>>> to block progress" just because it is contained in the Beam uber-repo.
>>>>>> >>> >>
>>>>>> >>> >> tl;dr Having an uber repo generally seems to work for Beam.
>>>>>> Exceptions are few enough to be handled on a case-by-case basis.
>>>>>> >>> >>
>>>>>> >>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <
>>>>>> elharo@ibiblio.org> wrote:
>>>>>> >>> >>>
>>>>>> >>> >>> Generic question without commenting on Twister2 specifically:
>>>>>> >>> >>>
>>>>>> >>> >>> Should runners, current and future, be in the same repository
>>>>>> as Beam
>>>>>> >>> >>> core? Can or should they be completely separate products with
>>>>>> their
>>>>>> >>> >>> own release cycles?
>>>>>> >>> >>>
>>>>>> >>> >>> Generally, loose coupling leads to more maintainable, reliable
>>>>>> >>> >>> projects. Specifically, Cassandra is holding back some other
>>>>>> changes
>>>>>> >>> >>> in Beam and I really wish it didn't have the ability to block
>>>>>> >>> >>> progress. The more different runners we have in core, the
>>>>>> worse this
>>>>>> >>> >>> problem is likely to become.
>>>>>> >>> >>>
>>>>>> >>> >>>
>>>>>> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>>>>>> >>> >>> <pu...@gmail.com> wrote:
>>>>>> >>> >>> >
>>>>>> >>> >>> > Hi
>>>>>> >>> >>> >
>>>>>> >>> >>> > I believe the pull request is pretty complete now with the
>>>>>> help of Ismaël. Kenn, would you be able to take a look at it and suggest
>>>>>> any changes if needed?. The build checks and validations tests are passing
>>>>>> at the moment.  I will start working on the documentation that you
>>>>>> mentioned in an earlier email separately.
>>>>>> >>> >>> >
>>>>>> >>> >>> > Best Regards,
>>>>>> >>> >>> > Pulasthi
>>>>>> >>> >>> >
>>>>>> >>> >>> >
>>>>>> >>> >>> >
>>>>>> >>> >>> >
>>>>>> >>> >>> >
>>>>>> >>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun
>>>>>> Wickramasinghe <pu...@gmail.com> wrote:
>>>>>> >>> >>> >>
>>>>>> >>> >>> >> Hi All,
>>>>>> >>> >>> >>
>>>>>> >>> >>> >> I have created the initial pull request [1] to contribute
>>>>>> the Twister2 Beam runner to the Apache Beam codebase. More information on
>>>>>> Twister2 can be found here[2] and the Twister2 codebase is available
>>>>>> here[3]. At the moment only batch mode is supported in the runner, but we
>>>>>> are planning to add stream support and implement a portable runner for
>>>>>> Twister2 in the near future.
>>>>>> >>> >>> >>
>>>>>> >>> >>> >> As Kenn pointed out in an earlier email it would be great
>>>>>> to have inputs from the community regarding this contribution since it is a
>>>>>> sizable one. I am sure there are many improvements that can be done in the
>>>>>> contributed codebase with input from the community.
>>>>>> >>> >>> >>
>>>>>> >>> >>> >> [1] https://github.com/apache/beam/pull/10888
>>>>>> >>> >>> >> [2] https://twister2.org/
>>>>>> >>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2
>>>>>> >>> >>> >>
>>>>>> >>> >>> >> Best Regards,
>>>>>> >>> >>> >> Pulasthi
>>>>>> >>> >>> >> --
>>>>>> >>> >>> >> Pulasthi S. Wickramasinghe
>>>>>> >>> >>> >> PhD Candidate  | Research Assistant
>>>>>> >>> >>> >> School of Informatics and Computing | Digital Science
>>>>>> Center
>>>>>> >>> >>> >> Indiana University, Bloomington
>>>>>> >>> >>> >> cell: 224-386-9035 <(224)%20386-9035>
>>>>>> >>> >>> >
>>>>>> >>> >>> >
>>>>>> >>> >>> >
>>>>>> >>> >>> > --
>>>>>> >>> >>> > Pulasthi S. Wickramasinghe
>>>>>> >>> >>> > PhD Candidate  | Research Assistant
>>>>>> >>> >>> > School of Informatics and Computing | Digital Science Center
>>>>>> >>> >>> > Indiana University, Bloomington
>>>>>> >>> >>> > cell: 224-386-9035 <(224)%20386-9035>
>>>>>> >>> >>>
>>>>>> >>> >>>
>>>>>> >>> >>>
>>>>>> >>> >>> --
>>>>>> >>> >>> Elliotte Rusty Harold
>>>>>> >>> >>> elharo@ibiblio.org
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> --
>>>>>> >> Pulasthi S. Wickramasinghe
>>>>>> >> PhD Candidate  | Research Assistant
>>>>>> >> School of Informatics and Computing | Digital Science Center
>>>>>> >> Indiana University, Bloomington
>>>>>> >> cell: 224-386-9035 <(224)%20386-9035>
>>>>>>
>>>>>

Re: Contributing Twister2 runner to Apache Beam

Posted by Kenneth Knowles <ke...@apache.org>.
I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn
API). But I disagree with it as a requirement; the whole point of
ptransforms with URNs is that if the runner can figure out how to execute
it according to semantics, then it is fine. A runner meets (1) and (2) but
can only run certain subset of DoFns is allowed by design (whether the
subset is based on language, state/timer support, etc).

Kenn

On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik <lc...@google.com> wrote:

> I would like to move away from having runners access APIs that are related
> to pipeline construction and other internal SDK APIs and I would like for
> SDKs to not inspect internal runner APIs. This would enable the community
> to improve each independently without needing to fix the world all the time
> and would enable the community to run a cluster that supports multiple Beam
> versions at the same time and would also allow for the cluster to be
> updated independently of the pipelines it runs.
>
> As a community, I believe we need to achieve 1, 2 and 3. Outside of the
> Apache Beam repo, anyone can do whatever they want but there should be no
> compatibility guarantees.
>
> 4 and 5 are extensions that enable a richer set of pipelines to run and
> are optional like many other parts such as if a runner supports metrics
> aggregation or dynamic work rebalancing.
>
> On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles <ke...@apache.org> wrote:
>
>> There are a lot of different meanings to "portable runner". Here are some:
>>
>> (1) A runner that accepts a pipeline proto and either runs it or says it
>> cannot run it
>> (2) A runner that accepts jobs via the job management APIs
>> (3) A runner that executes UDFs via the Fn API
>> (4) A runner that can execute multiple languages
>> (5) A runner that can run cross-language transforms aka multiple
>> languages in the same pipeline
>>
>> I think (1) is a very good bar, and (2) is a nice addition on top of
>> that. Then we have a unified way to submit pipelines and understand their
>> status.
>>
>> I think (3) is optional - a runner can run things however it likes,
>> including with native implementations. And then (4) and (5) as well are
>> just levels of feature capabilities.
>>
>> Kenn
>>
>> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik <lc...@google.com> wrote:
>>
>>> +1
>>>
>>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel <al...@vanboxel.be>
>>> wrote:
>>>
>>>> One last thing, for any runner after this one... wouldn't it be a good
>>>> acceptance criteria to only accept portable implementations anymore?
>>>>
>>>>  _/
>>>> _/ Alex Van Boxel
>>>>
>>>>
>>>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía <ie...@gmail.com> wrote:
>>>>
>>>>> Good points Kenn. I think we mostly agree on what has been discussed
>>>>> in this
>>>>> thread the pros/cons of having runners on our repository, but this is
>>>>> probably
>>>>> not the best moment in time to change any policy in that aspect.
>>>>>
>>>>> So if nobody objects I think we can proceed. I am OOO this week so
>>>>> with less
>>>>> time to continue with the code review, but I will be back to finish
>>>>> the review
>>>>> and hopefully finally get this merged with Pulasthi next week (sorry
>>>>> for the
>>>>> delay).
>>>>>
>>>>> > (don't wait for me on code review - if Ismaël said it is good, then
>>>>> it is
>>>>> > good.)
>>>>>
>>>>> Thanks for your confidence. Twister2 runners looks good so far, but I
>>>>> will
>>>>> confirm 100% next week :) In the meantime if someone has some extra
>>>>> cycles to
>>>>> take a look extra feedback is always welcome.
>>>>>
>>>>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <ke...@apache.org>
>>>>> wrote:
>>>>> >
>>>>> > I haven't heard anyone suggest that we need a vote. I haven't heard
>>>>> anyone object to this being merged to master. Some time ago, we mostly
>>>>> decided to favor master instead of branches, because it is so much smoother
>>>>> for contributors and users.
>>>>> >
>>>>> > So I am poking this thread one last time and otherwise I would
>>>>> consider it consensus that once code review is done the runner is a part of
>>>>> Beam (experimental!).
>>>>> >
>>>>> > (don't wait for me on code review - if Ismaël said it is good, then
>>>>> it is good.)
>>>>> >
>>>>> > Kenn
>>>>> >
>>>>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
>>>>> pulasthi911@gmail.com> wrote:
>>>>> >>
>>>>> >> I understand that the discussion is on a more broad level than the
>>>>> Twister2 runner. From my experience developing the runner the main
>>>>> advantage of being inside the beam project was the easy access to the wide
>>>>> range of tests and other core/utility code as Kyle pointed out. Unmerging
>>>>> runners that are not properly maintained and updated would be the most
>>>>> logical path to follow since the internals of the runners are only well
>>>>> understood by developers of that particular project. It would be
>>>>> unreasonable to expect the Beam community to maintain them. And since the
>>>>> runners do not alter the core API's I assume they would be easy to unmerge
>>>>> if the need arises.
>>>>> >>
>>>>> >> Talking specifically about Twister2 runner, we hope to continue
>>>>> developing the runner in the future to add both streaming capability and
>>>>> develop a portable runner as well. The team behind Twister2 is working
>>>>> towards the goal to get the project into Apache Incubator in the near
>>>>> future (Hopefully to submit the proposal in the next couple of months).
>>>>> >>
>>>>> >> Best Regards,
>>>>> >> Pulasthi
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>> >>>
>>>>> >>> I think we will get to a point where it makes sense for runners to
>>>>> >>> live in their own repositories, with their own release cadence, but
>>>>> >>> we're not at that point yet. One prerequisite is a stable
>>>>> API--we're
>>>>> >>> closing in on that with the portability protos, but many (java)
>>>>> >>> runners actually share the common runner core libraries and that is
>>>>> >>> even less set in stone.
>>>>> >>>
>>>>> >>> On the other hand, taking responsibility for maintaining all
>>>>> runners
>>>>> >>> is not a tenable or scalable position for the Beam project. If a
>>>>> >>> runner is merged, it should be understood that it can be
>>>>> "un-merged"
>>>>> >>> if it causes a maintenance burden. A completely separate
>>>>> >>> project/repository makes this less messy.
>>>>> >>>
>>>>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <ke...@apache.org>
>>>>> wrote:
>>>>> >>> >
>>>>> >>> > I agree with both of you, mostly :-)
>>>>> >>> >
>>>>> >>> > The monorepo approach doesn't work/scale well for shipped
>>>>> libraries (name a Google library that silently just works and never causes
>>>>> any dependency problems) and the pain we feel has been constant and
>>>>> increasing, but I don't think we are at the breaking point.
>>>>> >>> >
>>>>> >>> > But Google's big monorepo [1] demonstrates similar benefits to
>>>>> what Kyle describes. In the early stages the benefit of not having to think
>>>>> too hard about build/test infra and share it everywhere is a big help, and
>>>>> it scales well. Eventually, shipping test utility libraries and compliance
>>>>> suites can be equivalent. And to your point - it is very helpful for users
>>>>> to know that they can use CassandraIO with the other Beam artifacts. This
>>>>> is why Google requires the whole big repo to depend on a single version of
>>>>> any externally-controlled artifact. But, yes, as a consequence it is
>>>>> preposterously difficult to stay up to date, since literally anything can
>>>>> block progress. You need a unified escalation chain for that policy to make
>>>>> sense. It is the definition of a healthy Apache project to *not* have that
>>>>> (PMC is different).
>>>>> >>> >
>>>>> >>> > Independent dependencies, independent git histories, and
>>>>> independent release cadence/process are all separate discussions.
>>>>> >>> >
>>>>> >>> > It is a broader question than this particular contribution, so
>>>>> let's merge this runner before changing our whole way of doing things :-)
>>>>> >>> >
>>>>> >>> > Kenn
>>>>> >>> >
>>>>> >>> > [1]
>>>>> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
>>>>> (really quite a balanced analysis)
>>>>> >>> >
>>>>> >>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <kc...@google.com>
>>>>> wrote:
>>>>> >>> >>
>>>>> >>> >> > Should runners, current and future, be in the same repository
>>>>> as Beam
>>>>> >>> >> > core?
>>>>> >>> >>
>>>>> >>> >> In the distant past, runners lived in their own repositories,
>>>>> and then were donated to Beam. But Beam's current uber-repo setup allows a
>>>>> lot of convenience. For example, a ton of code (including core
>>>>> functionality and tests) is shared directly between runners, which is
>>>>> useful for keeping runners up to date and ensuring consistent behavior
>>>>> between them (in other words, maintainable and reliable).
>>>>> >>> >>
>>>>> >>> >> Generally, it is up to the authors of a particular Beam related
>>>>> project/subproject to decide whether to host their code in Beam or in a
>>>>> different repo, and up to the community to decide whether to take on the
>>>>> donation, as discussed in previous threads on the Twister2 runner. In this
>>>>> case, it seems there is agreement between the Twister2 runner authors and
>>>>> the community that the runner can be hosted in Beam proper.
>>>>> >>> >>
>>>>> >>> >> There are examples of successful independent Beam projects,
>>>>> such as Spotify's Scio, but having an independent project with its own
>>>>> releases requires a lot of dedicated resources, and the bar for entry for
>>>>> extending Beam should not be that high. All that's required of subproject
>>>>> authors is that they keep the subproject in step with Beam. If they can't
>>>>> maintain it any longer, the subproject can be allowed to bitrot without
>>>>> getting in anyone's way. On the other hand, I'm not sure of the details
>>>>> with Cassandra, but in general, a subproject should not have "the ability
>>>>> to block progress" just because it is contained in the Beam uber-repo.
>>>>> >>> >>
>>>>> >>> >> tl;dr Having an uber repo generally seems to work for Beam.
>>>>> Exceptions are few enough to be handled on a case-by-case basis.
>>>>> >>> >>
>>>>> >>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <
>>>>> elharo@ibiblio.org> wrote:
>>>>> >>> >>>
>>>>> >>> >>> Generic question without commenting on Twister2 specifically:
>>>>> >>> >>>
>>>>> >>> >>> Should runners, current and future, be in the same repository
>>>>> as Beam
>>>>> >>> >>> core? Can or should they be completely separate products with
>>>>> their
>>>>> >>> >>> own release cycles?
>>>>> >>> >>>
>>>>> >>> >>> Generally, loose coupling leads to more maintainable, reliable
>>>>> >>> >>> projects. Specifically, Cassandra is holding back some other
>>>>> changes
>>>>> >>> >>> in Beam and I really wish it didn't have the ability to block
>>>>> >>> >>> progress. The more different runners we have in core, the
>>>>> worse this
>>>>> >>> >>> problem is likely to become.
>>>>> >>> >>>
>>>>> >>> >>>
>>>>> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>>>>> >>> >>> <pu...@gmail.com> wrote:
>>>>> >>> >>> >
>>>>> >>> >>> > Hi
>>>>> >>> >>> >
>>>>> >>> >>> > I believe the pull request is pretty complete now with the
>>>>> help of Ismaël. Kenn, would you be able to take a look at it and suggest
>>>>> any changes if needed?. The build checks and validations tests are passing
>>>>> at the moment.  I will start working on the documentation that you
>>>>> mentioned in an earlier email separately.
>>>>> >>> >>> >
>>>>> >>> >>> > Best Regards,
>>>>> >>> >>> > Pulasthi
>>>>> >>> >>> >
>>>>> >>> >>> >
>>>>> >>> >>> >
>>>>> >>> >>> >
>>>>> >>> >>> >
>>>>> >>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun
>>>>> Wickramasinghe <pu...@gmail.com> wrote:
>>>>> >>> >>> >>
>>>>> >>> >>> >> Hi All,
>>>>> >>> >>> >>
>>>>> >>> >>> >> I have created the initial pull request [1] to contribute
>>>>> the Twister2 Beam runner to the Apache Beam codebase. More information on
>>>>> Twister2 can be found here[2] and the Twister2 codebase is available
>>>>> here[3]. At the moment only batch mode is supported in the runner, but we
>>>>> are planning to add stream support and implement a portable runner for
>>>>> Twister2 in the near future.
>>>>> >>> >>> >>
>>>>> >>> >>> >> As Kenn pointed out in an earlier email it would be great
>>>>> to have inputs from the community regarding this contribution since it is a
>>>>> sizable one. I am sure there are many improvements that can be done in the
>>>>> contributed codebase with input from the community.
>>>>> >>> >>> >>
>>>>> >>> >>> >> [1] https://github.com/apache/beam/pull/10888
>>>>> >>> >>> >> [2] https://twister2.org/
>>>>> >>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2
>>>>> >>> >>> >>
>>>>> >>> >>> >> Best Regards,
>>>>> >>> >>> >> Pulasthi
>>>>> >>> >>> >> --
>>>>> >>> >>> >> Pulasthi S. Wickramasinghe
>>>>> >>> >>> >> PhD Candidate  | Research Assistant
>>>>> >>> >>> >> School of Informatics and Computing | Digital Science Center
>>>>> >>> >>> >> Indiana University, Bloomington
>>>>> >>> >>> >> cell: 224-386-9035 <(224)%20386-9035>
>>>>> >>> >>> >
>>>>> >>> >>> >
>>>>> >>> >>> >
>>>>> >>> >>> > --
>>>>> >>> >>> > Pulasthi S. Wickramasinghe
>>>>> >>> >>> > PhD Candidate  | Research Assistant
>>>>> >>> >>> > School of Informatics and Computing | Digital Science Center
>>>>> >>> >>> > Indiana University, Bloomington
>>>>> >>> >>> > cell: 224-386-9035 <(224)%20386-9035>
>>>>> >>> >>>
>>>>> >>> >>>
>>>>> >>> >>>
>>>>> >>> >>> --
>>>>> >>> >>> Elliotte Rusty Harold
>>>>> >>> >>> elharo@ibiblio.org
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Pulasthi S. Wickramasinghe
>>>>> >> PhD Candidate  | Research Assistant
>>>>> >> School of Informatics and Computing | Digital Science Center
>>>>> >> Indiana University, Bloomington
>>>>> >> cell: 224-386-9035 <(224)%20386-9035>
>>>>>
>>>>

Re: Contributing Twister2 runner to Apache Beam

Posted by Luke Cwik <lc...@google.com>.
I would like to move away from having runners access APIs that are related
to pipeline construction and other internal SDK APIs and I would like for
SDKs to not inspect internal runner APIs. This would enable the community
to improve each independently without needing to fix the world all the time
and would enable the community to run a cluster that supports multiple Beam
versions at the same time and would also allow for the cluster to be
updated independently of the pipelines it runs.

As a community, I believe we need to achieve 1, 2 and 3. Outside of the
Apache Beam repo, anyone can do whatever they want but there should be no
compatibility guarantees.

4 and 5 are extensions that enable a richer set of pipelines to run and are
optional like many other parts such as if a runner supports metrics
aggregation or dynamic work rebalancing.

On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles <ke...@apache.org> wrote:

> There are a lot of different meanings to "portable runner". Here are some:
>
> (1) A runner that accepts a pipeline proto and either runs it or says it
> cannot run it
> (2) A runner that accepts jobs via the job management APIs
> (3) A runner that executes UDFs via the Fn API
> (4) A runner that can execute multiple languages
> (5) A runner that can run cross-language transforms aka multiple languages
> in the same pipeline
>
> I think (1) is a very good bar, and (2) is a nice addition on top of that.
> Then we have a unified way to submit pipelines and understand their status.
>
> I think (3) is optional - a runner can run things however it likes,
> including with native implementations. And then (4) and (5) as well are
> just levels of feature capabilities.
>
> Kenn
>
> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik <lc...@google.com> wrote:
>
>> +1
>>
>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel <al...@vanboxel.be> wrote:
>>
>>> One last thing, for any runner after this one... wouldn't it be a good
>>> acceptance criteria to only accept portable implementations anymore?
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>>
>>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía <ie...@gmail.com> wrote:
>>>
>>>> Good points Kenn. I think we mostly agree on what has been discussed in
>>>> this
>>>> thread the pros/cons of having runners on our repository, but this is
>>>> probably
>>>> not the best moment in time to change any policy in that aspect.
>>>>
>>>> So if nobody objects I think we can proceed. I am OOO this week so with
>>>> less
>>>> time to continue with the code review, but I will be back to finish the
>>>> review
>>>> and hopefully finally get this merged with Pulasthi next week (sorry
>>>> for the
>>>> delay).
>>>>
>>>> > (don't wait for me on code review - if Ismaël said it is good, then
>>>> it is
>>>> > good.)
>>>>
>>>> Thanks for your confidence. Twister2 runners looks good so far, but I
>>>> will
>>>> confirm 100% next week :) In the meantime if someone has some extra
>>>> cycles to
>>>> take a look extra feedback is always welcome.
>>>>
>>>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <ke...@apache.org> wrote:
>>>> >
>>>> > I haven't heard anyone suggest that we need a vote. I haven't heard
>>>> anyone object to this being merged to master. Some time ago, we mostly
>>>> decided to favor master instead of branches, because it is so much smoother
>>>> for contributors and users.
>>>> >
>>>> > So I am poking this thread one last time and otherwise I would
>>>> consider it consensus that once code review is done the runner is a part of
>>>> Beam (experimental!).
>>>> >
>>>> > (don't wait for me on code review - if Ismaël said it is good, then
>>>> it is good.)
>>>> >
>>>> > Kenn
>>>> >
>>>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
>>>> pulasthi911@gmail.com> wrote:
>>>> >>
>>>> >> I understand that the discussion is on a more broad level than the
>>>> Twister2 runner. From my experience developing the runner the main
>>>> advantage of being inside the beam project was the easy access to the wide
>>>> range of tests and other core/utility code as Kyle pointed out. Unmerging
>>>> runners that are not properly maintained and updated would be the most
>>>> logical path to follow since the internals of the runners are only well
>>>> understood by developers of that particular project. It would be
>>>> unreasonable to expect the Beam community to maintain them. And since the
>>>> runners do not alter the core API's I assume they would be easy to unmerge
>>>> if the need arises.
>>>> >>
>>>> >> Talking specifically about Twister2 runner, we hope to continue
>>>> developing the runner in the future to add both streaming capability and
>>>> develop a portable runner as well. The team behind Twister2 is working
>>>> towards the goal to get the project into Apache Incubator in the near
>>>> future (Hopefully to submit the proposal in the next couple of months).
>>>> >>
>>>> >> Best Regards,
>>>> >> Pulasthi
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>> >>>
>>>> >>> I think we will get to a point where it makes sense for runners to
>>>> >>> live in their own repositories, with their own release cadence, but
>>>> >>> we're not at that point yet. One prerequisite is a stable API--we're
>>>> >>> closing in on that with the portability protos, but many (java)
>>>> >>> runners actually share the common runner core libraries and that is
>>>> >>> even less set in stone.
>>>> >>>
>>>> >>> On the other hand, taking responsibility for maintaining all runners
>>>> >>> is not a tenable or scalable position for the Beam project. If a
>>>> >>> runner is merged, it should be understood that it can be "un-merged"
>>>> >>> if it causes a maintenance burden. A completely separate
>>>> >>> project/repository makes this less messy.
>>>> >>>
>>>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <ke...@apache.org>
>>>> wrote:
>>>> >>> >
>>>> >>> > I agree with both of you, mostly :-)
>>>> >>> >
>>>> >>> > The monorepo approach doesn't work/scale well for shipped
>>>> libraries (name a Google library that silently just works and never causes
>>>> any dependency problems) and the pain we feel has been constant and
>>>> increasing, but I don't think we are at the breaking point.
>>>> >>> >
>>>> >>> > But Google's big monorepo [1] demonstrates similar benefits to
>>>> what Kyle describes. In the early stages the benefit of not having to think
>>>> too hard about build/test infra and share it everywhere is a big help, and
>>>> it scales well. Eventually, shipping test utility libraries and compliance
>>>> suites can be equivalent. And to your point - it is very helpful for users
>>>> to know that they can use CassandraIO with the other Beam artifacts. This
>>>> is why Google requires the whole big repo to depend on a single version of
>>>> any externally-controlled artifact. But, yes, as a consequence it is
>>>> preposterously difficult to stay up to date, since literally anything can
>>>> block progress. You need a unified escalation chain for that policy to make
>>>> sense. It is the definition of a healthy Apache project to *not* have that
>>>> (PMC is different).
>>>> >>> >
>>>> >>> > Independent dependencies, independent git histories, and
>>>> independent release cadence/process are all separate discussions.
>>>> >>> >
>>>> >>> > It is a broader question than this particular contribution, so
>>>> let's merge this runner before changing our whole way of doing things :-)
>>>> >>> >
>>>> >>> > Kenn
>>>> >>> >
>>>> >>> > [1]
>>>> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
>>>> (really quite a balanced analysis)
>>>> >>> >
>>>> >>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <kc...@google.com>
>>>> wrote:
>>>> >>> >>
>>>> >>> >> > Should runners, current and future, be in the same repository
>>>> as Beam
>>>> >>> >> > core?
>>>> >>> >>
>>>> >>> >> In the distant past, runners lived in their own repositories,
>>>> and then were donated to Beam. But Beam's current uber-repo setup allows a
>>>> lot of convenience. For example, a ton of code (including core
>>>> functionality and tests) is shared directly between runners, which is
>>>> useful for keeping runners up to date and ensuring consistent behavior
>>>> between them (in other words, maintainable and reliable).
>>>> >>> >>
>>>> >>> >> Generally, it is up to the authors of a particular Beam related
>>>> project/subproject to decide whether to host their code in Beam or in a
>>>> different repo, and up to the community to decide whether to take on the
>>>> donation, as discussed in previous threads on the Twister2 runner. In this
>>>> case, it seems there is agreement between the Twister2 runner authors and
>>>> the community that the runner can be hosted in Beam proper.
>>>> >>> >>
>>>> >>> >> There are examples of successful independent Beam projects, such
>>>> as Spotify's Scio, but having an independent project with its own releases
>>>> requires a lot of dedicated resources, and the bar for entry for extending
>>>> Beam should not be that high. All that's required of subproject authors is
>>>> that they keep the subproject in step with Beam. If they can't maintain it
>>>> any longer, the subproject can be allowed to bitrot without getting in
>>>> anyone's way. On the other hand, I'm not sure of the details with
>>>> Cassandra, but in general, a subproject should not have "the ability to
>>>> block progress" just because it is contained in the Beam uber-repo.
>>>> >>> >>
>>>> >>> >> tl;dr Having an uber repo generally seems to work for Beam.
>>>> Exceptions are few enough to be handled on a case-by-case basis.
>>>> >>> >>
>>>> >>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <
>>>> elharo@ibiblio.org> wrote:
>>>> >>> >>>
>>>> >>> >>> Generic question without commenting on Twister2 specifically:
>>>> >>> >>>
>>>> >>> >>> Should runners, current and future, be in the same repository
>>>> as Beam
>>>> >>> >>> core? Can or should they be completely separate products with
>>>> their
>>>> >>> >>> own release cycles?
>>>> >>> >>>
>>>> >>> >>> Generally, loose coupling leads to more maintainable, reliable
>>>> >>> >>> projects. Specifically, Cassandra is holding back some other
>>>> changes
>>>> >>> >>> in Beam and I really wish it didn't have the ability to block
>>>> >>> >>> progress. The more different runners we have in core, the worse
>>>> this
>>>> >>> >>> problem is likely to become.
>>>> >>> >>>
>>>> >>> >>>
>>>> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>>>> >>> >>> <pu...@gmail.com> wrote:
>>>> >>> >>> >
>>>> >>> >>> > Hi
>>>> >>> >>> >
>>>> >>> >>> > I believe the pull request is pretty complete now with the
>>>> help of Ismaël. Kenn, would you be able to take a look at it and suggest
>>>> any changes if needed?. The build checks and validations tests are passing
>>>> at the moment.  I will start working on the documentation that you
>>>> mentioned in an earlier email separately.
>>>> >>> >>> >
>>>> >>> >>> > Best Regards,
>>>> >>> >>> > Pulasthi
>>>> >>> >>> >
>>>> >>> >>> >
>>>> >>> >>> >
>>>> >>> >>> >
>>>> >>> >>> >
>>>> >>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe
>>>> <pu...@gmail.com> wrote:
>>>> >>> >>> >>
>>>> >>> >>> >> Hi All,
>>>> >>> >>> >>
>>>> >>> >>> >> I have created the initial pull request [1] to contribute
>>>> the Twister2 Beam runner to the Apache Beam codebase. More information on
>>>> Twister2 can be found here[2] and the Twister2 codebase is available
>>>> here[3]. At the moment only batch mode is supported in the runner, but we
>>>> are planning to add stream support and implement a portable runner for
>>>> Twister2 in the near future.
>>>> >>> >>> >>
>>>> >>> >>> >> As Kenn pointed out in an earlier email it would be great to
>>>> have inputs from the community regarding this contribution since it is a
>>>> sizable one. I am sure there are many improvements that can be done in the
>>>> contributed codebase with input from the community.
>>>> >>> >>> >>
>>>> >>> >>> >> [1] https://github.com/apache/beam/pull/10888
>>>> >>> >>> >> [2] https://twister2.org/
>>>> >>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2
>>>> >>> >>> >>
>>>> >>> >>> >> Best Regards,
>>>> >>> >>> >> Pulasthi
>>>> >>> >>> >> --
>>>> >>> >>> >> Pulasthi S. Wickramasinghe
>>>> >>> >>> >> PhD Candidate  | Research Assistant
>>>> >>> >>> >> School of Informatics and Computing | Digital Science Center
>>>> >>> >>> >> Indiana University, Bloomington
>>>> >>> >>> >> cell: 224-386-9035 <(224)%20386-9035>
>>>> >>> >>> >
>>>> >>> >>> >
>>>> >>> >>> >
>>>> >>> >>> > --
>>>> >>> >>> > Pulasthi S. Wickramasinghe
>>>> >>> >>> > PhD Candidate  | Research Assistant
>>>> >>> >>> > School of Informatics and Computing | Digital Science Center
>>>> >>> >>> > Indiana University, Bloomington
>>>> >>> >>> > cell: 224-386-9035 <(224)%20386-9035>
>>>> >>> >>>
>>>> >>> >>>
>>>> >>> >>>
>>>> >>> >>> --
>>>> >>> >>> Elliotte Rusty Harold
>>>> >>> >>> elharo@ibiblio.org
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Pulasthi S. Wickramasinghe
>>>> >> PhD Candidate  | Research Assistant
>>>> >> School of Informatics and Computing | Digital Science Center
>>>> >> Indiana University, Bloomington
>>>> >> cell: 224-386-9035 <(224)%20386-9035>
>>>>
>>>

Re: Contributing Twister2 runner to Apache Beam

Posted by Kenneth Knowles <ke...@apache.org>.
There are a lot of different meanings to "portable runner". Here are some:

(1) A runner that accepts a pipeline proto and either runs it or says it
cannot run it
(2) A runner that accepts jobs via the job management APIs
(3) A runner that executes UDFs via the Fn API
(4) A runner that can execute multiple languages
(5) A runner that can run cross-language transforms aka multiple languages
in the same pipeline

I think (1) is a very good bar, and (2) is a nice addition on top of that.
Then we have a unified way to submit pipelines and understand their status.

I think (3) is optional - a runner can run things however it likes,
including with native implementations. And then (4) and (5) as well are
just levels of feature capabilities.

Kenn

On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik <lc...@google.com> wrote:

> +1
>
> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel <al...@vanboxel.be> wrote:
>
>> One last thing, for any runner after this one... wouldn't it be a good
>> acceptance criteria to only accept portable implementations anymore?
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía <ie...@gmail.com> wrote:
>>
>>> Good points Kenn. I think we mostly agree on what has been discussed in
>>> this
>>> thread the pros/cons of having runners on our repository, but this is
>>> probably
>>> not the best moment in time to change any policy in that aspect.
>>>
>>> So if nobody objects I think we can proceed. I am OOO this week so with
>>> less
>>> time to continue with the code review, but I will be back to finish the
>>> review
>>> and hopefully finally get this merged with Pulasthi next week (sorry for
>>> the
>>> delay).
>>>
>>> > (don't wait for me on code review - if Ismaël said it is good, then it
>>> is
>>> > good.)
>>>
>>> Thanks for your confidence. Twister2 runners looks good so far, but I
>>> will
>>> confirm 100% next week :) In the meantime if someone has some extra
>>> cycles to
>>> take a look extra feedback is always welcome.
>>>
>>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <ke...@apache.org> wrote:
>>> >
>>> > I haven't heard anyone suggest that we need a vote. I haven't heard
>>> anyone object to this being merged to master. Some time ago, we mostly
>>> decided to favor master instead of branches, because it is so much smoother
>>> for contributors and users.
>>> >
>>> > So I am poking this thread one last time and otherwise I would
>>> consider it consensus that once code review is done the runner is a part of
>>> Beam (experimental!).
>>> >
>>> > (don't wait for me on code review - if Ismaël said it is good, then it
>>> is good.)
>>> >
>>> > Kenn
>>> >
>>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
>>> pulasthi911@gmail.com> wrote:
>>> >>
>>> >> I understand that the discussion is on a more broad level than the
>>> Twister2 runner. From my experience developing the runner the main
>>> advantage of being inside the beam project was the easy access to the wide
>>> range of tests and other core/utility code as Kyle pointed out. Unmerging
>>> runners that are not properly maintained and updated would be the most
>>> logical path to follow since the internals of the runners are only well
>>> understood by developers of that particular project. It would be
>>> unreasonable to expect the Beam community to maintain them. And since the
>>> runners do not alter the core API's I assume they would be easy to unmerge
>>> if the need arises.
>>> >>
>>> >> Talking specifically about Twister2 runner, we hope to continue
>>> developing the runner in the future to add both streaming capability and
>>> develop a portable runner as well. The team behind Twister2 is working
>>> towards the goal to get the project into Apache Incubator in the near
>>> future (Hopefully to submit the proposal in the next couple of months).
>>> >>
>>> >> Best Regards,
>>> >> Pulasthi
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>> >>>
>>> >>> I think we will get to a point where it makes sense for runners to
>>> >>> live in their own repositories, with their own release cadence, but
>>> >>> we're not at that point yet. One prerequisite is a stable API--we're
>>> >>> closing in on that with the portability protos, but many (java)
>>> >>> runners actually share the common runner core libraries and that is
>>> >>> even less set in stone.
>>> >>>
>>> >>> On the other hand, taking responsibility for maintaining all runners
>>> >>> is not a tenable or scalable position for the Beam project. If a
>>> >>> runner is merged, it should be understood that it can be "un-merged"
>>> >>> if it causes a maintenance burden. A completely separate
>>> >>> project/repository makes this less messy.
>>> >>>
>>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <ke...@apache.org>
>>> wrote:
>>> >>> >
>>> >>> > I agree with both of you, mostly :-)
>>> >>> >
>>> >>> > The monorepo approach doesn't work/scale well for shipped
>>> libraries (name a Google library that silently just works and never causes
>>> any dependency problems) and the pain we feel has been constant and
>>> increasing, but I don't think we are at the breaking point.
>>> >>> >
>>> >>> > But Google's big monorepo [1] demonstrates similar benefits to
>>> what Kyle describes. In the early stages the benefit of not having to think
>>> too hard about build/test infra and share it everywhere is a big help, and
>>> it scales well. Eventually, shipping test utility libraries and compliance
>>> suites can be equivalent. And to your point - it is very helpful for users
>>> to know that they can use CassandraIO with the other Beam artifacts. This
>>> is why Google requires the whole big repo to depend on a single version of
>>> any externally-controlled artifact. But, yes, as a consequence it is
>>> preposterously difficult to stay up to date, since literally anything can
>>> block progress. You need a unified escalation chain for that policy to make
>>> sense. It is the definition of a healthy Apache project to *not* have that
>>> (PMC is different).
>>> >>> >
>>> >>> > Independent dependencies, independent git histories, and
>>> independent release cadence/process are all separate discussions.
>>> >>> >
>>> >>> > It is a broader question than this particular contribution, so
>>> let's merge this runner before changing our whole way of doing things :-)
>>> >>> >
>>> >>> > Kenn
>>> >>> >
>>> >>> > [1]
>>> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
>>> (really quite a balanced analysis)
>>> >>> >
>>> >>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <kc...@google.com>
>>> wrote:
>>> >>> >>
>>> >>> >> > Should runners, current and future, be in the same repository
>>> as Beam
>>> >>> >> > core?
>>> >>> >>
>>> >>> >> In the distant past, runners lived in their own repositories, and
>>> then were donated to Beam. But Beam's current uber-repo setup allows a lot
>>> of convenience. For example, a ton of code (including core functionality
>>> and tests) is shared directly between runners, which is useful for keeping
>>> runners up to date and ensuring consistent behavior between them (in other
>>> words, maintainable and reliable).
>>> >>> >>
>>> >>> >> Generally, it is up to the authors of a particular Beam related
>>> project/subproject to decide whether to host their code in Beam or in a
>>> different repo, and up to the community to decide whether to take on the
>>> donation, as discussed in previous threads on the Twister2 runner. In this
>>> case, it seems there is agreement between the Twister2 runner authors and
>>> the community that the runner can be hosted in Beam proper.
>>> >>> >>
>>> >>> >> There are examples of successful independent Beam projects, such
>>> as Spotify's Scio, but having an independent project with its own releases
>>> requires a lot of dedicated resources, and the bar for entry for extending
>>> Beam should not be that high. All that's required of subproject authors is
>>> that they keep the subproject in step with Beam. If they can't maintain it
>>> any longer, the subproject can be allowed to bitrot without getting in
>>> anyone's way. On the other hand, I'm not sure of the details with
>>> Cassandra, but in general, a subproject should not have "the ability to
>>> block progress" just because it is contained in the Beam uber-repo.
>>> >>> >>
>>> >>> >> tl;dr Having an uber repo generally seems to work for Beam.
>>> Exceptions are few enough to be handled on a case-by-case basis.
>>> >>> >>
>>> >>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <
>>> elharo@ibiblio.org> wrote:
>>> >>> >>>
>>> >>> >>> Generic question without commenting on Twister2 specifically:
>>> >>> >>>
>>> >>> >>> Should runners, current and future, be in the same repository as
>>> Beam
>>> >>> >>> core? Can or should they be completely separate products with
>>> their
>>> >>> >>> own release cycles?
>>> >>> >>>
>>> >>> >>> Generally, loose coupling leads to more maintainable, reliable
>>> >>> >>> projects. Specifically, Cassandra is holding back some other
>>> changes
>>> >>> >>> in Beam and I really wish it didn't have the ability to block
>>> >>> >>> progress. The more different runners we have in core, the worse
>>> this
>>> >>> >>> problem is likely to become.
>>> >>> >>>
>>> >>> >>>
>>> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>>> >>> >>> <pu...@gmail.com> wrote:
>>> >>> >>> >
>>> >>> >>> > Hi
>>> >>> >>> >
>>> >>> >>> > I believe the pull request is pretty complete now with the
>>> help of Ismaël. Kenn, would you be able to take a look at it and suggest
>>> any changes if needed?. The build checks and validations tests are passing
>>> at the moment.  I will start working on the documentation that you
>>> mentioned in an earlier email separately.
>>> >>> >>> >
>>> >>> >>> > Best Regards,
>>> >>> >>> > Pulasthi
>>> >>> >>> >
>>> >>> >>> >
>>> >>> >>> >
>>> >>> >>> >
>>> >>> >>> >
>>> >>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe <
>>> pulasthi911@gmail.com> wrote:
>>> >>> >>> >>
>>> >>> >>> >> Hi All,
>>> >>> >>> >>
>>> >>> >>> >> I have created the initial pull request [1] to contribute the
>>> Twister2 Beam runner to the Apache Beam codebase. More information on
>>> Twister2 can be found here[2] and the Twister2 codebase is available
>>> here[3]. At the moment only batch mode is supported in the runner, but we
>>> are planning to add stream support and implement a portable runner for
>>> Twister2 in the near future.
>>> >>> >>> >>
>>> >>> >>> >> As Kenn pointed out in an earlier email it would be great to
>>> have inputs from the community regarding this contribution since it is a
>>> sizable one. I am sure there are many improvements that can be done in the
>>> contributed codebase with input from the community.
>>> >>> >>> >>
>>> >>> >>> >> [1] https://github.com/apache/beam/pull/10888
>>> >>> >>> >> [2] https://twister2.org/
>>> >>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2
>>> >>> >>> >>
>>> >>> >>> >> Best Regards,
>>> >>> >>> >> Pulasthi
>>> >>> >>> >> --
>>> >>> >>> >> Pulasthi S. Wickramasinghe
>>> >>> >>> >> PhD Candidate  | Research Assistant
>>> >>> >>> >> School of Informatics and Computing | Digital Science Center
>>> >>> >>> >> Indiana University, Bloomington
>>> >>> >>> >> cell: 224-386-9035 <(224)%20386-9035>
>>> >>> >>> >
>>> >>> >>> >
>>> >>> >>> >
>>> >>> >>> > --
>>> >>> >>> > Pulasthi S. Wickramasinghe
>>> >>> >>> > PhD Candidate  | Research Assistant
>>> >>> >>> > School of Informatics and Computing | Digital Science Center
>>> >>> >>> > Indiana University, Bloomington
>>> >>> >>> > cell: 224-386-9035 <(224)%20386-9035>
>>> >>> >>>
>>> >>> >>>
>>> >>> >>>
>>> >>> >>> --
>>> >>> >>> Elliotte Rusty Harold
>>> >>> >>> elharo@ibiblio.org
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Pulasthi S. Wickramasinghe
>>> >> PhD Candidate  | Research Assistant
>>> >> School of Informatics and Computing | Digital Science Center
>>> >> Indiana University, Bloomington
>>> >> cell: 224-386-9035 <(224)%20386-9035>
>>>
>>

Re: Contributing Twister2 runner to Apache Beam

Posted by Luke Cwik <lc...@google.com>.
+1

On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel <al...@vanboxel.be> wrote:

> One last thing, for any runner after this one... wouldn't it be a good
> acceptance criteria to only accept portable implementations anymore?
>
>  _/
> _/ Alex Van Boxel
>
>
> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía <ie...@gmail.com> wrote:
>
>> Good points Kenn. I think we mostly agree on what has been discussed in
>> this
>> thread the pros/cons of having runners on our repository, but this is
>> probably
>> not the best moment in time to change any policy in that aspect.
>>
>> So if nobody objects I think we can proceed. I am OOO this week so with
>> less
>> time to continue with the code review, but I will be back to finish the
>> review
>> and hopefully finally get this merged with Pulasthi next week (sorry for
>> the
>> delay).
>>
>> > (don't wait for me on code review - if Ismaël said it is good, then it
>> is
>> > good.)
>>
>> Thanks for your confidence. Twister2 runners looks good so far, but I will
>> confirm 100% next week :) In the meantime if someone has some extra
>> cycles to
>> take a look extra feedback is always welcome.
>>
>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <ke...@apache.org> wrote:
>> >
>> > I haven't heard anyone suggest that we need a vote. I haven't heard
>> anyone object to this being merged to master. Some time ago, we mostly
>> decided to favor master instead of branches, because it is so much smoother
>> for contributors and users.
>> >
>> > So I am poking this thread one last time and otherwise I would consider
>> it consensus that once code review is done the runner is a part of Beam
>> (experimental!).
>> >
>> > (don't wait for me on code review - if Ismaël said it is good, then it
>> is good.)
>> >
>> > Kenn
>> >
>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
>> pulasthi911@gmail.com> wrote:
>> >>
>> >> I understand that the discussion is on a more broad level than the
>> Twister2 runner. From my experience developing the runner the main
>> advantage of being inside the beam project was the easy access to the wide
>> range of tests and other core/utility code as Kyle pointed out. Unmerging
>> runners that are not properly maintained and updated would be the most
>> logical path to follow since the internals of the runners are only well
>> understood by developers of that particular project. It would be
>> unreasonable to expect the Beam community to maintain them. And since the
>> runners do not alter the core API's I assume they would be easy to unmerge
>> if the need arises.
>> >>
>> >> Talking specifically about Twister2 runner, we hope to continue
>> developing the runner in the future to add both streaming capability and
>> develop a portable runner as well. The team behind Twister2 is working
>> towards the goal to get the project into Apache Incubator in the near
>> future (Hopefully to submit the proposal in the next couple of months).
>> >>
>> >> Best Regards,
>> >> Pulasthi
>> >>
>> >>
>> >>
>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>> >>>
>> >>> I think we will get to a point where it makes sense for runners to
>> >>> live in their own repositories, with their own release cadence, but
>> >>> we're not at that point yet. One prerequisite is a stable API--we're
>> >>> closing in on that with the portability protos, but many (java)
>> >>> runners actually share the common runner core libraries and that is
>> >>> even less set in stone.
>> >>>
>> >>> On the other hand, taking responsibility for maintaining all runners
>> >>> is not a tenable or scalable position for the Beam project. If a
>> >>> runner is merged, it should be understood that it can be "un-merged"
>> >>> if it causes a maintenance burden. A completely separate
>> >>> project/repository makes this less messy.
>> >>>
>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <ke...@apache.org>
>> wrote:
>> >>> >
>> >>> > I agree with both of you, mostly :-)
>> >>> >
>> >>> > The monorepo approach doesn't work/scale well for shipped libraries
>> (name a Google library that silently just works and never causes any
>> dependency problems) and the pain we feel has been constant and increasing,
>> but I don't think we are at the breaking point.
>> >>> >
>> >>> > But Google's big monorepo [1] demonstrates similar benefits to what
>> Kyle describes. In the early stages the benefit of not having to think too
>> hard about build/test infra and share it everywhere is a big help, and it
>> scales well. Eventually, shipping test utility libraries and compliance
>> suites can be equivalent. And to your point - it is very helpful for users
>> to know that they can use CassandraIO with the other Beam artifacts. This
>> is why Google requires the whole big repo to depend on a single version of
>> any externally-controlled artifact. But, yes, as a consequence it is
>> preposterously difficult to stay up to date, since literally anything can
>> block progress. You need a unified escalation chain for that policy to make
>> sense. It is the definition of a healthy Apache project to *not* have that
>> (PMC is different).
>> >>> >
>> >>> > Independent dependencies, independent git histories, and
>> independent release cadence/process are all separate discussions.
>> >>> >
>> >>> > It is a broader question than this particular contribution, so
>> let's merge this runner before changing our whole way of doing things :-)
>> >>> >
>> >>> > Kenn
>> >>> >
>> >>> > [1]
>> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
>> (really quite a balanced analysis)
>> >>> >
>> >>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <kc...@google.com>
>> wrote:
>> >>> >>
>> >>> >> > Should runners, current and future, be in the same repository as
>> Beam
>> >>> >> > core?
>> >>> >>
>> >>> >> In the distant past, runners lived in their own repositories, and
>> then were donated to Beam. But Beam's current uber-repo setup allows a lot
>> of convenience. For example, a ton of code (including core functionality
>> and tests) is shared directly between runners, which is useful for keeping
>> runners up to date and ensuring consistent behavior between them (in other
>> words, maintainable and reliable).
>> >>> >>
>> >>> >> Generally, it is up to the authors of a particular Beam related
>> project/subproject to decide whether to host their code in Beam or in a
>> different repo, and up to the community to decide whether to take on the
>> donation, as discussed in previous threads on the Twister2 runner. In this
>> case, it seems there is agreement between the Twister2 runner authors and
>> the community that the runner can be hosted in Beam proper.
>> >>> >>
>> >>> >> There are examples of successful independent Beam projects, such
>> as Spotify's Scio, but having an independent project with its own releases
>> requires a lot of dedicated resources, and the bar for entry for extending
>> Beam should not be that high. All that's required of subproject authors is
>> that they keep the subproject in step with Beam. If they can't maintain it
>> any longer, the subproject can be allowed to bitrot without getting in
>> anyone's way. On the other hand, I'm not sure of the details with
>> Cassandra, but in general, a subproject should not have "the ability to
>> block progress" just because it is contained in the Beam uber-repo.
>> >>> >>
>> >>> >> tl;dr Having an uber repo generally seems to work for Beam.
>> Exceptions are few enough to be handled on a case-by-case basis.
>> >>> >>
>> >>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <
>> elharo@ibiblio.org> wrote:
>> >>> >>>
>> >>> >>> Generic question without commenting on Twister2 specifically:
>> >>> >>>
>> >>> >>> Should runners, current and future, be in the same repository as
>> Beam
>> >>> >>> core? Can or should they be completely separate products with
>> their
>> >>> >>> own release cycles?
>> >>> >>>
>> >>> >>> Generally, loose coupling leads to more maintainable, reliable
>> >>> >>> projects. Specifically, Cassandra is holding back some other
>> changes
>> >>> >>> in Beam and I really wish it didn't have the ability to block
>> >>> >>> progress. The more different runners we have in core, the worse
>> this
>> >>> >>> problem is likely to become.
>> >>> >>>
>> >>> >>>
>> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>> >>> >>> <pu...@gmail.com> wrote:
>> >>> >>> >
>> >>> >>> > Hi
>> >>> >>> >
>> >>> >>> > I believe the pull request is pretty complete now with the help
>> of Ismaël. Kenn, would you be able to take a look at it and suggest any
>> changes if needed?. The build checks and validations tests are passing at
>> the moment.  I will start working on the documentation that you mentioned
>> in an earlier email separately.
>> >>> >>> >
>> >>> >>> > Best Regards,
>> >>> >>> > Pulasthi
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe <
>> pulasthi911@gmail.com> wrote:
>> >>> >>> >>
>> >>> >>> >> Hi All,
>> >>> >>> >>
>> >>> >>> >> I have created the initial pull request [1] to contribute the
>> Twister2 Beam runner to the Apache Beam codebase. More information on
>> Twister2 can be found here[2] and the Twister2 codebase is available
>> here[3]. At the moment only batch mode is supported in the runner, but we
>> are planning to add stream support and implement a portable runner for
>> Twister2 in the near future.
>> >>> >>> >>
>> >>> >>> >> As Kenn pointed out in an earlier email it would be great to
>> have inputs from the community regarding this contribution since it is a
>> sizable one. I am sure there are many improvements that can be done in the
>> contributed codebase with input from the community.
>> >>> >>> >>
>> >>> >>> >> [1] https://github.com/apache/beam/pull/10888
>> >>> >>> >> [2] https://twister2.org/
>> >>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2
>> >>> >>> >>
>> >>> >>> >> Best Regards,
>> >>> >>> >> Pulasthi
>> >>> >>> >> --
>> >>> >>> >> Pulasthi S. Wickramasinghe
>> >>> >>> >> PhD Candidate  | Research Assistant
>> >>> >>> >> School of Informatics and Computing | Digital Science Center
>> >>> >>> >> Indiana University, Bloomington
>> >>> >>> >> cell: 224-386-9035 <(224)%20386-9035>
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > --
>> >>> >>> > Pulasthi S. Wickramasinghe
>> >>> >>> > PhD Candidate  | Research Assistant
>> >>> >>> > School of Informatics and Computing | Digital Science Center
>> >>> >>> > Indiana University, Bloomington
>> >>> >>> > cell: 224-386-9035 <(224)%20386-9035>
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>> --
>> >>> >>> Elliotte Rusty Harold
>> >>> >>> elharo@ibiblio.org
>> >>
>> >>
>> >>
>> >> --
>> >> Pulasthi S. Wickramasinghe
>> >> PhD Candidate  | Research Assistant
>> >> School of Informatics and Computing | Digital Science Center
>> >> Indiana University, Bloomington
>> >> cell: 224-386-9035 <(224)%20386-9035>
>>
>

Re: Contributing Twister2 runner to Apache Beam

Posted by Alex Van Boxel <al...@vanboxel.be>.
One last thing, for any runner after this one... wouldn't it be a good
acceptance criteria to only accept portable implementations anymore?

 _/
_/ Alex Van Boxel


On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía <ie...@gmail.com> wrote:

> Good points Kenn. I think we mostly agree on what has been discussed in
> this
> thread the pros/cons of having runners on our repository, but this is
> probably
> not the best moment in time to change any policy in that aspect.
>
> So if nobody objects I think we can proceed. I am OOO this week so with
> less
> time to continue with the code review, but I will be back to finish the
> review
> and hopefully finally get this merged with Pulasthi next week (sorry for
> the
> delay).
>
> > (don't wait for me on code review - if Ismaël said it is good, then it is
> > good.)
>
> Thanks for your confidence. Twister2 runners looks good so far, but I will
> confirm 100% next week :) In the meantime if someone has some extra cycles
> to
> take a look extra feedback is always welcome.
>
> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <ke...@apache.org> wrote:
> >
> > I haven't heard anyone suggest that we need a vote. I haven't heard
> anyone object to this being merged to master. Some time ago, we mostly
> decided to favor master instead of branches, because it is so much smoother
> for contributors and users.
> >
> > So I am poking this thread one last time and otherwise I would consider
> it consensus that once code review is done the runner is a part of Beam
> (experimental!).
> >
> > (don't wait for me on code review - if Ismaël said it is good, then it
> is good.)
> >
> > Kenn
> >
> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
> pulasthi911@gmail.com> wrote:
> >>
> >> I understand that the discussion is on a more broad level than the
> Twister2 runner. From my experience developing the runner the main
> advantage of being inside the beam project was the easy access to the wide
> range of tests and other core/utility code as Kyle pointed out. Unmerging
> runners that are not properly maintained and updated would be the most
> logical path to follow since the internals of the runners are only well
> understood by developers of that particular project. It would be
> unreasonable to expect the Beam community to maintain them. And since the
> runners do not alter the core API's I assume they would be easy to unmerge
> if the need arises.
> >>
> >> Talking specifically about Twister2 runner, we hope to continue
> developing the runner in the future to add both streaming capability and
> develop a portable runner as well. The team behind Twister2 is working
> towards the goal to get the project into Apache Incubator in the near
> future (Hopefully to submit the proposal in the next couple of months).
> >>
> >> Best Regards,
> >> Pulasthi
> >>
> >>
> >>
> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >>>
> >>> I think we will get to a point where it makes sense for runners to
> >>> live in their own repositories, with their own release cadence, but
> >>> we're not at that point yet. One prerequisite is a stable API--we're
> >>> closing in on that with the portability protos, but many (java)
> >>> runners actually share the common runner core libraries and that is
> >>> even less set in stone.
> >>>
> >>> On the other hand, taking responsibility for maintaining all runners
> >>> is not a tenable or scalable position for the Beam project. If a
> >>> runner is merged, it should be understood that it can be "un-merged"
> >>> if it causes a maintenance burden. A completely separate
> >>> project/repository makes this less messy.
> >>>
> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <ke...@apache.org>
> wrote:
> >>> >
> >>> > I agree with both of you, mostly :-)
> >>> >
> >>> > The monorepo approach doesn't work/scale well for shipped libraries
> (name a Google library that silently just works and never causes any
> dependency problems) and the pain we feel has been constant and increasing,
> but I don't think we are at the breaking point.
> >>> >
> >>> > But Google's big monorepo [1] demonstrates similar benefits to what
> Kyle describes. In the early stages the benefit of not having to think too
> hard about build/test infra and share it everywhere is a big help, and it
> scales well. Eventually, shipping test utility libraries and compliance
> suites can be equivalent. And to your point - it is very helpful for users
> to know that they can use CassandraIO with the other Beam artifacts. This
> is why Google requires the whole big repo to depend on a single version of
> any externally-controlled artifact. But, yes, as a consequence it is
> preposterously difficult to stay up to date, since literally anything can
> block progress. You need a unified escalation chain for that policy to make
> sense. It is the definition of a healthy Apache project to *not* have that
> (PMC is different).
> >>> >
> >>> > Independent dependencies, independent git histories, and independent
> release cadence/process are all separate discussions.
> >>> >
> >>> > It is a broader question than this particular contribution, so let's
> merge this runner before changing our whole way of doing things :-)
> >>> >
> >>> > Kenn
> >>> >
> >>> > [1]
> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
> (really quite a balanced analysis)
> >>> >
> >>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <kc...@google.com>
> wrote:
> >>> >>
> >>> >> > Should runners, current and future, be in the same repository as
> Beam
> >>> >> > core?
> >>> >>
> >>> >> In the distant past, runners lived in their own repositories, and
> then were donated to Beam. But Beam's current uber-repo setup allows a lot
> of convenience. For example, a ton of code (including core functionality
> and tests) is shared directly between runners, which is useful for keeping
> runners up to date and ensuring consistent behavior between them (in other
> words, maintainable and reliable).
> >>> >>
> >>> >> Generally, it is up to the authors of a particular Beam related
> project/subproject to decide whether to host their code in Beam or in a
> different repo, and up to the community to decide whether to take on the
> donation, as discussed in previous threads on the Twister2 runner. In this
> case, it seems there is agreement between the Twister2 runner authors and
> the community that the runner can be hosted in Beam proper.
> >>> >>
> >>> >> There are examples of successful independent Beam projects, such as
> Spotify's Scio, but having an independent project with its own releases
> requires a lot of dedicated resources, and the bar for entry for extending
> Beam should not be that high. All that's required of subproject authors is
> that they keep the subproject in step with Beam. If they can't maintain it
> any longer, the subproject can be allowed to bitrot without getting in
> anyone's way. On the other hand, I'm not sure of the details with
> Cassandra, but in general, a subproject should not have "the ability to
> block progress" just because it is contained in the Beam uber-repo.
> >>> >>
> >>> >> tl;dr Having an uber repo generally seems to work for Beam.
> Exceptions are few enough to be handled on a case-by-case basis.
> >>> >>
> >>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <
> elharo@ibiblio.org> wrote:
> >>> >>>
> >>> >>> Generic question without commenting on Twister2 specifically:
> >>> >>>
> >>> >>> Should runners, current and future, be in the same repository as
> Beam
> >>> >>> core? Can or should they be completely separate products with their
> >>> >>> own release cycles?
> >>> >>>
> >>> >>> Generally, loose coupling leads to more maintainable, reliable
> >>> >>> projects. Specifically, Cassandra is holding back some other
> changes
> >>> >>> in Beam and I really wish it didn't have the ability to block
> >>> >>> progress. The more different runners we have in core, the worse
> this
> >>> >>> problem is likely to become.
> >>> >>>
> >>> >>>
> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
> >>> >>> <pu...@gmail.com> wrote:
> >>> >>> >
> >>> >>> > Hi
> >>> >>> >
> >>> >>> > I believe the pull request is pretty complete now with the help
> of Ismaël. Kenn, would you be able to take a look at it and suggest any
> changes if needed?. The build checks and validations tests are passing at
> the moment.  I will start working on the documentation that you mentioned
> in an earlier email separately.
> >>> >>> >
> >>> >>> > Best Regards,
> >>> >>> > Pulasthi
> >>> >>> >
> >>> >>> >
> >>> >>> >
> >>> >>> >
> >>> >>> >
> >>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe <
> pulasthi911@gmail.com> wrote:
> >>> >>> >>
> >>> >>> >> Hi All,
> >>> >>> >>
> >>> >>> >> I have created the initial pull request [1] to contribute the
> Twister2 Beam runner to the Apache Beam codebase. More information on
> Twister2 can be found here[2] and the Twister2 codebase is available
> here[3]. At the moment only batch mode is supported in the runner, but we
> are planning to add stream support and implement a portable runner for
> Twister2 in the near future.
> >>> >>> >>
> >>> >>> >> As Kenn pointed out in an earlier email it would be great to
> have inputs from the community regarding this contribution since it is a
> sizable one. I am sure there are many improvements that can be done in the
> contributed codebase with input from the community.
> >>> >>> >>
> >>> >>> >> [1] https://github.com/apache/beam/pull/10888
> >>> >>> >> [2] https://twister2.org/
> >>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2
> >>> >>> >>
> >>> >>> >> Best Regards,
> >>> >>> >> Pulasthi
> >>> >>> >> --
> >>> >>> >> Pulasthi S. Wickramasinghe
> >>> >>> >> PhD Candidate  | Research Assistant
> >>> >>> >> School of Informatics and Computing | Digital Science Center
> >>> >>> >> Indiana University, Bloomington
> >>> >>> >> cell: 224-386-9035
> >>> >>> >
> >>> >>> >
> >>> >>> >
> >>> >>> > --
> >>> >>> > Pulasthi S. Wickramasinghe
> >>> >>> > PhD Candidate  | Research Assistant
> >>> >>> > School of Informatics and Computing | Digital Science Center
> >>> >>> > Indiana University, Bloomington
> >>> >>> > cell: 224-386-9035
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> --
> >>> >>> Elliotte Rusty Harold
> >>> >>> elharo@ibiblio.org
> >>
> >>
> >>
> >> --
> >> Pulasthi S. Wickramasinghe
> >> PhD Candidate  | Research Assistant
> >> School of Informatics and Computing | Digital Science Center
> >> Indiana University, Bloomington
> >> cell: 224-386-9035
>

Re: Contributing Twister2 runner to Apache Beam

Posted by Ismaël Mejía <ie...@gmail.com>.
Good points Kenn. I think we mostly agree on what has been discussed in this
thread the pros/cons of having runners on our repository, but this is probably
not the best moment in time to change any policy in that aspect.

So if nobody objects I think we can proceed. I am OOO this week so with less
time to continue with the code review, but I will be back to finish the review
and hopefully finally get this merged with Pulasthi next week (sorry for the
delay).

> (don't wait for me on code review - if Ismaël said it is good, then it is
> good.)

Thanks for your confidence. Twister2 runners looks good so far, but I will
confirm 100% next week :) In the meantime if someone has some extra cycles to
take a look extra feedback is always welcome.

On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <ke...@apache.org> wrote:
>
> I haven't heard anyone suggest that we need a vote. I haven't heard anyone object to this being merged to master. Some time ago, we mostly decided to favor master instead of branches, because it is so much smoother for contributors and users.
>
> So I am poking this thread one last time and otherwise I would consider it consensus that once code review is done the runner is a part of Beam (experimental!).
>
> (don't wait for me on code review - if Ismaël said it is good, then it is good.)
>
> Kenn
>
> On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <pu...@gmail.com> wrote:
>>
>> I understand that the discussion is on a more broad level than the Twister2 runner. From my experience developing the runner the main advantage of being inside the beam project was the easy access to the wide range of tests and other core/utility code as Kyle pointed out. Unmerging runners that are not properly maintained and updated would be the most logical path to follow since the internals of the runners are only well understood by developers of that particular project. It would be unreasonable to expect the Beam community to maintain them. And since the runners do not alter the core API's I assume they would be easy to unmerge if the need arises.
>>
>> Talking specifically about Twister2 runner, we hope to continue developing the runner in the future to add both streaming capability and develop a portable runner as well. The team behind Twister2 is working towards the goal to get the project into Apache Incubator in the near future (Hopefully to submit the proposal in the next couple of months).
>>
>> Best Regards,
>> Pulasthi
>>
>>
>>
>> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw <ro...@google.com> wrote:
>>>
>>> I think we will get to a point where it makes sense for runners to
>>> live in their own repositories, with their own release cadence, but
>>> we're not at that point yet. One prerequisite is a stable API--we're
>>> closing in on that with the portability protos, but many (java)
>>> runners actually share the common runner core libraries and that is
>>> even less set in stone.
>>>
>>> On the other hand, taking responsibility for maintaining all runners
>>> is not a tenable or scalable position for the Beam project. If a
>>> runner is merged, it should be understood that it can be "un-merged"
>>> if it causes a maintenance burden. A completely separate
>>> project/repository makes this less messy.
>>>
>>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <ke...@apache.org> wrote:
>>> >
>>> > I agree with both of you, mostly :-)
>>> >
>>> > The monorepo approach doesn't work/scale well for shipped libraries (name a Google library that silently just works and never causes any dependency problems) and the pain we feel has been constant and increasing, but I don't think we are at the breaking point.
>>> >
>>> > But Google's big monorepo [1] demonstrates similar benefits to what Kyle describes. In the early stages the benefit of not having to think too hard about build/test infra and share it everywhere is a big help, and it scales well. Eventually, shipping test utility libraries and compliance suites can be equivalent. And to your point - it is very helpful for users to know that they can use CassandraIO with the other Beam artifacts. This is why Google requires the whole big repo to depend on a single version of any externally-controlled artifact. But, yes, as a consequence it is preposterously difficult to stay up to date, since literally anything can block progress. You need a unified escalation chain for that policy to make sense. It is the definition of a healthy Apache project to *not* have that (PMC is different).
>>> >
>>> > Independent dependencies, independent git histories, and independent release cadence/process are all separate discussions.
>>> >
>>> > It is a broader question than this particular contribution, so let's merge this runner before changing our whole way of doing things :-)
>>> >
>>> > Kenn
>>> >
>>> > [1] https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext (really quite a balanced analysis)
>>> >
>>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <kc...@google.com> wrote:
>>> >>
>>> >> > Should runners, current and future, be in the same repository as Beam
>>> >> > core?
>>> >>
>>> >> In the distant past, runners lived in their own repositories, and then were donated to Beam. But Beam's current uber-repo setup allows a lot of convenience. For example, a ton of code (including core functionality and tests) is shared directly between runners, which is useful for keeping runners up to date and ensuring consistent behavior between them (in other words, maintainable and reliable).
>>> >>
>>> >> Generally, it is up to the authors of a particular Beam related project/subproject to decide whether to host their code in Beam or in a different repo, and up to the community to decide whether to take on the donation, as discussed in previous threads on the Twister2 runner. In this case, it seems there is agreement between the Twister2 runner authors and the community that the runner can be hosted in Beam proper.
>>> >>
>>> >> There are examples of successful independent Beam projects, such as Spotify's Scio, but having an independent project with its own releases requires a lot of dedicated resources, and the bar for entry for extending Beam should not be that high. All that's required of subproject authors is that they keep the subproject in step with Beam. If they can't maintain it any longer, the subproject can be allowed to bitrot without getting in anyone's way. On the other hand, I'm not sure of the details with Cassandra, but in general, a subproject should not have "the ability to block progress" just because it is contained in the Beam uber-repo.
>>> >>
>>> >> tl;dr Having an uber repo generally seems to work for Beam. Exceptions are few enough to be handled on a case-by-case basis.
>>> >>
>>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <el...@ibiblio.org> wrote:
>>> >>>
>>> >>> Generic question without commenting on Twister2 specifically:
>>> >>>
>>> >>> Should runners, current and future, be in the same repository as Beam
>>> >>> core? Can or should they be completely separate products with their
>>> >>> own release cycles?
>>> >>>
>>> >>> Generally, loose coupling leads to more maintainable, reliable
>>> >>> projects. Specifically, Cassandra is holding back some other changes
>>> >>> in Beam and I really wish it didn't have the ability to block
>>> >>> progress. The more different runners we have in core, the worse this
>>> >>> problem is likely to become.
>>> >>>
>>> >>>
>>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>>> >>> <pu...@gmail.com> wrote:
>>> >>> >
>>> >>> > Hi
>>> >>> >
>>> >>> > I believe the pull request is pretty complete now with the help of Ismaël. Kenn, would you be able to take a look at it and suggest any changes if needed?. The build checks and validations tests are passing at the moment.  I will start working on the documentation that you mentioned in an earlier email separately.
>>> >>> >
>>> >>> > Best Regards,
>>> >>> > Pulasthi
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe <pu...@gmail.com> wrote:
>>> >>> >>
>>> >>> >> Hi All,
>>> >>> >>
>>> >>> >> I have created the initial pull request [1] to contribute the Twister2 Beam runner to the Apache Beam codebase. More information on Twister2 can be found here[2] and the Twister2 codebase is available here[3]. At the moment only batch mode is supported in the runner, but we are planning to add stream support and implement a portable runner for Twister2 in the near future.
>>> >>> >>
>>> >>> >> As Kenn pointed out in an earlier email it would be great to have inputs from the community regarding this contribution since it is a sizable one. I am sure there are many improvements that can be done in the contributed codebase with input from the community.
>>> >>> >>
>>> >>> >> [1] https://github.com/apache/beam/pull/10888
>>> >>> >> [2] https://twister2.org/
>>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2
>>> >>> >>
>>> >>> >> Best Regards,
>>> >>> >> Pulasthi
>>> >>> >> --
>>> >>> >> Pulasthi S. Wickramasinghe
>>> >>> >> PhD Candidate  | Research Assistant
>>> >>> >> School of Informatics and Computing | Digital Science Center
>>> >>> >> Indiana University, Bloomington
>>> >>> >> cell: 224-386-9035
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > --
>>> >>> > Pulasthi S. Wickramasinghe
>>> >>> > PhD Candidate  | Research Assistant
>>> >>> > School of Informatics and Computing | Digital Science Center
>>> >>> > Indiana University, Bloomington
>>> >>> > cell: 224-386-9035
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Elliotte Rusty Harold
>>> >>> elharo@ibiblio.org
>>
>>
>>
>> --
>> Pulasthi S. Wickramasinghe
>> PhD Candidate  | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> cell: 224-386-9035

Re: Contributing Twister2 runner to Apache Beam

Posted by Kenneth Knowles <ke...@apache.org>.
I haven't heard anyone suggest that we need a vote. I haven't heard anyone
object to this being merged to master. Some time ago, we mostly decided to
favor master instead of branches, because it is so much smoother for
contributors and users.

So I am poking this thread one last time and otherwise I would consider it
consensus that once code review is done the runner is a part of Beam
(experimental!).

(don't wait for me on code review - if Ismaël said it is good, then it is
good.)

Kenn

On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
pulasthi911@gmail.com> wrote:

> I understand that the discussion is on a more broad level than the
> Twister2 runner. From my experience developing the runner the main
> advantage of being inside the beam project was the easy access to the wide
> range of tests and other core/utility code as Kyle pointed out. Unmerging
> runners that are not properly maintained and updated would be the most
> logical path to follow since the internals of the runners are only well
> understood by developers of that particular project. It would be
> unreasonable to expect the Beam community to maintain them. And since the
> runners do not alter the core API's I assume they would be easy to unmerge
> if the need arises.
>
> Talking specifically about Twister2 runner, we hope to continue developing
> the runner in the future to add both streaming capability and develop a
> portable runner as well. The team behind Twister2 is working towards the
> goal to get the project into Apache Incubator in the near future (Hopefully
> to submit the proposal in the next couple of months).
>
> Best Regards,
> Pulasthi
>
>
>
> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> I think we will get to a point where it makes sense for runners to
>> live in their own repositories, with their own release cadence, but
>> we're not at that point yet. One prerequisite is a stable API--we're
>> closing in on that with the portability protos, but many (java)
>> runners actually share the common runner core libraries and that is
>> even less set in stone.
>>
>> On the other hand, taking responsibility for maintaining all runners
>> is not a tenable or scalable position for the Beam project. If a
>> runner is merged, it should be understood that it can be "un-merged"
>> if it causes a maintenance burden. A completely separate
>> project/repository makes this less messy.
>>
>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <ke...@apache.org> wrote:
>> >
>> > I agree with both of you, mostly :-)
>> >
>> > The monorepo approach doesn't work/scale well for shipped libraries
>> (name a Google library that silently just works and never causes any
>> dependency problems) and the pain we feel has been constant and increasing,
>> but I don't think we are at the breaking point.
>> >
>> > But Google's big monorepo [1] demonstrates similar benefits to what
>> Kyle describes. In the early stages the benefit of not having to think too
>> hard about build/test infra and share it everywhere is a big help, and it
>> scales well. Eventually, shipping test utility libraries and compliance
>> suites can be equivalent. And to your point - it is very helpful for users
>> to know that they can use CassandraIO with the other Beam artifacts. This
>> is why Google requires the whole big repo to depend on a single version of
>> any externally-controlled artifact. But, yes, as a consequence it is
>> preposterously difficult to stay up to date, since literally anything can
>> block progress. You need a unified escalation chain for that policy to make
>> sense. It is the definition of a healthy Apache project to *not* have that
>> (PMC is different).
>> >
>> > Independent dependencies, independent git histories, and independent
>> release cadence/process are all separate discussions.
>> >
>> > It is a broader question than this particular contribution, so let's
>> merge this runner before changing our whole way of doing things :-)
>> >
>> > Kenn
>> >
>> > [1]
>> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
>> (really quite a balanced analysis)
>> >
>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <kc...@google.com>
>> wrote:
>> >>
>> >> > Should runners, current and future, be in the same repository as Beam
>> >> > core?
>> >>
>> >> In the distant past, runners lived in their own repositories, and then
>> were donated to Beam. But Beam's current uber-repo setup allows a lot of
>> convenience. For example, a ton of code (including core functionality and
>> tests) is shared directly between runners, which is useful for keeping
>> runners up to date and ensuring consistent behavior between them (in other
>> words, maintainable and reliable).
>> >>
>> >> Generally, it is up to the authors of a particular Beam related
>> project/subproject to decide whether to host their code in Beam or in a
>> different repo, and up to the community to decide whether to take on the
>> donation, as discussed in previous threads on the Twister2 runner. In this
>> case, it seems there is agreement between the Twister2 runner authors and
>> the community that the runner can be hosted in Beam proper.
>> >>
>> >> There are examples of successful independent Beam projects, such as
>> Spotify's Scio, but having an independent project with its own releases
>> requires a lot of dedicated resources, and the bar for entry for extending
>> Beam should not be that high. All that's required of subproject authors is
>> that they keep the subproject in step with Beam. If they can't maintain it
>> any longer, the subproject can be allowed to bitrot without getting in
>> anyone's way. On the other hand, I'm not sure of the details with
>> Cassandra, but in general, a subproject should not have "the ability to
>> block progress" just because it is contained in the Beam uber-repo.
>> >>
>> >> tl;dr Having an uber repo generally seems to work for Beam. Exceptions
>> are few enough to be handled on a case-by-case basis.
>> >>
>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <
>> elharo@ibiblio.org> wrote:
>> >>>
>> >>> Generic question without commenting on Twister2 specifically:
>> >>>
>> >>> Should runners, current and future, be in the same repository as Beam
>> >>> core? Can or should they be completely separate products with their
>> >>> own release cycles?
>> >>>
>> >>> Generally, loose coupling leads to more maintainable, reliable
>> >>> projects. Specifically, Cassandra is holding back some other changes
>> >>> in Beam and I really wish it didn't have the ability to block
>> >>> progress. The more different runners we have in core, the worse this
>> >>> problem is likely to become.
>> >>>
>> >>>
>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>> >>> <pu...@gmail.com> wrote:
>> >>> >
>> >>> > Hi
>> >>> >
>> >>> > I believe the pull request is pretty complete now with the help of
>> Ismaël. Kenn, would you be able to take a look at it and suggest any
>> changes if needed?. The build checks and validations tests are passing at
>> the moment.  I will start working on the documentation that you mentioned
>> in an earlier email separately.
>> >>> >
>> >>> > Best Regards,
>> >>> > Pulasthi
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe <
>> pulasthi911@gmail.com> wrote:
>> >>> >>
>> >>> >> Hi All,
>> >>> >>
>> >>> >> I have created the initial pull request [1] to contribute the
>> Twister2 Beam runner to the Apache Beam codebase. More information on
>> Twister2 can be found here[2] and the Twister2 codebase is available
>> here[3]. At the moment only batch mode is supported in the runner, but we
>> are planning to add stream support and implement a portable runner for
>> Twister2 in the near future.
>> >>> >>
>> >>> >> As Kenn pointed out in an earlier email it would be great to have
>> inputs from the community regarding this contribution since it is a sizable
>> one. I am sure there are many improvements that can be done in the
>> contributed codebase with input from the community.
>> >>> >>
>> >>> >> [1] https://github.com/apache/beam/pull/10888
>> >>> >> [2] https://twister2.org/
>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2
>> >>> >>
>> >>> >> Best Regards,
>> >>> >> Pulasthi
>> >>> >> --
>> >>> >> Pulasthi S. Wickramasinghe
>> >>> >> PhD Candidate  | Research Assistant
>> >>> >> School of Informatics and Computing | Digital Science Center
>> >>> >> Indiana University, Bloomington
>> >>> >> cell: 224-386-9035 <(224)%20386-9035>
>> >>> >
>> >>> >
>> >>> >
>> >>> > --
>> >>> > Pulasthi S. Wickramasinghe
>> >>> > PhD Candidate  | Research Assistant
>> >>> > School of Informatics and Computing | Digital Science Center
>> >>> > Indiana University, Bloomington
>> >>> > cell: 224-386-9035 <(224)%20386-9035>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Elliotte Rusty Harold
>> >>> elharo@ibiblio.org
>>
>
>
> --
> Pulasthi S. Wickramasinghe
> PhD Candidate  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035 <(224)%20386-9035>
>

Re: Contributing Twister2 runner to Apache Beam

Posted by Pulasthi Supun Wickramasinghe <pu...@gmail.com>.
I understand that the discussion is on a more broad level than the Twister2
runner. From my experience developing the runner the main advantage of
being inside the beam project was the easy access to the wide range of
tests and other core/utility code as Kyle pointed out. Unmerging runners
that are not properly maintained and updated would be the most logical path
to follow since the internals of the runners are only well understood by
developers of that particular project. It would be unreasonable to expect
the Beam community to maintain them. And since the runners do not alter the
core API's I assume they would be easy to unmerge if the need arises.

Talking specifically about Twister2 runner, we hope to continue developing
the runner in the future to add both streaming capability and develop a
portable runner as well. The team behind Twister2 is working towards the
goal to get the project into Apache Incubator in the near future (Hopefully
to submit the proposal in the next couple of months).

Best Regards,
Pulasthi



On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw <ro...@google.com> wrote:

> I think we will get to a point where it makes sense for runners to
> live in their own repositories, with their own release cadence, but
> we're not at that point yet. One prerequisite is a stable API--we're
> closing in on that with the portability protos, but many (java)
> runners actually share the common runner core libraries and that is
> even less set in stone.
>
> On the other hand, taking responsibility for maintaining all runners
> is not a tenable or scalable position for the Beam project. If a
> runner is merged, it should be understood that it can be "un-merged"
> if it causes a maintenance burden. A completely separate
> project/repository makes this less messy.
>
> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <ke...@apache.org> wrote:
> >
> > I agree with both of you, mostly :-)
> >
> > The monorepo approach doesn't work/scale well for shipped libraries
> (name a Google library that silently just works and never causes any
> dependency problems) and the pain we feel has been constant and increasing,
> but I don't think we are at the breaking point.
> >
> > But Google's big monorepo [1] demonstrates similar benefits to what Kyle
> describes. In the early stages the benefit of not having to think too hard
> about build/test infra and share it everywhere is a big help, and it scales
> well. Eventually, shipping test utility libraries and compliance suites can
> be equivalent. And to your point - it is very helpful for users to know
> that they can use CassandraIO with the other Beam artifacts. This is why
> Google requires the whole big repo to depend on a single version of any
> externally-controlled artifact. But, yes, as a consequence it is
> preposterously difficult to stay up to date, since literally anything can
> block progress. You need a unified escalation chain for that policy to make
> sense. It is the definition of a healthy Apache project to *not* have that
> (PMC is different).
> >
> > Independent dependencies, independent git histories, and independent
> release cadence/process are all separate discussions.
> >
> > It is a broader question than this particular contribution, so let's
> merge this runner before changing our whole way of doing things :-)
> >
> > Kenn
> >
> > [1]
> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
> (really quite a balanced analysis)
> >
> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <kc...@google.com> wrote:
> >>
> >> > Should runners, current and future, be in the same repository as Beam
> >> > core?
> >>
> >> In the distant past, runners lived in their own repositories, and then
> were donated to Beam. But Beam's current uber-repo setup allows a lot of
> convenience. For example, a ton of code (including core functionality and
> tests) is shared directly between runners, which is useful for keeping
> runners up to date and ensuring consistent behavior between them (in other
> words, maintainable and reliable).
> >>
> >> Generally, it is up to the authors of a particular Beam related
> project/subproject to decide whether to host their code in Beam or in a
> different repo, and up to the community to decide whether to take on the
> donation, as discussed in previous threads on the Twister2 runner. In this
> case, it seems there is agreement between the Twister2 runner authors and
> the community that the runner can be hosted in Beam proper.
> >>
> >> There are examples of successful independent Beam projects, such as
> Spotify's Scio, but having an independent project with its own releases
> requires a lot of dedicated resources, and the bar for entry for extending
> Beam should not be that high. All that's required of subproject authors is
> that they keep the subproject in step with Beam. If they can't maintain it
> any longer, the subproject can be allowed to bitrot without getting in
> anyone's way. On the other hand, I'm not sure of the details with
> Cassandra, but in general, a subproject should not have "the ability to
> block progress" just because it is contained in the Beam uber-repo.
> >>
> >> tl;dr Having an uber repo generally seems to work for Beam. Exceptions
> are few enough to be handled on a case-by-case basis.
> >>
> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <
> elharo@ibiblio.org> wrote:
> >>>
> >>> Generic question without commenting on Twister2 specifically:
> >>>
> >>> Should runners, current and future, be in the same repository as Beam
> >>> core? Can or should they be completely separate products with their
> >>> own release cycles?
> >>>
> >>> Generally, loose coupling leads to more maintainable, reliable
> >>> projects. Specifically, Cassandra is holding back some other changes
> >>> in Beam and I really wish it didn't have the ability to block
> >>> progress. The more different runners we have in core, the worse this
> >>> problem is likely to become.
> >>>
> >>>
> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
> >>> <pu...@gmail.com> wrote:
> >>> >
> >>> > Hi
> >>> >
> >>> > I believe the pull request is pretty complete now with the help of
> Ismaël. Kenn, would you be able to take a look at it and suggest any
> changes if needed?. The build checks and validations tests are passing at
> the moment.  I will start working on the documentation that you mentioned
> in an earlier email separately.
> >>> >
> >>> > Best Regards,
> >>> > Pulasthi
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe <
> pulasthi911@gmail.com> wrote:
> >>> >>
> >>> >> Hi All,
> >>> >>
> >>> >> I have created the initial pull request [1] to contribute the
> Twister2 Beam runner to the Apache Beam codebase. More information on
> Twister2 can be found here[2] and the Twister2 codebase is available
> here[3]. At the moment only batch mode is supported in the runner, but we
> are planning to add stream support and implement a portable runner for
> Twister2 in the near future.
> >>> >>
> >>> >> As Kenn pointed out in an earlier email it would be great to have
> inputs from the community regarding this contribution since it is a sizable
> one. I am sure there are many improvements that can be done in the
> contributed codebase with input from the community.
> >>> >>
> >>> >> [1] https://github.com/apache/beam/pull/10888
> >>> >> [2] https://twister2.org/
> >>> >> [3] https://github.com/DSC-SPIDAL/twister2
> >>> >>
> >>> >> Best Regards,
> >>> >> Pulasthi
> >>> >> --
> >>> >> Pulasthi S. Wickramasinghe
> >>> >> PhD Candidate  | Research Assistant
> >>> >> School of Informatics and Computing | Digital Science Center
> >>> >> Indiana University, Bloomington
> >>> >> cell: 224-386-9035
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > Pulasthi S. Wickramasinghe
> >>> > PhD Candidate  | Research Assistant
> >>> > School of Informatics and Computing | Digital Science Center
> >>> > Indiana University, Bloomington
> >>> > cell: 224-386-9035
> >>>
> >>>
> >>>
> >>> --
> >>> Elliotte Rusty Harold
> >>> elharo@ibiblio.org
>


-- 
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035

Re: Contributing Twister2 runner to Apache Beam

Posted by Robert Bradshaw <ro...@google.com>.
I think we will get to a point where it makes sense for runners to
live in their own repositories, with their own release cadence, but
we're not at that point yet. One prerequisite is a stable API--we're
closing in on that with the portability protos, but many (java)
runners actually share the common runner core libraries and that is
even less set in stone.

On the other hand, taking responsibility for maintaining all runners
is not a tenable or scalable position for the Beam project. If a
runner is merged, it should be understood that it can be "un-merged"
if it causes a maintenance burden. A completely separate
project/repository makes this less messy.

On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <ke...@apache.org> wrote:
>
> I agree with both of you, mostly :-)
>
> The monorepo approach doesn't work/scale well for shipped libraries (name a Google library that silently just works and never causes any dependency problems) and the pain we feel has been constant and increasing, but I don't think we are at the breaking point.
>
> But Google's big monorepo [1] demonstrates similar benefits to what Kyle describes. In the early stages the benefit of not having to think too hard about build/test infra and share it everywhere is a big help, and it scales well. Eventually, shipping test utility libraries and compliance suites can be equivalent. And to your point - it is very helpful for users to know that they can use CassandraIO with the other Beam artifacts. This is why Google requires the whole big repo to depend on a single version of any externally-controlled artifact. But, yes, as a consequence it is preposterously difficult to stay up to date, since literally anything can block progress. You need a unified escalation chain for that policy to make sense. It is the definition of a healthy Apache project to *not* have that (PMC is different).
>
> Independent dependencies, independent git histories, and independent release cadence/process are all separate discussions.
>
> It is a broader question than this particular contribution, so let's merge this runner before changing our whole way of doing things :-)
>
> Kenn
>
> [1] https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext (really quite a balanced analysis)
>
> On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <kc...@google.com> wrote:
>>
>> > Should runners, current and future, be in the same repository as Beam
>> > core?
>>
>> In the distant past, runners lived in their own repositories, and then were donated to Beam. But Beam's current uber-repo setup allows a lot of convenience. For example, a ton of code (including core functionality and tests) is shared directly between runners, which is useful for keeping runners up to date and ensuring consistent behavior between them (in other words, maintainable and reliable).
>>
>> Generally, it is up to the authors of a particular Beam related project/subproject to decide whether to host their code in Beam or in a different repo, and up to the community to decide whether to take on the donation, as discussed in previous threads on the Twister2 runner. In this case, it seems there is agreement between the Twister2 runner authors and the community that the runner can be hosted in Beam proper.
>>
>> There are examples of successful independent Beam projects, such as Spotify's Scio, but having an independent project with its own releases requires a lot of dedicated resources, and the bar for entry for extending Beam should not be that high. All that's required of subproject authors is that they keep the subproject in step with Beam. If they can't maintain it any longer, the subproject can be allowed to bitrot without getting in anyone's way. On the other hand, I'm not sure of the details with Cassandra, but in general, a subproject should not have "the ability to block progress" just because it is contained in the Beam uber-repo.
>>
>> tl;dr Having an uber repo generally seems to work for Beam. Exceptions are few enough to be handled on a case-by-case basis.
>>
>> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <el...@ibiblio.org> wrote:
>>>
>>> Generic question without commenting on Twister2 specifically:
>>>
>>> Should runners, current and future, be in the same repository as Beam
>>> core? Can or should they be completely separate products with their
>>> own release cycles?
>>>
>>> Generally, loose coupling leads to more maintainable, reliable
>>> projects. Specifically, Cassandra is holding back some other changes
>>> in Beam and I really wish it didn't have the ability to block
>>> progress. The more different runners we have in core, the worse this
>>> problem is likely to become.
>>>
>>>
>>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>>> <pu...@gmail.com> wrote:
>>> >
>>> > Hi
>>> >
>>> > I believe the pull request is pretty complete now with the help of Ismaël. Kenn, would you be able to take a look at it and suggest any changes if needed?. The build checks and validations tests are passing at the moment.  I will start working on the documentation that you mentioned in an earlier email separately.
>>> >
>>> > Best Regards,
>>> > Pulasthi
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe <pu...@gmail.com> wrote:
>>> >>
>>> >> Hi All,
>>> >>
>>> >> I have created the initial pull request [1] to contribute the Twister2 Beam runner to the Apache Beam codebase. More information on Twister2 can be found here[2] and the Twister2 codebase is available here[3]. At the moment only batch mode is supported in the runner, but we are planning to add stream support and implement a portable runner for Twister2 in the near future.
>>> >>
>>> >> As Kenn pointed out in an earlier email it would be great to have inputs from the community regarding this contribution since it is a sizable one. I am sure there are many improvements that can be done in the contributed codebase with input from the community.
>>> >>
>>> >> [1] https://github.com/apache/beam/pull/10888
>>> >> [2] https://twister2.org/
>>> >> [3] https://github.com/DSC-SPIDAL/twister2
>>> >>
>>> >> Best Regards,
>>> >> Pulasthi
>>> >> --
>>> >> Pulasthi S. Wickramasinghe
>>> >> PhD Candidate  | Research Assistant
>>> >> School of Informatics and Computing | Digital Science Center
>>> >> Indiana University, Bloomington
>>> >> cell: 224-386-9035
>>> >
>>> >
>>> >
>>> > --
>>> > Pulasthi S. Wickramasinghe
>>> > PhD Candidate  | Research Assistant
>>> > School of Informatics and Computing | Digital Science Center
>>> > Indiana University, Bloomington
>>> > cell: 224-386-9035
>>>
>>>
>>>
>>> --
>>> Elliotte Rusty Harold
>>> elharo@ibiblio.org

Re: Contributing Twister2 runner to Apache Beam

Posted by Kenneth Knowles <ke...@apache.org>.
I agree with both of you, mostly :-)

The monorepo approach doesn't work/scale well for shipped libraries (name a
Google library that silently just works and never causes any dependency
problems) and the pain we feel has been constant and increasing, but I
don't think we are at the breaking point.

But Google's big monorepo [1] demonstrates similar benefits to what Kyle
describes. In the early stages the benefit of not having to think too hard
about build/test infra and share it everywhere is a big help, and it scales
well. Eventually, shipping test utility libraries and compliance suites can
be equivalent. And to your point - it is very helpful for users to know
that they can use CassandraIO with the other Beam artifacts. This is why
Google requires the whole big repo to depend on a single version of any
externally-controlled artifact. But, yes, as a consequence it is
preposterously difficult to stay up to date, since literally anything can
block progress. You need a unified escalation chain for that policy to make
sense. It is the definition of a healthy Apache project to *not* have that
(PMC is different).

Independent dependencies, independent git histories, and independent
release cadence/process are all separate discussions.

It is a broader question than this particular contribution, so let's merge
this runner before changing our whole way of doing things :-)

Kenn

[1]
https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
(really
quite a balanced analysis)

On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <kc...@google.com> wrote:

> > Should runners, current and future, be in the same repository as Beam
> > core?
>
> In the distant past, runners lived in their own repositories, and then
> were donated to Beam. But Beam's current uber-repo setup allows a lot of
> convenience. For example, a ton of code (including core functionality and
> tests) is shared directly between runners, which is useful for keeping
> runners up to date and ensuring consistent behavior between them (in other
> words, maintainable and reliable).
>
> Generally, it is up to the authors of a particular Beam related
> project/subproject to decide whether to host their code in Beam or in a
> different repo, and up to the community to decide whether to take on the
> donation, as discussed in previous threads on the Twister2 runner. In this
> case, it seems there is agreement between the Twister2 runner authors and
> the community that the runner can be hosted in Beam proper.
>
> There are examples of successful independent Beam projects, such as
> Spotify's Scio, but having an independent project with its own releases
> requires a lot of dedicated resources, and the bar for entry for extending
> Beam should not be that high. All that's required of subproject authors is
> that they keep the subproject in step with Beam. If they can't maintain it
> any longer, the subproject can be allowed to bitrot without getting in
> anyone's way. On the other hand, I'm not sure of the details with
> Cassandra, but in general, a subproject should not have "the ability to
> block progress" just because it is contained in the Beam uber-repo.
>
> tl;dr Having an uber repo generally seems to work for Beam. Exceptions are
> few enough to be handled on a case-by-case basis.
>
> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <el...@ibiblio.org>
> wrote:
>
>> Generic question without commenting on Twister2 specifically:
>>
>> Should runners, current and future, be in the same repository as Beam
>> core? Can or should they be completely separate products with their
>> own release cycles?
>>
>> Generally, loose coupling leads to more maintainable, reliable
>> projects. Specifically, Cassandra is holding back some other changes
>> in Beam and I really wish it didn't have the ability to block
>> progress. The more different runners we have in core, the worse this
>> problem is likely to become.
>>
>>
>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>> <pu...@gmail.com> wrote:
>> >
>> > Hi
>> >
>> > I believe the pull request is pretty complete now with the help of
>> Ismaël. Kenn, would you be able to take a look at it and suggest any
>> changes if needed?. The build checks and validations tests are passing at
>> the moment.  I will start working on the documentation that you mentioned
>> in an earlier email separately.
>> >
>> > Best Regards,
>> > Pulasthi
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe <
>> pulasthi911@gmail.com> wrote:
>> >>
>> >> Hi All,
>> >>
>> >> I have created the initial pull request [1] to contribute the Twister2
>> Beam runner to the Apache Beam codebase. More information on Twister2 can
>> be found here[2] and the Twister2 codebase is available here[3]. At the
>> moment only batch mode is supported in the runner, but we are planning to
>> add stream support and implement a portable runner for Twister2 in the near
>> future.
>> >>
>> >> As Kenn pointed out in an earlier email it would be great to have
>> inputs from the community regarding this contribution since it is a sizable
>> one. I am sure there are many improvements that can be done in the
>> contributed codebase with input from the community.
>> >>
>> >> [1] https://github.com/apache/beam/pull/10888
>> >> [2] https://twister2.org/
>> >> [3] https://github.com/DSC-SPIDAL/twister2
>> >>
>> >> Best Regards,
>> >> Pulasthi
>> >> --
>> >> Pulasthi S. Wickramasinghe
>> >> PhD Candidate  | Research Assistant
>> >> School of Informatics and Computing | Digital Science Center
>> >> Indiana University, Bloomington
>> >> cell: 224-386-9035 <(224)%20386-9035>
>> >
>> >
>> >
>> > --
>> > Pulasthi S. Wickramasinghe
>> > PhD Candidate  | Research Assistant
>> > School of Informatics and Computing | Digital Science Center
>> > Indiana University, Bloomington
>> > cell: 224-386-9035 <(224)%20386-9035>
>>
>>
>>
>> --
>> Elliotte Rusty Harold
>> elharo@ibiblio.org
>>
>

Re: Contributing Twister2 runner to Apache Beam

Posted by Kyle Weaver <kc...@google.com>.
> Should runners, current and future, be in the same repository as Beam
> core?

In the distant past, runners lived in their own repositories, and then were
donated to Beam. But Beam's current uber-repo setup allows a lot of
convenience. For example, a ton of code (including core functionality and
tests) is shared directly between runners, which is useful for keeping
runners up to date and ensuring consistent behavior between them (in other
words, maintainable and reliable).

Generally, it is up to the authors of a particular Beam related
project/subproject to decide whether to host their code in Beam or in a
different repo, and up to the community to decide whether to take on the
donation, as discussed in previous threads on the Twister2 runner. In this
case, it seems there is agreement between the Twister2 runner authors and
the community that the runner can be hosted in Beam proper.

There are examples of successful independent Beam projects, such as
Spotify's Scio, but having an independent project with its own releases
requires a lot of dedicated resources, and the bar for entry for extending
Beam should not be that high. All that's required of subproject authors is
that they keep the subproject in step with Beam. If they can't maintain it
any longer, the subproject can be allowed to bitrot without getting in
anyone's way. On the other hand, I'm not sure of the details with
Cassandra, but in general, a subproject should not have "the ability to
block progress" just because it is contained in the Beam uber-repo.

tl;dr Having an uber repo generally seems to work for Beam. Exceptions are
few enough to be handled on a case-by-case basis.

On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <el...@ibiblio.org>
wrote:

> Generic question without commenting on Twister2 specifically:
>
> Should runners, current and future, be in the same repository as Beam
> core? Can or should they be completely separate products with their
> own release cycles?
>
> Generally, loose coupling leads to more maintainable, reliable
> projects. Specifically, Cassandra is holding back some other changes
> in Beam and I really wish it didn't have the ability to block
> progress. The more different runners we have in core, the worse this
> problem is likely to become.
>
>
> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
> <pu...@gmail.com> wrote:
> >
> > Hi
> >
> > I believe the pull request is pretty complete now with the help of
> Ismaël. Kenn, would you be able to take a look at it and suggest any
> changes if needed?. The build checks and validations tests are passing at
> the moment.  I will start working on the documentation that you mentioned
> in an earlier email separately.
> >
> > Best Regards,
> > Pulasthi
> >
> >
> >
> >
> >
> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe <
> pulasthi911@gmail.com> wrote:
> >>
> >> Hi All,
> >>
> >> I have created the initial pull request [1] to contribute the Twister2
> Beam runner to the Apache Beam codebase. More information on Twister2 can
> be found here[2] and the Twister2 codebase is available here[3]. At the
> moment only batch mode is supported in the runner, but we are planning to
> add stream support and implement a portable runner for Twister2 in the near
> future.
> >>
> >> As Kenn pointed out in an earlier email it would be great to have
> inputs from the community regarding this contribution since it is a sizable
> one. I am sure there are many improvements that can be done in the
> contributed codebase with input from the community.
> >>
> >> [1] https://github.com/apache/beam/pull/10888
> >> [2] https://twister2.org/
> >> [3] https://github.com/DSC-SPIDAL/twister2
> >>
> >> Best Regards,
> >> Pulasthi
> >> --
> >> Pulasthi S. Wickramasinghe
> >> PhD Candidate  | Research Assistant
> >> School of Informatics and Computing | Digital Science Center
> >> Indiana University, Bloomington
> >> cell: 224-386-9035 <(224)%20386-9035>
> >
> >
> >
> > --
> > Pulasthi S. Wickramasinghe
> > PhD Candidate  | Research Assistant
> > School of Informatics and Computing | Digital Science Center
> > Indiana University, Bloomington
> > cell: 224-386-9035 <(224)%20386-9035>
>
>
>
> --
> Elliotte Rusty Harold
> elharo@ibiblio.org
>

Re: Contributing Twister2 runner to Apache Beam

Posted by Elliotte Rusty Harold <el...@ibiblio.org>.
Generic question without commenting on Twister2 specifically:

Should runners, current and future, be in the same repository as Beam
core? Can or should they be completely separate products with their
own release cycles?

Generally, loose coupling leads to more maintainable, reliable
projects. Specifically, Cassandra is holding back some other changes
in Beam and I really wish it didn't have the ability to block
progress. The more different runners we have in core, the worse this
problem is likely to become.


On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
<pu...@gmail.com> wrote:
>
> Hi
>
> I believe the pull request is pretty complete now with the help of Ismaël. Kenn, would you be able to take a look at it and suggest any changes if needed?. The build checks and validations tests are passing at the moment.  I will start working on the documentation that you mentioned in an earlier email separately.
>
> Best Regards,
> Pulasthi
>
>
>
>
>
> On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe <pu...@gmail.com> wrote:
>>
>> Hi All,
>>
>> I have created the initial pull request [1] to contribute the Twister2 Beam runner to the Apache Beam codebase. More information on Twister2 can be found here[2] and the Twister2 codebase is available here[3]. At the moment only batch mode is supported in the runner, but we are planning to add stream support and implement a portable runner for Twister2 in the near future.
>>
>> As Kenn pointed out in an earlier email it would be great to have inputs from the community regarding this contribution since it is a sizable one. I am sure there are many improvements that can be done in the contributed codebase with input from the community.
>>
>> [1] https://github.com/apache/beam/pull/10888
>> [2] https://twister2.org/
>> [3] https://github.com/DSC-SPIDAL/twister2
>>
>> Best Regards,
>> Pulasthi
>> --
>> Pulasthi S. Wickramasinghe
>> PhD Candidate  | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> cell: 224-386-9035
>
>
>
> --
> Pulasthi S. Wickramasinghe
> PhD Candidate  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035



-- 
Elliotte Rusty Harold
elharo@ibiblio.org