You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Etienne Chauchot <ec...@apache.org> on 2019/10/10 07:39:37 UTC

[spark structured streaming runner] merge to master?

Hi guys,

You probably know that there has been for several months an work 
developing a new Spark runner based on Spark Structured Streaming 
framework. This work is located in a feature branch here: 
https://github.com/apache/beam/tree/spark-runner_structured-streaming

To attract more contributors and get some user feedback, we think it is 
time to merge it to master. Before doing so, some steps need to be achieved:

- finish the work on spark Encoders (that allow to call Beam coders) 
because, right now, the runner is in an unstable state (some transforms 
use the new way of doing ser/de and some use the old one, making a 
pipeline incoherent toward serialization)

- clean history: The history contains commits from November 2018, so 
there is a good amount of work, thus a consequent number of commits. 
They were already squashed but not from September 2019

Regarding status:

- the runner passes 89% of the validates runner tests in batch mode. We 
hope to pass more with the new Encoders

- Streaming mode is barely started (waiting for the multi-aggregations 
support in spark SS framework from the Spark community)

- Runner can execute Nexmark

- Some things are not wired up yet

     - Beam Schemas not wired with Spark Schemas

     - Optional features of the model not implemented:  state api, timer 
api, splittable doFn api, …

WDYT, can we merge it to master once the 2 steps are done ?

Best

Etienne

Re: [spark structured streaming runner] merge to master?

Posted by Etienne Chauchot <ec...@apache.org>.

Hi Kenn,

Comments inline

On 10/10/2019 21:02, Kenneth Knowles wrote:
> +1
>
> I think our experiences with things that go to master early have been 
> very good. So I am in favor ASAP. We can exclude it from releases 
> easily until it is ready for end users.
I have mixed emotions on exclusion: I'd like users to try it for 
feedback but I don't want to confuse them and make promises on features 
completeness. How do you think we should we proceed?
>
> I have the same question as Robert - how much is modifications and how 
> much is new? I notice it is in a subdirectory of the 
> beam-runners-spark module.
cf my answers to Robert.
>
> I did not see any major changes to dependencies but I will also ask if 
> it has major version differences so that you might want a separate 
> artifact?

There is no deps change at all, the 2 runners are in sync on 
deps/versions. The trick is that spark 2.4.x provides both RDD/Dstream 
and Structured Streaming.


Etienne

>
> Kenn
>
> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <robertwb@google.com 
> <ma...@google.com>> wrote:
>
>     On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot
>     <echauchot@apache.org <ma...@apache.org>> wrote:
>     >
>     > Hi guys,
>     >
>     > You probably know that there has been for several months an work
>     > developing a new Spark runner based on Spark Structured Streaming
>     > framework. This work is located in a feature branch here:
>     >
>     https://github.com/apache/beam/tree/spark-runner_structured-streaming
>     >
>     > To attract more contributors and get some user feedback, we
>     think it is
>     > time to merge it to master. Before doing so, some steps need to
>     be achieved:
>     >
>     > - finish the work on spark Encoders (that allow to call Beam coders)
>     > because, right now, the runner is in an unstable state (some
>     transforms
>     > use the new way of doing ser/de and some use the old one, making a
>     > pipeline incoherent toward serialization)
>     >
>     > - clean history: The history contains commits from November 2018, so
>     > there is a good amount of work, thus a consequent number of commits.
>     > They were already squashed but not from September 2019
>
>     I don't think the number of commits should be an issue--we shouldn't
>     just squash years worth of history away. (OTOH, if this is a case of
>     this branch containing lots of little, irrelevant commits that would
>     have normally been squashed away in the normal review process we do
>     for the main branch, then, yes, some cleanup could be nice.)
>
>     > Regarding status:
>     >
>     > - the runner passes 89% of the validates runner tests in batch
>     mode. We
>     > hope to pass more with the new Encoders
>     >
>     > - Streaming mode is barely started (waiting for the
>     multi-aggregations
>     > support in spark SS framework from the Spark community)
>     >
>     > - Runner can execute Nexmark
>     >
>     > - Some things are not wired up yet
>     >
>     >      - Beam Schemas not wired with Spark Schemas
>     >
>     >      - Optional features of the model not implemented: state
>     api, timer
>     > api, splittable doFn api, …
>     >
>     > WDYT, can we merge it to master once the 2 steps are done ?
>
>     I think that as long as it sits parallel to the existing runner, and
>     is clearly marked with its status, it makes sense to me. How many
>     changes does it make to the existing codebase (as opposed to add new
>     code)?
>

Re: [spark structured streaming runner] merge to master?

Posted by Kenneth Knowles <ke...@apache.org>.

On Thu, Oct 10, 2019 at 2:44 PM Ismaël Mejía <ie...@gmail.com> wrote:

> +1
>
> The earlier we get to master the better to encourage not only code
> contributions but as important to have early user feedback.
>
> > Question is: do we keep the "old" spark runner for a while or not (or
> just keep on previous version/tag on git) ?
>
> It is still too early to even start discussing when to remove the
> classical runner given that the new runner is still a WIP. However the
> overall goal is that this runner becomes the de-facto one once the VR
> tests and the performance become at least equal to the classical
> runner, in the meantime the best for users is that they co-exist,
> let’s not forget that the other runner has been already battle tested
> for more than 3 years and has had lots of improvements in the last
> year.
>
> > I don't think the number of commits should be an issue--we shouldn't
> > just squash years worth of history away. (OTOH, if this is a case of
> > this branch containing lots of little, irrelevant commits that would
> > have normally been squashed away in the normal review process we do
> > for the main branch, then, yes, some cleanup could be nice.)
>
> About the commits we should encourage a clear history but we have also
> to remove useless commits that are still present in the branch,
> commits of the “Fix errorprone” / “Cleaning” kind and even commits
> that make a better narrative sense together should be probably
> squashed, because they do not bring much to the history. It is not
> about more or less commits it is about its relevance as Robert
> mentions.
>
> > I think our experiences with things that go to master early have been
> very good. So I am in favor ASAP. We can exclude it from releases easily
> until it is ready for end users.
> > I have the same question as Robert - how much is modifications and how
> much is new? I notice it is in a subdirectory of the beam-runners-spark
> module.
>
> In its current form we cannot exclude it but this relates to the other
> question, so better to explain a bit of history: The new runner used
> to live in its own module and subdirectory because it is a full blank
> page rewrite and the decision was not to use any of the classical
> runner classes to not be constrained by its evolution.
>
> However the reason to put it back in the same module as a subdirectory
> was to encourage early use, in more detail: The way you deploy spark
> jobs today is usually by packaging and staging an uber jar (~200MB of
> pure dependency joy) that contains the user pipeline classes, the
> spark runner module and its dependencies. If we have two spark runners
> in separate modules the user would need to repackage and redeploy
> their pipelines every time they want to switch from the classical
> Spark runner to the structured streaming runner which is painful and
> time and space consuming compared with the one module approach where
> they just change the name of the runner class and that’s it. The idea
> here is to make easy for users to test the new runner, but at the same
> time to make easy to come back to the classical runner in case of any
> issue.
>

That makes sense. And it is a good reason to release it and it can just be
activated by adventurous users.

Kenn


> Ismaël
>
> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <ke...@apache.org> wrote:
> >
> > +1
> >
> > I think our experiences with things that go to master early have been
> very good. So I am in favor ASAP. We can exclude it from releases easily
> until it is ready for end users.
> >
> > I have the same question as Robert - how much is modifications and how
> much is new? I notice it is in a subdirectory of the beam-runners-spark
> module.
> >
> > I did not see any major changes to dependencies but I will also ask if
> it has major version differences so that you might want a separate artifact?
> >
> > Kenn
> >
> > On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <ro...@google.com>
> wrote:
> >>
> >> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <ec...@apache.org>
> wrote:
> >> >
> >> > Hi guys,
> >> >
> >> > You probably know that there has been for several months an work
> >> > developing a new Spark runner based on Spark Structured Streaming
> >> > framework. This work is located in a feature branch here:
> >> > https://github.com/apache/beam/tree/spark-runner_structured-streaming
> >> >
> >> > To attract more contributors and get some user feedback, we think it
> is
> >> > time to merge it to master. Before doing so, some steps need to be
> achieved:
> >> >
> >> > - finish the work on spark Encoders (that allow to call Beam coders)
> >> > because, right now, the runner is in an unstable state (some
> transforms
> >> > use the new way of doing ser/de and some use the old one, making a
> >> > pipeline incoherent toward serialization)
> >> >
> >> > - clean history: The history contains commits from November 2018, so
> >> > there is a good amount of work, thus a consequent number of commits.
> >> > They were already squashed but not from September 2019
> >>
> >> I don't think the number of commits should be an issue--we shouldn't
> >> just squash years worth of history away. (OTOH, if this is a case of
> >> this branch containing lots of little, irrelevant commits that would
> >> have normally been squashed away in the normal review process we do
> >> for the main branch, then, yes, some cleanup could be nice.)
> >>
> >> > Regarding status:
> >> >
> >> > - the runner passes 89% of the validates runner tests in batch mode.
> We
> >> > hope to pass more with the new Encoders
> >> >
> >> > - Streaming mode is barely started (waiting for the multi-aggregations
> >> > support in spark SS framework from the Spark community)
> >> >
> >> > - Runner can execute Nexmark
> >> >
> >> > - Some things are not wired up yet
> >> >
> >> >      - Beam Schemas not wired with Spark Schemas
> >> >
> >> >      - Optional features of the model not implemented:  state api,
> timer
> >> > api, splittable doFn api, …
> >> >
> >> > WDYT, can we merge it to master once the 2 steps are done ?
> >>
> >> I think that as long as it sits parallel to the existing runner, and
> >> is clearly marked with its status, it makes sense to me. How many
> >> changes does it make to the existing codebase (as opposed to add new
> >> code)?
>

Re: [spark structured streaming runner] merge to master?

Posted by Etienne Chauchot <ec...@apache.org>.

Another thing:

A little feedback on perf. I ran Nexmark yesterday on this runner and 
here is the compared perf with the current runner.

There is still some failures: Q4 and Q5.

Very early stage for performance because it has not received much 
attention yet:

   - As the runner uses binary schema (for now)
      - Catalyst cannot resolve fields and thus cannot do some 
optimizations based on fields
      - Catalyst can still apply plan simplification rules based on tree 
nodes
   - Blindly serializing all the content of dataset could be counter 
performant (e.g. if we filter on one field)

   -Performance improvements made by David Moravek on the current runner 
are not yet ported to the new runner.

Best

Etienne

On 24/10/2019 15:44, Etienne Chauchot wrote:
>
> Hi guys,
>
> I'm glad to announce that the PR for the merge to master of the new 
> runner based on Spark Structured Streaming framework is submitted:
>
> https://github.com/apache/beam/pull/9866
>
>
> 1. Regarding the status of the runner:
>
> -the runner passes 93% of the validates runner tests in batch mode.
>
> -Streaming mode is barely started (waiting for the multi-aggregations 
> support in spark Structured Streaming framework from the Spark community)
>
> -Runner can execute Nexmark
>
> -Some things are not wired up yet
>
>   -Beam Schemas not wired with Spark Schemas
>
>   -Optional features of the model not implemented: state api, timer 
> api, splittable doFn api, …
>
>
> 2. Regarding the communication to users:
>
> - for reasons explained by Ismael: the runner is in the same module as 
> the "older" one. But it is in a different sub-package and both runners 
> share the same build.
>
> - How should we communicate to users:
>
>     - shall we filter out the package name from the release?
>
>     - should we release 2 jars: one for the old and one for the new ?
>
>     - should we release 3 jars: one for the new, one for the new and 
> one for both ?
>
>     - should we create a special entry to the capability matrix ?
>
> WDYT ?
>
> Best
>
> Etienne
>
>
> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>> +1 to merge.
>>
>> It is worth keeping things in master with explicitly marked status. 
>> It will make effort more visible to users and easier to get feedback 
>> upon.
>>
>> --Mikhail
>>
>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot 
>> <echauchot@apache.org <ma...@apache.org>> wrote:
>>
>>     Hi guys,
>>
>>     The new spark runner now supports beam coders and passes 93% of
>>     the batch validates runner tests (+4%). I think it is time to
>>     merge it to master. I will submit a PR in the coming days.
>>
>>     next steps: support schemas and thus better leverage catalyst
>>     optimizer (among other things optims based on data), port perfs
>>     optims that were done in the current runner.
>>
>>     Best
>>
>>     Etienne
>>
>>     On 11/10/2019 22:48, Pablo Estrada wrote:
>>>     +1 for merging : )
>>>
>>>     On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw
>>>     <robertwb@google.com <ma...@google.com>> wrote:
>>>
>>>         Sounds like a good plan to me.
>>>
>>>         On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot
>>>         <echauchot@apache.org <ma...@apache.org>> wrote:
>>>
>>>             Comments inline
>>>
>>>             On 10/10/2019 23:44, Ismaël Mejía wrote:
>>>>             +1
>>>>
>>>>             The earlier we get to master the better to encourage not only code
>>>>             contributions but as important to have early user feedback.
>>>>
>>>>>             Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
>>>>             It is still too early to even start discussing when to remove the
>>>>             classical runner given that the new runner is still a WIP. However the
>>>>             overall goal is that this runner becomes the de-facto one once the VR
>>>>             tests and the performance become at least equal to the classical
>>>>             runner, in the meantime the best for users is that they co-exist,
>>>>             let’s not forget that the other runner has been already battle tested
>>>>             for more than 3 years and has had lots of improvements in the last
>>>>             year.
>>>
>>>             +1 on what Ismael says: no soon removal,
>>>
>>>             The plan I had in mind at first (that I showed at the
>>>             apacheCon) was this but I'm proposing moving the first
>>>             gray label to before the red box.
>>>
>>>
>>>>>             I don't think the number of commits should be an issue--we shouldn't
>>>>>             just squash years worth of history away. (OTOH, if this is a case of
>>>>>             this branch containing lots of little, irrelevant commits that would
>>>>>             have normally been squashed away in the normal review process we do
>>>>>             for the main branch, then, yes, some cleanup could be nice.)
>>>>             About the commits we should encourage a clear history but we have also
>>>>             to remove useless commits that are still present in the branch,
>>>>             commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>>>             that make a better narrative sense together should be probably
>>>>             squashed, because they do not bring much to the history. It is not
>>>>             about more or less commits it is about its relevance as Robert
>>>>             mentions.
>>>>
>>>>>             I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>             I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>             In its current form we cannot exclude it but this relates to the other
>>>>             question, so better to explain a bit of history: The new runner used
>>>>             to live in its own module and subdirectory because it is a full blank
>>>>             page rewrite and the decision was not to use any of the classical
>>>>             runner classes to not be constrained by its evolution.
>>>>
>>>>             However the reason to put it back in the same module as a subdirectory
>>>>             was to encourage early use, in more detail: The way you deploy spark
>>>>             jobs today is usually by packaging and staging an uber jar (~200MB of
>>>>             pure dependency joy) that contains the user pipeline classes, the
>>>>             spark runner module and its dependencies. If we have two spark runners
>>>>             in separate modules the user would need to repackage and redeploy
>>>>             their pipelines every time they want to switch from the classical
>>>>             Spark runner to the structured streaming runner which is painful and
>>>>             time and space consuming compared with the one module approach where
>>>>             they just change the name of the runner class and that’s it. The idea
>>>>             here is to make easy for users to test the new runner, but at the same
>>>>             time to make easy to come back to the classical runner in case of any
>>>>             issue.
>>>>
>>>>             Ismaël
>>>>
>>>>             On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles<ke...@apache.org>  <ma...@apache.org>  wrote:
>>>>>             +1
>>>>>
>>>>>             I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>
>>>>>             I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>>
>>>>>             I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>>>>>
>>>>>             Kenn
>>>>>
>>>>>             On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw<ro...@google.com>  <ma...@google.com>  wrote:
>>>>>>             On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot<ec...@apache.org>  <ma...@apache.org>  wrote:
>>>>>>>             Hi guys,
>>>>>>>
>>>>>>>             You probably know that there has been for several months an work
>>>>>>>             developing a new Spark runner based on Spark Structured Streaming
>>>>>>>             framework. This work is located in a feature branch here:
>>>>>>>             https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>>>>>>
>>>>>>>             To attract more contributors and get some user feedback, we think it is
>>>>>>>             time to merge it to master. Before doing so, some steps need to be achieved:
>>>>>>>
>>>>>>>             - finish the work on spark Encoders (that allow to call Beam coders)
>>>>>>>             because, right now, the runner is in an unstable state (some transforms
>>>>>>>             use the new way of doing ser/de and some use the old one, making a
>>>>>>>             pipeline incoherent toward serialization)
>>>>>>>
>>>>>>>             - clean history: The history contains commits from November 2018, so
>>>>>>>             there is a good amount of work, thus a consequent number of commits.
>>>>>>>             They were already squashed but not from September 2019
>>>>>>             I don't think the number of commits should be an issue--we shouldn't
>>>>>>             just squash years worth of history away. (OTOH, if this is a case of
>>>>>>             this branch containing lots of little, irrelevant commits that would
>>>>>>             have normally been squashed away in the normal review process we do
>>>>>>             for the main branch, then, yes, some cleanup could be nice.)
>>>>>>
>>>>>>>             Regarding status:
>>>>>>>
>>>>>>>             - the runner passes 89% of the validates runner tests in batch mode. We
>>>>>>>             hope to pass more with the new Encoders
>>>>>>>
>>>>>>>             - Streaming mode is barely started (waiting for the multi-aggregations
>>>>>>>             support in spark SS framework from the Spark community)
>>>>>>>
>>>>>>>             - Runner can execute Nexmark
>>>>>>>
>>>>>>>             - Some things are not wired up yet
>>>>>>>
>>>>>>>                   - Beam Schemas not wired with Spark Schemas
>>>>>>>
>>>>>>>                   - Optional features of the model not implemented:  state api, timer
>>>>>>>             api, splittable doFn api, …
>>>>>>>
>>>>>>>             WDYT, can we merge it to master once the 2 steps are done ?
>>>>>>             I think that as long as it sits parallel to the existing runner, and
>>>>>>             is clearly marked with its status, it makes sense to me. How many
>>>>>>             changes does it make to the existing codebase (as opposed to add new
>>>>>>             code)?
>>>

Re: [spark structured streaming runner] merge to master?

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

I agree, it would be the easiest way and allow users to switch easily as 
well using a single artifact.

Regards
JB

On 29/10/2019 23:54, Kenneth Knowles wrote:
> Is it just as easy to have two jars and build an uber jar with both 
> included? Then the runner can still be toggled with a flag.
>
> Kenn
>
> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko 
> <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
>
>     Hmm, I don’t think that jar size should play a big role comparing
>     to the whole size of shaded jar of users job. Even more, I think
>     it will be quite confusing for users to choose which jar to use if
>     we will have 3 different ones for similar purposes. Though, let’s
>     see what others think.
>
>>     On 29 Oct 2019, at 15:32, Etienne Chauchot <echauchot@apache.org
>>     <ma...@apache.org>> wrote:
>>
>>     Hi Alexey,
>>
>>     Thanks for your opinion !
>>
>>     Comments inline
>>
>>     Etienne
>>
>>     On 28/10/2019 17:34, Alexey Romanenko wrote:
>>>     Let me share some of my thoughts on this.
>>>>>
>>>>>         - shall we filter out the package name from the release?
>>>>>
>>>     Until new runner is not ready to be used in production (or, at
>>>     least, be used for beta testing but users should be clearly
>>>     warned about that in this case), I believe we need to filter out
>>>     its classes from published jar to avoid a confusion.
>>     Yes that is what I think also
>>>>>
>>>>>         - should we release 2 jars: one for the old and one for
>>>>>     the new ?
>>>>>
>>>>>         - should we release 3 jars: one for the new, one for the
>>>>>     new and one for both ?
>>>>>
>>>     Once new runner will be released, then I think we need to
>>>     provide only one single jar and allow user to switch between
>>>     different Spark runners with CLI option.
>>
>>     I would vote for 3 jars: one for new, one for old, and one for
>>     both. Indeed, in some cases, users are looking very closely at
>>     the size of jars. This solution meets all use cases
>>
>>>>>         - should we create a special entry to the capability matrix ?
>>>>>
>>>     Sure, since it has its own uniq characteristics and
>>>     implementation, but again, only once new runner will be
>>>     "officially released".
>>     +1
>>>
>>>
>>>>     On 28 Oct 2019, at 10:27, Etienne Chauchot
>>>>     <echauchot@apache.org <ma...@apache.org>> wrote:
>>>>
>>>>     Hi guys,
>>>>
>>>>     Any opinions on the point2 communication to users ?
>>>>
>>>>     Etienne
>>>>
>>>>     On 24/10/2019 15:44, Etienne Chauchot wrote:
>>>>>
>>>>>     Hi guys,
>>>>>
>>>>>     I'm glad to announce that the PR for the merge to master of
>>>>>     the new runner based on Spark Structured Streaming framework
>>>>>     is submitted:
>>>>>
>>>>>     https://github.com/apache/beam/pull/9866
>>>>>
>>>>>
>>>>>     1. Regarding the status of the runner:
>>>>>
>>>>>     -the runner passes 93% of the validates runner tests in batch
>>>>>     mode.
>>>>>
>>>>>     -Streaming mode is barely started (waiting for the
>>>>>     multi-aggregations support in spark Structured Streaming
>>>>>     framework from the Spark community)
>>>>>
>>>>>     -Runner can execute Nexmark
>>>>>
>>>>>     -Some things are not wired up yet
>>>>>
>>>>>       -Beam Schemas not wired with Spark Schemas
>>>>>
>>>>>       -Optional features of the model not implemented: state api,
>>>>>     timer api, splittable doFn api, …
>>>>>
>>>>>
>>>>>     2. Regarding the communication to users:
>>>>>
>>>>>     - for reasons explained by Ismael: the runner is in the same
>>>>>     module as the "older" one. But it is in a different
>>>>>     sub-package and both runners share the same build.
>>>>>
>>>>>     - How should we communicate to users:
>>>>>
>>>>>         - shall we filter out the package name from the release?
>>>>>
>>>>>         - should we release 2 jars: one for the old and one for
>>>>>     the new ?
>>>>>
>>>>>         - should we release 3 jars: one for the new, one for the
>>>>>     new and one for both ?
>>>>>
>>>>>         - should we create a special entry to the capability matrix ?
>>>>>
>>>>>     WDYT ?
>>>>>
>>>>>     Best
>>>>>
>>>>>     Etienne
>>>>>
>>>>>
>>>>>     On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>>>>>>     +1 to merge.
>>>>>>
>>>>>>     It is worth keeping things in master with explicitly marked
>>>>>>     status. It will make effort more visible to users and easier
>>>>>>     to get feedback upon.
>>>>>>
>>>>>>     --Mikhail
>>>>>>
>>>>>>     On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot
>>>>>>     <echauchot@apache.org <ma...@apache.org>> wrote:
>>>>>>
>>>>>>         Hi guys,
>>>>>>
>>>>>>         The new spark runner now supports beam coders and passes
>>>>>>         93% of the batch validates runner tests (+4%). I think it
>>>>>>         is time to merge it to master. I will submit a PR in the
>>>>>>         coming days.
>>>>>>
>>>>>>         next steps: support schemas and thus better leverage
>>>>>>         catalyst optimizer (among other things optims based on
>>>>>>         data), port perfs optims that were done in the current
>>>>>>         runner.
>>>>>>
>>>>>>         Best
>>>>>>
>>>>>>         Etienne
>>>>>>
>>>>>>         On 11/10/2019 22:48, Pablo Estrada wrote:
>>>>>>>         +1 for merging : )
>>>>>>>
>>>>>>>         On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw
>>>>>>>         <robertwb@google.com <ma...@google.com>> wrote:
>>>>>>>
>>>>>>>             Sounds like a good plan to me.
>>>>>>>
>>>>>>>             On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot
>>>>>>>             <echauchot@apache.org <ma...@apache.org>>
>>>>>>>             wrote:
>>>>>>>
>>>>>>>                 Comments inline
>>>>>>>
>>>>>>>                 On 10/10/2019 23:44, Ismaël Mejía wrote:
>>>>>>>>                 +1
>>>>>>>>
>>>>>>>>                 The earlier we get to master the better to encourage not only code
>>>>>>>>                 contributions but as important to have early user feedback.
>>>>>>>>
>>>>>>>>>                 Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
>>>>>>>>                 It is still too early to even start discussing when to remove the
>>>>>>>>                 classical runner given that the new runner is still a WIP. However the
>>>>>>>>                 overall goal is that this runner becomes the de-facto one once the VR
>>>>>>>>                 tests and the performance become at least equal to the classical
>>>>>>>>                 runner, in the meantime the best for users is that they co-exist,
>>>>>>>>                 let’s not forget that the other runner has been already battle tested
>>>>>>>>                 for more than 3 years and has had lots of improvements in the last
>>>>>>>>                 year.
>>>>>>>
>>>>>>>                 +1 on what Ismael says: no soon removal,
>>>>>>>
>>>>>>>                 The plan I had in mind at first (that I showed
>>>>>>>                 at the apacheCon) was this but I'm proposing
>>>>>>>                 moving the first gray label to before the red box.
>>>>>>>
>>>>>>>                 <beogijnhpieapoll.png>
>>>>>>>
>>>>>>>
>>>>>>>>>                 I don't think the number of commits should be an issue--we shouldn't
>>>>>>>>>                 just squash years worth of history away. (OTOH, if this is a case of
>>>>>>>>>                 this branch containing lots of little, irrelevant commits that would
>>>>>>>>>                 have normally been squashed away in the normal review process we do
>>>>>>>>>                 for the main branch, then, yes, some cleanup could be nice.)
>>>>>>>>                 About the commits we should encourage a clear history but we have also
>>>>>>>>                 to remove useless commits that are still present in the branch,
>>>>>>>>                 commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>>>>>>>                 that make a better narrative sense together should be probably
>>>>>>>>                 squashed, because they do not bring much to the history. It is not
>>>>>>>>                 about more or less commits it is about its relevance as Robert
>>>>>>>>                 mentions.
>>>>>>>>
>>>>>>>>>                 I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>>>>>                 I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>>>>>                 In its current form we cannot exclude it but this relates to the other
>>>>>>>>                 question, so better to explain a bit of history: The new runner used
>>>>>>>>                 to live in its own module and subdirectory because it is a full blank
>>>>>>>>                 page rewrite and the decision was not to use any of the classical
>>>>>>>>                 runner classes to not be constrained by its evolution.
>>>>>>>>
>>>>>>>>                 However the reason to put it back in the same module as a subdirectory
>>>>>>>>                 was to encourage early use, in more detail: The way you deploy spark
>>>>>>>>                 jobs today is usually by packaging and staging an uber jar (~200MB of
>>>>>>>>                 pure dependency joy) that contains the user pipeline classes, the
>>>>>>>>                 spark runner module and its dependencies. If we have two spark runners
>>>>>>>>                 in separate modules the user would need to repackage and redeploy
>>>>>>>>                 their pipelines every time they want to switch from the classical
>>>>>>>>                 Spark runner to the structured streaming runner which is painful and
>>>>>>>>                 time and space consuming compared with the one module approach where
>>>>>>>>                 they just change the name of the runner class and that’s it. The idea
>>>>>>>>                 here is to make easy for users to test the new runner, but at the same
>>>>>>>>                 time to make easy to come back to the classical runner in case of any
>>>>>>>>                 issue.
>>>>>>>>
>>>>>>>>                 Ismaël
>>>>>>>>
>>>>>>>>                 On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles<ke...@apache.org>  <ma...@apache.org>  wrote:
>>>>>>>>>                 +1
>>>>>>>>>
>>>>>>>>>                 I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>>>>>
>>>>>>>>>                 I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>>>>>>
>>>>>>>>>                 I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>>>>>>>>>
>>>>>>>>>                 Kenn
>>>>>>>>>
>>>>>>>>>                 On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw<ro...@google.com>  <ma...@google.com>  wrote:
>>>>>>>>>>                 On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot<ec...@apache.org>  <ma...@apache.org>  wrote:
>>>>>>>>>>>                 Hi guys,
>>>>>>>>>>>
>>>>>>>>>>>                 You probably know that there has been for several months an work
>>>>>>>>>>>                 developing a new Spark runner based on Spark Structured Streaming
>>>>>>>>>>>                 framework. This work is located in a feature branch here:
>>>>>>>>>>>                 https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>>>>>>>>>>
>>>>>>>>>>>                 To attract more contributors and get some user feedback, we think it is
>>>>>>>>>>>                 time to merge it to master. Before doing so, some steps need to be achieved:
>>>>>>>>>>>
>>>>>>>>>>>                 - finish the work on spark Encoders (that allow to call Beam coders)
>>>>>>>>>>>                 because, right now, the runner is in an unstable state (some transforms
>>>>>>>>>>>                 use the new way of doing ser/de and some use the old one, making a
>>>>>>>>>>>                 pipeline incoherent toward serialization)
>>>>>>>>>>>
>>>>>>>>>>>                 - clean history: The history contains commits from November 2018, so
>>>>>>>>>>>                 there is a good amount of work, thus a consequent number of commits.
>>>>>>>>>>>                 They were already squashed but not from September 2019
>>>>>>>>>>                 I don't think the number of commits should be an issue--we shouldn't
>>>>>>>>>>                 just squash years worth of history away. (OTOH, if this is a case of
>>>>>>>>>>                 this branch containing lots of little, irrelevant commits that would
>>>>>>>>>>                 have normally been squashed away in the normal review process we do
>>>>>>>>>>                 for the main branch, then, yes, some cleanup could be nice.)
>>>>>>>>>>
>>>>>>>>>>>                 Regarding status:
>>>>>>>>>>>
>>>>>>>>>>>                 - the runner passes 89% of the validates runner tests in batch mode. We
>>>>>>>>>>>                 hope to pass more with the new Encoders
>>>>>>>>>>>
>>>>>>>>>>>                 - Streaming mode is barely started (waiting for the multi-aggregations
>>>>>>>>>>>                 support in spark SS framework from the Spark community)
>>>>>>>>>>>
>>>>>>>>>>>                 - Runner can execute Nexmark
>>>>>>>>>>>
>>>>>>>>>>>                 - Some things are not wired up yet
>>>>>>>>>>>
>>>>>>>>>>>                       - Beam Schemas not wired with Spark Schemas
>>>>>>>>>>>
>>>>>>>>>>>                       - Optional features of the model not implemented:  state api, timer
>>>>>>>>>>>                 api, splittable doFn api, …
>>>>>>>>>>>
>>>>>>>>>>>                 WDYT, can we merge it to master once the 2 steps are done ?
>>>>>>>>>>                 I think that as long as it sits parallel to the existing runner, and
>>>>>>>>>>                 is clearly marked with its status, it makes sense to me. How many
>>>>>>>>>>                 changes does it make to the existing codebase (as opposed to add new
>>>>>>>>>>                 code)?
>>>>>>>
>>>
>

Re: [spark structured streaming runner] merge to master?

Posted by Alexey Romanenko <ar...@gmail.com>.

I’m also as Kenn strongly against 3 jars but I don’t see a big difference between 1 or 2 jar(s) in case if 2 jars won’t contain the same classes or cross-dependencies between them. I don’t have strong arguments “for" or “against” these both options but, as I mentioned before, achieve it with one jar perhaps would be easier and less error-prone.

> On 31 Oct 2019, at 02:06, Kenneth Knowles <ke...@apache.org> wrote:
> 
> Very good points. We definitely ship a lot of code/features in very early stages, and there seems to be no problem.
> 
> I intend mostly to leave this judgment to people like you who know better about Spark users.
> 
> But I do think 1 or 2 jars is better than 3. I really don't like "3 jars" and I did give two reasons:
> 
> 1. diamond deps where things overlap
> 2. figuring out which thing to depend on
> 
> Both are annoying for users. I am not certain if it could lead to a real unsolvable situation. This is just a Java ecosystem problem so I feel qualified to comment.
> 
> I did also ask if there were major dependency differences between the two that could cause problems for users. This question was dropped and no one cares to comment so I assume it is not an issue. So then I favor having just 1 jar with both runners.
> 
> Kenn
> 
> On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía <iemejia@gmail.com <ma...@gmail.com>> wrote:
> I am still a bit lost about why we are discussing options without giving any
> arguments or reasons for the options? Why is 2 modules better than 3 or 3 better
> than 2, or even better, what forces us to have something different than a single
> module?
> 
> What are the reasons for wanting to have separate jars? If the issue is that the
> code is unfinished or not passing the tests, the impact for end users is minimal
> because they cannot accidentally end up running the new runner, and if they
> decide to do so we can warn them it is at their own risk and not ready for
> production in the documentation + runner.
> 
> If the fear is that new code may end up being intertwined with the classic and
> portable runners and have some side effects. We have the ValidatesRunner +
> Nexmark in the CI to cover this so again I do not see what is the problem that
> requires modules to be separate.
> 
> If the issue is being uncomfortable about having in-progress code in released
> artifacts we have been doing this in Beam forever, for example most of the work
> on portability and Schema/SQL, and all of those were still part of artifacts
> long time before they were ready for prime use, so I still don't see why this
> case is different to require different artifacts.
> 
> I have the impression we are trying to solve a non-issue by adding a lot of
> artificial complexity (in particular to the users), or am I missing something
> else?
> 
> On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles <kenn@apache.org <ma...@apache.org>> wrote:
> >
> > Oh, I mean that we ship just 2 jars.
> >
> > And since Spark users always build an uber jar, they can still depend on both of ours and be able to switch runners with a flag.
> >
> > I really dislike projects shipping overlapping jars. It is confusing and causes major diamond dependency problems.
> >
> > Kenn
> >
> > On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
> >>
> >> Yes, agree, two jars included in uber jar will work in the similar way. Though having 3 jars looks still quite confusing for me.
> >>
> >> On 29 Oct 2019, at 23:54, Kenneth Knowles <kenn@apache.org <ma...@apache.org>> wrote:
> >>
> >> Is it just as easy to have two jars and build an uber jar with both included? Then the runner can still be toggled with a flag.
> >>
> >> Kenn
> >>
> >> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
> >>>
> >>> Hmm, I don’t think that jar size should play a big role comparing to the whole size of shaded jar of users job. Even more, I think it will be quite confusing for users to choose which jar to use if we will have 3 different ones for similar purposes. Though, let’s see what others think.
> >>>
> >>> On 29 Oct 2019, at 15:32, Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
> >>>
> >>> Hi Alexey,
> >>>
> >>> Thanks for your opinion !
> >>>
> >>> Comments inline
> >>>
> >>> Etienne
> >>>
> >>> On 28/10/2019 17:34, Alexey Romanenko wrote:
> >>>
> >>> Let me share some of my thoughts on this.
> >>>
> >>>     - shall we filter out the package name from the release?
> >>>
> >>> Until new runner is not ready to be used in production (or, at least, be used for beta testing but users should be clearly warned about that in this case), I believe we need to filter out its classes from published jar to avoid a confusion.
> >>>
> >>> Yes that is what I think also
> >>>
> >>>     - should we release 2 jars: one for the old and one for the new ?
> >>>
> >>>     - should we release 3 jars: one for the new, one for the new and one for both ?
> >>>
> >>> Once new runner will be released, then I think we need to provide only one single jar and allow user to switch between different Spark runners with CLI option.
> >>>
> >>> I would vote for 3 jars: one for new, one for old, and one for both. Indeed, in some cases, users are looking very closely at the size of jars. This solution meets all use cases
> >>>
> >>>     - should we create a special entry to the capability matrix ?
> >>>
> >>> Sure, since it has its own uniq characteristics and implementation, but again, only once new runner will be "officially released".
> >>>
> >>> +1
> >>>
> >>>
> >>>
> >>> On 28 Oct 2019, at 10:27, Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
> >>>
> >>> Hi guys,
> >>>
> >>> Any opinions on the point2 communication to users ?
> >>>
> >>> Etienne
> >>>
> >>> On 24/10/2019 15:44, Etienne Chauchot wrote:
> >>>
> >>> Hi guys,
> >>>
> >>> I'm glad to announce that the PR for the merge to master of the new runner based on Spark Structured Streaming framework is submitted:
> >>>
> >>> https://github.com/apache/beam/pull/9866 <https://github.com/apache/beam/pull/9866>
> >>>
> >>>
> >>> 1. Regarding the status of the runner:
> >>>
> >>> -the runner passes 93% of the validates runner tests in batch mode.
> >>>
> >>> -Streaming mode is barely started (waiting for the multi-aggregations support in spark Structured Streaming framework from the Spark community)
> >>>
> >>> -Runner can execute Nexmark
> >>>
> >>> -Some things are not wired up yet
> >>>
> >>>   -Beam Schemas not wired with Spark Schemas
> >>>
> >>>   -Optional features of the model not implemented: state api, timer api, splittable doFn api, …
> >>>
> >>>
> >>> 2. Regarding the communication to users:
> >>>
> >>> - for reasons explained by Ismael: the runner is in the same module as the "older" one. But it is in a different sub-package and both runners share the same build.
> >>>
> >>> - How should we communicate to users:
> >>>
> >>>     - shall we filter out the package name from the release?
> >>>
> >>>     - should we release 2 jars: one for the old and one for the new ?
> >>>
> >>>     - should we release 3 jars: one for the new, one for the new and one for both ?
> >>>
> >>>     - should we create a special entry to the capability matrix ?
> >>>
> >>> WDYT ?
> >>>
> >>> Best
> >>>
> >>> Etienne
> >>>
> >>>
> >>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
> >>>
> >>> +1 to merge.
> >>>
> >>> It is worth keeping things in master with explicitly marked status. It will make effort more visible to users and easier to get feedback upon.
> >>>
> >>> --Mikhail
> >>>
> >>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
> >>>>
> >>>> Hi guys,
> >>>>
> >>>> The new spark runner now supports beam coders and passes 93% of the batch validates runner tests (+4%). I think it is time to merge it to master. I will submit a PR in the coming days.
> >>>>
> >>>> next steps: support schemas and thus better leverage catalyst optimizer (among other things optims based on data), port perfs optims that were done in the current runner.
> >>>>
> >>>> Best
> >>>>
> >>>> Etienne
> >>>>
> >>>> On 11/10/2019 22:48, Pablo Estrada wrote:
> >>>>
> >>>> +1 for merging : )
> >>>>
> >>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <robertwb@google.com <ma...@google.com>> wrote:
> >>>>>
> >>>>> Sounds like a good plan to me.
> >>>>>
> >>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
> >>>>>>
> >>>>>> Comments inline
> >>>>>>
> >>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
> >>>>>>
> >>>>>> +1
> >>>>>>
> >>>>>> The earlier we get to master the better to encourage not only code
> >>>>>> contributions but as important to have early user feedback.
> >>>>>>
> >>>>>> Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
> >>>>>>
> >>>>>> It is still too early to even start discussing when to remove the
> >>>>>> classical runner given that the new runner is still a WIP. However the
> >>>>>> overall goal is that this runner becomes the de-facto one once the VR
> >>>>>> tests and the performance become at least equal to the classical
> >>>>>> runner, in the meantime the best for users is that they co-exist,
> >>>>>> let’s not forget that the other runner has been already battle tested
> >>>>>> for more than 3 years and has had lots of improvements in the last
> >>>>>> year.
> >>>>>>
> >>>>>> +1 on what Ismael says: no soon removal,
> >>>>>>
> >>>>>> The plan I had in mind at first (that I showed at the apacheCon) was this but I'm proposing moving the first gray label to before the red box.
> >>>>>>
> >>>>>> <beogijnhpieapoll.png>
> >>>>>>
> >>>>>>
> >>>>>> I don't think the number of commits should be an issue--we shouldn't
> >>>>>> just squash years worth of history away. (OTOH, if this is a case of
> >>>>>> this branch containing lots of little, irrelevant commits that would
> >>>>>> have normally been squashed away in the normal review process we do
> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
> >>>>>>
> >>>>>> About the commits we should encourage a clear history but we have also
> >>>>>> to remove useless commits that are still present in the branch,
> >>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
> >>>>>> that make a better narrative sense together should be probably
> >>>>>> squashed, because they do not bring much to the history. It is not
> >>>>>> about more or less commits it is about its relevance as Robert
> >>>>>> mentions.
> >>>>>>
> >>>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
> >>>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
> >>>>>>
> >>>>>> In its current form we cannot exclude it but this relates to the other
> >>>>>> question, so better to explain a bit of history: The new runner used
> >>>>>> to live in its own module and subdirectory because it is a full blank
> >>>>>> page rewrite and the decision was not to use any of the classical
> >>>>>> runner classes to not be constrained by its evolution.
> >>>>>>
> >>>>>> However the reason to put it back in the same module as a subdirectory
> >>>>>> was to encourage early use, in more detail: The way you deploy spark
> >>>>>> jobs today is usually by packaging and staging an uber jar (~200MB of
> >>>>>> pure dependency joy) that contains the user pipeline classes, the
> >>>>>> spark runner module and its dependencies. If we have two spark runners
> >>>>>> in separate modules the user would need to repackage and redeploy
> >>>>>> their pipelines every time they want to switch from the classical
> >>>>>> Spark runner to the structured streaming runner which is painful and
> >>>>>> time and space consuming compared with the one module approach where
> >>>>>> they just change the name of the runner class and that’s it. The idea
> >>>>>> here is to make easy for users to test the new runner, but at the same
> >>>>>> time to make easy to come back to the classical runner in case of any
> >>>>>> issue.
> >>>>>>
> >>>>>> Ismaël
> >>>>>>
> >>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <kenn@apache.org <ma...@apache.org>> wrote:
> >>>>>>
> >>>>>> +1
> >>>>>>
> >>>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
> >>>>>>
> >>>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
> >>>>>>
> >>>>>> I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
> >>>>>>
> >>>>>> Kenn
> >>>>>>
> >>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <robertwb@google.com <ma...@google.com>> wrote:
> >>>>>>
> >>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
> >>>>>>
> >>>>>> Hi guys,
> >>>>>>
> >>>>>> You probably know that there has been for several months an work
> >>>>>> developing a new Spark runner based on Spark Structured Streaming
> >>>>>> framework. This work is located in a feature branch here:
> >>>>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming <https://github.com/apache/beam/tree/spark-runner_structured-streaming>
> >>>>>>
> >>>>>> To attract more contributors and get some user feedback, we think it is
> >>>>>> time to merge it to master. Before doing so, some steps need to be achieved:
> >>>>>>
> >>>>>> - finish the work on spark Encoders (that allow to call Beam coders)
> >>>>>> because, right now, the runner is in an unstable state (some transforms
> >>>>>> use the new way of doing ser/de and some use the old one, making a
> >>>>>> pipeline incoherent toward serialization)
> >>>>>>
> >>>>>> - clean history: The history contains commits from November 2018, so
> >>>>>> there is a good amount of work, thus a consequent number of commits.
> >>>>>> They were already squashed but not from September 2019
> >>>>>>
> >>>>>> I don't think the number of commits should be an issue--we shouldn't
> >>>>>> just squash years worth of history away. (OTOH, if this is a case of
> >>>>>> this branch containing lots of little, irrelevant commits that would
> >>>>>> have normally been squashed away in the normal review process we do
> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
> >>>>>>
> >>>>>> Regarding status:
> >>>>>>
> >>>>>> - the runner passes 89% of the validates runner tests in batch mode. We
> >>>>>> hope to pass more with the new Encoders
> >>>>>>
> >>>>>> - Streaming mode is barely started (waiting for the multi-aggregations
> >>>>>> support in spark SS framework from the Spark community)
> >>>>>>
> >>>>>> - Runner can execute Nexmark
> >>>>>>
> >>>>>> - Some things are not wired up yet
> >>>>>>
> >>>>>>      - Beam Schemas not wired with Spark Schemas
> >>>>>>
> >>>>>>      - Optional features of the model not implemented:  state api, timer
> >>>>>> api, splittable doFn api, …
> >>>>>>
> >>>>>> WDYT, can we merge it to master once the 2 steps are done ?
> >>>>>>
> >>>>>> I think that as long as it sits parallel to the existing runner, and
> >>>>>> is clearly marked with its status, it makes sense to me. How many
> >>>>>> changes does it make to the existing codebase (as opposed to add new
> >>>>>> code)?
> >>>
> >>>
> >>>
> >>

Re: [spark structured streaming runner] merge to master?

Posted by Etienne Chauchot <ec...@apache.org>.

Ok for 1 jar with the 2 runners then.
I'll add the banner to the logs and the Experimental in the code and in 
in the javadocs.

Thanks for your opinions guys !

Etienne

On 08/11/2019 18:50, Kenneth Knowles wrote:
> On Thu, Nov 7, 2019 at 5:32 PM Etienne Chauchot <echauchot@apache.org 
> <ma...@apache.org>> wrote:
> >
> > Hi guys
> >
> > @Kenn,
> >
> > I just wanted to mention that I did answered your question on 
> dependencies here: 
> https://lists.apache.org/thread.html/5a85caac41e796c2aa351d835b3483808ebbbd4512b480940d494439@%3Cdev.beam.apache.org%3E
>
> Ah, sorry! In that case there is no problem at all.
>
>
> > I'm not in favor of having the 2 runners in one jar, the point about 
> having 2 jars was to:
> >
> > - avoid making promises to users on a work in progress runner (make 
> it explicit with a different jar)
> > - avoid confusion for them (why are there 2 pipeline options? etc....)
> >
> > If the community believes that there is no confusion or wrong 
> promises with the one jar solution, we could leave the 2 runners in 
> one jar.
> >
> > Maybe we could start a vote on that?
>
> It seems unanimous among others to have one jar. There were some 
> suggestions of how to avoid promises and confusion, like Ryan's most 
> recent email. Did any of the ideas sound good to you?
>
> Kenn
>
>
>     I have no objection to putting the experimental runner alongside the
>     stable, mature runner.  We have some precedence with the portable
>     spark runner, and that's worked out pretty well -- at least, I haven't
>     heard any complaints from confused users!
>
>     That being said:
>
>     1.  It really should be marked @Experimental in the code *and* clearly
>     warned in API (javadoc) and documentation.
>
>     2.  Ideally, I'd like to see a warning banner in the logs when it's
>     used, pointing to the stable SparkRunner and/or documentation on the
>     current known issues.
>
>     All my best, Ryan
>
>
>
>
>
>
>     > regarding jars:
>     >
>     > I don't like 3 jars either.
>     >
>     >
>     > Etienne
>     >
>     > On 31/10/2019 02:06, Kenneth Knowles wrote:
>     >
>     > Very good points. We definitely ship a lot of code/features in
>     very early stages, and there seems to be no problem.
>     >
>     > I intend mostly to leave this judgment to people like you who
>     know better about Spark users.
>     >
>     > But I do think 1 or 2 jars is better than 3. I really don't like
>     "3 jars" and I did give two reasons:
>     >
>     > 1. diamond deps where things overlap
>     > 2. figuring out which thing to depend on
>     >
>     > Both are annoying for users. I am not certain if it could lead
>     to a real unsolvable situation. This is just a Java ecosystem
>     problem so I feel qualified to comment.
>     >
>     > I did also ask if there were major dependency differences
>     between the two that could cause problems for users. This question
>     was dropped and no one cares to comment so I assume it is not an
>     issue. So then I favor having just 1 jar with both runners.
>     >
>     > Kenn
>     >
>     > On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía <iemejia@gmail.com
>     <ma...@gmail.com>> wrote:
>     >>
>     >> I am still a bit lost about why we are discussing options
>     without giving any
>     >> arguments or reasons for the options? Why is 2 modules better
>     than 3 or 3 better
>     >> than 2, or even better, what forces us to have something
>     different than a single
>     >> module?
>     >>
>     >> What are the reasons for wanting to have separate jars? If the
>     issue is that the
>     >> code is unfinished or not passing the tests, the impact for end
>     users is minimal
>     >> because they cannot accidentally end up running the new runner,
>     and if they
>     >> decide to do so we can warn them it is at their own risk and
>     not ready for
>     >> production in the documentation + runner.
>     >>
>     >> If the fear is that new code may end up being intertwined with
>     the classic and
>     >> portable runners and have some side effects. We have the
>     ValidatesRunner +
>     >> Nexmark in the CI to cover this so again I do not see what is
>     the problem that
>     >> requires modules to be separate.
>     >>
>     >> If the issue is being uncomfortable about having in-progress
>     code in released
>     >> artifacts we have been doing this in Beam forever, for example
>     most of the work
>     >> on portability and Schema/SQL, and all of those were still part
>     of artifacts
>     >> long time before they were ready for prime use, so I still
>     don't see why this
>     >> case is different to require different artifacts.
>     >>
>     >> I have the impression we are trying to solve a non-issue by
>     adding a lot of
>     >> artificial complexity (in particular to the users), or am I
>     missing something
>     >> else?
>     >>
>     >> On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles
>     <kenn@apache.org <ma...@apache.org>> wrote:
>     >> >
>     >> > Oh, I mean that we ship just 2 jars.
>     >> >
>     >> > And since Spark users always build an uber jar, they can
>     still depend on both of ours and be able to switch runners with a
>     flag.
>     >> >
>     >> > I really dislike projects shipping overlapping jars. It is
>     confusing and causes major diamond dependency problems.
>     >> >
>     >> > Kenn
>     >> >
>     >> > On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko
>     <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
>     >> >>
>     >> >> Yes, agree, two jars included in uber jar will work in the
>     similar way. Though having 3 jars looks still quite confusing for me.
>     >> >>
>     >> >> On 29 Oct 2019, at 23:54, Kenneth Knowles <kenn@apache.org
>     <ma...@apache.org>> wrote:
>     >> >>
>     >> >> Is it just as easy to have two jars and build an uber jar
>     with both included? Then the runner can still be toggled with a flag.
>     >> >>
>     >> >> Kenn
>     >> >>
>     >> >> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko
>     <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
>     >> >>>
>     >> >>> Hmm, I don’t think that jar size should play a big role
>     comparing to the whole size of shaded jar of users job. Even more,
>     I think it will be quite confusing for users to choose which jar
>     to use if we will have 3 different ones for similar purposes.
>     Though, let’s see what others think.
>     >> >>>
>     >> >>> On 29 Oct 2019, at 15:32, Etienne Chauchot
>     <echauchot@apache.org <ma...@apache.org>> wrote:
>     >> >>>
>     >> >>> Hi Alexey,
>     >> >>>
>     >> >>> Thanks for your opinion !
>     >> >>>
>     >> >>> Comments inline
>     >> >>>
>     >> >>> Etienne
>     >> >>>
>     >> >>> On 28/10/2019 17:34, Alexey Romanenko wrote:
>     >> >>>
>     >> >>> Let me share some of my thoughts on this.
>     >> >>>
>     >> >>>     - shall we filter out the package name from the release?
>     >> >>>
>     >> >>> Until new runner is not ready to be used in production (or,
>     at least, be used for beta testing but users should be clearly
>     warned about that in this case), I believe we need to filter out
>     its classes from published jar to avoid a confusion.
>     >> >>>
>     >> >>> Yes that is what I think also
>     >> >>>
>     >> >>>     - should we release 2 jars: one for the old and one for
>     the new ?
>     >> >>>
>     >> >>>     - should we release 3 jars: one for the new, one for
>     the new and one for both ?
>     >> >>>
>     >> >>> Once new runner will be released, then I think we need to
>     provide only one single jar and allow user to switch between
>     different Spark runners with CLI option.
>     >> >>>
>     >> >>> I would vote for 3 jars: one for new, one for old, and one
>     for both. Indeed, in some cases, users are looking very closely at
>     the size of jars. This solution meets all use cases
>     >> >>>
>     >> >>>     - should we create a special entry to the capability
>     matrix ?
>     >> >>>
>     >> >>> Sure, since it has its own uniq characteristics and
>     implementation, but again, only once new runner will be
>     "officially released".
>     >> >>>
>     >> >>> +1
>     >> >>>
>     >> >>>
>     >> >>>
>     >> >>> On 28 Oct 2019, at 10:27, Etienne Chauchot
>     <echauchot@apache.org <ma...@apache.org>> wrote:
>     >> >>>
>     >> >>> Hi guys,
>     >> >>>
>     >> >>> Any opinions on the point2 communication to users ?
>     >> >>>
>     >> >>> Etienne
>     >> >>>
>     >> >>> On 24/10/2019 15:44, Etienne Chauchot wrote:
>     >> >>>
>     >> >>> Hi guys,
>     >> >>>
>     >> >>> I'm glad to announce that the PR for the merge to master of
>     the new runner based on Spark Structured Streaming framework is
>     submitted:
>     >> >>>
>     >> >>> https://github.com/apache/beam/pull/9866
>     >> >>>
>     >> >>>
>     >> >>> 1. Regarding the status of the runner:
>     >> >>>
>     >> >>> -the runner passes 93% of the validates runner tests in
>     batch mode.
>     >> >>>
>     >> >>> -Streaming mode is barely started (waiting for the
>     multi-aggregations support in spark Structured Streaming framework
>     from the Spark community)
>     >> >>>
>     >> >>> -Runner can execute Nexmark
>     >> >>>
>     >> >>> -Some things are not wired up yet
>     >> >>>
>     >> >>>   -Beam Schemas not wired with Spark Schemas
>     >> >>>
>     >> >>>   -Optional features of the model not implemented: state
>     api, timer api, splittable doFn api, …
>     >> >>>
>     >> >>>
>     >> >>> 2. Regarding the communication to users:
>     >> >>>
>     >> >>> - for reasons explained by Ismael: the runner is in the
>     same module as the "older" one. But it is in a different
>     sub-package and both runners share the same build.
>     >> >>>
>     >> >>> - How should we communicate to users:
>     >> >>>
>     >> >>>     - shall we filter out the package name from the release?
>     >> >>>
>     >> >>>     - should we release 2 jars: one for the old and one for
>     the new ?
>     >> >>>
>     >> >>>     - should we release 3 jars: one for the new, one for
>     the new and one for both ?
>     >> >>>
>     >> >>>     - should we create a special entry to the capability
>     matrix ?
>     >> >>>
>     >> >>> WDYT ?
>     >> >>>
>     >> >>> Best
>     >> >>>
>     >> >>> Etienne
>     >> >>>
>     >> >>>
>     >> >>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>     >> >>>
>     >> >>> +1 to merge.
>     >> >>>
>     >> >>> It is worth keeping things in master with explicitly marked
>     status. It will make effort more visible to users and easier to
>     get feedback upon.
>     >> >>>
>     >> >>> --Mikhail
>     >> >>>
>     >> >>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot
>     <echauchot@apache.org <ma...@apache.org>> wrote:
>     >> >>>>
>     >> >>>> Hi guys,
>     >> >>>>
>     >> >>>> The new spark runner now supports beam coders and passes
>     93% of the batch validates runner tests (+4%). I think it is time
>     to merge it to master. I will submit a PR in the coming days.
>     >> >>>>
>     >> >>>> next steps: support schemas and thus better leverage
>     catalyst optimizer (among other things optims based on data), port
>     perfs optims that were done in the current runner.
>     >> >>>>
>     >> >>>> Best
>     >> >>>>
>     >> >>>> Etienne
>     >> >>>>
>     >> >>>> On 11/10/2019 22:48, Pablo Estrada wrote:
>     >> >>>>
>     >> >>>> +1 for merging : )
>     >> >>>>
>     >> >>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw
>     <robertwb@google.com <ma...@google.com>> wrote:
>     >> >>>>>
>     >> >>>>> Sounds like a good plan to me.
>     >> >>>>>
>     >> >>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot
>     <echauchot@apache.org <ma...@apache.org>> wrote:
>     >> >>>>>>
>     >> >>>>>> Comments inline
>     >> >>>>>>
>     >> >>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>     >> >>>>>>
>     >> >>>>>> +1
>     >> >>>>>>
>     >> >>>>>> The earlier we get to master the better to encourage not
>     only code
>     >> >>>>>> contributions but as important to have early user feedback.
>     >> >>>>>>
>     >> >>>>>> Question is: do we keep the "old" spark runner for a
>     while or not (or just keep on previous version/tag on git) ?
>     >> >>>>>>
>     >> >>>>>> It is still too early to even start discussing when to
>     remove the
>     >> >>>>>> classical runner given that the new runner is still a
>     WIP. However the
>     >> >>>>>> overall goal is that this runner becomes the de-facto
>     one once the VR
>     >> >>>>>> tests and the performance become at least equal to the
>     classical
>     >> >>>>>> runner, in the meantime the best for users is that they
>     co-exist,
>     >> >>>>>> let’s not forget that the other runner has been already
>     battle tested
>     >> >>>>>> for more than 3 years and has had lots of improvements
>     in the last
>     >> >>>>>> year.
>     >> >>>>>>
>     >> >>>>>> +1 on what Ismael says: no soon removal,
>     >> >>>>>>
>     >> >>>>>> The plan I had in mind at first (that I showed at the
>     apacheCon) was this but I'm proposing moving the first gray label
>     to before the red box.
>     >> >>>>>>
>     >> >>>>>> <beogijnhpieapoll.png>
>     >> >>>>>>
>     >> >>>>>>
>     >> >>>>>> I don't think the number of commits should be an
>     issue--we shouldn't
>     >> >>>>>> just squash years worth of history away. (OTOH, if this
>     is a case of
>     >> >>>>>> this branch containing lots of little, irrelevant
>     commits that would
>     >> >>>>>> have normally been squashed away in the normal review
>     process we do
>     >> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
>     >> >>>>>>
>     >> >>>>>> About the commits we should encourage a clear history
>     but we have also
>     >> >>>>>> to remove useless commits that are still present in the
>     branch,
>     >> >>>>>> commits of the “Fix errorprone” / “Cleaning” kind and
>     even commits
>     >> >>>>>> that make a better narrative sense together should be
>     probably
>     >> >>>>>> squashed, because they do not bring much to the history.
>     It is not
>     >> >>>>>> about more or less commits it is about its relevance as
>     Robert
>     >> >>>>>> mentions.
>     >> >>>>>>
>     >> >>>>>> I think our experiences with things that go to master
>     early have been very good. So I am in favor ASAP. We can exclude
>     it from releases easily until it is ready for end users.
>     >> >>>>>> I have the same question as Robert - how much is
>     modifications and how much is new? I notice it is in a
>     subdirectory of the beam-runners-spark module.
>     >> >>>>>>
>     >> >>>>>> In its current form we cannot exclude it but this
>     relates to the other
>     >> >>>>>> question, so better to explain a bit of history: The new
>     runner used
>     >> >>>>>> to live in its own module and subdirectory because it is
>     a full blank
>     >> >>>>>> page rewrite and the decision was not to use any of the
>     classical
>     >> >>>>>> runner classes to not be constrained by its evolution.
>     >> >>>>>>
>     >> >>>>>> However the reason to put it back in the same module as
>     a subdirectory
>     >> >>>>>> was to encourage early use, in more detail: The way you
>     deploy spark
>     >> >>>>>> jobs today is usually by packaging and staging an uber
>     jar (~200MB of
>     >> >>>>>> pure dependency joy) that contains the user pipeline
>     classes, the
>     >> >>>>>> spark runner module and its dependencies. If we have two
>     spark runners
>     >> >>>>>> in separate modules the user would need to repackage and
>     redeploy
>     >> >>>>>> their pipelines every time they want to switch from the
>     classical
>     >> >>>>>> Spark runner to the structured streaming runner which is
>     painful and
>     >> >>>>>> time and space consuming compared with the one module
>     approach where
>     >> >>>>>> they just change the name of the runner class and that’s
>     it. The idea
>     >> >>>>>> here is to make easy for users to test the new runner,
>     but at the same
>     >> >>>>>> time to make easy to come back to the classical runner
>     in case of any
>     >> >>>>>> issue.
>     >> >>>>>>
>     >> >>>>>> Ismaël
>     >> >>>>>>
>     >> >>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles
>     <kenn@apache.org <ma...@apache.org>> wrote:
>     >> >>>>>>
>     >> >>>>>> +1
>     >> >>>>>>
>     >> >>>>>> I think our experiences with things that go to master
>     early have been very good. So I am in favor ASAP. We can exclude
>     it from releases easily until it is ready for end users.
>     >> >>>>>>
>     >> >>>>>> I have the same question as Robert - how much is
>     modifications and how much is new? I notice it is in a
>     subdirectory of the beam-runners-spark module.
>     >> >>>>>>
>     >> >>>>>> I did not see any major changes to dependencies but I
>     will also ask if it has major version differences so that you
>     might want a separate artifact?
>     >> >>>>>>
>     >> >>>>>> Kenn
>     >> >>>>>>
>     >> >>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw
>     <robertwb@google.com <ma...@google.com>> wrote:
>     >> >>>>>>
>     >> >>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot
>     <echauchot@apache.org <ma...@apache.org>> wrote:
>     >> >>>>>>
>     >> >>>>>> Hi guys,
>     >> >>>>>>
>     >> >>>>>> You probably know that there has been for several months
>     an work
>     >> >>>>>> developing a new Spark runner based on Spark Structured
>     Streaming
>     >> >>>>>> framework. This work is located in a feature branch here:
>     >> >>>>>>
>     https://github.com/apache/beam/tree/spark-runner_structured-streaming
>     >> >>>>>>
>     >> >>>>>> To attract more contributors and get some user feedback,
>     we think it is
>     >> >>>>>> time to merge it to master. Before doing so, some steps
>     need to be achieved:
>     >> >>>>>>
>     >> >>>>>> - finish the work on spark Encoders (that allow to call
>     Beam coders)
>     >> >>>>>> because, right now, the runner is in an unstable state
>     (some transforms
>     >> >>>>>> use the new way of doing ser/de and some use the old
>     one, making a
>     >> >>>>>> pipeline incoherent toward serialization)
>     >> >>>>>>
>     >> >>>>>> - clean history: The history contains commits from
>     November 2018, so
>     >> >>>>>> there is a good amount of work, thus a consequent number
>     of commits.
>     >> >>>>>> They were already squashed but not from September 2019
>     >> >>>>>>
>     >> >>>>>> I don't think the number of commits should be an
>     issue--we shouldn't
>     >> >>>>>> just squash years worth of history away. (OTOH, if this
>     is a case of
>     >> >>>>>> this branch containing lots of little, irrelevant
>     commits that would
>     >> >>>>>> have normally been squashed away in the normal review
>     process we do
>     >> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
>     >> >>>>>>
>     >> >>>>>> Regarding status:
>     >> >>>>>>
>     >> >>>>>> - the runner passes 89% of the validates runner tests in
>     batch mode. We
>     >> >>>>>> hope to pass more with the new Encoders
>     >> >>>>>>
>     >> >>>>>> - Streaming mode is barely started (waiting for the
>     multi-aggregations
>     >> >>>>>> support in spark SS framework from the Spark community)
>     >> >>>>>>
>     >> >>>>>> - Runner can execute Nexmark
>     >> >>>>>>
>     >> >>>>>> - Some things are not wired up yet
>     >> >>>>>>
>     >> >>>>>>      - Beam Schemas not wired with Spark Schemas
>     >> >>>>>>
>     >> >>>>>>      - Optional features of the model not implemented: 
>     state api, timer
>     >> >>>>>> api, splittable doFn api, …
>     >> >>>>>>
>     >> >>>>>> WDYT, can we merge it to master once the 2 steps are done ?
>     >> >>>>>>
>     >> >>>>>> I think that as long as it sits parallel to the existing
>     runner, and
>     >> >>>>>> is clearly marked with its status, it makes sense to me.
>     How many
>     >> >>>>>> changes does it make to the existing codebase (as
>     opposed to add new
>     >> >>>>>> code)?
>     >> >>>
>     >> >>>
>     >> >>>
>     >> >>
>

Re: [spark structured streaming runner] merge to master?

Posted by Kyle Weaver <kc...@google.com>.

+1 to "one uber jar to rule them all." As Ryan said, we snuck the portable
runner into master at a much less usable state than the structured
streaming runner, and it didn't seem to cause any issues (although in
retrospect we probably should have tagged it as @Experimental, I wasn't
aware that was a thing at the time).

In general, until adding visible documentation on the website, I suspect
most users will be completely unaware the new runner is even there.

On Fri, Nov 8, 2019 at 9:50 AM Kenneth Knowles <ke...@apache.org> wrote:

> On Thu, Nov 7, 2019 at 5:32 PM Etienne Chauchot <ec...@apache.org>
> wrote:
> >
> > Hi guys
> >
> > @Kenn,
> >
> > I just wanted to mention that I did answered your question on
> dependencies here:
> https://lists.apache.org/thread.html/5a85caac41e796c2aa351d835b3483808ebbbd4512b480940d494439@%3Cdev.beam.apache.org%3E
>
> Ah, sorry! In that case there is no problem at all.
>
>
> > I'm not in favor of having the 2 runners in one jar, the point about
> having 2 jars was to:
> >
> > - avoid making promises to users on a work in progress runner (make it
> explicit with a different jar)
> > - avoid confusion for them (why are there 2 pipeline options? etc....)
> >
> > If the community believes that there is no confusion or wrong promises
> with the one jar solution, we could leave the 2 runners in one jar.
> >
> > Maybe we could start a vote on that?
>
> It seems unanimous among others to have one jar. There were some
> suggestions of how to avoid promises and confusion, like Ryan's most recent
> email. Did any of the ideas sound good to you?
>
> Kenn
>
>
> I have no objection to putting the experimental runner alongside the
>> stable, mature runner.  We have some precedence with the portable
>> spark runner, and that's worked out pretty well -- at least, I haven't
>> heard any complaints from confused users!
>>
>> That being said:
>>
>> 1.  It really should be marked @Experimental in the code *and* clearly
>> warned in API (javadoc) and documentation.
>>
>> 2.  Ideally, I'd like to see a warning banner in the logs when it's
>> used, pointing to the stable SparkRunner and/or documentation on the
>> current known issues.
>>
>> All my best, Ryan
>>
>>
>>
>>
>>
>>
>> > regarding jars:
>> >
>> > I don't like 3 jars either.
>> >
>> >
>> > Etienne
>> >
>> > On 31/10/2019 02:06, Kenneth Knowles wrote:
>> >
>> > Very good points. We definitely ship a lot of code/features in very
>> early stages, and there seems to be no problem.
>> >
>> > I intend mostly to leave this judgment to people like you who know
>> better about Spark users.
>> >
>> > But I do think 1 or 2 jars is better than 3. I really don't like "3
>> jars" and I did give two reasons:
>> >
>> > 1. diamond deps where things overlap
>> > 2. figuring out which thing to depend on
>> >
>> > Both are annoying for users. I am not certain if it could lead to a
>> real unsolvable situation. This is just a Java ecosystem problem so I feel
>> qualified to comment.
>> >
>> > I did also ask if there were major dependency differences between the
>> two that could cause problems for users. This question was dropped and no
>> one cares to comment so I assume it is not an issue. So then I favor having
>> just 1 jar with both runners.
>> >
>> > Kenn
>> >
>> > On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía <ie...@gmail.com> wrote:
>> >>
>> >> I am still a bit lost about why we are discussing options without
>> giving any
>> >> arguments or reasons for the options? Why is 2 modules better than 3
>> or 3 better
>> >> than 2, or even better, what forces us to have something different
>> than a single
>> >> module?
>> >>
>> >> What are the reasons for wanting to have separate jars? If the issue
>> is that the
>> >> code is unfinished or not passing the tests, the impact for end users
>> is minimal
>> >> because they cannot accidentally end up running the new runner, and if
>> they
>> >> decide to do so we can warn them it is at their own risk and not ready
>> for
>> >> production in the documentation + runner.
>> >>
>> >> If the fear is that new code may end up being intertwined with the
>> classic and
>> >> portable runners and have some side effects. We have the
>> ValidatesRunner +
>> >> Nexmark in the CI to cover this so again I do not see what is the
>> problem that
>> >> requires modules to be separate.
>> >>
>> >> If the issue is being uncomfortable about having in-progress code in
>> released
>> >> artifacts we have been doing this in Beam forever, for example most of
>> the work
>> >> on portability and Schema/SQL, and all of those were still part of
>> artifacts
>> >> long time before they were ready for prime use, so I still don't see
>> why this
>> >> case is different to require different artifacts.
>> >>
>> >> I have the impression we are trying to solve a non-issue by adding a
>> lot of
>> >> artificial complexity (in particular to the users), or am I missing
>> something
>> >> else?
>> >>
>> >> On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles <ke...@apache.org>
>> wrote:
>> >> >
>> >> > Oh, I mean that we ship just 2 jars.
>> >> >
>> >> > And since Spark users always build an uber jar, they can still
>> depend on both of ours and be able to switch runners with a flag.
>> >> >
>> >> > I really dislike projects shipping overlapping jars. It is confusing
>> and causes major diamond dependency problems.
>> >> >
>> >> > Kenn
>> >> >
>> >> > On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko <
>> aromanenko.dev@gmail.com> wrote:
>> >> >>
>> >> >> Yes, agree, two jars included in uber jar will work in the similar
>> way. Though having 3 jars looks still quite confusing for me.
>> >> >>
>> >> >> On 29 Oct 2019, at 23:54, Kenneth Knowles <ke...@apache.org> wrote:
>> >> >>
>> >> >> Is it just as easy to have two jars and build an uber jar with both
>> included? Then the runner can still be toggled with a flag.
>> >> >>
>> >> >> Kenn
>> >> >>
>> >> >> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko <
>> aromanenko.dev@gmail.com> wrote:
>> >> >>>
>> >> >>> Hmm, I don’t think that jar size should play a big role comparing
>> to the whole size of shaded jar of users job. Even more, I think it will be
>> quite confusing for users to choose which jar to use if we will have 3
>> different ones for similar purposes. Though, let’s see what others think.
>> >> >>>
>> >> >>> On 29 Oct 2019, at 15:32, Etienne Chauchot <ec...@apache.org>
>> wrote:
>> >> >>>
>> >> >>> Hi Alexey,
>> >> >>>
>> >> >>> Thanks for your opinion !
>> >> >>>
>> >> >>> Comments inline
>> >> >>>
>> >> >>> Etienne
>> >> >>>
>> >> >>> On 28/10/2019 17:34, Alexey Romanenko wrote:
>> >> >>>
>> >> >>> Let me share some of my thoughts on this.
>> >> >>>
>> >> >>>     - shall we filter out the package name from the release?
>> >> >>>
>> >> >>> Until new runner is not ready to be used in production (or, at
>> least, be used for beta testing but users should be clearly warned about
>> that in this case), I believe we need to filter out its classes from
>> published jar to avoid a confusion.
>> >> >>>
>> >> >>> Yes that is what I think also
>> >> >>>
>> >> >>>     - should we release 2 jars: one for the old and one for the
>> new ?
>> >> >>>
>> >> >>>     - should we release 3 jars: one for the new, one for the new
>> and one for both ?
>> >> >>>
>> >> >>> Once new runner will be released, then I think we need to provide
>> only one single jar and allow user to switch between different Spark
>> runners with CLI option.
>> >> >>>
>> >> >>> I would vote for 3 jars: one for new, one for old, and one for
>> both. Indeed, in some cases, users are looking very closely at the size of
>> jars. This solution meets all use cases
>> >> >>>
>> >> >>>     - should we create a special entry to the capability matrix ?
>> >> >>>
>> >> >>> Sure, since it has its own uniq characteristics and
>> implementation, but again, only once new runner will be "officially
>> released".
>> >> >>>
>> >> >>> +1
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On 28 Oct 2019, at 10:27, Etienne Chauchot <ec...@apache.org>
>> wrote:
>> >> >>>
>> >> >>> Hi guys,
>> >> >>>
>> >> >>> Any opinions on the point2 communication to users ?
>> >> >>>
>> >> >>> Etienne
>> >> >>>
>> >> >>> On 24/10/2019 15:44, Etienne Chauchot wrote:
>> >> >>>
>> >> >>> Hi guys,
>> >> >>>
>> >> >>> I'm glad to announce that the PR for the merge to master of the
>> new runner based on Spark Structured Streaming framework is submitted:
>> >> >>>
>> >> >>> https://github.com/apache/beam/pull/9866
>> >> >>>
>> >> >>>
>> >> >>> 1. Regarding the status of the runner:
>> >> >>>
>> >> >>> -the runner passes 93% of the validates runner tests in batch mode.
>> >> >>>
>> >> >>> -Streaming mode is barely started (waiting for the
>> multi-aggregations support in spark Structured Streaming framework from the
>> Spark community)
>> >> >>>
>> >> >>> -Runner can execute Nexmark
>> >> >>>
>> >> >>> -Some things are not wired up yet
>> >> >>>
>> >> >>>   -Beam Schemas not wired with Spark Schemas
>> >> >>>
>> >> >>>   -Optional features of the model not implemented: state api,
>> timer api, splittable doFn api, …
>> >> >>>
>> >> >>>
>> >> >>> 2. Regarding the communication to users:
>> >> >>>
>> >> >>> - for reasons explained by Ismael: the runner is in the same
>> module as the "older" one. But it is in a different sub-package and both
>> runners share the same build.
>> >> >>>
>> >> >>> - How should we communicate to users:
>> >> >>>
>> >> >>>     - shall we filter out the package name from the release?
>> >> >>>
>> >> >>>     - should we release 2 jars: one for the old and one for the
>> new ?
>> >> >>>
>> >> >>>     - should we release 3 jars: one for the new, one for the new
>> and one for both ?
>> >> >>>
>> >> >>>     - should we create a special entry to the capability matrix ?
>> >> >>>
>> >> >>> WDYT ?
>> >> >>>
>> >> >>> Best
>> >> >>>
>> >> >>> Etienne
>> >> >>>
>> >> >>>
>> >> >>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>> >> >>>
>> >> >>> +1 to merge.
>> >> >>>
>> >> >>> It is worth keeping things in master with explicitly marked
>> status. It will make effort more visible to users and easier to get
>> feedback upon.
>> >> >>>
>> >> >>> --Mikhail
>> >> >>>
>> >> >>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <
>> echauchot@apache.org> wrote:
>> >> >>>>
>> >> >>>> Hi guys,
>> >> >>>>
>> >> >>>> The new spark runner now supports beam coders and passes 93% of
>> the batch validates runner tests (+4%). I think it is time to merge it to
>> master. I will submit a PR in the coming days.
>> >> >>>>
>> >> >>>> next steps: support schemas and thus better leverage catalyst
>> optimizer (among other things optims based on data), port perfs optims that
>> were done in the current runner.
>> >> >>>>
>> >> >>>> Best
>> >> >>>>
>> >> >>>> Etienne
>> >> >>>>
>> >> >>>> On 11/10/2019 22:48, Pablo Estrada wrote:
>> >> >>>>
>> >> >>>> +1 for merging : )
>> >> >>>>
>> >> >>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <
>> robertwb@google.com> wrote:
>> >> >>>>>
>> >> >>>>> Sounds like a good plan to me.
>> >> >>>>>
>> >> >>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <
>> echauchot@apache.org> wrote:
>> >> >>>>>>
>> >> >>>>>> Comments inline
>> >> >>>>>>
>> >> >>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>> >> >>>>>>
>> >> >>>>>> +1
>> >> >>>>>>
>> >> >>>>>> The earlier we get to master the better to encourage not only
>> code
>> >> >>>>>> contributions but as important to have early user feedback.
>> >> >>>>>>
>> >> >>>>>> Question is: do we keep the "old" spark runner for a while or
>> not (or just keep on previous version/tag on git) ?
>> >> >>>>>>
>> >> >>>>>> It is still too early to even start discussing when to remove
>> the
>> >> >>>>>> classical runner given that the new runner is still a WIP.
>> However the
>> >> >>>>>> overall goal is that this runner becomes the de-facto one once
>> the VR
>> >> >>>>>> tests and the performance become at least equal to the classical
>> >> >>>>>> runner, in the meantime the best for users is that they
>> co-exist,
>> >> >>>>>> let’s not forget that the other runner has been already battle
>> tested
>> >> >>>>>> for more than 3 years and has had lots of improvements in the
>> last
>> >> >>>>>> year.
>> >> >>>>>>
>> >> >>>>>> +1 on what Ismael says: no soon removal,
>> >> >>>>>>
>> >> >>>>>> The plan I had in mind at first (that I showed at the
>> apacheCon) was this but I'm proposing moving the first gray label to before
>> the red box.
>> >> >>>>>>
>> >> >>>>>> <beogijnhpieapoll.png>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> I don't think the number of commits should be an issue--we
>> shouldn't
>> >> >>>>>> just squash years worth of history away. (OTOH, if this is a
>> case of
>> >> >>>>>> this branch containing lots of little, irrelevant commits that
>> would
>> >> >>>>>> have normally been squashed away in the normal review process
>> we do
>> >> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
>> >> >>>>>>
>> >> >>>>>> About the commits we should encourage a clear history but we
>> have also
>> >> >>>>>> to remove useless commits that are still present in the branch,
>> >> >>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even
>> commits
>> >> >>>>>> that make a better narrative sense together should be probably
>> >> >>>>>> squashed, because they do not bring much to the history. It is
>> not
>> >> >>>>>> about more or less commits it is about its relevance as Robert
>> >> >>>>>> mentions.
>> >> >>>>>>
>> >> >>>>>> I think our experiences with things that go to master early
>> have been very good. So I am in favor ASAP. We can exclude it from releases
>> easily until it is ready for end users.
>> >> >>>>>> I have the same question as Robert - how much is modifications
>> and how much is new? I notice it is in a subdirectory of the
>> beam-runners-spark module.
>> >> >>>>>>
>> >> >>>>>> In its current form we cannot exclude it but this relates to
>> the other
>> >> >>>>>> question, so better to explain a bit of history: The new runner
>> used
>> >> >>>>>> to live in its own module and subdirectory because it is a full
>> blank
>> >> >>>>>> page rewrite and the decision was not to use any of the
>> classical
>> >> >>>>>> runner classes to not be constrained by its evolution.
>> >> >>>>>>
>> >> >>>>>> However the reason to put it back in the same module as a
>> subdirectory
>> >> >>>>>> was to encourage early use, in more detail: The way you deploy
>> spark
>> >> >>>>>> jobs today is usually by packaging and staging an uber jar
>> (~200MB of
>> >> >>>>>> pure dependency joy) that contains the user pipeline classes,
>> the
>> >> >>>>>> spark runner module and its dependencies. If we have two spark
>> runners
>> >> >>>>>> in separate modules the user would need to repackage and
>> redeploy
>> >> >>>>>> their pipelines every time they want to switch from the
>> classical
>> >> >>>>>> Spark runner to the structured streaming runner which is
>> painful and
>> >> >>>>>> time and space consuming compared with the one module approach
>> where
>> >> >>>>>> they just change the name of the runner class and that’s it.
>> The idea
>> >> >>>>>> here is to make easy for users to test the new runner, but at
>> the same
>> >> >>>>>> time to make easy to come back to the classical runner in case
>> of any
>> >> >>>>>> issue.
>> >> >>>>>>
>> >> >>>>>> Ismaël
>> >> >>>>>>
>> >> >>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <
>> kenn@apache.org> wrote:
>> >> >>>>>>
>> >> >>>>>> +1
>> >> >>>>>>
>> >> >>>>>> I think our experiences with things that go to master early
>> have been very good. So I am in favor ASAP. We can exclude it from releases
>> easily until it is ready for end users.
>> >> >>>>>>
>> >> >>>>>> I have the same question as Robert - how much is modifications
>> and how much is new? I notice it is in a subdirectory of the
>> beam-runners-spark module.
>> >> >>>>>>
>> >> >>>>>> I did not see any major changes to dependencies but I will also
>> ask if it has major version differences so that you might want a separate
>> artifact?
>> >> >>>>>>
>> >> >>>>>> Kenn
>> >> >>>>>>
>> >> >>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <
>> robertwb@google.com> wrote:
>> >> >>>>>>
>> >> >>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <
>> echauchot@apache.org> wrote:
>> >> >>>>>>
>> >> >>>>>> Hi guys,
>> >> >>>>>>
>> >> >>>>>> You probably know that there has been for several months an work
>> >> >>>>>> developing a new Spark runner based on Spark Structured
>> Streaming
>> >> >>>>>> framework. This work is located in a feature branch here:
>> >> >>>>>>
>> https://github.com/apache/beam/tree/spark-runner_structured-streaming
>> >> >>>>>>
>> >> >>>>>> To attract more contributors and get some user feedback, we
>> think it is
>> >> >>>>>> time to merge it to master. Before doing so, some steps need to
>> be achieved:
>> >> >>>>>>
>> >> >>>>>> - finish the work on spark Encoders (that allow to call Beam
>> coders)
>> >> >>>>>> because, right now, the runner is in an unstable state (some
>> transforms
>> >> >>>>>> use the new way of doing ser/de and some use the old one,
>> making a
>> >> >>>>>> pipeline incoherent toward serialization)
>> >> >>>>>>
>> >> >>>>>> - clean history: The history contains commits from November
>> 2018, so
>> >> >>>>>> there is a good amount of work, thus a consequent number of
>> commits.
>> >> >>>>>> They were already squashed but not from September 2019
>> >> >>>>>>
>> >> >>>>>> I don't think the number of commits should be an issue--we
>> shouldn't
>> >> >>>>>> just squash years worth of history away. (OTOH, if this is a
>> case of
>> >> >>>>>> this branch containing lots of little, irrelevant commits that
>> would
>> >> >>>>>> have normally been squashed away in the normal review process
>> we do
>> >> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
>> >> >>>>>>
>> >> >>>>>> Regarding status:
>> >> >>>>>>
>> >> >>>>>> - the runner passes 89% of the validates runner tests in batch
>> mode. We
>> >> >>>>>> hope to pass more with the new Encoders
>> >> >>>>>>
>> >> >>>>>> - Streaming mode is barely started (waiting for the
>> multi-aggregations
>> >> >>>>>> support in spark SS framework from the Spark community)
>> >> >>>>>>
>> >> >>>>>> - Runner can execute Nexmark
>> >> >>>>>>
>> >> >>>>>> - Some things are not wired up yet
>> >> >>>>>>
>> >> >>>>>>      - Beam Schemas not wired with Spark Schemas
>> >> >>>>>>
>> >> >>>>>>      - Optional features of the model not implemented:  state
>> api, timer
>> >> >>>>>> api, splittable doFn api, …
>> >> >>>>>>
>> >> >>>>>> WDYT, can we merge it to master once the 2 steps are done ?
>> >> >>>>>>
>> >> >>>>>> I think that as long as it sits parallel to the existing
>> runner, and
>> >> >>>>>> is clearly marked with its status, it makes sense to me. How
>> many
>> >> >>>>>> changes does it make to the existing codebase (as opposed to
>> add new
>> >> >>>>>> code)?
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>
>>
>

Re: [spark structured streaming runner] merge to master?

Posted by Kenneth Knowles <ke...@apache.org>.

On Thu, Nov 7, 2019 at 5:32 PM Etienne Chauchot <ec...@apache.org>
wrote:
>
> Hi guys
>
> @Kenn,
>
> I just wanted to mention that I did answered your question on
dependencies here:
https://lists.apache.org/thread.html/5a85caac41e796c2aa351d835b3483808ebbbd4512b480940d494439@%3Cdev.beam.apache.org%3E

Ah, sorry! In that case there is no problem at all.


> I'm not in favor of having the 2 runners in one jar, the point about
having 2 jars was to:
>
> - avoid making promises to users on a work in progress runner (make it
explicit with a different jar)
> - avoid confusion for them (why are there 2 pipeline options? etc....)
>
> If the community believes that there is no confusion or wrong promises
with the one jar solution, we could leave the 2 runners in one jar.
>
> Maybe we could start a vote on that?

It seems unanimous among others to have one jar. There were some
suggestions of how to avoid promises and confusion, like Ryan's most recent
email. Did any of the ideas sound good to you?

Kenn


I have no objection to putting the experimental runner alongside the
> stable, mature runner.  We have some precedence with the portable
> spark runner, and that's worked out pretty well -- at least, I haven't
> heard any complaints from confused users!
>
> That being said:
>
> 1.  It really should be marked @Experimental in the code *and* clearly
> warned in API (javadoc) and documentation.
>
> 2.  Ideally, I'd like to see a warning banner in the logs when it's
> used, pointing to the stable SparkRunner and/or documentation on the
> current known issues.
>
> All my best, Ryan
>
>
>
>
>
>
> > regarding jars:
> >
> > I don't like 3 jars either.
> >
> >
> > Etienne
> >
> > On 31/10/2019 02:06, Kenneth Knowles wrote:
> >
> > Very good points. We definitely ship a lot of code/features in very
> early stages, and there seems to be no problem.
> >
> > I intend mostly to leave this judgment to people like you who know
> better about Spark users.
> >
> > But I do think 1 or 2 jars is better than 3. I really don't like "3
> jars" and I did give two reasons:
> >
> > 1. diamond deps where things overlap
> > 2. figuring out which thing to depend on
> >
> > Both are annoying for users. I am not certain if it could lead to a real
> unsolvable situation. This is just a Java ecosystem problem so I feel
> qualified to comment.
> >
> > I did also ask if there were major dependency differences between the
> two that could cause problems for users. This question was dropped and no
> one cares to comment so I assume it is not an issue. So then I favor having
> just 1 jar with both runners.
> >
> > Kenn
> >
> > On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía <ie...@gmail.com> wrote:
> >>
> >> I am still a bit lost about why we are discussing options without
> giving any
> >> arguments or reasons for the options? Why is 2 modules better than 3 or
> 3 better
> >> than 2, or even better, what forces us to have something different than
> a single
> >> module?
> >>
> >> What are the reasons for wanting to have separate jars? If the issue is
> that the
> >> code is unfinished or not passing the tests, the impact for end users
> is minimal
> >> because they cannot accidentally end up running the new runner, and if
> they
> >> decide to do so we can warn them it is at their own risk and not ready
> for
> >> production in the documentation + runner.
> >>
> >> If the fear is that new code may end up being intertwined with the
> classic and
> >> portable runners and have some side effects. We have the
> ValidatesRunner +
> >> Nexmark in the CI to cover this so again I do not see what is the
> problem that
> >> requires modules to be separate.
> >>
> >> If the issue is being uncomfortable about having in-progress code in
> released
> >> artifacts we have been doing this in Beam forever, for example most of
> the work
> >> on portability and Schema/SQL, and all of those were still part of
> artifacts
> >> long time before they were ready for prime use, so I still don't see
> why this
> >> case is different to require different artifacts.
> >>
> >> I have the impression we are trying to solve a non-issue by adding a
> lot of
> >> artificial complexity (in particular to the users), or am I missing
> something
> >> else?
> >>
> >> On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles <ke...@apache.org>
> wrote:
> >> >
> >> > Oh, I mean that we ship just 2 jars.
> >> >
> >> > And since Spark users always build an uber jar, they can still depend
> on both of ours and be able to switch runners with a flag.
> >> >
> >> > I really dislike projects shipping overlapping jars. It is confusing
> and causes major diamond dependency problems.
> >> >
> >> > Kenn
> >> >
> >> > On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko <
> aromanenko.dev@gmail.com> wrote:
> >> >>
> >> >> Yes, agree, two jars included in uber jar will work in the similar
> way. Though having 3 jars looks still quite confusing for me.
> >> >>
> >> >> On 29 Oct 2019, at 23:54, Kenneth Knowles <ke...@apache.org> wrote:
> >> >>
> >> >> Is it just as easy to have two jars and build an uber jar with both
> included? Then the runner can still be toggled with a flag.
> >> >>
> >> >> Kenn
> >> >>
> >> >> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko <
> aromanenko.dev@gmail.com> wrote:
> >> >>>
> >> >>> Hmm, I don’t think that jar size should play a big role comparing
> to the whole size of shaded jar of users job. Even more, I think it will be
> quite confusing for users to choose which jar to use if we will have 3
> different ones for similar purposes. Though, let’s see what others think.
> >> >>>
> >> >>> On 29 Oct 2019, at 15:32, Etienne Chauchot <ec...@apache.org>
> wrote:
> >> >>>
> >> >>> Hi Alexey,
> >> >>>
> >> >>> Thanks for your opinion !
> >> >>>
> >> >>> Comments inline
> >> >>>
> >> >>> Etienne
> >> >>>
> >> >>> On 28/10/2019 17:34, Alexey Romanenko wrote:
> >> >>>
> >> >>> Let me share some of my thoughts on this.
> >> >>>
> >> >>>     - shall we filter out the package name from the release?
> >> >>>
> >> >>> Until new runner is not ready to be used in production (or, at
> least, be used for beta testing but users should be clearly warned about
> that in this case), I believe we need to filter out its classes from
> published jar to avoid a confusion.
> >> >>>
> >> >>> Yes that is what I think also
> >> >>>
> >> >>>     - should we release 2 jars: one for the old and one for the new
> ?
> >> >>>
> >> >>>     - should we release 3 jars: one for the new, one for the new
> and one for both ?
> >> >>>
> >> >>> Once new runner will be released, then I think we need to provide
> only one single jar and allow user to switch between different Spark
> runners with CLI option.
> >> >>>
> >> >>> I would vote for 3 jars: one for new, one for old, and one for
> both. Indeed, in some cases, users are looking very closely at the size of
> jars. This solution meets all use cases
> >> >>>
> >> >>>     - should we create a special entry to the capability matrix ?
> >> >>>
> >> >>> Sure, since it has its own uniq characteristics and implementation,
> but again, only once new runner will be "officially released".
> >> >>>
> >> >>> +1
> >> >>>
> >> >>>
> >> >>>
> >> >>> On 28 Oct 2019, at 10:27, Etienne Chauchot <ec...@apache.org>
> wrote:
> >> >>>
> >> >>> Hi guys,
> >> >>>
> >> >>> Any opinions on the point2 communication to users ?
> >> >>>
> >> >>> Etienne
> >> >>>
> >> >>> On 24/10/2019 15:44, Etienne Chauchot wrote:
> >> >>>
> >> >>> Hi guys,
> >> >>>
> >> >>> I'm glad to announce that the PR for the merge to master of the new
> runner based on Spark Structured Streaming framework is submitted:
> >> >>>
> >> >>> https://github.com/apache/beam/pull/9866
> >> >>>
> >> >>>
> >> >>> 1. Regarding the status of the runner:
> >> >>>
> >> >>> -the runner passes 93% of the validates runner tests in batch mode.
> >> >>>
> >> >>> -Streaming mode is barely started (waiting for the
> multi-aggregations support in spark Structured Streaming framework from the
> Spark community)
> >> >>>
> >> >>> -Runner can execute Nexmark
> >> >>>
> >> >>> -Some things are not wired up yet
> >> >>>
> >> >>>   -Beam Schemas not wired with Spark Schemas
> >> >>>
> >> >>>   -Optional features of the model not implemented: state api, timer
> api, splittable doFn api, …
> >> >>>
> >> >>>
> >> >>> 2. Regarding the communication to users:
> >> >>>
> >> >>> - for reasons explained by Ismael: the runner is in the same module
> as the "older" one. But it is in a different sub-package and both runners
> share the same build.
> >> >>>
> >> >>> - How should we communicate to users:
> >> >>>
> >> >>>     - shall we filter out the package name from the release?
> >> >>>
> >> >>>     - should we release 2 jars: one for the old and one for the new
> ?
> >> >>>
> >> >>>     - should we release 3 jars: one for the new, one for the new
> and one for both ?
> >> >>>
> >> >>>     - should we create a special entry to the capability matrix ?
> >> >>>
> >> >>> WDYT ?
> >> >>>
> >> >>> Best
> >> >>>
> >> >>> Etienne
> >> >>>
> >> >>>
> >> >>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
> >> >>>
> >> >>> +1 to merge.
> >> >>>
> >> >>> It is worth keeping things in master with explicitly marked status.
> It will make effort more visible to users and easier to get feedback upon.
> >> >>>
> >> >>> --Mikhail
> >> >>>
> >> >>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <
> echauchot@apache.org> wrote:
> >> >>>>
> >> >>>> Hi guys,
> >> >>>>
> >> >>>> The new spark runner now supports beam coders and passes 93% of
> the batch validates runner tests (+4%). I think it is time to merge it to
> master. I will submit a PR in the coming days.
> >> >>>>
> >> >>>> next steps: support schemas and thus better leverage catalyst
> optimizer (among other things optims based on data), port perfs optims that
> were done in the current runner.
> >> >>>>
> >> >>>> Best
> >> >>>>
> >> >>>> Etienne
> >> >>>>
> >> >>>> On 11/10/2019 22:48, Pablo Estrada wrote:
> >> >>>>
> >> >>>> +1 for merging : )
> >> >>>>
> >> >>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <
> robertwb@google.com> wrote:
> >> >>>>>
> >> >>>>> Sounds like a good plan to me.
> >> >>>>>
> >> >>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <
> echauchot@apache.org> wrote:
> >> >>>>>>
> >> >>>>>> Comments inline
> >> >>>>>>
> >> >>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
> >> >>>>>>
> >> >>>>>> +1
> >> >>>>>>
> >> >>>>>> The earlier we get to master the better to encourage not only
> code
> >> >>>>>> contributions but as important to have early user feedback.
> >> >>>>>>
> >> >>>>>> Question is: do we keep the "old" spark runner for a while or
> not (or just keep on previous version/tag on git) ?
> >> >>>>>>
> >> >>>>>> It is still too early to even start discussing when to remove the
> >> >>>>>> classical runner given that the new runner is still a WIP.
> However the
> >> >>>>>> overall goal is that this runner becomes the de-facto one once
> the VR
> >> >>>>>> tests and the performance become at least equal to the classical
> >> >>>>>> runner, in the meantime the best for users is that they co-exist,
> >> >>>>>> let’s not forget that the other runner has been already battle
> tested
> >> >>>>>> for more than 3 years and has had lots of improvements in the
> last
> >> >>>>>> year.
> >> >>>>>>
> >> >>>>>> +1 on what Ismael says: no soon removal,
> >> >>>>>>
> >> >>>>>> The plan I had in mind at first (that I showed at the apacheCon)
> was this but I'm proposing moving the first gray label to before the red
> box.
> >> >>>>>>
> >> >>>>>> <beogijnhpieapoll.png>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> I don't think the number of commits should be an issue--we
> shouldn't
> >> >>>>>> just squash years worth of history away. (OTOH, if this is a
> case of
> >> >>>>>> this branch containing lots of little, irrelevant commits that
> would
> >> >>>>>> have normally been squashed away in the normal review process we
> do
> >> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
> >> >>>>>>
> >> >>>>>> About the commits we should encourage a clear history but we
> have also
> >> >>>>>> to remove useless commits that are still present in the branch,
> >> >>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even
> commits
> >> >>>>>> that make a better narrative sense together should be probably
> >> >>>>>> squashed, because they do not bring much to the history. It is
> not
> >> >>>>>> about more or less commits it is about its relevance as Robert
> >> >>>>>> mentions.
> >> >>>>>>
> >> >>>>>> I think our experiences with things that go to master early have
> been very good. So I am in favor ASAP. We can exclude it from releases
> easily until it is ready for end users.
> >> >>>>>> I have the same question as Robert - how much is modifications
> and how much is new? I notice it is in a subdirectory of the
> beam-runners-spark module.
> >> >>>>>>
> >> >>>>>> In its current form we cannot exclude it but this relates to the
> other
> >> >>>>>> question, so better to explain a bit of history: The new runner
> used
> >> >>>>>> to live in its own module and subdirectory because it is a full
> blank
> >> >>>>>> page rewrite and the decision was not to use any of the classical
> >> >>>>>> runner classes to not be constrained by its evolution.
> >> >>>>>>
> >> >>>>>> However the reason to put it back in the same module as a
> subdirectory
> >> >>>>>> was to encourage early use, in more detail: The way you deploy
> spark
> >> >>>>>> jobs today is usually by packaging and staging an uber jar
> (~200MB of
> >> >>>>>> pure dependency joy) that contains the user pipeline classes, the
> >> >>>>>> spark runner module and its dependencies. If we have two spark
> runners
> >> >>>>>> in separate modules the user would need to repackage and redeploy
> >> >>>>>> their pipelines every time they want to switch from the classical
> >> >>>>>> Spark runner to the structured streaming runner which is painful
> and
> >> >>>>>> time and space consuming compared with the one module approach
> where
> >> >>>>>> they just change the name of the runner class and that’s it. The
> idea
> >> >>>>>> here is to make easy for users to test the new runner, but at
> the same
> >> >>>>>> time to make easy to come back to the classical runner in case
> of any
> >> >>>>>> issue.
> >> >>>>>>
> >> >>>>>> Ismaël
> >> >>>>>>
> >> >>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <ke...@apache.org>
> wrote:
> >> >>>>>>
> >> >>>>>> +1
> >> >>>>>>
> >> >>>>>> I think our experiences with things that go to master early have
> been very good. So I am in favor ASAP. We can exclude it from releases
> easily until it is ready for end users.
> >> >>>>>>
> >> >>>>>> I have the same question as Robert - how much is modifications
> and how much is new? I notice it is in a subdirectory of the
> beam-runners-spark module.
> >> >>>>>>
> >> >>>>>> I did not see any major changes to dependencies but I will also
> ask if it has major version differences so that you might want a separate
> artifact?
> >> >>>>>>
> >> >>>>>> Kenn
> >> >>>>>>
> >> >>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <
> robertwb@google.com> wrote:
> >> >>>>>>
> >> >>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <
> echauchot@apache.org> wrote:
> >> >>>>>>
> >> >>>>>> Hi guys,
> >> >>>>>>
> >> >>>>>> You probably know that there has been for several months an work
> >> >>>>>> developing a new Spark runner based on Spark Structured Streaming
> >> >>>>>> framework. This work is located in a feature branch here:
> >> >>>>>>
> https://github.com/apache/beam/tree/spark-runner_structured-streaming
> >> >>>>>>
> >> >>>>>> To attract more contributors and get some user feedback, we
> think it is
> >> >>>>>> time to merge it to master. Before doing so, some steps need to
> be achieved:
> >> >>>>>>
> >> >>>>>> - finish the work on spark Encoders (that allow to call Beam
> coders)
> >> >>>>>> because, right now, the runner is in an unstable state (some
> transforms
> >> >>>>>> use the new way of doing ser/de and some use the old one, making
> a
> >> >>>>>> pipeline incoherent toward serialization)
> >> >>>>>>
> >> >>>>>> - clean history: The history contains commits from November
> 2018, so
> >> >>>>>> there is a good amount of work, thus a consequent number of
> commits.
> >> >>>>>> They were already squashed but not from September 2019
> >> >>>>>>
> >> >>>>>> I don't think the number of commits should be an issue--we
> shouldn't
> >> >>>>>> just squash years worth of history away. (OTOH, if this is a
> case of
> >> >>>>>> this branch containing lots of little, irrelevant commits that
> would
> >> >>>>>> have normally been squashed away in the normal review process we
> do
> >> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
> >> >>>>>>
> >> >>>>>> Regarding status:
> >> >>>>>>
> >> >>>>>> - the runner passes 89% of the validates runner tests in batch
> mode. We
> >> >>>>>> hope to pass more with the new Encoders
> >> >>>>>>
> >> >>>>>> - Streaming mode is barely started (waiting for the
> multi-aggregations
> >> >>>>>> support in spark SS framework from the Spark community)
> >> >>>>>>
> >> >>>>>> - Runner can execute Nexmark
> >> >>>>>>
> >> >>>>>> - Some things are not wired up yet
> >> >>>>>>
> >> >>>>>>      - Beam Schemas not wired with Spark Schemas
> >> >>>>>>
> >> >>>>>>      - Optional features of the model not implemented:  state
> api, timer
> >> >>>>>> api, splittable doFn api, …
> >> >>>>>>
> >> >>>>>> WDYT, can we merge it to master once the 2 steps are done ?
> >> >>>>>>
> >> >>>>>> I think that as long as it sits parallel to the existing runner,
> and
> >> >>>>>> is clearly marked with its status, it makes sense to me. How many
> >> >>>>>> changes does it make to the existing codebase (as opposed to add
> new
> >> >>>>>> code)?
> >> >>>
> >> >>>
> >> >>>
> >> >>
>

Re: [spark structured streaming runner] merge to master?

Posted by Ryan Skraba <ry...@skraba.com>.

Just a personal opinion!  I would prefer ONE jar including all spark
runners, and I think the new Spark runner should be present in the
release artifacts even if it's in an incomplete state.

I have no objection to putting the experimental runner alongside the
stable, mature runner.  We have some precedence with the portable
spark runner, and that's worked out pretty well -- at least, I haven't
heard any complaints from confused users!

That being said:

1.  It really should be marked @Experimental in the code *and* clearly
warned in API (javadoc) and documentation.

2.  Ideally, I'd like to see a warning banner in the logs when it's
used, pointing to the stable SparkRunner and/or documentation on the
current known issues.

All my best, Ryan





On Thu, Nov 7, 2019 at 5:32 PM Etienne Chauchot <ec...@apache.org> wrote:
>
> Hi guys
>
> @Kenn,
>
> I just wanted to mention that I did answered your question on dependencies here: https://lists.apache.org/thread.html/5a85caac41e796c2aa351d835b3483808ebbbd4512b480940d494439@%3Cdev.beam.apache.org%3E
>
> regarding jars:
>
> I don't like 3 jars either.
>
> I'm not in favor of having the 2 runners in one jar, the point about having 2 jars was to:
>
> - avoid making promises to users on a work in progress runner (make it explicit with a different jar)
>
> - avoid confusion for them (why are there 2 pipeline options? etc....)
>
> If the community believes that there is no confusion or wrong promises with the one jar solution, we could leave the 2 runners in one jar.
>
> Maybe we could start a vote on that?
>
> Etienne
>
> On 31/10/2019 02:06, Kenneth Knowles wrote:
>
> Very good points. We definitely ship a lot of code/features in very early stages, and there seems to be no problem.
>
> I intend mostly to leave this judgment to people like you who know better about Spark users.
>
> But I do think 1 or 2 jars is better than 3. I really don't like "3 jars" and I did give two reasons:
>
> 1. diamond deps where things overlap
> 2. figuring out which thing to depend on
>
> Both are annoying for users. I am not certain if it could lead to a real unsolvable situation. This is just a Java ecosystem problem so I feel qualified to comment.
>
> I did also ask if there were major dependency differences between the two that could cause problems for users. This question was dropped and no one cares to comment so I assume it is not an issue. So then I favor having just 1 jar with both runners.
>
> Kenn
>
> On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía <ie...@gmail.com> wrote:
>>
>> I am still a bit lost about why we are discussing options without giving any
>> arguments or reasons for the options? Why is 2 modules better than 3 or 3 better
>> than 2, or even better, what forces us to have something different than a single
>> module?
>>
>> What are the reasons for wanting to have separate jars? If the issue is that the
>> code is unfinished or not passing the tests, the impact for end users is minimal
>> because they cannot accidentally end up running the new runner, and if they
>> decide to do so we can warn them it is at their own risk and not ready for
>> production in the documentation + runner.
>>
>> If the fear is that new code may end up being intertwined with the classic and
>> portable runners and have some side effects. We have the ValidatesRunner +
>> Nexmark in the CI to cover this so again I do not see what is the problem that
>> requires modules to be separate.
>>
>> If the issue is being uncomfortable about having in-progress code in released
>> artifacts we have been doing this in Beam forever, for example most of the work
>> on portability and Schema/SQL, and all of those were still part of artifacts
>> long time before they were ready for prime use, so I still don't see why this
>> case is different to require different artifacts.
>>
>> I have the impression we are trying to solve a non-issue by adding a lot of
>> artificial complexity (in particular to the users), or am I missing something
>> else?
>>
>> On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles <ke...@apache.org> wrote:
>> >
>> > Oh, I mean that we ship just 2 jars.
>> >
>> > And since Spark users always build an uber jar, they can still depend on both of ours and be able to switch runners with a flag.
>> >
>> > I really dislike projects shipping overlapping jars. It is confusing and causes major diamond dependency problems.
>> >
>> > Kenn
>> >
>> > On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko <ar...@gmail.com> wrote:
>> >>
>> >> Yes, agree, two jars included in uber jar will work in the similar way. Though having 3 jars looks still quite confusing for me.
>> >>
>> >> On 29 Oct 2019, at 23:54, Kenneth Knowles <ke...@apache.org> wrote:
>> >>
>> >> Is it just as easy to have two jars and build an uber jar with both included? Then the runner can still be toggled with a flag.
>> >>
>> >> Kenn
>> >>
>> >> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko <ar...@gmail.com> wrote:
>> >>>
>> >>> Hmm, I don’t think that jar size should play a big role comparing to the whole size of shaded jar of users job. Even more, I think it will be quite confusing for users to choose which jar to use if we will have 3 different ones for similar purposes. Though, let’s see what others think.
>> >>>
>> >>> On 29 Oct 2019, at 15:32, Etienne Chauchot <ec...@apache.org> wrote:
>> >>>
>> >>> Hi Alexey,
>> >>>
>> >>> Thanks for your opinion !
>> >>>
>> >>> Comments inline
>> >>>
>> >>> Etienne
>> >>>
>> >>> On 28/10/2019 17:34, Alexey Romanenko wrote:
>> >>>
>> >>> Let me share some of my thoughts on this.
>> >>>
>> >>>     - shall we filter out the package name from the release?
>> >>>
>> >>> Until new runner is not ready to be used in production (or, at least, be used for beta testing but users should be clearly warned about that in this case), I believe we need to filter out its classes from published jar to avoid a confusion.
>> >>>
>> >>> Yes that is what I think also
>> >>>
>> >>>     - should we release 2 jars: one for the old and one for the new ?
>> >>>
>> >>>     - should we release 3 jars: one for the new, one for the new and one for both ?
>> >>>
>> >>> Once new runner will be released, then I think we need to provide only one single jar and allow user to switch between different Spark runners with CLI option.
>> >>>
>> >>> I would vote for 3 jars: one for new, one for old, and one for both. Indeed, in some cases, users are looking very closely at the size of jars. This solution meets all use cases
>> >>>
>> >>>     - should we create a special entry to the capability matrix ?
>> >>>
>> >>> Sure, since it has its own uniq characteristics and implementation, but again, only once new runner will be "officially released".
>> >>>
>> >>> +1
>> >>>
>> >>>
>> >>>
>> >>> On 28 Oct 2019, at 10:27, Etienne Chauchot <ec...@apache.org> wrote:
>> >>>
>> >>> Hi guys,
>> >>>
>> >>> Any opinions on the point2 communication to users ?
>> >>>
>> >>> Etienne
>> >>>
>> >>> On 24/10/2019 15:44, Etienne Chauchot wrote:
>> >>>
>> >>> Hi guys,
>> >>>
>> >>> I'm glad to announce that the PR for the merge to master of the new runner based on Spark Structured Streaming framework is submitted:
>> >>>
>> >>> https://github.com/apache/beam/pull/9866
>> >>>
>> >>>
>> >>> 1. Regarding the status of the runner:
>> >>>
>> >>> -the runner passes 93% of the validates runner tests in batch mode.
>> >>>
>> >>> -Streaming mode is barely started (waiting for the multi-aggregations support in spark Structured Streaming framework from the Spark community)
>> >>>
>> >>> -Runner can execute Nexmark
>> >>>
>> >>> -Some things are not wired up yet
>> >>>
>> >>>   -Beam Schemas not wired with Spark Schemas
>> >>>
>> >>>   -Optional features of the model not implemented: state api, timer api, splittable doFn api, …
>> >>>
>> >>>
>> >>> 2. Regarding the communication to users:
>> >>>
>> >>> - for reasons explained by Ismael: the runner is in the same module as the "older" one. But it is in a different sub-package and both runners share the same build.
>> >>>
>> >>> - How should we communicate to users:
>> >>>
>> >>>     - shall we filter out the package name from the release?
>> >>>
>> >>>     - should we release 2 jars: one for the old and one for the new ?
>> >>>
>> >>>     - should we release 3 jars: one for the new, one for the new and one for both ?
>> >>>
>> >>>     - should we create a special entry to the capability matrix ?
>> >>>
>> >>> WDYT ?
>> >>>
>> >>> Best
>> >>>
>> >>> Etienne
>> >>>
>> >>>
>> >>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>> >>>
>> >>> +1 to merge.
>> >>>
>> >>> It is worth keeping things in master with explicitly marked status. It will make effort more visible to users and easier to get feedback upon.
>> >>>
>> >>> --Mikhail
>> >>>
>> >>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <ec...@apache.org> wrote:
>> >>>>
>> >>>> Hi guys,
>> >>>>
>> >>>> The new spark runner now supports beam coders and passes 93% of the batch validates runner tests (+4%). I think it is time to merge it to master. I will submit a PR in the coming days.
>> >>>>
>> >>>> next steps: support schemas and thus better leverage catalyst optimizer (among other things optims based on data), port perfs optims that were done in the current runner.
>> >>>>
>> >>>> Best
>> >>>>
>> >>>> Etienne
>> >>>>
>> >>>> On 11/10/2019 22:48, Pablo Estrada wrote:
>> >>>>
>> >>>> +1 for merging : )
>> >>>>
>> >>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <ro...@google.com> wrote:
>> >>>>>
>> >>>>> Sounds like a good plan to me.
>> >>>>>
>> >>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <ec...@apache.org> wrote:
>> >>>>>>
>> >>>>>> Comments inline
>> >>>>>>
>> >>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>> >>>>>>
>> >>>>>> +1
>> >>>>>>
>> >>>>>> The earlier we get to master the better to encourage not only code
>> >>>>>> contributions but as important to have early user feedback.
>> >>>>>>
>> >>>>>> Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
>> >>>>>>
>> >>>>>> It is still too early to even start discussing when to remove the
>> >>>>>> classical runner given that the new runner is still a WIP. However the
>> >>>>>> overall goal is that this runner becomes the de-facto one once the VR
>> >>>>>> tests and the performance become at least equal to the classical
>> >>>>>> runner, in the meantime the best for users is that they co-exist,
>> >>>>>> let’s not forget that the other runner has been already battle tested
>> >>>>>> for more than 3 years and has had lots of improvements in the last
>> >>>>>> year.
>> >>>>>>
>> >>>>>> +1 on what Ismael says: no soon removal,
>> >>>>>>
>> >>>>>> The plan I had in mind at first (that I showed at the apacheCon) was this but I'm proposing moving the first gray label to before the red box.
>> >>>>>>
>> >>>>>> <beogijnhpieapoll.png>
>> >>>>>>
>> >>>>>>
>> >>>>>> I don't think the number of commits should be an issue--we shouldn't
>> >>>>>> just squash years worth of history away. (OTOH, if this is a case of
>> >>>>>> this branch containing lots of little, irrelevant commits that would
>> >>>>>> have normally been squashed away in the normal review process we do
>> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
>> >>>>>>
>> >>>>>> About the commits we should encourage a clear history but we have also
>> >>>>>> to remove useless commits that are still present in the branch,
>> >>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
>> >>>>>> that make a better narrative sense together should be probably
>> >>>>>> squashed, because they do not bring much to the history. It is not
>> >>>>>> about more or less commits it is about its relevance as Robert
>> >>>>>> mentions.
>> >>>>>>
>> >>>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>> >>>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>> >>>>>>
>> >>>>>> In its current form we cannot exclude it but this relates to the other
>> >>>>>> question, so better to explain a bit of history: The new runner used
>> >>>>>> to live in its own module and subdirectory because it is a full blank
>> >>>>>> page rewrite and the decision was not to use any of the classical
>> >>>>>> runner classes to not be constrained by its evolution.
>> >>>>>>
>> >>>>>> However the reason to put it back in the same module as a subdirectory
>> >>>>>> was to encourage early use, in more detail: The way you deploy spark
>> >>>>>> jobs today is usually by packaging and staging an uber jar (~200MB of
>> >>>>>> pure dependency joy) that contains the user pipeline classes, the
>> >>>>>> spark runner module and its dependencies. If we have two spark runners
>> >>>>>> in separate modules the user would need to repackage and redeploy
>> >>>>>> their pipelines every time they want to switch from the classical
>> >>>>>> Spark runner to the structured streaming runner which is painful and
>> >>>>>> time and space consuming compared with the one module approach where
>> >>>>>> they just change the name of the runner class and that’s it. The idea
>> >>>>>> here is to make easy for users to test the new runner, but at the same
>> >>>>>> time to make easy to come back to the classical runner in case of any
>> >>>>>> issue.
>> >>>>>>
>> >>>>>> Ismaël
>> >>>>>>
>> >>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <ke...@apache.org> wrote:
>> >>>>>>
>> >>>>>> +1
>> >>>>>>
>> >>>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>> >>>>>>
>> >>>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>> >>>>>>
>> >>>>>> I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>> >>>>>>
>> >>>>>> Kenn
>> >>>>>>
>> >>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <ro...@google.com> wrote:
>> >>>>>>
>> >>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <ec...@apache.org> wrote:
>> >>>>>>
>> >>>>>> Hi guys,
>> >>>>>>
>> >>>>>> You probably know that there has been for several months an work
>> >>>>>> developing a new Spark runner based on Spark Structured Streaming
>> >>>>>> framework. This work is located in a feature branch here:
>> >>>>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming
>> >>>>>>
>> >>>>>> To attract more contributors and get some user feedback, we think it is
>> >>>>>> time to merge it to master. Before doing so, some steps need to be achieved:
>> >>>>>>
>> >>>>>> - finish the work on spark Encoders (that allow to call Beam coders)
>> >>>>>> because, right now, the runner is in an unstable state (some transforms
>> >>>>>> use the new way of doing ser/de and some use the old one, making a
>> >>>>>> pipeline incoherent toward serialization)
>> >>>>>>
>> >>>>>> - clean history: The history contains commits from November 2018, so
>> >>>>>> there is a good amount of work, thus a consequent number of commits.
>> >>>>>> They were already squashed but not from September 2019
>> >>>>>>
>> >>>>>> I don't think the number of commits should be an issue--we shouldn't
>> >>>>>> just squash years worth of history away. (OTOH, if this is a case of
>> >>>>>> this branch containing lots of little, irrelevant commits that would
>> >>>>>> have normally been squashed away in the normal review process we do
>> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
>> >>>>>>
>> >>>>>> Regarding status:
>> >>>>>>
>> >>>>>> - the runner passes 89% of the validates runner tests in batch mode. We
>> >>>>>> hope to pass more with the new Encoders
>> >>>>>>
>> >>>>>> - Streaming mode is barely started (waiting for the multi-aggregations
>> >>>>>> support in spark SS framework from the Spark community)
>> >>>>>>
>> >>>>>> - Runner can execute Nexmark
>> >>>>>>
>> >>>>>> - Some things are not wired up yet
>> >>>>>>
>> >>>>>>      - Beam Schemas not wired with Spark Schemas
>> >>>>>>
>> >>>>>>      - Optional features of the model not implemented:  state api, timer
>> >>>>>> api, splittable doFn api, …
>> >>>>>>
>> >>>>>> WDYT, can we merge it to master once the 2 steps are done ?
>> >>>>>>
>> >>>>>> I think that as long as it sits parallel to the existing runner, and
>> >>>>>> is clearly marked with its status, it makes sense to me. How many
>> >>>>>> changes does it make to the existing codebase (as opposed to add new
>> >>>>>> code)?
>> >>>
>> >>>
>> >>>
>> >>

Re: [spark structured streaming runner] merge to master?

Posted by Etienne Chauchot <ec...@apache.org>.

Hi guys

@Kenn,

I just wanted to mention that I did answered your question on 
dependencies here: 
https://lists.apache.org/thread.html/5a85caac41e796c2aa351d835b3483808ebbbd4512b480940d494439@%3Cdev.beam.apache.org%3E

regarding jars:

I don't like 3 jars either.

I'm not in favor of having the 2 runners in one jar, the point about 
having 2 jars was to:

- avoid making promises to users on a work in progress runner (make it 
explicit with a different jar)

- avoid confusion for them (why are there 2 pipeline options? etc....)

If the community believes that there is no confusion or wrong promises 
with the one jar solution, we could leave the 2 runners in one jar.

Maybe we could start a vote on that?

Etienne

On 31/10/2019 02:06, Kenneth Knowles wrote:
> Very good points. We definitely ship a lot of code/features in very 
> early stages, and there seems to be no problem.
>
> I intend mostly to leave this judgment to people like you who know 
> better about Spark users.
>
> But I do think 1 or 2 jars is better than 3. I really don't like "3 
> jars" and I did give two reasons:
>
> 1. diamond deps where things overlap
> 2. figuring out which thing to depend on
>
> Both are annoying for users. I am not certain if it could lead to a 
> real unsolvable situation. This is just a Java ecosystem problem so I 
> feel qualified to comment.
>
> I did also ask if there were major dependency differences between the 
> two that could cause problems for users. This question was dropped and 
> no one cares to comment so I assume it is not an issue. So then I 
> favor having just 1 jar with both runners.
>
> Kenn
>
> On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía <iemejia@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     I am still a bit lost about why we are discussing options without
>     giving any
>     arguments or reasons for the options? Why is 2 modules better than
>     3 or 3 better
>     than 2, or even better, what forces us to have something different
>     than a single
>     module?
>
>     What are the reasons for wanting to have separate jars? If the
>     issue is that the
>     code is unfinished or not passing the tests, the impact for end
>     users is minimal
>     because they cannot accidentally end up running the new runner,
>     and if they
>     decide to do so we can warn them it is at their own risk and not
>     ready for
>     production in the documentation + runner.
>
>     If the fear is that new code may end up being intertwined with the
>     classic and
>     portable runners and have some side effects. We have the
>     ValidatesRunner +
>     Nexmark in the CI to cover this so again I do not see what is the
>     problem that
>     requires modules to be separate.
>
>     If the issue is being uncomfortable about having in-progress code
>     in released
>     artifacts we have been doing this in Beam forever, for example
>     most of the work
>     on portability and Schema/SQL, and all of those were still part of
>     artifacts
>     long time before they were ready for prime use, so I still don't
>     see why this
>     case is different to require different artifacts.
>
>     I have the impression we are trying to solve a non-issue by adding
>     a lot of
>     artificial complexity (in particular to the users), or am I
>     missing something
>     else?
>
>     On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles <kenn@apache.org
>     <ma...@apache.org>> wrote:
>     >
>     > Oh, I mean that we ship just 2 jars.
>     >
>     > And since Spark users always build an uber jar, they can still
>     depend on both of ours and be able to switch runners with a flag.
>     >
>     > I really dislike projects shipping overlapping jars. It is
>     confusing and causes major diamond dependency problems.
>     >
>     > Kenn
>     >
>     > On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko
>     <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
>     >>
>     >> Yes, agree, two jars included in uber jar will work in the
>     similar way. Though having 3 jars looks still quite confusing for me.
>     >>
>     >> On 29 Oct 2019, at 23:54, Kenneth Knowles <kenn@apache.org
>     <ma...@apache.org>> wrote:
>     >>
>     >> Is it just as easy to have two jars and build an uber jar with
>     both included? Then the runner can still be toggled with a flag.
>     >>
>     >> Kenn
>     >>
>     >> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko
>     <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
>     >>>
>     >>> Hmm, I don’t think that jar size should play a big role
>     comparing to the whole size of shaded jar of users job. Even more,
>     I think it will be quite confusing for users to choose which jar
>     to use if we will have 3 different ones for similar purposes.
>     Though, let’s see what others think.
>     >>>
>     >>> On 29 Oct 2019, at 15:32, Etienne Chauchot
>     <echauchot@apache.org <ma...@apache.org>> wrote:
>     >>>
>     >>> Hi Alexey,
>     >>>
>     >>> Thanks for your opinion !
>     >>>
>     >>> Comments inline
>     >>>
>     >>> Etienne
>     >>>
>     >>> On 28/10/2019 17:34, Alexey Romanenko wrote:
>     >>>
>     >>> Let me share some of my thoughts on this.
>     >>>
>     >>>     - shall we filter out the package name from the release?
>     >>>
>     >>> Until new runner is not ready to be used in production (or, at
>     least, be used for beta testing but users should be clearly warned
>     about that in this case), I believe we need to filter out its
>     classes from published jar to avoid a confusion.
>     >>>
>     >>> Yes that is what I think also
>     >>>
>     >>>     - should we release 2 jars: one for the old and one for
>     the new ?
>     >>>
>     >>>     - should we release 3 jars: one for the new, one for the
>     new and one for both ?
>     >>>
>     >>> Once new runner will be released, then I think we need to
>     provide only one single jar and allow user to switch between
>     different Spark runners with CLI option.
>     >>>
>     >>> I would vote for 3 jars: one for new, one for old, and one for
>     both. Indeed, in some cases, users are looking very closely at the
>     size of jars. This solution meets all use cases
>     >>>
>     >>>     - should we create a special entry to the capability matrix ?
>     >>>
>     >>> Sure, since it has its own uniq characteristics and
>     implementation, but again, only once new runner will be
>     "officially released".
>     >>>
>     >>> +1
>     >>>
>     >>>
>     >>>
>     >>> On 28 Oct 2019, at 10:27, Etienne Chauchot
>     <echauchot@apache.org <ma...@apache.org>> wrote:
>     >>>
>     >>> Hi guys,
>     >>>
>     >>> Any opinions on the point2 communication to users ?
>     >>>
>     >>> Etienne
>     >>>
>     >>> On 24/10/2019 15:44, Etienne Chauchot wrote:
>     >>>
>     >>> Hi guys,
>     >>>
>     >>> I'm glad to announce that the PR for the merge to master of
>     the new runner based on Spark Structured Streaming framework is
>     submitted:
>     >>>
>     >>> https://github.com/apache/beam/pull/9866
>     >>>
>     >>>
>     >>> 1. Regarding the status of the runner:
>     >>>
>     >>> -the runner passes 93% of the validates runner tests in batch
>     mode.
>     >>>
>     >>> -Streaming mode is barely started (waiting for the
>     multi-aggregations support in spark Structured Streaming framework
>     from the Spark community)
>     >>>
>     >>> -Runner can execute Nexmark
>     >>>
>     >>> -Some things are not wired up yet
>     >>>
>     >>>   -Beam Schemas not wired with Spark Schemas
>     >>>
>     >>>   -Optional features of the model not implemented: state api,
>     timer api, splittable doFn api, …
>     >>>
>     >>>
>     >>> 2. Regarding the communication to users:
>     >>>
>     >>> - for reasons explained by Ismael: the runner is in the same
>     module as the "older" one. But it is in a different sub-package
>     and both runners share the same build.
>     >>>
>     >>> - How should we communicate to users:
>     >>>
>     >>>     - shall we filter out the package name from the release?
>     >>>
>     >>>     - should we release 2 jars: one for the old and one for
>     the new ?
>     >>>
>     >>>     - should we release 3 jars: one for the new, one for the
>     new and one for both ?
>     >>>
>     >>>     - should we create a special entry to the capability matrix ?
>     >>>
>     >>> WDYT ?
>     >>>
>     >>> Best
>     >>>
>     >>> Etienne
>     >>>
>     >>>
>     >>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>     >>>
>     >>> +1 to merge.
>     >>>
>     >>> It is worth keeping things in master with explicitly marked
>     status. It will make effort more visible to users and easier to
>     get feedback upon.
>     >>>
>     >>> --Mikhail
>     >>>
>     >>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot
>     <echauchot@apache.org <ma...@apache.org>> wrote:
>     >>>>
>     >>>> Hi guys,
>     >>>>
>     >>>> The new spark runner now supports beam coders and passes 93%
>     of the batch validates runner tests (+4%). I think it is time to
>     merge it to master. I will submit a PR in the coming days.
>     >>>>
>     >>>> next steps: support schemas and thus better leverage catalyst
>     optimizer (among other things optims based on data), port perfs
>     optims that were done in the current runner.
>     >>>>
>     >>>> Best
>     >>>>
>     >>>> Etienne
>     >>>>
>     >>>> On 11/10/2019 22:48, Pablo Estrada wrote:
>     >>>>
>     >>>> +1 for merging : )
>     >>>>
>     >>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw
>     <robertwb@google.com <ma...@google.com>> wrote:
>     >>>>>
>     >>>>> Sounds like a good plan to me.
>     >>>>>
>     >>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot
>     <echauchot@apache.org <ma...@apache.org>> wrote:
>     >>>>>>
>     >>>>>> Comments inline
>     >>>>>>
>     >>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>     >>>>>>
>     >>>>>> +1
>     >>>>>>
>     >>>>>> The earlier we get to master the better to encourage not
>     only code
>     >>>>>> contributions but as important to have early user feedback.
>     >>>>>>
>     >>>>>> Question is: do we keep the "old" spark runner for a while
>     or not (or just keep on previous version/tag on git) ?
>     >>>>>>
>     >>>>>> It is still too early to even start discussing when to
>     remove the
>     >>>>>> classical runner given that the new runner is still a WIP.
>     However the
>     >>>>>> overall goal is that this runner becomes the de-facto one
>     once the VR
>     >>>>>> tests and the performance become at least equal to the
>     classical
>     >>>>>> runner, in the meantime the best for users is that they
>     co-exist,
>     >>>>>> let’s not forget that the other runner has been already
>     battle tested
>     >>>>>> for more than 3 years and has had lots of improvements in
>     the last
>     >>>>>> year.
>     >>>>>>
>     >>>>>> +1 on what Ismael says: no soon removal,
>     >>>>>>
>     >>>>>> The plan I had in mind at first (that I showed at the
>     apacheCon) was this but I'm proposing moving the first gray label
>     to before the red box.
>     >>>>>>
>     >>>>>> <beogijnhpieapoll.png>
>     >>>>>>
>     >>>>>>
>     >>>>>> I don't think the number of commits should be an issue--we
>     shouldn't
>     >>>>>> just squash years worth of history away. (OTOH, if this is
>     a case of
>     >>>>>> this branch containing lots of little, irrelevant commits
>     that would
>     >>>>>> have normally been squashed away in the normal review
>     process we do
>     >>>>>> for the main branch, then, yes, some cleanup could be nice.)
>     >>>>>>
>     >>>>>> About the commits we should encourage a clear history but
>     we have also
>     >>>>>> to remove useless commits that are still present in the branch,
>     >>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even
>     commits
>     >>>>>> that make a better narrative sense together should be probably
>     >>>>>> squashed, because they do not bring much to the history. It
>     is not
>     >>>>>> about more or less commits it is about its relevance as Robert
>     >>>>>> mentions.
>     >>>>>>
>     >>>>>> I think our experiences with things that go to master early
>     have been very good. So I am in favor ASAP. We can exclude it from
>     releases easily until it is ready for end users.
>     >>>>>> I have the same question as Robert - how much is
>     modifications and how much is new? I notice it is in a
>     subdirectory of the beam-runners-spark module.
>     >>>>>>
>     >>>>>> In its current form we cannot exclude it but this relates
>     to the other
>     >>>>>> question, so better to explain a bit of history: The new
>     runner used
>     >>>>>> to live in its own module and subdirectory because it is a
>     full blank
>     >>>>>> page rewrite and the decision was not to use any of the
>     classical
>     >>>>>> runner classes to not be constrained by its evolution.
>     >>>>>>
>     >>>>>> However the reason to put it back in the same module as a
>     subdirectory
>     >>>>>> was to encourage early use, in more detail: The way you
>     deploy spark
>     >>>>>> jobs today is usually by packaging and staging an uber jar
>     (~200MB of
>     >>>>>> pure dependency joy) that contains the user pipeline
>     classes, the
>     >>>>>> spark runner module and its dependencies. If we have two
>     spark runners
>     >>>>>> in separate modules the user would need to repackage and
>     redeploy
>     >>>>>> their pipelines every time they want to switch from the
>     classical
>     >>>>>> Spark runner to the structured streaming runner which is
>     painful and
>     >>>>>> time and space consuming compared with the one module
>     approach where
>     >>>>>> they just change the name of the runner class and that’s
>     it. The idea
>     >>>>>> here is to make easy for users to test the new runner, but
>     at the same
>     >>>>>> time to make easy to come back to the classical runner in
>     case of any
>     >>>>>> issue.
>     >>>>>>
>     >>>>>> Ismaël
>     >>>>>>
>     >>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles
>     <kenn@apache.org <ma...@apache.org>> wrote:
>     >>>>>>
>     >>>>>> +1
>     >>>>>>
>     >>>>>> I think our experiences with things that go to master early
>     have been very good. So I am in favor ASAP. We can exclude it from
>     releases easily until it is ready for end users.
>     >>>>>>
>     >>>>>> I have the same question as Robert - how much is
>     modifications and how much is new? I notice it is in a
>     subdirectory of the beam-runners-spark module.
>     >>>>>>
>     >>>>>> I did not see any major changes to dependencies but I will
>     also ask if it has major version differences so that you might
>     want a separate artifact?
>     >>>>>>
>     >>>>>> Kenn
>     >>>>>>
>     >>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw
>     <robertwb@google.com <ma...@google.com>> wrote:
>     >>>>>>
>     >>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot
>     <echauchot@apache.org <ma...@apache.org>> wrote:
>     >>>>>>
>     >>>>>> Hi guys,
>     >>>>>>
>     >>>>>> You probably know that there has been for several months an
>     work
>     >>>>>> developing a new Spark runner based on Spark Structured
>     Streaming
>     >>>>>> framework. This work is located in a feature branch here:
>     >>>>>>
>     https://github.com/apache/beam/tree/spark-runner_structured-streaming
>     >>>>>>
>     >>>>>> To attract more contributors and get some user feedback, we
>     think it is
>     >>>>>> time to merge it to master. Before doing so, some steps
>     need to be achieved:
>     >>>>>>
>     >>>>>> - finish the work on spark Encoders (that allow to call
>     Beam coders)
>     >>>>>> because, right now, the runner is in an unstable state
>     (some transforms
>     >>>>>> use the new way of doing ser/de and some use the old one,
>     making a
>     >>>>>> pipeline incoherent toward serialization)
>     >>>>>>
>     >>>>>> - clean history: The history contains commits from November
>     2018, so
>     >>>>>> there is a good amount of work, thus a consequent number of
>     commits.
>     >>>>>> They were already squashed but not from September 2019
>     >>>>>>
>     >>>>>> I don't think the number of commits should be an issue--we
>     shouldn't
>     >>>>>> just squash years worth of history away. (OTOH, if this is
>     a case of
>     >>>>>> this branch containing lots of little, irrelevant commits
>     that would
>     >>>>>> have normally been squashed away in the normal review
>     process we do
>     >>>>>> for the main branch, then, yes, some cleanup could be nice.)
>     >>>>>>
>     >>>>>> Regarding status:
>     >>>>>>
>     >>>>>> - the runner passes 89% of the validates runner tests in
>     batch mode. We
>     >>>>>> hope to pass more with the new Encoders
>     >>>>>>
>     >>>>>> - Streaming mode is barely started (waiting for the
>     multi-aggregations
>     >>>>>> support in spark SS framework from the Spark community)
>     >>>>>>
>     >>>>>> - Runner can execute Nexmark
>     >>>>>>
>     >>>>>> - Some things are not wired up yet
>     >>>>>>
>     >>>>>>      - Beam Schemas not wired with Spark Schemas
>     >>>>>>
>     >>>>>>      - Optional features of the model not implemented: 
>     state api, timer
>     >>>>>> api, splittable doFn api, …
>     >>>>>>
>     >>>>>> WDYT, can we merge it to master once the 2 steps are done ?
>     >>>>>>
>     >>>>>> I think that as long as it sits parallel to the existing
>     runner, and
>     >>>>>> is clearly marked with its status, it makes sense to me.
>     How many
>     >>>>>> changes does it make to the existing codebase (as opposed
>     to add new
>     >>>>>> code)?
>     >>>
>     >>>
>     >>>
>     >>
>

Re: [spark structured streaming runner] merge to master?

Posted by Kenneth Knowles <ke...@apache.org>.

Very good points. We definitely ship a lot of code/features in very early
stages, and there seems to be no problem.

I intend mostly to leave this judgment to people like you who know better
about Spark users.

But I do think 1 or 2 jars is better than 3. I really don't like "3 jars"
and I did give two reasons:

1. diamond deps where things overlap
2. figuring out which thing to depend on

Both are annoying for users. I am not certain if it could lead to a real
unsolvable situation. This is just a Java ecosystem problem so I feel
qualified to comment.

I did also ask if there were major dependency differences between the two
that could cause problems for users. This question was dropped and no one
cares to comment so I assume it is not an issue. So then I favor having
just 1 jar with both runners.

Kenn

On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía <ie...@gmail.com> wrote:

> I am still a bit lost about why we are discussing options without giving
> any
> arguments or reasons for the options? Why is 2 modules better than 3 or 3
> better
> than 2, or even better, what forces us to have something different than a
> single
> module?
>
> What are the reasons for wanting to have separate jars? If the issue is
> that the
> code is unfinished or not passing the tests, the impact for end users is
> minimal
> because they cannot accidentally end up running the new runner, and if they
> decide to do so we can warn them it is at their own risk and not ready for
> production in the documentation + runner.
>
> If the fear is that new code may end up being intertwined with the classic
> and
> portable runners and have some side effects. We have the ValidatesRunner +
> Nexmark in the CI to cover this so again I do not see what is the problem
> that
> requires modules to be separate.
>
> If the issue is being uncomfortable about having in-progress code in
> released
> artifacts we have been doing this in Beam forever, for example most of the
> work
> on portability and Schema/SQL, and all of those were still part of
> artifacts
> long time before they were ready for prime use, so I still don't see why
> this
> case is different to require different artifacts.
>
> I have the impression we are trying to solve a non-issue by adding a lot of
> artificial complexity (in particular to the users), or am I missing
> something
> else?
>
> On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles <ke...@apache.org> wrote:
> >
> > Oh, I mean that we ship just 2 jars.
> >
> > And since Spark users always build an uber jar, they can still depend on
> both of ours and be able to switch runners with a flag.
> >
> > I really dislike projects shipping overlapping jars. It is confusing and
> causes major diamond dependency problems.
> >
> > Kenn
> >
> > On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko <
> aromanenko.dev@gmail.com> wrote:
> >>
> >> Yes, agree, two jars included in uber jar will work in the similar way.
> Though having 3 jars looks still quite confusing for me.
> >>
> >> On 29 Oct 2019, at 23:54, Kenneth Knowles <ke...@apache.org> wrote:
> >>
> >> Is it just as easy to have two jars and build an uber jar with both
> included? Then the runner can still be toggled with a flag.
> >>
> >> Kenn
> >>
> >> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko <
> aromanenko.dev@gmail.com> wrote:
> >>>
> >>> Hmm, I don’t think that jar size should play a big role comparing to
> the whole size of shaded jar of users job. Even more, I think it will be
> quite confusing for users to choose which jar to use if we will have 3
> different ones for similar purposes. Though, let’s see what others think.
> >>>
> >>> On 29 Oct 2019, at 15:32, Etienne Chauchot <ec...@apache.org>
> wrote:
> >>>
> >>> Hi Alexey,
> >>>
> >>> Thanks for your opinion !
> >>>
> >>> Comments inline
> >>>
> >>> Etienne
> >>>
> >>> On 28/10/2019 17:34, Alexey Romanenko wrote:
> >>>
> >>> Let me share some of my thoughts on this.
> >>>
> >>>     - shall we filter out the package name from the release?
> >>>
> >>> Until new runner is not ready to be used in production (or, at least,
> be used for beta testing but users should be clearly warned about that in
> this case), I believe we need to filter out its classes from published jar
> to avoid a confusion.
> >>>
> >>> Yes that is what I think also
> >>>
> >>>     - should we release 2 jars: one for the old and one for the new ?
> >>>
> >>>     - should we release 3 jars: one for the new, one for the new and
> one for both ?
> >>>
> >>> Once new runner will be released, then I think we need to provide only
> one single jar and allow user to switch between different Spark runners
> with CLI option.
> >>>
> >>> I would vote for 3 jars: one for new, one for old, and one for both.
> Indeed, in some cases, users are looking very closely at the size of jars.
> This solution meets all use cases
> >>>
> >>>     - should we create a special entry to the capability matrix ?
> >>>
> >>> Sure, since it has its own uniq characteristics and implementation,
> but again, only once new runner will be "officially released".
> >>>
> >>> +1
> >>>
> >>>
> >>>
> >>> On 28 Oct 2019, at 10:27, Etienne Chauchot <ec...@apache.org>
> wrote:
> >>>
> >>> Hi guys,
> >>>
> >>> Any opinions on the point2 communication to users ?
> >>>
> >>> Etienne
> >>>
> >>> On 24/10/2019 15:44, Etienne Chauchot wrote:
> >>>
> >>> Hi guys,
> >>>
> >>> I'm glad to announce that the PR for the merge to master of the new
> runner based on Spark Structured Streaming framework is submitted:
> >>>
> >>> https://github.com/apache/beam/pull/9866
> >>>
> >>>
> >>> 1. Regarding the status of the runner:
> >>>
> >>> -the runner passes 93% of the validates runner tests in batch mode.
> >>>
> >>> -Streaming mode is barely started (waiting for the multi-aggregations
> support in spark Structured Streaming framework from the Spark community)
> >>>
> >>> -Runner can execute Nexmark
> >>>
> >>> -Some things are not wired up yet
> >>>
> >>>   -Beam Schemas not wired with Spark Schemas
> >>>
> >>>   -Optional features of the model not implemented: state api, timer
> api, splittable doFn api, …
> >>>
> >>>
> >>> 2. Regarding the communication to users:
> >>>
> >>> - for reasons explained by Ismael: the runner is in the same module as
> the "older" one. But it is in a different sub-package and both runners
> share the same build.
> >>>
> >>> - How should we communicate to users:
> >>>
> >>>     - shall we filter out the package name from the release?
> >>>
> >>>     - should we release 2 jars: one for the old and one for the new ?
> >>>
> >>>     - should we release 3 jars: one for the new, one for the new and
> one for both ?
> >>>
> >>>     - should we create a special entry to the capability matrix ?
> >>>
> >>> WDYT ?
> >>>
> >>> Best
> >>>
> >>> Etienne
> >>>
> >>>
> >>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
> >>>
> >>> +1 to merge.
> >>>
> >>> It is worth keeping things in master with explicitly marked status. It
> will make effort more visible to users and easier to get feedback upon.
> >>>
> >>> --Mikhail
> >>>
> >>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <ec...@apache.org>
> wrote:
> >>>>
> >>>> Hi guys,
> >>>>
> >>>> The new spark runner now supports beam coders and passes 93% of the
> batch validates runner tests (+4%). I think it is time to merge it to
> master. I will submit a PR in the coming days.
> >>>>
> >>>> next steps: support schemas and thus better leverage catalyst
> optimizer (among other things optims based on data), port perfs optims that
> were done in the current runner.
> >>>>
> >>>> Best
> >>>>
> >>>> Etienne
> >>>>
> >>>> On 11/10/2019 22:48, Pablo Estrada wrote:
> >>>>
> >>>> +1 for merging : )
> >>>>
> >>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >>>>>
> >>>>> Sounds like a good plan to me.
> >>>>>
> >>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <
> echauchot@apache.org> wrote:
> >>>>>>
> >>>>>> Comments inline
> >>>>>>
> >>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
> >>>>>>
> >>>>>> +1
> >>>>>>
> >>>>>> The earlier we get to master the better to encourage not only code
> >>>>>> contributions but as important to have early user feedback.
> >>>>>>
> >>>>>> Question is: do we keep the "old" spark runner for a while or not
> (or just keep on previous version/tag on git) ?
> >>>>>>
> >>>>>> It is still too early to even start discussing when to remove the
> >>>>>> classical runner given that the new runner is still a WIP. However
> the
> >>>>>> overall goal is that this runner becomes the de-facto one once the
> VR
> >>>>>> tests and the performance become at least equal to the classical
> >>>>>> runner, in the meantime the best for users is that they co-exist,
> >>>>>> let’s not forget that the other runner has been already battle
> tested
> >>>>>> for more than 3 years and has had lots of improvements in the last
> >>>>>> year.
> >>>>>>
> >>>>>> +1 on what Ismael says: no soon removal,
> >>>>>>
> >>>>>> The plan I had in mind at first (that I showed at the apacheCon)
> was this but I'm proposing moving the first gray label to before the red
> box.
> >>>>>>
> >>>>>> <beogijnhpieapoll.png>
> >>>>>>
> >>>>>>
> >>>>>> I don't think the number of commits should be an issue--we shouldn't
> >>>>>> just squash years worth of history away. (OTOH, if this is a case of
> >>>>>> this branch containing lots of little, irrelevant commits that would
> >>>>>> have normally been squashed away in the normal review process we do
> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
> >>>>>>
> >>>>>> About the commits we should encourage a clear history but we have
> also
> >>>>>> to remove useless commits that are still present in the branch,
> >>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
> >>>>>> that make a better narrative sense together should be probably
> >>>>>> squashed, because they do not bring much to the history. It is not
> >>>>>> about more or less commits it is about its relevance as Robert
> >>>>>> mentions.
> >>>>>>
> >>>>>> I think our experiences with things that go to master early have
> been very good. So I am in favor ASAP. We can exclude it from releases
> easily until it is ready for end users.
> >>>>>> I have the same question as Robert - how much is modifications and
> how much is new? I notice it is in a subdirectory of the beam-runners-spark
> module.
> >>>>>>
> >>>>>> In its current form we cannot exclude it but this relates to the
> other
> >>>>>> question, so better to explain a bit of history: The new runner used
> >>>>>> to live in its own module and subdirectory because it is a full
> blank
> >>>>>> page rewrite and the decision was not to use any of the classical
> >>>>>> runner classes to not be constrained by its evolution.
> >>>>>>
> >>>>>> However the reason to put it back in the same module as a
> subdirectory
> >>>>>> was to encourage early use, in more detail: The way you deploy spark
> >>>>>> jobs today is usually by packaging and staging an uber jar (~200MB
> of
> >>>>>> pure dependency joy) that contains the user pipeline classes, the
> >>>>>> spark runner module and its dependencies. If we have two spark
> runners
> >>>>>> in separate modules the user would need to repackage and redeploy
> >>>>>> their pipelines every time they want to switch from the classical
> >>>>>> Spark runner to the structured streaming runner which is painful and
> >>>>>> time and space consuming compared with the one module approach where
> >>>>>> they just change the name of the runner class and that’s it. The
> idea
> >>>>>> here is to make easy for users to test the new runner, but at the
> same
> >>>>>> time to make easy to come back to the classical runner in case of
> any
> >>>>>> issue.
> >>>>>>
> >>>>>> Ismaël
> >>>>>>
> >>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <ke...@apache.org>
> wrote:
> >>>>>>
> >>>>>> +1
> >>>>>>
> >>>>>> I think our experiences with things that go to master early have
> been very good. So I am in favor ASAP. We can exclude it from releases
> easily until it is ready for end users.
> >>>>>>
> >>>>>> I have the same question as Robert - how much is modifications and
> how much is new? I notice it is in a subdirectory of the beam-runners-spark
> module.
> >>>>>>
> >>>>>> I did not see any major changes to dependencies but I will also ask
> if it has major version differences so that you might want a separate
> artifact?
> >>>>>>
> >>>>>> Kenn
> >>>>>>
> >>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <
> robertwb@google.com> wrote:
> >>>>>>
> >>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <
> echauchot@apache.org> wrote:
> >>>>>>
> >>>>>> Hi guys,
> >>>>>>
> >>>>>> You probably know that there has been for several months an work
> >>>>>> developing a new Spark runner based on Spark Structured Streaming
> >>>>>> framework. This work is located in a feature branch here:
> >>>>>>
> https://github.com/apache/beam/tree/spark-runner_structured-streaming
> >>>>>>
> >>>>>> To attract more contributors and get some user feedback, we think
> it is
> >>>>>> time to merge it to master. Before doing so, some steps need to be
> achieved:
> >>>>>>
> >>>>>> - finish the work on spark Encoders (that allow to call Beam coders)
> >>>>>> because, right now, the runner is in an unstable state (some
> transforms
> >>>>>> use the new way of doing ser/de and some use the old one, making a
> >>>>>> pipeline incoherent toward serialization)
> >>>>>>
> >>>>>> - clean history: The history contains commits from November 2018, so
> >>>>>> there is a good amount of work, thus a consequent number of commits.
> >>>>>> They were already squashed but not from September 2019
> >>>>>>
> >>>>>> I don't think the number of commits should be an issue--we shouldn't
> >>>>>> just squash years worth of history away. (OTOH, if this is a case of
> >>>>>> this branch containing lots of little, irrelevant commits that would
> >>>>>> have normally been squashed away in the normal review process we do
> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
> >>>>>>
> >>>>>> Regarding status:
> >>>>>>
> >>>>>> - the runner passes 89% of the validates runner tests in batch
> mode. We
> >>>>>> hope to pass more with the new Encoders
> >>>>>>
> >>>>>> - Streaming mode is barely started (waiting for the
> multi-aggregations
> >>>>>> support in spark SS framework from the Spark community)
> >>>>>>
> >>>>>> - Runner can execute Nexmark
> >>>>>>
> >>>>>> - Some things are not wired up yet
> >>>>>>
> >>>>>>      - Beam Schemas not wired with Spark Schemas
> >>>>>>
> >>>>>>      - Optional features of the model not implemented:  state api,
> timer
> >>>>>> api, splittable doFn api, …
> >>>>>>
> >>>>>> WDYT, can we merge it to master once the 2 steps are done ?
> >>>>>>
> >>>>>> I think that as long as it sits parallel to the existing runner, and
> >>>>>> is clearly marked with its status, it makes sense to me. How many
> >>>>>> changes does it make to the existing codebase (as opposed to add new
> >>>>>> code)?
> >>>
> >>>
> >>>
> >>
>

Re: [spark structured streaming runner] merge to master?

Posted by Ismaël Mejía <ie...@gmail.com>.

I am still a bit lost about why we are discussing options without giving any
arguments or reasons for the options? Why is 2 modules better than 3 or 3 better
than 2, or even better, what forces us to have something different than a single
module?

What are the reasons for wanting to have separate jars? If the issue is that the
code is unfinished or not passing the tests, the impact for end users is minimal
because they cannot accidentally end up running the new runner, and if they
decide to do so we can warn them it is at their own risk and not ready for
production in the documentation + runner.

If the fear is that new code may end up being intertwined with the classic and
portable runners and have some side effects. We have the ValidatesRunner +
Nexmark in the CI to cover this so again I do not see what is the problem that
requires modules to be separate.

If the issue is being uncomfortable about having in-progress code in released
artifacts we have been doing this in Beam forever, for example most of the work
on portability and Schema/SQL, and all of those were still part of artifacts
long time before they were ready for prime use, so I still don't see why this
case is different to require different artifacts.

I have the impression we are trying to solve a non-issue by adding a lot of
artificial complexity (in particular to the users), or am I missing something
else?

On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles <ke...@apache.org> wrote:
>
> Oh, I mean that we ship just 2 jars.
>
> And since Spark users always build an uber jar, they can still depend on both of ours and be able to switch runners with a flag.
>
> I really dislike projects shipping overlapping jars. It is confusing and causes major diamond dependency problems.
>
> Kenn
>
> On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko <ar...@gmail.com> wrote:
>>
>> Yes, agree, two jars included in uber jar will work in the similar way. Though having 3 jars looks still quite confusing for me.
>>
>> On 29 Oct 2019, at 23:54, Kenneth Knowles <ke...@apache.org> wrote:
>>
>> Is it just as easy to have two jars and build an uber jar with both included? Then the runner can still be toggled with a flag.
>>
>> Kenn
>>
>> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko <ar...@gmail.com> wrote:
>>>
>>> Hmm, I don’t think that jar size should play a big role comparing to the whole size of shaded jar of users job. Even more, I think it will be quite confusing for users to choose which jar to use if we will have 3 different ones for similar purposes. Though, let’s see what others think.
>>>
>>> On 29 Oct 2019, at 15:32, Etienne Chauchot <ec...@apache.org> wrote:
>>>
>>> Hi Alexey,
>>>
>>> Thanks for your opinion !
>>>
>>> Comments inline
>>>
>>> Etienne
>>>
>>> On 28/10/2019 17:34, Alexey Romanenko wrote:
>>>
>>> Let me share some of my thoughts on this.
>>>
>>>     - shall we filter out the package name from the release?
>>>
>>> Until new runner is not ready to be used in production (or, at least, be used for beta testing but users should be clearly warned about that in this case), I believe we need to filter out its classes from published jar to avoid a confusion.
>>>
>>> Yes that is what I think also
>>>
>>>     - should we release 2 jars: one for the old and one for the new ?
>>>
>>>     - should we release 3 jars: one for the new, one for the new and one for both ?
>>>
>>> Once new runner will be released, then I think we need to provide only one single jar and allow user to switch between different Spark runners with CLI option.
>>>
>>> I would vote for 3 jars: one for new, one for old, and one for both. Indeed, in some cases, users are looking very closely at the size of jars. This solution meets all use cases
>>>
>>>     - should we create a special entry to the capability matrix ?
>>>
>>> Sure, since it has its own uniq characteristics and implementation, but again, only once new runner will be "officially released".
>>>
>>> +1
>>>
>>>
>>>
>>> On 28 Oct 2019, at 10:27, Etienne Chauchot <ec...@apache.org> wrote:
>>>
>>> Hi guys,
>>>
>>> Any opinions on the point2 communication to users ?
>>>
>>> Etienne
>>>
>>> On 24/10/2019 15:44, Etienne Chauchot wrote:
>>>
>>> Hi guys,
>>>
>>> I'm glad to announce that the PR for the merge to master of the new runner based on Spark Structured Streaming framework is submitted:
>>>
>>> https://github.com/apache/beam/pull/9866
>>>
>>>
>>> 1. Regarding the status of the runner:
>>>
>>> -the runner passes 93% of the validates runner tests in batch mode.
>>>
>>> -Streaming mode is barely started (waiting for the multi-aggregations support in spark Structured Streaming framework from the Spark community)
>>>
>>> -Runner can execute Nexmark
>>>
>>> -Some things are not wired up yet
>>>
>>>   -Beam Schemas not wired with Spark Schemas
>>>
>>>   -Optional features of the model not implemented: state api, timer api, splittable doFn api, …
>>>
>>>
>>> 2. Regarding the communication to users:
>>>
>>> - for reasons explained by Ismael: the runner is in the same module as the "older" one. But it is in a different sub-package and both runners share the same build.
>>>
>>> - How should we communicate to users:
>>>
>>>     - shall we filter out the package name from the release?
>>>
>>>     - should we release 2 jars: one for the old and one for the new ?
>>>
>>>     - should we release 3 jars: one for the new, one for the new and one for both ?
>>>
>>>     - should we create a special entry to the capability matrix ?
>>>
>>> WDYT ?
>>>
>>> Best
>>>
>>> Etienne
>>>
>>>
>>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>>>
>>> +1 to merge.
>>>
>>> It is worth keeping things in master with explicitly marked status. It will make effort more visible to users and easier to get feedback upon.
>>>
>>> --Mikhail
>>>
>>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <ec...@apache.org> wrote:
>>>>
>>>> Hi guys,
>>>>
>>>> The new spark runner now supports beam coders and passes 93% of the batch validates runner tests (+4%). I think it is time to merge it to master. I will submit a PR in the coming days.
>>>>
>>>> next steps: support schemas and thus better leverage catalyst optimizer (among other things optims based on data), port perfs optims that were done in the current runner.
>>>>
>>>> Best
>>>>
>>>> Etienne
>>>>
>>>> On 11/10/2019 22:48, Pablo Estrada wrote:
>>>>
>>>> +1 for merging : )
>>>>
>>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <ro...@google.com> wrote:
>>>>>
>>>>> Sounds like a good plan to me.
>>>>>
>>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <ec...@apache.org> wrote:
>>>>>>
>>>>>> Comments inline
>>>>>>
>>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>>>>>>
>>>>>> +1
>>>>>>
>>>>>> The earlier we get to master the better to encourage not only code
>>>>>> contributions but as important to have early user feedback.
>>>>>>
>>>>>> Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
>>>>>>
>>>>>> It is still too early to even start discussing when to remove the
>>>>>> classical runner given that the new runner is still a WIP. However the
>>>>>> overall goal is that this runner becomes the de-facto one once the VR
>>>>>> tests and the performance become at least equal to the classical
>>>>>> runner, in the meantime the best for users is that they co-exist,
>>>>>> let’s not forget that the other runner has been already battle tested
>>>>>> for more than 3 years and has had lots of improvements in the last
>>>>>> year.
>>>>>>
>>>>>> +1 on what Ismael says: no soon removal,
>>>>>>
>>>>>> The plan I had in mind at first (that I showed at the apacheCon) was this but I'm proposing moving the first gray label to before the red box.
>>>>>>
>>>>>> <beogijnhpieapoll.png>
>>>>>>
>>>>>>
>>>>>> I don't think the number of commits should be an issue--we shouldn't
>>>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>>>> this branch containing lots of little, irrelevant commits that would
>>>>>> have normally been squashed away in the normal review process we do
>>>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>>>
>>>>>> About the commits we should encourage a clear history but we have also
>>>>>> to remove useless commits that are still present in the branch,
>>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>>>>> that make a better narrative sense together should be probably
>>>>>> squashed, because they do not bring much to the history. It is not
>>>>>> about more or less commits it is about its relevance as Robert
>>>>>> mentions.
>>>>>>
>>>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>>>
>>>>>> In its current form we cannot exclude it but this relates to the other
>>>>>> question, so better to explain a bit of history: The new runner used
>>>>>> to live in its own module and subdirectory because it is a full blank
>>>>>> page rewrite and the decision was not to use any of the classical
>>>>>> runner classes to not be constrained by its evolution.
>>>>>>
>>>>>> However the reason to put it back in the same module as a subdirectory
>>>>>> was to encourage early use, in more detail: The way you deploy spark
>>>>>> jobs today is usually by packaging and staging an uber jar (~200MB of
>>>>>> pure dependency joy) that contains the user pipeline classes, the
>>>>>> spark runner module and its dependencies. If we have two spark runners
>>>>>> in separate modules the user would need to repackage and redeploy
>>>>>> their pipelines every time they want to switch from the classical
>>>>>> Spark runner to the structured streaming runner which is painful and
>>>>>> time and space consuming compared with the one module approach where
>>>>>> they just change the name of the runner class and that’s it. The idea
>>>>>> here is to make easy for users to test the new runner, but at the same
>>>>>> time to make easy to come back to the classical runner in case of any
>>>>>> issue.
>>>>>>
>>>>>> Ismaël
>>>>>>
>>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <ke...@apache.org> wrote:
>>>>>>
>>>>>> +1
>>>>>>
>>>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>>
>>>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>>>
>>>>>> I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <ro...@google.com> wrote:
>>>>>>
>>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <ec...@apache.org> wrote:
>>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> You probably know that there has been for several months an work
>>>>>> developing a new Spark runner based on Spark Structured Streaming
>>>>>> framework. This work is located in a feature branch here:
>>>>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>>>>>
>>>>>> To attract more contributors and get some user feedback, we think it is
>>>>>> time to merge it to master. Before doing so, some steps need to be achieved:
>>>>>>
>>>>>> - finish the work on spark Encoders (that allow to call Beam coders)
>>>>>> because, right now, the runner is in an unstable state (some transforms
>>>>>> use the new way of doing ser/de and some use the old one, making a
>>>>>> pipeline incoherent toward serialization)
>>>>>>
>>>>>> - clean history: The history contains commits from November 2018, so
>>>>>> there is a good amount of work, thus a consequent number of commits.
>>>>>> They were already squashed but not from September 2019
>>>>>>
>>>>>> I don't think the number of commits should be an issue--we shouldn't
>>>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>>>> this branch containing lots of little, irrelevant commits that would
>>>>>> have normally been squashed away in the normal review process we do
>>>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>>>
>>>>>> Regarding status:
>>>>>>
>>>>>> - the runner passes 89% of the validates runner tests in batch mode. We
>>>>>> hope to pass more with the new Encoders
>>>>>>
>>>>>> - Streaming mode is barely started (waiting for the multi-aggregations
>>>>>> support in spark SS framework from the Spark community)
>>>>>>
>>>>>> - Runner can execute Nexmark
>>>>>>
>>>>>> - Some things are not wired up yet
>>>>>>
>>>>>>      - Beam Schemas not wired with Spark Schemas
>>>>>>
>>>>>>      - Optional features of the model not implemented:  state api, timer
>>>>>> api, splittable doFn api, …
>>>>>>
>>>>>> WDYT, can we merge it to master once the 2 steps are done ?
>>>>>>
>>>>>> I think that as long as it sits parallel to the existing runner, and
>>>>>> is clearly marked with its status, it makes sense to me. How many
>>>>>> changes does it make to the existing codebase (as opposed to add new
>>>>>> code)?
>>>
>>>
>>>
>>

Re: [spark structured streaming runner] merge to master?

Posted by Kenneth Knowles <ke...@apache.org>.

Oh, I mean that we ship just 2 jars.

And since Spark users always build an uber jar, they can still depend on
both of ours and be able to switch runners with a flag.

I really dislike projects shipping overlapping jars. It is confusing and
causes major diamond dependency problems.

Kenn

On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko <ar...@gmail.com>
wrote:

> Yes, agree, two jars included in uber jar will work in the similar way.
> Though having 3 jars looks still quite confusing for me.
>
> On 29 Oct 2019, at 23:54, Kenneth Knowles <ke...@apache.org> wrote:
>
> Is it just as easy to have two jars and build an uber jar with both
> included? Then the runner can still be toggled with a flag.
>
> Kenn
>
> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko <ar...@gmail.com>
> wrote:
>
>> Hmm, I don’t think that jar size should play a big role comparing to the
>> whole size of shaded jar of users job. Even more, I think it will be quite
>> confusing for users to choose which jar to use if we will have 3 different
>> ones for similar purposes. Though, let’s see what others think.
>>
>> On 29 Oct 2019, at 15:32, Etienne Chauchot <ec...@apache.org> wrote:
>>
>> Hi Alexey,
>>
>> Thanks for your opinion !
>>
>> Comments inline
>>
>> Etienne
>> On 28/10/2019 17:34, Alexey Romanenko wrote:
>>
>> Let me share some of my thoughts on this.
>>
>>     - shall we filter out the package name from the release?
>>
>> Until new runner is not ready to be used in production (or, at least, be
>> used for beta testing but users should be clearly warned about that in this
>> case), I believe we need to filter out its classes from published jar to
>> avoid a confusion.
>>
>> Yes that is what I think also
>>
>>     - should we release 2 jars: one for the old and one for the new ?
>>
>>     - should we release 3 jars: one for the new, one for the new and one
>> for both ?
>>
>> Once new runner will be released, then I think we need to provide only
>> one single jar and allow user to switch between different Spark runners
>> with CLI option.
>>
>> I would vote for 3 jars: one for new, one for old, and one for both.
>> Indeed, in some cases, users are looking very closely at the size of jars.
>> This solution meets all use cases
>>
>>     - should we create a special entry to the capability matrix ?
>>
>> Sure, since it has its own uniq characteristics and implementation, but
>> again, only once new runner will be "officially released".
>>
>> +1
>>
>>
>>
>> On 28 Oct 2019, at 10:27, Etienne Chauchot <ec...@apache.org> wrote:
>>
>> Hi guys,
>>
>> Any opinions on the point2 communication to users ?
>>
>> Etienne
>> On 24/10/2019 15:44, Etienne Chauchot wrote:
>>
>> Hi guys,
>>
>> I'm glad to announce that the PR for the merge to master of the new
>> runner based on Spark Structured Streaming framework is submitted:
>>
>> https://github.com/apache/beam/pull/9866
>>
>>
>> 1. Regarding the status of the runner:
>>
>> -the runner passes 93% of the validates runner tests in batch mode.
>>
>> -Streaming mode is barely started (waiting for the multi-aggregations
>> support in spark Structured Streaming framework from the Spark community)
>>
>> -Runner can execute Nexmark
>>
>> -Some things are not wired up yet
>>
>>   -Beam Schemas not wired with Spark Schemas
>>
>>   -Optional features of the model not implemented: state api, timer api,
>> splittable doFn api, …
>>
>>
>> 2. Regarding the communication to users:
>>
>> - for reasons explained by Ismael: the runner is in the same module as
>> the "older" one. But it is in a different sub-package and both runners
>> share the same build.
>>
>> - How should we communicate to users:
>>
>>     - shall we filter out the package name from the release?
>>
>>     - should we release 2 jars: one for the old and one for the new ?
>>
>>     - should we release 3 jars: one for the new, one for the new and one
>> for both ?
>>
>>     - should we create a special entry to the capability matrix ?
>>
>> WDYT ?
>>
>> Best
>>
>> Etienne
>>
>>
>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>>
>> +1 to merge.
>>
>> It is worth keeping things in master with explicitly marked status. It
>> will make effort more visible to users and easier to get feedback upon.
>>
>> --Mikhail
>>
>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <ec...@apache.org>
>> wrote:
>>
>>> Hi guys,
>>>
>>> The new spark runner now supports beam coders and passes 93% of the
>>> batch validates runner tests (+4%). I think it is time to merge it to
>>> master. I will submit a PR in the coming days.
>>>
>>> next steps: support schemas and thus better leverage catalyst optimizer
>>> (among other things optims based on data), port perfs optims that were done
>>> in the current runner.
>>>
>>> Best
>>>
>>> Etienne
>>> On 11/10/2019 22:48, Pablo Estrada wrote:
>>>
>>> +1 for merging : )
>>>
>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> Sounds like a good plan to me.
>>>>
>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <ec...@apache.org>
>>>> wrote:
>>>>
>>>>> Comments inline
>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>>>>>
>>>>> +1
>>>>>
>>>>> The earlier we get to master the better to encourage not only code
>>>>> contributions but as important to have early user feedback.
>>>>>
>>>>>
>>>>> Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
>>>>>
>>>>> It is still too early to even start discussing when to remove the
>>>>> classical runner given that the new runner is still a WIP. However the
>>>>> overall goal is that this runner becomes the de-facto one once the VR
>>>>> tests and the performance become at least equal to the classical
>>>>> runner, in the meantime the best for users is that they co-exist,
>>>>> let’s not forget that the other runner has been already battle tested
>>>>> for more than 3 years and has had lots of improvements in the last
>>>>> year.
>>>>>
>>>>> +1 on what Ismael says: no soon removal,
>>>>>
>>>>> The plan I had in mind at first (that I showed at the apacheCon) was
>>>>> this but I'm proposing moving the first gray label to before the red box.
>>>>>
>>>>> <beogijnhpieapoll.png>
>>>>>
>>>>>
>>>>> I don't think the number of commits should be an issue--we shouldn't
>>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>>> this branch containing lots of little, irrelevant commits that would
>>>>> have normally been squashed away in the normal review process we do
>>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>>
>>>>> About the commits we should encourage a clear history but we have also
>>>>> to remove useless commits that are still present in the branch,
>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>>>> that make a better narrative sense together should be probably
>>>>> squashed, because they do not bring much to the history. It is not
>>>>> about more or less commits it is about its relevance as Robert
>>>>> mentions.
>>>>>
>>>>>
>>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>>
>>>>> In its current form we cannot exclude it but this relates to the other
>>>>> question, so better to explain a bit of history: The new runner used
>>>>> to live in its own module and subdirectory because it is a full blank
>>>>> page rewrite and the decision was not to use any of the classical
>>>>> runner classes to not be constrained by its evolution.
>>>>>
>>>>> However the reason to put it back in the same module as a subdirectory
>>>>> was to encourage early use, in more detail: The way you deploy spark
>>>>> jobs today is usually by packaging and staging an uber jar (~200MB of
>>>>> pure dependency joy) that contains the user pipeline classes, the
>>>>> spark runner module and its dependencies. If we have two spark runners
>>>>> in separate modules the user would need to repackage and redeploy
>>>>> their pipelines every time they want to switch from the classical
>>>>> Spark runner to the structured streaming runner which is painful and
>>>>> time and space consuming compared with the one module approach where
>>>>> they just change the name of the runner class and that’s it. The idea
>>>>> here is to make easy for users to test the new runner, but at the same
>>>>> time to make easy to come back to the classical runner in case of any
>>>>> issue.
>>>>>
>>>>> Ismaël
>>>>>
>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <ke...@apache.org> <ke...@apache.org> wrote:
>>>>>
>>>>> +1
>>>>>
>>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>
>>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>>
>>>>> I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <ro...@google.com> <ro...@google.com> wrote:
>>>>>
>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <ec...@apache.org> <ec...@apache.org> wrote:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> You probably know that there has been for several months an work
>>>>> developing a new Spark runner based on Spark Structured Streaming
>>>>> framework. This work is located in a feature branch here:https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>>>>
>>>>> To attract more contributors and get some user feedback, we think it is
>>>>> time to merge it to master. Before doing so, some steps need to be achieved:
>>>>>
>>>>> - finish the work on spark Encoders (that allow to call Beam coders)
>>>>> because, right now, the runner is in an unstable state (some transforms
>>>>> use the new way of doing ser/de and some use the old one, making a
>>>>> pipeline incoherent toward serialization)
>>>>>
>>>>> - clean history: The history contains commits from November 2018, so
>>>>> there is a good amount of work, thus a consequent number of commits.
>>>>> They were already squashed but not from September 2019
>>>>>
>>>>> I don't think the number of commits should be an issue--we shouldn't
>>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>>> this branch containing lots of little, irrelevant commits that would
>>>>> have normally been squashed away in the normal review process we do
>>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>>
>>>>>
>>>>> Regarding status:
>>>>>
>>>>> - the runner passes 89% of the validates runner tests in batch mode. We
>>>>> hope to pass more with the new Encoders
>>>>>
>>>>> - Streaming mode is barely started (waiting for the multi-aggregations
>>>>> support in spark SS framework from the Spark community)
>>>>>
>>>>> - Runner can execute Nexmark
>>>>>
>>>>> - Some things are not wired up yet
>>>>>
>>>>>      - Beam Schemas not wired with Spark Schemas
>>>>>
>>>>>      - Optional features of the model not implemented:  state api, timer
>>>>> api, splittable doFn api, …
>>>>>
>>>>> WDYT, can we merge it to master once the 2 steps are done ?
>>>>>
>>>>> I think that as long as it sits parallel to the existing runner, and
>>>>> is clearly marked with its status, it makes sense to me. How many
>>>>> changes does it make to the existing codebase (as opposed to add new
>>>>> code)?
>>>>>
>>>>>
>>
>>
>

Re: [spark structured streaming runner] merge to master?

Posted by Alexey Romanenko <ar...@gmail.com>.

Yes, agree, two jars included in uber jar will work in the similar way. Though having 3 jars looks still quite confusing for me.

> On 29 Oct 2019, at 23:54, Kenneth Knowles <ke...@apache.org> wrote:
> 
> Is it just as easy to have two jars and build an uber jar with both included? Then the runner can still be toggled with a flag.
> 
> Kenn
> 
> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
> Hmm, I don’t think that jar size should play a big role comparing to the whole size of shaded jar of users job. Even more, I think it will be quite confusing for users to choose which jar to use if we will have 3 different ones for similar purposes. Though, let’s see what others think.
> 
>> On 29 Oct 2019, at 15:32, Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
>> 
>> Hi Alexey, 
>> Thanks for your opinion !
>> 
>> Comments inline
>> 
>> Etienne
>> On 28/10/2019 17:34, Alexey Romanenko wrote:
>>> Let me share some of my thoughts on this.
>>>>>     - shall we filter out the package name from the release? 
>>> Until new runner is not ready to be used in production (or, at least, be used for beta testing but users should be clearly warned about that in this case), I believe we need to filter out its classes from published jar to avoid a confusion.
>> Yes that is what I think also
>>>>>     - should we release 2 jars: one for the old and one for the new ? 
>>>>>     - should we release 3 jars: one for the new, one for the new and one for both ?
>>>>> 
>>> Once new runner will be released, then I think we need to provide only one single jar and allow user to switch between different Spark runners with CLI option.
>> I would vote for 3 jars: one for new, one for old, and one for both. Indeed, in some cases, users are looking very closely at the size of jars. This solution meets all use cases
>>>>>     - should we create a special entry to the capability matrix ?
>>>>> 
>>> 
>>> Sure, since it has its own uniq characteristics and implementation, but again, only once new runner will be "officially released".
>> +1
>>> 
>>> 
>>>> On 28 Oct 2019, at 10:27, Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
>>>> 
>>>> Hi guys,
>>>> 
>>>> Any opinions on the point2 communication to users ?
>>>> 
>>>> Etienne
>>>> On 24/10/2019 15:44, Etienne Chauchot wrote:
>>>>> Hi guys,
>>>>> 
>>>>> I'm glad to announce that the PR for the merge to master of the new runner based on Spark Structured Streaming framework is submitted:
>>>>> 
>>>>> https://github.com/apache/beam/pull/9866 <https://github.com/apache/beam/pull/9866>
>>>>> 
>>>>> 1. Regarding the status of the runner: 
>>>>> -the runner passes 93% of the validates runner tests in batch mode.
>>>>> 
>>>>> -Streaming mode is barely started (waiting for the multi-aggregations support in spark Structured Streaming framework from the Spark community)
>>>>> 
>>>>> -Runner can execute Nexmark
>>>>> 
>>>>> -Some things are not wired up yet
>>>>> 
>>>>>   -Beam Schemas not wired with Spark Schemas
>>>>> 
>>>>>   -Optional features of the model not implemented: state api, timer api, splittable doFn api, …
>>>>> 
>>>>> 2. Regarding the communication to users:
>>>>> 
>>>>> - for reasons explained by Ismael: the runner is in the same module as the "older" one. But it is in a different sub-package and both runners share the same build.  
>>>>> - How should we communicate to users: 
>>>>>     - shall we filter out the package name from the release? 
>>>>>     - should we release 2 jars: one for the old and one for the new ? 
>>>>>     - should we release 3 jars: one for the new, one for the new and one for both ?
>>>>> 
>>>>>     - should we create a special entry to the capability matrix ?
>>>>> 
>>>>> WDYT ?
>>>>> Best
>>>>> 
>>>>> Etienne
>>>>> 
>>>>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>>>>>> +1 to merge.
>>>>>> 
>>>>>> It is worth keeping things in master with explicitly marked status. It will make effort more visible to users and easier to get feedback upon.
>>>>>> 
>>>>>> --Mikhail
>>>>>> 
>>>>>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
>>>>>> Hi guys,
>>>>>> 
>>>>>> The new spark runner now supports beam coders and passes 93% of the batch validates runner tests (+4%). I think it is time to merge it to master. I will submit a PR in the coming days.
>>>>>> 
>>>>>> next steps: support schemas and thus better leverage catalyst optimizer (among other things optims based on data), port perfs optims that were done in the current runner.
>>>>>> Best
>>>>>> Etienne
>>>>>> On 11/10/2019 22:48, Pablo Estrada wrote:
>>>>>>> +1 for merging : )
>>>>>>> 
>>>>>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <robertwb@google.com <ma...@google.com>> wrote:
>>>>>>> Sounds like a good plan to me. 
>>>>>>> 
>>>>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
>>>>>>> Comments inline
>>>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>>>>>>>> +1
>>>>>>>> 
>>>>>>>> The earlier we get to master the better to encourage not only code
>>>>>>>> contributions but as important to have early user feedback.
>>>>>>>> 
>>>>>>>>> Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
>>>>>>>> It is still too early to even start discussing when to remove the
>>>>>>>> classical runner given that the new runner is still a WIP. However the
>>>>>>>> overall goal is that this runner becomes the de-facto one once the VR
>>>>>>>> tests and the performance become at least equal to the classical
>>>>>>>> runner, in the meantime the best for users is that they co-exist,
>>>>>>>> let’s not forget that the other runner has been already battle tested
>>>>>>>> for more than 3 years and has had lots of improvements in the last
>>>>>>>> year.
>>>>>>> +1 on what Ismael says: no soon removal, 
>>>>>>> The plan I had in mind at first (that I showed at the apacheCon) was this but I'm proposing moving the first gray label to before the red box. 
>>>>>>> <beogijnhpieapoll.png>
>>>>>>> 
>>>>>>> 
>>>>>>>>> I don't think the number of commits should be an issue--we shouldn't
>>>>>>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>>>>>>> this branch containing lots of little, irrelevant commits that would
>>>>>>>>> have normally been squashed away in the normal review process we do
>>>>>>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>>>>> About the commits we should encourage a clear history but we have also
>>>>>>>> to remove useless commits that are still present in the branch,
>>>>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>>>>>>> that make a better narrative sense together should be probably
>>>>>>>> squashed, because they do not bring much to the history. It is not
>>>>>>>> about more or less commits it is about its relevance as Robert
>>>>>>>> mentions.
>>>>>>>> 
>>>>>>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>>>>> In its current form we cannot exclude it but this relates to the other
>>>>>>>> question, so better to explain a bit of history: The new runner used
>>>>>>>> to live in its own module and subdirectory because it is a full blank
>>>>>>>> page rewrite and the decision was not to use any of the classical
>>>>>>>> runner classes to not be constrained by its evolution.
>>>>>>>> 
>>>>>>>> However the reason to put it back in the same module as a subdirectory
>>>>>>>> was to encourage early use, in more detail: The way you deploy spark
>>>>>>>> jobs today is usually by packaging and staging an uber jar (~200MB of
>>>>>>>> pure dependency joy) that contains the user pipeline classes, the
>>>>>>>> spark runner module and its dependencies. If we have two spark runners
>>>>>>>> in separate modules the user would need to repackage and redeploy
>>>>>>>> their pipelines every time they want to switch from the classical
>>>>>>>> Spark runner to the structured streaming runner which is painful and
>>>>>>>> time and space consuming compared with the one module approach where
>>>>>>>> they just change the name of the runner class and that’s it. The idea
>>>>>>>> here is to make easy for users to test the new runner, but at the same
>>>>>>>> time to make easy to come back to the classical runner in case of any
>>>>>>>> issue.
>>>>>>>> 
>>>>>>>> Ismaël
>>>>>>>> 
>>>>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <ke...@apache.org> <ma...@apache.org> wrote:
>>>>>>>>> +1
>>>>>>>>> 
>>>>>>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>>>>> 
>>>>>>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>>>>>> 
>>>>>>>>> I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>>>>>>>>> 
>>>>>>>>> Kenn
>>>>>>>>> 
>>>>>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <ro...@google.com> <ma...@google.com> wrote:
>>>>>>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <ec...@apache.org> <ma...@apache.org> wrote:
>>>>>>>>>>> Hi guys,
>>>>>>>>>>> 
>>>>>>>>>>> You probably know that there has been for several months an work
>>>>>>>>>>> developing a new Spark runner based on Spark Structured Streaming
>>>>>>>>>>> framework. This work is located in a feature branch here:
>>>>>>>>>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming <https://github.com/apache/beam/tree/spark-runner_structured-streaming>
>>>>>>>>>>> 
>>>>>>>>>>> To attract more contributors and get some user feedback, we think it is
>>>>>>>>>>> time to merge it to master. Before doing so, some steps need to be achieved:
>>>>>>>>>>> 
>>>>>>>>>>> - finish the work on spark Encoders (that allow to call Beam coders)
>>>>>>>>>>> because, right now, the runner is in an unstable state (some transforms
>>>>>>>>>>> use the new way of doing ser/de and some use the old one, making a
>>>>>>>>>>> pipeline incoherent toward serialization)
>>>>>>>>>>> 
>>>>>>>>>>> - clean history: The history contains commits from November 2018, so
>>>>>>>>>>> there is a good amount of work, thus a consequent number of commits.
>>>>>>>>>>> They were already squashed but not from September 2019
>>>>>>>>>> I don't think the number of commits should be an issue--we shouldn't
>>>>>>>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>>>>>>>> this branch containing lots of little, irrelevant commits that would
>>>>>>>>>> have normally been squashed away in the normal review process we do
>>>>>>>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>>>>>>> 
>>>>>>>>>>> Regarding status:
>>>>>>>>>>> 
>>>>>>>>>>> - the runner passes 89% of the validates runner tests in batch mode. We
>>>>>>>>>>> hope to pass more with the new Encoders
>>>>>>>>>>> 
>>>>>>>>>>> - Streaming mode is barely started (waiting for the multi-aggregations
>>>>>>>>>>> support in spark SS framework from the Spark community)
>>>>>>>>>>> 
>>>>>>>>>>> - Runner can execute Nexmark
>>>>>>>>>>> 
>>>>>>>>>>> - Some things are not wired up yet
>>>>>>>>>>> 
>>>>>>>>>>>      - Beam Schemas not wired with Spark Schemas
>>>>>>>>>>> 
>>>>>>>>>>>      - Optional features of the model not implemented:  state api, timer
>>>>>>>>>>> api, splittable doFn api, …
>>>>>>>>>>> 
>>>>>>>>>>> WDYT, can we merge it to master once the 2 steps are done ?
>>>>>>>>>> I think that as long as it sits parallel to the existing runner, and
>>>>>>>>>> is clearly marked with its status, it makes sense to me. How many
>>>>>>>>>> changes does it make to the existing codebase (as opposed to add new
>>>>>>>>>> code)?
>>> 
>

Re: [spark structured streaming runner] merge to master?

Posted by Kenneth Knowles <ke...@apache.org>.

Is it just as easy to have two jars and build an uber jar with both
included? Then the runner can still be toggled with a flag.

Kenn

On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko <ar...@gmail.com>
wrote:

> Hmm, I don’t think that jar size should play a big role comparing to the
> whole size of shaded jar of users job. Even more, I think it will be quite
> confusing for users to choose which jar to use if we will have 3 different
> ones for similar purposes. Though, let’s see what others think.
>
> On 29 Oct 2019, at 15:32, Etienne Chauchot <ec...@apache.org> wrote:
>
> Hi Alexey,
>
> Thanks for your opinion !
>
> Comments inline
>
> Etienne
> On 28/10/2019 17:34, Alexey Romanenko wrote:
>
> Let me share some of my thoughts on this.
>
>     - shall we filter out the package name from the release?
>
> Until new runner is not ready to be used in production (or, at least, be
> used for beta testing but users should be clearly warned about that in this
> case), I believe we need to filter out its classes from published jar to
> avoid a confusion.
>
> Yes that is what I think also
>
>     - should we release 2 jars: one for the old and one for the new ?
>
>     - should we release 3 jars: one for the new, one for the new and one
> for both ?
>
> Once new runner will be released, then I think we need to provide only one
> single jar and allow user to switch between different Spark runners with
> CLI option.
>
> I would vote for 3 jars: one for new, one for old, and one for both.
> Indeed, in some cases, users are looking very closely at the size of jars.
> This solution meets all use cases
>
>     - should we create a special entry to the capability matrix ?
>
> Sure, since it has its own uniq characteristics and implementation, but
> again, only once new runner will be "officially released".
>
> +1
>
>
>
> On 28 Oct 2019, at 10:27, Etienne Chauchot <ec...@apache.org> wrote:
>
> Hi guys,
>
> Any opinions on the point2 communication to users ?
>
> Etienne
> On 24/10/2019 15:44, Etienne Chauchot wrote:
>
> Hi guys,
>
> I'm glad to announce that the PR for the merge to master of the new runner
> based on Spark Structured Streaming framework is submitted:
>
> https://github.com/apache/beam/pull/9866
>
>
> 1. Regarding the status of the runner:
>
> -the runner passes 93% of the validates runner tests in batch mode.
>
> -Streaming mode is barely started (waiting for the multi-aggregations
> support in spark Structured Streaming framework from the Spark community)
>
> -Runner can execute Nexmark
>
> -Some things are not wired up yet
>
>   -Beam Schemas not wired with Spark Schemas
>
>   -Optional features of the model not implemented: state api, timer api,
> splittable doFn api, …
>
>
> 2. Regarding the communication to users:
>
> - for reasons explained by Ismael: the runner is in the same module as the
> "older" one. But it is in a different sub-package and both runners share
> the same build.
>
> - How should we communicate to users:
>
>     - shall we filter out the package name from the release?
>
>     - should we release 2 jars: one for the old and one for the new ?
>
>     - should we release 3 jars: one for the new, one for the new and one
> for both ?
>
>     - should we create a special entry to the capability matrix ?
>
> WDYT ?
>
> Best
>
> Etienne
>
>
> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>
> +1 to merge.
>
> It is worth keeping things in master with explicitly marked status. It
> will make effort more visible to users and easier to get feedback upon.
>
> --Mikhail
>
> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <ec...@apache.org>
> wrote:
>
>> Hi guys,
>>
>> The new spark runner now supports beam coders and passes 93% of the batch
>> validates runner tests (+4%). I think it is time to merge it to master. I
>> will submit a PR in the coming days.
>>
>> next steps: support schemas and thus better leverage catalyst optimizer
>> (among other things optims based on data), port perfs optims that were done
>> in the current runner.
>>
>> Best
>>
>> Etienne
>> On 11/10/2019 22:48, Pablo Estrada wrote:
>>
>> +1 for merging : )
>>
>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> Sounds like a good plan to me.
>>>
>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <ec...@apache.org>
>>> wrote:
>>>
>>>> Comments inline
>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>>>>
>>>> +1
>>>>
>>>> The earlier we get to master the better to encourage not only code
>>>> contributions but as important to have early user feedback.
>>>>
>>>>
>>>> Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
>>>>
>>>> It is still too early to even start discussing when to remove the
>>>> classical runner given that the new runner is still a WIP. However the
>>>> overall goal is that this runner becomes the de-facto one once the VR
>>>> tests and the performance become at least equal to the classical
>>>> runner, in the meantime the best for users is that they co-exist,
>>>> let’s not forget that the other runner has been already battle tested
>>>> for more than 3 years and has had lots of improvements in the last
>>>> year.
>>>>
>>>> +1 on what Ismael says: no soon removal,
>>>>
>>>> The plan I had in mind at first (that I showed at the apacheCon) was
>>>> this but I'm proposing moving the first gray label to before the red box.
>>>>
>>>> <beogijnhpieapoll.png>
>>>>
>>>>
>>>> I don't think the number of commits should be an issue--we shouldn't
>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>> this branch containing lots of little, irrelevant commits that would
>>>> have normally been squashed away in the normal review process we do
>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>
>>>> About the commits we should encourage a clear history but we have also
>>>> to remove useless commits that are still present in the branch,
>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>>> that make a better narrative sense together should be probably
>>>> squashed, because they do not bring much to the history. It is not
>>>> about more or less commits it is about its relevance as Robert
>>>> mentions.
>>>>
>>>>
>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>
>>>> In its current form we cannot exclude it but this relates to the other
>>>> question, so better to explain a bit of history: The new runner used
>>>> to live in its own module and subdirectory because it is a full blank
>>>> page rewrite and the decision was not to use any of the classical
>>>> runner classes to not be constrained by its evolution.
>>>>
>>>> However the reason to put it back in the same module as a subdirectory
>>>> was to encourage early use, in more detail: The way you deploy spark
>>>> jobs today is usually by packaging and staging an uber jar (~200MB of
>>>> pure dependency joy) that contains the user pipeline classes, the
>>>> spark runner module and its dependencies. If we have two spark runners
>>>> in separate modules the user would need to repackage and redeploy
>>>> their pipelines every time they want to switch from the classical
>>>> Spark runner to the structured streaming runner which is painful and
>>>> time and space consuming compared with the one module approach where
>>>> they just change the name of the runner class and that’s it. The idea
>>>> here is to make easy for users to test the new runner, but at the same
>>>> time to make easy to come back to the classical runner in case of any
>>>> issue.
>>>>
>>>> Ismaël
>>>>
>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <ke...@apache.org> <ke...@apache.org> wrote:
>>>>
>>>> +1
>>>>
>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>
>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>
>>>> I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>>>>
>>>> Kenn
>>>>
>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <ro...@google.com> <ro...@google.com> wrote:
>>>>
>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <ec...@apache.org> <ec...@apache.org> wrote:
>>>>
>>>> Hi guys,
>>>>
>>>> You probably know that there has been for several months an work
>>>> developing a new Spark runner based on Spark Structured Streaming
>>>> framework. This work is located in a feature branch here:https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>>>
>>>> To attract more contributors and get some user feedback, we think it is
>>>> time to merge it to master. Before doing so, some steps need to be achieved:
>>>>
>>>> - finish the work on spark Encoders (that allow to call Beam coders)
>>>> because, right now, the runner is in an unstable state (some transforms
>>>> use the new way of doing ser/de and some use the old one, making a
>>>> pipeline incoherent toward serialization)
>>>>
>>>> - clean history: The history contains commits from November 2018, so
>>>> there is a good amount of work, thus a consequent number of commits.
>>>> They were already squashed but not from September 2019
>>>>
>>>> I don't think the number of commits should be an issue--we shouldn't
>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>> this branch containing lots of little, irrelevant commits that would
>>>> have normally been squashed away in the normal review process we do
>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>
>>>>
>>>> Regarding status:
>>>>
>>>> - the runner passes 89% of the validates runner tests in batch mode. We
>>>> hope to pass more with the new Encoders
>>>>
>>>> - Streaming mode is barely started (waiting for the multi-aggregations
>>>> support in spark SS framework from the Spark community)
>>>>
>>>> - Runner can execute Nexmark
>>>>
>>>> - Some things are not wired up yet
>>>>
>>>>      - Beam Schemas not wired with Spark Schemas
>>>>
>>>>      - Optional features of the model not implemented:  state api, timer
>>>> api, splittable doFn api, …
>>>>
>>>> WDYT, can we merge it to master once the 2 steps are done ?
>>>>
>>>> I think that as long as it sits parallel to the existing runner, and
>>>> is clearly marked with its status, it makes sense to me. How many
>>>> changes does it make to the existing codebase (as opposed to add new
>>>> code)?
>>>>
>>>>
>
>

Re: [spark structured streaming runner] merge to master?

Posted by Alexey Romanenko <ar...@gmail.com>.

Hmm, I don’t think that jar size should play a big role comparing to the whole size of shaded jar of users job. Even more, I think it will be quite confusing for users to choose which jar to use if we will have 3 different ones for similar purposes. Though, let’s see what others think.

> On 29 Oct 2019, at 15:32, Etienne Chauchot <ec...@apache.org> wrote:
> 
> Hi Alexey, 
> Thanks for your opinion !
> 
> Comments inline
> 
> Etienne
> On 28/10/2019 17:34, Alexey Romanenko wrote:
>> Let me share some of my thoughts on this.
>>>>     - shall we filter out the package name from the release? 
>> Until new runner is not ready to be used in production (or, at least, be used for beta testing but users should be clearly warned about that in this case), I believe we need to filter out its classes from published jar to avoid a confusion.
> Yes that is what I think also
>>>>     - should we release 2 jars: one for the old and one for the new ? 
>>>>     - should we release 3 jars: one for the new, one for the new and one for both ?
>>>> 
>> Once new runner will be released, then I think we need to provide only one single jar and allow user to switch between different Spark runners with CLI option.
> I would vote for 3 jars: one for new, one for old, and one for both. Indeed, in some cases, users are looking very closely at the size of jars. This solution meets all use cases
>>>>     - should we create a special entry to the capability matrix ?
>>>> 
>> 
>> Sure, since it has its own uniq characteristics and implementation, but again, only once new runner will be "officially released".
> +1
>> 
>> 
>>> On 28 Oct 2019, at 10:27, Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
>>> 
>>> Hi guys,
>>> 
>>> Any opinions on the point2 communication to users ?
>>> 
>>> Etienne
>>> On 24/10/2019 15:44, Etienne Chauchot wrote:
>>>> Hi guys,
>>>> 
>>>> I'm glad to announce that the PR for the merge to master of the new runner based on Spark Structured Streaming framework is submitted:
>>>> 
>>>> https://github.com/apache/beam/pull/9866 <https://github.com/apache/beam/pull/9866>
>>>> 
>>>> 1. Regarding the status of the runner: 
>>>> -the runner passes 93% of the validates runner tests in batch mode.
>>>> 
>>>> -Streaming mode is barely started (waiting for the multi-aggregations support in spark Structured Streaming framework from the Spark community)
>>>> 
>>>> -Runner can execute Nexmark
>>>> 
>>>> -Some things are not wired up yet
>>>> 
>>>>   -Beam Schemas not wired with Spark Schemas
>>>> 
>>>>   -Optional features of the model not implemented: state api, timer api, splittable doFn api, …
>>>> 
>>>> 2. Regarding the communication to users:
>>>> 
>>>> - for reasons explained by Ismael: the runner is in the same module as the "older" one. But it is in a different sub-package and both runners share the same build.  
>>>> - How should we communicate to users: 
>>>>     - shall we filter out the package name from the release? 
>>>>     - should we release 2 jars: one for the old and one for the new ? 
>>>>     - should we release 3 jars: one for the new, one for the new and one for both ?
>>>> 
>>>>     - should we create a special entry to the capability matrix ?
>>>> 
>>>> WDYT ?
>>>> Best
>>>> 
>>>> Etienne
>>>> 
>>>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>>>>> +1 to merge.
>>>>> 
>>>>> It is worth keeping things in master with explicitly marked status. It will make effort more visible to users and easier to get feedback upon.
>>>>> 
>>>>> --Mikhail
>>>>> 
>>>>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
>>>>> Hi guys,
>>>>> 
>>>>> The new spark runner now supports beam coders and passes 93% of the batch validates runner tests (+4%). I think it is time to merge it to master. I will submit a PR in the coming days.
>>>>> 
>>>>> next steps: support schemas and thus better leverage catalyst optimizer (among other things optims based on data), port perfs optims that were done in the current runner.
>>>>> Best
>>>>> Etienne
>>>>> On 11/10/2019 22:48, Pablo Estrada wrote:
>>>>>> +1 for merging : )
>>>>>> 
>>>>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <robertwb@google.com <ma...@google.com>> wrote:
>>>>>> Sounds like a good plan to me. 
>>>>>> 
>>>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
>>>>>> Comments inline
>>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>>>>>>> +1
>>>>>>> 
>>>>>>> The earlier we get to master the better to encourage not only code
>>>>>>> contributions but as important to have early user feedback.
>>>>>>> 
>>>>>>>> Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
>>>>>>> It is still too early to even start discussing when to remove the
>>>>>>> classical runner given that the new runner is still a WIP. However the
>>>>>>> overall goal is that this runner becomes the de-facto one once the VR
>>>>>>> tests and the performance become at least equal to the classical
>>>>>>> runner, in the meantime the best for users is that they co-exist,
>>>>>>> let’s not forget that the other runner has been already battle tested
>>>>>>> for more than 3 years and has had lots of improvements in the last
>>>>>>> year.
>>>>>> +1 on what Ismael says: no soon removal, 
>>>>>> The plan I had in mind at first (that I showed at the apacheCon) was this but I'm proposing moving the first gray label to before the red box. 
>>>>>> <beogijnhpieapoll.png>
>>>>>> 
>>>>>> 
>>>>>>>> I don't think the number of commits should be an issue--we shouldn't
>>>>>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>>>>>> this branch containing lots of little, irrelevant commits that would
>>>>>>>> have normally been squashed away in the normal review process we do
>>>>>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>>>> About the commits we should encourage a clear history but we have also
>>>>>>> to remove useless commits that are still present in the branch,
>>>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>>>>>> that make a better narrative sense together should be probably
>>>>>>> squashed, because they do not bring much to the history. It is not
>>>>>>> about more or less commits it is about its relevance as Robert
>>>>>>> mentions.
>>>>>>> 
>>>>>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>>>> In its current form we cannot exclude it but this relates to the other
>>>>>>> question, so better to explain a bit of history: The new runner used
>>>>>>> to live in its own module and subdirectory because it is a full blank
>>>>>>> page rewrite and the decision was not to use any of the classical
>>>>>>> runner classes to not be constrained by its evolution.
>>>>>>> 
>>>>>>> However the reason to put it back in the same module as a subdirectory
>>>>>>> was to encourage early use, in more detail: The way you deploy spark
>>>>>>> jobs today is usually by packaging and staging an uber jar (~200MB of
>>>>>>> pure dependency joy) that contains the user pipeline classes, the
>>>>>>> spark runner module and its dependencies. If we have two spark runners
>>>>>>> in separate modules the user would need to repackage and redeploy
>>>>>>> their pipelines every time they want to switch from the classical
>>>>>>> Spark runner to the structured streaming runner which is painful and
>>>>>>> time and space consuming compared with the one module approach where
>>>>>>> they just change the name of the runner class and that’s it. The idea
>>>>>>> here is to make easy for users to test the new runner, but at the same
>>>>>>> time to make easy to come back to the classical runner in case of any
>>>>>>> issue.
>>>>>>> 
>>>>>>> Ismaël
>>>>>>> 
>>>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <ke...@apache.org> <ma...@apache.org> wrote:
>>>>>>>> +1
>>>>>>>> 
>>>>>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>>>> 
>>>>>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>>>>> 
>>>>>>>> I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>>>>>>>> 
>>>>>>>> Kenn
>>>>>>>> 
>>>>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <ro...@google.com> <ma...@google.com> wrote:
>>>>>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <ec...@apache.org> <ma...@apache.org> wrote:
>>>>>>>>>> Hi guys,
>>>>>>>>>> 
>>>>>>>>>> You probably know that there has been for several months an work
>>>>>>>>>> developing a new Spark runner based on Spark Structured Streaming
>>>>>>>>>> framework. This work is located in a feature branch here:
>>>>>>>>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming <https://github.com/apache/beam/tree/spark-runner_structured-streaming>
>>>>>>>>>> 
>>>>>>>>>> To attract more contributors and get some user feedback, we think it is
>>>>>>>>>> time to merge it to master. Before doing so, some steps need to be achieved:
>>>>>>>>>> 
>>>>>>>>>> - finish the work on spark Encoders (that allow to call Beam coders)
>>>>>>>>>> because, right now, the runner is in an unstable state (some transforms
>>>>>>>>>> use the new way of doing ser/de and some use the old one, making a
>>>>>>>>>> pipeline incoherent toward serialization)
>>>>>>>>>> 
>>>>>>>>>> - clean history: The history contains commits from November 2018, so
>>>>>>>>>> there is a good amount of work, thus a consequent number of commits.
>>>>>>>>>> They were already squashed but not from September 2019
>>>>>>>>> I don't think the number of commits should be an issue--we shouldn't
>>>>>>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>>>>>>> this branch containing lots of little, irrelevant commits that would
>>>>>>>>> have normally been squashed away in the normal review process we do
>>>>>>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>>>>>> 
>>>>>>>>>> Regarding status:
>>>>>>>>>> 
>>>>>>>>>> - the runner passes 89% of the validates runner tests in batch mode. We
>>>>>>>>>> hope to pass more with the new Encoders
>>>>>>>>>> 
>>>>>>>>>> - Streaming mode is barely started (waiting for the multi-aggregations
>>>>>>>>>> support in spark SS framework from the Spark community)
>>>>>>>>>> 
>>>>>>>>>> - Runner can execute Nexmark
>>>>>>>>>> 
>>>>>>>>>> - Some things are not wired up yet
>>>>>>>>>> 
>>>>>>>>>>      - Beam Schemas not wired with Spark Schemas
>>>>>>>>>> 
>>>>>>>>>>      - Optional features of the model not implemented:  state api, timer
>>>>>>>>>> api, splittable doFn api, …
>>>>>>>>>> 
>>>>>>>>>> WDYT, can we merge it to master once the 2 steps are done ?
>>>>>>>>> I think that as long as it sits parallel to the existing runner, and
>>>>>>>>> is clearly marked with its status, it makes sense to me. How many
>>>>>>>>> changes does it make to the existing codebase (as opposed to add new
>>>>>>>>> code)?
>>

Re: [spark structured streaming runner] merge to master?

Posted by Etienne Chauchot <ec...@apache.org>.

Hi Alexey,

Thanks for your opinion !

Comments inline

Etienne

On 28/10/2019 17:34, Alexey Romanenko wrote:
> Let me share some of my thoughts on this.
>>>
>>>     - shall we filter out the package name from the release?
>>>
> Until new runner is not ready to be used in production (or, at least, 
> be used for beta testing but users should be clearly warned about that 
> in this case), I believe we need to filter out its classes from 
> published jar to avoid a confusion.
Yes that is what I think also
>>>
>>>     - should we release 2 jars: one for the old and one for the new ?
>>>
>>>     - should we release 3 jars: one for the new, one for the new and 
>>> one for both ?
>>>
> Once new runner will be released, then I think we need to provide only 
> one single jar and allow user to switch between different Spark 
> runners with CLI option.

I would vote for 3 jars: one for new, one for old, and one for both. 
Indeed, in some cases, users are looking very closely at the size of 
jars. This solution meets all use cases

>>>     - should we create a special entry to the capability matrix ?
>>>
> Sure, since it has its own uniq characteristics and implementation, 
> but again, only once new runner will be "officially released".
+1
>
>
>> On 28 Oct 2019, at 10:27, Etienne Chauchot <echauchot@apache.org 
>> <ma...@apache.org>> wrote:
>>
>> Hi guys,
>>
>> Any opinions on the point2 communication to users ?
>>
>> Etienne
>>
>> On 24/10/2019 15:44, Etienne Chauchot wrote:
>>>
>>> Hi guys,
>>>
>>> I'm glad to announce that the PR for the merge to master of the new 
>>> runner based on Spark Structured Streaming framework is submitted:
>>>
>>> https://github.com/apache/beam/pull/9866
>>>
>>>
>>> 1. Regarding the status of the runner:
>>>
>>> -the runner passes 93% of the validates runner tests in batch mode.
>>>
>>> -Streaming mode is barely started (waiting for the 
>>> multi-aggregations support in spark Structured Streaming framework 
>>> from the Spark community)
>>>
>>> -Runner can execute Nexmark
>>>
>>> -Some things are not wired up yet
>>>
>>>   -Beam Schemas not wired with Spark Schemas
>>>
>>>   -Optional features of the model not implemented: state api, timer 
>>> api, splittable doFn api, …
>>>
>>>
>>> 2. Regarding the communication to users:
>>>
>>> - for reasons explained by Ismael: the runner is in the same module 
>>> as the "older" one. But it is in a different sub-package and both 
>>> runners share the same build.
>>>
>>> - How should we communicate to users:
>>>
>>>     - shall we filter out the package name from the release?
>>>
>>>     - should we release 2 jars: one for the old and one for the new ?
>>>
>>>     - should we release 3 jars: one for the new, one for the new and 
>>> one for both ?
>>>
>>>     - should we create a special entry to the capability matrix ?
>>>
>>> WDYT ?
>>>
>>> Best
>>>
>>> Etienne
>>>
>>>
>>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>>>> +1 to merge.
>>>>
>>>> It is worth keeping things in master with explicitly marked status. 
>>>> It will make effort more visible to users and easier to get 
>>>> feedback upon.
>>>>
>>>> --Mikhail
>>>>
>>>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot 
>>>> <echauchot@apache.org <ma...@apache.org>> wrote:
>>>>
>>>>     Hi guys,
>>>>
>>>>     The new spark runner now supports beam coders and passes 93% of
>>>>     the batch validates runner tests (+4%). I think it is time to
>>>>     merge it to master. I will submit a PR in the coming days.
>>>>
>>>>     next steps: support schemas and thus better leverage catalyst
>>>>     optimizer (among other things optims based on data), port perfs
>>>>     optims that were done in the current runner.
>>>>
>>>>     Best
>>>>
>>>>     Etienne
>>>>
>>>>     On 11/10/2019 22:48, Pablo Estrada wrote:
>>>>>     +1 for merging : )
>>>>>
>>>>>     On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw
>>>>>     <robertwb@google.com <ma...@google.com>> wrote:
>>>>>
>>>>>         Sounds like a good plan to me.
>>>>>
>>>>>         On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot
>>>>>         <echauchot@apache.org <ma...@apache.org>> wrote:
>>>>>
>>>>>             Comments inline
>>>>>
>>>>>             On 10/10/2019 23:44, Ismaël Mejía wrote:
>>>>>>             +1
>>>>>>
>>>>>>             The earlier we get to master the better to encourage not only code
>>>>>>             contributions but as important to have early user feedback.
>>>>>>
>>>>>>>             Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
>>>>>>             It is still too early to even start discussing when to remove the
>>>>>>             classical runner given that the new runner is still a WIP. However the
>>>>>>             overall goal is that this runner becomes the de-facto one once the VR
>>>>>>             tests and the performance become at least equal to the classical
>>>>>>             runner, in the meantime the best for users is that they co-exist,
>>>>>>             let’s not forget that the other runner has been already battle tested
>>>>>>             for more than 3 years and has had lots of improvements in the last
>>>>>>             year.
>>>>>
>>>>>             +1 on what Ismael says: no soon removal,
>>>>>
>>>>>             The plan I had in mind at first (that I showed at the
>>>>>             apacheCon) was this but I'm proposing moving the first
>>>>>             gray label to before the red box.
>>>>>
>>>>>             <beogijnhpieapoll.png>
>>>>>
>>>>>
>>>>>>>             I don't think the number of commits should be an issue--we shouldn't
>>>>>>>             just squash years worth of history away. (OTOH, if this is a case of
>>>>>>>             this branch containing lots of little, irrelevant commits that would
>>>>>>>             have normally been squashed away in the normal review process we do
>>>>>>>             for the main branch, then, yes, some cleanup could be nice.)
>>>>>>             About the commits we should encourage a clear history but we have also
>>>>>>             to remove useless commits that are still present in the branch,
>>>>>>             commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>>>>>             that make a better narrative sense together should be probably
>>>>>>             squashed, because they do not bring much to the history. It is not
>>>>>>             about more or less commits it is about its relevance as Robert
>>>>>>             mentions.
>>>>>>
>>>>>>>             I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>>>             I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>>>             In its current form we cannot exclude it but this relates to the other
>>>>>>             question, so better to explain a bit of history: The new runner used
>>>>>>             to live in its own module and subdirectory because it is a full blank
>>>>>>             page rewrite and the decision was not to use any of the classical
>>>>>>             runner classes to not be constrained by its evolution.
>>>>>>
>>>>>>             However the reason to put it back in the same module as a subdirectory
>>>>>>             was to encourage early use, in more detail: The way you deploy spark
>>>>>>             jobs today is usually by packaging and staging an uber jar (~200MB of
>>>>>>             pure dependency joy) that contains the user pipeline classes, the
>>>>>>             spark runner module and its dependencies. If we have two spark runners
>>>>>>             in separate modules the user would need to repackage and redeploy
>>>>>>             their pipelines every time they want to switch from the classical
>>>>>>             Spark runner to the structured streaming runner which is painful and
>>>>>>             time and space consuming compared with the one module approach where
>>>>>>             they just change the name of the runner class and that’s it. The idea
>>>>>>             here is to make easy for users to test the new runner, but at the same
>>>>>>             time to make easy to come back to the classical runner in case of any
>>>>>>             issue.
>>>>>>
>>>>>>             Ismaël
>>>>>>
>>>>>>             On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles<ke...@apache.org>  <ma...@apache.org>  wrote:
>>>>>>>             +1
>>>>>>>
>>>>>>>             I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>>>
>>>>>>>             I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>>>>
>>>>>>>             I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>>>>>>>
>>>>>>>             Kenn
>>>>>>>
>>>>>>>             On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw<ro...@google.com>  <ma...@google.com>  wrote:
>>>>>>>>             On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot<ec...@apache.org>  <ma...@apache.org>  wrote:
>>>>>>>>>             Hi guys,
>>>>>>>>>
>>>>>>>>>             You probably know that there has been for several months an work
>>>>>>>>>             developing a new Spark runner based on Spark Structured Streaming
>>>>>>>>>             framework. This work is located in a feature branch here:
>>>>>>>>>             https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>>>>>>>>
>>>>>>>>>             To attract more contributors and get some user feedback, we think it is
>>>>>>>>>             time to merge it to master. Before doing so, some steps need to be achieved:
>>>>>>>>>
>>>>>>>>>             - finish the work on spark Encoders (that allow to call Beam coders)
>>>>>>>>>             because, right now, the runner is in an unstable state (some transforms
>>>>>>>>>             use the new way of doing ser/de and some use the old one, making a
>>>>>>>>>             pipeline incoherent toward serialization)
>>>>>>>>>
>>>>>>>>>             - clean history: The history contains commits from November 2018, so
>>>>>>>>>             there is a good amount of work, thus a consequent number of commits.
>>>>>>>>>             They were already squashed but not from September 2019
>>>>>>>>             I don't think the number of commits should be an issue--we shouldn't
>>>>>>>>             just squash years worth of history away. (OTOH, if this is a case of
>>>>>>>>             this branch containing lots of little, irrelevant commits that would
>>>>>>>>             have normally been squashed away in the normal review process we do
>>>>>>>>             for the main branch, then, yes, some cleanup could be nice.)
>>>>>>>>
>>>>>>>>>             Regarding status:
>>>>>>>>>
>>>>>>>>>             - the runner passes 89% of the validates runner tests in batch mode. We
>>>>>>>>>             hope to pass more with the new Encoders
>>>>>>>>>
>>>>>>>>>             - Streaming mode is barely started (waiting for the multi-aggregations
>>>>>>>>>             support in spark SS framework from the Spark community)
>>>>>>>>>
>>>>>>>>>             - Runner can execute Nexmark
>>>>>>>>>
>>>>>>>>>             - Some things are not wired up yet
>>>>>>>>>
>>>>>>>>>                   - Beam Schemas not wired with Spark Schemas
>>>>>>>>>
>>>>>>>>>                   - Optional features of the model not implemented:  state api, timer
>>>>>>>>>             api, splittable doFn api, …
>>>>>>>>>
>>>>>>>>>             WDYT, can we merge it to master once the 2 steps are done ?
>>>>>>>>             I think that as long as it sits parallel to the existing runner, and
>>>>>>>>             is clearly marked with its status, it makes sense to me. How many
>>>>>>>>             changes does it make to the existing codebase (as opposed to add new
>>>>>>>>             code)?
>>>>>
>

Re: [spark structured streaming runner] merge to master?

Posted by Alexey Romanenko <ar...@gmail.com>.

Let me share some of my thoughts on this.
>>     - shall we filter out the package name from the release? 
>> 
Until new runner is not ready to be used in production (or, at least, be used for beta testing but users should be clearly warned about that in this case), I believe we need to filter out its classes from published jar to avoid a confusion.
>>     - should we release 2 jars: one for the old and one for the new ? 
>> 
>>     - should we release 3 jars: one for the new, one for the new and one for both ?
>> 
Once new runner will be released, then I think we need to provide only one single jar and allow user to switch between different Spark runners with CLI option.
>>     - should we create a special entry to the capability matrix ?
>> 

Sure, since it has its own uniq characteristics and implementation, but again, only once new runner will be "officially released".


> On 28 Oct 2019, at 10:27, Etienne Chauchot <ec...@apache.org> wrote:
> 
> Hi guys,
> 
> Any opinions on the point2 communication to users ?
> 
> Etienne
> On 24/10/2019 15:44, Etienne Chauchot wrote:
>> Hi guys,
>> 
>> I'm glad to announce that the PR for the merge to master of the new runner based on Spark Structured Streaming framework is submitted:
>> 
>> https://github.com/apache/beam/pull/9866 <https://github.com/apache/beam/pull/9866>
>> 
>> 1. Regarding the status of the runner: 
>> -the runner passes 93% of the validates runner tests in batch mode.
>> 
>> -Streaming mode is barely started (waiting for the multi-aggregations support in spark Structured Streaming framework from the Spark community)
>> 
>> -Runner can execute Nexmark
>> 
>> -Some things are not wired up yet
>> 
>>   -Beam Schemas not wired with Spark Schemas
>> 
>>   -Optional features of the model not implemented: state api, timer api, splittable doFn api, …
>> 
>> 2. Regarding the communication to users:
>> 
>> - for reasons explained by Ismael: the runner is in the same module as the "older" one. But it is in a different sub-package and both runners share the same build.  
>> - How should we communicate to users: 
>>     - shall we filter out the package name from the release? 
>>     - should we release 2 jars: one for the old and one for the new ? 
>>     - should we release 3 jars: one for the new, one for the new and one for both ?
>> 
>>     - should we create a special entry to the capability matrix ?
>> 
>> WDYT ?
>> Best
>> 
>> Etienne
>> 
>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>>> +1 to merge.
>>> 
>>> It is worth keeping things in master with explicitly marked status. It will make effort more visible to users and easier to get feedback upon.
>>> 
>>> --Mikhail
>>> 
>>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
>>> Hi guys,
>>> 
>>> The new spark runner now supports beam coders and passes 93% of the batch validates runner tests (+4%). I think it is time to merge it to master. I will submit a PR in the coming days.
>>> 
>>> next steps: support schemas and thus better leverage catalyst optimizer (among other things optims based on data), port perfs optims that were done in the current runner.
>>> Best
>>> Etienne
>>> On 11/10/2019 22:48, Pablo Estrada wrote:
>>>> +1 for merging : )
>>>> 
>>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <robertwb@google.com <ma...@google.com>> wrote:
>>>> Sounds like a good plan to me. 
>>>> 
>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
>>>> Comments inline
>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>>>>> +1
>>>>> 
>>>>> The earlier we get to master the better to encourage not only code
>>>>> contributions but as important to have early user feedback.
>>>>> 
>>>>>> Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
>>>>> It is still too early to even start discussing when to remove the
>>>>> classical runner given that the new runner is still a WIP. However the
>>>>> overall goal is that this runner becomes the de-facto one once the VR
>>>>> tests and the performance become at least equal to the classical
>>>>> runner, in the meantime the best for users is that they co-exist,
>>>>> let’s not forget that the other runner has been already battle tested
>>>>> for more than 3 years and has had lots of improvements in the last
>>>>> year.
>>>> +1 on what Ismael says: no soon removal, 
>>>> The plan I had in mind at first (that I showed at the apacheCon) was this but I'm proposing moving the first gray label to before the red box. 
>>>> <beogijnhpieapoll.png>
>>>> 
>>>> 
>>>>>> I don't think the number of commits should be an issue--we shouldn't
>>>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>>>> this branch containing lots of little, irrelevant commits that would
>>>>>> have normally been squashed away in the normal review process we do
>>>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>> About the commits we should encourage a clear history but we have also
>>>>> to remove useless commits that are still present in the branch,
>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>>>> that make a better narrative sense together should be probably
>>>>> squashed, because they do not bring much to the history. It is not
>>>>> about more or less commits it is about its relevance as Robert
>>>>> mentions.
>>>>> 
>>>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>> In its current form we cannot exclude it but this relates to the other
>>>>> question, so better to explain a bit of history: The new runner used
>>>>> to live in its own module and subdirectory because it is a full blank
>>>>> page rewrite and the decision was not to use any of the classical
>>>>> runner classes to not be constrained by its evolution.
>>>>> 
>>>>> However the reason to put it back in the same module as a subdirectory
>>>>> was to encourage early use, in more detail: The way you deploy spark
>>>>> jobs today is usually by packaging and staging an uber jar (~200MB of
>>>>> pure dependency joy) that contains the user pipeline classes, the
>>>>> spark runner module and its dependencies. If we have two spark runners
>>>>> in separate modules the user would need to repackage and redeploy
>>>>> their pipelines every time they want to switch from the classical
>>>>> Spark runner to the structured streaming runner which is painful and
>>>>> time and space consuming compared with the one module approach where
>>>>> they just change the name of the runner class and that’s it. The idea
>>>>> here is to make easy for users to test the new runner, but at the same
>>>>> time to make easy to come back to the classical runner in case of any
>>>>> issue.
>>>>> 
>>>>> Ismaël
>>>>> 
>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <ke...@apache.org> <ma...@apache.org> wrote:
>>>>>> +1
>>>>>> 
>>>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>> 
>>>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>>> 
>>>>>> I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>>>>>> 
>>>>>> Kenn
>>>>>> 
>>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <ro...@google.com> <ma...@google.com> wrote:
>>>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <ec...@apache.org> <ma...@apache.org> wrote:
>>>>>>>> Hi guys,
>>>>>>>> 
>>>>>>>> You probably know that there has been for several months an work
>>>>>>>> developing a new Spark runner based on Spark Structured Streaming
>>>>>>>> framework. This work is located in a feature branch here:
>>>>>>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming <https://github.com/apache/beam/tree/spark-runner_structured-streaming>
>>>>>>>> 
>>>>>>>> To attract more contributors and get some user feedback, we think it is
>>>>>>>> time to merge it to master. Before doing so, some steps need to be achieved:
>>>>>>>> 
>>>>>>>> - finish the work on spark Encoders (that allow to call Beam coders)
>>>>>>>> because, right now, the runner is in an unstable state (some transforms
>>>>>>>> use the new way of doing ser/de and some use the old one, making a
>>>>>>>> pipeline incoherent toward serialization)
>>>>>>>> 
>>>>>>>> - clean history: The history contains commits from November 2018, so
>>>>>>>> there is a good amount of work, thus a consequent number of commits.
>>>>>>>> They were already squashed but not from September 2019
>>>>>>> I don't think the number of commits should be an issue--we shouldn't
>>>>>>> just squash years worth of history away. (OTOH, if this is a case of
>>>>>>> this branch containing lots of little, irrelevant commits that would
>>>>>>> have normally been squashed away in the normal review process we do
>>>>>>> for the main branch, then, yes, some cleanup could be nice.)
>>>>>>> 
>>>>>>>> Regarding status:
>>>>>>>> 
>>>>>>>> - the runner passes 89% of the validates runner tests in batch mode. We
>>>>>>>> hope to pass more with the new Encoders
>>>>>>>> 
>>>>>>>> - Streaming mode is barely started (waiting for the multi-aggregations
>>>>>>>> support in spark SS framework from the Spark community)
>>>>>>>> 
>>>>>>>> - Runner can execute Nexmark
>>>>>>>> 
>>>>>>>> - Some things are not wired up yet
>>>>>>>> 
>>>>>>>>      - Beam Schemas not wired with Spark Schemas
>>>>>>>> 
>>>>>>>>      - Optional features of the model not implemented:  state api, timer
>>>>>>>> api, splittable doFn api, …
>>>>>>>> 
>>>>>>>> WDYT, can we merge it to master once the 2 steps are done ?
>>>>>>> I think that as long as it sits parallel to the existing runner, and
>>>>>>> is clearly marked with its status, it makes sense to me. How many
>>>>>>> changes does it make to the existing codebase (as opposed to add new
>>>>>>> code)?

Re: [spark structured streaming runner] merge to master?

Posted by Etienne Chauchot <ec...@apache.org>.

Hi guys,

Any opinions on the point2 communication to users ?

Etienne

On 24/10/2019 15:44, Etienne Chauchot wrote:
>
> Hi guys,
>
> I'm glad to announce that the PR for the merge to master of the new 
> runner based on Spark Structured Streaming framework is submitted:
>
> https://github.com/apache/beam/pull/9866
>
>
> 1. Regarding the status of the runner:
>
> -the runner passes 93% of the validates runner tests in batch mode.
>
> -Streaming mode is barely started (waiting for the multi-aggregations 
> support in spark Structured Streaming framework from the Spark community)
>
> -Runner can execute Nexmark
>
> -Some things are not wired up yet
>
>   -Beam Schemas not wired with Spark Schemas
>
>   -Optional features of the model not implemented: state api, timer 
> api, splittable doFn api, …
>
>
> 2. Regarding the communication to users:
>
> - for reasons explained by Ismael: the runner is in the same module as 
> the "older" one. But it is in a different sub-package and both runners 
> share the same build.
>
> - How should we communicate to users:
>
>     - shall we filter out the package name from the release?
>
>     - should we release 2 jars: one for the old and one for the new ?
>
>     - should we release 3 jars: one for the new, one for the new and 
> one for both ?
>
>     - should we create a special entry to the capability matrix ?
>
> WDYT ?
>
> Best
>
> Etienne
>
>
> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
>> +1 to merge.
>>
>> It is worth keeping things in master with explicitly marked status. 
>> It will make effort more visible to users and easier to get feedback 
>> upon.
>>
>> --Mikhail
>>
>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot 
>> <echauchot@apache.org <ma...@apache.org>> wrote:
>>
>>     Hi guys,
>>
>>     The new spark runner now supports beam coders and passes 93% of
>>     the batch validates runner tests (+4%). I think it is time to
>>     merge it to master. I will submit a PR in the coming days.
>>
>>     next steps: support schemas and thus better leverage catalyst
>>     optimizer (among other things optims based on data), port perfs
>>     optims that were done in the current runner.
>>
>>     Best
>>
>>     Etienne
>>
>>     On 11/10/2019 22:48, Pablo Estrada wrote:
>>>     +1 for merging : )
>>>
>>>     On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw
>>>     <robertwb@google.com <ma...@google.com>> wrote:
>>>
>>>         Sounds like a good plan to me.
>>>
>>>         On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot
>>>         <echauchot@apache.org <ma...@apache.org>> wrote:
>>>
>>>             Comments inline
>>>
>>>             On 10/10/2019 23:44, Ismaël Mejía wrote:
>>>>             +1
>>>>
>>>>             The earlier we get to master the better to encourage not only code
>>>>             contributions but as important to have early user feedback.
>>>>
>>>>>             Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
>>>>             It is still too early to even start discussing when to remove the
>>>>             classical runner given that the new runner is still a WIP. However the
>>>>             overall goal is that this runner becomes the de-facto one once the VR
>>>>             tests and the performance become at least equal to the classical
>>>>             runner, in the meantime the best for users is that they co-exist,
>>>>             let’s not forget that the other runner has been already battle tested
>>>>             for more than 3 years and has had lots of improvements in the last
>>>>             year.
>>>
>>>             +1 on what Ismael says: no soon removal,
>>>
>>>             The plan I had in mind at first (that I showed at the
>>>             apacheCon) was this but I'm proposing moving the first
>>>             gray label to before the red box.
>>>
>>>
>>>>>             I don't think the number of commits should be an issue--we shouldn't
>>>>>             just squash years worth of history away. (OTOH, if this is a case of
>>>>>             this branch containing lots of little, irrelevant commits that would
>>>>>             have normally been squashed away in the normal review process we do
>>>>>             for the main branch, then, yes, some cleanup could be nice.)
>>>>             About the commits we should encourage a clear history but we have also
>>>>             to remove useless commits that are still present in the branch,
>>>>             commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>>>             that make a better narrative sense together should be probably
>>>>             squashed, because they do not bring much to the history. It is not
>>>>             about more or less commits it is about its relevance as Robert
>>>>             mentions.
>>>>
>>>>>             I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>             I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>             In its current form we cannot exclude it but this relates to the other
>>>>             question, so better to explain a bit of history: The new runner used
>>>>             to live in its own module and subdirectory because it is a full blank
>>>>             page rewrite and the decision was not to use any of the classical
>>>>             runner classes to not be constrained by its evolution.
>>>>
>>>>             However the reason to put it back in the same module as a subdirectory
>>>>             was to encourage early use, in more detail: The way you deploy spark
>>>>             jobs today is usually by packaging and staging an uber jar (~200MB of
>>>>             pure dependency joy) that contains the user pipeline classes, the
>>>>             spark runner module and its dependencies. If we have two spark runners
>>>>             in separate modules the user would need to repackage and redeploy
>>>>             their pipelines every time they want to switch from the classical
>>>>             Spark runner to the structured streaming runner which is painful and
>>>>             time and space consuming compared with the one module approach where
>>>>             they just change the name of the runner class and that’s it. The idea
>>>>             here is to make easy for users to test the new runner, but at the same
>>>>             time to make easy to come back to the classical runner in case of any
>>>>             issue.
>>>>
>>>>             Ismaël
>>>>
>>>>             On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles<ke...@apache.org>  <ma...@apache.org>  wrote:
>>>>>             +1
>>>>>
>>>>>             I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>>
>>>>>             I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>>
>>>>>             I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>>>>>
>>>>>             Kenn
>>>>>
>>>>>             On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw<ro...@google.com>  <ma...@google.com>  wrote:
>>>>>>             On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot<ec...@apache.org>  <ma...@apache.org>  wrote:
>>>>>>>             Hi guys,
>>>>>>>
>>>>>>>             You probably know that there has been for several months an work
>>>>>>>             developing a new Spark runner based on Spark Structured Streaming
>>>>>>>             framework. This work is located in a feature branch here:
>>>>>>>             https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>>>>>>
>>>>>>>             To attract more contributors and get some user feedback, we think it is
>>>>>>>             time to merge it to master. Before doing so, some steps need to be achieved:
>>>>>>>
>>>>>>>             - finish the work on spark Encoders (that allow to call Beam coders)
>>>>>>>             because, right now, the runner is in an unstable state (some transforms
>>>>>>>             use the new way of doing ser/de and some use the old one, making a
>>>>>>>             pipeline incoherent toward serialization)
>>>>>>>
>>>>>>>             - clean history: The history contains commits from November 2018, so
>>>>>>>             there is a good amount of work, thus a consequent number of commits.
>>>>>>>             They were already squashed but not from September 2019
>>>>>>             I don't think the number of commits should be an issue--we shouldn't
>>>>>>             just squash years worth of history away. (OTOH, if this is a case of
>>>>>>             this branch containing lots of little, irrelevant commits that would
>>>>>>             have normally been squashed away in the normal review process we do
>>>>>>             for the main branch, then, yes, some cleanup could be nice.)
>>>>>>
>>>>>>>             Regarding status:
>>>>>>>
>>>>>>>             - the runner passes 89% of the validates runner tests in batch mode. We
>>>>>>>             hope to pass more with the new Encoders
>>>>>>>
>>>>>>>             - Streaming mode is barely started (waiting for the multi-aggregations
>>>>>>>             support in spark SS framework from the Spark community)
>>>>>>>
>>>>>>>             - Runner can execute Nexmark
>>>>>>>
>>>>>>>             - Some things are not wired up yet
>>>>>>>
>>>>>>>                   - Beam Schemas not wired with Spark Schemas
>>>>>>>
>>>>>>>                   - Optional features of the model not implemented:  state api, timer
>>>>>>>             api, splittable doFn api, …
>>>>>>>
>>>>>>>             WDYT, can we merge it to master once the 2 steps are done ?
>>>>>>             I think that as long as it sits parallel to the existing runner, and
>>>>>>             is clearly marked with its status, it makes sense to me. How many
>>>>>>             changes does it make to the existing codebase (as opposed to add new
>>>>>>             code)?
>>>

Re: [spark structured streaming runner] merge to master?

Posted by Etienne Chauchot <ec...@apache.org>.

Hi guys,

I'm glad to announce that the PR for the merge to master of the new 
runner based on Spark Structured Streaming framework is submitted:

https://github.com/apache/beam/pull/9866


1. Regarding the status of the runner:

-the runner passes 93% of the validates runner tests in batch mode.

-Streaming mode is barely started (waiting for the multi-aggregations 
support in spark Structured Streaming framework from the Spark community)

-Runner can execute Nexmark

-Some things are not wired up yet

   -Beam Schemas not wired with Spark Schemas

   -Optional features of the model not implemented: state api, timer 
api, splittable doFn api, …


2. Regarding the communication to users:

- for reasons explained by Ismael: the runner is in the same module as 
the "older" one. But it is in a different sub-package and both runners 
share the same build.

- How should we communicate to users:

     - shall we filter out the package name from the release?

     - should we release 2 jars: one for the old and one for the new ?

     - should we release 3 jars: one for the new, one for the new and 
one for both ?

     - should we create a special entry to the capability matrix ?

WDYT ?

Best

Etienne


On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
> +1 to merge.
>
> It is worth keeping things in master with explicitly marked status. It 
> will make effort more visible to users and easier to get feedback upon.
>
> --Mikhail
>
> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <echauchot@apache.org 
> <ma...@apache.org>> wrote:
>
>     Hi guys,
>
>     The new spark runner now supports beam coders and passes 93% of
>     the batch validates runner tests (+4%). I think it is time to
>     merge it to master. I will submit a PR in the coming days.
>
>     next steps: support schemas and thus better leverage catalyst
>     optimizer (among other things optims based on data), port perfs
>     optims that were done in the current runner.
>
>     Best
>
>     Etienne
>
>     On 11/10/2019 22:48, Pablo Estrada wrote:
>>     +1 for merging : )
>>
>>     On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw
>>     <robertwb@google.com <ma...@google.com>> wrote:
>>
>>         Sounds like a good plan to me.
>>
>>         On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot
>>         <echauchot@apache.org <ma...@apache.org>> wrote:
>>
>>             Comments inline
>>
>>             On 10/10/2019 23:44, Ismaël Mejía wrote:
>>>             +1
>>>
>>>             The earlier we get to master the better to encourage not only code
>>>             contributions but as important to have early user feedback.
>>>
>>>>             Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
>>>             It is still too early to even start discussing when to remove the
>>>             classical runner given that the new runner is still a WIP. However the
>>>             overall goal is that this runner becomes the de-facto one once the VR
>>>             tests and the performance become at least equal to the classical
>>>             runner, in the meantime the best for users is that they co-exist,
>>>             let’s not forget that the other runner has been already battle tested
>>>             for more than 3 years and has had lots of improvements in the last
>>>             year.
>>
>>             +1 on what Ismael says: no soon removal,
>>
>>             The plan I had in mind at first (that I showed at the
>>             apacheCon) was this but I'm proposing moving the first
>>             gray label to before the red box.
>>
>>
>>>>             I don't think the number of commits should be an issue--we shouldn't
>>>>             just squash years worth of history away. (OTOH, if this is a case of
>>>>             this branch containing lots of little, irrelevant commits that would
>>>>             have normally been squashed away in the normal review process we do
>>>>             for the main branch, then, yes, some cleanup could be nice.)
>>>             About the commits we should encourage a clear history but we have also
>>>             to remove useless commits that are still present in the branch,
>>>             commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>>             that make a better narrative sense together should be probably
>>>             squashed, because they do not bring much to the history. It is not
>>>             about more or less commits it is about its relevance as Robert
>>>             mentions.
>>>
>>>>             I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>             I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>             In its current form we cannot exclude it but this relates to the other
>>>             question, so better to explain a bit of history: The new runner used
>>>             to live in its own module and subdirectory because it is a full blank
>>>             page rewrite and the decision was not to use any of the classical
>>>             runner classes to not be constrained by its evolution.
>>>
>>>             However the reason to put it back in the same module as a subdirectory
>>>             was to encourage early use, in more detail: The way you deploy spark
>>>             jobs today is usually by packaging and staging an uber jar (~200MB of
>>>             pure dependency joy) that contains the user pipeline classes, the
>>>             spark runner module and its dependencies. If we have two spark runners
>>>             in separate modules the user would need to repackage and redeploy
>>>             their pipelines every time they want to switch from the classical
>>>             Spark runner to the structured streaming runner which is painful and
>>>             time and space consuming compared with the one module approach where
>>>             they just change the name of the runner class and that’s it. The idea
>>>             here is to make easy for users to test the new runner, but at the same
>>>             time to make easy to come back to the classical runner in case of any
>>>             issue.
>>>
>>>             Ismaël
>>>
>>>             On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles<ke...@apache.org>  <ma...@apache.org>  wrote:
>>>>             +1
>>>>
>>>>             I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>>
>>>>             I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>>
>>>>             I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>>>>
>>>>             Kenn
>>>>
>>>>             On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw<ro...@google.com>  <ma...@google.com>  wrote:
>>>>>             On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot<ec...@apache.org>  <ma...@apache.org>  wrote:
>>>>>>             Hi guys,
>>>>>>
>>>>>>             You probably know that there has been for several months an work
>>>>>>             developing a new Spark runner based on Spark Structured Streaming
>>>>>>             framework. This work is located in a feature branch here:
>>>>>>             https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>>>>>
>>>>>>             To attract more contributors and get some user feedback, we think it is
>>>>>>             time to merge it to master. Before doing so, some steps need to be achieved:
>>>>>>
>>>>>>             - finish the work on spark Encoders (that allow to call Beam coders)
>>>>>>             because, right now, the runner is in an unstable state (some transforms
>>>>>>             use the new way of doing ser/de and some use the old one, making a
>>>>>>             pipeline incoherent toward serialization)
>>>>>>
>>>>>>             - clean history: The history contains commits from November 2018, so
>>>>>>             there is a good amount of work, thus a consequent number of commits.
>>>>>>             They were already squashed but not from September 2019
>>>>>             I don't think the number of commits should be an issue--we shouldn't
>>>>>             just squash years worth of history away. (OTOH, if this is a case of
>>>>>             this branch containing lots of little, irrelevant commits that would
>>>>>             have normally been squashed away in the normal review process we do
>>>>>             for the main branch, then, yes, some cleanup could be nice.)
>>>>>
>>>>>>             Regarding status:
>>>>>>
>>>>>>             - the runner passes 89% of the validates runner tests in batch mode. We
>>>>>>             hope to pass more with the new Encoders
>>>>>>
>>>>>>             - Streaming mode is barely started (waiting for the multi-aggregations
>>>>>>             support in spark SS framework from the Spark community)
>>>>>>
>>>>>>             - Runner can execute Nexmark
>>>>>>
>>>>>>             - Some things are not wired up yet
>>>>>>
>>>>>>                   - Beam Schemas not wired with Spark Schemas
>>>>>>
>>>>>>                   - Optional features of the model not implemented:  state api, timer
>>>>>>             api, splittable doFn api, …
>>>>>>
>>>>>>             WDYT, can we merge it to master once the 2 steps are done ?
>>>>>             I think that as long as it sits parallel to the existing runner, and
>>>>>             is clearly marked with its status, it makes sense to me. How many
>>>>>             changes does it make to the existing codebase (as opposed to add new
>>>>>             code)?
>>

Re: [spark structured streaming runner] merge to master?

Posted by Mikhail Gryzykhin <gr...@gmail.com>.

+1 to merge.

It is worth keeping things in master with explicitly marked status. It will
make effort more visible to users and easier to get feedback upon.

--Mikhail

On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <ec...@apache.org>
wrote:

> Hi guys,
>
> The new spark runner now supports beam coders and passes 93% of the batch
> validates runner tests (+4%). I think it is time to merge it to master. I
> will submit a PR in the coming days.
>
> next steps: support schemas and thus better leverage catalyst optimizer
> (among other things optims based on data), port perfs optims that were done
> in the current runner.
>
> Best
>
> Etienne
> On 11/10/2019 22:48, Pablo Estrada wrote:
>
> +1 for merging : )
>
> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> Sounds like a good plan to me.
>>
>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <ec...@apache.org>
>> wrote:
>>
>>> Comments inline
>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>>>
>>> +1
>>>
>>> The earlier we get to master the better to encourage not only code
>>> contributions but as important to have early user feedback.
>>>
>>>
>>> Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
>>>
>>> It is still too early to even start discussing when to remove the
>>> classical runner given that the new runner is still a WIP. However the
>>> overall goal is that this runner becomes the de-facto one once the VR
>>> tests and the performance become at least equal to the classical
>>> runner, in the meantime the best for users is that they co-exist,
>>> let’s not forget that the other runner has been already battle tested
>>> for more than 3 years and has had lots of improvements in the last
>>> year.
>>>
>>> +1 on what Ismael says: no soon removal,
>>>
>>> The plan I had in mind at first (that I showed at the apacheCon) was
>>> this but I'm proposing moving the first gray label to before the red box.
>>>
>>>
>>> I don't think the number of commits should be an issue--we shouldn't
>>> just squash years worth of history away. (OTOH, if this is a case of
>>> this branch containing lots of little, irrelevant commits that would
>>> have normally been squashed away in the normal review process we do
>>> for the main branch, then, yes, some cleanup could be nice.)
>>>
>>> About the commits we should encourage a clear history but we have also
>>> to remove useless commits that are still present in the branch,
>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>> that make a better narrative sense together should be probably
>>> squashed, because they do not bring much to the history. It is not
>>> about more or less commits it is about its relevance as Robert
>>> mentions.
>>>
>>>
>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>
>>> In its current form we cannot exclude it but this relates to the other
>>> question, so better to explain a bit of history: The new runner used
>>> to live in its own module and subdirectory because it is a full blank
>>> page rewrite and the decision was not to use any of the classical
>>> runner classes to not be constrained by its evolution.
>>>
>>> However the reason to put it back in the same module as a subdirectory
>>> was to encourage early use, in more detail: The way you deploy spark
>>> jobs today is usually by packaging and staging an uber jar (~200MB of
>>> pure dependency joy) that contains the user pipeline classes, the
>>> spark runner module and its dependencies. If we have two spark runners
>>> in separate modules the user would need to repackage and redeploy
>>> their pipelines every time they want to switch from the classical
>>> Spark runner to the structured streaming runner which is painful and
>>> time and space consuming compared with the one module approach where
>>> they just change the name of the runner class and that’s it. The idea
>>> here is to make easy for users to test the new runner, but at the same
>>> time to make easy to come back to the classical runner in case of any
>>> issue.
>>>
>>> Ismaël
>>>
>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <ke...@apache.org> <ke...@apache.org> wrote:
>>>
>>> +1
>>>
>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>
>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>
>>> I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>>>
>>> Kenn
>>>
>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <ro...@google.com> <ro...@google.com> wrote:
>>>
>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <ec...@apache.org> <ec...@apache.org> wrote:
>>>
>>> Hi guys,
>>>
>>> You probably know that there has been for several months an work
>>> developing a new Spark runner based on Spark Structured Streaming
>>> framework. This work is located in a feature branch here:https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>>
>>> To attract more contributors and get some user feedback, we think it is
>>> time to merge it to master. Before doing so, some steps need to be achieved:
>>>
>>> - finish the work on spark Encoders (that allow to call Beam coders)
>>> because, right now, the runner is in an unstable state (some transforms
>>> use the new way of doing ser/de and some use the old one, making a
>>> pipeline incoherent toward serialization)
>>>
>>> - clean history: The history contains commits from November 2018, so
>>> there is a good amount of work, thus a consequent number of commits.
>>> They were already squashed but not from September 2019
>>>
>>> I don't think the number of commits should be an issue--we shouldn't
>>> just squash years worth of history away. (OTOH, if this is a case of
>>> this branch containing lots of little, irrelevant commits that would
>>> have normally been squashed away in the normal review process we do
>>> for the main branch, then, yes, some cleanup could be nice.)
>>>
>>>
>>> Regarding status:
>>>
>>> - the runner passes 89% of the validates runner tests in batch mode. We
>>> hope to pass more with the new Encoders
>>>
>>> - Streaming mode is barely started (waiting for the multi-aggregations
>>> support in spark SS framework from the Spark community)
>>>
>>> - Runner can execute Nexmark
>>>
>>> - Some things are not wired up yet
>>>
>>>      - Beam Schemas not wired with Spark Schemas
>>>
>>>      - Optional features of the model not implemented:  state api, timer
>>> api, splittable doFn api, …
>>>
>>> WDYT, can we merge it to master once the 2 steps are done ?
>>>
>>> I think that as long as it sits parallel to the existing runner, and
>>> is clearly marked with its status, it makes sense to me. How many
>>> changes does it make to the existing codebase (as opposed to add new
>>> code)?
>>>
>>>

Re: [spark structured streaming runner] merge to master?

Posted by Etienne Chauchot <ec...@apache.org>.

Hi guys,

The new spark runner now supports beam coders and passes 93% of the 
batch validates runner tests (+4%). I think it is time to merge it to 
master. I will submit a PR in the coming days.

next steps: support schemas and thus better leverage catalyst optimizer 
(among other things optims based on data), port perfs optims that were 
done in the current runner.

Best

Etienne

On 11/10/2019 22:48, Pablo Estrada wrote:
> +1 for merging : )
>
> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <robertwb@google.com 
> <ma...@google.com>> wrote:
>
>     Sounds like a good plan to me.
>
>     On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot
>     <echauchot@apache.org <ma...@apache.org>> wrote:
>
>         Comments inline
>
>         On 10/10/2019 23:44, Ismaël Mejía wrote:
>>         +1
>>
>>         The earlier we get to master the better to encourage not only code
>>         contributions but as important to have early user feedback.
>>
>>>         Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
>>         It is still too early to even start discussing when to remove the
>>         classical runner given that the new runner is still a WIP. However the
>>         overall goal is that this runner becomes the de-facto one once the VR
>>         tests and the performance become at least equal to the classical
>>         runner, in the meantime the best for users is that they co-exist,
>>         let’s not forget that the other runner has been already battle tested
>>         for more than 3 years and has had lots of improvements in the last
>>         year.
>
>         +1 on what Ismael says: no soon removal,
>
>         The plan I had in mind at first (that I showed at the
>         apacheCon) was this but I'm proposing moving the first gray
>         label to before the red box.
>
>
>>>         I don't think the number of commits should be an issue--we shouldn't
>>>         just squash years worth of history away. (OTOH, if this is a case of
>>>         this branch containing lots of little, irrelevant commits that would
>>>         have normally been squashed away in the normal review process we do
>>>         for the main branch, then, yes, some cleanup could be nice.)
>>         About the commits we should encourage a clear history but we have also
>>         to remove useless commits that are still present in the branch,
>>         commits of the “Fix errorprone” / “Cleaning” kind and even commits
>>         that make a better narrative sense together should be probably
>>         squashed, because they do not bring much to the history. It is not
>>         about more or less commits it is about its relevance as Robert
>>         mentions.
>>
>>>         I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>         I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>         In its current form we cannot exclude it but this relates to the other
>>         question, so better to explain a bit of history: The new runner used
>>         to live in its own module and subdirectory because it is a full blank
>>         page rewrite and the decision was not to use any of the classical
>>         runner classes to not be constrained by its evolution.
>>
>>         However the reason to put it back in the same module as a subdirectory
>>         was to encourage early use, in more detail: The way you deploy spark
>>         jobs today is usually by packaging and staging an uber jar (~200MB of
>>         pure dependency joy) that contains the user pipeline classes, the
>>         spark runner module and its dependencies. If we have two spark runners
>>         in separate modules the user would need to repackage and redeploy
>>         their pipelines every time they want to switch from the classical
>>         Spark runner to the structured streaming runner which is painful and
>>         time and space consuming compared with the one module approach where
>>         they just change the name of the runner class and that’s it. The idea
>>         here is to make easy for users to test the new runner, but at the same
>>         time to make easy to come back to the classical runner in case of any
>>         issue.
>>
>>         Ismaël
>>
>>         On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles<ke...@apache.org>  <ma...@apache.org>  wrote:
>>>         +1
>>>
>>>         I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>>
>>>         I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>>
>>>         I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>>>
>>>         Kenn
>>>
>>>         On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw<ro...@google.com>  <ma...@google.com>  wrote:
>>>>         On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot<ec...@apache.org>  <ma...@apache.org>  wrote:
>>>>>         Hi guys,
>>>>>
>>>>>         You probably know that there has been for several months an work
>>>>>         developing a new Spark runner based on Spark Structured Streaming
>>>>>         framework. This work is located in a feature branch here:
>>>>>         https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>>>>
>>>>>         To attract more contributors and get some user feedback, we think it is
>>>>>         time to merge it to master. Before doing so, some steps need to be achieved:
>>>>>
>>>>>         - finish the work on spark Encoders (that allow to call Beam coders)
>>>>>         because, right now, the runner is in an unstable state (some transforms
>>>>>         use the new way of doing ser/de and some use the old one, making a
>>>>>         pipeline incoherent toward serialization)
>>>>>
>>>>>         - clean history: The history contains commits from November 2018, so
>>>>>         there is a good amount of work, thus a consequent number of commits.
>>>>>         They were already squashed but not from September 2019
>>>>         I don't think the number of commits should be an issue--we shouldn't
>>>>         just squash years worth of history away. (OTOH, if this is a case of
>>>>         this branch containing lots of little, irrelevant commits that would
>>>>         have normally been squashed away in the normal review process we do
>>>>         for the main branch, then, yes, some cleanup could be nice.)
>>>>
>>>>>         Regarding status:
>>>>>
>>>>>         - the runner passes 89% of the validates runner tests in batch mode. We
>>>>>         hope to pass more with the new Encoders
>>>>>
>>>>>         - Streaming mode is barely started (waiting for the multi-aggregations
>>>>>         support in spark SS framework from the Spark community)
>>>>>
>>>>>         - Runner can execute Nexmark
>>>>>
>>>>>         - Some things are not wired up yet
>>>>>
>>>>>               - Beam Schemas not wired with Spark Schemas
>>>>>
>>>>>               - Optional features of the model not implemented:  state api, timer
>>>>>         api, splittable doFn api, …
>>>>>
>>>>>         WDYT, can we merge it to master once the 2 steps are done ?
>>>>         I think that as long as it sits parallel to the existing runner, and
>>>>         is clearly marked with its status, it makes sense to me. How many
>>>>         changes does it make to the existing codebase (as opposed to add new
>>>>         code)?
>

Re: [spark structured streaming runner] merge to master?

Posted by Pablo Estrada <pa...@google.com>.

+1 for merging : )

On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <ro...@google.com>
wrote:

> Sounds like a good plan to me.
>
> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <ec...@apache.org>
> wrote:
>
>> Comments inline
>> On 10/10/2019 23:44, Ismaël Mejía wrote:
>>
>> +1
>>
>> The earlier we get to master the better to encourage not only code
>> contributions but as important to have early user feedback.
>>
>>
>> Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
>>
>> It is still too early to even start discussing when to remove the
>> classical runner given that the new runner is still a WIP. However the
>> overall goal is that this runner becomes the de-facto one once the VR
>> tests and the performance become at least equal to the classical
>> runner, in the meantime the best for users is that they co-exist,
>> let’s not forget that the other runner has been already battle tested
>> for more than 3 years and has had lots of improvements in the last
>> year.
>>
>> +1 on what Ismael says: no soon removal,
>>
>> The plan I had in mind at first (that I showed at the apacheCon) was this
>> but I'm proposing moving the first gray label to before the red box.
>>
>>
>> I don't think the number of commits should be an issue--we shouldn't
>> just squash years worth of history away. (OTOH, if this is a case of
>> this branch containing lots of little, irrelevant commits that would
>> have normally been squashed away in the normal review process we do
>> for the main branch, then, yes, some cleanup could be nice.)
>>
>> About the commits we should encourage a clear history but we have also
>> to remove useless commits that are still present in the branch,
>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
>> that make a better narrative sense together should be probably
>> squashed, because they do not bring much to the history. It is not
>> about more or less commits it is about its relevance as Robert
>> mentions.
>>
>>
>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>
>> In its current form we cannot exclude it but this relates to the other
>> question, so better to explain a bit of history: The new runner used
>> to live in its own module and subdirectory because it is a full blank
>> page rewrite and the decision was not to use any of the classical
>> runner classes to not be constrained by its evolution.
>>
>> However the reason to put it back in the same module as a subdirectory
>> was to encourage early use, in more detail: The way you deploy spark
>> jobs today is usually by packaging and staging an uber jar (~200MB of
>> pure dependency joy) that contains the user pipeline classes, the
>> spark runner module and its dependencies. If we have two spark runners
>> in separate modules the user would need to repackage and redeploy
>> their pipelines every time they want to switch from the classical
>> Spark runner to the structured streaming runner which is painful and
>> time and space consuming compared with the one module approach where
>> they just change the name of the runner class and that’s it. The idea
>> here is to make easy for users to test the new runner, but at the same
>> time to make easy to come back to the classical runner in case of any
>> issue.
>>
>> Ismaël
>>
>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <ke...@apache.org> <ke...@apache.org> wrote:
>>
>> +1
>>
>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>
>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>
>> I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>>
>> Kenn
>>
>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <ro...@google.com> <ro...@google.com> wrote:
>>
>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <ec...@apache.org> <ec...@apache.org> wrote:
>>
>> Hi guys,
>>
>> You probably know that there has been for several months an work
>> developing a new Spark runner based on Spark Structured Streaming
>> framework. This work is located in a feature branch here:https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>
>> To attract more contributors and get some user feedback, we think it is
>> time to merge it to master. Before doing so, some steps need to be achieved:
>>
>> - finish the work on spark Encoders (that allow to call Beam coders)
>> because, right now, the runner is in an unstable state (some transforms
>> use the new way of doing ser/de and some use the old one, making a
>> pipeline incoherent toward serialization)
>>
>> - clean history: The history contains commits from November 2018, so
>> there is a good amount of work, thus a consequent number of commits.
>> They were already squashed but not from September 2019
>>
>> I don't think the number of commits should be an issue--we shouldn't
>> just squash years worth of history away. (OTOH, if this is a case of
>> this branch containing lots of little, irrelevant commits that would
>> have normally been squashed away in the normal review process we do
>> for the main branch, then, yes, some cleanup could be nice.)
>>
>>
>> Regarding status:
>>
>> - the runner passes 89% of the validates runner tests in batch mode. We
>> hope to pass more with the new Encoders
>>
>> - Streaming mode is barely started (waiting for the multi-aggregations
>> support in spark SS framework from the Spark community)
>>
>> - Runner can execute Nexmark
>>
>> - Some things are not wired up yet
>>
>>      - Beam Schemas not wired with Spark Schemas
>>
>>      - Optional features of the model not implemented:  state api, timer
>> api, splittable doFn api, …
>>
>> WDYT, can we merge it to master once the 2 steps are done ?
>>
>> I think that as long as it sits parallel to the existing runner, and
>> is clearly marked with its status, it makes sense to me. How many
>> changes does it make to the existing codebase (as opposed to add new
>> code)?
>>
>>

Re: [spark structured streaming runner] merge to master?

Posted by Robert Bradshaw <ro...@google.com>.

Sounds like a good plan to me.

On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <ec...@apache.org>
wrote:

> Comments inline
> On 10/10/2019 23:44, Ismaël Mejía wrote:
>
> +1
>
> The earlier we get to master the better to encourage not only code
> contributions but as important to have early user feedback.
>
>
> Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
>
> It is still too early to even start discussing when to remove the
> classical runner given that the new runner is still a WIP. However the
> overall goal is that this runner becomes the de-facto one once the VR
> tests and the performance become at least equal to the classical
> runner, in the meantime the best for users is that they co-exist,
> let’s not forget that the other runner has been already battle tested
> for more than 3 years and has had lots of improvements in the last
> year.
>
> +1 on what Ismael says: no soon removal,
>
> The plan I had in mind at first (that I showed at the apacheCon) was this
> but I'm proposing moving the first gray label to before the red box.
>
>
> I don't think the number of commits should be an issue--we shouldn't
> just squash years worth of history away. (OTOH, if this is a case of
> this branch containing lots of little, irrelevant commits that would
> have normally been squashed away in the normal review process we do
> for the main branch, then, yes, some cleanup could be nice.)
>
> About the commits we should encourage a clear history but we have also
> to remove useless commits that are still present in the branch,
> commits of the “Fix errorprone” / “Cleaning” kind and even commits
> that make a better narrative sense together should be probably
> squashed, because they do not bring much to the history. It is not
> about more or less commits it is about its relevance as Robert
> mentions.
>
>
> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>
> In its current form we cannot exclude it but this relates to the other
> question, so better to explain a bit of history: The new runner used
> to live in its own module and subdirectory because it is a full blank
> page rewrite and the decision was not to use any of the classical
> runner classes to not be constrained by its evolution.
>
> However the reason to put it back in the same module as a subdirectory
> was to encourage early use, in more detail: The way you deploy spark
> jobs today is usually by packaging and staging an uber jar (~200MB of
> pure dependency joy) that contains the user pipeline classes, the
> spark runner module and its dependencies. If we have two spark runners
> in separate modules the user would need to repackage and redeploy
> their pipelines every time they want to switch from the classical
> Spark runner to the structured streaming runner which is painful and
> time and space consuming compared with the one module approach where
> they just change the name of the runner class and that’s it. The idea
> here is to make easy for users to test the new runner, but at the same
> time to make easy to come back to the classical runner in case of any
> issue.
>
> Ismaël
>
> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <ke...@apache.org> <ke...@apache.org> wrote:
>
> +1
>
> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>
> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>
> I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>
> Kenn
>
> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <ro...@google.com> <ro...@google.com> wrote:
>
> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <ec...@apache.org> <ec...@apache.org> wrote:
>
> Hi guys,
>
> You probably know that there has been for several months an work
> developing a new Spark runner based on Spark Structured Streaming
> framework. This work is located in a feature branch here:https://github.com/apache/beam/tree/spark-runner_structured-streaming
>
> To attract more contributors and get some user feedback, we think it is
> time to merge it to master. Before doing so, some steps need to be achieved:
>
> - finish the work on spark Encoders (that allow to call Beam coders)
> because, right now, the runner is in an unstable state (some transforms
> use the new way of doing ser/de and some use the old one, making a
> pipeline incoherent toward serialization)
>
> - clean history: The history contains commits from November 2018, so
> there is a good amount of work, thus a consequent number of commits.
> They were already squashed but not from September 2019
>
> I don't think the number of commits should be an issue--we shouldn't
> just squash years worth of history away. (OTOH, if this is a case of
> this branch containing lots of little, irrelevant commits that would
> have normally been squashed away in the normal review process we do
> for the main branch, then, yes, some cleanup could be nice.)
>
>
> Regarding status:
>
> - the runner passes 89% of the validates runner tests in batch mode. We
> hope to pass more with the new Encoders
>
> - Streaming mode is barely started (waiting for the multi-aggregations
> support in spark SS framework from the Spark community)
>
> - Runner can execute Nexmark
>
> - Some things are not wired up yet
>
>      - Beam Schemas not wired with Spark Schemas
>
>      - Optional features of the model not implemented:  state api, timer
> api, splittable doFn api, …
>
> WDYT, can we merge it to master once the 2 steps are done ?
>
> I think that as long as it sits parallel to the existing runner, and
> is clearly marked with its status, it makes sense to me. How many
> changes does it make to the existing codebase (as opposed to add new
> code)?
>
>

Re: [spark structured streaming runner] merge to master?

Posted by Etienne Chauchot <ec...@apache.org>.

Comments inline

On 10/10/2019 23:44, Ismaël Mejía wrote:
> +1
>
> The earlier we get to master the better to encourage not only code
> contributions but as important to have early user feedback.
>
>> Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
> It is still too early to even start discussing when to remove the
> classical runner given that the new runner is still a WIP. However the
> overall goal is that this runner becomes the de-facto one once the VR
> tests and the performance become at least equal to the classical
> runner, in the meantime the best for users is that they co-exist,
> let’s not forget that the other runner has been already battle tested
> for more than 3 years and has had lots of improvements in the last
> year.

+1 on what Ismael says: no soon removal,

The plan I had in mind at first (that I showed at the apacheCon) was 
this but I'm proposing moving the first gray label to before the red box.


>
>> I don't think the number of commits should be an issue--we shouldn't
>> just squash years worth of history away. (OTOH, if this is a case of
>> this branch containing lots of little, irrelevant commits that would
>> have normally been squashed away in the normal review process we do
>> for the main branch, then, yes, some cleanup could be nice.)
> About the commits we should encourage a clear history but we have also
> to remove useless commits that are still present in the branch,
> commits of the “Fix errorprone” / “Cleaning” kind and even commits
> that make a better narrative sense together should be probably
> squashed, because they do not bring much to the history. It is not
> about more or less commits it is about its relevance as Robert
> mentions.
>
>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
> In its current form we cannot exclude it but this relates to the other
> question, so better to explain a bit of history: The new runner used
> to live in its own module and subdirectory because it is a full blank
> page rewrite and the decision was not to use any of the classical
> runner classes to not be constrained by its evolution.
>
> However the reason to put it back in the same module as a subdirectory
> was to encourage early use, in more detail: The way you deploy spark
> jobs today is usually by packaging and staging an uber jar (~200MB of
> pure dependency joy) that contains the user pipeline classes, the
> spark runner module and its dependencies. If we have two spark runners
> in separate modules the user would need to repackage and redeploy
> their pipelines every time they want to switch from the classical
> Spark runner to the structured streaming runner which is painful and
> time and space consuming compared with the one module approach where
> they just change the name of the runner class and that’s it. The idea
> here is to make easy for users to test the new runner, but at the same
> time to make easy to come back to the classical runner in case of any
> issue.
>
> Ismaël
>
> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <ke...@apache.org> wrote:
>> +1
>>
>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>>
>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>>
>> I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>>
>> Kenn
>>
>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <ro...@google.com> wrote:
>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <ec...@apache.org> wrote:
>>>> Hi guys,
>>>>
>>>> You probably know that there has been for several months an work
>>>> developing a new Spark runner based on Spark Structured Streaming
>>>> framework. This work is located in a feature branch here:
>>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>>>
>>>> To attract more contributors and get some user feedback, we think it is
>>>> time to merge it to master. Before doing so, some steps need to be achieved:
>>>>
>>>> - finish the work on spark Encoders (that allow to call Beam coders)
>>>> because, right now, the runner is in an unstable state (some transforms
>>>> use the new way of doing ser/de and some use the old one, making a
>>>> pipeline incoherent toward serialization)
>>>>
>>>> - clean history: The history contains commits from November 2018, so
>>>> there is a good amount of work, thus a consequent number of commits.
>>>> They were already squashed but not from September 2019
>>> I don't think the number of commits should be an issue--we shouldn't
>>> just squash years worth of history away. (OTOH, if this is a case of
>>> this branch containing lots of little, irrelevant commits that would
>>> have normally been squashed away in the normal review process we do
>>> for the main branch, then, yes, some cleanup could be nice.)
>>>
>>>> Regarding status:
>>>>
>>>> - the runner passes 89% of the validates runner tests in batch mode. We
>>>> hope to pass more with the new Encoders
>>>>
>>>> - Streaming mode is barely started (waiting for the multi-aggregations
>>>> support in spark SS framework from the Spark community)
>>>>
>>>> - Runner can execute Nexmark
>>>>
>>>> - Some things are not wired up yet
>>>>
>>>>       - Beam Schemas not wired with Spark Schemas
>>>>
>>>>       - Optional features of the model not implemented:  state api, timer
>>>> api, splittable doFn api, …
>>>>
>>>> WDYT, can we merge it to master once the 2 steps are done ?
>>> I think that as long as it sits parallel to the existing runner, and
>>> is clearly marked with its status, it makes sense to me. How many
>>> changes does it make to the existing codebase (as opposed to add new
>>> code)?

Re: [spark structured streaming runner] merge to master?

Posted by Ismaël Mejía <ie...@gmail.com>.

+1

The earlier we get to master the better to encourage not only code
contributions but as important to have early user feedback.

> Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?

It is still too early to even start discussing when to remove the
classical runner given that the new runner is still a WIP. However the
overall goal is that this runner becomes the de-facto one once the VR
tests and the performance become at least equal to the classical
runner, in the meantime the best for users is that they co-exist,
let’s not forget that the other runner has been already battle tested
for more than 3 years and has had lots of improvements in the last
year.

> I don't think the number of commits should be an issue--we shouldn't
> just squash years worth of history away. (OTOH, if this is a case of
> this branch containing lots of little, irrelevant commits that would
> have normally been squashed away in the normal review process we do
> for the main branch, then, yes, some cleanup could be nice.)

About the commits we should encourage a clear history but we have also
to remove useless commits that are still present in the branch,
commits of the “Fix errorprone” / “Cleaning” kind and even commits
that make a better narrative sense together should be probably
squashed, because they do not bring much to the history. It is not
about more or less commits it is about its relevance as Robert
mentions.

> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.

In its current form we cannot exclude it but this relates to the other
question, so better to explain a bit of history: The new runner used
to live in its own module and subdirectory because it is a full blank
page rewrite and the decision was not to use any of the classical
runner classes to not be constrained by its evolution.

However the reason to put it back in the same module as a subdirectory
was to encourage early use, in more detail: The way you deploy spark
jobs today is usually by packaging and staging an uber jar (~200MB of
pure dependency joy) that contains the user pipeline classes, the
spark runner module and its dependencies. If we have two spark runners
in separate modules the user would need to repackage and redeploy
their pipelines every time they want to switch from the classical
Spark runner to the structured streaming runner which is painful and
time and space consuming compared with the one module approach where
they just change the name of the runner class and that’s it. The idea
here is to make easy for users to test the new runner, but at the same
time to make easy to come back to the classical runner in case of any
issue.

Ismaël

On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <ke...@apache.org> wrote:
>
> +1
>
> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
>
> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
>
> I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
>
> Kenn
>
> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <ro...@google.com> wrote:
>>
>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <ec...@apache.org> wrote:
>> >
>> > Hi guys,
>> >
>> > You probably know that there has been for several months an work
>> > developing a new Spark runner based on Spark Structured Streaming
>> > framework. This work is located in a feature branch here:
>> > https://github.com/apache/beam/tree/spark-runner_structured-streaming
>> >
>> > To attract more contributors and get some user feedback, we think it is
>> > time to merge it to master. Before doing so, some steps need to be achieved:
>> >
>> > - finish the work on spark Encoders (that allow to call Beam coders)
>> > because, right now, the runner is in an unstable state (some transforms
>> > use the new way of doing ser/de and some use the old one, making a
>> > pipeline incoherent toward serialization)
>> >
>> > - clean history: The history contains commits from November 2018, so
>> > there is a good amount of work, thus a consequent number of commits.
>> > They were already squashed but not from September 2019
>>
>> I don't think the number of commits should be an issue--we shouldn't
>> just squash years worth of history away. (OTOH, if this is a case of
>> this branch containing lots of little, irrelevant commits that would
>> have normally been squashed away in the normal review process we do
>> for the main branch, then, yes, some cleanup could be nice.)
>>
>> > Regarding status:
>> >
>> > - the runner passes 89% of the validates runner tests in batch mode. We
>> > hope to pass more with the new Encoders
>> >
>> > - Streaming mode is barely started (waiting for the multi-aggregations
>> > support in spark SS framework from the Spark community)
>> >
>> > - Runner can execute Nexmark
>> >
>> > - Some things are not wired up yet
>> >
>> >      - Beam Schemas not wired with Spark Schemas
>> >
>> >      - Optional features of the model not implemented:  state api, timer
>> > api, splittable doFn api, …
>> >
>> > WDYT, can we merge it to master once the 2 steps are done ?
>>
>> I think that as long as it sits parallel to the existing runner, and
>> is clearly marked with its status, it makes sense to me. How many
>> changes does it make to the existing codebase (as opposed to add new
>> code)?

Re: [spark structured streaming runner] merge to master?

Posted by Kenneth Knowles <ke...@apache.org>.

+1

I think our experiences with things that go to master early have been very
good. So I am in favor ASAP. We can exclude it from releases easily until
it is ready for end users.

I have the same question as Robert - how much is modifications and how much
is new? I notice it is in a subdirectory of the beam-runners-spark module.

I did not see any major changes to dependencies but I will also ask if it
has major version differences so that you might want a separate artifact?

Kenn

On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <ro...@google.com>
wrote:

> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <ec...@apache.org>
> wrote:
> >
> > Hi guys,
> >
> > You probably know that there has been for several months an work
> > developing a new Spark runner based on Spark Structured Streaming
> > framework. This work is located in a feature branch here:
> > https://github.com/apache/beam/tree/spark-runner_structured-streaming
> >
> > To attract more contributors and get some user feedback, we think it is
> > time to merge it to master. Before doing so, some steps need to be
> achieved:
> >
> > - finish the work on spark Encoders (that allow to call Beam coders)
> > because, right now, the runner is in an unstable state (some transforms
> > use the new way of doing ser/de and some use the old one, making a
> > pipeline incoherent toward serialization)
> >
> > - clean history: The history contains commits from November 2018, so
> > there is a good amount of work, thus a consequent number of commits.
> > They were already squashed but not from September 2019
>
> I don't think the number of commits should be an issue--we shouldn't
> just squash years worth of history away. (OTOH, if this is a case of
> this branch containing lots of little, irrelevant commits that would
> have normally been squashed away in the normal review process we do
> for the main branch, then, yes, some cleanup could be nice.)
>
> > Regarding status:
> >
> > - the runner passes 89% of the validates runner tests in batch mode. We
> > hope to pass more with the new Encoders
> >
> > - Streaming mode is barely started (waiting for the multi-aggregations
> > support in spark SS framework from the Spark community)
> >
> > - Runner can execute Nexmark
> >
> > - Some things are not wired up yet
> >
> >      - Beam Schemas not wired with Spark Schemas
> >
> >      - Optional features of the model not implemented:  state api, timer
> > api, splittable doFn api, …
> >
> > WDYT, can we merge it to master once the 2 steps are done ?
>
> I think that as long as it sits parallel to the existing runner, and
> is clearly marked with its status, it makes sense to me. How many
> changes does it make to the existing codebase (as opposed to add new
> code)?
>

Re: [spark structured streaming runner] merge to master?

Posted by Etienne Chauchot <ec...@apache.org>.

Hi Robert comments inline:

On 10/10/2019 20:49, Robert Bradshaw wrote:
> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <ec...@apache.org> wrote:
>> Hi guys,
>>
>> You probably know that there has been for several months an work
>> developing a new Spark runner based on Spark Structured Streaming
>> framework. This work is located in a feature branch here:
>> https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>
>> To attract more contributors and get some user feedback, we think it is
>> time to merge it to master. Before doing so, some steps need to be achieved:
>>
>> - finish the work on spark Encoders (that allow to call Beam coders)
>> because, right now, the runner is in an unstable state (some transforms
>> use the new way of doing ser/de and some use the old one, making a
>> pipeline incoherent toward serialization)
>>
>> - clean history: The history contains commits from November 2018, so
>> there is a good amount of work, thus a consequent number of commits.
>> They were already squashed but not from September 2019

> I don't think the number of commits should be an issue--we shouldn't
> just squash years worth of history away. (OTOH, if this is a case of
> this branch containing lots of little, irrelevant commits that would
> have normally been squashed away in the normal review process we do
> for the main branch, then, yes, some cleanup could be nice.)
+1. yes tiny ones were already squashed except from sept to now.
>
>> Regarding status:
>>
>> - the runner passes 89% of the validates runner tests in batch mode. We
>> hope to pass more with the new Encoders
>>
>> - Streaming mode is barely started (waiting for the multi-aggregations
>> support in spark SS framework from the Spark community)
>>
>> - Runner can execute Nexmark
>>
>> - Some things are not wired up yet
>>
>>       - Beam Schemas not wired with Spark Schemas
>>
>>       - Optional features of the model not implemented:  state api, timer
>> api, splittable doFn api, …
>>
>> WDYT, can we merge it to master once the 2 steps are done ?
> I think that as long as it sits parallel to the existing runner, and
> is clearly marked with its status, it makes sense to me. How many
> changes does it make to the existing codebase (as opposed to add new
> code)?

the 2 runners live in the same module but in different java packages. 
New runner was re+coded from scratch (cf explanations here 
https://www.slideshare.net/EtienneChauchot/etienne-chauchot-spark-structured-streaming-runner). 


There is no shared code between the runners (except beam sdk and core of 
course :) ).

Etienne

Re: [spark structured streaming runner] merge to master?

Posted by Robert Bradshaw <ro...@google.com>.

On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <ec...@apache.org> wrote:
>
> Hi guys,
>
> You probably know that there has been for several months an work
> developing a new Spark runner based on Spark Structured Streaming
> framework. This work is located in a feature branch here:
> https://github.com/apache/beam/tree/spark-runner_structured-streaming
>
> To attract more contributors and get some user feedback, we think it is
> time to merge it to master. Before doing so, some steps need to be achieved:
>
> - finish the work on spark Encoders (that allow to call Beam coders)
> because, right now, the runner is in an unstable state (some transforms
> use the new way of doing ser/de and some use the old one, making a
> pipeline incoherent toward serialization)
>
> - clean history: The history contains commits from November 2018, so
> there is a good amount of work, thus a consequent number of commits.
> They were already squashed but not from September 2019

I don't think the number of commits should be an issue--we shouldn't
just squash years worth of history away. (OTOH, if this is a case of
this branch containing lots of little, irrelevant commits that would
have normally been squashed away in the normal review process we do
for the main branch, then, yes, some cleanup could be nice.)

> Regarding status:
>
> - the runner passes 89% of the validates runner tests in batch mode. We
> hope to pass more with the new Encoders
>
> - Streaming mode is barely started (waiting for the multi-aggregations
> support in spark SS framework from the Spark community)
>
> - Runner can execute Nexmark
>
> - Some things are not wired up yet
>
>      - Beam Schemas not wired with Spark Schemas
>
>      - Optional features of the model not implemented:  state api, timer
> api, splittable doFn api, …
>
> WDYT, can we merge it to master once the 2 steps are done ?

I think that as long as it sits parallel to the existing runner, and
is clearly marked with its status, it makes sense to me. How many
changes does it make to the existing codebase (as opposed to add new
code)?

Re: [spark structured streaming runner] merge to master?

Posted by Xinyu Liu <xi...@gmail.com>.

+1 for merging to master. It's going to help a lot for us to try it out,
and also contribute back for the missing features.

Thanks,
Xinyu

On Thu, Oct 10, 2019 at 6:40 AM Alexey Romanenko <ar...@gmail.com>
wrote:

> +1 for merging this new runner too (even if it’s not 100% ready for the
> moment) in case if it doesn’t break/fail/affect all other tests and Jenkins
> jobs. I mean, it should be transparent for other Beam components.
>
> Also, since it won’t be officially “released” right after merging, we need
> to clearly warn users that it’s not ready to use in production.
>
> > On 10 Oct 2019, at 15:25, Ryan Skraba <ry...@skraba.com> wrote:
> >
> > Merging to master sounds like a really good idea, even if it is not
> > feature-complete yet.
> >
> > It's already a pretty big accomplishment getting it to the current
> > state (great job all!).  Merging it into master would give it a pretty
> > good boost for visibility and encouraging some discussion about where
> > it's going.
> >
> > I don't think there's any question about removing the RDD-based
> > (a.k.a. old/legacy/stable) spark runner yet!
> >
> > All my best, Ryan
> >
> >
> > On Thu, Oct 10, 2019 at 2:47 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> >>
> >> +1
> >>
> >> As the runner seems almost "equivalent" to the one we have, it makes
> sense.
> >>
> >> Question is: do we keep the "old" spark runner for a while or not (or
> >> just keep on previous version/tag on git) ?
> >>
> >> Regards
> >> JB
> >>
> >> On 10/10/2019 09:39, Etienne Chauchot wrote:
> >>> Hi guys,
> >>>
> >>> You probably know that there has been for several months an work
> >>> developing a new Spark runner based on Spark Structured Streaming
> >>> framework. This work is located in a feature branch here:
> >>> https://github.com/apache/beam/tree/spark-runner_structured-streaming
> >>>
> >>> To attract more contributors and get some user feedback, we think it is
> >>> time to merge it to master. Before doing so, some steps need to be
> >>> achieved:
> >>>
> >>> - finish the work on spark Encoders (that allow to call Beam coders)
> >>> because, right now, the runner is in an unstable state (some transforms
> >>> use the new way of doing ser/de and some use the old one, making a
> >>> pipeline incoherent toward serialization)
> >>>
> >>> - clean history: The history contains commits from November 2018, so
> >>> there is a good amount of work, thus a consequent number of commits.
> >>> They were already squashed but not from September 2019
> >>>
> >>> Regarding status:
> >>>
> >>> - the runner passes 89% of the validates runner tests in batch mode. We
> >>> hope to pass more with the new Encoders
> >>>
> >>> - Streaming mode is barely started (waiting for the multi-aggregations
> >>> support in spark SS framework from the Spark community)
> >>>
> >>> - Runner can execute Nexmark
> >>>
> >>> - Some things are not wired up yet
> >>>
> >>>    - Beam Schemas not wired with Spark Schemas
> >>>
> >>>    - Optional features of the model not implemented:  state api, timer
> >>> api, splittable doFn api, …
> >>>
> >>> WDYT, can we merge it to master once the 2 steps are done ?
> >>>
> >>> Best
> >>>
> >>> Etienne
> >>>
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbonofre@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
>
>

Re: [spark structured streaming runner] merge to master?

Posted by Etienne Chauchot <ec...@apache.org>.

I think that it is important to also provide in the build a jar that 
only contains the old runner, for people who want to ship only one.

Etienne

On 10/10/2019 15:40, Alexey Romanenko wrote:
> +1 for merging this new runner too (even if it’s not 100% ready for the moment) in case if it doesn’t break/fail/affect all other tests and Jenkins jobs. I mean, it should be transparent for other Beam components.
>
> Also, since it won’t be officially “released” right after merging, we need to clearly warn users that it’s not ready to use in production.
>
>> On 10 Oct 2019, at 15:25, Ryan Skraba <ry...@skraba.com> wrote:
>>
>> Merging to master sounds like a really good idea, even if it is not
>> feature-complete yet.
>>
>> It's already a pretty big accomplishment getting it to the current
>> state (great job all!).  Merging it into master would give it a pretty
>> good boost for visibility and encouraging some discussion about where
>> it's going.
>>
>> I don't think there's any question about removing the RDD-based
>> (a.k.a. old/legacy/stable) spark runner yet!
>>
>> All my best, Ryan
>>
>>
>> On Thu, Oct 10, 2019 at 2:47 PM Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>>> +1
>>>
>>> As the runner seems almost "equivalent" to the one we have, it makes sense.
>>>
>>> Question is: do we keep the "old" spark runner for a while or not (or
>>> just keep on previous version/tag on git) ?
>>>
>>> Regards
>>> JB
>>>
>>> On 10/10/2019 09:39, Etienne Chauchot wrote:
>>>> Hi guys,
>>>>
>>>> You probably know that there has been for several months an work
>>>> developing a new Spark runner based on Spark Structured Streaming
>>>> framework. This work is located in a feature branch here:
>>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>>>
>>>> To attract more contributors and get some user feedback, we think it is
>>>> time to merge it to master. Before doing so, some steps need to be
>>>> achieved:
>>>>
>>>> - finish the work on spark Encoders (that allow to call Beam coders)
>>>> because, right now, the runner is in an unstable state (some transforms
>>>> use the new way of doing ser/de and some use the old one, making a
>>>> pipeline incoherent toward serialization)
>>>>
>>>> - clean history: The history contains commits from November 2018, so
>>>> there is a good amount of work, thus a consequent number of commits.
>>>> They were already squashed but not from September 2019
>>>>
>>>> Regarding status:
>>>>
>>>> - the runner passes 89% of the validates runner tests in batch mode. We
>>>> hope to pass more with the new Encoders
>>>>
>>>> - Streaming mode is barely started (waiting for the multi-aggregations
>>>> support in spark SS framework from the Spark community)
>>>>
>>>> - Runner can execute Nexmark
>>>>
>>>> - Some things are not wired up yet
>>>>
>>>>     - Beam Schemas not wired with Spark Schemas
>>>>
>>>>     - Optional features of the model not implemented:  state api, timer
>>>> api, splittable doFn api, …
>>>>
>>>> WDYT, can we merge it to master once the 2 steps are done ?
>>>>
>>>> Best
>>>>
>>>> Etienne
>>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com

Re: [spark structured streaming runner] merge to master?

Posted by Alexey Romanenko <ar...@gmail.com>.

+1 for merging this new runner too (even if it’s not 100% ready for the moment) in case if it doesn’t break/fail/affect all other tests and Jenkins jobs. I mean, it should be transparent for other Beam components.

Also, since it won’t be officially “released” right after merging, we need to clearly warn users that it’s not ready to use in production.

> On 10 Oct 2019, at 15:25, Ryan Skraba <ry...@skraba.com> wrote:
> 
> Merging to master sounds like a really good idea, even if it is not
> feature-complete yet.
> 
> It's already a pretty big accomplishment getting it to the current
> state (great job all!).  Merging it into master would give it a pretty
> good boost for visibility and encouraging some discussion about where
> it's going.
> 
> I don't think there's any question about removing the RDD-based
> (a.k.a. old/legacy/stable) spark runner yet!
> 
> All my best, Ryan
> 
> 
> On Thu, Oct 10, 2019 at 2:47 PM Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>> 
>> +1
>> 
>> As the runner seems almost "equivalent" to the one we have, it makes sense.
>> 
>> Question is: do we keep the "old" spark runner for a while or not (or
>> just keep on previous version/tag on git) ?
>> 
>> Regards
>> JB
>> 
>> On 10/10/2019 09:39, Etienne Chauchot wrote:
>>> Hi guys,
>>> 
>>> You probably know that there has been for several months an work
>>> developing a new Spark runner based on Spark Structured Streaming
>>> framework. This work is located in a feature branch here:
>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming
>>> 
>>> To attract more contributors and get some user feedback, we think it is
>>> time to merge it to master. Before doing so, some steps need to be
>>> achieved:
>>> 
>>> - finish the work on spark Encoders (that allow to call Beam coders)
>>> because, right now, the runner is in an unstable state (some transforms
>>> use the new way of doing ser/de and some use the old one, making a
>>> pipeline incoherent toward serialization)
>>> 
>>> - clean history: The history contains commits from November 2018, so
>>> there is a good amount of work, thus a consequent number of commits.
>>> They were already squashed but not from September 2019
>>> 
>>> Regarding status:
>>> 
>>> - the runner passes 89% of the validates runner tests in batch mode. We
>>> hope to pass more with the new Encoders
>>> 
>>> - Streaming mode is barely started (waiting for the multi-aggregations
>>> support in spark SS framework from the Spark community)
>>> 
>>> - Runner can execute Nexmark
>>> 
>>> - Some things are not wired up yet
>>> 
>>>    - Beam Schemas not wired with Spark Schemas
>>> 
>>>    - Optional features of the model not implemented:  state api, timer
>>> api, splittable doFn api, …
>>> 
>>> WDYT, can we merge it to master once the 2 steps are done ?
>>> 
>>> Best
>>> 
>>> Etienne
>>> 
>> 
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com

Re: [spark structured streaming runner] merge to master?

Posted by Ryan Skraba <ry...@skraba.com>.

Merging to master sounds like a really good idea, even if it is not
feature-complete yet.

It's already a pretty big accomplishment getting it to the current
state (great job all!).  Merging it into master would give it a pretty
good boost for visibility and encouraging some discussion about where
it's going.

I don't think there's any question about removing the RDD-based
(a.k.a. old/legacy/stable) spark runner yet!

All my best, Ryan


On Thu, Oct 10, 2019 at 2:47 PM Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>
> +1
>
> As the runner seems almost "equivalent" to the one we have, it makes sense.
>
> Question is: do we keep the "old" spark runner for a while or not (or
> just keep on previous version/tag on git) ?
>
> Regards
> JB
>
> On 10/10/2019 09:39, Etienne Chauchot wrote:
> > Hi guys,
> >
> > You probably know that there has been for several months an work
> > developing a new Spark runner based on Spark Structured Streaming
> > framework. This work is located in a feature branch here:
> > https://github.com/apache/beam/tree/spark-runner_structured-streaming
> >
> > To attract more contributors and get some user feedback, we think it is
> > time to merge it to master. Before doing so, some steps need to be
> > achieved:
> >
> > - finish the work on spark Encoders (that allow to call Beam coders)
> > because, right now, the runner is in an unstable state (some transforms
> > use the new way of doing ser/de and some use the old one, making a
> > pipeline incoherent toward serialization)
> >
> > - clean history: The history contains commits from November 2018, so
> > there is a good amount of work, thus a consequent number of commits.
> > They were already squashed but not from September 2019
> >
> > Regarding status:
> >
> > - the runner passes 89% of the validates runner tests in batch mode. We
> > hope to pass more with the new Encoders
> >
> > - Streaming mode is barely started (waiting for the multi-aggregations
> > support in spark SS framework from the Spark community)
> >
> > - Runner can execute Nexmark
> >
> > - Some things are not wired up yet
> >
> >     - Beam Schemas not wired with Spark Schemas
> >
> >     - Optional features of the model not implemented:  state api, timer
> > api, splittable doFn api, …
> >
> > WDYT, can we merge it to master once the 2 steps are done ?
> >
> > Best
> >
> > Etienne
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: [spark structured streaming runner] merge to master?

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

+1

As the runner seems almost "equivalent" to the one we have, it makes sense.

Question is: do we keep the "old" spark runner for a while or not (or
just keep on previous version/tag on git) ?

Regards
JB

On 10/10/2019 09:39, Etienne Chauchot wrote:
> Hi guys,
> 
> You probably know that there has been for several months an work
> developing a new Spark runner based on Spark Structured Streaming
> framework. This work is located in a feature branch here:
> https://github.com/apache/beam/tree/spark-runner_structured-streaming
> 
> To attract more contributors and get some user feedback, we think it is
> time to merge it to master. Before doing so, some steps need to be
> achieved:
> 
> - finish the work on spark Encoders (that allow to call Beam coders)
> because, right now, the runner is in an unstable state (some transforms
> use the new way of doing ser/de and some use the old one, making a
> pipeline incoherent toward serialization)
> 
> - clean history: The history contains commits from November 2018, so
> there is a good amount of work, thus a consequent number of commits.
> They were already squashed but not from September 2019
> 
> Regarding status:
> 
> - the runner passes 89% of the validates runner tests in batch mode. We
> hope to pass more with the new Encoders
> 
> - Streaming mode is barely started (waiting for the multi-aggregations
> support in spark SS framework from the Spark community)
> 
> - Runner can execute Nexmark
> 
> - Some things are not wired up yet
> 
>     - Beam Schemas not wired with Spark Schemas
> 
>     - Optional features of the model not implemented:  state api, timer
> api, splittable doFn api, …
> 
> WDYT, can we merge it to master once the 2 steps are done ?
> 
> Best
> 
> Etienne
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com