You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Alexey Romanenko <ar...@gmail.com> on 2019/11/04 17:48:45 UTC

Re: [spark structured streaming runner] merge to master?

I’m also as Kenn strongly against 3 jars but I don’t see a big difference between 1 or 2 jar(s) in case if 2 jars won’t contain the same classes or cross-dependencies between them. I don’t have strong arguments “for" or “against” these both options but, as I mentioned before, achieve it with one jar perhaps would be easier and less error-prone.

> On 31 Oct 2019, at 02:06, Kenneth Knowles <ke...@apache.org> wrote:
> 
> Very good points. We definitely ship a lot of code/features in very early stages, and there seems to be no problem.
> 
> I intend mostly to leave this judgment to people like you who know better about Spark users.
> 
> But I do think 1 or 2 jars is better than 3. I really don't like "3 jars" and I did give two reasons:
> 
> 1. diamond deps where things overlap
> 2. figuring out which thing to depend on
> 
> Both are annoying for users. I am not certain if it could lead to a real unsolvable situation. This is just a Java ecosystem problem so I feel qualified to comment.
> 
> I did also ask if there were major dependency differences between the two that could cause problems for users. This question was dropped and no one cares to comment so I assume it is not an issue. So then I favor having just 1 jar with both runners.
> 
> Kenn
> 
> On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía <iemejia@gmail.com <ma...@gmail.com>> wrote:
> I am still a bit lost about why we are discussing options without giving any
> arguments or reasons for the options? Why is 2 modules better than 3 or 3 better
> than 2, or even better, what forces us to have something different than a single
> module?
> 
> What are the reasons for wanting to have separate jars? If the issue is that the
> code is unfinished or not passing the tests, the impact for end users is minimal
> because they cannot accidentally end up running the new runner, and if they
> decide to do so we can warn them it is at their own risk and not ready for
> production in the documentation + runner.
> 
> If the fear is that new code may end up being intertwined with the classic and
> portable runners and have some side effects. We have the ValidatesRunner +
> Nexmark in the CI to cover this so again I do not see what is the problem that
> requires modules to be separate.
> 
> If the issue is being uncomfortable about having in-progress code in released
> artifacts we have been doing this in Beam forever, for example most of the work
> on portability and Schema/SQL, and all of those were still part of artifacts
> long time before they were ready for prime use, so I still don't see why this
> case is different to require different artifacts.
> 
> I have the impression we are trying to solve a non-issue by adding a lot of
> artificial complexity (in particular to the users), or am I missing something
> else?
> 
> On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles <kenn@apache.org <ma...@apache.org>> wrote:
> >
> > Oh, I mean that we ship just 2 jars.
> >
> > And since Spark users always build an uber jar, they can still depend on both of ours and be able to switch runners with a flag.
> >
> > I really dislike projects shipping overlapping jars. It is confusing and causes major diamond dependency problems.
> >
> > Kenn
> >
> > On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
> >>
> >> Yes, agree, two jars included in uber jar will work in the similar way. Though having 3 jars looks still quite confusing for me.
> >>
> >> On 29 Oct 2019, at 23:54, Kenneth Knowles <kenn@apache.org <ma...@apache.org>> wrote:
> >>
> >> Is it just as easy to have two jars and build an uber jar with both included? Then the runner can still be toggled with a flag.
> >>
> >> Kenn
> >>
> >> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
> >>>
> >>> Hmm, I don’t think that jar size should play a big role comparing to the whole size of shaded jar of users job. Even more, I think it will be quite confusing for users to choose which jar to use if we will have 3 different ones for similar purposes. Though, let’s see what others think.
> >>>
> >>> On 29 Oct 2019, at 15:32, Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
> >>>
> >>> Hi Alexey,
> >>>
> >>> Thanks for your opinion !
> >>>
> >>> Comments inline
> >>>
> >>> Etienne
> >>>
> >>> On 28/10/2019 17:34, Alexey Romanenko wrote:
> >>>
> >>> Let me share some of my thoughts on this.
> >>>
> >>>     - shall we filter out the package name from the release?
> >>>
> >>> Until new runner is not ready to be used in production (or, at least, be used for beta testing but users should be clearly warned about that in this case), I believe we need to filter out its classes from published jar to avoid a confusion.
> >>>
> >>> Yes that is what I think also
> >>>
> >>>     - should we release 2 jars: one for the old and one for the new ?
> >>>
> >>>     - should we release 3 jars: one for the new, one for the new and one for both ?
> >>>
> >>> Once new runner will be released, then I think we need to provide only one single jar and allow user to switch between different Spark runners with CLI option.
> >>>
> >>> I would vote for 3 jars: one for new, one for old, and one for both. Indeed, in some cases, users are looking very closely at the size of jars. This solution meets all use cases
> >>>
> >>>     - should we create a special entry to the capability matrix ?
> >>>
> >>> Sure, since it has its own uniq characteristics and implementation, but again, only once new runner will be "officially released".
> >>>
> >>> +1
> >>>
> >>>
> >>>
> >>> On 28 Oct 2019, at 10:27, Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
> >>>
> >>> Hi guys,
> >>>
> >>> Any opinions on the point2 communication to users ?
> >>>
> >>> Etienne
> >>>
> >>> On 24/10/2019 15:44, Etienne Chauchot wrote:
> >>>
> >>> Hi guys,
> >>>
> >>> I'm glad to announce that the PR for the merge to master of the new runner based on Spark Structured Streaming framework is submitted:
> >>>
> >>> https://github.com/apache/beam/pull/9866 <https://github.com/apache/beam/pull/9866>
> >>>
> >>>
> >>> 1. Regarding the status of the runner:
> >>>
> >>> -the runner passes 93% of the validates runner tests in batch mode.
> >>>
> >>> -Streaming mode is barely started (waiting for the multi-aggregations support in spark Structured Streaming framework from the Spark community)
> >>>
> >>> -Runner can execute Nexmark
> >>>
> >>> -Some things are not wired up yet
> >>>
> >>>   -Beam Schemas not wired with Spark Schemas
> >>>
> >>>   -Optional features of the model not implemented: state api, timer api, splittable doFn api, …
> >>>
> >>>
> >>> 2. Regarding the communication to users:
> >>>
> >>> - for reasons explained by Ismael: the runner is in the same module as the "older" one. But it is in a different sub-package and both runners share the same build.
> >>>
> >>> - How should we communicate to users:
> >>>
> >>>     - shall we filter out the package name from the release?
> >>>
> >>>     - should we release 2 jars: one for the old and one for the new ?
> >>>
> >>>     - should we release 3 jars: one for the new, one for the new and one for both ?
> >>>
> >>>     - should we create a special entry to the capability matrix ?
> >>>
> >>> WDYT ?
> >>>
> >>> Best
> >>>
> >>> Etienne
> >>>
> >>>
> >>> On 23/10/2019 19:11, Mikhail Gryzykhin wrote:
> >>>
> >>> +1 to merge.
> >>>
> >>> It is worth keeping things in master with explicitly marked status. It will make effort more visible to users and easier to get feedback upon.
> >>>
> >>> --Mikhail
> >>>
> >>> On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
> >>>>
> >>>> Hi guys,
> >>>>
> >>>> The new spark runner now supports beam coders and passes 93% of the batch validates runner tests (+4%). I think it is time to merge it to master. I will submit a PR in the coming days.
> >>>>
> >>>> next steps: support schemas and thus better leverage catalyst optimizer (among other things optims based on data), port perfs optims that were done in the current runner.
> >>>>
> >>>> Best
> >>>>
> >>>> Etienne
> >>>>
> >>>> On 11/10/2019 22:48, Pablo Estrada wrote:
> >>>>
> >>>> +1 for merging : )
> >>>>
> >>>> On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw <robertwb@google.com <ma...@google.com>> wrote:
> >>>>>
> >>>>> Sounds like a good plan to me.
> >>>>>
> >>>>> On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
> >>>>>>
> >>>>>> Comments inline
> >>>>>>
> >>>>>> On 10/10/2019 23:44, Ismaël Mejía wrote:
> >>>>>>
> >>>>>> +1
> >>>>>>
> >>>>>> The earlier we get to master the better to encourage not only code
> >>>>>> contributions but as important to have early user feedback.
> >>>>>>
> >>>>>> Question is: do we keep the "old" spark runner for a while or not (or just keep on previous version/tag on git) ?
> >>>>>>
> >>>>>> It is still too early to even start discussing when to remove the
> >>>>>> classical runner given that the new runner is still a WIP. However the
> >>>>>> overall goal is that this runner becomes the de-facto one once the VR
> >>>>>> tests and the performance become at least equal to the classical
> >>>>>> runner, in the meantime the best for users is that they co-exist,
> >>>>>> let’s not forget that the other runner has been already battle tested
> >>>>>> for more than 3 years and has had lots of improvements in the last
> >>>>>> year.
> >>>>>>
> >>>>>> +1 on what Ismael says: no soon removal,
> >>>>>>
> >>>>>> The plan I had in mind at first (that I showed at the apacheCon) was this but I'm proposing moving the first gray label to before the red box.
> >>>>>>
> >>>>>> <beogijnhpieapoll.png>
> >>>>>>
> >>>>>>
> >>>>>> I don't think the number of commits should be an issue--we shouldn't
> >>>>>> just squash years worth of history away. (OTOH, if this is a case of
> >>>>>> this branch containing lots of little, irrelevant commits that would
> >>>>>> have normally been squashed away in the normal review process we do
> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
> >>>>>>
> >>>>>> About the commits we should encourage a clear history but we have also
> >>>>>> to remove useless commits that are still present in the branch,
> >>>>>> commits of the “Fix errorprone” / “Cleaning” kind and even commits
> >>>>>> that make a better narrative sense together should be probably
> >>>>>> squashed, because they do not bring much to the history. It is not
> >>>>>> about more or less commits it is about its relevance as Robert
> >>>>>> mentions.
> >>>>>>
> >>>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
> >>>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
> >>>>>>
> >>>>>> In its current form we cannot exclude it but this relates to the other
> >>>>>> question, so better to explain a bit of history: The new runner used
> >>>>>> to live in its own module and subdirectory because it is a full blank
> >>>>>> page rewrite and the decision was not to use any of the classical
> >>>>>> runner classes to not be constrained by its evolution.
> >>>>>>
> >>>>>> However the reason to put it back in the same module as a subdirectory
> >>>>>> was to encourage early use, in more detail: The way you deploy spark
> >>>>>> jobs today is usually by packaging and staging an uber jar (~200MB of
> >>>>>> pure dependency joy) that contains the user pipeline classes, the
> >>>>>> spark runner module and its dependencies. If we have two spark runners
> >>>>>> in separate modules the user would need to repackage and redeploy
> >>>>>> their pipelines every time they want to switch from the classical
> >>>>>> Spark runner to the structured streaming runner which is painful and
> >>>>>> time and space consuming compared with the one module approach where
> >>>>>> they just change the name of the runner class and that’s it. The idea
> >>>>>> here is to make easy for users to test the new runner, but at the same
> >>>>>> time to make easy to come back to the classical runner in case of any
> >>>>>> issue.
> >>>>>>
> >>>>>> Ismaël
> >>>>>>
> >>>>>> On Thu, Oct 10, 2019 at 9:02 PM Kenneth Knowles <kenn@apache.org <ma...@apache.org>> wrote:
> >>>>>>
> >>>>>> +1
> >>>>>>
> >>>>>> I think our experiences with things that go to master early have been very good. So I am in favor ASAP. We can exclude it from releases easily until it is ready for end users.
> >>>>>>
> >>>>>> I have the same question as Robert - how much is modifications and how much is new? I notice it is in a subdirectory of the beam-runners-spark module.
> >>>>>>
> >>>>>> I did not see any major changes to dependencies but I will also ask if it has major version differences so that you might want a separate artifact?
> >>>>>>
> >>>>>> Kenn
> >>>>>>
> >>>>>> On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <robertwb@google.com <ma...@google.com>> wrote:
> >>>>>>
> >>>>>> On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot <echauchot@apache.org <ma...@apache.org>> wrote:
> >>>>>>
> >>>>>> Hi guys,
> >>>>>>
> >>>>>> You probably know that there has been for several months an work
> >>>>>> developing a new Spark runner based on Spark Structured Streaming
> >>>>>> framework. This work is located in a feature branch here:
> >>>>>> https://github.com/apache/beam/tree/spark-runner_structured-streaming <https://github.com/apache/beam/tree/spark-runner_structured-streaming>
> >>>>>>
> >>>>>> To attract more contributors and get some user feedback, we think it is
> >>>>>> time to merge it to master. Before doing so, some steps need to be achieved:
> >>>>>>
> >>>>>> - finish the work on spark Encoders (that allow to call Beam coders)
> >>>>>> because, right now, the runner is in an unstable state (some transforms
> >>>>>> use the new way of doing ser/de and some use the old one, making a
> >>>>>> pipeline incoherent toward serialization)
> >>>>>>
> >>>>>> - clean history: The history contains commits from November 2018, so
> >>>>>> there is a good amount of work, thus a consequent number of commits.
> >>>>>> They were already squashed but not from September 2019
> >>>>>>
> >>>>>> I don't think the number of commits should be an issue--we shouldn't
> >>>>>> just squash years worth of history away. (OTOH, if this is a case of
> >>>>>> this branch containing lots of little, irrelevant commits that would
> >>>>>> have normally been squashed away in the normal review process we do
> >>>>>> for the main branch, then, yes, some cleanup could be nice.)
> >>>>>>
> >>>>>> Regarding status:
> >>>>>>
> >>>>>> - the runner passes 89% of the validates runner tests in batch mode. We
> >>>>>> hope to pass more with the new Encoders
> >>>>>>
> >>>>>> - Streaming mode is barely started (waiting for the multi-aggregations
> >>>>>> support in spark SS framework from the Spark community)
> >>>>>>
> >>>>>> - Runner can execute Nexmark
> >>>>>>
> >>>>>> - Some things are not wired up yet
> >>>>>>
> >>>>>>      - Beam Schemas not wired with Spark Schemas
> >>>>>>
> >>>>>>      - Optional features of the model not implemented:  state api, timer
> >>>>>> api, splittable doFn api, …
> >>>>>>
> >>>>>> WDYT, can we merge it to master once the 2 steps are done ?
> >>>>>>
> >>>>>> I think that as long as it sits parallel to the existing runner, and
> >>>>>> is clearly marked with its status, it makes sense to me. How many
> >>>>>> changes does it make to the existing codebase (as opposed to add new
> >>>>>> code)?
> >>>
> >>>
> >>>
> >>