You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Maximilian Michels <mx...@apache.org> on 2019/01/30 11:26:30 UTC
Re: [BEAM-5442] Store duplicate unknown (runner) options in a list
argument
Thomas was so kind to implement Option 3) in
https://github.com/apache/beam/pull/7597
Heads-up to the Go SDK people to eventually implement the new
DescribePipelineOptionsRequest. Tracking issue:
https://issues.apache.org/jira/browse/BEAM-6549
Also related, we will have to follow-up with proper scoping of pipeline options:
https://issues.apache.org/jira/browse/BEAM-6537
Thanks,
Max
On 13.11.18 19:05, Robert Burke wrote:
> +1 to Option 3
>
> I'd rather have each SDK have a single point of well defined complexity to do
> something general, than have to make tiny but simple changes. Less toil and
> maintenance in the long run per SDK.
>
> Similarly I don't have time to make it happen right now.
>
> On Tue, Nov 13, 2018, 9:22 AM Thomas Weise <thw@apache.org
> <ma...@apache.org>> wrote:
>
> Discovering options from the job server would be the only way to perform
> full validation (and provide upfront help to the user).
>
> The runner cannot perform full validation, since it is not aware of the user
> and SDK options (that it has to forward to the SDK worker).
>
> Special runner options flag to forward unknown options also wouldn't fully
> solve the problem (besides being subject to change in the future). Let's say
> runner understands --fancy-int-option and the user repeats that option
> multiple times. Not knowing the type of option, the SDK will pass it as a
> list and the runner will fail.
>
> Replicating SDK options is a workaround for known runners but it really goes
> against the idea of portability (making assumptions outside the API
> contract). We already have runners implemented outside of Beam and hope for
> the ecosystem to grow. What we do for options should also work for those
> runners.
>
> I'm with Luke here that options discovery provides the best user experience
> and can address the other issues. Even the scenario of multiple intermediate
> runners could be addressed by forwarding the unparsed options with the
> discovery call. I don't see SDK implementation complexity as a significant
> drawback so far.
>
> Thomas
>
>
> On Mon, Nov 12, 2018 at 2:30 PM Lukasz Cwik <lcwik@google.com
> <ma...@google.com>> wrote:
>
>
> On Mon, Nov 12, 2018 at 9:38 AM Maximilian Michels <mxm@apache.org
> <ma...@apache.org>> wrote:
>
> Thank you Robert and Lukasz for your points.
>
> > Note that I believe that we will want to have multiple URLs to
> support cross language pipelines since we will want to be able to
> ask other SDK languages/versions for their "list" of supported
> PipelineOptions.
>
> Why is that? The Runner itself is the source of truth for its options.
>
>
> Because other languages (or even different versions of the same
> language) may have their own options. For example, the Go SDK talks to a
> Java service which applies a SQL transform and returns the expanded form
> (this may require knowledge of some options like credentials for the
> filesystem, ...) and then talks to a Python service that performs
> another transform expansion. Finally the pipeline containing Go, Java
> and Python transforms is submitted to a runner and it performs its own
> internal replacements/expansions related to executing the pipeline.
>
> Everything else is SDK-related and should be validated there.
>
> I imagined the process to go like this:
>
> a) Parse options to find JobServer URL
> a) Retrieve options from JobServer
> c) Parse all options
> ...continue as always...
>
> An option is just represented by a name and a type. There is nothing
> more to it, at least as of now. So it should be possible to parse them
> in the SDK without much further work.
>
> Nevertheless, I agree with your sentiment, Robert. The "runner_option"
> flag would prevent additional complexity. I still don't prefer it
> because it's not nice from an end user perspective. If we were to
> implement it, I would definitely go for the "option promotion" which
> you
> mentioned.
>
> I hadn't thought about delegating runners, although the PortableRunner
> is basically a delegating Runner. If that was an important feature, I
> suppose the "runner_option" would be the preferred way.
>
> All in all, since there doesn't seem to be an excitement to implement
> JobServer option retrieval and we will need the help of all SDK
> developers, "runner_option" seems to be the more likely path.
>
>
> I would say its a lack of time for people to improve this story over
> others but it can be revisited at some point in the future and I agree
> that using --runner_option as an interim provides value.
>
>
> -Max
>
> On 08.11.18 21:50, Lukasz Cwik wrote:
> > The purpose of the spec would be to provide the names, type and
> > descriptions of the options. We don't need anything beyond the JSON
> > types (string, number, bool, object, list) because the only
> ambiguity we
> > get is how do we parse command line string into the JSON type
> (and that
> > ambiguity is actually only between string and non-string since
> all the
> > other JSON types are unambiguous).
> >
> > Also, I believe the flow would be
> > 1) Parse options
> > a) Find the URL from args specified and/or additional methods on
> > PipelineOptions that exposes a programmatic way to set the URL
> during
> > parsing.
> > b) Query URL for option specs
> > c) Parse the remainder of the options
> > 2) Construct pipeline
> > 3) Choose runner
> > 4) Submit job to runner
> >
> > Note that I believe that we will want to have multiple URLs to
> support
> > cross language pipelines since we will want to be able to ask
> other SDK
> > languages/versions for their "list" of supported PipelineOptions.
> >
> > On Thu, Nov 8, 2018 at 11:51 AM Robert Bradshaw
> <robertwb@google.com <ma...@google.com>
> > <mailto:robertwb@google.com <ma...@google.com>>> wrote:
> >
> > There's two questions here:
> >
> > (A) What do we do in the short term?
> >
> > I think adding every runner option to every SDK is not
> sustainable
> > (n*m work, assuming every SDK knows about every runner), and
> having a
> > patchwork of options that were added as one-offs to SDKs is not
> > desirable either. Furthermore, it seems difficult to parse
> unknown
> > options as if they were valid options, so my preference here
> would be
> > to just use a special runner_option flag. (One could also
> pass a set
> > of unparsed/unvalidated runner options to the runner, even if
> they're
> > not distinguished for the user, and runners (or any
> intermediates)
> > could run a "promote" operation that promotes any of these
> unknowns
> > that they recognize to real options before further
> processing. The
> > parsing would be done as repeated-string, and not be
> intermingled with
> > the actually validated options. This is essential a variant of
> > option 1.)
> >
> > (B) What do do in the long term? While the JobServer approach
> sounds
> > nice, I think it introduces a lot of complexity (we have too
> much of
> > that already) and still doesn't completely solve the problem. In
> > particular, it changes the flow from
> >
> > 1. Parse options
> > 2. Construct pipeline
> > 3. Choose runner
> > 4. Submit job to runner
> >
> > to
> >
> > 1. Parse options
> > 2. Construct pipeline
> > 3. Choose runner
> > 4a. Query runner for option specs
> > 4b. Re-parse options
> > 4c. Submit job to runner
> >
> > In particular, doing 4b in the SDK rather than just let the
> runner
> > itself do the validation as part of (4) doesn't save much and
> forces
> > us to come up with a (probably incomplete) spec as to how to
> define
> > options, their types, and their validations. It also means that a
> > delegating runner must choose and interact with its downstream
> > runner(s) synchronously, else we haven't actually solved the
> issue.
> >
> > For these reasons, I don't think we even want to go with the
> JobServer
> > approach in the long term, which has bearing on (A).
> >
> > - Robert
> >
> >
> > On Wed, Nov 7, 2018 at 8:50 PM Maximilian Michels
> <mxm@apache.org <ma...@apache.org>
> > <mailto:mxm@apache.org <ma...@apache.org>>> wrote:
> > >
> > > +1
> > >
> > > If the preferred approach is to eventually have the JobServer
> > serve the
> > > options, then the best intermediate solution is to
> replicate common
> > > options in the SDKs.
> > >
> > > If we went down the "--runner_option" path, we would end
> up with
> > > multiple ways of specifying the same options. We would
> eventually
> > have
> > > to deprecate "runner options" once we have the JobServer
> > approach. I'd
> > > like to avoid that.
> > >
> > > For the upcoming release we can revert the changes again
> and add the
> > > most common missing options to the SDKs. Then hopefully we
> should
> > have
> > > fetching implemented for the release after.
> > >
> > > Do you think that is feasible?
> > >
> > > Thanks,
> > > Max
> > >
> > > On 30.10.18 23:00, Lukasz Cwik wrote:
> > > > I still like #3 the most, just can't devote the time to
> get it
> > done.
> > > >
> > > > Instead of going with a fully implemented #3, we could
> hardcode
> > the a
> > > > subset of options and types within each SDK until the
> job server is
> > > > ready to provide this information and then migrate to the
> > "full" list.
> > > > This would be an easy path for SDKs to take on. They could
> > "know" of a
> > > > few well known options, and if they want to support all
> > options, they
> > > > implement the integration with the job server.
> > > >
> > > > On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels
> > <mxm@apache.org <ma...@apache.org>
> <mailto:mxm@apache.org <ma...@apache.org>>
> > > > <mailto:mxm@apache.org <ma...@apache.org>
> <mailto:mxm@apache.org <ma...@apache.org>>>> wrote:
> > > >
> > > > > I would prefer we don't introduce a (quirky) way
> of passing
> > > > unknown options that forces users to type JSON into the
> > command line
> > > > (or similar acrobatics)
> > > > Same here, the JSON approach seems technically nice
> but too
> > bulky
> > > > for users.
> > > >
> > > > > To someone wanting to run a pipeline, all options are
> > equally
> > > > important, whether they are application specific, SDK
> > specific or
> > > > runner specific.
> > > >
> > > > I'm also reluctant to force users to use
> `--runner_option=`
> > because the
> > > > division into "Runner" options and other options seems
> > rather arbitrary
> > > > to users. Most built-in options are also Runner-related.
> > > >
> > > > > It should be possible to *optionally*
> qualify/scope (to
> > cover
> > > > cases where there is ambiguity), but otherwise I
> prefer the
> > format
> > > > we currently have.
> > > >
> > > > Yes, namespacing is a problem. What happens if the user
> > defines a
> > > > custom
> > > > PipelineOption which clashes with one of the builtin
> ones?
> > If both are
> > > >
> > > > set, which one is actually applied?
> > > >
> > > >
> > > > Note that PipelineOptions so far has been treating name
> > equality to mean
> > > > option equality and the Java implementation has a bunch of
> > strict checks
> > > > to make sure that default values aren't used for duplicate
> > definitions,
> > > > they have the same type, etc...
> > > > With 1), you fail the job if the runner can't understand
> your
> > option
> > > > because its not represented the same way. User then
> needs to fix-up
> > > > their declaration of the option name.
> > > > With 2), there are no name conflicts, the SDK will need to
> > validate that
> > > > the option isn't set in both formats and error out if it
> is before
> > > > pipeline submission time.
> > > > With 3), you can prefetch all the options and error out
> to the user
> > > > during argument parsing time.
> > > >
> > > >
> > > >
> > > > Here is a summary of the possible paths going forward:
> > > >
> > > >
> > > > 1) Validate PipelineOptions at Runner side
> > > > ==========================================
> > > >
> > > > The main issue raised here was that we want to move away
> > from parsing
> > > > arguments which look like options without validating
> them.
> > An easy fix
> > > > would be to actually validate them on the Runner
> side. This
> > could be
> > > > done by changing the deserialization code of
> > PipelineOptions which so
> > > > far ignores unknown JSON options.
> > > >
> > > > See: PipelineOptionsTranslation.fromProto(Struct
> protoOptions)
> > > >
> > > > Actually, this wouldn't work for user-defined
> > PipelineOptions because
> > > > they might not be known to the Runner (if they are
> defined
> > in Python).
> > > >
> > > >
> > > > 2) Introduce a Runner-Option Flag
> > > > =================================
> > > >
> > > > In this approach we would try to add as many pipeline
> > options for a
> > > > Runner to the SDK, but allow additional Runner
> options to
> > be passed
> > > > using the `--runner-option=key=val` flag. The
> Runner, like
> > in 1), would
> > > > have to ensure validation. I think this has been the
> most
> > favored
> > > > way so
> > > > far. Going forward, that means that
> `--parallelism=4` and
> > > > `--runner-option=parallelism=4` will have the same
> effect
> > for the Flink
> > > > Runner.
> > > >
> > > >
> > > > 3) Implement Fetching of Options from JobServer
> > > > ===============================================
> > > >
> > > > The options are retrieved from the JobServer before
> > submitting the
> > > > pipeline. I think this would be ideal but, as mentioned
> > before, it
> > > > increases the complexity for implementing new SDKs and
> > might overall
> > > > just not be worth the effort.
> > > >
> > > >
> > > > What do you think? I'd implement 2) for the next
> release,
> > unless there
> > > > are advocates for a different approach.
> > > >
> > > > Cheers,
> > > > Max
> >
>