You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Maximilian Michels <mx...@apache.org> on 2019/01/30 11:26:30 UTC
Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Thomas was so kind to implement Option 3) in 
https://github.com/apache/beam/pull/7597

Heads-up to the Go SDK people to eventually implement the new 
DescribePipelineOptionsRequest. Tracking issue: 
https://issues.apache.org/jira/browse/BEAM-6549

Also related, we will have to follow-up with proper scoping of pipeline options: 
https://issues.apache.org/jira/browse/BEAM-6537

Thanks,
Max

On 13.11.18 19:05, Robert Burke wrote:
> +1 to Option 3
> 
> I'd rather have each SDK have a single point of well defined complexity to do 
> something general, than have to make tiny but simple changes. Less toil and 
> maintenance in the long run per SDK.
> 
> Similarly I don't have time to make it happen right now.
> 
> On Tue, Nov 13, 2018, 9:22 AM Thomas Weise <thw@apache.org 
> <ma...@apache.org>> wrote:
> 
>     Discovering options from the job server would be the only way to perform
>     full validation (and provide upfront help to the user).
> 
>     The runner cannot perform full validation, since it is not aware of the user
>     and SDK options (that it has to forward to the SDK worker).
> 
>     Special runner options flag to forward unknown options also wouldn't fully
>     solve the problem (besides being subject to change in the future). Let's say
>     runner understands --fancy-int-option and the user repeats that option
>     multiple times. Not knowing the type of option, the SDK will pass it as a
>     list and the runner will fail.
> 
>     Replicating SDK options is a workaround for known runners but it really goes
>     against the idea of portability (making assumptions outside the API
>     contract). We already have runners implemented outside of Beam and hope for
>     the ecosystem to grow. What we do for options should also work for those
>     runners.
> 
>     I'm with Luke here that options discovery provides the best user experience
>     and can address the other issues. Even the scenario of multiple intermediate
>     runners could be addressed by forwarding the unparsed options with the
>     discovery call. I don't see SDK implementation complexity as a significant
>     drawback so far.
> 
>     Thomas
> 
> 
>     On Mon, Nov 12, 2018 at 2:30 PM Lukasz Cwik <lcwik@google.com
>     <ma...@google.com>> wrote:
> 
> 
>         On Mon, Nov 12, 2018 at 9:38 AM Maximilian Michels <mxm@apache.org
>         <ma...@apache.org>> wrote:
> 
>             Thank you Robert and Lukasz for your points.
> 
>              > Note that I believe that we will want to have multiple URLs to
>             support cross language pipelines since we will want to be able to
>             ask other SDK languages/versions for their "list" of supported
>             PipelineOptions.
> 
>             Why is that? The Runner itself is the source of truth for its options.
> 
> 
>         Because other languages (or even different versions of the same
>         language) may have their own options. For example, the Go SDK talks to a
>         Java service which applies a SQL transform and returns the expanded form
>         (this may require knowledge of some options like credentials for the
>         filesystem, ...) and then talks to a Python service that performs
>         another transform expansion. Finally the pipeline containing Go, Java
>         and Python transforms is submitted to a runner and it performs its own
>         internal replacements/expansions related to executing the pipeline.
> 
>             Everything else is SDK-related and should be validated there.
> 
>             I imagined the process to go like this:
> 
>                 a) Parse options to find JobServer URL
>                 a) Retrieve options from JobServer
>                 c) Parse all options
>                 ...continue as always...
> 
>             An option is just represented by a name and a type. There is nothing
>             more to it, at least as of now. So it should be possible to parse them
>             in the SDK without much further work.
> 
>             Nevertheless, I agree with your sentiment, Robert. The "runner_option"
>             flag would prevent additional complexity. I still don't prefer it
>             because it's not nice from an end user perspective. If we were to
>             implement it, I would definitely go for the "option promotion" which
>             you
>             mentioned.
> 
>             I hadn't thought about delegating runners, although the PortableRunner
>             is basically a delegating Runner. If that was an important feature, I
>             suppose the "runner_option" would be the preferred way.
> 
>             All in all, since there doesn't seem to be an excitement to implement
>             JobServer option retrieval and we will need the help of all SDK
>             developers, "runner_option" seems to be the more likely path.
> 
> 
>         I would say its a lack of time for people to improve this story over
>         others but it can be revisited at some point in the future and I agree
>         that using --runner_option as an interim provides value.
> 
> 
>             -Max
> 
>             On 08.11.18 21:50, Lukasz Cwik wrote:
>              > The purpose of the spec would be to provide the names, type and
>              > descriptions of the options. We don't need anything beyond the JSON
>              > types (string, number, bool, object, list) because the only
>             ambiguity we
>              > get is how do we parse command line string into the JSON type
>             (and that
>              > ambiguity is actually only between string and non-string since
>             all the
>              > other JSON types are unambiguous).
>              >
>              > Also, I believe the flow would be
>              > 1) Parse options
>              >    a) Find the URL from args specified and/or additional methods on
>              > PipelineOptions that exposes a programmatic way to set the URL
>             during
>              > parsing.
>              >    b) Query URL for option specs
>              >    c) Parse the remainder of the options
>              > 2) Construct pipeline
>              > 3) Choose runner
>              > 4) Submit job to runner
>              >
>              > Note that I believe that we will want to have multiple URLs to
>             support
>              > cross language pipelines since we will want to be able to ask
>             other SDK
>              > languages/versions for their "list" of supported PipelineOptions.
>              >
>              > On Thu, Nov 8, 2018 at 11:51 AM Robert Bradshaw
>             <robertwb@google.com <ma...@google.com>
>              > <mailto:robertwb@google.com <ma...@google.com>>> wrote:
>              >
>              >     There's two questions here:
>              >
>              >     (A) What do we do in the short term?
>              >
>              >     I think adding every runner option to every SDK is not
>             sustainable
>              >     (n*m work, assuming every SDK knows about every runner), and
>             having a
>              >     patchwork of options that were added as one-offs to SDKs is not
>              >     desirable either. Furthermore, it seems difficult to parse
>             unknown
>              >     options as if they were valid options, so my preference here
>             would be
>              >     to just use a special runner_option flag. (One could also
>             pass a set
>              >     of unparsed/unvalidated runner options to the runner, even if
>             they're
>              >     not distinguished for the user, and runners (or any
>             intermediates)
>              >     could run a "promote" operation that promotes any of these
>             unknowns
>              >     that they recognize to real options before further
>             processing. The
>              >     parsing would be done as repeated-string, and not be
>             intermingled with
>              >     the actually validated options. This is essential a variant of
>              >     option 1.)
>              >
>              >     (B) What do do in the long term? While the JobServer approach
>             sounds
>              >     nice, I think it introduces a lot of complexity (we have too
>             much of
>              >     that already) and still doesn't completely solve the problem. In
>              >     particular, it changes the flow from
>              >
>              >     1. Parse options
>              >     2. Construct pipeline
>              >     3. Choose runner
>              >     4. Submit job to runner
>              >
>              >     to
>              >
>              >     1. Parse options
>              >     2. Construct pipeline
>              >     3. Choose runner
>              >     4a. Query runner for option specs
>              >     4b. Re-parse options
>              >     4c. Submit job to runner
>              >
>              >     In particular, doing 4b in the SDK rather than just let the
>             runner
>              >     itself do the validation as part of (4) doesn't save much and
>             forces
>              >     us to come up with a (probably incomplete) spec as to how to
>             define
>              >     options, their types, and their validations. It also means that a
>              >     delegating runner must choose and interact with its downstream
>              >     runner(s) synchronously, else we haven't actually solved the
>             issue.
>              >
>              >     For these reasons, I don't think we even want to go with the
>             JobServer
>              >     approach in the long term, which has bearing on (A).
>              >
>              >     - Robert
>              >
>              >
>              >     On Wed, Nov 7, 2018 at 8:50 PM Maximilian Michels
>             <mxm@apache.org <ma...@apache.org>
>              >     <mailto:mxm@apache.org <ma...@apache.org>>> wrote:
>              >      >
>              >      > +1
>              >      >
>              >      > If the preferred approach is to eventually have the JobServer
>              >     serve the
>              >      > options, then the best intermediate solution is to
>             replicate common
>              >      > options in the SDKs.
>              >      >
>              >      > If we went down the "--runner_option" path, we would end
>             up with
>              >      > multiple ways of specifying the same options. We would
>             eventually
>              >     have
>              >      > to deprecate "runner options" once we have the JobServer
>              >     approach. I'd
>              >      > like to avoid that.
>              >      >
>              >      > For the upcoming release we can revert the changes again
>             and add the
>              >      > most common missing options to the SDKs. Then hopefully we
>             should
>              >     have
>              >      > fetching implemented for the release after.
>              >      >
>              >      > Do you think that is feasible?
>              >      >
>              >      > Thanks,
>              >      > Max
>              >      >
>              >      > On 30.10.18 23:00, Lukasz Cwik wrote:
>              >      > > I still like #3 the most, just can't devote the time to
>             get it
>              >     done.
>              >      > >
>              >      > > Instead of going with a fully implemented #3, we could
>             hardcode
>              >     the a
>              >      > > subset of options and types within each SDK until the
>             job server is
>              >      > > ready to provide this information and then migrate to the
>              >     "full" list.
>              >      > > This would be an easy path for SDKs to take on. They could
>              >     "know" of a
>              >      > > few well known options, and if they want to support all
>              >     options, they
>              >      > > implement the integration with the job server.
>              >      > >
>              >      > > On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels
>              >     <mxm@apache.org <ma...@apache.org>
>             <mailto:mxm@apache.org <ma...@apache.org>>
>              >      > > <mailto:mxm@apache.org <ma...@apache.org>
>             <mailto:mxm@apache.org <ma...@apache.org>>>> wrote:
>              >      > >
>              >      > >      > I would prefer we don't introduce a (quirky) way
>             of passing
>              >      > >     unknown options that forces users to type JSON into the
>              >     command line
>              >      > >     (or similar acrobatics)
>              >      > >     Same here, the JSON approach seems technically nice
>             but too
>              >     bulky
>              >      > >     for users.
>              >      > >
>              >      > >      > To someone wanting to run a pipeline, all options are
>              >     equally
>              >      > >     important, whether they are application specific, SDK
>              >     specific or
>              >      > >     runner specific.
>              >      > >
>              >      > >     I'm also reluctant to force users to use
>             `--runner_option=`
>              >     because the
>              >      > >     division into "Runner" options and other options seems
>              >     rather arbitrary
>              >      > >     to users. Most built-in options are also Runner-related.
>              >      > >
>              >      > >      > It should be possible to *optionally*
>             qualify/scope (to
>              >     cover
>              >      > >     cases where there is ambiguity), but otherwise I
>             prefer the
>              >     format
>              >      > >     we currently have.
>              >      > >
>              >      > >     Yes, namespacing is a problem. What happens if the user
>              >     defines a
>              >      > >     custom
>              >      > >     PipelineOption which clashes with one of the builtin
>             ones?
>              >     If both are
>              >      > >
>              >      > >     set, which one is actually applied?
>              >      > >
>              >      > >
>              >      > > Note that PipelineOptions so far has been treating name
>              >     equality to mean
>              >      > > option equality and the Java implementation has a bunch of
>              >     strict checks
>              >      > > to make sure that default values aren't used for duplicate
>              >     definitions,
>              >      > > they have the same type, etc...
>              >      > > With 1), you fail the job if the runner can't understand
>             your
>              >     option
>              >      > > because its not represented the same way. User then
>             needs to fix-up
>              >      > > their declaration of the option name.
>              >      > > With 2), there are no name conflicts, the SDK will need to
>              >     validate that
>              >      > > the option isn't set in both formats and error out if it
>             is before
>              >      > > pipeline submission time.
>              >      > > With 3), you can prefetch all the options and error out
>             to the user
>              >      > > during argument parsing time.
>              >      > >
>              >      > >
>              >      > >
>              >      > >     Here is a summary of the possible paths going forward:
>              >      > >
>              >      > >
>              >      > >     1) Validate PipelineOptions at Runner side
>              >      > >     ==========================================
>              >      > >
>              >      > >     The main issue raised here was that we want to move away
>              >     from parsing
>              >      > >     arguments which look like options without validating
>             them.
>              >     An easy fix
>              >      > >     would be to actually validate them on the Runner
>             side. This
>              >     could be
>              >      > >     done by changing the deserialization code of
>              >     PipelineOptions which so
>              >      > >     far ignores unknown JSON options.
>              >      > >
>              >      > >     See: PipelineOptionsTranslation.fromProto(Struct
>             protoOptions)
>              >      > >
>              >      > >     Actually, this wouldn't work for user-defined
>              >     PipelineOptions because
>              >      > >     they might not be known to the Runner (if they are
>             defined
>              >     in Python).
>              >      > >
>              >      > >
>              >      > >     2) Introduce a Runner-Option Flag
>              >      > >     =================================
>              >      > >
>              >      > >     In this approach we would try to add as many pipeline
>              >     options for a
>              >      > >     Runner to the SDK, but allow additional Runner
>             options to
>              >     be passed
>              >      > >     using the `--runner-option=key=val` flag. The
>             Runner, like
>              >     in 1), would
>              >      > >     have to ensure validation. I think this has been the
>             most
>              >     favored
>              >      > >     way so
>              >      > >     far. Going forward, that means that
>             `--parallelism=4` and
>              >      > >     `--runner-option=parallelism=4` will have the same
>             effect
>              >     for the Flink
>              >      > >     Runner.
>              >      > >
>              >      > >
>              >      > >     3) Implement Fetching of Options from JobServer
>              >      > >     ===============================================
>              >      > >
>              >      > >     The options are retrieved from the JobServer before
>              >     submitting the
>              >      > >     pipeline. I think this would be ideal but, as mentioned
>              >     before, it
>              >      > >     increases the complexity for implementing new SDKs and
>              >     might overall
>              >      > >     just not be worth the effort.
>              >      > >
>              >      > >
>              >      > >     What do you think? I'd implement 2) for the next
>             release,
>              >     unless there
>              >      > >     are advocates for a different approach.
>              >      > >
>              >      > >     Cheers,
>              >      > >     Max
>              >
>