You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Thomas Weise <th...@apache.org> on 2018/10/12 15:50:08 UTC

[BEAM-5442] Store duplicate unknown (runner) options in a list argument

[moving to the list]

The requirement driving this part of the change was to allow a user to
specify pipeline options that a runner supports without having to declare
those in each language SDK.

In the specific scenario, we have options that the Flink runner supports
(and can validate), that are not enumerated in the Python SDK.

I think we have a bigger problem scoping pipeline options. For example, the
runner options are dumped into the SDK worker. There is also a possibility
of name collisions. So I think this would benefit from broader feedback.

Thanks,
Thomas


---------- Forwarded message ---------
From: Charles Chen <no...@github.com>
Date: Fri, Oct 12, 2018 at 8:36 AM
Subject: Re: [apache/beam] [BEAM-5442] Store duplicate unknown options in a
list argument (#6600)
To: apache/beam <be...@noreply.github.com>
Cc: Thomas Weise <th...@gmail.com>, Mention <
mention@noreply.github.com>


CC: @tweise <https://github.com/tweise>

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/apache/beam/pull/6600#issuecomment-429367754>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>
.

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Udi Meiri <eh...@google.com>.

+1 for explicit --runner_option=param=val,...
It's hard to tell otherwise where an option is going to,

On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <ro...@google.com> wrote:

> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <mx...@apache.org> wrote:
> >
> > I agree that the current approach breaks the pipeline options contract
> > because "unknown" options get parsed in the same way as options which
> > have been defined by the user.
>
> FWIW, I think we're already breaking this "contract." Unknown options
> are silently ignored; with this change we just change how we record
> them. It still feels a bit hacky though.
>
> > I'm not sure the `experiments` flag works for us. AFAIK it only allows
> > true/false flags. We want to pass all types of pipeline options to the
> > Runner.
>
> Experiments is an arbitrary set of strings, which can be of the form
> "param=value" if that's useful. (Dataflow does this.) There is, again,
> no namespacing on the param names, but we could user urns or impose
> some other structure here.
>
> > How to solve this?
> >
> > 1) Add all options of all Runners to each SDK
> > We added some of the FlinkRunner options to the Python SDK but realized
> > syncing is rather cumbersome in the long term. However, we want the most
> > important options to be validated on the client side.
>
> I don't think this is sustainable in the long run. However, thinking
> about this, in the worse case validation happens after construction
> but before execution (as with much of our other validation) so it
> isn't that bad.
>
> > 2) Pass "unknown" options via a separate list in the Proto which can
> > only be accessed internally by the Runners. This still allows passing
> > arbitrary options but we wouldn't leak unknown options and display them
> > as top-level options.
>
> I think there needs to be a way for the user to communicate values
> directly to the runner regardless of the SDK. My preference would be
> to make this explicit, e.g. (repeated) --runner_option=..., rather
> than scooping up all unknown flags at command line parsing time.
> Perhaps an SDK that is aware of some runners could choose to lift
> these as top-level options, but still pass them as runner options.
>
> > On 13.10.18 02:34, Charles Chen wrote:
> > > The current release branch
> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut after
> the
> > > revert went in.  Sent out https://github.com/apache/beam/pull/6683 as
> a
> > > revert of the revert.  Regarding your comment above, I can help out
> with
> > > the design / PR reviews for common Python code as you suggest.
> > >
> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <thw@apache.org
> > > <ma...@apache.org>> wrote:
> > >
> > >     Thanks, will tag you and looking forward to feedback so we can
> > >     ensure that changes work for everyone.
> > >
> > >     Looking at the PR, I see agreement from Max to revert the change on
> > >     the release branch, but not in master. Would you mind to restore it
> > >     in master?
> > >
> > >     Thanks
> > >
> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <altay@google.com
> > >     <ma...@google.com>> wrote:
> > >
> > >
> > >
> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <ccy@google.com
> > >         <ma...@google.com>> wrote:
> > >
> > >             What I mean is that a user may find that it works for them
> > >             to pass "--myarg blah" and access it as "options.myarg"
> > >             without explicitly defining a "my_arg" flag due to the
> added
> > >             logic.  This is not the intended behavior and we may want
> to
> > >             change this implementation detail in the future.  However,
> > >             having this logic in a released version makes it hard to
> > >             change this behavior since users may erroneously depend on
> > >             this undocumented behavior.  Instead, we should namespace /
> > >             scope this so that it is obvious that this is meant for
> > >             runner (and not Beam user) consumption.
> > >
> > >             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
> > >             <thw@apache.org <ma...@apache.org>> wrote:
> > >
> > >                 Can you please elaborate more what practical problems
> > >                 this introduces for users?
> > >
> > >                 I can see that this change allows a user to specify a
> > >                 runner specific option, which in the future may change
> > >                 because we decide to scope differently. If this only
> > >                 affects users of the portable Flink runner (like us),
> > >                 then no need to revert, because at this early stage we
> > >                 prefer something that works over being blocked.
> > >
> > >                 It would also be really great if some of the core
> Python
> > >                 SDK developers could help out with the design aspects
> > >                 and PR reviews of changes that affect common Python
> > >                 code. Anyone who specifically wants to be tagged on
> > >                 relevant JIRAs and PRs?
> > >
> > >
> > >         I would be happy to be tagged, and I can also help with
> > >         including other relevant folks whenever possible. In general I
> > >         think Robert, Charles, myself are good candidates.
> > >
> > >
> > >                 Thanks
> > >
> > >
> > >                 On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
> > >                 <altay@google.com <ma...@google.com>> wrote:
> > >
> > >
> > >
> > >                     On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen
> > >                     <ccy@google.com <ma...@google.com>> wrote:
> > >
> > >                         For context, I made comments on
> > >                         https://github.com/apache/beam/pull/6600
> noting
> > >                         that the changes being made were not good for
> > >                         Beam backwards-compatibility.  The change as is
> > >                         allows users to use pipeline options without
> > >                         explicitly defining them, which is not the type
> > >                         of usage we would like to encourage since we
> > >                         prefer to be explicit whenever possible.  If
> > >                         users write pipelines with this sort of
> pattern,
> > >                         they will potentially encounter pain when
> > >                         upgrading to a later version since this is an
> > >                         implementation detail and not an officially
> > >                         supported pattern.  I agree with the comments
> > >                         above that this is ultimately a scoping issue.
> > >                         I would not have a problem with these changes
> if
> > >                         they were explicitly scoped under either a
> > >                         runner or unparsed options namespace.
> > >
> > >                         As a second note, since the 2.8.0 release is
> > >                         being cut right now, because of these
> > >                         backwards-compatibility concerns, I would
> > >                         suggest reverting these changes, at least until
> > >                         2.8.0 is cut, so we can have a discussion here
> > >                         before committing to and releasing any
> API-level
> > >                         changes.
> > >
> > >
> > >                     +1 I would like to revert the changes in order not
> > >                     rush this into the release. Once this discussion
> > >                     results in an agreement changes can be brought
> back.
> > >
> > >
> > >                         On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde
> > >                         <herohde@google.com <mailto:herohde@google.com
> >>
> > >                         wrote:
> > >
> > >                             Agree that pipeline options lack some
> > >                             mechanism for scoping. It is also not
> always
> > >                             possible distinguish options meant to be
> > >                             consumed at pipeline construction time, by
> > >                             the runner, by the SDK harness, by the user
> > >                             code or any combination -- and this causes
> > >                             confusion every now and then.
> > >
> > >                             For Dataflow, we have been using
> > >                             "experiments" for arbitrary runner-specific
> > >                             options. It's simply a string list pipeline
> > >                             option that all SDKs support and, for Go at
> > >                             least, is sent to portable runners. Flink
> > >                             can do the same in the short term to move
> > >                             forward.
> > >
> > >                             Henning
> > >
> > >
> > >                             On Fri, Oct 12, 2018 at 8:50 AM Thomas
> Weise
> > >                             <thw@apache.org <ma...@apache.org>>
> wrote:
> > >
> > >                                 [moving to the list]
> > >
> > >                                 The requirement driving this part of
> the
> > >                                 change was to allow a user to specify
> > >                                 pipeline options that a runner supports
> > >                                 without having to declare those in each
> > >                                 language SDK.
> > >
> > >                                 In the specific scenario, we have
> > >                                 options that the Flink runner supports
> > >                                 (and can validate), that are not
> > >                                 enumerated in the Python SDK.
> > >
> > >                                 I think we have a bigger problem
> scoping
> > >                                 pipeline options. For example, the
> > >                                 runner options are dumped into the SDK
> > >                                 worker. There is also a possibility of
> > >                                 name collisions. So I think this would
> > >                                 benefit from broader feedback.
> > >
> > >                                 Thanks,
> > >                                 Thomas
> > >
> > >
> > >                                 ---------- Forwarded message ---------
> > >                                 From: *Charles Chen*
> > >                                 <notifications@github.com
> > >                                 <ma...@github.com>>
> > >                                 Date: Fri, Oct 12, 2018 at 8:36 AM
> > >                                 Subject: Re: [apache/beam] [BEAM-5442]
> > >                                 Store duplicate unknown options in a
> > >                                 list argument (#6600)
> > >                                 To: apache/beam <
> beam@noreply.github.com
> > >                                 <ma...@noreply.github.com>>
> > >                                 Cc: Thomas Weise <
> thomas.weise@gmail.com
> > >                                 <ma...@gmail.com>>,
> > >                                 Mention <mention@noreply.github.com
> > >                                 <ma...@noreply.github.com>>
> > >
> > >
> > >                                 CC: @tweise <https://github.com/tweise
> >
> > >
> > >                                 —
> > >                                 You are receiving this because you were
> > >                                 mentioned.
> > >                                 Reply to this email directly, view it
> on
> > >                                 GitHub
> > >                                 <
> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
> > >                                 or mute the thread
> > >                                 <
> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
> >.
> > >
> > >
> > >
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Maximilian Michels <mx...@apache.org>.

Thomas was so kind to implement Option 3) in 
https://github.com/apache/beam/pull/7597

Heads-up to the Go SDK people to eventually implement the new 
DescribePipelineOptionsRequest. Tracking issue: 
https://issues.apache.org/jira/browse/BEAM-6549

Also related, we will have to follow-up with proper scoping of pipeline options: 
https://issues.apache.org/jira/browse/BEAM-6537

Thanks,
Max

On 13.11.18 19:05, Robert Burke wrote:
> +1 to Option 3
> 
> I'd rather have each SDK have a single point of well defined complexity to do 
> something general, than have to make tiny but simple changes. Less toil and 
> maintenance in the long run per SDK.
> 
> Similarly I don't have time to make it happen right now.
> 
> On Tue, Nov 13, 2018, 9:22 AM Thomas Weise <thw@apache.org 
> <ma...@apache.org>> wrote:
> 
>     Discovering options from the job server would be the only way to perform
>     full validation (and provide upfront help to the user).
> 
>     The runner cannot perform full validation, since it is not aware of the user
>     and SDK options (that it has to forward to the SDK worker).
> 
>     Special runner options flag to forward unknown options also wouldn't fully
>     solve the problem (besides being subject to change in the future). Let's say
>     runner understands --fancy-int-option and the user repeats that option
>     multiple times. Not knowing the type of option, the SDK will pass it as a
>     list and the runner will fail.
> 
>     Replicating SDK options is a workaround for known runners but it really goes
>     against the idea of portability (making assumptions outside the API
>     contract). We already have runners implemented outside of Beam and hope for
>     the ecosystem to grow. What we do for options should also work for those
>     runners.
> 
>     I'm with Luke here that options discovery provides the best user experience
>     and can address the other issues. Even the scenario of multiple intermediate
>     runners could be addressed by forwarding the unparsed options with the
>     discovery call. I don't see SDK implementation complexity as a significant
>     drawback so far.
> 
>     Thomas
> 
> 
>     On Mon, Nov 12, 2018 at 2:30 PM Lukasz Cwik <lcwik@google.com
>     <ma...@google.com>> wrote:
> 
> 
>         On Mon, Nov 12, 2018 at 9:38 AM Maximilian Michels <mxm@apache.org
>         <ma...@apache.org>> wrote:
> 
>             Thank you Robert and Lukasz for your points.
> 
>              > Note that I believe that we will want to have multiple URLs to
>             support cross language pipelines since we will want to be able to
>             ask other SDK languages/versions for their "list" of supported
>             PipelineOptions.
> 
>             Why is that? The Runner itself is the source of truth for its options.
> 
> 
>         Because other languages (or even different versions of the same
>         language) may have their own options. For example, the Go SDK talks to a
>         Java service which applies a SQL transform and returns the expanded form
>         (this may require knowledge of some options like credentials for the
>         filesystem, ...) and then talks to a Python service that performs
>         another transform expansion. Finally the pipeline containing Go, Java
>         and Python transforms is submitted to a runner and it performs its own
>         internal replacements/expansions related to executing the pipeline.
> 
>             Everything else is SDK-related and should be validated there.
> 
>             I imagined the process to go like this:
> 
>                 a) Parse options to find JobServer URL
>                 a) Retrieve options from JobServer
>                 c) Parse all options
>                 ...continue as always...
> 
>             An option is just represented by a name and a type. There is nothing
>             more to it, at least as of now. So it should be possible to parse them
>             in the SDK without much further work.
> 
>             Nevertheless, I agree with your sentiment, Robert. The "runner_option"
>             flag would prevent additional complexity. I still don't prefer it
>             because it's not nice from an end user perspective. If we were to
>             implement it, I would definitely go for the "option promotion" which
>             you
>             mentioned.
> 
>             I hadn't thought about delegating runners, although the PortableRunner
>             is basically a delegating Runner. If that was an important feature, I
>             suppose the "runner_option" would be the preferred way.
> 
>             All in all, since there doesn't seem to be an excitement to implement
>             JobServer option retrieval and we will need the help of all SDK
>             developers, "runner_option" seems to be the more likely path.
> 
> 
>         I would say its a lack of time for people to improve this story over
>         others but it can be revisited at some point in the future and I agree
>         that using --runner_option as an interim provides value.
> 
> 
>             -Max
> 
>             On 08.11.18 21:50, Lukasz Cwik wrote:
>              > The purpose of the spec would be to provide the names, type and
>              > descriptions of the options. We don't need anything beyond the JSON
>              > types (string, number, bool, object, list) because the only
>             ambiguity we
>              > get is how do we parse command line string into the JSON type
>             (and that
>              > ambiguity is actually only between string and non-string since
>             all the
>              > other JSON types are unambiguous).
>              >
>              > Also, I believe the flow would be
>              > 1) Parse options
>              >    a) Find the URL from args specified and/or additional methods on
>              > PipelineOptions that exposes a programmatic way to set the URL
>             during
>              > parsing.
>              >    b) Query URL for option specs
>              >    c) Parse the remainder of the options
>              > 2) Construct pipeline
>              > 3) Choose runner
>              > 4) Submit job to runner
>              >
>              > Note that I believe that we will want to have multiple URLs to
>             support
>              > cross language pipelines since we will want to be able to ask
>             other SDK
>              > languages/versions for their "list" of supported PipelineOptions.
>              >
>              > On Thu, Nov 8, 2018 at 11:51 AM Robert Bradshaw
>             <robertwb@google.com <ma...@google.com>
>              > <mailto:robertwb@google.com <ma...@google.com>>> wrote:
>              >
>              >     There's two questions here:
>              >
>              >     (A) What do we do in the short term?
>              >
>              >     I think adding every runner option to every SDK is not
>             sustainable
>              >     (n*m work, assuming every SDK knows about every runner), and
>             having a
>              >     patchwork of options that were added as one-offs to SDKs is not
>              >     desirable either. Furthermore, it seems difficult to parse
>             unknown
>              >     options as if they were valid options, so my preference here
>             would be
>              >     to just use a special runner_option flag. (One could also
>             pass a set
>              >     of unparsed/unvalidated runner options to the runner, even if
>             they're
>              >     not distinguished for the user, and runners (or any
>             intermediates)
>              >     could run a "promote" operation that promotes any of these
>             unknowns
>              >     that they recognize to real options before further
>             processing. The
>              >     parsing would be done as repeated-string, and not be
>             intermingled with
>              >     the actually validated options. This is essential a variant of
>              >     option 1.)
>              >
>              >     (B) What do do in the long term? While the JobServer approach
>             sounds
>              >     nice, I think it introduces a lot of complexity (we have too
>             much of
>              >     that already) and still doesn't completely solve the problem. In
>              >     particular, it changes the flow from
>              >
>              >     1. Parse options
>              >     2. Construct pipeline
>              >     3. Choose runner
>              >     4. Submit job to runner
>              >
>              >     to
>              >
>              >     1. Parse options
>              >     2. Construct pipeline
>              >     3. Choose runner
>              >     4a. Query runner for option specs
>              >     4b. Re-parse options
>              >     4c. Submit job to runner
>              >
>              >     In particular, doing 4b in the SDK rather than just let the
>             runner
>              >     itself do the validation as part of (4) doesn't save much and
>             forces
>              >     us to come up with a (probably incomplete) spec as to how to
>             define
>              >     options, their types, and their validations. It also means that a
>              >     delegating runner must choose and interact with its downstream
>              >     runner(s) synchronously, else we haven't actually solved the
>             issue.
>              >
>              >     For these reasons, I don't think we even want to go with the
>             JobServer
>              >     approach in the long term, which has bearing on (A).
>              >
>              >     - Robert
>              >
>              >
>              >     On Wed, Nov 7, 2018 at 8:50 PM Maximilian Michels
>             <mxm@apache.org <ma...@apache.org>
>              >     <mailto:mxm@apache.org <ma...@apache.org>>> wrote:
>              >      >
>              >      > +1
>              >      >
>              >      > If the preferred approach is to eventually have the JobServer
>              >     serve the
>              >      > options, then the best intermediate solution is to
>             replicate common
>              >      > options in the SDKs.
>              >      >
>              >      > If we went down the "--runner_option" path, we would end
>             up with
>              >      > multiple ways of specifying the same options. We would
>             eventually
>              >     have
>              >      > to deprecate "runner options" once we have the JobServer
>              >     approach. I'd
>              >      > like to avoid that.
>              >      >
>              >      > For the upcoming release we can revert the changes again
>             and add the
>              >      > most common missing options to the SDKs. Then hopefully we
>             should
>              >     have
>              >      > fetching implemented for the release after.
>              >      >
>              >      > Do you think that is feasible?
>              >      >
>              >      > Thanks,
>              >      > Max
>              >      >
>              >      > On 30.10.18 23:00, Lukasz Cwik wrote:
>              >      > > I still like #3 the most, just can't devote the time to
>             get it
>              >     done.
>              >      > >
>              >      > > Instead of going with a fully implemented #3, we could
>             hardcode
>              >     the a
>              >      > > subset of options and types within each SDK until the
>             job server is
>              >      > > ready to provide this information and then migrate to the
>              >     "full" list.
>              >      > > This would be an easy path for SDKs to take on. They could
>              >     "know" of a
>              >      > > few well known options, and if they want to support all
>              >     options, they
>              >      > > implement the integration with the job server.
>              >      > >
>              >      > > On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels
>              >     <mxm@apache.org <ma...@apache.org>
>             <mailto:mxm@apache.org <ma...@apache.org>>
>              >      > > <mailto:mxm@apache.org <ma...@apache.org>
>             <mailto:mxm@apache.org <ma...@apache.org>>>> wrote:
>              >      > >
>              >      > >      > I would prefer we don't introduce a (quirky) way
>             of passing
>              >      > >     unknown options that forces users to type JSON into the
>              >     command line
>              >      > >     (or similar acrobatics)
>              >      > >     Same here, the JSON approach seems technically nice
>             but too
>              >     bulky
>              >      > >     for users.
>              >      > >
>              >      > >      > To someone wanting to run a pipeline, all options are
>              >     equally
>              >      > >     important, whether they are application specific, SDK
>              >     specific or
>              >      > >     runner specific.
>              >      > >
>              >      > >     I'm also reluctant to force users to use
>             `--runner_option=`
>              >     because the
>              >      > >     division into "Runner" options and other options seems
>              >     rather arbitrary
>              >      > >     to users. Most built-in options are also Runner-related.
>              >      > >
>              >      > >      > It should be possible to *optionally*
>             qualify/scope (to
>              >     cover
>              >      > >     cases where there is ambiguity), but otherwise I
>             prefer the
>              >     format
>              >      > >     we currently have.
>              >      > >
>              >      > >     Yes, namespacing is a problem. What happens if the user
>              >     defines a
>              >      > >     custom
>              >      > >     PipelineOption which clashes with one of the builtin
>             ones?
>              >     If both are
>              >      > >
>              >      > >     set, which one is actually applied?
>              >      > >
>              >      > >
>              >      > > Note that PipelineOptions so far has been treating name
>              >     equality to mean
>              >      > > option equality and the Java implementation has a bunch of
>              >     strict checks
>              >      > > to make sure that default values aren't used for duplicate
>              >     definitions,
>              >      > > they have the same type, etc...
>              >      > > With 1), you fail the job if the runner can't understand
>             your
>              >     option
>              >      > > because its not represented the same way. User then
>             needs to fix-up
>              >      > > their declaration of the option name.
>              >      > > With 2), there are no name conflicts, the SDK will need to
>              >     validate that
>              >      > > the option isn't set in both formats and error out if it
>             is before
>              >      > > pipeline submission time.
>              >      > > With 3), you can prefetch all the options and error out
>             to the user
>              >      > > during argument parsing time.
>              >      > >
>              >      > >
>              >      > >
>              >      > >     Here is a summary of the possible paths going forward:
>              >      > >
>              >      > >
>              >      > >     1) Validate PipelineOptions at Runner side
>              >      > >     ==========================================
>              >      > >
>              >      > >     The main issue raised here was that we want to move away
>              >     from parsing
>              >      > >     arguments which look like options without validating
>             them.
>              >     An easy fix
>              >      > >     would be to actually validate them on the Runner
>             side. This
>              >     could be
>              >      > >     done by changing the deserialization code of
>              >     PipelineOptions which so
>              >      > >     far ignores unknown JSON options.
>              >      > >
>              >      > >     See: PipelineOptionsTranslation.fromProto(Struct
>             protoOptions)
>              >      > >
>              >      > >     Actually, this wouldn't work for user-defined
>              >     PipelineOptions because
>              >      > >     they might not be known to the Runner (if they are
>             defined
>              >     in Python).
>              >      > >
>              >      > >
>              >      > >     2) Introduce a Runner-Option Flag
>              >      > >     =================================
>              >      > >
>              >      > >     In this approach we would try to add as many pipeline
>              >     options for a
>              >      > >     Runner to the SDK, but allow additional Runner
>             options to
>              >     be passed
>              >      > >     using the `--runner-option=key=val` flag. The
>             Runner, like
>              >     in 1), would
>              >      > >     have to ensure validation. I think this has been the
>             most
>              >     favored
>              >      > >     way so
>              >      > >     far. Going forward, that means that
>             `--parallelism=4` and
>              >      > >     `--runner-option=parallelism=4` will have the same
>             effect
>              >     for the Flink
>              >      > >     Runner.
>              >      > >
>              >      > >
>              >      > >     3) Implement Fetching of Options from JobServer
>              >      > >     ===============================================
>              >      > >
>              >      > >     The options are retrieved from the JobServer before
>              >     submitting the
>              >      > >     pipeline. I think this would be ideal but, as mentioned
>              >     before, it
>              >      > >     increases the complexity for implementing new SDKs and
>              >     might overall
>              >      > >     just not be worth the effort.
>              >      > >
>              >      > >
>              >      > >     What do you think? I'd implement 2) for the next
>             release,
>              >     unless there
>              >      > >     are advocates for a different approach.
>              >      > >
>              >      > >     Cheers,
>              >      > >     Max
>              >
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Robert Burke <ro...@frantil.com>.

+1 to Option 3

I'd rather have each SDK have a single point of well defined complexity to
do something general, than have to make tiny but simple changes. Less toil
and maintenance in the long run per SDK.

Similarly I don't have time to make it happen right now.

On Tue, Nov 13, 2018, 9:22 AM Thomas Weise <th...@apache.org> wrote:

> Discovering options from the job server would be the only way to perform
> full validation (and provide upfront help to the user).
>
> The runner cannot perform full validation, since it is not aware of the
> user and SDK options (that it has to forward to the SDK worker).
>
> Special runner options flag to forward unknown options also wouldn't fully
> solve the problem (besides being subject to change in the future). Let's
> say runner understands --fancy-int-option and the user repeats that option
> multiple times. Not knowing the type of option, the SDK will pass it as a
> list and the runner will fail.
>
> Replicating SDK options is a workaround for known runners but it really
> goes against the idea of portability (making assumptions outside the API
> contract). We already have runners implemented outside of Beam and hope for
> the ecosystem to grow. What we do for options should also work for those
> runners.
>
> I'm with Luke here that options discovery provides the best user
> experience and can address the other issues. Even the scenario of multiple
> intermediate runners could be addressed by forwarding the unparsed options
> with the discovery call. I don't see SDK implementation complexity as a
> significant drawback so far.
>
> Thomas
>
>
> On Mon, Nov 12, 2018 at 2:30 PM Lukasz Cwik <lc...@google.com> wrote:
>
>>
>> On Mon, Nov 12, 2018 at 9:38 AM Maximilian Michels <mx...@apache.org>
>> wrote:
>>
>>> Thank you Robert and Lukasz for your points.
>>>
>>> > Note that I believe that we will want to have multiple URLs to support
>>> cross language pipelines since we will want to be able to ask other SDK
>>> languages/versions for their "list" of supported PipelineOptions.
>>>
>>> Why is that? The Runner itself is the source of truth for its options.
>>>
>>
>> Because other languages (or even different versions of the same language)
>> may have their own options. For example, the Go SDK talks to a Java service
>> which applies a SQL transform and returns the expanded form (this may
>> require knowledge of some options like credentials for the filesystem, ...)
>> and then talks to a Python service that performs another transform
>> expansion. Finally the pipeline containing Go, Java and Python transforms
>> is submitted to a runner and it performs its own internal
>> replacements/expansions related to executing the pipeline.
>>
>>
>>> Everything else is SDK-related and should be validated there.
>>>
>>> I imagined the process to go like this:
>>>
>>>    a) Parse options to find JobServer URL
>>>    a) Retrieve options from JobServer
>>>    c) Parse all options
>>>    ...continue as always...
>>>
>>> An option is just represented by a name and a type. There is nothing
>>> more to it, at least as of now. So it should be possible to parse them
>>> in the SDK without much further work.
>>>
>>> Nevertheless, I agree with your sentiment, Robert. The "runner_option"
>>> flag would prevent additional complexity. I still don't prefer it
>>> because it's not nice from an end user perspective. If we were to
>>> implement it, I would definitely go for the "option promotion" which you
>>> mentioned.
>>>
>>> I hadn't thought about delegating runners, although the PortableRunner
>>> is basically a delegating Runner. If that was an important feature, I
>>> suppose the "runner_option" would be the preferred way.
>>>
>>> All in all, since there doesn't seem to be an excitement to implement
>>> JobServer option retrieval and we will need the help of all SDK
>>> developers, "runner_option" seems to be the more likely path.
>>>
>>
>> I would say its a lack of time for people to improve this story over
>> others but it can be revisited at some point in the future and I agree that
>> using --runner_option as an interim provides value.
>>
>>
>>>
>>> -Max
>>>
>>> On 08.11.18 21:50, Lukasz Cwik wrote:
>>> > The purpose of the spec would be to provide the names, type and
>>> > descriptions of the options. We don't need anything beyond the JSON
>>> > types (string, number, bool, object, list) because the only ambiguity
>>> we
>>> > get is how do we parse command line string into the JSON type (and
>>> that
>>> > ambiguity is actually only between string and non-string since all the
>>> > other JSON types are unambiguous).
>>> >
>>> > Also, I believe the flow would be
>>> > 1) Parse options
>>> >    a) Find the URL from args specified and/or additional methods on
>>> > PipelineOptions that exposes a programmatic way to set the URL during
>>> > parsing.
>>> >    b) Query URL for option specs
>>> >    c) Parse the remainder of the options
>>> > 2) Construct pipeline
>>> > 3) Choose runner
>>> > 4) Submit job to runner
>>> >
>>> > Note that I believe that we will want to have multiple URLs to support
>>> > cross language pipelines since we will want to be able to ask other
>>> SDK
>>> > languages/versions for their "list" of supported PipelineOptions.
>>> >
>>> > On Thu, Nov 8, 2018 at 11:51 AM Robert Bradshaw <robertwb@google.com
>>> > <ma...@google.com>> wrote:
>>> >
>>> >     There's two questions here:
>>> >
>>> >     (A) What do we do in the short term?
>>> >
>>> >     I think adding every runner option to every SDK is not sustainable
>>> >     (n*m work, assuming every SDK knows about every runner), and
>>> having a
>>> >     patchwork of options that were added as one-offs to SDKs is not
>>> >     desirable either. Furthermore, it seems difficult to parse unknown
>>> >     options as if they were valid options, so my preference here would
>>> be
>>> >     to just use a special runner_option flag. (One could also pass a
>>> set
>>> >     of unparsed/unvalidated runner options to the runner, even if
>>> they're
>>> >     not distinguished for the user, and runners (or any intermediates)
>>> >     could run a "promote" operation that promotes any of these unknowns
>>> >     that they recognize to real options before further processing. The
>>> >     parsing would be done as repeated-string, and not be intermingled
>>> with
>>> >     the actually validated options. This is essential a variant of
>>> >     option 1.)
>>> >
>>> >     (B) What do do in the long term? While the JobServer approach
>>> sounds
>>> >     nice, I think it introduces a lot of complexity (we have too much
>>> of
>>> >     that already) and still doesn't completely solve the problem. In
>>> >     particular, it changes the flow from
>>> >
>>> >     1. Parse options
>>> >     2. Construct pipeline
>>> >     3. Choose runner
>>> >     4. Submit job to runner
>>> >
>>> >     to
>>> >
>>> >     1. Parse options
>>> >     2. Construct pipeline
>>> >     3. Choose runner
>>> >     4a. Query runner for option specs
>>> >     4b. Re-parse options
>>> >     4c. Submit job to runner
>>> >
>>> >     In particular, doing 4b in the SDK rather than just let the runner
>>> >     itself do the validation as part of (4) doesn't save much and
>>> forces
>>> >     us to come up with a (probably incomplete) spec as to how to define
>>> >     options, their types, and their validations. It also means that a
>>> >     delegating runner must choose and interact with its downstream
>>> >     runner(s) synchronously, else we haven't actually solved the issue.
>>> >
>>> >     For these reasons, I don't think we even want to go with the
>>> JobServer
>>> >     approach in the long term, which has bearing on (A).
>>> >
>>> >     - Robert
>>> >
>>> >
>>> >     On Wed, Nov 7, 2018 at 8:50 PM Maximilian Michels <mxm@apache.org
>>> >     <ma...@apache.org>> wrote:
>>> >      >
>>> >      > +1
>>> >      >
>>> >      > If the preferred approach is to eventually have the JobServer
>>> >     serve the
>>> >      > options, then the best intermediate solution is to replicate
>>> common
>>> >      > options in the SDKs.
>>> >      >
>>> >      > If we went down the "--runner_option" path, we would end up with
>>> >      > multiple ways of specifying the same options. We would
>>> eventually
>>> >     have
>>> >      > to deprecate "runner options" once we have the JobServer
>>> >     approach. I'd
>>> >      > like to avoid that.
>>> >      >
>>> >      > For the upcoming release we can revert the changes again and
>>> add the
>>> >      > most common missing options to the SDKs. Then hopefully we
>>> should
>>> >     have
>>> >      > fetching implemented for the release after.
>>> >      >
>>> >      > Do you think that is feasible?
>>> >      >
>>> >      > Thanks,
>>> >      > Max
>>> >      >
>>> >      > On 30.10.18 23:00, Lukasz Cwik wrote:
>>> >      > > I still like #3 the most, just can't devote the time to get it
>>> >     done.
>>> >      > >
>>> >      > > Instead of going with a fully implemented #3, we could
>>> hardcode
>>> >     the a
>>> >      > > subset of options and types within each SDK until the job
>>> server is
>>> >      > > ready to provide this information and then migrate to the
>>> >     "full" list.
>>> >      > > This would be an easy path for SDKs to take on. They could
>>> >     "know" of a
>>> >      > > few well known options, and if they want to support all
>>> >     options, they
>>> >      > > implement the integration with the job server.
>>> >      > >
>>> >      > > On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels
>>> >     <mxm@apache.org <ma...@apache.org>
>>> >      > > <mailto:mxm@apache.org <ma...@apache.org>>> wrote:
>>> >      > >
>>> >      > >      > I would prefer we don't introduce a (quirky) way of
>>> passing
>>> >      > >     unknown options that forces users to type JSON into the
>>> >     command line
>>> >      > >     (or similar acrobatics)
>>> >      > >     Same here, the JSON approach seems technically nice but
>>> too
>>> >     bulky
>>> >      > >     for users.
>>> >      > >
>>> >      > >      > To someone wanting to run a pipeline, all options are
>>> >     equally
>>> >      > >     important, whether they are application specific, SDK
>>> >     specific or
>>> >      > >     runner specific.
>>> >      > >
>>> >      > >     I'm also reluctant to force users to use
>>> `--runner_option=`
>>> >     because the
>>> >      > >     division into "Runner" options and other options seems
>>> >     rather arbitrary
>>> >      > >     to users. Most built-in options are also Runner-related.
>>> >      > >
>>> >      > >      > It should be possible to *optionally* qualify/scope (to
>>> >     cover
>>> >      > >     cases where there is ambiguity), but otherwise I prefer
>>> the
>>> >     format
>>> >      > >     we currently have.
>>> >      > >
>>> >      > >     Yes, namespacing is a problem. What happens if the user
>>> >     defines a
>>> >      > >     custom
>>> >      > >     PipelineOption which clashes with one of the builtin ones?
>>> >     If both are
>>> >      > >
>>> >      > >     set, which one is actually applied?
>>> >      > >
>>> >      > >
>>> >      > > Note that PipelineOptions so far has been treating name
>>> >     equality to mean
>>> >      > > option equality and the Java implementation has a bunch of
>>> >     strict checks
>>> >      > > to make sure that default values aren't used for duplicate
>>> >     definitions,
>>> >      > > they have the same type, etc...
>>> >      > > With 1), you fail the job if the runner can't understand your
>>> >     option
>>> >      > > because its not represented the same way. User then needs to
>>> fix-up
>>> >      > > their declaration of the option name.
>>> >      > > With 2), there are no name conflicts, the SDK will need to
>>> >     validate that
>>> >      > > the option isn't set in both formats and error out if it is
>>> before
>>> >      > > pipeline submission time.
>>> >      > > With 3), you can prefetch all the options and error out to
>>> the user
>>> >      > > during argument parsing time.
>>> >      > >
>>> >      > >
>>> >      > >
>>> >      > >     Here is a summary of the possible paths going forward:
>>> >      > >
>>> >      > >
>>> >      > >     1) Validate PipelineOptions at Runner side
>>> >      > >     ==========================================
>>> >      > >
>>> >      > >     The main issue raised here was that we want to move away
>>> >     from parsing
>>> >      > >     arguments which look like options without validating them.
>>> >     An easy fix
>>> >      > >     would be to actually validate them on the Runner side.
>>> This
>>> >     could be
>>> >      > >     done by changing the deserialization code of
>>> >     PipelineOptions which so
>>> >      > >     far ignores unknown JSON options.
>>> >      > >
>>> >      > >     See: PipelineOptionsTranslation.fromProto(Struct
>>> protoOptions)
>>> >      > >
>>> >      > >     Actually, this wouldn't work for user-defined
>>> >     PipelineOptions because
>>> >      > >     they might not be known to the Runner (if they are defined
>>> >     in Python).
>>> >      > >
>>> >      > >
>>> >      > >     2) Introduce a Runner-Option Flag
>>> >      > >     =================================
>>> >      > >
>>> >      > >     In this approach we would try to add as many pipeline
>>> >     options for a
>>> >      > >     Runner to the SDK, but allow additional Runner options to
>>> >     be passed
>>> >      > >     using the `--runner-option=key=val` flag. The Runner, like
>>> >     in 1), would
>>> >      > >     have to ensure validation. I think this has been the most
>>> >     favored
>>> >      > >     way so
>>> >      > >     far. Going forward, that means that `--parallelism=4` and
>>> >      > >     `--runner-option=parallelism=4` will have the same effect
>>> >     for the Flink
>>> >      > >     Runner.
>>> >      > >
>>> >      > >
>>> >      > >     3) Implement Fetching of Options from JobServer
>>> >      > >     ===============================================
>>> >      > >
>>> >      > >     The options are retrieved from the JobServer before
>>> >     submitting the
>>> >      > >     pipeline. I think this would be ideal but, as mentioned
>>> >     before, it
>>> >      > >     increases the complexity for implementing new SDKs and
>>> >     might overall
>>> >      > >     just not be worth the effort.
>>> >      > >
>>> >      > >
>>> >      > >     What do you think? I'd implement 2) for the next release,
>>> >     unless there
>>> >      > >     are advocates for a different approach.
>>> >      > >
>>> >      > >     Cheers,
>>> >      > >     Max
>>> >
>>>
>>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Thomas Weise <th...@apache.org>.

Discovering options from the job server would be the only way to perform
full validation (and provide upfront help to the user).

The runner cannot perform full validation, since it is not aware of the
user and SDK options (that it has to forward to the SDK worker).

Special runner options flag to forward unknown options also wouldn't fully
solve the problem (besides being subject to change in the future). Let's
say runner understands --fancy-int-option and the user repeats that option
multiple times. Not knowing the type of option, the SDK will pass it as a
list and the runner will fail.

Replicating SDK options is a workaround for known runners but it really
goes against the idea of portability (making assumptions outside the API
contract). We already have runners implemented outside of Beam and hope for
the ecosystem to grow. What we do for options should also work for those
runners.

I'm with Luke here that options discovery provides the best user experience
and can address the other issues. Even the scenario of multiple
intermediate runners could be addressed by forwarding the unparsed options
with the discovery call. I don't see SDK implementation complexity as a
significant drawback so far.

Thomas


On Mon, Nov 12, 2018 at 2:30 PM Lukasz Cwik <lc...@google.com> wrote:

>
> On Mon, Nov 12, 2018 at 9:38 AM Maximilian Michels <mx...@apache.org> wrote:
>
>> Thank you Robert and Lukasz for your points.
>>
>> > Note that I believe that we will want to have multiple URLs to support
>> cross language pipelines since we will want to be able to ask other SDK
>> languages/versions for their "list" of supported PipelineOptions.
>>
>> Why is that? The Runner itself is the source of truth for its options.
>>
>
> Because other languages (or even different versions of the same language)
> may have their own options. For example, the Go SDK talks to a Java service
> which applies a SQL transform and returns the expanded form (this may
> require knowledge of some options like credentials for the filesystem, ...)
> and then talks to a Python service that performs another transform
> expansion. Finally the pipeline containing Go, Java and Python transforms
> is submitted to a runner and it performs its own internal
> replacements/expansions related to executing the pipeline.
>
>
>> Everything else is SDK-related and should be validated there.
>>
>> I imagined the process to go like this:
>>
>>    a) Parse options to find JobServer URL
>>    a) Retrieve options from JobServer
>>    c) Parse all options
>>    ...continue as always...
>>
>> An option is just represented by a name and a type. There is nothing
>> more to it, at least as of now. So it should be possible to parse them
>> in the SDK without much further work.
>>
>> Nevertheless, I agree with your sentiment, Robert. The "runner_option"
>> flag would prevent additional complexity. I still don't prefer it
>> because it's not nice from an end user perspective. If we were to
>> implement it, I would definitely go for the "option promotion" which you
>> mentioned.
>>
>> I hadn't thought about delegating runners, although the PortableRunner
>> is basically a delegating Runner. If that was an important feature, I
>> suppose the "runner_option" would be the preferred way.
>>
>> All in all, since there doesn't seem to be an excitement to implement
>> JobServer option retrieval and we will need the help of all SDK
>> developers, "runner_option" seems to be the more likely path.
>>
>
> I would say its a lack of time for people to improve this story over
> others but it can be revisited at some point in the future and I agree that
> using --runner_option as an interim provides value.
>
>
>>
>> -Max
>>
>> On 08.11.18 21:50, Lukasz Cwik wrote:
>> > The purpose of the spec would be to provide the names, type and
>> > descriptions of the options. We don't need anything beyond the JSON
>> > types (string, number, bool, object, list) because the only ambiguity
>> we
>> > get is how do we parse command line string into the JSON type (and that
>> > ambiguity is actually only between string and non-string since all the
>> > other JSON types are unambiguous).
>> >
>> > Also, I believe the flow would be
>> > 1) Parse options
>> >    a) Find the URL from args specified and/or additional methods on
>> > PipelineOptions that exposes a programmatic way to set the URL during
>> > parsing.
>> >    b) Query URL for option specs
>> >    c) Parse the remainder of the options
>> > 2) Construct pipeline
>> > 3) Choose runner
>> > 4) Submit job to runner
>> >
>> > Note that I believe that we will want to have multiple URLs to support
>> > cross language pipelines since we will want to be able to ask other SDK
>> > languages/versions for their "list" of supported PipelineOptions.
>> >
>> > On Thu, Nov 8, 2018 at 11:51 AM Robert Bradshaw <robertwb@google.com
>> > <ma...@google.com>> wrote:
>> >
>> >     There's two questions here:
>> >
>> >     (A) What do we do in the short term?
>> >
>> >     I think adding every runner option to every SDK is not sustainable
>> >     (n*m work, assuming every SDK knows about every runner), and having
>> a
>> >     patchwork of options that were added as one-offs to SDKs is not
>> >     desirable either. Furthermore, it seems difficult to parse unknown
>> >     options as if they were valid options, so my preference here would
>> be
>> >     to just use a special runner_option flag. (One could also pass a set
>> >     of unparsed/unvalidated runner options to the runner, even if
>> they're
>> >     not distinguished for the user, and runners (or any intermediates)
>> >     could run a "promote" operation that promotes any of these unknowns
>> >     that they recognize to real options before further processing. The
>> >     parsing would be done as repeated-string, and not be intermingled
>> with
>> >     the actually validated options. This is essential a variant of
>> >     option 1.)
>> >
>> >     (B) What do do in the long term? While the JobServer approach sounds
>> >     nice, I think it introduces a lot of complexity (we have too much of
>> >     that already) and still doesn't completely solve the problem. In
>> >     particular, it changes the flow from
>> >
>> >     1. Parse options
>> >     2. Construct pipeline
>> >     3. Choose runner
>> >     4. Submit job to runner
>> >
>> >     to
>> >
>> >     1. Parse options
>> >     2. Construct pipeline
>> >     3. Choose runner
>> >     4a. Query runner for option specs
>> >     4b. Re-parse options
>> >     4c. Submit job to runner
>> >
>> >     In particular, doing 4b in the SDK rather than just let the runner
>> >     itself do the validation as part of (4) doesn't save much and forces
>> >     us to come up with a (probably incomplete) spec as to how to define
>> >     options, their types, and their validations. It also means that a
>> >     delegating runner must choose and interact with its downstream
>> >     runner(s) synchronously, else we haven't actually solved the issue.
>> >
>> >     For these reasons, I don't think we even want to go with the
>> JobServer
>> >     approach in the long term, which has bearing on (A).
>> >
>> >     - Robert
>> >
>> >
>> >     On Wed, Nov 7, 2018 at 8:50 PM Maximilian Michels <mxm@apache.org
>> >     <ma...@apache.org>> wrote:
>> >      >
>> >      > +1
>> >      >
>> >      > If the preferred approach is to eventually have the JobServer
>> >     serve the
>> >      > options, then the best intermediate solution is to replicate
>> common
>> >      > options in the SDKs.
>> >      >
>> >      > If we went down the "--runner_option" path, we would end up with
>> >      > multiple ways of specifying the same options. We would eventually
>> >     have
>> >      > to deprecate "runner options" once we have the JobServer
>> >     approach. I'd
>> >      > like to avoid that.
>> >      >
>> >      > For the upcoming release we can revert the changes again and add
>> the
>> >      > most common missing options to the SDKs. Then hopefully we should
>> >     have
>> >      > fetching implemented for the release after.
>> >      >
>> >      > Do you think that is feasible?
>> >      >
>> >      > Thanks,
>> >      > Max
>> >      >
>> >      > On 30.10.18 23:00, Lukasz Cwik wrote:
>> >      > > I still like #3 the most, just can't devote the time to get it
>> >     done.
>> >      > >
>> >      > > Instead of going with a fully implemented #3, we could hardcode
>> >     the a
>> >      > > subset of options and types within each SDK until the job
>> server is
>> >      > > ready to provide this information and then migrate to the
>> >     "full" list.
>> >      > > This would be an easy path for SDKs to take on. They could
>> >     "know" of a
>> >      > > few well known options, and if they want to support all
>> >     options, they
>> >      > > implement the integration with the job server.
>> >      > >
>> >      > > On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels
>> >     <mxm@apache.org <ma...@apache.org>
>> >      > > <mailto:mxm@apache.org <ma...@apache.org>>> wrote:
>> >      > >
>> >      > >      > I would prefer we don't introduce a (quirky) way of
>> passing
>> >      > >     unknown options that forces users to type JSON into the
>> >     command line
>> >      > >     (or similar acrobatics)
>> >      > >     Same here, the JSON approach seems technically nice but too
>> >     bulky
>> >      > >     for users.
>> >      > >
>> >      > >      > To someone wanting to run a pipeline, all options are
>> >     equally
>> >      > >     important, whether they are application specific, SDK
>> >     specific or
>> >      > >     runner specific.
>> >      > >
>> >      > >     I'm also reluctant to force users to use `--runner_option=`
>> >     because the
>> >      > >     division into "Runner" options and other options seems
>> >     rather arbitrary
>> >      > >     to users. Most built-in options are also Runner-related.
>> >      > >
>> >      > >      > It should be possible to *optionally* qualify/scope (to
>> >     cover
>> >      > >     cases where there is ambiguity), but otherwise I prefer the
>> >     format
>> >      > >     we currently have.
>> >      > >
>> >      > >     Yes, namespacing is a problem. What happens if the user
>> >     defines a
>> >      > >     custom
>> >      > >     PipelineOption which clashes with one of the builtin ones?
>> >     If both are
>> >      > >
>> >      > >     set, which one is actually applied?
>> >      > >
>> >      > >
>> >      > > Note that PipelineOptions so far has been treating name
>> >     equality to mean
>> >      > > option equality and the Java implementation has a bunch of
>> >     strict checks
>> >      > > to make sure that default values aren't used for duplicate
>> >     definitions,
>> >      > > they have the same type, etc...
>> >      > > With 1), you fail the job if the runner can't understand your
>> >     option
>> >      > > because its not represented the same way. User then needs to
>> fix-up
>> >      > > their declaration of the option name.
>> >      > > With 2), there are no name conflicts, the SDK will need to
>> >     validate that
>> >      > > the option isn't set in both formats and error out if it is
>> before
>> >      > > pipeline submission time.
>> >      > > With 3), you can prefetch all the options and error out to the
>> user
>> >      > > during argument parsing time.
>> >      > >
>> >      > >
>> >      > >
>> >      > >     Here is a summary of the possible paths going forward:
>> >      > >
>> >      > >
>> >      > >     1) Validate PipelineOptions at Runner side
>> >      > >     ==========================================
>> >      > >
>> >      > >     The main issue raised here was that we want to move away
>> >     from parsing
>> >      > >     arguments which look like options without validating them.
>> >     An easy fix
>> >      > >     would be to actually validate them on the Runner side. This
>> >     could be
>> >      > >     done by changing the deserialization code of
>> >     PipelineOptions which so
>> >      > >     far ignores unknown JSON options.
>> >      > >
>> >      > >     See: PipelineOptionsTranslation.fromProto(Struct
>> protoOptions)
>> >      > >
>> >      > >     Actually, this wouldn't work for user-defined
>> >     PipelineOptions because
>> >      > >     they might not be known to the Runner (if they are defined
>> >     in Python).
>> >      > >
>> >      > >
>> >      > >     2) Introduce a Runner-Option Flag
>> >      > >     =================================
>> >      > >
>> >      > >     In this approach we would try to add as many pipeline
>> >     options for a
>> >      > >     Runner to the SDK, but allow additional Runner options to
>> >     be passed
>> >      > >     using the `--runner-option=key=val` flag. The Runner, like
>> >     in 1), would
>> >      > >     have to ensure validation. I think this has been the most
>> >     favored
>> >      > >     way so
>> >      > >     far. Going forward, that means that `--parallelism=4` and
>> >      > >     `--runner-option=parallelism=4` will have the same effect
>> >     for the Flink
>> >      > >     Runner.
>> >      > >
>> >      > >
>> >      > >     3) Implement Fetching of Options from JobServer
>> >      > >     ===============================================
>> >      > >
>> >      > >     The options are retrieved from the JobServer before
>> >     submitting the
>> >      > >     pipeline. I think this would be ideal but, as mentioned
>> >     before, it
>> >      > >     increases the complexity for implementing new SDKs and
>> >     might overall
>> >      > >     just not be worth the effort.
>> >      > >
>> >      > >
>> >      > >     What do you think? I'd implement 2) for the next release,
>> >     unless there
>> >      > >     are advocates for a different approach.
>> >      > >
>> >      > >     Cheers,
>> >      > >     Max
>> >
>>
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Lukasz Cwik <lc...@google.com>.

On Mon, Nov 12, 2018 at 9:38 AM Maximilian Michels <mx...@apache.org> wrote:

> Thank you Robert and Lukasz for your points.
>
> > Note that I believe that we will want to have multiple URLs to support
> cross language pipelines since we will want to be able to ask other SDK
> languages/versions for their "list" of supported PipelineOptions.
>
> Why is that? The Runner itself is the source of truth for its options.
>

Because other languages (or even different versions of the same language)
may have their own options. For example, the Go SDK talks to a Java service
which applies a SQL transform and returns the expanded form (this may
require knowledge of some options like credentials for the filesystem, ...)
and then talks to a Python service that performs another transform
expansion. Finally the pipeline containing Go, Java and Python transforms
is submitted to a runner and it performs its own internal
replacements/expansions related to executing the pipeline.


> Everything else is SDK-related and should be validated there.
>
> I imagined the process to go like this:
>
>    a) Parse options to find JobServer URL
>    a) Retrieve options from JobServer
>    c) Parse all options
>    ...continue as always...
>
> An option is just represented by a name and a type. There is nothing
> more to it, at least as of now. So it should be possible to parse them
> in the SDK without much further work.
>
> Nevertheless, I agree with your sentiment, Robert. The "runner_option"
> flag would prevent additional complexity. I still don't prefer it
> because it's not nice from an end user perspective. If we were to
> implement it, I would definitely go for the "option promotion" which you
> mentioned.
>
> I hadn't thought about delegating runners, although the PortableRunner
> is basically a delegating Runner. If that was an important feature, I
> suppose the "runner_option" would be the preferred way.
>
> All in all, since there doesn't seem to be an excitement to implement
> JobServer option retrieval and we will need the help of all SDK
> developers, "runner_option" seems to be the more likely path.
>

I would say its a lack of time for people to improve this story over others
but it can be revisited at some point in the future and I agree that using
--runner_option as an interim provides value.


>
> -Max
>
> On 08.11.18 21:50, Lukasz Cwik wrote:
> > The purpose of the spec would be to provide the names, type and
> > descriptions of the options. We don't need anything beyond the JSON
> > types (string, number, bool, object, list) because the only ambiguity we
> > get is how do we parse command line string into the JSON type (and that
> > ambiguity is actually only between string and non-string since all the
> > other JSON types are unambiguous).
> >
> > Also, I believe the flow would be
> > 1) Parse options
> >    a) Find the URL from args specified and/or additional methods on
> > PipelineOptions that exposes a programmatic way to set the URL during
> > parsing.
> >    b) Query URL for option specs
> >    c) Parse the remainder of the options
> > 2) Construct pipeline
> > 3) Choose runner
> > 4) Submit job to runner
> >
> > Note that I believe that we will want to have multiple URLs to support
> > cross language pipelines since we will want to be able to ask other SDK
> > languages/versions for their "list" of supported PipelineOptions.
> >
> > On Thu, Nov 8, 2018 at 11:51 AM Robert Bradshaw <robertwb@google.com
> > <ma...@google.com>> wrote:
> >
> >     There's two questions here:
> >
> >     (A) What do we do in the short term?
> >
> >     I think adding every runner option to every SDK is not sustainable
> >     (n*m work, assuming every SDK knows about every runner), and having a
> >     patchwork of options that were added as one-offs to SDKs is not
> >     desirable either. Furthermore, it seems difficult to parse unknown
> >     options as if they were valid options, so my preference here would be
> >     to just use a special runner_option flag. (One could also pass a set
> >     of unparsed/unvalidated runner options to the runner, even if they're
> >     not distinguished for the user, and runners (or any intermediates)
> >     could run a "promote" operation that promotes any of these unknowns
> >     that they recognize to real options before further processing. The
> >     parsing would be done as repeated-string, and not be intermingled
> with
> >     the actually validated options. This is essential a variant of
> >     option 1.)
> >
> >     (B) What do do in the long term? While the JobServer approach sounds
> >     nice, I think it introduces a lot of complexity (we have too much of
> >     that already) and still doesn't completely solve the problem. In
> >     particular, it changes the flow from
> >
> >     1. Parse options
> >     2. Construct pipeline
> >     3. Choose runner
> >     4. Submit job to runner
> >
> >     to
> >
> >     1. Parse options
> >     2. Construct pipeline
> >     3. Choose runner
> >     4a. Query runner for option specs
> >     4b. Re-parse options
> >     4c. Submit job to runner
> >
> >     In particular, doing 4b in the SDK rather than just let the runner
> >     itself do the validation as part of (4) doesn't save much and forces
> >     us to come up with a (probably incomplete) spec as to how to define
> >     options, their types, and their validations. It also means that a
> >     delegating runner must choose and interact with its downstream
> >     runner(s) synchronously, else we haven't actually solved the issue.
> >
> >     For these reasons, I don't think we even want to go with the
> JobServer
> >     approach in the long term, which has bearing on (A).
> >
> >     - Robert
> >
> >
> >     On Wed, Nov 7, 2018 at 8:50 PM Maximilian Michels <mxm@apache.org
> >     <ma...@apache.org>> wrote:
> >      >
> >      > +1
> >      >
> >      > If the preferred approach is to eventually have the JobServer
> >     serve the
> >      > options, then the best intermediate solution is to replicate
> common
> >      > options in the SDKs.
> >      >
> >      > If we went down the "--runner_option" path, we would end up with
> >      > multiple ways of specifying the same options. We would eventually
> >     have
> >      > to deprecate "runner options" once we have the JobServer
> >     approach. I'd
> >      > like to avoid that.
> >      >
> >      > For the upcoming release we can revert the changes again and add
> the
> >      > most common missing options to the SDKs. Then hopefully we should
> >     have
> >      > fetching implemented for the release after.
> >      >
> >      > Do you think that is feasible?
> >      >
> >      > Thanks,
> >      > Max
> >      >
> >      > On 30.10.18 23:00, Lukasz Cwik wrote:
> >      > > I still like #3 the most, just can't devote the time to get it
> >     done.
> >      > >
> >      > > Instead of going with a fully implemented #3, we could hardcode
> >     the a
> >      > > subset of options and types within each SDK until the job
> server is
> >      > > ready to provide this information and then migrate to the
> >     "full" list.
> >      > > This would be an easy path for SDKs to take on. They could
> >     "know" of a
> >      > > few well known options, and if they want to support all
> >     options, they
> >      > > implement the integration with the job server.
> >      > >
> >      > > On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels
> >     <mxm@apache.org <ma...@apache.org>
> >      > > <mailto:mxm@apache.org <ma...@apache.org>>> wrote:
> >      > >
> >      > >      > I would prefer we don't introduce a (quirky) way of
> passing
> >      > >     unknown options that forces users to type JSON into the
> >     command line
> >      > >     (or similar acrobatics)
> >      > >     Same here, the JSON approach seems technically nice but too
> >     bulky
> >      > >     for users.
> >      > >
> >      > >      > To someone wanting to run a pipeline, all options are
> >     equally
> >      > >     important, whether they are application specific, SDK
> >     specific or
> >      > >     runner specific.
> >      > >
> >      > >     I'm also reluctant to force users to use `--runner_option=`
> >     because the
> >      > >     division into "Runner" options and other options seems
> >     rather arbitrary
> >      > >     to users. Most built-in options are also Runner-related.
> >      > >
> >      > >      > It should be possible to *optionally* qualify/scope (to
> >     cover
> >      > >     cases where there is ambiguity), but otherwise I prefer the
> >     format
> >      > >     we currently have.
> >      > >
> >      > >     Yes, namespacing is a problem. What happens if the user
> >     defines a
> >      > >     custom
> >      > >     PipelineOption which clashes with one of the builtin ones?
> >     If both are
> >      > >
> >      > >     set, which one is actually applied?
> >      > >
> >      > >
> >      > > Note that PipelineOptions so far has been treating name
> >     equality to mean
> >      > > option equality and the Java implementation has a bunch of
> >     strict checks
> >      > > to make sure that default values aren't used for duplicate
> >     definitions,
> >      > > they have the same type, etc...
> >      > > With 1), you fail the job if the runner can't understand your
> >     option
> >      > > because its not represented the same way. User then needs to
> fix-up
> >      > > their declaration of the option name.
> >      > > With 2), there are no name conflicts, the SDK will need to
> >     validate that
> >      > > the option isn't set in both formats and error out if it is
> before
> >      > > pipeline submission time.
> >      > > With 3), you can prefetch all the options and error out to the
> user
> >      > > during argument parsing time.
> >      > >
> >      > >
> >      > >
> >      > >     Here is a summary of the possible paths going forward:
> >      > >
> >      > >
> >      > >     1) Validate PipelineOptions at Runner side
> >      > >     ==========================================
> >      > >
> >      > >     The main issue raised here was that we want to move away
> >     from parsing
> >      > >     arguments which look like options without validating them.
> >     An easy fix
> >      > >     would be to actually validate them on the Runner side. This
> >     could be
> >      > >     done by changing the deserialization code of
> >     PipelineOptions which so
> >      > >     far ignores unknown JSON options.
> >      > >
> >      > >     See: PipelineOptionsTranslation.fromProto(Struct
> protoOptions)
> >      > >
> >      > >     Actually, this wouldn't work for user-defined
> >     PipelineOptions because
> >      > >     they might not be known to the Runner (if they are defined
> >     in Python).
> >      > >
> >      > >
> >      > >     2) Introduce a Runner-Option Flag
> >      > >     =================================
> >      > >
> >      > >     In this approach we would try to add as many pipeline
> >     options for a
> >      > >     Runner to the SDK, but allow additional Runner options to
> >     be passed
> >      > >     using the `--runner-option=key=val` flag. The Runner, like
> >     in 1), would
> >      > >     have to ensure validation. I think this has been the most
> >     favored
> >      > >     way so
> >      > >     far. Going forward, that means that `--parallelism=4` and
> >      > >     `--runner-option=parallelism=4` will have the same effect
> >     for the Flink
> >      > >     Runner.
> >      > >
> >      > >
> >      > >     3) Implement Fetching of Options from JobServer
> >      > >     ===============================================
> >      > >
> >      > >     The options are retrieved from the JobServer before
> >     submitting the
> >      > >     pipeline. I think this would be ideal but, as mentioned
> >     before, it
> >      > >     increases the complexity for implementing new SDKs and
> >     might overall
> >      > >     just not be worth the effort.
> >      > >
> >      > >
> >      > >     What do you think? I'd implement 2) for the next release,
> >     unless there
> >      > >     are advocates for a different approach.
> >      > >
> >      > >     Cheers,
> >      > >     Max
> >
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Maximilian Michels <mx...@apache.org>.

Thank you Robert and Lukasz for your points.

> Note that I believe that we will want to have multiple URLs to support cross language pipelines since we will want to be able to ask other SDK languages/versions for their "list" of supported PipelineOptions.

Why is that? The Runner itself is the source of truth for its options. 
Everything else is SDK-related and should be validated there.

I imagined the process to go like this:

   a) Parse options to find JobServer URL
   a) Retrieve options from JobServer
   c) Parse all options
   ...continue as always...

An option is just represented by a name and a type. There is nothing 
more to it, at least as of now. So it should be possible to parse them 
in the SDK without much further work.

Nevertheless, I agree with your sentiment, Robert. The "runner_option" 
flag would prevent additional complexity. I still don't prefer it 
because it's not nice from an end user perspective. If we were to 
implement it, I would definitely go for the "option promotion" which you 
mentioned.

I hadn't thought about delegating runners, although the PortableRunner 
is basically a delegating Runner. If that was an important feature, I 
suppose the "runner_option" would be the preferred way.

All in all, since there doesn't seem to be an excitement to implement 
JobServer option retrieval and we will need the help of all SDK 
developers, "runner_option" seems to be the more likely path.

-Max

On 08.11.18 21:50, Lukasz Cwik wrote:
> The purpose of the spec would be to provide the names, type and 
> descriptions of the options. We don't need anything beyond the JSON 
> types (string, number, bool, object, list) because the only ambiguity we 
> get is how do we parse command line string into the JSON type (and that 
> ambiguity is actually only between string and non-string since all the 
> other JSON types are unambiguous).
> 
> Also, I believe the flow would be
> 1) Parse options
>    a) Find the URL from args specified and/or additional methods on 
> PipelineOptions that exposes a programmatic way to set the URL during 
> parsing.
>    b) Query URL for option specs
>    c) Parse the remainder of the options
> 2) Construct pipeline
> 3) Choose runner
> 4) Submit job to runner
> 
> Note that I believe that we will want to have multiple URLs to support 
> cross language pipelines since we will want to be able to ask other SDK 
> languages/versions for their "list" of supported PipelineOptions.
> 
> On Thu, Nov 8, 2018 at 11:51 AM Robert Bradshaw <robertwb@google.com 
> <ma...@google.com>> wrote:
> 
>     There's two questions here:
> 
>     (A) What do we do in the short term?
> 
>     I think adding every runner option to every SDK is not sustainable
>     (n*m work, assuming every SDK knows about every runner), and having a
>     patchwork of options that were added as one-offs to SDKs is not
>     desirable either. Furthermore, it seems difficult to parse unknown
>     options as if they were valid options, so my preference here would be
>     to just use a special runner_option flag. (One could also pass a set
>     of unparsed/unvalidated runner options to the runner, even if they're
>     not distinguished for the user, and runners (or any intermediates)
>     could run a "promote" operation that promotes any of these unknowns
>     that they recognize to real options before further processing. The
>     parsing would be done as repeated-string, and not be intermingled with
>     the actually validated options. This is essential a variant of
>     option 1.)
> 
>     (B) What do do in the long term? While the JobServer approach sounds
>     nice, I think it introduces a lot of complexity (we have too much of
>     that already) and still doesn't completely solve the problem. In
>     particular, it changes the flow from
> 
>     1. Parse options
>     2. Construct pipeline
>     3. Choose runner
>     4. Submit job to runner
> 
>     to
> 
>     1. Parse options
>     2. Construct pipeline
>     3. Choose runner
>     4a. Query runner for option specs
>     4b. Re-parse options
>     4c. Submit job to runner
> 
>     In particular, doing 4b in the SDK rather than just let the runner
>     itself do the validation as part of (4) doesn't save much and forces
>     us to come up with a (probably incomplete) spec as to how to define
>     options, their types, and their validations. It also means that a
>     delegating runner must choose and interact with its downstream
>     runner(s) synchronously, else we haven't actually solved the issue.
> 
>     For these reasons, I don't think we even want to go with the JobServer
>     approach in the long term, which has bearing on (A).
> 
>     - Robert
> 
> 
>     On Wed, Nov 7, 2018 at 8:50 PM Maximilian Michels <mxm@apache.org
>     <ma...@apache.org>> wrote:
>      >
>      > +1
>      >
>      > If the preferred approach is to eventually have the JobServer
>     serve the
>      > options, then the best intermediate solution is to replicate common
>      > options in the SDKs.
>      >
>      > If we went down the "--runner_option" path, we would end up with
>      > multiple ways of specifying the same options. We would eventually
>     have
>      > to deprecate "runner options" once we have the JobServer
>     approach. I'd
>      > like to avoid that.
>      >
>      > For the upcoming release we can revert the changes again and add the
>      > most common missing options to the SDKs. Then hopefully we should
>     have
>      > fetching implemented for the release after.
>      >
>      > Do you think that is feasible?
>      >
>      > Thanks,
>      > Max
>      >
>      > On 30.10.18 23:00, Lukasz Cwik wrote:
>      > > I still like #3 the most, just can't devote the time to get it
>     done.
>      > >
>      > > Instead of going with a fully implemented #3, we could hardcode
>     the a
>      > > subset of options and types within each SDK until the job server is
>      > > ready to provide this information and then migrate to the
>     "full" list.
>      > > This would be an easy path for SDKs to take on. They could
>     "know" of a
>      > > few well known options, and if they want to support all
>     options, they
>      > > implement the integration with the job server.
>      > >
>      > > On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels
>     <mxm@apache.org <ma...@apache.org>
>      > > <mailto:mxm@apache.org <ma...@apache.org>>> wrote:
>      > >
>      > >      > I would prefer we don't introduce a (quirky) way of passing
>      > >     unknown options that forces users to type JSON into the
>     command line
>      > >     (or similar acrobatics)
>      > >     Same here, the JSON approach seems technically nice but too
>     bulky
>      > >     for users.
>      > >
>      > >      > To someone wanting to run a pipeline, all options are
>     equally
>      > >     important, whether they are application specific, SDK
>     specific or
>      > >     runner specific.
>      > >
>      > >     I'm also reluctant to force users to use `--runner_option=`
>     because the
>      > >     division into "Runner" options and other options seems
>     rather arbitrary
>      > >     to users. Most built-in options are also Runner-related.
>      > >
>      > >      > It should be possible to *optionally* qualify/scope (to
>     cover
>      > >     cases where there is ambiguity), but otherwise I prefer the
>     format
>      > >     we currently have.
>      > >
>      > >     Yes, namespacing is a problem. What happens if the user
>     defines a
>      > >     custom
>      > >     PipelineOption which clashes with one of the builtin ones?
>     If both are
>      > >
>      > >     set, which one is actually applied?
>      > >
>      > >
>      > > Note that PipelineOptions so far has been treating name
>     equality to mean
>      > > option equality and the Java implementation has a bunch of
>     strict checks
>      > > to make sure that default values aren't used for duplicate
>     definitions,
>      > > they have the same type, etc...
>      > > With 1), you fail the job if the runner can't understand your
>     option
>      > > because its not represented the same way. User then needs to fix-up
>      > > their declaration of the option name.
>      > > With 2), there are no name conflicts, the SDK will need to
>     validate that
>      > > the option isn't set in both formats and error out if it is before
>      > > pipeline submission time.
>      > > With 3), you can prefetch all the options and error out to the user
>      > > during argument parsing time.
>      > >
>      > >
>      > >
>      > >     Here is a summary of the possible paths going forward:
>      > >
>      > >
>      > >     1) Validate PipelineOptions at Runner side
>      > >     ==========================================
>      > >
>      > >     The main issue raised here was that we want to move away
>     from parsing
>      > >     arguments which look like options without validating them.
>     An easy fix
>      > >     would be to actually validate them on the Runner side. This
>     could be
>      > >     done by changing the deserialization code of
>     PipelineOptions which so
>      > >     far ignores unknown JSON options.
>      > >
>      > >     See: PipelineOptionsTranslation.fromProto(Struct protoOptions)
>      > >
>      > >     Actually, this wouldn't work for user-defined
>     PipelineOptions because
>      > >     they might not be known to the Runner (if they are defined
>     in Python).
>      > >
>      > >
>      > >     2) Introduce a Runner-Option Flag
>      > >     =================================
>      > >
>      > >     In this approach we would try to add as many pipeline
>     options for a
>      > >     Runner to the SDK, but allow additional Runner options to
>     be passed
>      > >     using the `--runner-option=key=val` flag. The Runner, like
>     in 1), would
>      > >     have to ensure validation. I think this has been the most
>     favored
>      > >     way so
>      > >     far. Going forward, that means that `--parallelism=4` and
>      > >     `--runner-option=parallelism=4` will have the same effect
>     for the Flink
>      > >     Runner.
>      > >
>      > >
>      > >     3) Implement Fetching of Options from JobServer
>      > >     ===============================================
>      > >
>      > >     The options are retrieved from the JobServer before
>     submitting the
>      > >     pipeline. I think this would be ideal but, as mentioned
>     before, it
>      > >     increases the complexity for implementing new SDKs and
>     might overall
>      > >     just not be worth the effort.
>      > >
>      > >
>      > >     What do you think? I'd implement 2) for the next release,
>     unless there
>      > >     are advocates for a different approach.
>      > >
>      > >     Cheers,
>      > >     Max
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Lukasz Cwik <lc...@google.com>.

I'm not sure how feasible fetching is because no one has said they will do
the work.
I can not pick it up as I am heads down in splittable dofn right now and
integrating that into portability.

On Wed, Nov 7, 2018 at 11:50 AM Maximilian Michels <mx...@apache.org> wrote:

> +1
>
> If the preferred approach is to eventually have the JobServer serve the
> options, then the best intermediate solution is to replicate common
> options in the SDKs.
>
> If we went down the "--runner_option" path, we would end up with
> multiple ways of specifying the same options. We would eventually have
> to deprecate "runner options" once we have the JobServer approach. I'd
> like to avoid that.
>
> For the upcoming release we can revert the changes again and add the
> most common missing options to the SDKs. Then hopefully we should have
> fetching implemented for the release after.
>
> Do you think that is feasible?
>
> Thanks,
> Max
>
> On 30.10.18 23:00, Lukasz Cwik wrote:
> > I still like #3 the most, just can't devote the time to get it done.
> >
> > Instead of going with a fully implemented #3, we could hardcode the a
> > subset of options and types within each SDK until the job server is
> > ready to provide this information and then migrate to the "full" list.
> > This would be an easy path for SDKs to take on. They could "know" of a
> > few well known options, and if they want to support all options, they
> > implement the integration with the job server.
> >
> > On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels <mxm@apache.org
> > <ma...@apache.org>> wrote:
> >
> >      > I would prefer we don't introduce a (quirky) way of passing
> >     unknown options that forces users to type JSON into the command line
> >     (or similar acrobatics)
> >     Same here, the JSON approach seems technically nice but too bulky
> >     for users.
> >
> >      > To someone wanting to run a pipeline, all options are equally
> >     important, whether they are application specific, SDK specific or
> >     runner specific.
> >
> >     I'm also reluctant to force users to use `--runner_option=` because
> the
> >     division into "Runner" options and other options seems rather
> arbitrary
> >     to users. Most built-in options are also Runner-related.
> >
> >      > It should be possible to *optionally* qualify/scope (to cover
> >     cases where there is ambiguity), but otherwise I prefer the format
> >     we currently have.
> >
> >     Yes, namespacing is a problem. What happens if the user defines a
> >     custom
> >     PipelineOption which clashes with one of the builtin ones? If both
> are
> >
> >     set, which one is actually applied?
> >
> >
> > Note that PipelineOptions so far has been treating name equality to mean
> > option equality and the Java implementation has a bunch of strict checks
> > to make sure that default values aren't used for duplicate definitions,
> > they have the same type, etc...
> > With 1), you fail the job if the runner can't understand your option
> > because its not represented the same way. User then needs to fix-up
> > their declaration of the option name.
> > With 2), there are no name conflicts, the SDK will need to validate that
> > the option isn't set in both formats and error out if it is before
> > pipeline submission time.
> > With 3), you can prefetch all the options and error out to the user
> > during argument parsing time.
> >
> >
> >
> >     Here is a summary of the possible paths going forward:
> >
> >
> >     1) Validate PipelineOptions at Runner side
> >     ==========================================
> >
> >     The main issue raised here was that we want to move away from parsing
> >     arguments which look like options without validating them. An easy
> fix
> >     would be to actually validate them on the Runner side. This could be
> >     done by changing the deserialization code of PipelineOptions which so
> >     far ignores unknown JSON options.
> >
> >     See: PipelineOptionsTranslation.fromProto(Struct protoOptions)
> >
> >     Actually, this wouldn't work for user-defined PipelineOptions because
> >     they might not be known to the Runner (if they are defined in
> Python).
> >
> >
> >     2) Introduce a Runner-Option Flag
> >     =================================
> >
> >     In this approach we would try to add as many pipeline options for a
> >     Runner to the SDK, but allow additional Runner options to be passed
> >     using the `--runner-option=key=val` flag. The Runner, like in 1),
> would
> >     have to ensure validation. I think this has been the most favored
> >     way so
> >     far. Going forward, that means that `--parallelism=4` and
> >     `--runner-option=parallelism=4` will have the same effect for the
> Flink
> >     Runner.
> >
> >
> >     3) Implement Fetching of Options from JobServer
> >     ===============================================
> >
> >     The options are retrieved from the JobServer before submitting the
> >     pipeline. I think this would be ideal but, as mentioned before, it
> >     increases the complexity for implementing new SDKs and might overall
> >     just not be worth the effort.
> >
> >
> >     What do you think? I'd implement 2) for the next release, unless
> there
> >     are advocates for a different approach.
> >
> >     Cheers,
> >     Max
> >
> >     On 25.10.18 21:19, Thomas Weise wrote:
> >      > Reminder that this is something we ideally address before the next
> >      > release...
> >      >
> >      > Considering the discussion so far, my preference is that we get
> away
> >      > from unknown options and discover valid options from the runner
> (by
> >      > expanding the job service).
> >      >
> >      > Once the SDK is aware of all valid options, it is possible to
> >     provide
> >      > meaningful feedback to the user (validate or help), and correctly
> >     handle
> >      > scopes and types.
> >      >
> >      > I would prefer we don't introduce a (quirky) way of passing
> unknown
> >      > options that forces users to type JSON into the command line (or
> >     similar
> >      > acrobatics). To someone wanting to run a pipeline, all options are
> >      > equally important, whether they are application specific, SDK
> >     specific
> >      > or runner specific. It should be possible to *optionally*
> >     qualify/scope
> >      > (to cover cases where there is ambiguity), but otherwise I prefer
> >     the
> >      > format we currently have.
> >      >
> >      > Regarding type inference: Correct handling of numeric types
> >     matters, see
> >      > following issue with protobuf (not JSON):
> >      > https://issues.apache.org/jira/browse/BEAM-5509
> >      >
> >      > Thomas
> >      >
> >      >
> >      > On Thu, Oct 18, 2018 at 6:55 AM Robert Bradshaw
> >     <robertwb@google.com <ma...@google.com>
> >      > <mailto:robertwb@google.com <ma...@google.com>>> wrote:
> >      >
> >      >     On Wed, Oct 17, 2018 at 11:35 PM Lukasz Cwik
> >     <lcwik@google.com <ma...@google.com>
> >      >     <mailto:lcwik@google.com <ma...@google.com>>> wrote:
> >      >
> >      >
> >      >         On Tue, Oct 16, 2018 at 11:51 AM Robert Bradshaw
> >      >         <robertwb@google.com <ma...@google.com>
> >     <mailto:robertwb@google.com <ma...@google.com>>> wrote:
> >      >
> >      >             On Tue, Oct 16, 2018 at 7:03 PM Lukasz Cwik
> >      >             <lcwik@google.com <ma...@google.com>
> >     <mailto:lcwik@google.com <ma...@google.com>>> wrote:
> >      >              >
> >      >              > For all unknown options, the SDK can require that
> all
> >      >             flag values be specified explicitly as a valid JSON
> type.
> >      >              > starts with { -> object
> >      >              > starts with [ -> list
> >      >              > starts with " -> string
> >      >              > is null / true / false -> null / true / false
> >      >              > otherwise is number.
> >      >              >
> >      >              > This isn't great for strings but works well for
> >     all the
> >      >             other types.
> >      >              >
> >      >              > Thus for known options, the additional typing
> >     information
> >      >             would disambiguate whether something should be a
> >      >             string/boolean/number/object/list but for unknown
> >     options we
> >      >             would expect the user to use valid JSON explicitly
> >     and write:
> >      >              > --foo={"object": "value"}
> >      >              > --foo=["value", "value2"]
> >      >              > --foo="string value"
> >      >
> >      >             Due to shell escaping, one would have to write
> >      >
> >      >             --foo=\"string value\"
> >      >
> >      >             or actually, due to the space
> >      >
> >      >             --foo='"string value"'
> >      >
> >      >             or some other variation on that, which is really
> >      >             unfortunate. (The JSON list/objects would need similar
> >      >             quoting, but that's less surprising.) Also, does this
> >     mean
> >      >             we'd only have one kind of number (not integer vs.
> float,
> >      >             i.e. --parallelism=5.0 works)? I suppose that is JSON.
> >      >
> >      >
> >      >         Yes, I was suspecting that users would need to type the
> >     second
> >      >         variant as \"...\" I found more burdensome then '"..."'
> >      >
> >      >
> >      >              > --foo=3.5 --foo=-4
> >      >              > --foo=true --foo=false
> >      >              > --foo=null
> >      >              > This also works if the flag is repeated, so
> --foo=3.5
> >      >             --foo=-4 is [3.5, -4]
> >      >
> >      >             The thing that sparked this discussion was what to do
> >     when
> >      >             unknown foo is repeated, but only one value given.
> >      >
> >      >
> >      >         If the person only specifies one value, then they have to
> >      >         disambiguate and put it in a list, only if they specify
> more
> >      >         then one value will they have to turn it into a list.
> >      >
> >      >         I believe we could come up with other schemes on how to
> >     convert
> >      >         unknown options to JSON where we prefer strings over
> >     non-string
> >      >         types like null/true/false/numbers/list/object and
> >     require the
> >      >         user to escape out of the string default but anything
> that is
> >      >         too different from strict JSON would cause headaches when
> >      >         attempting to explain the format to users. I think a happy
> >      >         middle ground would be that we will only require escaping
> for
> >      >         strings which are ambiguous, so things like true, null,
> >     false,
> >      >         ... to be treated as strings would require the user to
> >     escape them.
> >      >
> >      >
> >      >     I'd prefer to avoid inferring the type of an unknown argument
> >     based
> >      >     on its contents, which can lead to surprises. We could
> >     declare every
> >      >     unknown type to be repeated string, and let any
> >     parsing/validation
> >      >     occur on the runner. If desired, we could pass these around
> as a
> >      >     single "runner options" dict that runners could inspect and
> >     use to
> >      >     populate the actual dict rather than mixing parsed and
> unparsed
> >      >     options.
> >      >
> >      >
> >      >
> >      >              > On Tue, Oct 16, 2018 at 7:56 AM Thomas Weise
> >      >             <thw@apache.org <ma...@apache.org>
> >     <mailto:thw@apache.org <ma...@apache.org>>> wrote:
> >      >              >>
> >      >              >> Discovering options from the job server seems
> >     preferable
> >      >             over replicating runner options in SDKs.
> >      >              >>
> >      >              >> Runners evolve on their own, and with portability
> the
> >      >             SDK does not need to know anything about the runner.
> >      >              >>
> >      >              >> Regarding --runner-option. It is true that this
> looks
> >      >             less user friendly. On the other hand it eliminates
> the
> >      >             possibility of name collisions.
> >      >              >>
> >      >              >> But if options are discovered, the SDK can
> >     perform full
> >      >             validation. It would only be necessary to use explicit
> >      >             scoping when there is ambiguity.
> >      >              >>
> >      >              >> Thomas
> >      >              >>
> >      >              >>
> >      >              >> On Tue, Oct 16, 2018 at 3:48 AM Maximilian Michels
> >      >             <mxm@apache.org <ma...@apache.org>
> >     <mailto:mxm@apache.org <ma...@apache.org>>> wrote:
> >      >              >>>
> >      >              >>> Fetching options directly from the Runner's
> >     JobServer
> >      >             seems like the
> >      >              >>> ideal solution. I agree with Robert that it
> creates
> >      >             additional
> >      >              >>> complexity for SDK authors, so the
> `--runner-option`
> >      >             flag would be an
> >      >              >>> easy and explicit way to specify additional
> >     Runner options.
> >      >              >>>
> >      >              >>> The format I prefer would be:
> >     --runner_option=key1=val1
> >      >              >>> --runner_option=key2=val2
> >      >              >>>
> >      >              >>> Now, from the perspective of end users, I think
> >     it is
> >      >             neither convenient
> >      >              >>> nor reasonable to require the use of the
> >      >             `--runner-option` flag. To the
> >      >              >>> user it seems nebulous why some pipeline options
> >     live
> >      >             in the top-level
> >      >              >>> option namespace while others need to be nested
> >     within
> >      >             an option. This
> >      >              >>> is amplified by there being two Runners the user
> >     needs
> >      >             to be aware of,
> >      >              >>> i.e. PortableRunner and the actual Runner
> >      >             (Dataflow/Flink/Spark..).
> >      >              >>>
> >      >              >>> I feel like we would eventually replicate all
> >     options
> >      >             in the SDK because
> >      >              >>> otherwise users have to use the
> >     `--runner-option`, but
> >      >             at least we can
> >      >              >>> specify options which have not been replicated
> yet.
> >      >              >>>
> >      >              >>> -Max
> >      >              >>>
> >      >              >>> On 16.10.18 10:27, Robert Bradshaw wrote:
> >      >              >>> > Yes, we don't know how to parse and/or
> >     validate it.
> >      >              >>> >
> >      >              >>> > On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik
> >      >             <lcwik@google.com <ma...@google.com>
> >     <mailto:lcwik@google.com <ma...@google.com>>
> >      >              >>> > <mailto:lcwik@google.com
> >     <ma...@google.com> <mailto:lcwik@google.com
> >     <ma...@google.com>>>>
> >      >             wrote:
> >      >              >>> >
> >      >              >>> >     I see, is the issue that we currently are
> >     using a
> >      >             JSON
> >      >              >>> >     representation for options when being
> >     serialized
> >      >             and when we get
> >      >              >>> >     some unknown option, we don't know how to
> >     convert
> >      >             it into its JSON form?
> >      >              >>> >
> >      >              >>> >     On Mon, Oct 15, 2018 at 2:41 PM Robert
> >     Bradshaw
> >      >             <robertwb@google.com <ma...@google.com>
> >     <mailto:robertwb@google.com <ma...@google.com>>
> >      >              >>> >     <mailto:robertwb@google.com
> >     <ma...@google.com>
> >      >             <mailto:robertwb@google.com
> >     <ma...@google.com>>>> wrote:
> >      >              >>> >
> >      >              >>> >         On Mon, Oct 15, 2018 at 11:30 PM
> >     Lukasz Cwik
> >      >             <lcwik@google.com <ma...@google.com>
> >     <mailto:lcwik@google.com <ma...@google.com>>
> >      >              >>> >         <mailto:lcwik@google.com
> >     <ma...@google.com>
> >      >             <mailto:lcwik@google.com <ma...@google.com>>>>
> >     wrote:
> >      >              >>> >          >
> >      >              >>> >          > On Mon, Oct 15, 2018 at 1:17 PM
> Robert
> >      >             Bradshaw
> >      >              >>> >         <robertwb@google.com
> >     <ma...@google.com>
> >      >             <mailto:robertwb@google.com
> >     <ma...@google.com>> <mailto:robertwb@google.com
> >     <ma...@google.com>
> >      >             <mailto:robertwb@google.com
> >     <ma...@google.com>>>> wrote:
> >      >              >>> >          >>
> >      >              >>> >          >> On Mon, Oct 15, 2018 at 7:50 PM
> >     Lukasz Cwik
> >      >              >>> >         <lcwik@google.com
> >     <ma...@google.com> <mailto:lcwik@google.com
> >     <ma...@google.com>>
> >      >             <mailto:lcwik@google.com <ma...@google.com>
> >     <mailto:lcwik@google.com <ma...@google.com>>>> wrote:
> >      >              >>> >          >> >
> >      >              >>> >          >> > I agree with the sentiment for
> >     better
> >      >             error checking.
> >      >              >>> >          >> >
> >      >              >>> >          >> > We can try to make it such that
> >     the SDK
> >      >             can "fetch" the
> >      >              >>> >         set of options that the runner
> supports by
> >      >             making a call to the
> >      >              >>> >         Job API. The API could return a list of
> >      >             option names
> >      >              >>> >         (descriptions for --help purposes and
> also
> >      >             potentially the
> >      >              >>> >         expected format) which would remove
> >     the worry
> >      >             around "unknown"
> >      >              >>> >         options. Yes I understand to be able
> >     to make
> >      >             the Job API call,
> >      >              >>> >         we may need to parse some options from
> the
> >      >             args parameters first
> >      >              >>> >         and then parse the unknown options
> >     after they
> >      >             are fetched.
> >      >              >>> >          >>
> >      >              >>> >          >> This is an interesting idea, but
> >     seems it
> >      >             could get quite
> >      >              >>> >         complicated.
> >      >              >>> >          >> E.g. for delegating runners, one
> would
> >      >             first read the options to
> >      >              >>> >          >> determine which runner to fetch the
> >      >             options from, which
> >      >              >>> >         would then
> >      >              >>> >          >> return a set of options that
> possibly
> >      >             depends on the values
> >      >              >>> >         of some of
> >      >              >>> >          >> its options...
> >      >              >>> >          >>
> >      >              >>> >          >> > Alternatively, we can choose an
> >      >             explicit format upfront.
> >      >              >>> >          >> > To expand on the exact format for
> >      >             --runner_option=...,
> >      >              >>> >         here are some different ideas:
> >      >              >>> >          >> > 1) Specified multiple times,
> >     each one
> >      >             is an explicit flag
> >      >              >>> >          >> > --runner_option=--blah=bar
> >      >             --runner_option=--foo=baz1
> >      >              >>> >         --runner_option=--foo=baz2
> >      >              >>> >          >>
> >      >              >>> >          >> I'm -1 on this format. We should
> move
> >      >             away from the idea
> >      >              >>> >         that options
> >      >              >>> >          >> == flags (as that doesn't compose
> well
> >      >             with other libraries
> >      >              >>> >         that do
> >      >              >>> >          >> their own flags parsing). The
> >     ability to
> >      >             parse a set of
> >      >              >>> >         flags into
> >      >              >>> >          >> options is just a convenience that
> an
> >      >             author may (or may
> >      >              >>> >         not) choose
> >      >              >>> >          >> to use (e.g. when running
> pipelines a
> >      >             long-lived process like a
> >      >              >>> >          >> service or a notebook, the command
> >     line
> >      >             flags are almost
> >      >              >>> >         certainly not
> >      >              >>> >          >> the right interface).
> >      >              >>> >          >>
> >      >              >>> >          >> > 2) specified multiple times, we
> drop
> >      >             the explicit flag
> >      >              >>> >          >> > --runner_option=blah=bar
> >      >             --runner_option=foo=baz1
> >      >              >>> >         --runner_option=foo=baz2
> >      >              >>> >          >>
> >      >              >>> >          >> This or (4) is my preference.
> >      >              >>> >          >>
> >      >              >>> >          >> > 3) we use a string which the
> >     runner can
> >      >             choose to
> >      >              >>> >         interpret however they want (JSON/XML
> >     shown
> >      >             below)
> >      >              >>> >          >> > --runner_option='{"blah": "bar",
> >     "foo":
> >      >             ["baz1", "baz2"]}'
> >      >              >>> >          >> >
> >      >              >>> >
> >      >
> >
>  --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
> >      >              >>> >          >>
> >      >              >>> >          >> This would make validation hard.
> >     Also, I
> >      >             think it makes
> >      >              >>> >         sense for some
> >      >              >>> >          >> runner options to be "shared"
> >      >             (parallelism") by convention,
> >      >              >>> >         so letting
> >      >              >>> >          >> it be a free-form string wouldn't
> >     allow
> >      >             different runners to
> >      >              >>> >         inspect
> >      >              >>> >          >> different bits.
> >      >              >>> >          >>
> >      >              >>> >          >> We should consider if we should
> >     use urns
> >      >             for namespacing, and
> >      >              >>> >          >> assigning semantic meaning to
> >     strings, here.
> >      >              >>> >          >>
> >      >              >>> >          >> > 4) we use a string which must be
> a
> >      >             specific format such as
> >      >              >>> >         JSON (allows the SDK to do simple
> >     validation):
> >      >              >>> >          >> > --runner_option='{"blah": "bar",
> >     "foo":
> >      >             ["baz1", "baz2"]}'
> >      >              >>> >          >>
> >      >              >>> >          >> I like this in that at least some
> >      >             validation can be
> >      >              >>> >         performed, and
> >      >              >>> >          >> expectations of how to format
> richer
> >      >             types. On the other
> >      >              >>> >         hand it gets
> >      >              >>> >          >> a bit verbose, given that most (I'd
> >      >             imagine) options will be
> >      >              >>> >         simple.
> >      >              >>> >          >> As with normal options,
> >      >              >>> >          >>
> >      >              >>> >          >>     --option1=value1
> --option2=value2
> >      >              >>> >          >>
> >      >              >>> >          >> is shorthand for {"option1":
> value1,
> >      >             "option2": value2}.
> >      >              >>> >          >>
> >      >              >>> >          > I lean to 4 the most. With 2, you
> >     run into
> >      >             issues of what
> >      >              >>> >         does --runner_option=foo=["a", "b"]
> >      >             --runner_option=foo=["c",
> >      >              >>> >         "d"] mean?
> >      >              >>> >          > Is it an error or list of lists or
> >      >             concatenated. Similar
> >      >              >>> >         issues for map types represented via
> JSON
> >      >             object {...}
> >      >              >>> >
> >      >              >>> >         We can err to be on the safe side
> >      >             unless/until an argument can
> >      >              >>> >         be made
> >      >              >>> >         that merging is more natural. I just
> think
> >      >             this will be excessively
> >      >              >>> >         verbose to use.
> >      >              >>> >
> >      >              >>> >          >> > I would strongly suggest that we
> go
> >      >             with the "fetch"
> >      >              >>> >         approach, since this makes the set of
> >     options
> >      >             discoverable and
> >      >              >>> >         helps users find errors much earlier
> >     in their
> >      >             pipeline.
> >      >              >>> >          >>
> >      >              >>> >          >> This seems like an advanced
> >     feature that
> >      >             SDKs may want to
> >      >              >>> >         support, but
> >      >              >>> >          >> I wouldn't want to require this
> >      >             complexity for bootstrapping
> >      >              >>> >         an SDK.
> >      >              >>> >          >>
> >      >              >>> >          > SDKs that are starting off wouldn't
> >     need
> >      >             to "fetch" options,
> >      >              >>> >         they could choose to not support runner
> >      >             options or they could
> >      >              >>> >         choose to pass all options through to
> the
> >      >             runner blindly.
> >      >              >>> >         Fetching the options only provides the
> SDK
> >      >             the ability to
> >      >              >>> >         provide error checking upfront and
> useful
> >      >             error/help messages.
> >      >              >>> >
> >      >              >>> >         But how to even pass all options
> through
> >      >             blindly is exactly the
> >      >              >>> >         difficulty we're running into here.
> >      >              >>> >
> >      >              >>> >          >> Regarding always keeping runner
> >     options
> >      >             separate, +1, though
> >      >              >>> >         I'm not
> >      >              >>> >          >> sure the line is always clear.
> >      >              >>> >          >>
> >      >              >>> >          >>
> >      >              >>> >          >> > On Mon, Oct 15, 2018 at 8:04 AM
> >     Robert
> >      >             Bradshaw
> >      >              >>> >         <robertwb@google.com
> >     <ma...@google.com>
> >      >             <mailto:robertwb@google.com
> >     <ma...@google.com>> <mailto:robertwb@google.com
> >     <ma...@google.com>
> >      >             <mailto:robertwb@google.com
> >     <ma...@google.com>>>> wrote:
> >      >              >>> >          >> >>
> >      >              >>> >          >> >> On Mon, Oct 15, 2018 at 3:58 PM
> >      >             Maximilian Michels
> >      >              >>> >         <mxm@apache.org
> >     <ma...@apache.org> <mailto:mxm@apache.org <mailto:
> mxm@apache.org>>
> >      >             <mailto:mxm@apache.org <ma...@apache.org>
> >     <mailto:mxm@apache.org <ma...@apache.org>>>> wrote:
> >      >              >>> >          >> >> >
> >      >              >>> >          >> >> > I agree that the current
> approach
> >      >             breaks the pipeline
> >      >              >>> >         options contract
> >      >              >>> >          >> >> > because "unknown" options get
> >     parsed
> >      >             in the same way as
> >      >              >>> >         options which
> >      >              >>> >          >> >> > have been defined by the user.
> >      >              >>> >          >> >>
> >      >              >>> >          >> >> FWIW, I think we're already
> >     breaking
> >      >             this "contract."
> >      >              >>> >         Unknown options
> >      >              >>> >          >> >> are silently ignored; with this
> >     change
> >      >             we just change how
> >      >              >>> >         we record
> >      >              >>> >          >> >> them. It still feels a bit
> >     hacky though.
> >      >              >>> >          >> >>
> >      >              >>> >          >> >> > I'm not sure the
> >     `experiments` flag
> >      >             works for us. AFAIK
> >      >              >>> >         it only allows
> >      >              >>> >          >> >> > true/false flags. We want to
> pass
> >      >             all types of pipeline
> >      >              >>> >         options to the
> >      >              >>> >          >> >> > Runner.
> >      >              >>> >          >> >>
> >      >              >>> >          >> >> Experiments is an arbitrary set
> of
> >      >             strings, which can be
> >      >              >>> >         of the form
> >      >              >>> >          >> >> "param=value" if that's useful.
> >      >             (Dataflow does this.)
> >      >              >>> >         There is, again,
> >      >              >>> >          >> >> no namespacing on the param
> >     names, but
> >      >             we could user urns
> >      >              >>> >         or impose
> >      >              >>> >          >> >> some other structure here.
> >      >              >>> >          >> >>
> >      >              >>> >          >> >> > How to solve this?
> >      >              >>> >          >> >> >
> >      >              >>> >          >> >> > 1) Add all options of all
> >     Runners to
> >      >             each SDK
> >      >              >>> >          >> >> > We added some of the
> FlinkRunner
> >      >             options to the Python
> >      >              >>> >         SDK but realized
> >      >              >>> >          >> >> > syncing is rather cumbersome
> >     in the
> >      >             long term. However,
> >      >              >>> >         we want the most
> >      >              >>> >          >> >> > important options to be
> >     validated on
> >      >             the client side.
> >      >              >>> >          >> >>
> >      >              >>> >          >> >> I don't think this is
> >     sustainable in
> >      >             the long run.
> >      >              >>> >         However, thinking
> >      >              >>> >          >> >> about this, in the worse case
> >      >             validation happens after
> >      >              >>> >         construction
> >      >              >>> >          >> >> but before execution (as with
> >     much of
> >      >             our other
> >      >              >>> >         validation) so it
> >      >              >>> >          >> >> isn't that bad.
> >      >              >>> >          >> >>
> >      >              >>> >          >> >> > 2) Pass "unknown" options via
> a
> >      >             separate list in the
> >      >              >>> >         Proto which can
> >      >              >>> >          >> >> > only be accessed internally
> >     by the
> >      >             Runners. This still
> >      >              >>> >         allows passing
> >      >              >>> >          >> >> > arbitrary options but we
> wouldn't
> >      >             leak unknown options
> >      >              >>> >         and display them
> >      >              >>> >          >> >> > as top-level options.
> >      >              >>> >          >> >>
> >      >              >>> >          >> >> I think there needs to be a way
> for
> >      >             the user to
> >      >              >>> >         communicate values
> >      >              >>> >          >> >> directly to the runner
> >     regardless of
> >      >             the SDK. My
> >      >              >>> >         preference would be
> >      >              >>> >          >> >> to make this explicit, e.g.
> >     (repeated)
> >      >              >>> >         --runner_option=..., rather
> >      >              >>> >          >> >> than scooping up all unknown
> >     flags at
> >      >             command line
> >      >              >>> >         parsing time.
> >      >              >>> >          >> >> Perhaps an SDK that is aware of
> >     some
> >      >             runners could choose
> >      >              >>> >         to lift
> >      >              >>> >          >> >> these as top-level options, but
> >     still
> >      >             pass them as runner
> >      >              >>> >         options.
> >      >              >>> >          >> >>
> >      >              >>> >          >> >> > On 13.10.18 02:34, Charles
> >     Chen wrote:
> >      >              >>> >          >> >> > > The current release branch
> >      >              >>> >          >> >> > >
> >      >              >>> >
> >      >
> >       (https://github.com/apache/beam/commits/release-2.8.0) was cut
> >      >              >>> >         after the
> >      >              >>> >          >> >> > > revert went in.  Sent out
> >      >              >>> > https://github.com/apache/beam/pull/6683 as a
> >      >              >>> >          >> >> > > revert of the revert.
> >     Regarding
> >      >             your comment above,
> >      >              >>> >         I can help out with
> >      >              >>> >          >> >> > > the design / PR reviews for
> >     common
> >      >             Python code as you
> >      >              >>> >         suggest.
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > > On Fri, Oct 12, 2018 at
> 4:48 PM
> >      >             Thomas Weise
> >      >              >>> >         <thw@apache.org
> >     <ma...@apache.org> <mailto:thw@apache.org <mailto:
> thw@apache.org>>
> >      >             <mailto:thw@apache.org <ma...@apache.org>
> >     <mailto:thw@apache.org <ma...@apache.org>>>
> >      >              >>> >          >> >> > > <mailto:thw@apache.org
> >     <ma...@apache.org>
> >      >             <mailto:thw@apache.org <ma...@apache.org>>
> >     <mailto:thw@apache.org <ma...@apache.org>
> >      >             <mailto:thw@apache.org <ma...@apache.org>>>>>
> wrote:
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >     Thanks, will tag you and
> >      >             looking forward to
> >      >              >>> >         feedback so we can
> >      >              >>> >          >> >> > >     ensure that changes
> >     work for
> >      >             everyone.
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >     Looking at the PR, I see
> >      >             agreement from Max to
> >      >              >>> >         revert the change on
> >      >              >>> >          >> >> > >     the release branch, but
> >     not in
> >      >             master. Would you
> >      >              >>> >         mind to restore it
> >      >              >>> >          >> >> > >     in master?
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >     Thanks
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >     On Fri, Oct 12, 2018 at
> >     4:40
> >      >             PM Ahmet Altay
> >      >              >>> >         <altay@google.com
> >     <ma...@google.com> <mailto:altay@google.com
> >     <ma...@google.com>>
> >      >             <mailto:altay@google.com <ma...@google.com>
> >     <mailto:altay@google.com <ma...@google.com>>>
> >      >              >>> >          >> >> > >
> >     <mailto:altay@google.com <ma...@google.com>
> >      >             <mailto:altay@google.com <ma...@google.com>>
> >      >              >>> >         <mailto:altay@google.com
> >     <ma...@google.com>
> >      >             <mailto:altay@google.com
> >     <ma...@google.com>>>>> wrote:
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >         On Fri, Oct 12,
> 2018 at
> >      >             11:31 AM, Charles
> >      >              >>> >         Chen <ccy@google.com
> >     <ma...@google.com> <mailto:ccy@google.com <mailto:
> ccy@google.com>>
> >      >             <mailto:ccy@google.com <ma...@google.com>
> >     <mailto:ccy@google.com <ma...@google.com>>>
> >      >              >>> >          >> >> > >
> >     <mailto:ccy@google.com <ma...@google.com>
> >      >             <mailto:ccy@google.com <ma...@google.com>>
> >      >              >>> >         <mailto:ccy@google.com
> >     <ma...@google.com>
> >      >             <mailto:ccy@google.com <ma...@google.com>>>>>
> wrote:
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >             What I mean is
> >     that a
> >      >             user may find that
> >      >              >>> >         it works for them
> >      >              >>> >          >> >> > >             to pass
> >     "--myarg blah"
> >      >             and access it as
> >      >              >>> >         "options.myarg"
> >      >              >>> >          >> >> > >             without
> explicitly
> >      >             defining a "my_arg"
> >      >              >>> >         flag due to the added
> >      >              >>> >          >> >> > >             logic.  This is
> not
> >      >             the intended behavior
> >      >              >>> >         and we may want to
> >      >              >>> >          >> >> > >             change this
> >      >             implementation detail in the
> >      >              >>> >         future.  However,
> >      >              >>> >          >> >> > >             having this
> >     logic in a
> >      >             released version
> >      >              >>> >         makes it hard to
> >      >              >>> >          >> >> > >             change this
> >     behavior
> >      >             since users may
> >      >              >>> >         erroneously depend on
> >      >              >>> >          >> >> > >             this
> undocumented
> >      >             behavior.  Instead, we
> >      >              >>> >         should namespace /
> >      >              >>> >          >> >> > >             scope this so
> >     that it
> >      >             is obvious that
> >      >              >>> >         this is meant for
> >      >              >>> >          >> >> > >             runner (and not
> >     Beam
> >      >             user) consumption.
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >             On Fri, Oct 12,
> >     2018
> >      >             at 10:48 AM Thomas Weise
> >      >              >>> >          >> >> > >             <thw@apache.org
> >     <ma...@apache.org>
> >      >             <mailto:thw@apache.org <ma...@apache.org>>
> >     <mailto:thw@apache.org <ma...@apache.org>
> >      >             <mailto:thw@apache.org <ma...@apache.org>>>
> >      >              >>> >         <mailto:thw@apache.org
> >     <ma...@apache.org>
> >      >             <mailto:thw@apache.org <ma...@apache.org>>
> >     <mailto:thw@apache.org <ma...@apache.org>
> >      >             <mailto:thw@apache.org <ma...@apache.org>>>>>
> wrote:
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >                 Can you
> please
> >      >             elaborate more what
> >      >              >>> >         practical problems
> >      >              >>> >          >> >> > >                 this
> introduces
> >      >             for users?
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >                 I can see
> that
> >      >             this change allows a
> >      >              >>> >         user to specify a
> >      >              >>> >          >> >> > >                 runner
> specific
> >      >             option, which in the
> >      >              >>> >         future may change
> >      >              >>> >          >> >> > >                 because we
> >     decide
> >      >             to scope
> >      >              >>> >         differently. If this only
> >      >              >>> >          >> >> > >                 affects
> >     users of
> >      >             the portable Flink
> >      >              >>> >         runner (like us),
> >      >              >>> >          >> >> > >                 then no
> need to
> >      >             revert, because at
> >      >              >>> >         this early stage we
> >      >              >>> >          >> >> > >                 prefer
> >     something
> >      >             that works over
> >      >              >>> >         being blocked.
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >                 It would
> >     also be
> >      >             really great if some
> >      >              >>> >         of the core Python
> >      >              >>> >          >> >> > >                 SDK
> developers
> >      >             could help out with
> >      >              >>> >         the design aspects
> >      >              >>> >          >> >> > >                 and PR
> >     reviews of
> >      >             changes that affect
> >      >              >>> >         common Python
> >      >              >>> >          >> >> > >                 code.
> >     Anyone who
> >      >             specifically wants
> >      >              >>> >         to be tagged on
> >      >              >>> >          >> >> > >                 relevant
> >     JIRAs and
> >      >             PRs?
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >         I would be happy to
> be
> >      >             tagged, and I can also
> >      >              >>> >         help with
> >      >              >>> >          >> >> > >         including other
> >     relevant
> >      >             folks whenever
> >      >              >>> >         possible. In general I
> >      >              >>> >          >> >> > >         think Robert,
> Charles,
> >      >             myself are good
> >      >              >>> >         candidates.
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >                 Thanks
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >                 On Fri, Oct
> 12,
> >      >             2018 at 10:20 AM
> >      >              >>> >         Ahmet Altay
> >      >              >>> >          >> >> > >
> >     <altay@google.com <ma...@google.com>
> >      >             <mailto:altay@google.com <ma...@google.com>>
> >      >              >>> >         <mailto:altay@google.com
> >     <ma...@google.com>
> >      >             <mailto:altay@google.com <ma...@google.com>>>
> >     <mailto:altay@google.com <ma...@google.com>
> >      >             <mailto:altay@google.com <ma...@google.com>>
> >      >              >>> >         <mailto:altay@google.com
> >     <ma...@google.com>
> >      >             <mailto:altay@google.com
> >     <ma...@google.com>>>>> wrote:
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >                     On Fri,
> Oct
> >      >             12, 2018 at 10:11 AM,
> >      >              >>> >         Charles Chen
> >      >              >>> >          >> >> > >
> >      >             <ccy@google.com <ma...@google.com>
> >     <mailto:ccy@google.com <ma...@google.com>>
> >      >              >>> >         <mailto:ccy@google.com
> >     <ma...@google.com>
> >      >             <mailto:ccy@google.com <ma...@google.com>>>
> >     <mailto:ccy@google.com <ma...@google.com>
> >      >             <mailto:ccy@google.com <ma...@google.com>>
> >      >              >>> >         <mailto:ccy@google.com
> >     <ma...@google.com>
> >      >             <mailto:ccy@google.com <ma...@google.com>>>>>
> wrote:
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >                         For
> >      >             context, I made comments on
> >      >              >>> >          >> >> > >
> >      > https://github.com/apache/beam/pull/6600 noting
> >      >              >>> >          >> >> > >
> >     that the
> >      >             changes being made
> >      >              >>> >         were not good for
> >      >              >>> >          >> >> > >                         Beam
> >      >              >>> >         backwards-compatibility.  The change
> as is
> >      >              >>> >          >> >> > >
> allows
> >      >             users to use pipeline
> >      >              >>> >         options without
> >      >              >>> >          >> >> > >
> >     explicitly
> >      >             defining them,
> >      >              >>> >         which is not the type
> >      >              >>> >          >> >> > >                         of
> >     usage
> >      >             we would like to
> >      >              >>> >         encourage since we
> >      >              >>> >          >> >> > >
> >     prefer to
> >      >             be explicit
> >      >              >>> >         whenever possible.  If
> >      >              >>> >          >> >> > >
> users
> >      >             write pipelines with
> >      >              >>> >         this sort of pattern,
> >      >              >>> >          >> >> > >
> >     they will
> >      >             potentially
> >      >              >>> >         encounter pain when
> >      >              >>> >          >> >> > >
> >     upgrading
> >      >             to a later version
> >      >              >>> >         since this is an
> >      >              >>> >          >> >> > >
> >      >             implementation detail and not
> >      >              >>> >         an officially
> >      >              >>> >          >> >> > >
> >     supported
> >      >             pattern.  I agree
> >      >              >>> >         with the comments
> >      >              >>> >          >> >> > >
> >     above that
> >      >             this is ultimately
> >      >              >>> >         a scoping issue.
> >      >              >>> >          >> >> > >                         I
> would
> >      >             not have a problem
> >      >              >>> >         with these changes if
> >      >              >>> >          >> >> > >
> >     they were
> >      >             explicitly scoped
> >      >              >>> >         under either a
> >      >              >>> >          >> >> > >
> >     runner or
> >      >             unparsed options
> >      >              >>> >         namespace.
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >                         As a
> >      >             second note, since the
> >      >              >>> >         2.8.0 release is
> >      >              >>> >          >> >> > >
> >     being cut
> >      >             right now, because
> >      >              >>> >         of these
> >      >              >>> >          >> >> > >
> >      >             backwards-compatibility
> >      >              >>> >         concerns, I would
> >      >              >>> >          >> >> > >
> suggest
> >      >             reverting these
> >      >              >>> >         changes, at least until
> >      >              >>> >          >> >> > >
> >     2.8.0 is
> >      >             cut, so we can have
> >      >              >>> >         a discussion here
> >      >              >>> >          >> >> > >
> before
> >      >             committing to and
> >      >              >>> >         releasing any API-level
> >      >              >>> >          >> >> > >
> >     changes.
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >                     +1 I
> would
> >      >             like to revert the
> >      >              >>> >         changes in order not
> >      >              >>> >          >> >> > >                     rush
> >     this into
> >      >             the release. Once
> >      >              >>> >         this discussion
> >      >              >>> >          >> >> > >                     results
> >     in an
> >      >             agreement changes
> >      >              >>> >         can be brought back.
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >                         On
> Fri,
> >      >             Oct 12, 2018 at 9:26
> >      >              >>> >         AM Henning Rohde
> >      >              >>> >          >> >> > >
> >      >             <herohde@google.com <ma...@google.com>
> >     <mailto:herohde@google.com <ma...@google.com>>
> >      >              >>> >         <mailto:herohde@google.com
> >     <ma...@google.com>
> >      >             <mailto:herohde@google.com
> >     <ma...@google.com>>> <mailto:herohde@google.com
> >     <ma...@google.com>
> >      >             <mailto:herohde@google.com <mailto:herohde@google.com
> >>
> >      >              >>> >         <mailto:herohde@google.com
> >     <ma...@google.com>
> >      >             <mailto:herohde@google.com
> >     <ma...@google.com>>>>>
> >      >              >>> >          >> >> > >
> wrote:
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >     Agree
> >      >             that pipeline
> >      >              >>> >         options lack some
> >      >              >>> >          >> >> > >
> >      >             mechanism for scoping. It
> >      >              >>> >         is also not always
> >      >              >>> >          >> >> > >
> >      >             possible distinguish
> >      >              >>> >         options meant to be
> >      >              >>> >          >> >> > >
> >      >             consumed at pipeline
> >      >              >>> >         construction time, by
> >      >              >>> >          >> >> > >
> the
> >      >             runner, by the SDK
> >      >              >>> >         harness, by the user
> >      >              >>> >          >> >> > >
> >     code
> >      >             or any combination
> >      >              >>> >         -- and this causes
> >      >              >>> >          >> >> > >
> >      >             confusion every now and then.
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> For
> >      >             Dataflow, we have
> >      >              >>> >         been using
> >      >              >>> >          >> >> > >
> >      >             "experiments" for
> >      >              >>> >         arbitrary runner-specific
> >      >              >>> >          >> >> > >
> >      >             options. It's simply a
> >      >              >>> >         string list pipeline
> >      >              >>> >          >> >> > >
> >     option
> >      >             that all SDKs
> >      >              >>> >         support and, for Go at
> >      >              >>> >          >> >> > >
> >     least,
> >      >             is sent to
> >      >              >>> >         portable runners. Flink
> >      >              >>> >          >> >> > >
> >     can do
> >      >             the same in the
> >      >              >>> >         short term to move
> >      >              >>> >          >> >> > >
> >     forward.
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >     Henning
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> On
> >      >             Fri, Oct 12, 2018 at
> >      >              >>> >         8:50 AM Thomas Weise
> >      >              >>> >          >> >> > >
> >      >             <thw@apache.org <ma...@apache.org>
> >     <mailto:thw@apache.org <ma...@apache.org>>
> >      >              >>> >         <mailto:thw@apache.org
> >     <ma...@apache.org>
> >      >             <mailto:thw@apache.org <ma...@apache.org>>>
> >     <mailto:thw@apache.org <ma...@apache.org>
> >      >             <mailto:thw@apache.org <ma...@apache.org>>
> >      >              >>> >         <mailto:thw@apache.org
> >     <ma...@apache.org>
> >      >             <mailto:thw@apache.org <ma...@apache.org>>>>>
> wrote:
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >             [moving to the list]
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >             The requirement
> >      >              >>> >         driving this part of the
> >      >              >>> >          >> >> > >
> >      >             change was to allow a
> >      >              >>> >         user to specify
> >      >              >>> >          >> >> > >
> >      >             pipeline options that
> >      >              >>> >         a runner supports
> >      >              >>> >          >> >> > >
> >      >             without having to
> >      >              >>> >         declare those in each
> >      >              >>> >          >> >> > >
> >      >             language SDK.
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >          In
> >      >             the specific
> >      >              >>> >         scenario, we have
> >      >              >>> >          >> >> > >
> >      >             options that the
> >      >              >>> >         Flink runner supports
> >      >              >>> >          >> >> > >
> >      >             (and can validate),
> >      >              >>> >         that are not
> >      >              >>> >          >> >> > >
> >      >             enumerated in the
> >      >              >>> >         Python SDK.
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >          I
> >      >             think we have a
> >      >              >>> >         bigger problem scoping
> >      >              >>> >          >> >> > >
> >      >             pipeline options. For
> >      >              >>> >         example, the
> >      >              >>> >          >> >> > >
> >      >             runner options are
> >      >              >>> >         dumped into the SDK
> >      >              >>> >          >> >> > >
> >      >             worker. There is also
> >      >              >>> >         a possibility of
> >      >              >>> >          >> >> > >
> >      >             name collisions. So I
> >      >              >>> >         think this would
> >      >              >>> >          >> >> > >
> >      >             benefit from broader
> >      >              >>> >         feedback.
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >             Thanks,
> >      >              >>> >          >> >> > >
> >          Thomas
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >             ---------- Forwarded
> >      >              >>> >         message ---------
> >      >              >>> >          >> >> > >
> >      >             From: *Charles Chen*
> >      >              >>> >          >> >> > >
> >      >              >>> >           <notifications@github.com
> >     <ma...@github.com>
> >      >             <mailto:notifications@github.com
> >     <ma...@github.com>>
> >      >             <mailto:notifications@github.com
> >     <ma...@github.com>
> >      >             <mailto:notifications@github.com
> >     <ma...@github.com>>>
> >      >              >>> >          >> >> > >
> >      >              >>> >           <mailto:notifications@github.com
> >     <ma...@github.com>
> >      >             <mailto:notifications@github.com
> >     <ma...@github.com>>
> >      >              >>> >         <mailto:notifications@github.com
> >     <ma...@github.com>
> >      >             <mailto:notifications@github.com
> >     <ma...@github.com>>>>>
> >      >              >>> >          >> >> > >
> >      >             Date: Fri, Oct 12,
> >      >              >>> >         2018 at 8:36 AM
> >      >              >>> >          >> >> > >
> >      >             Subject: Re:
> >      >              >>> >         [apache/beam] [BEAM-5442]
> >      >              >>> >          >> >> > >
> >      >             Store duplicate
> >      >              >>> >         unknown options in a
> >      >              >>> >          >> >> > >
> >      >             list argument (#6600)
> >      >              >>> >          >> >> > >
> >      >             To: apache/beam
> >      >              >>> >         <beam@noreply.github.com
> >     <ma...@noreply.github.com>
> >      >             <mailto:beam@noreply.github.com
> >     <ma...@noreply.github.com>>
> >      >             <mailto:beam@noreply.github.com
> >     <ma...@noreply.github.com>
> >      >             <mailto:beam@noreply.github.com
> >     <ma...@noreply.github.com>>>
> >      >              >>> >          >> >> > >
> >      >              >>> >           <mailto:beam@noreply.github.com
> >     <ma...@noreply.github.com>
> >      >             <mailto:beam@noreply.github.com
> >     <ma...@noreply.github.com>>
> >      >             <mailto:beam@noreply.github.com
> >     <ma...@noreply.github.com>
> >      >             <mailto:beam@noreply.github.com
> >     <ma...@noreply.github.com>>>>>
> >      >              >>> >          >> >> > >
> >      >             Cc: Thomas Weise
> >      >              >>> >         <thomas.weise@gmail.com
> >     <ma...@gmail.com>
> >      >             <mailto:thomas.weise@gmail.com
> >     <ma...@gmail.com>>
> >      >             <mailto:thomas.weise@gmail.com
> >     <ma...@gmail.com> <mailto:thomas.weise@gmail.com
> >     <ma...@gmail.com>>>
> >      >              >>> >          >> >> > >
> >      >              >>> >           <mailto:thomas.weise@gmail.com
> >     <ma...@gmail.com>
> >      >             <mailto:thomas.weise@gmail.com
> >     <ma...@gmail.com>>
> >      >             <mailto:thomas.weise@gmail.com
> >     <ma...@gmail.com>
> >      >             <mailto:thomas.weise@gmail.com
> >     <ma...@gmail.com>>>>>,
> >      >              >>> >          >> >> > >
> >      >             Mention
> >      >              >>> >         <mention@noreply.github.com
> >     <ma...@noreply.github.com>
> >      >             <mailto:mention@noreply.github.com
> >     <ma...@noreply.github.com>>
> >      >             <mailto:mention@noreply.github.com
> >     <ma...@noreply.github.com>
> >      >             <mailto:mention@noreply.github.com
> >     <ma...@noreply.github.com>>>
> >      >              >>> >          >> >> > >
> >      >              >>> >           <mailto:mention@noreply.github.com
> >     <ma...@noreply.github.com>
> >      >             <mailto:mention@noreply.github.com
> >     <ma...@noreply.github.com>>
> >      >              >>> >         <mailto:mention@noreply.github.com
> >     <ma...@noreply.github.com>
> >      >             <mailto:mention@noreply.github.com
> >     <ma...@noreply.github.com>>>>>
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >             CC: @tweise
> >      >              >>> >         <https://github.com/tweise>
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >          —
> >      >              >>> >          >> >> > >
> >      >             You are receiving
> >      >              >>> >         this because you were
> >      >              >>> >          >> >> > >
> >      >             mentioned.
> >      >              >>> >          >> >> > >
> >      >             Reply to this email
> >      >              >>> >         directly, view it on
> >      >              >>> >          >> >> > >
> >          GitHub
> >      >              >>> >          >> >> > >
> >      >              >>> >
> >      >
> >       <https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
> >      >              >>> >          >> >> > >
> >          or
> >      >             mute the thread
> >      >              >>> >          >> >> > >
> >      >              >>> >
> >      >
> >       <
> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
> >.
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >              >>> >          >> >> > >
> >      >              >>> >
> >      >
> >
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Lukasz Cwik <lc...@google.com>.

The purpose of the spec would be to provide the names, type and
descriptions of the options. We don't need anything beyond the JSON types
(string, number, bool, object, list) because the only ambiguity we get is
how do we parse command line string into the JSON type (and that ambiguity
is actually only between string and non-string since all the other JSON
types are unambiguous).

Also, I believe the flow would be
1) Parse options
  a) Find the URL from args specified and/or additional methods on
PipelineOptions that exposes a programmatic way to set the URL during
parsing.
  b) Query URL for option specs
  c) Parse the remainder of the options
2) Construct pipeline
3) Choose runner
4) Submit job to runner

Note that I believe that we will want to have multiple URLs to support
cross language pipelines since we will want to be able to ask other SDK
languages/versions for their "list" of supported PipelineOptions.

On Thu, Nov 8, 2018 at 11:51 AM Robert Bradshaw <ro...@google.com> wrote:

> There's two questions here:
>
> (A) What do we do in the short term?
>
> I think adding every runner option to every SDK is not sustainable
> (n*m work, assuming every SDK knows about every runner), and having a
> patchwork of options that were added as one-offs to SDKs is not
> desirable either. Furthermore, it seems difficult to parse unknown
> options as if they were valid options, so my preference here would be
> to just use a special runner_option flag. (One could also pass a set
> of unparsed/unvalidated runner options to the runner, even if they're
> not distinguished for the user, and runners (or any intermediates)
> could run a "promote" operation that promotes any of these unknowns
> that they recognize to real options before further processing. The
> parsing would be done as repeated-string, and not be intermingled with
> the actually validated options. This is essential a variant of option 1.)
>
> (B) What do do in the long term? While the JobServer approach sounds
> nice, I think it introduces a lot of complexity (we have too much of
> that already) and still doesn't completely solve the problem. In
> particular, it changes the flow from
>
> 1. Parse options
> 2. Construct pipeline
> 3. Choose runner
> 4. Submit job to runner
>
> to
>
> 1. Parse options
> 2. Construct pipeline
> 3. Choose runner
> 4a. Query runner for option specs
> 4b. Re-parse options
> 4c. Submit job to runner
>
> In particular, doing 4b in the SDK rather than just let the runner
> itself do the validation as part of (4) doesn't save much and forces
> us to come up with a (probably incomplete) spec as to how to define
> options, their types, and their validations. It also means that a
> delegating runner must choose and interact with its downstream
> runner(s) synchronously, else we haven't actually solved the issue.
>
> For these reasons, I don't think we even want to go with the JobServer
> approach in the long term, which has bearing on (A).
>
> - Robert
>
>
> On Wed, Nov 7, 2018 at 8:50 PM Maximilian Michels <mx...@apache.org> wrote:
> >
> > +1
> >
> > If the preferred approach is to eventually have the JobServer serve the
> > options, then the best intermediate solution is to replicate common
> > options in the SDKs.
> >
> > If we went down the "--runner_option" path, we would end up with
> > multiple ways of specifying the same options. We would eventually have
> > to deprecate "runner options" once we have the JobServer approach. I'd
> > like to avoid that.
> >
> > For the upcoming release we can revert the changes again and add the
> > most common missing options to the SDKs. Then hopefully we should have
> > fetching implemented for the release after.
> >
> > Do you think that is feasible?
> >
> > Thanks,
> > Max
> >
> > On 30.10.18 23:00, Lukasz Cwik wrote:
> > > I still like #3 the most, just can't devote the time to get it done.
> > >
> > > Instead of going with a fully implemented #3, we could hardcode the a
> > > subset of options and types within each SDK until the job server is
> > > ready to provide this information and then migrate to the "full" list.
> > > This would be an easy path for SDKs to take on. They could "know" of a
> > > few well known options, and if they want to support all options, they
> > > implement the integration with the job server.
> > >
> > > On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels <mxm@apache.org
> > > <ma...@apache.org>> wrote:
> > >
> > >      > I would prefer we don't introduce a (quirky) way of passing
> > >     unknown options that forces users to type JSON into the command
> line
> > >     (or similar acrobatics)
> > >     Same here, the JSON approach seems technically nice but too bulky
> > >     for users.
> > >
> > >      > To someone wanting to run a pipeline, all options are equally
> > >     important, whether they are application specific, SDK specific or
> > >     runner specific.
> > >
> > >     I'm also reluctant to force users to use `--runner_option=`
> because the
> > >     division into "Runner" options and other options seems rather
> arbitrary
> > >     to users. Most built-in options are also Runner-related.
> > >
> > >      > It should be possible to *optionally* qualify/scope (to cover
> > >     cases where there is ambiguity), but otherwise I prefer the format
> > >     we currently have.
> > >
> > >     Yes, namespacing is a problem. What happens if the user defines a
> > >     custom
> > >     PipelineOption which clashes with one of the builtin ones? If both
> are
> > >
> > >     set, which one is actually applied?
> > >
> > >
> > > Note that PipelineOptions so far has been treating name equality to
> mean
> > > option equality and the Java implementation has a bunch of strict
> checks
> > > to make sure that default values aren't used for duplicate definitions,
> > > they have the same type, etc...
> > > With 1), you fail the job if the runner can't understand your option
> > > because its not represented the same way. User then needs to fix-up
> > > their declaration of the option name.
> > > With 2), there are no name conflicts, the SDK will need to validate
> that
> > > the option isn't set in both formats and error out if it is before
> > > pipeline submission time.
> > > With 3), you can prefetch all the options and error out to the user
> > > during argument parsing time.
> > >
> > >
> > >
> > >     Here is a summary of the possible paths going forward:
> > >
> > >
> > >     1) Validate PipelineOptions at Runner side
> > >     ==========================================
> > >
> > >     The main issue raised here was that we want to move away from
> parsing
> > >     arguments which look like options without validating them. An easy
> fix
> > >     would be to actually validate them on the Runner side. This could
> be
> > >     done by changing the deserialization code of PipelineOptions which
> so
> > >     far ignores unknown JSON options.
> > >
> > >     See: PipelineOptionsTranslation.fromProto(Struct protoOptions)
> > >
> > >     Actually, this wouldn't work for user-defined PipelineOptions
> because
> > >     they might not be known to the Runner (if they are defined in
> Python).
> > >
> > >
> > >     2) Introduce a Runner-Option Flag
> > >     =================================
> > >
> > >     In this approach we would try to add as many pipeline options for a
> > >     Runner to the SDK, but allow additional Runner options to be passed
> > >     using the `--runner-option=key=val` flag. The Runner, like in 1),
> would
> > >     have to ensure validation. I think this has been the most favored
> > >     way so
> > >     far. Going forward, that means that `--parallelism=4` and
> > >     `--runner-option=parallelism=4` will have the same effect for the
> Flink
> > >     Runner.
> > >
> > >
> > >     3) Implement Fetching of Options from JobServer
> > >     ===============================================
> > >
> > >     The options are retrieved from the JobServer before submitting the
> > >     pipeline. I think this would be ideal but, as mentioned before, it
> > >     increases the complexity for implementing new SDKs and might
> overall
> > >     just not be worth the effort.
> > >
> > >
> > >     What do you think? I'd implement 2) for the next release, unless
> there
> > >     are advocates for a different approach.
> > >
> > >     Cheers,
> > >     Max
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Robert Bradshaw <ro...@google.com>.

There's two questions here:

(A) What do we do in the short term?

I think adding every runner option to every SDK is not sustainable
(n*m work, assuming every SDK knows about every runner), and having a
patchwork of options that were added as one-offs to SDKs is not
desirable either. Furthermore, it seems difficult to parse unknown
options as if they were valid options, so my preference here would be
to just use a special runner_option flag. (One could also pass a set
of unparsed/unvalidated runner options to the runner, even if they're
not distinguished for the user, and runners (or any intermediates)
could run a "promote" operation that promotes any of these unknowns
that they recognize to real options before further processing. The
parsing would be done as repeated-string, and not be intermingled with
the actually validated options. This is essential a variant of option 1.)

(B) What do do in the long term? While the JobServer approach sounds
nice, I think it introduces a lot of complexity (we have too much of
that already) and still doesn't completely solve the problem. In
particular, it changes the flow from

1. Parse options
2. Construct pipeline
3. Choose runner
4. Submit job to runner

to

1. Parse options
2. Construct pipeline
3. Choose runner
4a. Query runner for option specs
4b. Re-parse options
4c. Submit job to runner

In particular, doing 4b in the SDK rather than just let the runner
itself do the validation as part of (4) doesn't save much and forces
us to come up with a (probably incomplete) spec as to how to define
options, their types, and their validations. It also means that a
delegating runner must choose and interact with its downstream
runner(s) synchronously, else we haven't actually solved the issue.

For these reasons, I don't think we even want to go with the JobServer
approach in the long term, which has bearing on (A).

- Robert


On Wed, Nov 7, 2018 at 8:50 PM Maximilian Michels <mx...@apache.org> wrote:
>
> +1
>
> If the preferred approach is to eventually have the JobServer serve the
> options, then the best intermediate solution is to replicate common
> options in the SDKs.
>
> If we went down the "--runner_option" path, we would end up with
> multiple ways of specifying the same options. We would eventually have
> to deprecate "runner options" once we have the JobServer approach. I'd
> like to avoid that.
>
> For the upcoming release we can revert the changes again and add the
> most common missing options to the SDKs. Then hopefully we should have
> fetching implemented for the release after.
>
> Do you think that is feasible?
>
> Thanks,
> Max
>
> On 30.10.18 23:00, Lukasz Cwik wrote:
> > I still like #3 the most, just can't devote the time to get it done.
> >
> > Instead of going with a fully implemented #3, we could hardcode the a
> > subset of options and types within each SDK until the job server is
> > ready to provide this information and then migrate to the "full" list.
> > This would be an easy path for SDKs to take on. They could "know" of a
> > few well known options, and if they want to support all options, they
> > implement the integration with the job server.
> >
> > On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels <mxm@apache.org
> > <ma...@apache.org>> wrote:
> >
> >      > I would prefer we don't introduce a (quirky) way of passing
> >     unknown options that forces users to type JSON into the command line
> >     (or similar acrobatics)
> >     Same here, the JSON approach seems technically nice but too bulky
> >     for users.
> >
> >      > To someone wanting to run a pipeline, all options are equally
> >     important, whether they are application specific, SDK specific or
> >     runner specific.
> >
> >     I'm also reluctant to force users to use `--runner_option=` because the
> >     division into "Runner" options and other options seems rather arbitrary
> >     to users. Most built-in options are also Runner-related.
> >
> >      > It should be possible to *optionally* qualify/scope (to cover
> >     cases where there is ambiguity), but otherwise I prefer the format
> >     we currently have.
> >
> >     Yes, namespacing is a problem. What happens if the user defines a
> >     custom
> >     PipelineOption which clashes with one of the builtin ones? If both are
> >
> >     set, which one is actually applied?
> >
> >
> > Note that PipelineOptions so far has been treating name equality to mean
> > option equality and the Java implementation has a bunch of strict checks
> > to make sure that default values aren't used for duplicate definitions,
> > they have the same type, etc...
> > With 1), you fail the job if the runner can't understand your option
> > because its not represented the same way. User then needs to fix-up
> > their declaration of the option name.
> > With 2), there are no name conflicts, the SDK will need to validate that
> > the option isn't set in both formats and error out if it is before
> > pipeline submission time.
> > With 3), you can prefetch all the options and error out to the user
> > during argument parsing time.
> >
> >
> >
> >     Here is a summary of the possible paths going forward:
> >
> >
> >     1) Validate PipelineOptions at Runner side
> >     ==========================================
> >
> >     The main issue raised here was that we want to move away from parsing
> >     arguments which look like options without validating them. An easy fix
> >     would be to actually validate them on the Runner side. This could be
> >     done by changing the deserialization code of PipelineOptions which so
> >     far ignores unknown JSON options.
> >
> >     See: PipelineOptionsTranslation.fromProto(Struct protoOptions)
> >
> >     Actually, this wouldn't work for user-defined PipelineOptions because
> >     they might not be known to the Runner (if they are defined in Python).
> >
> >
> >     2) Introduce a Runner-Option Flag
> >     =================================
> >
> >     In this approach we would try to add as many pipeline options for a
> >     Runner to the SDK, but allow additional Runner options to be passed
> >     using the `--runner-option=key=val` flag. The Runner, like in 1), would
> >     have to ensure validation. I think this has been the most favored
> >     way so
> >     far. Going forward, that means that `--parallelism=4` and
> >     `--runner-option=parallelism=4` will have the same effect for the Flink
> >     Runner.
> >
> >
> >     3) Implement Fetching of Options from JobServer
> >     ===============================================
> >
> >     The options are retrieved from the JobServer before submitting the
> >     pipeline. I think this would be ideal but, as mentioned before, it
> >     increases the complexity for implementing new SDKs and might overall
> >     just not be worth the effort.
> >
> >
> >     What do you think? I'd implement 2) for the next release, unless there
> >     are advocates for a different approach.
> >
> >     Cheers,
> >     Max

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Maximilian Michels <mx...@apache.org>.

+1

If the preferred approach is to eventually have the JobServer serve the 
options, then the best intermediate solution is to replicate common 
options in the SDKs.

If we went down the "--runner_option" path, we would end up with 
multiple ways of specifying the same options. We would eventually have 
to deprecate "runner options" once we have the JobServer approach. I'd 
like to avoid that.

For the upcoming release we can revert the changes again and add the 
most common missing options to the SDKs. Then hopefully we should have 
fetching implemented for the release after.

Do you think that is feasible?

Thanks,
Max

On 30.10.18 23:00, Lukasz Cwik wrote:
> I still like #3 the most, just can't devote the time to get it done.
> 
> Instead of going with a fully implemented #3, we could hardcode the a 
> subset of options and types within each SDK until the job server is 
> ready to provide this information and then migrate to the "full" list. 
> This would be an easy path for SDKs to take on. They could "know" of a 
> few well known options, and if they want to support all options, they 
> implement the integration with the job server.
> 
> On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels <mxm@apache.org 
> <ma...@apache.org>> wrote:
> 
>      > I would prefer we don't introduce a (quirky) way of passing
>     unknown options that forces users to type JSON into the command line
>     (or similar acrobatics)
>     Same here, the JSON approach seems technically nice but too bulky
>     for users.
> 
>      > To someone wanting to run a pipeline, all options are equally
>     important, whether they are application specific, SDK specific or
>     runner specific.
> 
>     I'm also reluctant to force users to use `--runner_option=` because the
>     division into "Runner" options and other options seems rather arbitrary
>     to users. Most built-in options are also Runner-related.
> 
>      > It should be possible to *optionally* qualify/scope (to cover
>     cases where there is ambiguity), but otherwise I prefer the format
>     we currently have.
> 
>     Yes, namespacing is a problem. What happens if the user defines a
>     custom
>     PipelineOption which clashes with one of the builtin ones? If both are 
> 
>     set, which one is actually applied?
> 
> 
> Note that PipelineOptions so far has been treating name equality to mean 
> option equality and the Java implementation has a bunch of strict checks 
> to make sure that default values aren't used for duplicate definitions, 
> they have the same type, etc...
> With 1), you fail the job if the runner can't understand your option 
> because its not represented the same way. User then needs to fix-up 
> their declaration of the option name.
> With 2), there are no name conflicts, the SDK will need to validate that 
> the option isn't set in both formats and error out if it is before 
> pipeline submission time.
> With 3), you can prefetch all the options and error out to the user 
> during argument parsing time.
> 
> 
> 
>     Here is a summary of the possible paths going forward:
> 
> 
>     1) Validate PipelineOptions at Runner side
>     ==========================================
> 
>     The main issue raised here was that we want to move away from parsing
>     arguments which look like options without validating them. An easy fix
>     would be to actually validate them on the Runner side. This could be
>     done by changing the deserialization code of PipelineOptions which so
>     far ignores unknown JSON options.
> 
>     See: PipelineOptionsTranslation.fromProto(Struct protoOptions)
> 
>     Actually, this wouldn't work for user-defined PipelineOptions because
>     they might not be known to the Runner (if they are defined in Python).
> 
> 
>     2) Introduce a Runner-Option Flag
>     =================================
> 
>     In this approach we would try to add as many pipeline options for a
>     Runner to the SDK, but allow additional Runner options to be passed
>     using the `--runner-option=key=val` flag. The Runner, like in 1), would
>     have to ensure validation. I think this has been the most favored
>     way so
>     far. Going forward, that means that `--parallelism=4` and
>     `--runner-option=parallelism=4` will have the same effect for the Flink
>     Runner.
> 
> 
>     3) Implement Fetching of Options from JobServer
>     ===============================================
> 
>     The options are retrieved from the JobServer before submitting the
>     pipeline. I think this would be ideal but, as mentioned before, it
>     increases the complexity for implementing new SDKs and might overall
>     just not be worth the effort.
> 
> 
>     What do you think? I'd implement 2) for the next release, unless there
>     are advocates for a different approach.
> 
>     Cheers,
>     Max
> 
>     On 25.10.18 21:19, Thomas Weise wrote:
>      > Reminder that this is something we ideally address before the next
>      > release...
>      >
>      > Considering the discussion so far, my preference is that we get away
>      > from unknown options and discover valid options from the runner (by
>      > expanding the job service).
>      >
>      > Once the SDK is aware of all valid options, it is possible to
>     provide
>      > meaningful feedback to the user (validate or help), and correctly
>     handle
>      > scopes and types.
>      >
>      > I would prefer we don't introduce a (quirky) way of passing unknown
>      > options that forces users to type JSON into the command line (or
>     similar
>      > acrobatics). To someone wanting to run a pipeline, all options are
>      > equally important, whether they are application specific, SDK
>     specific
>      > or runner specific. It should be possible to *optionally*
>     qualify/scope
>      > (to cover cases where there is ambiguity), but otherwise I prefer
>     the
>      > format we currently have.
>      >
>      > Regarding type inference: Correct handling of numeric types
>     matters, see
>      > following issue with protobuf (not JSON):
>      > https://issues.apache.org/jira/browse/BEAM-5509
>      >
>      > Thomas
>      >
>      >
>      > On Thu, Oct 18, 2018 at 6:55 AM Robert Bradshaw
>     <robertwb@google.com <ma...@google.com>
>      > <mailto:robertwb@google.com <ma...@google.com>>> wrote:
>      >
>      >     On Wed, Oct 17, 2018 at 11:35 PM Lukasz Cwik
>     <lcwik@google.com <ma...@google.com>
>      >     <mailto:lcwik@google.com <ma...@google.com>>> wrote:
>      >
>      >
>      >         On Tue, Oct 16, 2018 at 11:51 AM Robert Bradshaw
>      >         <robertwb@google.com <ma...@google.com>
>     <mailto:robertwb@google.com <ma...@google.com>>> wrote:
>      >
>      >             On Tue, Oct 16, 2018 at 7:03 PM Lukasz Cwik
>      >             <lcwik@google.com <ma...@google.com>
>     <mailto:lcwik@google.com <ma...@google.com>>> wrote:
>      >              >
>      >              > For all unknown options, the SDK can require that all
>      >             flag values be specified explicitly as a valid JSON type.
>      >              > starts with { -> object
>      >              > starts with [ -> list
>      >              > starts with " -> string
>      >              > is null / true / false -> null / true / false
>      >              > otherwise is number.
>      >              >
>      >              > This isn't great for strings but works well for
>     all the
>      >             other types.
>      >              >
>      >              > Thus for known options, the additional typing
>     information
>      >             would disambiguate whether something should be a
>      >             string/boolean/number/object/list but for unknown
>     options we
>      >             would expect the user to use valid JSON explicitly
>     and write:
>      >              > --foo={"object": "value"}
>      >              > --foo=["value", "value2"]
>      >              > --foo="string value"
>      >
>      >             Due to shell escaping, one would have to write
>      >
>      >             --foo=\"string value\"
>      >
>      >             or actually, due to the space
>      >
>      >             --foo='"string value"'
>      >
>      >             or some other variation on that, which is really
>      >             unfortunate. (The JSON list/objects would need similar
>      >             quoting, but that's less surprising.) Also, does this
>     mean
>      >             we'd only have one kind of number (not integer vs. float,
>      >             i.e. --parallelism=5.0 works)? I suppose that is JSON.
>      >
>      >
>      >         Yes, I was suspecting that users would need to type the
>     second
>      >         variant as \"...\" I found more burdensome then '"..."'
>      >
>      >
>      >              > --foo=3.5 --foo=-4
>      >              > --foo=true --foo=false
>      >              > --foo=null
>      >              > This also works if the flag is repeated, so --foo=3.5
>      >             --foo=-4 is [3.5, -4]
>      >
>      >             The thing that sparked this discussion was what to do
>     when
>      >             unknown foo is repeated, but only one value given.
>      >
>      >
>      >         If the person only specifies one value, then they have to
>      >         disambiguate and put it in a list, only if they specify more
>      >         then one value will they have to turn it into a list.
>      >
>      >         I believe we could come up with other schemes on how to
>     convert
>      >         unknown options to JSON where we prefer strings over
>     non-string
>      >         types like null/true/false/numbers/list/object and
>     require the
>      >         user to escape out of the string default but anything that is
>      >         too different from strict JSON would cause headaches when
>      >         attempting to explain the format to users. I think a happy
>      >         middle ground would be that we will only require escaping for
>      >         strings which are ambiguous, so things like true, null,
>     false,
>      >         ... to be treated as strings would require the user to
>     escape them.
>      >
>      >
>      >     I'd prefer to avoid inferring the type of an unknown argument
>     based
>      >     on its contents, which can lead to surprises. We could
>     declare every
>      >     unknown type to be repeated string, and let any
>     parsing/validation
>      >     occur on the runner. If desired, we could pass these around as a
>      >     single "runner options" dict that runners could inspect and
>     use to
>      >     populate the actual dict rather than mixing parsed and unparsed
>      >     options.
>      >
>      >
>      >
>      >              > On Tue, Oct 16, 2018 at 7:56 AM Thomas Weise
>      >             <thw@apache.org <ma...@apache.org>
>     <mailto:thw@apache.org <ma...@apache.org>>> wrote:
>      >              >>
>      >              >> Discovering options from the job server seems
>     preferable
>      >             over replicating runner options in SDKs.
>      >              >>
>      >              >> Runners evolve on their own, and with portability the
>      >             SDK does not need to know anything about the runner.
>      >              >>
>      >              >> Regarding --runner-option. It is true that this looks
>      >             less user friendly. On the other hand it eliminates the
>      >             possibility of name collisions.
>      >              >>
>      >              >> But if options are discovered, the SDK can
>     perform full
>      >             validation. It would only be necessary to use explicit
>      >             scoping when there is ambiguity.
>      >              >>
>      >              >> Thomas
>      >              >>
>      >              >>
>      >              >> On Tue, Oct 16, 2018 at 3:48 AM Maximilian Michels
>      >             <mxm@apache.org <ma...@apache.org>
>     <mailto:mxm@apache.org <ma...@apache.org>>> wrote:
>      >              >>>
>      >              >>> Fetching options directly from the Runner's
>     JobServer
>      >             seems like the
>      >              >>> ideal solution. I agree with Robert that it creates
>      >             additional
>      >              >>> complexity for SDK authors, so the `--runner-option`
>      >             flag would be an
>      >              >>> easy and explicit way to specify additional
>     Runner options.
>      >              >>>
>      >              >>> The format I prefer would be:
>     --runner_option=key1=val1
>      >              >>> --runner_option=key2=val2
>      >              >>>
>      >              >>> Now, from the perspective of end users, I think
>     it is
>      >             neither convenient
>      >              >>> nor reasonable to require the use of the
>      >             `--runner-option` flag. To the
>      >              >>> user it seems nebulous why some pipeline options
>     live
>      >             in the top-level
>      >              >>> option namespace while others need to be nested
>     within
>      >             an option. This
>      >              >>> is amplified by there being two Runners the user
>     needs
>      >             to be aware of,
>      >              >>> i.e. PortableRunner and the actual Runner
>      >             (Dataflow/Flink/Spark..).
>      >              >>>
>      >              >>> I feel like we would eventually replicate all
>     options
>      >             in the SDK because
>      >              >>> otherwise users have to use the
>     `--runner-option`, but
>      >             at least we can
>      >              >>> specify options which have not been replicated yet.
>      >              >>>
>      >              >>> -Max
>      >              >>>
>      >              >>> On 16.10.18 10:27, Robert Bradshaw wrote:
>      >              >>> > Yes, we don't know how to parse and/or
>     validate it.
>      >              >>> >
>      >              >>> > On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik
>      >             <lcwik@google.com <ma...@google.com>
>     <mailto:lcwik@google.com <ma...@google.com>>
>      >              >>> > <mailto:lcwik@google.com
>     <ma...@google.com> <mailto:lcwik@google.com
>     <ma...@google.com>>>>
>      >             wrote:
>      >              >>> >
>      >              >>> >     I see, is the issue that we currently are
>     using a
>      >             JSON
>      >              >>> >     representation for options when being
>     serialized
>      >             and when we get
>      >              >>> >     some unknown option, we don't know how to
>     convert
>      >             it into its JSON form?
>      >              >>> >
>      >              >>> >     On Mon, Oct 15, 2018 at 2:41 PM Robert
>     Bradshaw
>      >             <robertwb@google.com <ma...@google.com>
>     <mailto:robertwb@google.com <ma...@google.com>>
>      >              >>> >     <mailto:robertwb@google.com
>     <ma...@google.com>
>      >             <mailto:robertwb@google.com
>     <ma...@google.com>>>> wrote:
>      >              >>> >
>      >              >>> >         On Mon, Oct 15, 2018 at 11:30 PM
>     Lukasz Cwik
>      >             <lcwik@google.com <ma...@google.com>
>     <mailto:lcwik@google.com <ma...@google.com>>
>      >              >>> >         <mailto:lcwik@google.com
>     <ma...@google.com>
>      >             <mailto:lcwik@google.com <ma...@google.com>>>>
>     wrote:
>      >              >>> >          >
>      >              >>> >          > On Mon, Oct 15, 2018 at 1:17 PM Robert
>      >             Bradshaw
>      >              >>> >         <robertwb@google.com
>     <ma...@google.com>
>      >             <mailto:robertwb@google.com
>     <ma...@google.com>> <mailto:robertwb@google.com
>     <ma...@google.com>
>      >             <mailto:robertwb@google.com
>     <ma...@google.com>>>> wrote:
>      >              >>> >          >>
>      >              >>> >          >> On Mon, Oct 15, 2018 at 7:50 PM
>     Lukasz Cwik
>      >              >>> >         <lcwik@google.com
>     <ma...@google.com> <mailto:lcwik@google.com
>     <ma...@google.com>>
>      >             <mailto:lcwik@google.com <ma...@google.com>
>     <mailto:lcwik@google.com <ma...@google.com>>>> wrote:
>      >              >>> >          >> >
>      >              >>> >          >> > I agree with the sentiment for
>     better
>      >             error checking.
>      >              >>> >          >> >
>      >              >>> >          >> > We can try to make it such that
>     the SDK
>      >             can "fetch" the
>      >              >>> >         set of options that the runner supports by
>      >             making a call to the
>      >              >>> >         Job API. The API could return a list of
>      >             option names
>      >              >>> >         (descriptions for --help purposes and also
>      >             potentially the
>      >              >>> >         expected format) which would remove
>     the worry
>      >             around "unknown"
>      >              >>> >         options. Yes I understand to be able
>     to make
>      >             the Job API call,
>      >              >>> >         we may need to parse some options from the
>      >             args parameters first
>      >              >>> >         and then parse the unknown options
>     after they
>      >             are fetched.
>      >              >>> >          >>
>      >              >>> >          >> This is an interesting idea, but
>     seems it
>      >             could get quite
>      >              >>> >         complicated.
>      >              >>> >          >> E.g. for delegating runners, one would
>      >             first read the options to
>      >              >>> >          >> determine which runner to fetch the
>      >             options from, which
>      >              >>> >         would then
>      >              >>> >          >> return a set of options that possibly
>      >             depends on the values
>      >              >>> >         of some of
>      >              >>> >          >> its options...
>      >              >>> >          >>
>      >              >>> >          >> > Alternatively, we can choose an
>      >             explicit format upfront.
>      >              >>> >          >> > To expand on the exact format for
>      >             --runner_option=...,
>      >              >>> >         here are some different ideas:
>      >              >>> >          >> > 1) Specified multiple times,
>     each one
>      >             is an explicit flag
>      >              >>> >          >> > --runner_option=--blah=bar
>      >             --runner_option=--foo=baz1
>      >              >>> >         --runner_option=--foo=baz2
>      >              >>> >          >>
>      >              >>> >          >> I'm -1 on this format. We should move
>      >             away from the idea
>      >              >>> >         that options
>      >              >>> >          >> == flags (as that doesn't compose well
>      >             with other libraries
>      >              >>> >         that do
>      >              >>> >          >> their own flags parsing). The
>     ability to
>      >             parse a set of
>      >              >>> >         flags into
>      >              >>> >          >> options is just a convenience that an
>      >             author may (or may
>      >              >>> >         not) choose
>      >              >>> >          >> to use (e.g. when running pipelines a
>      >             long-lived process like a
>      >              >>> >          >> service or a notebook, the command
>     line
>      >             flags are almost
>      >              >>> >         certainly not
>      >              >>> >          >> the right interface).
>      >              >>> >          >>
>      >              >>> >          >> > 2) specified multiple times, we drop
>      >             the explicit flag
>      >              >>> >          >> > --runner_option=blah=bar
>      >             --runner_option=foo=baz1
>      >              >>> >         --runner_option=foo=baz2
>      >              >>> >          >>
>      >              >>> >          >> This or (4) is my preference.
>      >              >>> >          >>
>      >              >>> >          >> > 3) we use a string which the
>     runner can
>      >             choose to
>      >              >>> >         interpret however they want (JSON/XML
>     shown
>      >             below)
>      >              >>> >          >> > --runner_option='{"blah": "bar",
>     "foo":
>      >             ["baz1", "baz2"]}'
>      >              >>> >          >> >
>      >              >>> >
>      >           
>       --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>      >              >>> >          >>
>      >              >>> >          >> This would make validation hard.
>     Also, I
>      >             think it makes
>      >              >>> >         sense for some
>      >              >>> >          >> runner options to be "shared"
>      >             (parallelism") by convention,
>      >              >>> >         so letting
>      >              >>> >          >> it be a free-form string wouldn't
>     allow
>      >             different runners to
>      >              >>> >         inspect
>      >              >>> >          >> different bits.
>      >              >>> >          >>
>      >              >>> >          >> We should consider if we should
>     use urns
>      >             for namespacing, and
>      >              >>> >          >> assigning semantic meaning to
>     strings, here.
>      >              >>> >          >>
>      >              >>> >          >> > 4) we use a string which must be a
>      >             specific format such as
>      >              >>> >         JSON (allows the SDK to do simple
>     validation):
>      >              >>> >          >> > --runner_option='{"blah": "bar",
>     "foo":
>      >             ["baz1", "baz2"]}'
>      >              >>> >          >>
>      >              >>> >          >> I like this in that at least some
>      >             validation can be
>      >              >>> >         performed, and
>      >              >>> >          >> expectations of how to format richer
>      >             types. On the other
>      >              >>> >         hand it gets
>      >              >>> >          >> a bit verbose, given that most (I'd
>      >             imagine) options will be
>      >              >>> >         simple.
>      >              >>> >          >> As with normal options,
>      >              >>> >          >>
>      >              >>> >          >>     --option1=value1 --option2=value2
>      >              >>> >          >>
>      >              >>> >          >> is shorthand for {"option1": value1,
>      >             "option2": value2}.
>      >              >>> >          >>
>      >              >>> >          > I lean to 4 the most. With 2, you
>     run into
>      >             issues of what
>      >              >>> >         does --runner_option=foo=["a", "b"]
>      >             --runner_option=foo=["c",
>      >              >>> >         "d"] mean?
>      >              >>> >          > Is it an error or list of lists or
>      >             concatenated. Similar
>      >              >>> >         issues for map types represented via JSON
>      >             object {...}
>      >              >>> >
>      >              >>> >         We can err to be on the safe side
>      >             unless/until an argument can
>      >              >>> >         be made
>      >              >>> >         that merging is more natural. I just think
>      >             this will be excessively
>      >              >>> >         verbose to use.
>      >              >>> >
>      >              >>> >          >> > I would strongly suggest that we go
>      >             with the "fetch"
>      >              >>> >         approach, since this makes the set of
>     options
>      >             discoverable and
>      >              >>> >         helps users find errors much earlier
>     in their
>      >             pipeline.
>      >              >>> >          >>
>      >              >>> >          >> This seems like an advanced
>     feature that
>      >             SDKs may want to
>      >              >>> >         support, but
>      >              >>> >          >> I wouldn't want to require this
>      >             complexity for bootstrapping
>      >              >>> >         an SDK.
>      >              >>> >          >>
>      >              >>> >          > SDKs that are starting off wouldn't
>     need
>      >             to "fetch" options,
>      >              >>> >         they could choose to not support runner
>      >             options or they could
>      >              >>> >         choose to pass all options through to the
>      >             runner blindly.
>      >              >>> >         Fetching the options only provides the SDK
>      >             the ability to
>      >              >>> >         provide error checking upfront and useful
>      >             error/help messages.
>      >              >>> >
>      >              >>> >         But how to even pass all options through
>      >             blindly is exactly the
>      >              >>> >         difficulty we're running into here.
>      >              >>> >
>      >              >>> >          >> Regarding always keeping runner
>     options
>      >             separate, +1, though
>      >              >>> >         I'm not
>      >              >>> >          >> sure the line is always clear.
>      >              >>> >          >>
>      >              >>> >          >>
>      >              >>> >          >> > On Mon, Oct 15, 2018 at 8:04 AM
>     Robert
>      >             Bradshaw
>      >              >>> >         <robertwb@google.com
>     <ma...@google.com>
>      >             <mailto:robertwb@google.com
>     <ma...@google.com>> <mailto:robertwb@google.com
>     <ma...@google.com>
>      >             <mailto:robertwb@google.com
>     <ma...@google.com>>>> wrote:
>      >              >>> >          >> >>
>      >              >>> >          >> >> On Mon, Oct 15, 2018 at 3:58 PM
>      >             Maximilian Michels
>      >              >>> >         <mxm@apache.org
>     <ma...@apache.org> <mailto:mxm@apache.org <ma...@apache.org>>
>      >             <mailto:mxm@apache.org <ma...@apache.org>
>     <mailto:mxm@apache.org <ma...@apache.org>>>> wrote:
>      >              >>> >          >> >> >
>      >              >>> >          >> >> > I agree that the current approach
>      >             breaks the pipeline
>      >              >>> >         options contract
>      >              >>> >          >> >> > because "unknown" options get
>     parsed
>      >             in the same way as
>      >              >>> >         options which
>      >              >>> >          >> >> > have been defined by the user.
>      >              >>> >          >> >>
>      >              >>> >          >> >> FWIW, I think we're already
>     breaking
>      >             this "contract."
>      >              >>> >         Unknown options
>      >              >>> >          >> >> are silently ignored; with this
>     change
>      >             we just change how
>      >              >>> >         we record
>      >              >>> >          >> >> them. It still feels a bit
>     hacky though.
>      >              >>> >          >> >>
>      >              >>> >          >> >> > I'm not sure the
>     `experiments` flag
>      >             works for us. AFAIK
>      >              >>> >         it only allows
>      >              >>> >          >> >> > true/false flags. We want to pass
>      >             all types of pipeline
>      >              >>> >         options to the
>      >              >>> >          >> >> > Runner.
>      >              >>> >          >> >>
>      >              >>> >          >> >> Experiments is an arbitrary set of
>      >             strings, which can be
>      >              >>> >         of the form
>      >              >>> >          >> >> "param=value" if that's useful.
>      >             (Dataflow does this.)
>      >              >>> >         There is, again,
>      >              >>> >          >> >> no namespacing on the param
>     names, but
>      >             we could user urns
>      >              >>> >         or impose
>      >              >>> >          >> >> some other structure here.
>      >              >>> >          >> >>
>      >              >>> >          >> >> > How to solve this?
>      >              >>> >          >> >> >
>      >              >>> >          >> >> > 1) Add all options of all
>     Runners to
>      >             each SDK
>      >              >>> >          >> >> > We added some of the FlinkRunner
>      >             options to the Python
>      >              >>> >         SDK but realized
>      >              >>> >          >> >> > syncing is rather cumbersome
>     in the
>      >             long term. However,
>      >              >>> >         we want the most
>      >              >>> >          >> >> > important options to be
>     validated on
>      >             the client side.
>      >              >>> >          >> >>
>      >              >>> >          >> >> I don't think this is
>     sustainable in
>      >             the long run.
>      >              >>> >         However, thinking
>      >              >>> >          >> >> about this, in the worse case
>      >             validation happens after
>      >              >>> >         construction
>      >              >>> >          >> >> but before execution (as with
>     much of
>      >             our other
>      >              >>> >         validation) so it
>      >              >>> >          >> >> isn't that bad.
>      >              >>> >          >> >>
>      >              >>> >          >> >> > 2) Pass "unknown" options via a
>      >             separate list in the
>      >              >>> >         Proto which can
>      >              >>> >          >> >> > only be accessed internally
>     by the
>      >             Runners. This still
>      >              >>> >         allows passing
>      >              >>> >          >> >> > arbitrary options but we wouldn't
>      >             leak unknown options
>      >              >>> >         and display them
>      >              >>> >          >> >> > as top-level options.
>      >              >>> >          >> >>
>      >              >>> >          >> >> I think there needs to be a way for
>      >             the user to
>      >              >>> >         communicate values
>      >              >>> >          >> >> directly to the runner
>     regardless of
>      >             the SDK. My
>      >              >>> >         preference would be
>      >              >>> >          >> >> to make this explicit, e.g.
>     (repeated)
>      >              >>> >         --runner_option=..., rather
>      >              >>> >          >> >> than scooping up all unknown
>     flags at
>      >             command line
>      >              >>> >         parsing time.
>      >              >>> >          >> >> Perhaps an SDK that is aware of
>     some
>      >             runners could choose
>      >              >>> >         to lift
>      >              >>> >          >> >> these as top-level options, but
>     still
>      >             pass them as runner
>      >              >>> >         options.
>      >              >>> >          >> >>
>      >              >>> >          >> >> > On 13.10.18 02:34, Charles
>     Chen wrote:
>      >              >>> >          >> >> > > The current release branch
>      >              >>> >          >> >> > >
>      >              >>> >
>      >           
>       (https://github.com/apache/beam/commits/release-2.8.0) was cut
>      >              >>> >         after the
>      >              >>> >          >> >> > > revert went in.  Sent out
>      >              >>> > https://github.com/apache/beam/pull/6683 as a
>      >              >>> >          >> >> > > revert of the revert. 
>     Regarding
>      >             your comment above,
>      >              >>> >         I can help out with
>      >              >>> >          >> >> > > the design / PR reviews for
>     common
>      >             Python code as you
>      >              >>> >         suggest.
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > > On Fri, Oct 12, 2018 at 4:48 PM
>      >             Thomas Weise
>      >              >>> >         <thw@apache.org
>     <ma...@apache.org> <mailto:thw@apache.org <ma...@apache.org>>
>      >             <mailto:thw@apache.org <ma...@apache.org>
>     <mailto:thw@apache.org <ma...@apache.org>>>
>      >              >>> >          >> >> > > <mailto:thw@apache.org
>     <ma...@apache.org>
>      >             <mailto:thw@apache.org <ma...@apache.org>>
>     <mailto:thw@apache.org <ma...@apache.org>
>      >             <mailto:thw@apache.org <ma...@apache.org>>>>> wrote:
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >     Thanks, will tag you and
>      >             looking forward to
>      >              >>> >         feedback so we can
>      >              >>> >          >> >> > >     ensure that changes
>     work for
>      >             everyone.
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >     Looking at the PR, I see
>      >             agreement from Max to
>      >              >>> >         revert the change on
>      >              >>> >          >> >> > >     the release branch, but
>     not in
>      >             master. Would you
>      >              >>> >         mind to restore it
>      >              >>> >          >> >> > >     in master?
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >     Thanks
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >     On Fri, Oct 12, 2018 at
>     4:40
>      >             PM Ahmet Altay
>      >              >>> >         <altay@google.com
>     <ma...@google.com> <mailto:altay@google.com
>     <ma...@google.com>>
>      >             <mailto:altay@google.com <ma...@google.com>
>     <mailto:altay@google.com <ma...@google.com>>>
>      >              >>> >          >> >> > >    
>     <mailto:altay@google.com <ma...@google.com>
>      >             <mailto:altay@google.com <ma...@google.com>>
>      >              >>> >         <mailto:altay@google.com
>     <ma...@google.com>
>      >             <mailto:altay@google.com
>     <ma...@google.com>>>>> wrote:
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >         On Fri, Oct 12, 2018 at
>      >             11:31 AM, Charles
>      >              >>> >         Chen <ccy@google.com
>     <ma...@google.com> <mailto:ccy@google.com <ma...@google.com>>
>      >             <mailto:ccy@google.com <ma...@google.com>
>     <mailto:ccy@google.com <ma...@google.com>>>
>      >              >>> >          >> >> > >        
>     <mailto:ccy@google.com <ma...@google.com>
>      >             <mailto:ccy@google.com <ma...@google.com>>
>      >              >>> >         <mailto:ccy@google.com
>     <ma...@google.com>
>      >             <mailto:ccy@google.com <ma...@google.com>>>>> wrote:
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >             What I mean is
>     that a
>      >             user may find that
>      >              >>> >         it works for them
>      >              >>> >          >> >> > >             to pass
>     "--myarg blah"
>      >             and access it as
>      >              >>> >         "options.myarg"
>      >              >>> >          >> >> > >             without explicitly
>      >             defining a "my_arg"
>      >              >>> >         flag due to the added
>      >              >>> >          >> >> > >             logic.  This is not
>      >             the intended behavior
>      >              >>> >         and we may want to
>      >              >>> >          >> >> > >             change this
>      >             implementation detail in the
>      >              >>> >         future.  However,
>      >              >>> >          >> >> > >             having this
>     logic in a
>      >             released version
>      >              >>> >         makes it hard to
>      >              >>> >          >> >> > >             change this
>     behavior
>      >             since users may
>      >              >>> >         erroneously depend on
>      >              >>> >          >> >> > >             this undocumented
>      >             behavior.  Instead, we
>      >              >>> >         should namespace /
>      >              >>> >          >> >> > >             scope this so
>     that it
>      >             is obvious that
>      >              >>> >         this is meant for
>      >              >>> >          >> >> > >             runner (and not
>     Beam
>      >             user) consumption.
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >             On Fri, Oct 12,
>     2018
>      >             at 10:48 AM Thomas Weise
>      >              >>> >          >> >> > >             <thw@apache.org
>     <ma...@apache.org>
>      >             <mailto:thw@apache.org <ma...@apache.org>>
>     <mailto:thw@apache.org <ma...@apache.org>
>      >             <mailto:thw@apache.org <ma...@apache.org>>>
>      >              >>> >         <mailto:thw@apache.org
>     <ma...@apache.org>
>      >             <mailto:thw@apache.org <ma...@apache.org>>
>     <mailto:thw@apache.org <ma...@apache.org>
>      >             <mailto:thw@apache.org <ma...@apache.org>>>>> wrote:
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                 Can you please
>      >             elaborate more what
>      >              >>> >         practical problems
>      >              >>> >          >> >> > >                 this introduces
>      >             for users?
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                 I can see that
>      >             this change allows a
>      >              >>> >         user to specify a
>      >              >>> >          >> >> > >                 runner specific
>      >             option, which in the
>      >              >>> >         future may change
>      >              >>> >          >> >> > >                 because we
>     decide
>      >             to scope
>      >              >>> >         differently. If this only
>      >              >>> >          >> >> > >                 affects
>     users of
>      >             the portable Flink
>      >              >>> >         runner (like us),
>      >              >>> >          >> >> > >                 then no need to
>      >             revert, because at
>      >              >>> >         this early stage we
>      >              >>> >          >> >> > >                 prefer
>     something
>      >             that works over
>      >              >>> >         being blocked.
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                 It would
>     also be
>      >             really great if some
>      >              >>> >         of the core Python
>      >              >>> >          >> >> > >                 SDK developers
>      >             could help out with
>      >              >>> >         the design aspects
>      >              >>> >          >> >> > >                 and PR
>     reviews of
>      >             changes that affect
>      >              >>> >         common Python
>      >              >>> >          >> >> > >                 code.
>     Anyone who
>      >             specifically wants
>      >              >>> >         to be tagged on
>      >              >>> >          >> >> > >                 relevant
>     JIRAs and
>      >             PRs?
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >         I would be happy to be
>      >             tagged, and I can also
>      >              >>> >         help with
>      >              >>> >          >> >> > >         including other
>     relevant
>      >             folks whenever
>      >              >>> >         possible. In general I
>      >              >>> >          >> >> > >         think Robert, Charles,
>      >             myself are good
>      >              >>> >         candidates.
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                 Thanks
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                 On Fri, Oct 12,
>      >             2018 at 10:20 AM
>      >              >>> >         Ahmet Altay
>      >              >>> >          >> >> > >                
>     <altay@google.com <ma...@google.com>
>      >             <mailto:altay@google.com <ma...@google.com>>
>      >              >>> >         <mailto:altay@google.com
>     <ma...@google.com>
>      >             <mailto:altay@google.com <ma...@google.com>>>
>     <mailto:altay@google.com <ma...@google.com>
>      >             <mailto:altay@google.com <ma...@google.com>>
>      >              >>> >         <mailto:altay@google.com
>     <ma...@google.com>
>      >             <mailto:altay@google.com
>     <ma...@google.com>>>>> wrote:
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                     On Fri, Oct
>      >             12, 2018 at 10:11 AM,
>      >              >>> >         Charles Chen
>      >              >>> >          >> >> > >
>      >             <ccy@google.com <ma...@google.com>
>     <mailto:ccy@google.com <ma...@google.com>>
>      >              >>> >         <mailto:ccy@google.com
>     <ma...@google.com>
>      >             <mailto:ccy@google.com <ma...@google.com>>>
>     <mailto:ccy@google.com <ma...@google.com>
>      >             <mailto:ccy@google.com <ma...@google.com>>
>      >              >>> >         <mailto:ccy@google.com
>     <ma...@google.com>
>      >             <mailto:ccy@google.com <ma...@google.com>>>>> wrote:
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                         For
>      >             context, I made comments on
>      >              >>> >          >> >> > >
>      > https://github.com/apache/beam/pull/6600 noting
>      >              >>> >          >> >> > >                        
>     that the
>      >             changes being made
>      >              >>> >         were not good for
>      >              >>> >          >> >> > >                         Beam
>      >              >>> >         backwards-compatibility.  The change as is
>      >              >>> >          >> >> > >                         allows
>      >             users to use pipeline
>      >              >>> >         options without
>      >              >>> >          >> >> > >                        
>     explicitly
>      >             defining them,
>      >              >>> >         which is not the type
>      >              >>> >          >> >> > >                         of
>     usage
>      >             we would like to
>      >              >>> >         encourage since we
>      >              >>> >          >> >> > >                        
>     prefer to
>      >             be explicit
>      >              >>> >         whenever possible.  If
>      >              >>> >          >> >> > >                         users
>      >             write pipelines with
>      >              >>> >         this sort of pattern,
>      >              >>> >          >> >> > >                        
>     they will
>      >             potentially
>      >              >>> >         encounter pain when
>      >              >>> >          >> >> > >                        
>     upgrading
>      >             to a later version
>      >              >>> >         since this is an
>      >              >>> >          >> >> > >
>      >             implementation detail and not
>      >              >>> >         an officially
>      >              >>> >          >> >> > >                        
>     supported
>      >             pattern.  I agree
>      >              >>> >         with the comments
>      >              >>> >          >> >> > >                        
>     above that
>      >             this is ultimately
>      >              >>> >         a scoping issue.
>      >              >>> >          >> >> > >                         I would
>      >             not have a problem
>      >              >>> >         with these changes if
>      >              >>> >          >> >> > >                        
>     they were
>      >             explicitly scoped
>      >              >>> >         under either a
>      >              >>> >          >> >> > >                        
>     runner or
>      >             unparsed options
>      >              >>> >         namespace.
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                         As a
>      >             second note, since the
>      >              >>> >         2.8.0 release is
>      >              >>> >          >> >> > >                        
>     being cut
>      >             right now, because
>      >              >>> >         of these
>      >              >>> >          >> >> > >
>      >             backwards-compatibility
>      >              >>> >         concerns, I would
>      >              >>> >          >> >> > >                         suggest
>      >             reverting these
>      >              >>> >         changes, at least until
>      >              >>> >          >> >> > >                        
>     2.8.0 is
>      >             cut, so we can have
>      >              >>> >         a discussion here
>      >              >>> >          >> >> > >                         before
>      >             committing to and
>      >              >>> >         releasing any API-level
>      >              >>> >          >> >> > >                        
>     changes.
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                     +1 I would
>      >             like to revert the
>      >              >>> >         changes in order not
>      >              >>> >          >> >> > >                     rush
>     this into
>      >             the release. Once
>      >              >>> >         this discussion
>      >              >>> >          >> >> > >                     results
>     in an
>      >             agreement changes
>      >              >>> >         can be brought back.
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                         On Fri,
>      >             Oct 12, 2018 at 9:26
>      >              >>> >         AM Henning Rohde
>      >              >>> >          >> >> > >
>      >             <herohde@google.com <ma...@google.com>
>     <mailto:herohde@google.com <ma...@google.com>>
>      >              >>> >         <mailto:herohde@google.com
>     <ma...@google.com>
>      >             <mailto:herohde@google.com
>     <ma...@google.com>>> <mailto:herohde@google.com
>     <ma...@google.com>
>      >             <mailto:herohde@google.com <ma...@google.com>>
>      >              >>> >         <mailto:herohde@google.com
>     <ma...@google.com>
>      >             <mailto:herohde@google.com
>     <ma...@google.com>>>>>
>      >              >>> >          >> >> > >                         wrote:
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                            
>     Agree
>      >             that pipeline
>      >              >>> >         options lack some
>      >              >>> >          >> >> > >
>      >             mechanism for scoping. It
>      >              >>> >         is also not always
>      >              >>> >          >> >> > >
>      >             possible distinguish
>      >              >>> >         options meant to be
>      >              >>> >          >> >> > >
>      >             consumed at pipeline
>      >              >>> >         construction time, by
>      >              >>> >          >> >> > >                             the
>      >             runner, by the SDK
>      >              >>> >         harness, by the user
>      >              >>> >          >> >> > >                            
>     code
>      >             or any combination
>      >              >>> >         -- and this causes
>      >              >>> >          >> >> > >
>      >             confusion every now and then.
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                             For
>      >             Dataflow, we have
>      >              >>> >         been using
>      >              >>> >          >> >> > >
>      >             "experiments" for
>      >              >>> >         arbitrary runner-specific
>      >              >>> >          >> >> > >
>      >             options. It's simply a
>      >              >>> >         string list pipeline
>      >              >>> >          >> >> > >                            
>     option
>      >             that all SDKs
>      >              >>> >         support and, for Go at
>      >              >>> >          >> >> > >                            
>     least,
>      >             is sent to
>      >              >>> >         portable runners. Flink
>      >              >>> >          >> >> > >                            
>     can do
>      >             the same in the
>      >              >>> >         short term to move
>      >              >>> >          >> >> > >                            
>     forward.
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                            
>     Henning
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                             On
>      >             Fri, Oct 12, 2018 at
>      >              >>> >         8:50 AM Thomas Weise
>      >              >>> >          >> >> > >
>      >             <thw@apache.org <ma...@apache.org>
>     <mailto:thw@apache.org <ma...@apache.org>>
>      >              >>> >         <mailto:thw@apache.org
>     <ma...@apache.org>
>      >             <mailto:thw@apache.org <ma...@apache.org>>>
>     <mailto:thw@apache.org <ma...@apache.org>
>      >             <mailto:thw@apache.org <ma...@apache.org>>
>      >              >>> >         <mailto:thw@apache.org
>     <ma...@apache.org>
>      >             <mailto:thw@apache.org <ma...@apache.org>>>>> wrote:
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >             [moving to the list]
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >             The requirement
>      >              >>> >         driving this part of the
>      >              >>> >          >> >> > >
>      >             change was to allow a
>      >              >>> >         user to specify
>      >              >>> >          >> >> > >
>      >             pipeline options that
>      >              >>> >         a runner supports
>      >              >>> >          >> >> > >
>      >             without having to
>      >              >>> >         declare those in each
>      >              >>> >          >> >> > >
>      >             language SDK.
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                            
>          In
>      >             the specific
>      >              >>> >         scenario, we have
>      >              >>> >          >> >> > >
>      >             options that the
>      >              >>> >         Flink runner supports
>      >              >>> >          >> >> > >
>      >             (and can validate),
>      >              >>> >         that are not
>      >              >>> >          >> >> > >
>      >             enumerated in the
>      >              >>> >         Python SDK.
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                            
>          I
>      >             think we have a
>      >              >>> >         bigger problem scoping
>      >              >>> >          >> >> > >
>      >             pipeline options. For
>      >              >>> >         example, the
>      >              >>> >          >> >> > >
>      >             runner options are
>      >              >>> >         dumped into the SDK
>      >              >>> >          >> >> > >
>      >             worker. There is also
>      >              >>> >         a possibility of
>      >              >>> >          >> >> > >
>      >             name collisions. So I
>      >              >>> >         think this would
>      >              >>> >          >> >> > >
>      >             benefit from broader
>      >              >>> >         feedback.
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >             Thanks,
>      >              >>> >          >> >> > >                            
>          Thomas
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >             ---------- Forwarded
>      >              >>> >         message ---------
>      >              >>> >          >> >> > >
>      >             From: *Charles Chen*
>      >              >>> >          >> >> > >
>      >              >>> >           <notifications@github.com
>     <ma...@github.com>
>      >             <mailto:notifications@github.com
>     <ma...@github.com>>
>      >             <mailto:notifications@github.com
>     <ma...@github.com>
>      >             <mailto:notifications@github.com
>     <ma...@github.com>>>
>      >              >>> >          >> >> > >
>      >              >>> >           <mailto:notifications@github.com
>     <ma...@github.com>
>      >             <mailto:notifications@github.com
>     <ma...@github.com>>
>      >              >>> >         <mailto:notifications@github.com
>     <ma...@github.com>
>      >             <mailto:notifications@github.com
>     <ma...@github.com>>>>>
>      >              >>> >          >> >> > >
>      >             Date: Fri, Oct 12,
>      >              >>> >         2018 at 8:36 AM
>      >              >>> >          >> >> > >
>      >             Subject: Re:
>      >              >>> >         [apache/beam] [BEAM-5442]
>      >              >>> >          >> >> > >
>      >             Store duplicate
>      >              >>> >         unknown options in a
>      >              >>> >          >> >> > >
>      >             list argument (#6600)
>      >              >>> >          >> >> > >
>      >             To: apache/beam
>      >              >>> >         <beam@noreply.github.com
>     <ma...@noreply.github.com>
>      >             <mailto:beam@noreply.github.com
>     <ma...@noreply.github.com>>
>      >             <mailto:beam@noreply.github.com
>     <ma...@noreply.github.com>
>      >             <mailto:beam@noreply.github.com
>     <ma...@noreply.github.com>>>
>      >              >>> >          >> >> > >
>      >              >>> >           <mailto:beam@noreply.github.com
>     <ma...@noreply.github.com>
>      >             <mailto:beam@noreply.github.com
>     <ma...@noreply.github.com>>
>      >             <mailto:beam@noreply.github.com
>     <ma...@noreply.github.com>
>      >             <mailto:beam@noreply.github.com
>     <ma...@noreply.github.com>>>>>
>      >              >>> >          >> >> > >
>      >             Cc: Thomas Weise
>      >              >>> >         <thomas.weise@gmail.com
>     <ma...@gmail.com>
>      >             <mailto:thomas.weise@gmail.com
>     <ma...@gmail.com>>
>      >             <mailto:thomas.weise@gmail.com
>     <ma...@gmail.com> <mailto:thomas.weise@gmail.com
>     <ma...@gmail.com>>>
>      >              >>> >          >> >> > >
>      >              >>> >           <mailto:thomas.weise@gmail.com
>     <ma...@gmail.com>
>      >             <mailto:thomas.weise@gmail.com
>     <ma...@gmail.com>>
>      >             <mailto:thomas.weise@gmail.com
>     <ma...@gmail.com>
>      >             <mailto:thomas.weise@gmail.com
>     <ma...@gmail.com>>>>>,
>      >              >>> >          >> >> > >
>      >             Mention
>      >              >>> >         <mention@noreply.github.com
>     <ma...@noreply.github.com>
>      >             <mailto:mention@noreply.github.com
>     <ma...@noreply.github.com>>
>      >             <mailto:mention@noreply.github.com
>     <ma...@noreply.github.com>
>      >             <mailto:mention@noreply.github.com
>     <ma...@noreply.github.com>>>
>      >              >>> >          >> >> > >
>      >              >>> >           <mailto:mention@noreply.github.com
>     <ma...@noreply.github.com>
>      >             <mailto:mention@noreply.github.com
>     <ma...@noreply.github.com>>
>      >              >>> >         <mailto:mention@noreply.github.com
>     <ma...@noreply.github.com>
>      >             <mailto:mention@noreply.github.com
>     <ma...@noreply.github.com>>>>>
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >             CC: @tweise
>      >              >>> >         <https://github.com/tweise>
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >                            
>          —
>      >              >>> >          >> >> > >
>      >             You are receiving
>      >              >>> >         this because you were
>      >              >>> >          >> >> > >
>      >             mentioned.
>      >              >>> >          >> >> > >
>      >             Reply to this email
>      >              >>> >         directly, view it on
>      >              >>> >          >> >> > >                            
>          GitHub
>      >              >>> >          >> >> > >
>      >              >>> >
>      >           
>       <https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>      >              >>> >          >> >> > >                            
>          or
>      >             mute the thread
>      >              >>> >          >> >> > >
>      >              >>> >
>      >           
>       <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>.
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >              >>> >          >> >> > >
>      >              >>> >
>      >
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Lukasz Cwik <lc...@google.com>.

I still like #3 the most, just can't devote the time to get it done.

Instead of going with a fully implemented #3, we could hardcode the a
subset of options and types within each SDK until the job server is ready
to provide this information and then migrate to the "full" list. This would
be an easy path for SDKs to take on. They could "know" of a few well known
options, and if they want to support all options, they implement the
integration with the job server.

On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels <mx...@apache.org> wrote:

> > I would prefer we don't introduce a (quirky) way of passing unknown
> options that forces users to type JSON into the command line (or similar
> acrobatics)
> Same here, the JSON approach seems technically nice but too bulky for
> users.
>
> > To someone wanting to run a pipeline, all options are equally important,
> whether they are application specific, SDK specific or runner specific.
>
> I'm also reluctant to force users to use `--runner_option=` because the
> division into "Runner" options and other options seems rather arbitrary
> to users. Most built-in options are also Runner-related.
>
> > It should be possible to *optionally* qualify/scope (to cover cases
> where there is ambiguity), but otherwise I prefer the format we currently
> have.
>
> Yes, namespacing is a problem. What happens if the user defines a custom
> PipelineOption which clashes with one of the builtin ones? If both are

set, which one is actually applied?


Note that PipelineOptions so far has been treating name equality to mean
option equality and the Java implementation has a bunch of strict checks to
make sure that default values aren't used for duplicate definitions, they
have the same type, etc...
With 1), you fail the job if the runner can't understand your option
because its not represented the same way. User then needs to fix-up their
declaration of the option name.
With 2), there are no name conflicts, the SDK will need to validate that
the option isn't set in both formats and error out if it is before pipeline
submission time.
With 3), you can prefetch all the options and error out to the user during
argument parsing time.


>
> Here is a summary of the possible paths going forward:
>
>
> 1) Validate PipelineOptions at Runner side
> ==========================================
>
> The main issue raised here was that we want to move away from parsing
> arguments which look like options without validating them. An easy fix
> would be to actually validate them on the Runner side. This could be
> done by changing the deserialization code of PipelineOptions which so
> far ignores unknown JSON options.
>
> See: PipelineOptionsTranslation.fromProto(Struct protoOptions)
>
> Actually, this wouldn't work for user-defined PipelineOptions because
> they might not be known to the Runner (if they are defined in Python).
>
>
> 2) Introduce a Runner-Option Flag
> =================================
>
> In this approach we would try to add as many pipeline options for a
> Runner to the SDK, but allow additional Runner options to be passed
> using the `--runner-option=key=val` flag. The Runner, like in 1), would
> have to ensure validation. I think this has been the most favored way so
> far. Going forward, that means that `--parallelism=4` and
> `--runner-option=parallelism=4` will have the same effect for the Flink
> Runner.
>
>
> 3) Implement Fetching of Options from JobServer
> ===============================================
>
> The options are retrieved from the JobServer before submitting the
> pipeline. I think this would be ideal but, as mentioned before, it
> increases the complexity for implementing new SDKs and might overall
> just not be worth the effort.
>
>
> What do you think? I'd implement 2) for the next release, unless there
> are advocates for a different approach.
>
> Cheers,
> Max
>
> On 25.10.18 21:19, Thomas Weise wrote:
> > Reminder that this is something we ideally address before the next
> > release...
> >
> > Considering the discussion so far, my preference is that we get away
> > from unknown options and discover valid options from the runner (by
> > expanding the job service).
> >
> > Once the SDK is aware of all valid options, it is possible to provide
> > meaningful feedback to the user (validate or help), and correctly handle
> > scopes and types.
> >
> > I would prefer we don't introduce a (quirky) way of passing unknown
> > options that forces users to type JSON into the command line (or similar
> > acrobatics). To someone wanting to run a pipeline, all options are
> > equally important, whether they are application specific, SDK specific
> > or runner specific. It should be possible to *optionally* qualify/scope
> > (to cover cases where there is ambiguity), but otherwise I prefer the
> > format we currently have.
> >
> > Regarding type inference: Correct handling of numeric types matters, see
> > following issue with protobuf (not JSON):
> > https://issues.apache.org/jira/browse/BEAM-5509
> >
> > Thomas
> >
> >
> > On Thu, Oct 18, 2018 at 6:55 AM Robert Bradshaw <robertwb@google.com
> > <ma...@google.com>> wrote:
> >
> >     On Wed, Oct 17, 2018 at 11:35 PM Lukasz Cwik <lcwik@google.com
> >     <ma...@google.com>> wrote:
> >
> >
> >         On Tue, Oct 16, 2018 at 11:51 AM Robert Bradshaw
> >         <robertwb@google.com <ma...@google.com>> wrote:
> >
> >             On Tue, Oct 16, 2018 at 7:03 PM Lukasz Cwik
> >             <lcwik@google.com <ma...@google.com>> wrote:
> >              >
> >              > For all unknown options, the SDK can require that all
> >             flag values be specified explicitly as a valid JSON type.
> >              > starts with { -> object
> >              > starts with [ -> list
> >              > starts with " -> string
> >              > is null / true / false -> null / true / false
> >              > otherwise is number.
> >              >
> >              > This isn't great for strings but works well for all the
> >             other types.
> >              >
> >              > Thus for known options, the additional typing information
> >             would disambiguate whether something should be a
> >             string/boolean/number/object/list but for unknown options we
> >             would expect the user to use valid JSON explicitly and write:
> >              > --foo={"object": "value"}
> >              > --foo=["value", "value2"]
> >              > --foo="string value"
> >
> >             Due to shell escaping, one would have to write
> >
> >             --foo=\"string value\"
> >
> >             or actually, due to the space
> >
> >             --foo='"string value"'
> >
> >             or some other variation on that, which is really
> >             unfortunate. (The JSON list/objects would need similar
> >             quoting, but that's less surprising.) Also, does this mean
> >             we'd only have one kind of number (not integer vs. float,
> >             i.e. --parallelism=5.0 works)? I suppose that is JSON.
> >
> >
> >         Yes, I was suspecting that users would need to type the second
> >         variant as \"...\" I found more burdensome then '"..."'
> >
> >
> >              > --foo=3.5 --foo=-4
> >              > --foo=true --foo=false
> >              > --foo=null
> >              > This also works if the flag is repeated, so --foo=3.5
> >             --foo=-4 is [3.5, -4]
> >
> >             The thing that sparked this discussion was what to do when
> >             unknown foo is repeated, but only one value given.
> >
> >
> >         If the person only specifies one value, then they have to
> >         disambiguate and put it in a list, only if they specify more
> >         then one value will they have to turn it into a list.
> >
> >         I believe we could come up with other schemes on how to convert
> >         unknown options to JSON where we prefer strings over non-string
> >         types like null/true/false/numbers/list/object and require the
> >         user to escape out of the string default but anything that is
> >         too different from strict JSON would cause headaches when
> >         attempting to explain the format to users. I think a happy
> >         middle ground would be that we will only require escaping for
> >         strings which are ambiguous, so things like true, null, false,
> >         ... to be treated as strings would require the user to escape
> them.
> >
> >
> >     I'd prefer to avoid inferring the type of an unknown argument based
> >     on its contents, which can lead to surprises. We could declare every
> >     unknown type to be repeated string, and let any parsing/validation
> >     occur on the runner. If desired, we could pass these around as a
> >     single "runner options" dict that runners could inspect and use to
> >     populate the actual dict rather than mixing parsed and unparsed
> >     options.
> >
> >
> >
> >              > On Tue, Oct 16, 2018 at 7:56 AM Thomas Weise
> >             <thw@apache.org <ma...@apache.org>> wrote:
> >              >>
> >              >> Discovering options from the job server seems preferable
> >             over replicating runner options in SDKs.
> >              >>
> >              >> Runners evolve on their own, and with portability the
> >             SDK does not need to know anything about the runner.
> >              >>
> >              >> Regarding --runner-option. It is true that this looks
> >             less user friendly. On the other hand it eliminates the
> >             possibility of name collisions.
> >              >>
> >              >> But if options are discovered, the SDK can perform full
> >             validation. It would only be necessary to use explicit
> >             scoping when there is ambiguity.
> >              >>
> >              >> Thomas
> >              >>
> >              >>
> >              >> On Tue, Oct 16, 2018 at 3:48 AM Maximilian Michels
> >             <mxm@apache.org <ma...@apache.org>> wrote:
> >              >>>
> >              >>> Fetching options directly from the Runner's JobServer
> >             seems like the
> >              >>> ideal solution. I agree with Robert that it creates
> >             additional
> >              >>> complexity for SDK authors, so the `--runner-option`
> >             flag would be an
> >              >>> easy and explicit way to specify additional Runner
> options.
> >              >>>
> >              >>> The format I prefer would be: --runner_option=key1=val1
> >              >>> --runner_option=key2=val2
> >              >>>
> >              >>> Now, from the perspective of end users, I think it is
> >             neither convenient
> >              >>> nor reasonable to require the use of the
> >             `--runner-option` flag. To the
> >              >>> user it seems nebulous why some pipeline options live
> >             in the top-level
> >              >>> option namespace while others need to be nested within
> >             an option. This
> >              >>> is amplified by there being two Runners the user needs
> >             to be aware of,
> >              >>> i.e. PortableRunner and the actual Runner
> >             (Dataflow/Flink/Spark..).
> >              >>>
> >              >>> I feel like we would eventually replicate all options
> >             in the SDK because
> >              >>> otherwise users have to use the `--runner-option`, but
> >             at least we can
> >              >>> specify options which have not been replicated yet.
> >              >>>
> >              >>> -Max
> >              >>>
> >              >>> On 16.10.18 10:27, Robert Bradshaw wrote:
> >              >>> > Yes, we don't know how to parse and/or validate it.
> >              >>> >
> >              >>> > On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik
> >             <lcwik@google.com <ma...@google.com>
> >              >>> > <mailto:lcwik@google.com <ma...@google.com>>>
> >             wrote:
> >              >>> >
> >              >>> >     I see, is the issue that we currently are using a
> >             JSON
> >              >>> >     representation for options when being serialized
> >             and when we get
> >              >>> >     some unknown option, we don't know how to convert
> >             it into its JSON form?
> >              >>> >
> >              >>> >     On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw
> >             <robertwb@google.com <ma...@google.com>
> >              >>> >     <mailto:robertwb@google.com
> >             <ma...@google.com>>> wrote:
> >              >>> >
> >              >>> >         On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik
> >             <lcwik@google.com <ma...@google.com>
> >              >>> >         <mailto:lcwik@google.com
> >             <ma...@google.com>>> wrote:
> >              >>> >          >
> >              >>> >          > On Mon, Oct 15, 2018 at 1:17 PM Robert
> >             Bradshaw
> >              >>> >         <robertwb@google.com
> >             <ma...@google.com> <mailto:robertwb@google.com
> >             <ma...@google.com>>> wrote:
> >              >>> >          >>
> >              >>> >          >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz
> Cwik
> >              >>> >         <lcwik@google.com <ma...@google.com>
> >             <mailto:lcwik@google.com <ma...@google.com>>> wrote:
> >              >>> >          >> >
> >              >>> >          >> > I agree with the sentiment for better
> >             error checking.
> >              >>> >          >> >
> >              >>> >          >> > We can try to make it such that the SDK
> >             can "fetch" the
> >              >>> >         set of options that the runner supports by
> >             making a call to the
> >              >>> >         Job API. The API could return a list of
> >             option names
> >              >>> >         (descriptions for --help purposes and also
> >             potentially the
> >              >>> >         expected format) which would remove the worry
> >             around "unknown"
> >              >>> >         options. Yes I understand to be able to make
> >             the Job API call,
> >              >>> >         we may need to parse some options from the
> >             args parameters first
> >              >>> >         and then parse the unknown options after they
> >             are fetched.
> >              >>> >          >>
> >              >>> >          >> This is an interesting idea, but seems it
> >             could get quite
> >              >>> >         complicated.
> >              >>> >          >> E.g. for delegating runners, one would
> >             first read the options to
> >              >>> >          >> determine which runner to fetch the
> >             options from, which
> >              >>> >         would then
> >              >>> >          >> return a set of options that possibly
> >             depends on the values
> >              >>> >         of some of
> >              >>> >          >> its options...
> >              >>> >          >>
> >              >>> >          >> > Alternatively, we can choose an
> >             explicit format upfront.
> >              >>> >          >> > To expand on the exact format for
> >             --runner_option=...,
> >              >>> >         here are some different ideas:
> >              >>> >          >> > 1) Specified multiple times, each one
> >             is an explicit flag
> >              >>> >          >> > --runner_option=--blah=bar
> >             --runner_option=--foo=baz1
> >              >>> >         --runner_option=--foo=baz2
> >              >>> >          >>
> >              >>> >          >> I'm -1 on this format. We should move
> >             away from the idea
> >              >>> >         that options
> >              >>> >          >> == flags (as that doesn't compose well
> >             with other libraries
> >              >>> >         that do
> >              >>> >          >> their own flags parsing). The ability to
> >             parse a set of
> >              >>> >         flags into
> >              >>> >          >> options is just a convenience that an
> >             author may (or may
> >              >>> >         not) choose
> >              >>> >          >> to use (e.g. when running pipelines a
> >             long-lived process like a
> >              >>> >          >> service or a notebook, the command line
> >             flags are almost
> >              >>> >         certainly not
> >              >>> >          >> the right interface).
> >              >>> >          >>
> >              >>> >          >> > 2) specified multiple times, we drop
> >             the explicit flag
> >              >>> >          >> > --runner_option=blah=bar
> >             --runner_option=foo=baz1
> >              >>> >         --runner_option=foo=baz2
> >              >>> >          >>
> >              >>> >          >> This or (4) is my preference.
> >              >>> >          >>
> >              >>> >          >> > 3) we use a string which the runner can
> >             choose to
> >              >>> >         interpret however they want (JSON/XML shown
> >             below)
> >              >>> >          >> > --runner_option='{"blah": "bar", "foo":
> >             ["baz1", "baz2"]}'
> >              >>> >          >> >
> >              >>> >
> >
>  --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
> >              >>> >          >>
> >              >>> >          >> This would make validation hard. Also, I
> >             think it makes
> >              >>> >         sense for some
> >              >>> >          >> runner options to be "shared"
> >             (parallelism") by convention,
> >              >>> >         so letting
> >              >>> >          >> it be a free-form string wouldn't allow
> >             different runners to
> >              >>> >         inspect
> >              >>> >          >> different bits.
> >              >>> >          >>
> >              >>> >          >> We should consider if we should use urns
> >             for namespacing, and
> >              >>> >          >> assigning semantic meaning to strings,
> here.
> >              >>> >          >>
> >              >>> >          >> > 4) we use a string which must be a
> >             specific format such as
> >              >>> >         JSON (allows the SDK to do simple validation):
> >              >>> >          >> > --runner_option='{"blah": "bar", "foo":
> >             ["baz1", "baz2"]}'
> >              >>> >          >>
> >              >>> >          >> I like this in that at least some
> >             validation can be
> >              >>> >         performed, and
> >              >>> >          >> expectations of how to format richer
> >             types. On the other
> >              >>> >         hand it gets
> >              >>> >          >> a bit verbose, given that most (I'd
> >             imagine) options will be
> >              >>> >         simple.
> >              >>> >          >> As with normal options,
> >              >>> >          >>
> >              >>> >          >>     --option1=value1 --option2=value2
> >              >>> >          >>
> >              >>> >          >> is shorthand for {"option1": value1,
> >             "option2": value2}.
> >              >>> >          >>
> >              >>> >          > I lean to 4 the most. With 2, you run into
> >             issues of what
> >              >>> >         does --runner_option=foo=["a", "b"]
> >             --runner_option=foo=["c",
> >              >>> >         "d"] mean?
> >              >>> >          > Is it an error or list of lists or
> >             concatenated. Similar
> >              >>> >         issues for map types represented via JSON
> >             object {...}
> >              >>> >
> >              >>> >         We can err to be on the safe side
> >             unless/until an argument can
> >              >>> >         be made
> >              >>> >         that merging is more natural. I just think
> >             this will be excessively
> >              >>> >         verbose to use.
> >              >>> >
> >              >>> >          >> > I would strongly suggest that we go
> >             with the "fetch"
> >              >>> >         approach, since this makes the set of options
> >             discoverable and
> >              >>> >         helps users find errors much earlier in their
> >             pipeline.
> >              >>> >          >>
> >              >>> >          >> This seems like an advanced feature that
> >             SDKs may want to
> >              >>> >         support, but
> >              >>> >          >> I wouldn't want to require this
> >             complexity for bootstrapping
> >              >>> >         an SDK.
> >              >>> >          >>
> >              >>> >          > SDKs that are starting off wouldn't need
> >             to "fetch" options,
> >              >>> >         they could choose to not support runner
> >             options or they could
> >              >>> >         choose to pass all options through to the
> >             runner blindly.
> >              >>> >         Fetching the options only provides the SDK
> >             the ability to
> >              >>> >         provide error checking upfront and useful
> >             error/help messages.
> >              >>> >
> >              >>> >         But how to even pass all options through
> >             blindly is exactly the
> >              >>> >         difficulty we're running into here.
> >              >>> >
> >              >>> >          >> Regarding always keeping runner options
> >             separate, +1, though
> >              >>> >         I'm not
> >              >>> >          >> sure the line is always clear.
> >              >>> >          >>
> >              >>> >          >>
> >              >>> >          >> > On Mon, Oct 15, 2018 at 8:04 AM Robert
> >             Bradshaw
> >              >>> >         <robertwb@google.com
> >             <ma...@google.com> <mailto:robertwb@google.com
> >             <ma...@google.com>>> wrote:
> >              >>> >          >> >>
> >              >>> >          >> >> On Mon, Oct 15, 2018 at 3:58 PM
> >             Maximilian Michels
> >              >>> >         <mxm@apache.org <ma...@apache.org>
> >             <mailto:mxm@apache.org <ma...@apache.org>>> wrote:
> >              >>> >          >> >> >
> >              >>> >          >> >> > I agree that the current approach
> >             breaks the pipeline
> >              >>> >         options contract
> >              >>> >          >> >> > because "unknown" options get parsed
> >             in the same way as
> >              >>> >         options which
> >              >>> >          >> >> > have been defined by the user.
> >              >>> >          >> >>
> >              >>> >          >> >> FWIW, I think we're already breaking
> >             this "contract."
> >              >>> >         Unknown options
> >              >>> >          >> >> are silently ignored; with this change
> >             we just change how
> >              >>> >         we record
> >              >>> >          >> >> them. It still feels a bit hacky
> though.
> >              >>> >          >> >>
> >              >>> >          >> >> > I'm not sure the `experiments` flag
> >             works for us. AFAIK
> >              >>> >         it only allows
> >              >>> >          >> >> > true/false flags. We want to pass
> >             all types of pipeline
> >              >>> >         options to the
> >              >>> >          >> >> > Runner.
> >              >>> >          >> >>
> >              >>> >          >> >> Experiments is an arbitrary set of
> >             strings, which can be
> >              >>> >         of the form
> >              >>> >          >> >> "param=value" if that's useful.
> >             (Dataflow does this.)
> >              >>> >         There is, again,
> >              >>> >          >> >> no namespacing on the param names, but
> >             we could user urns
> >              >>> >         or impose
> >              >>> >          >> >> some other structure here.
> >              >>> >          >> >>
> >              >>> >          >> >> > How to solve this?
> >              >>> >          >> >> >
> >              >>> >          >> >> > 1) Add all options of all Runners to
> >             each SDK
> >              >>> >          >> >> > We added some of the FlinkRunner
> >             options to the Python
> >              >>> >         SDK but realized
> >              >>> >          >> >> > syncing is rather cumbersome in the
> >             long term. However,
> >              >>> >         we want the most
> >              >>> >          >> >> > important options to be validated on
> >             the client side.
> >              >>> >          >> >>
> >              >>> >          >> >> I don't think this is sustainable in
> >             the long run.
> >              >>> >         However, thinking
> >              >>> >          >> >> about this, in the worse case
> >             validation happens after
> >              >>> >         construction
> >              >>> >          >> >> but before execution (as with much of
> >             our other
> >              >>> >         validation) so it
> >              >>> >          >> >> isn't that bad.
> >              >>> >          >> >>
> >              >>> >          >> >> > 2) Pass "unknown" options via a
> >             separate list in the
> >              >>> >         Proto which can
> >              >>> >          >> >> > only be accessed internally by the
> >             Runners. This still
> >              >>> >         allows passing
> >              >>> >          >> >> > arbitrary options but we wouldn't
> >             leak unknown options
> >              >>> >         and display them
> >              >>> >          >> >> > as top-level options.
> >              >>> >          >> >>
> >              >>> >          >> >> I think there needs to be a way for
> >             the user to
> >              >>> >         communicate values
> >              >>> >          >> >> directly to the runner regardless of
> >             the SDK. My
> >              >>> >         preference would be
> >              >>> >          >> >> to make this explicit, e.g. (repeated)
> >              >>> >         --runner_option=..., rather
> >              >>> >          >> >> than scooping up all unknown flags at
> >             command line
> >              >>> >         parsing time.
> >              >>> >          >> >> Perhaps an SDK that is aware of some
> >             runners could choose
> >              >>> >         to lift
> >              >>> >          >> >> these as top-level options, but still
> >             pass them as runner
> >              >>> >         options.
> >              >>> >          >> >>
> >              >>> >          >> >> > On 13.10.18 02:34, Charles Chen
> wrote:
> >              >>> >          >> >> > > The current release branch
> >              >>> >          >> >> > >
> >              >>> >
> >             (https://github.com/apache/beam/commits/release-2.8.0) was
> cut
> >              >>> >         after the
> >              >>> >          >> >> > > revert went in.  Sent out
> >              >>> > https://github.com/apache/beam/pull/6683 as a
> >              >>> >          >> >> > > revert of the revert.  Regarding
> >             your comment above,
> >              >>> >         I can help out with
> >              >>> >          >> >> > > the design / PR reviews for common
> >             Python code as you
> >              >>> >         suggest.
> >              >>> >          >> >> > >
> >              >>> >          >> >> > > On Fri, Oct 12, 2018 at 4:48 PM
> >             Thomas Weise
> >              >>> >         <thw@apache.org <ma...@apache.org>
> >             <mailto:thw@apache.org <ma...@apache.org>>
> >              >>> >          >> >> > > <mailto:thw@apache.org
> >             <ma...@apache.org> <mailto:thw@apache.org
> >             <ma...@apache.org>>>> wrote:
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >     Thanks, will tag you and
> >             looking forward to
> >              >>> >         feedback so we can
> >              >>> >          >> >> > >     ensure that changes work for
> >             everyone.
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >     Looking at the PR, I see
> >             agreement from Max to
> >              >>> >         revert the change on
> >              >>> >          >> >> > >     the release branch, but not in
> >             master. Would you
> >              >>> >         mind to restore it
> >              >>> >          >> >> > >     in master?
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >     Thanks
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >     On Fri, Oct 12, 2018 at 4:40
> >             PM Ahmet Altay
> >              >>> >         <altay@google.com <ma...@google.com>
> >             <mailto:altay@google.com <ma...@google.com>>
> >              >>> >          >> >> > >     <mailto:altay@google.com
> >             <ma...@google.com>
> >              >>> >         <mailto:altay@google.com
> >             <ma...@google.com>>>> wrote:
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >         On Fri, Oct 12, 2018 at
> >             11:31 AM, Charles
> >              >>> >         Chen <ccy@google.com <ma...@google.com>
> >             <mailto:ccy@google.com <ma...@google.com>>
> >              >>> >          >> >> > >         <mailto:ccy@google.com
> >             <ma...@google.com>
> >              >>> >         <mailto:ccy@google.com
> >             <ma...@google.com>>>> wrote:
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >             What I mean is that a
> >             user may find that
> >              >>> >         it works for them
> >              >>> >          >> >> > >             to pass "--myarg blah"
> >             and access it as
> >              >>> >         "options.myarg"
> >              >>> >          >> >> > >             without explicitly
> >             defining a "my_arg"
> >              >>> >         flag due to the added
> >              >>> >          >> >> > >             logic.  This is not
> >             the intended behavior
> >              >>> >         and we may want to
> >              >>> >          >> >> > >             change this
> >             implementation detail in the
> >              >>> >         future.  However,
> >              >>> >          >> >> > >             having this logic in a
> >             released version
> >              >>> >         makes it hard to
> >              >>> >          >> >> > >             change this behavior
> >             since users may
> >              >>> >         erroneously depend on
> >              >>> >          >> >> > >             this undocumented
> >             behavior.  Instead, we
> >              >>> >         should namespace /
> >              >>> >          >> >> > >             scope this so that it
> >             is obvious that
> >              >>> >         this is meant for
> >              >>> >          >> >> > >             runner (and not Beam
> >             user) consumption.
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >             On Fri, Oct 12, 2018
> >             at 10:48 AM Thomas Weise
> >              >>> >          >> >> > >             <thw@apache.org
> >             <ma...@apache.org> <mailto:thw@apache.org
> >             <ma...@apache.org>>
> >              >>> >         <mailto:thw@apache.org
> >             <ma...@apache.org> <mailto:thw@apache.org
> >             <ma...@apache.org>>>> wrote:
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                 Can you please
> >             elaborate more what
> >              >>> >         practical problems
> >              >>> >          >> >> > >                 this introduces
> >             for users?
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                 I can see that
> >             this change allows a
> >              >>> >         user to specify a
> >              >>> >          >> >> > >                 runner specific
> >             option, which in the
> >              >>> >         future may change
> >              >>> >          >> >> > >                 because we decide
> >             to scope
> >              >>> >         differently. If this only
> >              >>> >          >> >> > >                 affects users of
> >             the portable Flink
> >              >>> >         runner (like us),
> >              >>> >          >> >> > >                 then no need to
> >             revert, because at
> >              >>> >         this early stage we
> >              >>> >          >> >> > >                 prefer something
> >             that works over
> >              >>> >         being blocked.
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                 It would also be
> >             really great if some
> >              >>> >         of the core Python
> >              >>> >          >> >> > >                 SDK developers
> >             could help out with
> >              >>> >         the design aspects
> >              >>> >          >> >> > >                 and PR reviews of
> >             changes that affect
> >              >>> >         common Python
> >              >>> >          >> >> > >                 code. Anyone who
> >             specifically wants
> >              >>> >         to be tagged on
> >              >>> >          >> >> > >                 relevant JIRAs and
> >             PRs?
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >         I would be happy to be
> >             tagged, and I can also
> >              >>> >         help with
> >              >>> >          >> >> > >         including other relevant
> >             folks whenever
> >              >>> >         possible. In general I
> >              >>> >          >> >> > >         think Robert, Charles,
> >             myself are good
> >              >>> >         candidates.
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                 Thanks
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                 On Fri, Oct 12,
> >             2018 at 10:20 AM
> >              >>> >         Ahmet Altay
> >              >>> >          >> >> > >                 <altay@google.com
> >             <ma...@google.com>
> >              >>> >         <mailto:altay@google.com
> >             <ma...@google.com>> <mailto:altay@google.com
> >             <ma...@google.com>
> >              >>> >         <mailto:altay@google.com
> >             <ma...@google.com>>>> wrote:
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                     On Fri, Oct
> >             12, 2018 at 10:11 AM,
> >              >>> >         Charles Chen
> >              >>> >          >> >> > >
> >             <ccy@google.com <ma...@google.com>
> >              >>> >         <mailto:ccy@google.com
> >             <ma...@google.com>> <mailto:ccy@google.com
> >             <ma...@google.com>
> >              >>> >         <mailto:ccy@google.com
> >             <ma...@google.com>>>> wrote:
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                         For
> >             context, I made comments on
> >              >>> >          >> >> > >
> >             https://github.com/apache/beam/pull/6600 noting
> >              >>> >          >> >> > >                         that the
> >             changes being made
> >              >>> >         were not good for
> >              >>> >          >> >> > >                         Beam
> >              >>> >         backwards-compatibility.  The change as is
> >              >>> >          >> >> > >                         allows
> >             users to use pipeline
> >              >>> >         options without
> >              >>> >          >> >> > >                         explicitly
> >             defining them,
> >              >>> >         which is not the type
> >              >>> >          >> >> > >                         of usage
> >             we would like to
> >              >>> >         encourage since we
> >              >>> >          >> >> > >                         prefer to
> >             be explicit
> >              >>> >         whenever possible.  If
> >              >>> >          >> >> > >                         users
> >             write pipelines with
> >              >>> >         this sort of pattern,
> >              >>> >          >> >> > >                         they will
> >             potentially
> >              >>> >         encounter pain when
> >              >>> >          >> >> > >                         upgrading
> >             to a later version
> >              >>> >         since this is an
> >              >>> >          >> >> > >
> >             implementation detail and not
> >              >>> >         an officially
> >              >>> >          >> >> > >                         supported
> >             pattern.  I agree
> >              >>> >         with the comments
> >              >>> >          >> >> > >                         above that
> >             this is ultimately
> >              >>> >         a scoping issue.
> >              >>> >          >> >> > >                         I would
> >             not have a problem
> >              >>> >         with these changes if
> >              >>> >          >> >> > >                         they were
> >             explicitly scoped
> >              >>> >         under either a
> >              >>> >          >> >> > >                         runner or
> >             unparsed options
> >              >>> >         namespace.
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                         As a
> >             second note, since the
> >              >>> >         2.8.0 release is
> >              >>> >          >> >> > >                         being cut
> >             right now, because
> >              >>> >         of these
> >              >>> >          >> >> > >
> >             backwards-compatibility
> >              >>> >         concerns, I would
> >              >>> >          >> >> > >                         suggest
> >             reverting these
> >              >>> >         changes, at least until
> >              >>> >          >> >> > >                         2.8.0 is
> >             cut, so we can have
> >              >>> >         a discussion here
> >              >>> >          >> >> > >                         before
> >             committing to and
> >              >>> >         releasing any API-level
> >              >>> >          >> >> > >                         changes.
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                     +1 I would
> >             like to revert the
> >              >>> >         changes in order not
> >              >>> >          >> >> > >                     rush this into
> >             the release. Once
> >              >>> >         this discussion
> >              >>> >          >> >> > >                     results in an
> >             agreement changes
> >              >>> >         can be brought back.
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                         On Fri,
> >             Oct 12, 2018 at 9:26
> >              >>> >         AM Henning Rohde
> >              >>> >          >> >> > >
> >             <herohde@google.com <ma...@google.com>
> >              >>> >         <mailto:herohde@google.com
> >             <ma...@google.com>> <mailto:herohde@google.com
> >             <ma...@google.com>
> >              >>> >         <mailto:herohde@google.com
> >             <ma...@google.com>>>>
> >              >>> >          >> >> > >                         wrote:
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                             Agree
> >             that pipeline
> >              >>> >         options lack some
> >              >>> >          >> >> > >
> >             mechanism for scoping. It
> >              >>> >         is also not always
> >              >>> >          >> >> > >
> >             possible distinguish
> >              >>> >         options meant to be
> >              >>> >          >> >> > >
> >             consumed at pipeline
> >              >>> >         construction time, by
> >              >>> >          >> >> > >                             the
> >             runner, by the SDK
> >              >>> >         harness, by the user
> >              >>> >          >> >> > >                             code
> >             or any combination
> >              >>> >         -- and this causes
> >              >>> >          >> >> > >
> >             confusion every now and then.
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                             For
> >             Dataflow, we have
> >              >>> >         been using
> >              >>> >          >> >> > >
> >             "experiments" for
> >              >>> >         arbitrary runner-specific
> >              >>> >          >> >> > >
> >             options. It's simply a
> >              >>> >         string list pipeline
> >              >>> >          >> >> > >                             option
> >             that all SDKs
> >              >>> >         support and, for Go at
> >              >>> >          >> >> > >                             least,
> >             is sent to
> >              >>> >         portable runners. Flink
> >              >>> >          >> >> > >                             can do
> >             the same in the
> >              >>> >         short term to move
> >              >>> >          >> >> > >
> forward.
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                             Henning
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                             On
> >             Fri, Oct 12, 2018 at
> >              >>> >         8:50 AM Thomas Weise
> >              >>> >          >> >> > >
> >             <thw@apache.org <ma...@apache.org>
> >              >>> >         <mailto:thw@apache.org
> >             <ma...@apache.org>> <mailto:thw@apache.org
> >             <ma...@apache.org>
> >              >>> >         <mailto:thw@apache.org
> >             <ma...@apache.org>>>> wrote:
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >             [moving to the list]
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >             The requirement
> >              >>> >         driving this part of the
> >              >>> >          >> >> > >
> >             change was to allow a
> >              >>> >         user to specify
> >              >>> >          >> >> > >
> >             pipeline options that
> >              >>> >         a runner supports
> >              >>> >          >> >> > >
> >             without having to
> >              >>> >         declare those in each
> >              >>> >          >> >> > >
> >             language SDK.
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                                 In
> >             the specific
> >              >>> >         scenario, we have
> >              >>> >          >> >> > >
> >             options that the
> >              >>> >         Flink runner supports
> >              >>> >          >> >> > >
> >             (and can validate),
> >              >>> >         that are not
> >              >>> >          >> >> > >
> >             enumerated in the
> >              >>> >         Python SDK.
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                                 I
> >             think we have a
> >              >>> >         bigger problem scoping
> >              >>> >          >> >> > >
> >             pipeline options. For
> >              >>> >         example, the
> >              >>> >          >> >> > >
> >             runner options are
> >              >>> >         dumped into the SDK
> >              >>> >          >> >> > >
> >             worker. There is also
> >              >>> >         a possibility of
> >              >>> >          >> >> > >
> >             name collisions. So I
> >              >>> >         think this would
> >              >>> >          >> >> > >
> >             benefit from broader
> >              >>> >         feedback.
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >             Thanks,
> >              >>> >          >> >> > >
> Thomas
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >             ---------- Forwarded
> >              >>> >         message ---------
> >              >>> >          >> >> > >
> >             From: *Charles Chen*
> >              >>> >          >> >> > >
> >              >>> >           <notifications@github.com
> >             <ma...@github.com>
> >             <mailto:notifications@github.com
> >             <ma...@github.com>>
> >              >>> >          >> >> > >
> >              >>> >           <mailto:notifications@github.com
> >             <ma...@github.com>
> >              >>> >         <mailto:notifications@github.com
> >             <ma...@github.com>>>>
> >              >>> >          >> >> > >
> >             Date: Fri, Oct 12,
> >              >>> >         2018 at 8:36 AM
> >              >>> >          >> >> > >
> >             Subject: Re:
> >              >>> >         [apache/beam] [BEAM-5442]
> >              >>> >          >> >> > >
> >             Store duplicate
> >              >>> >         unknown options in a
> >              >>> >          >> >> > >
> >             list argument (#6600)
> >              >>> >          >> >> > >
> >             To: apache/beam
> >              >>> >         <beam@noreply.github.com
> >             <ma...@noreply.github.com>
> >             <mailto:beam@noreply.github.com
> >             <ma...@noreply.github.com>>
> >              >>> >          >> >> > >
> >              >>> >           <mailto:beam@noreply.github.com
> >             <ma...@noreply.github.com>
> >             <mailto:beam@noreply.github.com
> >             <ma...@noreply.github.com>>>>
> >              >>> >          >> >> > >
> >             Cc: Thomas Weise
> >              >>> >         <thomas.weise@gmail.com
> >             <ma...@gmail.com>
> >             <mailto:thomas.weise@gmail.com <mailto:
> thomas.weise@gmail.com>>
> >              >>> >          >> >> > >
> >              >>> >           <mailto:thomas.weise@gmail.com
> >             <ma...@gmail.com>
> >             <mailto:thomas.weise@gmail.com
> >             <ma...@gmail.com>>>>,
> >              >>> >          >> >> > >
> >             Mention
> >              >>> >         <mention@noreply.github.com
> >             <ma...@noreply.github.com>
> >             <mailto:mention@noreply.github.com
> >             <ma...@noreply.github.com>>
> >              >>> >          >> >> > >
> >              >>> >           <mailto:mention@noreply.github.com
> >             <ma...@noreply.github.com>
> >              >>> >         <mailto:mention@noreply.github.com
> >             <ma...@noreply.github.com>>>>
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >             CC: @tweise
> >              >>> >         <https://github.com/tweise>
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >                                 —
> >              >>> >          >> >> > >
> >             You are receiving
> >              >>> >         this because you were
> >              >>> >          >> >> > >
> >             mentioned.
> >              >>> >          >> >> > >
> >             Reply to this email
> >              >>> >         directly, view it on
> >              >>> >          >> >> > >
> GitHub
> >              >>> >          >> >> > >
> >              >>> >
> >             <
> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
> >              >>> >          >> >> > >                                 or
> >             mute the thread
> >              >>> >          >> >> > >
> >              >>> >
> >             <
> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
> >.
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >              >>> >          >> >> > >
> >              >>> >
> >
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Maximilian Michels <mx...@apache.org>.

> I would prefer we don't introduce a (quirky) way of passing unknown options that forces users to type JSON into the command line (or similar acrobatics)
Same here, the JSON approach seems technically nice but too bulky for users.

> To someone wanting to run a pipeline, all options are equally important, whether they are application specific, SDK specific or runner specific.

I'm also reluctant to force users to use `--runner_option=` because the 
division into "Runner" options and other options seems rather arbitrary 
to users. Most built-in options are also Runner-related.

> It should be possible to *optionally* qualify/scope (to cover cases where there is ambiguity), but otherwise I prefer the format we currently have.  

Yes, namespacing is a problem. What happens if the user defines a custom 
PipelineOption which clashes with one of the builtin ones? If both are 
set, which one is actually applied?


Here is a summary of the possible paths going forward:


1) Validate PipelineOptions at Runner side
==========================================

The main issue raised here was that we want to move away from parsing 
arguments which look like options without validating them. An easy fix 
would be to actually validate them on the Runner side. This could be 
done by changing the deserialization code of PipelineOptions which so 
far ignores unknown JSON options.

See: PipelineOptionsTranslation.fromProto(Struct protoOptions)

Actually, this wouldn't work for user-defined PipelineOptions because 
they might not be known to the Runner (if they are defined in Python).


2) Introduce a Runner-Option Flag
=================================

In this approach we would try to add as many pipeline options for a 
Runner to the SDK, but allow additional Runner options to be passed 
using the `--runner-option=key=val` flag. The Runner, like in 1), would 
have to ensure validation. I think this has been the most favored way so 
far. Going forward, that means that `--parallelism=4` and 
`--runner-option=parallelism=4` will have the same effect for the Flink 
Runner.


3) Implement Fetching of Options from JobServer
===============================================

The options are retrieved from the JobServer before submitting the 
pipeline. I think this would be ideal but, as mentioned before, it 
increases the complexity for implementing new SDKs and might overall 
just not be worth the effort.


What do you think? I'd implement 2) for the next release, unless there 
are advocates for a different approach.

Cheers,
Max

On 25.10.18 21:19, Thomas Weise wrote:
> Reminder that this is something we ideally address before the next 
> release...
> 
> Considering the discussion so far, my preference is that we get away 
> from unknown options and discover valid options from the runner (by 
> expanding the job service).
> 
> Once the SDK is aware of all valid options, it is possible to provide 
> meaningful feedback to the user (validate or help), and correctly handle 
> scopes and types.
> 
> I would prefer we don't introduce a (quirky) way of passing unknown 
> options that forces users to type JSON into the command line (or similar 
> acrobatics). To someone wanting to run a pipeline, all options are 
> equally important, whether they are application specific, SDK specific 
> or runner specific. It should be possible to *optionally* qualify/scope 
> (to cover cases where there is ambiguity), but otherwise I prefer the 
> format we currently have.
> 
> Regarding type inference: Correct handling of numeric types matters, see 
> following issue with protobuf (not JSON): 
> https://issues.apache.org/jira/browse/BEAM-5509
> 
> Thomas
> 
> 
> On Thu, Oct 18, 2018 at 6:55 AM Robert Bradshaw <robertwb@google.com 
> <ma...@google.com>> wrote:
> 
>     On Wed, Oct 17, 2018 at 11:35 PM Lukasz Cwik <lcwik@google.com
>     <ma...@google.com>> wrote:
> 
> 
>         On Tue, Oct 16, 2018 at 11:51 AM Robert Bradshaw
>         <robertwb@google.com <ma...@google.com>> wrote:
> 
>             On Tue, Oct 16, 2018 at 7:03 PM Lukasz Cwik
>             <lcwik@google.com <ma...@google.com>> wrote:
>              >
>              > For all unknown options, the SDK can require that all
>             flag values be specified explicitly as a valid JSON type.
>              > starts with { -> object
>              > starts with [ -> list
>              > starts with " -> string
>              > is null / true / false -> null / true / false
>              > otherwise is number.
>              >
>              > This isn't great for strings but works well for all the
>             other types.
>              >
>              > Thus for known options, the additional typing information
>             would disambiguate whether something should be a
>             string/boolean/number/object/list but for unknown options we
>             would expect the user to use valid JSON explicitly and write:
>              > --foo={"object": "value"}
>              > --foo=["value", "value2"]
>              > --foo="string value"
> 
>             Due to shell escaping, one would have to write
> 
>             --foo=\"string value\"
> 
>             or actually, due to the space
> 
>             --foo='"string value"'
> 
>             or some other variation on that, which is really
>             unfortunate. (The JSON list/objects would need similar
>             quoting, but that's less surprising.) Also, does this mean
>             we'd only have one kind of number (not integer vs. float,
>             i.e. --parallelism=5.0 works)? I suppose that is JSON. 
> 
> 
>         Yes, I was suspecting that users would need to type the second
>         variant as \"...\" I found more burdensome then '"..."'
> 
> 
>              > --foo=3.5 --foo=-4
>              > --foo=true --foo=false
>              > --foo=null
>              > This also works if the flag is repeated, so --foo=3.5
>             --foo=-4 is [3.5, -4]
> 
>             The thing that sparked this discussion was what to do when
>             unknown foo is repeated, but only one value given.
> 
> 
>         If the person only specifies one value, then they have to
>         disambiguate and put it in a list, only if they specify more
>         then one value will they have to turn it into a list.
> 
>         I believe we could come up with other schemes on how to convert
>         unknown options to JSON where we prefer strings over non-string
>         types like null/true/false/numbers/list/object and require the
>         user to escape out of the string default but anything that is
>         too different from strict JSON would cause headaches when
>         attempting to explain the format to users. I think a happy
>         middle ground would be that we will only require escaping for
>         strings which are ambiguous, so things like true, null, false,
>         ... to be treated as strings would require the user to escape them.
> 
> 
>     I'd prefer to avoid inferring the type of an unknown argument based
>     on its contents, which can lead to surprises. We could declare every
>     unknown type to be repeated string, and let any parsing/validation
>     occur on the runner. If desired, we could pass these around as a
>     single "runner options" dict that runners could inspect and use to
>     populate the actual dict rather than mixing parsed and unparsed
>     options.
> 
> 
> 
>              > On Tue, Oct 16, 2018 at 7:56 AM Thomas Weise
>             <thw@apache.org <ma...@apache.org>> wrote:
>              >>
>              >> Discovering options from the job server seems preferable
>             over replicating runner options in SDKs.
>              >>
>              >> Runners evolve on their own, and with portability the
>             SDK does not need to know anything about the runner.
>              >>
>              >> Regarding --runner-option. It is true that this looks
>             less user friendly. On the other hand it eliminates the
>             possibility of name collisions.
>              >>
>              >> But if options are discovered, the SDK can perform full
>             validation. It would only be necessary to use explicit
>             scoping when there is ambiguity.
>              >>
>              >> Thomas
>              >>
>              >>
>              >> On Tue, Oct 16, 2018 at 3:48 AM Maximilian Michels
>             <mxm@apache.org <ma...@apache.org>> wrote:
>              >>>
>              >>> Fetching options directly from the Runner's JobServer
>             seems like the
>              >>> ideal solution. I agree with Robert that it creates
>             additional
>              >>> complexity for SDK authors, so the `--runner-option`
>             flag would be an
>              >>> easy and explicit way to specify additional Runner options.
>              >>>
>              >>> The format I prefer would be: --runner_option=key1=val1
>              >>> --runner_option=key2=val2
>              >>>
>              >>> Now, from the perspective of end users, I think it is
>             neither convenient
>              >>> nor reasonable to require the use of the
>             `--runner-option` flag. To the
>              >>> user it seems nebulous why some pipeline options live
>             in the top-level
>              >>> option namespace while others need to be nested within
>             an option. This
>              >>> is amplified by there being two Runners the user needs
>             to be aware of,
>              >>> i.e. PortableRunner and the actual Runner
>             (Dataflow/Flink/Spark..).
>              >>>
>              >>> I feel like we would eventually replicate all options
>             in the SDK because
>              >>> otherwise users have to use the `--runner-option`, but
>             at least we can
>              >>> specify options which have not been replicated yet.
>              >>>
>              >>> -Max
>              >>>
>              >>> On 16.10.18 10:27, Robert Bradshaw wrote:
>              >>> > Yes, we don't know how to parse and/or validate it.
>              >>> >
>              >>> > On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik
>             <lcwik@google.com <ma...@google.com>
>              >>> > <mailto:lcwik@google.com <ma...@google.com>>>
>             wrote:
>              >>> >
>              >>> >     I see, is the issue that we currently are using a
>             JSON
>              >>> >     representation for options when being serialized
>             and when we get
>              >>> >     some unknown option, we don't know how to convert
>             it into its JSON form?
>              >>> >
>              >>> >     On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw
>             <robertwb@google.com <ma...@google.com>
>              >>> >     <mailto:robertwb@google.com
>             <ma...@google.com>>> wrote:
>              >>> >
>              >>> >         On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik
>             <lcwik@google.com <ma...@google.com>
>              >>> >         <mailto:lcwik@google.com
>             <ma...@google.com>>> wrote:
>              >>> >          >
>              >>> >          > On Mon, Oct 15, 2018 at 1:17 PM Robert
>             Bradshaw
>              >>> >         <robertwb@google.com
>             <ma...@google.com> <mailto:robertwb@google.com
>             <ma...@google.com>>> wrote:
>              >>> >          >>
>              >>> >          >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik
>              >>> >         <lcwik@google.com <ma...@google.com>
>             <mailto:lcwik@google.com <ma...@google.com>>> wrote:
>              >>> >          >> >
>              >>> >          >> > I agree with the sentiment for better
>             error checking.
>              >>> >          >> >
>              >>> >          >> > We can try to make it such that the SDK
>             can "fetch" the
>              >>> >         set of options that the runner supports by
>             making a call to the
>              >>> >         Job API. The API could return a list of
>             option names
>              >>> >         (descriptions for --help purposes and also
>             potentially the
>              >>> >         expected format) which would remove the worry
>             around "unknown"
>              >>> >         options. Yes I understand to be able to make
>             the Job API call,
>              >>> >         we may need to parse some options from the
>             args parameters first
>              >>> >         and then parse the unknown options after they
>             are fetched.
>              >>> >          >>
>              >>> >          >> This is an interesting idea, but seems it
>             could get quite
>              >>> >         complicated.
>              >>> >          >> E.g. for delegating runners, one would
>             first read the options to
>              >>> >          >> determine which runner to fetch the
>             options from, which
>              >>> >         would then
>              >>> >          >> return a set of options that possibly
>             depends on the values
>              >>> >         of some of
>              >>> >          >> its options...
>              >>> >          >>
>              >>> >          >> > Alternatively, we can choose an
>             explicit format upfront.
>              >>> >          >> > To expand on the exact format for
>             --runner_option=...,
>              >>> >         here are some different ideas:
>              >>> >          >> > 1) Specified multiple times, each one
>             is an explicit flag
>              >>> >          >> > --runner_option=--blah=bar
>             --runner_option=--foo=baz1
>              >>> >         --runner_option=--foo=baz2
>              >>> >          >>
>              >>> >          >> I'm -1 on this format. We should move
>             away from the idea
>              >>> >         that options
>              >>> >          >> == flags (as that doesn't compose well
>             with other libraries
>              >>> >         that do
>              >>> >          >> their own flags parsing). The ability to
>             parse a set of
>              >>> >         flags into
>              >>> >          >> options is just a convenience that an
>             author may (or may
>              >>> >         not) choose
>              >>> >          >> to use (e.g. when running pipelines a
>             long-lived process like a
>              >>> >          >> service or a notebook, the command line
>             flags are almost
>              >>> >         certainly not
>              >>> >          >> the right interface).
>              >>> >          >>
>              >>> >          >> > 2) specified multiple times, we drop
>             the explicit flag
>              >>> >          >> > --runner_option=blah=bar
>             --runner_option=foo=baz1
>              >>> >         --runner_option=foo=baz2
>              >>> >          >>
>              >>> >          >> This or (4) is my preference.
>              >>> >          >>
>              >>> >          >> > 3) we use a string which the runner can
>             choose to
>              >>> >         interpret however they want (JSON/XML shown
>             below)
>              >>> >          >> > --runner_option='{"blah": "bar", "foo":
>             ["baz1", "baz2"]}'
>              >>> >          >> >
>              >>> >        
>             --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>              >>> >          >>
>              >>> >          >> This would make validation hard. Also, I
>             think it makes
>              >>> >         sense for some
>              >>> >          >> runner options to be "shared"
>             (parallelism") by convention,
>              >>> >         so letting
>              >>> >          >> it be a free-form string wouldn't allow
>             different runners to
>              >>> >         inspect
>              >>> >          >> different bits.
>              >>> >          >>
>              >>> >          >> We should consider if we should use urns
>             for namespacing, and
>              >>> >          >> assigning semantic meaning to strings, here.
>              >>> >          >>
>              >>> >          >> > 4) we use a string which must be a
>             specific format such as
>              >>> >         JSON (allows the SDK to do simple validation):
>              >>> >          >> > --runner_option='{"blah": "bar", "foo":
>             ["baz1", "baz2"]}'
>              >>> >          >>
>              >>> >          >> I like this in that at least some
>             validation can be
>              >>> >         performed, and
>              >>> >          >> expectations of how to format richer
>             types. On the other
>              >>> >         hand it gets
>              >>> >          >> a bit verbose, given that most (I'd
>             imagine) options will be
>              >>> >         simple.
>              >>> >          >> As with normal options,
>              >>> >          >>
>              >>> >          >>     --option1=value1 --option2=value2
>              >>> >          >>
>              >>> >          >> is shorthand for {"option1": value1,
>             "option2": value2}.
>              >>> >          >>
>              >>> >          > I lean to 4 the most. With 2, you run into
>             issues of what
>              >>> >         does --runner_option=foo=["a", "b"]
>             --runner_option=foo=["c",
>              >>> >         "d"] mean?
>              >>> >          > Is it an error or list of lists or
>             concatenated. Similar
>              >>> >         issues for map types represented via JSON
>             object {...}
>              >>> >
>              >>> >         We can err to be on the safe side
>             unless/until an argument can
>              >>> >         be made
>              >>> >         that merging is more natural. I just think
>             this will be excessively
>              >>> >         verbose to use.
>              >>> >
>              >>> >          >> > I would strongly suggest that we go
>             with the "fetch"
>              >>> >         approach, since this makes the set of options
>             discoverable and
>              >>> >         helps users find errors much earlier in their
>             pipeline.
>              >>> >          >>
>              >>> >          >> This seems like an advanced feature that
>             SDKs may want to
>              >>> >         support, but
>              >>> >          >> I wouldn't want to require this
>             complexity for bootstrapping
>              >>> >         an SDK.
>              >>> >          >>
>              >>> >          > SDKs that are starting off wouldn't need
>             to "fetch" options,
>              >>> >         they could choose to not support runner
>             options or they could
>              >>> >         choose to pass all options through to the
>             runner blindly.
>              >>> >         Fetching the options only provides the SDK
>             the ability to
>              >>> >         provide error checking upfront and useful
>             error/help messages.
>              >>> >
>              >>> >         But how to even pass all options through
>             blindly is exactly the
>              >>> >         difficulty we're running into here.
>              >>> >
>              >>> >          >> Regarding always keeping runner options
>             separate, +1, though
>              >>> >         I'm not
>              >>> >          >> sure the line is always clear.
>              >>> >          >>
>              >>> >          >>
>              >>> >          >> > On Mon, Oct 15, 2018 at 8:04 AM Robert
>             Bradshaw
>              >>> >         <robertwb@google.com
>             <ma...@google.com> <mailto:robertwb@google.com
>             <ma...@google.com>>> wrote:
>              >>> >          >> >>
>              >>> >          >> >> On Mon, Oct 15, 2018 at 3:58 PM
>             Maximilian Michels
>              >>> >         <mxm@apache.org <ma...@apache.org>
>             <mailto:mxm@apache.org <ma...@apache.org>>> wrote:
>              >>> >          >> >> >
>              >>> >          >> >> > I agree that the current approach
>             breaks the pipeline
>              >>> >         options contract
>              >>> >          >> >> > because "unknown" options get parsed
>             in the same way as
>              >>> >         options which
>              >>> >          >> >> > have been defined by the user.
>              >>> >          >> >>
>              >>> >          >> >> FWIW, I think we're already breaking
>             this "contract."
>              >>> >         Unknown options
>              >>> >          >> >> are silently ignored; with this change
>             we just change how
>              >>> >         we record
>              >>> >          >> >> them. It still feels a bit hacky though.
>              >>> >          >> >>
>              >>> >          >> >> > I'm not sure the `experiments` flag
>             works for us. AFAIK
>              >>> >         it only allows
>              >>> >          >> >> > true/false flags. We want to pass
>             all types of pipeline
>              >>> >         options to the
>              >>> >          >> >> > Runner.
>              >>> >          >> >>
>              >>> >          >> >> Experiments is an arbitrary set of
>             strings, which can be
>              >>> >         of the form
>              >>> >          >> >> "param=value" if that's useful.
>             (Dataflow does this.)
>              >>> >         There is, again,
>              >>> >          >> >> no namespacing on the param names, but
>             we could user urns
>              >>> >         or impose
>              >>> >          >> >> some other structure here.
>              >>> >          >> >>
>              >>> >          >> >> > How to solve this?
>              >>> >          >> >> >
>              >>> >          >> >> > 1) Add all options of all Runners to
>             each SDK
>              >>> >          >> >> > We added some of the FlinkRunner
>             options to the Python
>              >>> >         SDK but realized
>              >>> >          >> >> > syncing is rather cumbersome in the
>             long term. However,
>              >>> >         we want the most
>              >>> >          >> >> > important options to be validated on
>             the client side.
>              >>> >          >> >>
>              >>> >          >> >> I don't think this is sustainable in
>             the long run.
>              >>> >         However, thinking
>              >>> >          >> >> about this, in the worse case
>             validation happens after
>              >>> >         construction
>              >>> >          >> >> but before execution (as with much of
>             our other
>              >>> >         validation) so it
>              >>> >          >> >> isn't that bad.
>              >>> >          >> >>
>              >>> >          >> >> > 2) Pass "unknown" options via a
>             separate list in the
>              >>> >         Proto which can
>              >>> >          >> >> > only be accessed internally by the
>             Runners. This still
>              >>> >         allows passing
>              >>> >          >> >> > arbitrary options but we wouldn't
>             leak unknown options
>              >>> >         and display them
>              >>> >          >> >> > as top-level options.
>              >>> >          >> >>
>              >>> >          >> >> I think there needs to be a way for
>             the user to
>              >>> >         communicate values
>              >>> >          >> >> directly to the runner regardless of
>             the SDK. My
>              >>> >         preference would be
>              >>> >          >> >> to make this explicit, e.g. (repeated)
>              >>> >         --runner_option=..., rather
>              >>> >          >> >> than scooping up all unknown flags at
>             command line
>              >>> >         parsing time.
>              >>> >          >> >> Perhaps an SDK that is aware of some
>             runners could choose
>              >>> >         to lift
>              >>> >          >> >> these as top-level options, but still
>             pass them as runner
>              >>> >         options.
>              >>> >          >> >>
>              >>> >          >> >> > On 13.10.18 02:34, Charles Chen wrote:
>              >>> >          >> >> > > The current release branch
>              >>> >          >> >> > >
>              >>> >        
>             (https://github.com/apache/beam/commits/release-2.8.0) was cut
>              >>> >         after the
>              >>> >          >> >> > > revert went in.  Sent out
>              >>> > https://github.com/apache/beam/pull/6683 as a
>              >>> >          >> >> > > revert of the revert.  Regarding
>             your comment above,
>              >>> >         I can help out with
>              >>> >          >> >> > > the design / PR reviews for common
>             Python code as you
>              >>> >         suggest.
>              >>> >          >> >> > >
>              >>> >          >> >> > > On Fri, Oct 12, 2018 at 4:48 PM
>             Thomas Weise
>              >>> >         <thw@apache.org <ma...@apache.org>
>             <mailto:thw@apache.org <ma...@apache.org>>
>              >>> >          >> >> > > <mailto:thw@apache.org
>             <ma...@apache.org> <mailto:thw@apache.org
>             <ma...@apache.org>>>> wrote:
>              >>> >          >> >> > >
>              >>> >          >> >> > >     Thanks, will tag you and
>             looking forward to
>              >>> >         feedback so we can
>              >>> >          >> >> > >     ensure that changes work for
>             everyone.
>              >>> >          >> >> > >
>              >>> >          >> >> > >     Looking at the PR, I see
>             agreement from Max to
>              >>> >         revert the change on
>              >>> >          >> >> > >     the release branch, but not in
>             master. Would you
>              >>> >         mind to restore it
>              >>> >          >> >> > >     in master?
>              >>> >          >> >> > >
>              >>> >          >> >> > >     Thanks
>              >>> >          >> >> > >
>              >>> >          >> >> > >     On Fri, Oct 12, 2018 at 4:40
>             PM Ahmet Altay
>              >>> >         <altay@google.com <ma...@google.com>
>             <mailto:altay@google.com <ma...@google.com>>
>              >>> >          >> >> > >     <mailto:altay@google.com
>             <ma...@google.com>
>              >>> >         <mailto:altay@google.com
>             <ma...@google.com>>>> wrote:
>              >>> >          >> >> > >
>              >>> >          >> >> > >
>              >>> >          >> >> > >
>              >>> >          >> >> > >         On Fri, Oct 12, 2018 at
>             11:31 AM, Charles
>              >>> >         Chen <ccy@google.com <ma...@google.com>
>             <mailto:ccy@google.com <ma...@google.com>>
>              >>> >          >> >> > >         <mailto:ccy@google.com
>             <ma...@google.com>
>              >>> >         <mailto:ccy@google.com
>             <ma...@google.com>>>> wrote:
>              >>> >          >> >> > >
>              >>> >          >> >> > >             What I mean is that a
>             user may find that
>              >>> >         it works for them
>              >>> >          >> >> > >             to pass "--myarg blah"
>             and access it as
>              >>> >         "options.myarg"
>              >>> >          >> >> > >             without explicitly
>             defining a "my_arg"
>              >>> >         flag due to the added
>              >>> >          >> >> > >             logic.  This is not
>             the intended behavior
>              >>> >         and we may want to
>              >>> >          >> >> > >             change this
>             implementation detail in the
>              >>> >         future.  However,
>              >>> >          >> >> > >             having this logic in a
>             released version
>              >>> >         makes it hard to
>              >>> >          >> >> > >             change this behavior
>             since users may
>              >>> >         erroneously depend on
>              >>> >          >> >> > >             this undocumented
>             behavior.  Instead, we
>              >>> >         should namespace /
>              >>> >          >> >> > >             scope this so that it
>             is obvious that
>              >>> >         this is meant for
>              >>> >          >> >> > >             runner (and not Beam
>             user) consumption.
>              >>> >          >> >> > >
>              >>> >          >> >> > >             On Fri, Oct 12, 2018
>             at 10:48 AM Thomas Weise
>              >>> >          >> >> > >             <thw@apache.org
>             <ma...@apache.org> <mailto:thw@apache.org
>             <ma...@apache.org>>
>              >>> >         <mailto:thw@apache.org
>             <ma...@apache.org> <mailto:thw@apache.org
>             <ma...@apache.org>>>> wrote:
>              >>> >          >> >> > >
>              >>> >          >> >> > >                 Can you please
>             elaborate more what
>              >>> >         practical problems
>              >>> >          >> >> > >                 this introduces
>             for users?
>              >>> >          >> >> > >
>              >>> >          >> >> > >                 I can see that
>             this change allows a
>              >>> >         user to specify a
>              >>> >          >> >> > >                 runner specific
>             option, which in the
>              >>> >         future may change
>              >>> >          >> >> > >                 because we decide
>             to scope
>              >>> >         differently. If this only
>              >>> >          >> >> > >                 affects users of
>             the portable Flink
>              >>> >         runner (like us),
>              >>> >          >> >> > >                 then no need to
>             revert, because at
>              >>> >         this early stage we
>              >>> >          >> >> > >                 prefer something
>             that works over
>              >>> >         being blocked.
>              >>> >          >> >> > >
>              >>> >          >> >> > >                 It would also be
>             really great if some
>              >>> >         of the core Python
>              >>> >          >> >> > >                 SDK developers
>             could help out with
>              >>> >         the design aspects
>              >>> >          >> >> > >                 and PR reviews of
>             changes that affect
>              >>> >         common Python
>              >>> >          >> >> > >                 code. Anyone who
>             specifically wants
>              >>> >         to be tagged on
>              >>> >          >> >> > >                 relevant JIRAs and
>             PRs?
>              >>> >          >> >> > >
>              >>> >          >> >> > >
>              >>> >          >> >> > >         I would be happy to be
>             tagged, and I can also
>              >>> >         help with
>              >>> >          >> >> > >         including other relevant
>             folks whenever
>              >>> >         possible. In general I
>              >>> >          >> >> > >         think Robert, Charles,
>             myself are good
>              >>> >         candidates.
>              >>> >          >> >> > >
>              >>> >          >> >> > >
>              >>> >          >> >> > >                 Thanks
>              >>> >          >> >> > >
>              >>> >          >> >> > >
>              >>> >          >> >> > >                 On Fri, Oct 12,
>             2018 at 10:20 AM
>              >>> >         Ahmet Altay
>              >>> >          >> >> > >                 <altay@google.com
>             <ma...@google.com>
>              >>> >         <mailto:altay@google.com
>             <ma...@google.com>> <mailto:altay@google.com
>             <ma...@google.com>
>              >>> >         <mailto:altay@google.com
>             <ma...@google.com>>>> wrote:
>              >>> >          >> >> > >
>              >>> >          >> >> > >
>              >>> >          >> >> > >
>              >>> >          >> >> > >                     On Fri, Oct
>             12, 2018 at 10:11 AM,
>              >>> >         Charles Chen
>              >>> >          >> >> > >                    
>             <ccy@google.com <ma...@google.com>
>              >>> >         <mailto:ccy@google.com
>             <ma...@google.com>> <mailto:ccy@google.com
>             <ma...@google.com>
>              >>> >         <mailto:ccy@google.com
>             <ma...@google.com>>>> wrote:
>              >>> >          >> >> > >
>              >>> >          >> >> > >                         For
>             context, I made comments on
>              >>> >          >> >> > >
>             https://github.com/apache/beam/pull/6600 noting
>              >>> >          >> >> > >                         that the
>             changes being made
>              >>> >         were not good for
>              >>> >          >> >> > >                         Beam
>              >>> >         backwards-compatibility.  The change as is
>              >>> >          >> >> > >                         allows
>             users to use pipeline
>              >>> >         options without
>              >>> >          >> >> > >                         explicitly
>             defining them,
>              >>> >         which is not the type
>              >>> >          >> >> > >                         of usage
>             we would like to
>              >>> >         encourage since we
>              >>> >          >> >> > >                         prefer to
>             be explicit
>              >>> >         whenever possible.  If
>              >>> >          >> >> > >                         users
>             write pipelines with
>              >>> >         this sort of pattern,
>              >>> >          >> >> > >                         they will
>             potentially
>              >>> >         encounter pain when
>              >>> >          >> >> > >                         upgrading
>             to a later version
>              >>> >         since this is an
>              >>> >          >> >> > >                        
>             implementation detail and not
>              >>> >         an officially
>              >>> >          >> >> > >                         supported
>             pattern.  I agree
>              >>> >         with the comments
>              >>> >          >> >> > >                         above that
>             this is ultimately
>              >>> >         a scoping issue.
>              >>> >          >> >> > >                         I would
>             not have a problem
>              >>> >         with these changes if
>              >>> >          >> >> > >                         they were
>             explicitly scoped
>              >>> >         under either a
>              >>> >          >> >> > >                         runner or
>             unparsed options
>              >>> >         namespace.
>              >>> >          >> >> > >
>              >>> >          >> >> > >                         As a
>             second note, since the
>              >>> >         2.8.0 release is
>              >>> >          >> >> > >                         being cut
>             right now, because
>              >>> >         of these
>              >>> >          >> >> > >                        
>             backwards-compatibility
>              >>> >         concerns, I would
>              >>> >          >> >> > >                         suggest
>             reverting these
>              >>> >         changes, at least until
>              >>> >          >> >> > >                         2.8.0 is
>             cut, so we can have
>              >>> >         a discussion here
>              >>> >          >> >> > >                         before
>             committing to and
>              >>> >         releasing any API-level
>              >>> >          >> >> > >                         changes.
>              >>> >          >> >> > >
>              >>> >          >> >> > >
>              >>> >          >> >> > >                     +1 I would
>             like to revert the
>              >>> >         changes in order not
>              >>> >          >> >> > >                     rush this into
>             the release. Once
>              >>> >         this discussion
>              >>> >          >> >> > >                     results in an
>             agreement changes
>              >>> >         can be brought back.
>              >>> >          >> >> > >
>              >>> >          >> >> > >
>              >>> >          >> >> > >                         On Fri,
>             Oct 12, 2018 at 9:26
>              >>> >         AM Henning Rohde
>              >>> >          >> >> > >                        
>             <herohde@google.com <ma...@google.com>
>              >>> >         <mailto:herohde@google.com
>             <ma...@google.com>> <mailto:herohde@google.com
>             <ma...@google.com>
>              >>> >         <mailto:herohde@google.com
>             <ma...@google.com>>>>
>              >>> >          >> >> > >                         wrote:
>              >>> >          >> >> > >
>              >>> >          >> >> > >                             Agree
>             that pipeline
>              >>> >         options lack some
>              >>> >          >> >> > >                            
>             mechanism for scoping. It
>              >>> >         is also not always
>              >>> >          >> >> > >                            
>             possible distinguish
>              >>> >         options meant to be
>              >>> >          >> >> > >                            
>             consumed at pipeline
>              >>> >         construction time, by
>              >>> >          >> >> > >                             the
>             runner, by the SDK
>              >>> >         harness, by the user
>              >>> >          >> >> > >                             code
>             or any combination
>              >>> >         -- and this causes
>              >>> >          >> >> > >                            
>             confusion every now and then.
>              >>> >          >> >> > >
>              >>> >          >> >> > >                             For
>             Dataflow, we have
>              >>> >         been using
>              >>> >          >> >> > >                            
>             "experiments" for
>              >>> >         arbitrary runner-specific
>              >>> >          >> >> > >                            
>             options. It's simply a
>              >>> >         string list pipeline
>              >>> >          >> >> > >                             option
>             that all SDKs
>              >>> >         support and, for Go at
>              >>> >          >> >> > >                             least,
>             is sent to
>              >>> >         portable runners. Flink
>              >>> >          >> >> > >                             can do
>             the same in the
>              >>> >         short term to move
>              >>> >          >> >> > >                             forward.
>              >>> >          >> >> > >
>              >>> >          >> >> > >                             Henning
>              >>> >          >> >> > >
>              >>> >          >> >> > >
>              >>> >          >> >> > >                             On
>             Fri, Oct 12, 2018 at
>              >>> >         8:50 AM Thomas Weise
>              >>> >          >> >> > >                            
>             <thw@apache.org <ma...@apache.org>
>              >>> >         <mailto:thw@apache.org
>             <ma...@apache.org>> <mailto:thw@apache.org
>             <ma...@apache.org>
>              >>> >         <mailto:thw@apache.org
>             <ma...@apache.org>>>> wrote:
>              >>> >          >> >> > >
>              >>> >          >> >> > >                                
>             [moving to the list]
>              >>> >          >> >> > >
>              >>> >          >> >> > >                                
>             The requirement
>              >>> >         driving this part of the
>              >>> >          >> >> > >                                
>             change was to allow a
>              >>> >         user to specify
>              >>> >          >> >> > >                                
>             pipeline options that
>              >>> >         a runner supports
>              >>> >          >> >> > >                                
>             without having to
>              >>> >         declare those in each
>              >>> >          >> >> > >                                
>             language SDK.
>              >>> >          >> >> > >
>              >>> >          >> >> > >                                 In
>             the specific
>              >>> >         scenario, we have
>              >>> >          >> >> > >                                
>             options that the
>              >>> >         Flink runner supports
>              >>> >          >> >> > >                                
>             (and can validate),
>              >>> >         that are not
>              >>> >          >> >> > >                                
>             enumerated in the
>              >>> >         Python SDK.
>              >>> >          >> >> > >
>              >>> >          >> >> > >                                 I
>             think we have a
>              >>> >         bigger problem scoping
>              >>> >          >> >> > >                                
>             pipeline options. For
>              >>> >         example, the
>              >>> >          >> >> > >                                
>             runner options are
>              >>> >         dumped into the SDK
>              >>> >          >> >> > >                                
>             worker. There is also
>              >>> >         a possibility of
>              >>> >          >> >> > >                                
>             name collisions. So I
>              >>> >         think this would
>              >>> >          >> >> > >                                
>             benefit from broader
>              >>> >         feedback.
>              >>> >          >> >> > >
>              >>> >          >> >> > >                                
>             Thanks,
>              >>> >          >> >> > >                                 Thomas
>              >>> >          >> >> > >
>              >>> >          >> >> > >
>              >>> >          >> >> > >                                
>             ---------- Forwarded
>              >>> >         message ---------
>              >>> >          >> >> > >                                
>             From: *Charles Chen*
>              >>> >          >> >> > >
>              >>> >           <notifications@github.com
>             <ma...@github.com>
>             <mailto:notifications@github.com
>             <ma...@github.com>>
>              >>> >          >> >> > >
>              >>> >           <mailto:notifications@github.com
>             <ma...@github.com>
>              >>> >         <mailto:notifications@github.com
>             <ma...@github.com>>>>
>              >>> >          >> >> > >                                
>             Date: Fri, Oct 12,
>              >>> >         2018 at 8:36 AM
>              >>> >          >> >> > >                                
>             Subject: Re:
>              >>> >         [apache/beam] [BEAM-5442]
>              >>> >          >> >> > >                                
>             Store duplicate
>              >>> >         unknown options in a
>              >>> >          >> >> > >                                
>             list argument (#6600)
>              >>> >          >> >> > >                                
>             To: apache/beam
>              >>> >         <beam@noreply.github.com
>             <ma...@noreply.github.com>
>             <mailto:beam@noreply.github.com
>             <ma...@noreply.github.com>>
>              >>> >          >> >> > >
>              >>> >           <mailto:beam@noreply.github.com
>             <ma...@noreply.github.com>
>             <mailto:beam@noreply.github.com
>             <ma...@noreply.github.com>>>>
>              >>> >          >> >> > >                                
>             Cc: Thomas Weise
>              >>> >         <thomas.weise@gmail.com
>             <ma...@gmail.com>
>             <mailto:thomas.weise@gmail.com <ma...@gmail.com>>
>              >>> >          >> >> > >
>              >>> >           <mailto:thomas.weise@gmail.com
>             <ma...@gmail.com>
>             <mailto:thomas.weise@gmail.com
>             <ma...@gmail.com>>>>,
>              >>> >          >> >> > >                                
>             Mention
>              >>> >         <mention@noreply.github.com
>             <ma...@noreply.github.com>
>             <mailto:mention@noreply.github.com
>             <ma...@noreply.github.com>>
>              >>> >          >> >> > >
>              >>> >           <mailto:mention@noreply.github.com
>             <ma...@noreply.github.com>
>              >>> >         <mailto:mention@noreply.github.com
>             <ma...@noreply.github.com>>>>
>              >>> >          >> >> > >
>              >>> >          >> >> > >
>              >>> >          >> >> > >                                
>             CC: @tweise
>              >>> >         <https://github.com/tweise>
>              >>> >          >> >> > >
>              >>> >          >> >> > >                                 —
>              >>> >          >> >> > >                                
>             You are receiving
>              >>> >         this because you were
>              >>> >          >> >> > >                                
>             mentioned.
>              >>> >          >> >> > >                                
>             Reply to this email
>              >>> >         directly, view it on
>              >>> >          >> >> > >                                 GitHub
>              >>> >          >> >> > >
>              >>> >          
>             <https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>              >>> >          >> >> > >                                 or
>             mute the thread
>              >>> >          >> >> > >
>              >>> >          
>             <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>.
>              >>> >          >> >> > >
>              >>> >          >> >> > >
>              >>> >          >> >> > >
>              >>> >
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Thomas Weise <th...@apache.org>.

Reminder that this is something we ideally address before the next
release...

Considering the discussion so far, my preference is that we get away from
unknown options and discover valid options from the runner (by expanding
the job service).

Once the SDK is aware of all valid options, it is possible to provide
meaningful feedback to the user (validate or help), and correctly handle
scopes and types.

I would prefer we don't introduce a (quirky) way of passing unknown options
that forces users to type JSON into the command line (or similar
acrobatics). To someone wanting to run a pipeline, all options are equally
important, whether they are application specific, SDK specific or runner
specific. It should be possible to *optionally* qualify/scope (to cover
cases where there is ambiguity), but otherwise I prefer the format we
currently have.

Regarding type inference: Correct handling of numeric types matters, see
following issue with protobuf (not JSON):
https://issues.apache.org/jira/browse/BEAM-5509

Thomas


On Thu, Oct 18, 2018 at 6:55 AM Robert Bradshaw <ro...@google.com> wrote:

> On Wed, Oct 17, 2018 at 11:35 PM Lukasz Cwik <lc...@google.com> wrote:
>
>>
>> On Tue, Oct 16, 2018 at 11:51 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> On Tue, Oct 16, 2018 at 7:03 PM Lukasz Cwik <lc...@google.com> wrote:
>>> >
>>> > For all unknown options, the SDK can require that all flag values be
>>> specified explicitly as a valid JSON type.
>>> > starts with { -> object
>>> > starts with [ -> list
>>> > starts with " -> string
>>> > is null / true / false -> null / true / false
>>> > otherwise is number.
>>> >
>>> > This isn't great for strings but works well for all the other types.
>>> >
>>> > Thus for known options, the additional typing information would
>>> disambiguate whether something should be a
>>> string/boolean/number/object/list but for unknown options we would expect
>>> the user to use valid JSON explicitly and write:
>>> > --foo={"object": "value"}
>>> > --foo=["value", "value2"]
>>> > --foo="string value"
>>>
>>> Due to shell escaping, one would have to write
>>>
>>> --foo=\"string value\"
>>>
>>> or actually, due to the space
>>>
>>> --foo='"string value"'
>>>
>>> or some other variation on that, which is really unfortunate. (The JSON
>>> list/objects would need similar quoting, but that's less surprising.) Also,
>>> does this mean we'd only have one kind of number (not integer vs. float,
>>> i.e. --parallelism=5.0 works)? I suppose that is JSON.
>>>
>>
>> Yes, I was suspecting that users would need to type the second variant as
>> \"...\" I found more burdensome then '"..."'
>>
>>
>>>
>>> > --foo=3.5 --foo=-4
>>> > --foo=true --foo=false
>>> > --foo=null
>>> > This also works if the flag is repeated, so --foo=3.5 --foo=-4 is
>>> [3.5, -4]
>>>
>>> The thing that sparked this discussion was what to do when unknown foo
>>> is repeated, but only one value given.
>>>
>>
>> If the person only specifies one value, then they have to disambiguate
>> and put it in a list, only if they specify more then one value will they
>> have to turn it into a list.
>>
>> I believe we could come up with other schemes on how to convert unknown
>> options to JSON where we prefer strings over non-string types like
>> null/true/false/numbers/list/object and require the user to escape out of
>> the string default but anything that is too different from strict JSON
>> would cause headaches when attempting to explain the format to users. I
>> think a happy middle ground would be that we will only require escaping for
>> strings which are ambiguous, so things like true, null, false, ... to be
>> treated as strings would require the user to escape them.
>>
>
> I'd prefer to avoid inferring the type of an unknown argument based on its
> contents, which can lead to surprises. We could declare every unknown type
> to be repeated string, and let any parsing/validation occur on the runner.
> If desired, we could pass these around as a single "runner options" dict
> that runners could inspect and use to populate the actual dict rather than
> mixing parsed and unparsed options.
>
>
>>
>>
>>> > On Tue, Oct 16, 2018 at 7:56 AM Thomas Weise <th...@apache.org> wrote:
>>> >>
>>> >> Discovering options from the job server seems preferable over
>>> replicating runner options in SDKs.
>>> >>
>>> >> Runners evolve on their own, and with portability the SDK does not
>>> need to know anything about the runner.
>>> >>
>>> >> Regarding --runner-option. It is true that this looks less user
>>> friendly. On the other hand it eliminates the possibility of name
>>> collisions.
>>> >>
>>> >> But if options are discovered, the SDK can perform full validation.
>>> It would only be necessary to use explicit scoping when there is ambiguity.
>>> >>
>>> >> Thomas
>>> >>
>>> >>
>>> >> On Tue, Oct 16, 2018 at 3:48 AM Maximilian Michels <mx...@apache.org>
>>> wrote:
>>> >>>
>>> >>> Fetching options directly from the Runner's JobServer seems like the
>>> >>> ideal solution. I agree with Robert that it creates additional
>>> >>> complexity for SDK authors, so the `--runner-option` flag would be an
>>> >>> easy and explicit way to specify additional Runner options.
>>> >>>
>>> >>> The format I prefer would be: --runner_option=key1=val1
>>> >>> --runner_option=key2=val2
>>> >>>
>>> >>> Now, from the perspective of end users, I think it is neither
>>> convenient
>>> >>> nor reasonable to require the use of the `--runner-option` flag. To
>>> the
>>> >>> user it seems nebulous why some pipeline options live in the
>>> top-level
>>> >>> option namespace while others need to be nested within an option.
>>> This
>>> >>> is amplified by there being two Runners the user needs to be aware
>>> of,
>>> >>> i.e. PortableRunner and the actual Runner (Dataflow/Flink/Spark..).
>>> >>>
>>> >>> I feel like we would eventually replicate all options in the SDK
>>> because
>>> >>> otherwise users have to use the `--runner-option`, but at least we
>>> can
>>> >>> specify options which have not been replicated yet.
>>> >>>
>>> >>> -Max
>>> >>>
>>> >>> On 16.10.18 10:27, Robert Bradshaw wrote:
>>> >>> > Yes, we don't know how to parse and/or validate it.
>>> >>> >
>>> >>> > On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik <lcwik@google.com
>>> >>> > <ma...@google.com>> wrote:
>>> >>> >
>>> >>> >     I see, is the issue that we currently are using a JSON
>>> >>> >     representation for options when being serialized and when we
>>> get
>>> >>> >     some unknown option, we don't know how to convert it into its
>>> JSON form?
>>> >>> >
>>> >>> >     On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw <
>>> robertwb@google.com
>>> >>> >     <ma...@google.com>> wrote:
>>> >>> >
>>> >>> >         On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <
>>> lcwik@google.com
>>> >>> >         <ma...@google.com>> wrote:
>>> >>> >          >
>>> >>> >          > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw
>>> >>> >         <robertwb@google.com <ma...@google.com>> wrote:
>>> >>> >          >>
>>> >>> >          >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik
>>> >>> >         <lcwik@google.com <ma...@google.com>> wrote:
>>> >>> >          >> >
>>> >>> >          >> > I agree with the sentiment for better error checking.
>>> >>> >          >> >
>>> >>> >          >> > We can try to make it such that the SDK can "fetch"
>>> the
>>> >>> >         set of options that the runner supports by making a call
>>> to the
>>> >>> >         Job API. The API could return a list of option names
>>> >>> >         (descriptions for --help purposes and also potentially the
>>> >>> >         expected format) which would remove the worry around
>>> "unknown"
>>> >>> >         options. Yes I understand to be able to make the Job API
>>> call,
>>> >>> >         we may need to parse some options from the args parameters
>>> first
>>> >>> >         and then parse the unknown options after they are fetched.
>>> >>> >          >>
>>> >>> >          >> This is an interesting idea, but seems it could get
>>> quite
>>> >>> >         complicated.
>>> >>> >          >> E.g. for delegating runners, one would first read the
>>> options to
>>> >>> >          >> determine which runner to fetch the options from, which
>>> >>> >         would then
>>> >>> >          >> return a set of options that possibly depends on the
>>> values
>>> >>> >         of some of
>>> >>> >          >> its options...
>>> >>> >          >>
>>> >>> >          >> > Alternatively, we can choose an explicit format
>>> upfront.
>>> >>> >          >> > To expand on the exact format for
>>> --runner_option=...,
>>> >>> >         here are some different ideas:
>>> >>> >          >> > 1) Specified multiple times, each one is an explicit
>>> flag
>>> >>> >          >> > --runner_option=--blah=bar --runner_option=--foo=baz1
>>> >>> >         --runner_option=--foo=baz2
>>> >>> >          >>
>>> >>> >          >> I'm -1 on this format. We should move away from the
>>> idea
>>> >>> >         that options
>>> >>> >          >> == flags (as that doesn't compose well with other
>>> libraries
>>> >>> >         that do
>>> >>> >          >> their own flags parsing). The ability to parse a set of
>>> >>> >         flags into
>>> >>> >          >> options is just a convenience that an author may (or
>>> may
>>> >>> >         not) choose
>>> >>> >          >> to use (e.g. when running pipelines a long-lived
>>> process like a
>>> >>> >          >> service or a notebook, the command line flags are
>>> almost
>>> >>> >         certainly not
>>> >>> >          >> the right interface).
>>> >>> >          >>
>>> >>> >          >> > 2) specified multiple times, we drop the explicit
>>> flag
>>> >>> >          >> > --runner_option=blah=bar --runner_option=foo=baz1
>>> >>> >         --runner_option=foo=baz2
>>> >>> >          >>
>>> >>> >          >> This or (4) is my preference.
>>> >>> >          >>
>>> >>> >          >> > 3) we use a string which the runner can choose to
>>> >>> >         interpret however they want (JSON/XML shown below)
>>> >>> >          >> > --runner_option='{"blah": "bar", "foo": ["baz1",
>>> "baz2"]}'
>>> >>> >          >> >
>>> >>> >
>>> --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>>> >>> >          >>
>>> >>> >          >> This would make validation hard. Also, I think it makes
>>> >>> >         sense for some
>>> >>> >          >> runner options to be "shared" (parallelism") by
>>> convention,
>>> >>> >         so letting
>>> >>> >          >> it be a free-form string wouldn't allow different
>>> runners to
>>> >>> >         inspect
>>> >>> >          >> different bits.
>>> >>> >          >>
>>> >>> >          >> We should consider if we should use urns for
>>> namespacing, and
>>> >>> >          >> assigning semantic meaning to strings, here.
>>> >>> >          >>
>>> >>> >          >> > 4) we use a string which must be a specific format
>>> such as
>>> >>> >         JSON (allows the SDK to do simple validation):
>>> >>> >          >> > --runner_option='{"blah": "bar", "foo": ["baz1",
>>> "baz2"]}'
>>> >>> >          >>
>>> >>> >          >> I like this in that at least some validation can be
>>> >>> >         performed, and
>>> >>> >          >> expectations of how to format richer types. On the
>>> other
>>> >>> >         hand it gets
>>> >>> >          >> a bit verbose, given that most (I'd imagine) options
>>> will be
>>> >>> >         simple.
>>> >>> >          >> As with normal options,
>>> >>> >          >>
>>> >>> >          >>     --option1=value1 --option2=value2
>>> >>> >          >>
>>> >>> >          >> is shorthand for {"option1": value1, "option2":
>>> value2}.
>>> >>> >          >>
>>> >>> >          > I lean to 4 the most. With 2, you run into issues of
>>> what
>>> >>> >         does --runner_option=foo=["a", "b"]
>>> --runner_option=foo=["c",
>>> >>> >         "d"] mean?
>>> >>> >          > Is it an error or list of lists or concatenated. Similar
>>> >>> >         issues for map types represented via JSON object {...}
>>> >>> >
>>> >>> >         We can err to be on the safe side unless/until an argument
>>> can
>>> >>> >         be made
>>> >>> >         that merging is more natural. I just think this will be
>>> excessively
>>> >>> >         verbose to use.
>>> >>> >
>>> >>> >          >> > I would strongly suggest that we go with the "fetch"
>>> >>> >         approach, since this makes the set of options discoverable
>>> and
>>> >>> >         helps users find errors much earlier in their pipeline.
>>> >>> >          >>
>>> >>> >          >> This seems like an advanced feature that SDKs may want
>>> to
>>> >>> >         support, but
>>> >>> >          >> I wouldn't want to require this complexity for
>>> bootstrapping
>>> >>> >         an SDK.
>>> >>> >          >>
>>> >>> >          > SDKs that are starting off wouldn't need to "fetch"
>>> options,
>>> >>> >         they could choose to not support runner options or they
>>> could
>>> >>> >         choose to pass all options through to the runner blindly.
>>> >>> >         Fetching the options only provides the SDK the ability to
>>> >>> >         provide error checking upfront and useful error/help
>>> messages.
>>> >>> >
>>> >>> >         But how to even pass all options through blindly is
>>> exactly the
>>> >>> >         difficulty we're running into here.
>>> >>> >
>>> >>> >          >> Regarding always keeping runner options separate, +1,
>>> though
>>> >>> >         I'm not
>>> >>> >          >> sure the line is always clear.
>>> >>> >          >>
>>> >>> >          >>
>>> >>> >          >> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw
>>> >>> >         <robertwb@google.com <ma...@google.com>> wrote:
>>> >>> >          >> >>
>>> >>> >          >> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels
>>> >>> >         <mxm@apache.org <ma...@apache.org>> wrote:
>>> >>> >          >> >> >
>>> >>> >          >> >> > I agree that the current approach breaks the
>>> pipeline
>>> >>> >         options contract
>>> >>> >          >> >> > because "unknown" options get parsed in the same
>>> way as
>>> >>> >         options which
>>> >>> >          >> >> > have been defined by the user.
>>> >>> >          >> >>
>>> >>> >          >> >> FWIW, I think we're already breaking this
>>> "contract."
>>> >>> >         Unknown options
>>> >>> >          >> >> are silently ignored; with this change we just
>>> change how
>>> >>> >         we record
>>> >>> >          >> >> them. It still feels a bit hacky though.
>>> >>> >          >> >>
>>> >>> >          >> >> > I'm not sure the `experiments` flag works for us.
>>> AFAIK
>>> >>> >         it only allows
>>> >>> >          >> >> > true/false flags. We want to pass all types of
>>> pipeline
>>> >>> >         options to the
>>> >>> >          >> >> > Runner.
>>> >>> >          >> >>
>>> >>> >          >> >> Experiments is an arbitrary set of strings, which
>>> can be
>>> >>> >         of the form
>>> >>> >          >> >> "param=value" if that's useful. (Dataflow does
>>> this.)
>>> >>> >         There is, again,
>>> >>> >          >> >> no namespacing on the param names, but we could
>>> user urns
>>> >>> >         or impose
>>> >>> >          >> >> some other structure here.
>>> >>> >          >> >>
>>> >>> >          >> >> > How to solve this?
>>> >>> >          >> >> >
>>> >>> >          >> >> > 1) Add all options of all Runners to each SDK
>>> >>> >          >> >> > We added some of the FlinkRunner options to the
>>> Python
>>> >>> >         SDK but realized
>>> >>> >          >> >> > syncing is rather cumbersome in the long term.
>>> However,
>>> >>> >         we want the most
>>> >>> >          >> >> > important options to be validated on the client
>>> side.
>>> >>> >          >> >>
>>> >>> >          >> >> I don't think this is sustainable in the long run.
>>> >>> >         However, thinking
>>> >>> >          >> >> about this, in the worse case validation happens
>>> after
>>> >>> >         construction
>>> >>> >          >> >> but before execution (as with much of our other
>>> >>> >         validation) so it
>>> >>> >          >> >> isn't that bad.
>>> >>> >          >> >>
>>> >>> >          >> >> > 2) Pass "unknown" options via a separate list in
>>> the
>>> >>> >         Proto which can
>>> >>> >          >> >> > only be accessed internally by the Runners. This
>>> still
>>> >>> >         allows passing
>>> >>> >          >> >> > arbitrary options but we wouldn't leak unknown
>>> options
>>> >>> >         and display them
>>> >>> >          >> >> > as top-level options.
>>> >>> >          >> >>
>>> >>> >          >> >> I think there needs to be a way for the user to
>>> >>> >         communicate values
>>> >>> >          >> >> directly to the runner regardless of the SDK. My
>>> >>> >         preference would be
>>> >>> >          >> >> to make this explicit, e.g. (repeated)
>>> >>> >         --runner_option=..., rather
>>> >>> >          >> >> than scooping up all unknown flags at command line
>>> >>> >         parsing time.
>>> >>> >          >> >> Perhaps an SDK that is aware of some runners could
>>> choose
>>> >>> >         to lift
>>> >>> >          >> >> these as top-level options, but still pass them as
>>> runner
>>> >>> >         options.
>>> >>> >          >> >>
>>> >>> >          >> >> > On 13.10.18 02:34, Charles Chen wrote:
>>> >>> >          >> >> > > The current release branch
>>> >>> >          >> >> > >
>>> >>> >         (https://github.com/apache/beam/commits/release-2.8.0)
>>> was cut
>>> >>> >         after the
>>> >>> >          >> >> > > revert went in.  Sent out
>>> >>> >         https://github.com/apache/beam/pull/6683 as a
>>> >>> >          >> >> > > revert of the revert.  Regarding your comment
>>> above,
>>> >>> >         I can help out with
>>> >>> >          >> >> > > the design / PR reviews for common Python code
>>> as you
>>> >>> >         suggest.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise
>>> >>> >         <thw@apache.org <ma...@apache.org>
>>> >>> >          >> >> > > <mailto:thw@apache.org <ma...@apache.org>>>
>>> wrote:
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >     Thanks, will tag you and looking forward to
>>> >>> >         feedback so we can
>>> >>> >          >> >> > >     ensure that changes work for everyone.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >     Looking at the PR, I see agreement from Max
>>> to
>>> >>> >         revert the change on
>>> >>> >          >> >> > >     the release branch, but not in master.
>>> Would you
>>> >>> >         mind to restore it
>>> >>> >          >> >> > >     in master?
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >     Thanks
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay
>>> >>> >         <altay@google.com <ma...@google.com>
>>> >>> >          >> >> > >     <mailto:altay@google.com
>>> >>> >         <ma...@google.com>>> wrote:
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >         On Fri, Oct 12, 2018 at 11:31 AM,
>>> Charles
>>> >>> >         Chen <ccy@google.com <ma...@google.com>
>>> >>> >          >> >> > >         <mailto:ccy@google.com
>>> >>> >         <ma...@google.com>>> wrote:
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >             What I mean is that a user may find
>>> that
>>> >>> >         it works for them
>>> >>> >          >> >> > >             to pass "--myarg blah" and access
>>> it as
>>> >>> >         "options.myarg"
>>> >>> >          >> >> > >             without explicitly defining a
>>> "my_arg"
>>> >>> >         flag due to the added
>>> >>> >          >> >> > >             logic.  This is not the intended
>>> behavior
>>> >>> >         and we may want to
>>> >>> >          >> >> > >             change this implementation detail
>>> in the
>>> >>> >         future.  However,
>>> >>> >          >> >> > >             having this logic in a released
>>> version
>>> >>> >         makes it hard to
>>> >>> >          >> >> > >             change this behavior since users may
>>> >>> >         erroneously depend on
>>> >>> >          >> >> > >             this undocumented behavior.
>>> Instead, we
>>> >>> >         should namespace /
>>> >>> >          >> >> > >             scope this so that it is obvious
>>> that
>>> >>> >         this is meant for
>>> >>> >          >> >> > >             runner (and not Beam user)
>>> consumption.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >             On Fri, Oct 12, 2018 at 10:48 AM
>>> Thomas Weise
>>> >>> >          >> >> > >             <thw@apache.org <mailto:
>>> thw@apache.org>
>>> >>> >         <mailto:thw@apache.org <ma...@apache.org>>> wrote:
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                 Can you please elaborate more
>>> what
>>> >>> >         practical problems
>>> >>> >          >> >> > >                 this introduces for users?
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                 I can see that this change
>>> allows a
>>> >>> >         user to specify a
>>> >>> >          >> >> > >                 runner specific option, which
>>> in the
>>> >>> >         future may change
>>> >>> >          >> >> > >                 because we decide to scope
>>> >>> >         differently. If this only
>>> >>> >          >> >> > >                 affects users of the portable
>>> Flink
>>> >>> >         runner (like us),
>>> >>> >          >> >> > >                 then no need to revert, because
>>> at
>>> >>> >         this early stage we
>>> >>> >          >> >> > >                 prefer something that works over
>>> >>> >         being blocked.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                 It would also be really great
>>> if some
>>> >>> >         of the core Python
>>> >>> >          >> >> > >                 SDK developers could help out
>>> with
>>> >>> >         the design aspects
>>> >>> >          >> >> > >                 and PR reviews of changes that
>>> affect
>>> >>> >         common Python
>>> >>> >          >> >> > >                 code. Anyone who specifically
>>> wants
>>> >>> >         to be tagged on
>>> >>> >          >> >> > >                 relevant JIRAs and PRs?
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >         I would be happy to be tagged, and I
>>> can also
>>> >>> >         help with
>>> >>> >          >> >> > >         including other relevant folks whenever
>>> >>> >         possible. In general I
>>> >>> >          >> >> > >         think Robert, Charles, myself are good
>>> >>> >         candidates.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                 Thanks
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                 On Fri, Oct 12, 2018 at 10:20 AM
>>> >>> >         Ahmet Altay
>>> >>> >          >> >> > >                 <altay@google.com
>>> >>> >         <ma...@google.com> <mailto:altay@google.com
>>> >>> >         <ma...@google.com>>> wrote:
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                     On Fri, Oct 12, 2018 at
>>> 10:11 AM,
>>> >>> >         Charles Chen
>>> >>> >          >> >> > >                     <ccy@google.com
>>> >>> >         <ma...@google.com> <mailto:ccy@google.com
>>> >>> >         <ma...@google.com>>> wrote:
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                         For context, I made
>>> comments on
>>> >>> >          >> >> > > https://github.com/apache/beam/pull/6600 noting
>>> >>> >          >> >> > >                         that the changes being
>>> made
>>> >>> >         were not good for
>>> >>> >          >> >> > >                         Beam
>>> >>> >         backwards-compatibility.  The change as is
>>> >>> >          >> >> > >                         allows users to use
>>> pipeline
>>> >>> >         options without
>>> >>> >          >> >> > >                         explicitly defining
>>> them,
>>> >>> >         which is not the type
>>> >>> >          >> >> > >                         of usage we would like
>>> to
>>> >>> >         encourage since we
>>> >>> >          >> >> > >                         prefer to be explicit
>>> >>> >         whenever possible.  If
>>> >>> >          >> >> > >                         users write pipelines
>>> with
>>> >>> >         this sort of pattern,
>>> >>> >          >> >> > >                         they will potentially
>>> >>> >         encounter pain when
>>> >>> >          >> >> > >                         upgrading to a later
>>> version
>>> >>> >         since this is an
>>> >>> >          >> >> > >                         implementation detail
>>> and not
>>> >>> >         an officially
>>> >>> >          >> >> > >                         supported pattern.  I
>>> agree
>>> >>> >         with the comments
>>> >>> >          >> >> > >                         above that this is
>>> ultimately
>>> >>> >         a scoping issue.
>>> >>> >          >> >> > >                         I would not have a
>>> problem
>>> >>> >         with these changes if
>>> >>> >          >> >> > >                         they were explicitly
>>> scoped
>>> >>> >         under either a
>>> >>> >          >> >> > >                         runner or unparsed
>>> options
>>> >>> >         namespace.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                         As a second note, since
>>> the
>>> >>> >         2.8.0 release is
>>> >>> >          >> >> > >                         being cut right now,
>>> because
>>> >>> >         of these
>>> >>> >          >> >> > >                         backwards-compatibility
>>> >>> >         concerns, I would
>>> >>> >          >> >> > >                         suggest reverting these
>>> >>> >         changes, at least until
>>> >>> >          >> >> > >                         2.8.0 is cut, so we can
>>> have
>>> >>> >         a discussion here
>>> >>> >          >> >> > >                         before committing to and
>>> >>> >         releasing any API-level
>>> >>> >          >> >> > >                         changes.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                     +1 I would like to revert
>>> the
>>> >>> >         changes in order not
>>> >>> >          >> >> > >                     rush this into the release.
>>> Once
>>> >>> >         this discussion
>>> >>> >          >> >> > >                     results in an agreement
>>> changes
>>> >>> >         can be brought back.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                         On Fri, Oct 12, 2018 at
>>> 9:26
>>> >>> >         AM Henning Rohde
>>> >>> >          >> >> > >                         <herohde@google.com
>>> >>> >         <ma...@google.com> <mailto:herohde@google.com
>>> >>> >         <ma...@google.com>>>
>>> >>> >          >> >> > >                         wrote:
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                             Agree that pipeline
>>> >>> >         options lack some
>>> >>> >          >> >> > >                             mechanism for
>>> scoping. It
>>> >>> >         is also not always
>>> >>> >          >> >> > >                             possible distinguish
>>> >>> >         options meant to be
>>> >>> >          >> >> > >                             consumed at pipeline
>>> >>> >         construction time, by
>>> >>> >          >> >> > >                             the runner, by the
>>> SDK
>>> >>> >         harness, by the user
>>> >>> >          >> >> > >                             code or any
>>> combination
>>> >>> >         -- and this causes
>>> >>> >          >> >> > >                             confusion every now
>>> and then.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                             For Dataflow, we
>>> have
>>> >>> >         been using
>>> >>> >          >> >> > >                             "experiments" for
>>> >>> >         arbitrary runner-specific
>>> >>> >          >> >> > >                             options. It's
>>> simply a
>>> >>> >         string list pipeline
>>> >>> >          >> >> > >                             option that all SDKs
>>> >>> >         support and, for Go at
>>> >>> >          >> >> > >                             least, is sent to
>>> >>> >         portable runners. Flink
>>> >>> >          >> >> > >                             can do the same in
>>> the
>>> >>> >         short term to move
>>> >>> >          >> >> > >                             forward.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                             Henning
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                             On Fri, Oct 12,
>>> 2018 at
>>> >>> >         8:50 AM Thomas Weise
>>> >>> >          >> >> > >                             <thw@apache.org
>>> >>> >         <ma...@apache.org> <mailto:thw@apache.org
>>> >>> >         <ma...@apache.org>>> wrote:
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                                 [moving to the
>>> list]
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                                 The requirement
>>> >>> >         driving this part of the
>>> >>> >          >> >> > >                                 change was to
>>> allow a
>>> >>> >         user to specify
>>> >>> >          >> >> > >                                 pipeline
>>> options that
>>> >>> >         a runner supports
>>> >>> >          >> >> > >                                 without having
>>> to
>>> >>> >         declare those in each
>>> >>> >          >> >> > >                                 language SDK.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                                 In the specific
>>> >>> >         scenario, we have
>>> >>> >          >> >> > >                                 options that the
>>> >>> >         Flink runner supports
>>> >>> >          >> >> > >                                 (and can
>>> validate),
>>> >>> >         that are not
>>> >>> >          >> >> > >                                 enumerated in
>>> the
>>> >>> >         Python SDK.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                                 I think we have
>>> a
>>> >>> >         bigger problem scoping
>>> >>> >          >> >> > >                                 pipeline
>>> options. For
>>> >>> >         example, the
>>> >>> >          >> >> > >                                 runner options
>>> are
>>> >>> >         dumped into the SDK
>>> >>> >          >> >> > >                                 worker. There
>>> is also
>>> >>> >         a possibility of
>>> >>> >          >> >> > >                                 name
>>> collisions. So I
>>> >>> >         think this would
>>> >>> >          >> >> > >                                 benefit from
>>> broader
>>> >>> >         feedback.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                                 Thanks,
>>> >>> >          >> >> > >                                 Thomas
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                                 ----------
>>> Forwarded
>>> >>> >         message ---------
>>> >>> >          >> >> > >                                 From: *Charles
>>> Chen*
>>> >>> >          >> >> > >
>>> >>> >           <notifications@github.com <mailto:
>>> notifications@github.com>
>>> >>> >          >> >> > >
>>> >>> >           <mailto:notifications@github.com
>>> >>> >         <ma...@github.com>>>
>>> >>> >          >> >> > >                                 Date: Fri, Oct
>>> 12,
>>> >>> >         2018 at 8:36 AM
>>> >>> >          >> >> > >                                 Subject: Re:
>>> >>> >         [apache/beam] [BEAM-5442]
>>> >>> >          >> >> > >                                 Store duplicate
>>> >>> >         unknown options in a
>>> >>> >          >> >> > >                                 list argument
>>> (#6600)
>>> >>> >          >> >> > >                                 To: apache/beam
>>> >>> >         <beam@noreply.github.com <ma...@noreply.github.com>
>>> >>> >          >> >> > >
>>> >>> >           <mailto:beam@noreply.github.com <mailto:
>>> beam@noreply.github.com>>>
>>> >>> >          >> >> > >                                 Cc: Thomas Weise
>>> >>> >         <thomas.weise@gmail.com <ma...@gmail.com>
>>> >>> >          >> >> > >
>>> >>> >           <mailto:thomas.weise@gmail.com <mailto:
>>> thomas.weise@gmail.com>>>,
>>> >>> >          >> >> > >                                 Mention
>>> >>> >         <mention@noreply.github.com <mailto:
>>> mention@noreply.github.com>
>>> >>> >          >> >> > >
>>> >>> >           <mailto:mention@noreply.github.com
>>> >>> >         <ma...@noreply.github.com>>>
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                                 CC: @tweise
>>> >>> >         <https://github.com/tweise>
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >                                 —
>>> >>> >          >> >> > >                                 You are
>>> receiving
>>> >>> >         this because you were
>>> >>> >          >> >> > >                                 mentioned.
>>> >>> >          >> >> > >                                 Reply to this
>>> email
>>> >>> >         directly, view it on
>>> >>> >          >> >> > >                                 GitHub
>>> >>> >          >> >> > >
>>> >>> >           <
>>> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>>> >>> >          >> >> > >                                 or mute the
>>> thread
>>> >>> >          >> >> > >
>>> >>> >           <
>>> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
>>> >.
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >          >> >> > >
>>> >>> >
>>>
>>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Robert Bradshaw <ro...@google.com>.

On Wed, Oct 17, 2018 at 11:35 PM Lukasz Cwik <lc...@google.com> wrote:

>
> On Tue, Oct 16, 2018 at 11:51 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Tue, Oct 16, 2018 at 7:03 PM Lukasz Cwik <lc...@google.com> wrote:
>> >
>> > For all unknown options, the SDK can require that all flag values be
>> specified explicitly as a valid JSON type.
>> > starts with { -> object
>> > starts with [ -> list
>> > starts with " -> string
>> > is null / true / false -> null / true / false
>> > otherwise is number.
>> >
>> > This isn't great for strings but works well for all the other types.
>> >
>> > Thus for known options, the additional typing information would
>> disambiguate whether something should be a
>> string/boolean/number/object/list but for unknown options we would expect
>> the user to use valid JSON explicitly and write:
>> > --foo={"object": "value"}
>> > --foo=["value", "value2"]
>> > --foo="string value"
>>
>> Due to shell escaping, one would have to write
>>
>> --foo=\"string value\"
>>
>> or actually, due to the space
>>
>> --foo='"string value"'
>>
>> or some other variation on that, which is really unfortunate. (The JSON
>> list/objects would need similar quoting, but that's less surprising.) Also,
>> does this mean we'd only have one kind of number (not integer vs. float,
>> i.e. --parallelism=5.0 works)? I suppose that is JSON.
>>
>
> Yes, I was suspecting that users would need to type the second variant as
> \"...\" I found more burdensome then '"..."'
>
>
>>
>> > --foo=3.5 --foo=-4
>> > --foo=true --foo=false
>> > --foo=null
>> > This also works if the flag is repeated, so --foo=3.5 --foo=-4 is [3.5,
>> -4]
>>
>> The thing that sparked this discussion was what to do when unknown foo is
>> repeated, but only one value given.
>>
>
> If the person only specifies one value, then they have to disambiguate and
> put it in a list, only if they specify more then one value will they have
> to turn it into a list.
>
> I believe we could come up with other schemes on how to convert unknown
> options to JSON where we prefer strings over non-string types like
> null/true/false/numbers/list/object and require the user to escape out of
> the string default but anything that is too different from strict JSON
> would cause headaches when attempting to explain the format to users. I
> think a happy middle ground would be that we will only require escaping for
> strings which are ambiguous, so things like true, null, false, ... to be
> treated as strings would require the user to escape them.
>

I'd prefer to avoid inferring the type of an unknown argument based on its
contents, which can lead to surprises. We could declare every unknown type
to be repeated string, and let any parsing/validation occur on the runner.
If desired, we could pass these around as a single "runner options" dict
that runners could inspect and use to populate the actual dict rather than
mixing parsed and unparsed options.


>
>
>> > On Tue, Oct 16, 2018 at 7:56 AM Thomas Weise <th...@apache.org> wrote:
>> >>
>> >> Discovering options from the job server seems preferable over
>> replicating runner options in SDKs.
>> >>
>> >> Runners evolve on their own, and with portability the SDK does not
>> need to know anything about the runner.
>> >>
>> >> Regarding --runner-option. It is true that this looks less user
>> friendly. On the other hand it eliminates the possibility of name
>> collisions.
>> >>
>> >> But if options are discovered, the SDK can perform full validation. It
>> would only be necessary to use explicit scoping when there is ambiguity.
>> >>
>> >> Thomas
>> >>
>> >>
>> >> On Tue, Oct 16, 2018 at 3:48 AM Maximilian Michels <mx...@apache.org>
>> wrote:
>> >>>
>> >>> Fetching options directly from the Runner's JobServer seems like the
>> >>> ideal solution. I agree with Robert that it creates additional
>> >>> complexity for SDK authors, so the `--runner-option` flag would be an
>> >>> easy and explicit way to specify additional Runner options.
>> >>>
>> >>> The format I prefer would be: --runner_option=key1=val1
>> >>> --runner_option=key2=val2
>> >>>
>> >>> Now, from the perspective of end users, I think it is neither
>> convenient
>> >>> nor reasonable to require the use of the `--runner-option` flag. To
>> the
>> >>> user it seems nebulous why some pipeline options live in the top-level
>> >>> option namespace while others need to be nested within an option. This
>> >>> is amplified by there being two Runners the user needs to be aware of,
>> >>> i.e. PortableRunner and the actual Runner (Dataflow/Flink/Spark..).
>> >>>
>> >>> I feel like we would eventually replicate all options in the SDK
>> because
>> >>> otherwise users have to use the `--runner-option`, but at least we can
>> >>> specify options which have not been replicated yet.
>> >>>
>> >>> -Max
>> >>>
>> >>> On 16.10.18 10:27, Robert Bradshaw wrote:
>> >>> > Yes, we don't know how to parse and/or validate it.
>> >>> >
>> >>> > On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik <lcwik@google.com
>> >>> > <ma...@google.com>> wrote:
>> >>> >
>> >>> >     I see, is the issue that we currently are using a JSON
>> >>> >     representation for options when being serialized and when we get
>> >>> >     some unknown option, we don't know how to convert it into its
>> JSON form?
>> >>> >
>> >>> >     On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw <
>> robertwb@google.com
>> >>> >     <ma...@google.com>> wrote:
>> >>> >
>> >>> >         On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <
>> lcwik@google.com
>> >>> >         <ma...@google.com>> wrote:
>> >>> >          >
>> >>> >          > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw
>> >>> >         <robertwb@google.com <ma...@google.com>> wrote:
>> >>> >          >>
>> >>> >          >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik
>> >>> >         <lcwik@google.com <ma...@google.com>> wrote:
>> >>> >          >> >
>> >>> >          >> > I agree with the sentiment for better error checking.
>> >>> >          >> >
>> >>> >          >> > We can try to make it such that the SDK can "fetch"
>> the
>> >>> >         set of options that the runner supports by making a call to
>> the
>> >>> >         Job API. The API could return a list of option names
>> >>> >         (descriptions for --help purposes and also potentially the
>> >>> >         expected format) which would remove the worry around
>> "unknown"
>> >>> >         options. Yes I understand to be able to make the Job API
>> call,
>> >>> >         we may need to parse some options from the args parameters
>> first
>> >>> >         and then parse the unknown options after they are fetched.
>> >>> >          >>
>> >>> >          >> This is an interesting idea, but seems it could get
>> quite
>> >>> >         complicated.
>> >>> >          >> E.g. for delegating runners, one would first read the
>> options to
>> >>> >          >> determine which runner to fetch the options from, which
>> >>> >         would then
>> >>> >          >> return a set of options that possibly depends on the
>> values
>> >>> >         of some of
>> >>> >          >> its options...
>> >>> >          >>
>> >>> >          >> > Alternatively, we can choose an explicit format
>> upfront.
>> >>> >          >> > To expand on the exact format for --runner_option=...,
>> >>> >         here are some different ideas:
>> >>> >          >> > 1) Specified multiple times, each one is an explicit
>> flag
>> >>> >          >> > --runner_option=--blah=bar --runner_option=--foo=baz1
>> >>> >         --runner_option=--foo=baz2
>> >>> >          >>
>> >>> >          >> I'm -1 on this format. We should move away from the idea
>> >>> >         that options
>> >>> >          >> == flags (as that doesn't compose well with other
>> libraries
>> >>> >         that do
>> >>> >          >> their own flags parsing). The ability to parse a set of
>> >>> >         flags into
>> >>> >          >> options is just a convenience that an author may (or may
>> >>> >         not) choose
>> >>> >          >> to use (e.g. when running pipelines a long-lived
>> process like a
>> >>> >          >> service or a notebook, the command line flags are almost
>> >>> >         certainly not
>> >>> >          >> the right interface).
>> >>> >          >>
>> >>> >          >> > 2) specified multiple times, we drop the explicit flag
>> >>> >          >> > --runner_option=blah=bar --runner_option=foo=baz1
>> >>> >         --runner_option=foo=baz2
>> >>> >          >>
>> >>> >          >> This or (4) is my preference.
>> >>> >          >>
>> >>> >          >> > 3) we use a string which the runner can choose to
>> >>> >         interpret however they want (JSON/XML shown below)
>> >>> >          >> > --runner_option='{"blah": "bar", "foo": ["baz1",
>> "baz2"]}'
>> >>> >          >> >
>> >>> >
>> --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>> >>> >          >>
>> >>> >          >> This would make validation hard. Also, I think it makes
>> >>> >         sense for some
>> >>> >          >> runner options to be "shared" (parallelism") by
>> convention,
>> >>> >         so letting
>> >>> >          >> it be a free-form string wouldn't allow different
>> runners to
>> >>> >         inspect
>> >>> >          >> different bits.
>> >>> >          >>
>> >>> >          >> We should consider if we should use urns for
>> namespacing, and
>> >>> >          >> assigning semantic meaning to strings, here.
>> >>> >          >>
>> >>> >          >> > 4) we use a string which must be a specific format
>> such as
>> >>> >         JSON (allows the SDK to do simple validation):
>> >>> >          >> > --runner_option='{"blah": "bar", "foo": ["baz1",
>> "baz2"]}'
>> >>> >          >>
>> >>> >          >> I like this in that at least some validation can be
>> >>> >         performed, and
>> >>> >          >> expectations of how to format richer types. On the other
>> >>> >         hand it gets
>> >>> >          >> a bit verbose, given that most (I'd imagine) options
>> will be
>> >>> >         simple.
>> >>> >          >> As with normal options,
>> >>> >          >>
>> >>> >          >>     --option1=value1 --option2=value2
>> >>> >          >>
>> >>> >          >> is shorthand for {"option1": value1, "option2": value2}.
>> >>> >          >>
>> >>> >          > I lean to 4 the most. With 2, you run into issues of what
>> >>> >         does --runner_option=foo=["a", "b"]
>> --runner_option=foo=["c",
>> >>> >         "d"] mean?
>> >>> >          > Is it an error or list of lists or concatenated. Similar
>> >>> >         issues for map types represented via JSON object {...}
>> >>> >
>> >>> >         We can err to be on the safe side unless/until an argument
>> can
>> >>> >         be made
>> >>> >         that merging is more natural. I just think this will be
>> excessively
>> >>> >         verbose to use.
>> >>> >
>> >>> >          >> > I would strongly suggest that we go with the "fetch"
>> >>> >         approach, since this makes the set of options discoverable
>> and
>> >>> >         helps users find errors much earlier in their pipeline.
>> >>> >          >>
>> >>> >          >> This seems like an advanced feature that SDKs may want
>> to
>> >>> >         support, but
>> >>> >          >> I wouldn't want to require this complexity for
>> bootstrapping
>> >>> >         an SDK.
>> >>> >          >>
>> >>> >          > SDKs that are starting off wouldn't need to "fetch"
>> options,
>> >>> >         they could choose to not support runner options or they
>> could
>> >>> >         choose to pass all options through to the runner blindly.
>> >>> >         Fetching the options only provides the SDK the ability to
>> >>> >         provide error checking upfront and useful error/help
>> messages.
>> >>> >
>> >>> >         But how to even pass all options through blindly is exactly
>> the
>> >>> >         difficulty we're running into here.
>> >>> >
>> >>> >          >> Regarding always keeping runner options separate, +1,
>> though
>> >>> >         I'm not
>> >>> >          >> sure the line is always clear.
>> >>> >          >>
>> >>> >          >>
>> >>> >          >> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw
>> >>> >         <robertwb@google.com <ma...@google.com>> wrote:
>> >>> >          >> >>
>> >>> >          >> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels
>> >>> >         <mxm@apache.org <ma...@apache.org>> wrote:
>> >>> >          >> >> >
>> >>> >          >> >> > I agree that the current approach breaks the
>> pipeline
>> >>> >         options contract
>> >>> >          >> >> > because "unknown" options get parsed in the same
>> way as
>> >>> >         options which
>> >>> >          >> >> > have been defined by the user.
>> >>> >          >> >>
>> >>> >          >> >> FWIW, I think we're already breaking this "contract."
>> >>> >         Unknown options
>> >>> >          >> >> are silently ignored; with this change we just
>> change how
>> >>> >         we record
>> >>> >          >> >> them. It still feels a bit hacky though.
>> >>> >          >> >>
>> >>> >          >> >> > I'm not sure the `experiments` flag works for us.
>> AFAIK
>> >>> >         it only allows
>> >>> >          >> >> > true/false flags. We want to pass all types of
>> pipeline
>> >>> >         options to the
>> >>> >          >> >> > Runner.
>> >>> >          >> >>
>> >>> >          >> >> Experiments is an arbitrary set of strings, which
>> can be
>> >>> >         of the form
>> >>> >          >> >> "param=value" if that's useful. (Dataflow does this.)
>> >>> >         There is, again,
>> >>> >          >> >> no namespacing on the param names, but we could user
>> urns
>> >>> >         or impose
>> >>> >          >> >> some other structure here.
>> >>> >          >> >>
>> >>> >          >> >> > How to solve this?
>> >>> >          >> >> >
>> >>> >          >> >> > 1) Add all options of all Runners to each SDK
>> >>> >          >> >> > We added some of the FlinkRunner options to the
>> Python
>> >>> >         SDK but realized
>> >>> >          >> >> > syncing is rather cumbersome in the long term.
>> However,
>> >>> >         we want the most
>> >>> >          >> >> > important options to be validated on the client
>> side.
>> >>> >          >> >>
>> >>> >          >> >> I don't think this is sustainable in the long run.
>> >>> >         However, thinking
>> >>> >          >> >> about this, in the worse case validation happens
>> after
>> >>> >         construction
>> >>> >          >> >> but before execution (as with much of our other
>> >>> >         validation) so it
>> >>> >          >> >> isn't that bad.
>> >>> >          >> >>
>> >>> >          >> >> > 2) Pass "unknown" options via a separate list in
>> the
>> >>> >         Proto which can
>> >>> >          >> >> > only be accessed internally by the Runners. This
>> still
>> >>> >         allows passing
>> >>> >          >> >> > arbitrary options but we wouldn't leak unknown
>> options
>> >>> >         and display them
>> >>> >          >> >> > as top-level options.
>> >>> >          >> >>
>> >>> >          >> >> I think there needs to be a way for the user to
>> >>> >         communicate values
>> >>> >          >> >> directly to the runner regardless of the SDK. My
>> >>> >         preference would be
>> >>> >          >> >> to make this explicit, e.g. (repeated)
>> >>> >         --runner_option=..., rather
>> >>> >          >> >> than scooping up all unknown flags at command line
>> >>> >         parsing time.
>> >>> >          >> >> Perhaps an SDK that is aware of some runners could
>> choose
>> >>> >         to lift
>> >>> >          >> >> these as top-level options, but still pass them as
>> runner
>> >>> >         options.
>> >>> >          >> >>
>> >>> >          >> >> > On 13.10.18 02:34, Charles Chen wrote:
>> >>> >          >> >> > > The current release branch
>> >>> >          >> >> > >
>> >>> >         (https://github.com/apache/beam/commits/release-2.8.0) was
>> cut
>> >>> >         after the
>> >>> >          >> >> > > revert went in.  Sent out
>> >>> >         https://github.com/apache/beam/pull/6683 as a
>> >>> >          >> >> > > revert of the revert.  Regarding your comment
>> above,
>> >>> >         I can help out with
>> >>> >          >> >> > > the design / PR reviews for common Python code
>> as you
>> >>> >         suggest.
>> >>> >          >> >> > >
>> >>> >          >> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise
>> >>> >         <thw@apache.org <ma...@apache.org>
>> >>> >          >> >> > > <mailto:thw@apache.org <ma...@apache.org>>>
>> wrote:
>> >>> >          >> >> > >
>> >>> >          >> >> > >     Thanks, will tag you and looking forward to
>> >>> >         feedback so we can
>> >>> >          >> >> > >     ensure that changes work for everyone.
>> >>> >          >> >> > >
>> >>> >          >> >> > >     Looking at the PR, I see agreement from Max
>> to
>> >>> >         revert the change on
>> >>> >          >> >> > >     the release branch, but not in master. Would
>> you
>> >>> >         mind to restore it
>> >>> >          >> >> > >     in master?
>> >>> >          >> >> > >
>> >>> >          >> >> > >     Thanks
>> >>> >          >> >> > >
>> >>> >          >> >> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay
>> >>> >         <altay@google.com <ma...@google.com>
>> >>> >          >> >> > >     <mailto:altay@google.com
>> >>> >         <ma...@google.com>>> wrote:
>> >>> >          >> >> > >
>> >>> >          >> >> > >
>> >>> >          >> >> > >
>> >>> >          >> >> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles
>> >>> >         Chen <ccy@google.com <ma...@google.com>
>> >>> >          >> >> > >         <mailto:ccy@google.com
>> >>> >         <ma...@google.com>>> wrote:
>> >>> >          >> >> > >
>> >>> >          >> >> > >             What I mean is that a user may find
>> that
>> >>> >         it works for them
>> >>> >          >> >> > >             to pass "--myarg blah" and access it
>> as
>> >>> >         "options.myarg"
>> >>> >          >> >> > >             without explicitly defining a
>> "my_arg"
>> >>> >         flag due to the added
>> >>> >          >> >> > >             logic.  This is not the intended
>> behavior
>> >>> >         and we may want to
>> >>> >          >> >> > >             change this implementation detail in
>> the
>> >>> >         future.  However,
>> >>> >          >> >> > >             having this logic in a released
>> version
>> >>> >         makes it hard to
>> >>> >          >> >> > >             change this behavior since users may
>> >>> >         erroneously depend on
>> >>> >          >> >> > >             this undocumented behavior.
>> Instead, we
>> >>> >         should namespace /
>> >>> >          >> >> > >             scope this so that it is obvious that
>> >>> >         this is meant for
>> >>> >          >> >> > >             runner (and not Beam user)
>> consumption.
>> >>> >          >> >> > >
>> >>> >          >> >> > >             On Fri, Oct 12, 2018 at 10:48 AM
>> Thomas Weise
>> >>> >          >> >> > >             <thw@apache.org <mailto:
>> thw@apache.org>
>> >>> >         <mailto:thw@apache.org <ma...@apache.org>>> wrote:
>> >>> >          >> >> > >
>> >>> >          >> >> > >                 Can you please elaborate more
>> what
>> >>> >         practical problems
>> >>> >          >> >> > >                 this introduces for users?
>> >>> >          >> >> > >
>> >>> >          >> >> > >                 I can see that this change
>> allows a
>> >>> >         user to specify a
>> >>> >          >> >> > >                 runner specific option, which in
>> the
>> >>> >         future may change
>> >>> >          >> >> > >                 because we decide to scope
>> >>> >         differently. If this only
>> >>> >          >> >> > >                 affects users of the portable
>> Flink
>> >>> >         runner (like us),
>> >>> >          >> >> > >                 then no need to revert, because
>> at
>> >>> >         this early stage we
>> >>> >          >> >> > >                 prefer something that works over
>> >>> >         being blocked.
>> >>> >          >> >> > >
>> >>> >          >> >> > >                 It would also be really great if
>> some
>> >>> >         of the core Python
>> >>> >          >> >> > >                 SDK developers could help out
>> with
>> >>> >         the design aspects
>> >>> >          >> >> > >                 and PR reviews of changes that
>> affect
>> >>> >         common Python
>> >>> >          >> >> > >                 code. Anyone who specifically
>> wants
>> >>> >         to be tagged on
>> >>> >          >> >> > >                 relevant JIRAs and PRs?
>> >>> >          >> >> > >
>> >>> >          >> >> > >
>> >>> >          >> >> > >         I would be happy to be tagged, and I can
>> also
>> >>> >         help with
>> >>> >          >> >> > >         including other relevant folks whenever
>> >>> >         possible. In general I
>> >>> >          >> >> > >         think Robert, Charles, myself are good
>> >>> >         candidates.
>> >>> >          >> >> > >
>> >>> >          >> >> > >
>> >>> >          >> >> > >                 Thanks
>> >>> >          >> >> > >
>> >>> >          >> >> > >
>> >>> >          >> >> > >                 On Fri, Oct 12, 2018 at 10:20 AM
>> >>> >         Ahmet Altay
>> >>> >          >> >> > >                 <altay@google.com
>> >>> >         <ma...@google.com> <mailto:altay@google.com
>> >>> >         <ma...@google.com>>> wrote:
>> >>> >          >> >> > >
>> >>> >          >> >> > >
>> >>> >          >> >> > >
>> >>> >          >> >> > >                     On Fri, Oct 12, 2018 at
>> 10:11 AM,
>> >>> >         Charles Chen
>> >>> >          >> >> > >                     <ccy@google.com
>> >>> >         <ma...@google.com> <mailto:ccy@google.com
>> >>> >         <ma...@google.com>>> wrote:
>> >>> >          >> >> > >
>> >>> >          >> >> > >                         For context, I made
>> comments on
>> >>> >          >> >> > > https://github.com/apache/beam/pull/6600 noting
>> >>> >          >> >> > >                         that the changes being
>> made
>> >>> >         were not good for
>> >>> >          >> >> > >                         Beam
>> >>> >         backwards-compatibility.  The change as is
>> >>> >          >> >> > >                         allows users to use
>> pipeline
>> >>> >         options without
>> >>> >          >> >> > >                         explicitly defining them,
>> >>> >         which is not the type
>> >>> >          >> >> > >                         of usage we would like to
>> >>> >         encourage since we
>> >>> >          >> >> > >                         prefer to be explicit
>> >>> >         whenever possible.  If
>> >>> >          >> >> > >                         users write pipelines
>> with
>> >>> >         this sort of pattern,
>> >>> >          >> >> > >                         they will potentially
>> >>> >         encounter pain when
>> >>> >          >> >> > >                         upgrading to a later
>> version
>> >>> >         since this is an
>> >>> >          >> >> > >                         implementation detail
>> and not
>> >>> >         an officially
>> >>> >          >> >> > >                         supported pattern.  I
>> agree
>> >>> >         with the comments
>> >>> >          >> >> > >                         above that this is
>> ultimately
>> >>> >         a scoping issue.
>> >>> >          >> >> > >                         I would not have a
>> problem
>> >>> >         with these changes if
>> >>> >          >> >> > >                         they were explicitly
>> scoped
>> >>> >         under either a
>> >>> >          >> >> > >                         runner or unparsed
>> options
>> >>> >         namespace.
>> >>> >          >> >> > >
>> >>> >          >> >> > >                         As a second note, since
>> the
>> >>> >         2.8.0 release is
>> >>> >          >> >> > >                         being cut right now,
>> because
>> >>> >         of these
>> >>> >          >> >> > >                         backwards-compatibility
>> >>> >         concerns, I would
>> >>> >          >> >> > >                         suggest reverting these
>> >>> >         changes, at least until
>> >>> >          >> >> > >                         2.8.0 is cut, so we can
>> have
>> >>> >         a discussion here
>> >>> >          >> >> > >                         before committing to and
>> >>> >         releasing any API-level
>> >>> >          >> >> > >                         changes.
>> >>> >          >> >> > >
>> >>> >          >> >> > >
>> >>> >          >> >> > >                     +1 I would like to revert the
>> >>> >         changes in order not
>> >>> >          >> >> > >                     rush this into the release.
>> Once
>> >>> >         this discussion
>> >>> >          >> >> > >                     results in an agreement
>> changes
>> >>> >         can be brought back.
>> >>> >          >> >> > >
>> >>> >          >> >> > >
>> >>> >          >> >> > >                         On Fri, Oct 12, 2018 at
>> 9:26
>> >>> >         AM Henning Rohde
>> >>> >          >> >> > >                         <herohde@google.com
>> >>> >         <ma...@google.com> <mailto:herohde@google.com
>> >>> >         <ma...@google.com>>>
>> >>> >          >> >> > >                         wrote:
>> >>> >          >> >> > >
>> >>> >          >> >> > >                             Agree that pipeline
>> >>> >         options lack some
>> >>> >          >> >> > >                             mechanism for
>> scoping. It
>> >>> >         is also not always
>> >>> >          >> >> > >                             possible distinguish
>> >>> >         options meant to be
>> >>> >          >> >> > >                             consumed at pipeline
>> >>> >         construction time, by
>> >>> >          >> >> > >                             the runner, by the
>> SDK
>> >>> >         harness, by the user
>> >>> >          >> >> > >                             code or any
>> combination
>> >>> >         -- and this causes
>> >>> >          >> >> > >                             confusion every now
>> and then.
>> >>> >          >> >> > >
>> >>> >          >> >> > >                             For Dataflow, we have
>> >>> >         been using
>> >>> >          >> >> > >                             "experiments" for
>> >>> >         arbitrary runner-specific
>> >>> >          >> >> > >                             options. It's simply
>> a
>> >>> >         string list pipeline
>> >>> >          >> >> > >                             option that all SDKs
>> >>> >         support and, for Go at
>> >>> >          >> >> > >                             least, is sent to
>> >>> >         portable runners. Flink
>> >>> >          >> >> > >                             can do the same in
>> the
>> >>> >         short term to move
>> >>> >          >> >> > >                             forward.
>> >>> >          >> >> > >
>> >>> >          >> >> > >                             Henning
>> >>> >          >> >> > >
>> >>> >          >> >> > >
>> >>> >          >> >> > >                             On Fri, Oct 12, 2018
>> at
>> >>> >         8:50 AM Thomas Weise
>> >>> >          >> >> > >                             <thw@apache.org
>> >>> >         <ma...@apache.org> <mailto:thw@apache.org
>> >>> >         <ma...@apache.org>>> wrote:
>> >>> >          >> >> > >
>> >>> >          >> >> > >                                 [moving to the
>> list]
>> >>> >          >> >> > >
>> >>> >          >> >> > >                                 The requirement
>> >>> >         driving this part of the
>> >>> >          >> >> > >                                 change was to
>> allow a
>> >>> >         user to specify
>> >>> >          >> >> > >                                 pipeline options
>> that
>> >>> >         a runner supports
>> >>> >          >> >> > >                                 without having to
>> >>> >         declare those in each
>> >>> >          >> >> > >                                 language SDK.
>> >>> >          >> >> > >
>> >>> >          >> >> > >                                 In the specific
>> >>> >         scenario, we have
>> >>> >          >> >> > >                                 options that the
>> >>> >         Flink runner supports
>> >>> >          >> >> > >                                 (and can
>> validate),
>> >>> >         that are not
>> >>> >          >> >> > >                                 enumerated in the
>> >>> >         Python SDK.
>> >>> >          >> >> > >
>> >>> >          >> >> > >                                 I think we have a
>> >>> >         bigger problem scoping
>> >>> >          >> >> > >                                 pipeline
>> options. For
>> >>> >         example, the
>> >>> >          >> >> > >                                 runner options
>> are
>> >>> >         dumped into the SDK
>> >>> >          >> >> > >                                 worker. There is
>> also
>> >>> >         a possibility of
>> >>> >          >> >> > >                                 name collisions.
>> So I
>> >>> >         think this would
>> >>> >          >> >> > >                                 benefit from
>> broader
>> >>> >         feedback.
>> >>> >          >> >> > >
>> >>> >          >> >> > >                                 Thanks,
>> >>> >          >> >> > >                                 Thomas
>> >>> >          >> >> > >
>> >>> >          >> >> > >
>> >>> >          >> >> > >                                 ----------
>> Forwarded
>> >>> >         message ---------
>> >>> >          >> >> > >                                 From: *Charles
>> Chen*
>> >>> >          >> >> > >
>> >>> >           <notifications@github.com <mailto:
>> notifications@github.com>
>> >>> >          >> >> > >
>> >>> >           <mailto:notifications@github.com
>> >>> >         <ma...@github.com>>>
>> >>> >          >> >> > >                                 Date: Fri, Oct
>> 12,
>> >>> >         2018 at 8:36 AM
>> >>> >          >> >> > >                                 Subject: Re:
>> >>> >         [apache/beam] [BEAM-5442]
>> >>> >          >> >> > >                                 Store duplicate
>> >>> >         unknown options in a
>> >>> >          >> >> > >                                 list argument
>> (#6600)
>> >>> >          >> >> > >                                 To: apache/beam
>> >>> >         <beam@noreply.github.com <ma...@noreply.github.com>
>> >>> >          >> >> > >
>> >>> >           <mailto:beam@noreply.github.com <mailto:
>> beam@noreply.github.com>>>
>> >>> >          >> >> > >                                 Cc: Thomas Weise
>> >>> >         <thomas.weise@gmail.com <ma...@gmail.com>
>> >>> >          >> >> > >
>> >>> >           <mailto:thomas.weise@gmail.com <mailto:
>> thomas.weise@gmail.com>>>,
>> >>> >          >> >> > >                                 Mention
>> >>> >         <mention@noreply.github.com <mailto:
>> mention@noreply.github.com>
>> >>> >          >> >> > >
>> >>> >           <mailto:mention@noreply.github.com
>> >>> >         <ma...@noreply.github.com>>>
>> >>> >          >> >> > >
>> >>> >          >> >> > >
>> >>> >          >> >> > >                                 CC: @tweise
>> >>> >         <https://github.com/tweise>
>> >>> >          >> >> > >
>> >>> >          >> >> > >                                 —
>> >>> >          >> >> > >                                 You are receiving
>> >>> >         this because you were
>> >>> >          >> >> > >                                 mentioned.
>> >>> >          >> >> > >                                 Reply to this
>> email
>> >>> >         directly, view it on
>> >>> >          >> >> > >                                 GitHub
>> >>> >          >> >> > >
>> >>> >           <
>> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>> >>> >          >> >> > >                                 or mute the
>> thread
>> >>> >          >> >> > >
>> >>> >           <
>> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
>> >.
>> >>> >          >> >> > >
>> >>> >          >> >> > >
>> >>> >          >> >> > >
>> >>> >
>>
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Lukasz Cwik <lc...@google.com>.

On Tue, Oct 16, 2018 at 11:51 AM Robert Bradshaw <ro...@google.com>
wrote:

> On Tue, Oct 16, 2018 at 7:03 PM Lukasz Cwik <lc...@google.com> wrote:
> >
> > For all unknown options, the SDK can require that all flag values be
> specified explicitly as a valid JSON type.
> > starts with { -> object
> > starts with [ -> list
> > starts with " -> string
> > is null / true / false -> null / true / false
> > otherwise is number.
> >
> > This isn't great for strings but works well for all the other types.
> >
> > Thus for known options, the additional typing information would
> disambiguate whether something should be a
> string/boolean/number/object/list but for unknown options we would expect
> the user to use valid JSON explicitly and write:
> > --foo={"object": "value"}
> > --foo=["value", "value2"]
> > --foo="string value"
>
> Due to shell escaping, one would have to write
>
> --foo=\"string value\"
>
> or actually, due to the space
>
> --foo='"string value"'
>
> or some other variation on that, which is really unfortunate. (The JSON
> list/objects would need similar quoting, but that's less surprising.) Also,
> does this mean we'd only have one kind of number (not integer vs. float,
> i.e. --parallelism=5.0 works)? I suppose that is JSON.
>

Yes, I was suspecting that users would need to type the second variant as
\"...\" I found more burdensome then '"..."'


>
> > --foo=3.5 --foo=-4
> > --foo=true --foo=false
> > --foo=null
> > This also works if the flag is repeated, so --foo=3.5 --foo=-4 is [3.5,
> -4]
>
> The thing that sparked this discussion was what to do when unknown foo is
> repeated, but only one value given.
>

If the person only specifies one value, then they have to disambiguate and
put it in a list, only if they specify more then one value will they have
to turn it into a list.

I believe we could come up with other schemes on how to convert unknown
options to JSON where we prefer strings over non-string types like
null/true/false/numbers/list/object and require the user to escape out of
the string default but anything that is too different from strict JSON
would cause headaches when attempting to explain the format to users. I
think a happy middle ground would be that we will only require escaping for
strings which are ambiguous, so things like true, null, false, ... to be
treated as strings would require the user to escape them.


> > On Tue, Oct 16, 2018 at 7:56 AM Thomas Weise <th...@apache.org> wrote:
> >>
> >> Discovering options from the job server seems preferable over
> replicating runner options in SDKs.
> >>
> >> Runners evolve on their own, and with portability the SDK does not need
> to know anything about the runner.
> >>
> >> Regarding --runner-option. It is true that this looks less user
> friendly. On the other hand it eliminates the possibility of name
> collisions.
> >>
> >> But if options are discovered, the SDK can perform full validation. It
> would only be necessary to use explicit scoping when there is ambiguity.
> >>
> >> Thomas
> >>
> >>
> >> On Tue, Oct 16, 2018 at 3:48 AM Maximilian Michels <mx...@apache.org>
> wrote:
> >>>
> >>> Fetching options directly from the Runner's JobServer seems like the
> >>> ideal solution. I agree with Robert that it creates additional
> >>> complexity for SDK authors, so the `--runner-option` flag would be an
> >>> easy and explicit way to specify additional Runner options.
> >>>
> >>> The format I prefer would be: --runner_option=key1=val1
> >>> --runner_option=key2=val2
> >>>
> >>> Now, from the perspective of end users, I think it is neither
> convenient
> >>> nor reasonable to require the use of the `--runner-option` flag. To the
> >>> user it seems nebulous why some pipeline options live in the top-level
> >>> option namespace while others need to be nested within an option. This
> >>> is amplified by there being two Runners the user needs to be aware of,
> >>> i.e. PortableRunner and the actual Runner (Dataflow/Flink/Spark..).
> >>>
> >>> I feel like we would eventually replicate all options in the SDK
> because
> >>> otherwise users have to use the `--runner-option`, but at least we can
> >>> specify options which have not been replicated yet.
> >>>
> >>> -Max
> >>>
> >>> On 16.10.18 10:27, Robert Bradshaw wrote:
> >>> > Yes, we don't know how to parse and/or validate it.
> >>> >
> >>> > On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik <lcwik@google.com
> >>> > <ma...@google.com>> wrote:
> >>> >
> >>> >     I see, is the issue that we currently are using a JSON
> >>> >     representation for options when being serialized and when we get
> >>> >     some unknown option, we don't know how to convert it into its
> JSON form?
> >>> >
> >>> >     On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw <
> robertwb@google.com
> >>> >     <ma...@google.com>> wrote:
> >>> >
> >>> >         On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <
> lcwik@google.com
> >>> >         <ma...@google.com>> wrote:
> >>> >          >
> >>> >          > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw
> >>> >         <robertwb@google.com <ma...@google.com>> wrote:
> >>> >          >>
> >>> >          >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik
> >>> >         <lcwik@google.com <ma...@google.com>> wrote:
> >>> >          >> >
> >>> >          >> > I agree with the sentiment for better error checking.
> >>> >          >> >
> >>> >          >> > We can try to make it such that the SDK can "fetch" the
> >>> >         set of options that the runner supports by making a call to
> the
> >>> >         Job API. The API could return a list of option names
> >>> >         (descriptions for --help purposes and also potentially the
> >>> >         expected format) which would remove the worry around
> "unknown"
> >>> >         options. Yes I understand to be able to make the Job API
> call,
> >>> >         we may need to parse some options from the args parameters
> first
> >>> >         and then parse the unknown options after they are fetched.
> >>> >          >>
> >>> >          >> This is an interesting idea, but seems it could get quite
> >>> >         complicated.
> >>> >          >> E.g. for delegating runners, one would first read the
> options to
> >>> >          >> determine which runner to fetch the options from, which
> >>> >         would then
> >>> >          >> return a set of options that possibly depends on the
> values
> >>> >         of some of
> >>> >          >> its options...
> >>> >          >>
> >>> >          >> > Alternatively, we can choose an explicit format
> upfront.
> >>> >          >> > To expand on the exact format for --runner_option=...,
> >>> >         here are some different ideas:
> >>> >          >> > 1) Specified multiple times, each one is an explicit
> flag
> >>> >          >> > --runner_option=--blah=bar --runner_option=--foo=baz1
> >>> >         --runner_option=--foo=baz2
> >>> >          >>
> >>> >          >> I'm -1 on this format. We should move away from the idea
> >>> >         that options
> >>> >          >> == flags (as that doesn't compose well with other
> libraries
> >>> >         that do
> >>> >          >> their own flags parsing). The ability to parse a set of
> >>> >         flags into
> >>> >          >> options is just a convenience that an author may (or may
> >>> >         not) choose
> >>> >          >> to use (e.g. when running pipelines a long-lived process
> like a
> >>> >          >> service or a notebook, the command line flags are almost
> >>> >         certainly not
> >>> >          >> the right interface).
> >>> >          >>
> >>> >          >> > 2) specified multiple times, we drop the explicit flag
> >>> >          >> > --runner_option=blah=bar --runner_option=foo=baz1
> >>> >         --runner_option=foo=baz2
> >>> >          >>
> >>> >          >> This or (4) is my preference.
> >>> >          >>
> >>> >          >> > 3) we use a string which the runner can choose to
> >>> >         interpret however they want (JSON/XML shown below)
> >>> >          >> > --runner_option='{"blah": "bar", "foo": ["baz1",
> "baz2"]}'
> >>> >          >> >
> >>> >
> --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
> >>> >          >>
> >>> >          >> This would make validation hard. Also, I think it makes
> >>> >         sense for some
> >>> >          >> runner options to be "shared" (parallelism") by
> convention,
> >>> >         so letting
> >>> >          >> it be a free-form string wouldn't allow different
> runners to
> >>> >         inspect
> >>> >          >> different bits.
> >>> >          >>
> >>> >          >> We should consider if we should use urns for
> namespacing, and
> >>> >          >> assigning semantic meaning to strings, here.
> >>> >          >>
> >>> >          >> > 4) we use a string which must be a specific format
> such as
> >>> >         JSON (allows the SDK to do simple validation):
> >>> >          >> > --runner_option='{"blah": "bar", "foo": ["baz1",
> "baz2"]}'
> >>> >          >>
> >>> >          >> I like this in that at least some validation can be
> >>> >         performed, and
> >>> >          >> expectations of how to format richer types. On the other
> >>> >         hand it gets
> >>> >          >> a bit verbose, given that most (I'd imagine) options
> will be
> >>> >         simple.
> >>> >          >> As with normal options,
> >>> >          >>
> >>> >          >>     --option1=value1 --option2=value2
> >>> >          >>
> >>> >          >> is shorthand for {"option1": value1, "option2": value2}.
> >>> >          >>
> >>> >          > I lean to 4 the most. With 2, you run into issues of what
> >>> >         does --runner_option=foo=["a", "b"] --runner_option=foo=["c",
> >>> >         "d"] mean?
> >>> >          > Is it an error or list of lists or concatenated. Similar
> >>> >         issues for map types represented via JSON object {...}
> >>> >
> >>> >         We can err to be on the safe side unless/until an argument
> can
> >>> >         be made
> >>> >         that merging is more natural. I just think this will be
> excessively
> >>> >         verbose to use.
> >>> >
> >>> >          >> > I would strongly suggest that we go with the "fetch"
> >>> >         approach, since this makes the set of options discoverable
> and
> >>> >         helps users find errors much earlier in their pipeline.
> >>> >          >>
> >>> >          >> This seems like an advanced feature that SDKs may want to
> >>> >         support, but
> >>> >          >> I wouldn't want to require this complexity for
> bootstrapping
> >>> >         an SDK.
> >>> >          >>
> >>> >          > SDKs that are starting off wouldn't need to "fetch"
> options,
> >>> >         they could choose to not support runner options or they could
> >>> >         choose to pass all options through to the runner blindly.
> >>> >         Fetching the options only provides the SDK the ability to
> >>> >         provide error checking upfront and useful error/help
> messages.
> >>> >
> >>> >         But how to even pass all options through blindly is exactly
> the
> >>> >         difficulty we're running into here.
> >>> >
> >>> >          >> Regarding always keeping runner options separate, +1,
> though
> >>> >         I'm not
> >>> >          >> sure the line is always clear.
> >>> >          >>
> >>> >          >>
> >>> >          >> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw
> >>> >         <robertwb@google.com <ma...@google.com>> wrote:
> >>> >          >> >>
> >>> >          >> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels
> >>> >         <mxm@apache.org <ma...@apache.org>> wrote:
> >>> >          >> >> >
> >>> >          >> >> > I agree that the current approach breaks the
> pipeline
> >>> >         options contract
> >>> >          >> >> > because "unknown" options get parsed in the same
> way as
> >>> >         options which
> >>> >          >> >> > have been defined by the user.
> >>> >          >> >>
> >>> >          >> >> FWIW, I think we're already breaking this "contract."
> >>> >         Unknown options
> >>> >          >> >> are silently ignored; with this change we just change
> how
> >>> >         we record
> >>> >          >> >> them. It still feels a bit hacky though.
> >>> >          >> >>
> >>> >          >> >> > I'm not sure the `experiments` flag works for us.
> AFAIK
> >>> >         it only allows
> >>> >          >> >> > true/false flags. We want to pass all types of
> pipeline
> >>> >         options to the
> >>> >          >> >> > Runner.
> >>> >          >> >>
> >>> >          >> >> Experiments is an arbitrary set of strings, which can
> be
> >>> >         of the form
> >>> >          >> >> "param=value" if that's useful. (Dataflow does this.)
> >>> >         There is, again,
> >>> >          >> >> no namespacing on the param names, but we could user
> urns
> >>> >         or impose
> >>> >          >> >> some other structure here.
> >>> >          >> >>
> >>> >          >> >> > How to solve this?
> >>> >          >> >> >
> >>> >          >> >> > 1) Add all options of all Runners to each SDK
> >>> >          >> >> > We added some of the FlinkRunner options to the
> Python
> >>> >         SDK but realized
> >>> >          >> >> > syncing is rather cumbersome in the long term.
> However,
> >>> >         we want the most
> >>> >          >> >> > important options to be validated on the client
> side.
> >>> >          >> >>
> >>> >          >> >> I don't think this is sustainable in the long run.
> >>> >         However, thinking
> >>> >          >> >> about this, in the worse case validation happens after
> >>> >         construction
> >>> >          >> >> but before execution (as with much of our other
> >>> >         validation) so it
> >>> >          >> >> isn't that bad.
> >>> >          >> >>
> >>> >          >> >> > 2) Pass "unknown" options via a separate list in the
> >>> >         Proto which can
> >>> >          >> >> > only be accessed internally by the Runners. This
> still
> >>> >         allows passing
> >>> >          >> >> > arbitrary options but we wouldn't leak unknown
> options
> >>> >         and display them
> >>> >          >> >> > as top-level options.
> >>> >          >> >>
> >>> >          >> >> I think there needs to be a way for the user to
> >>> >         communicate values
> >>> >          >> >> directly to the runner regardless of the SDK. My
> >>> >         preference would be
> >>> >          >> >> to make this explicit, e.g. (repeated)
> >>> >         --runner_option=..., rather
> >>> >          >> >> than scooping up all unknown flags at command line
> >>> >         parsing time.
> >>> >          >> >> Perhaps an SDK that is aware of some runners could
> choose
> >>> >         to lift
> >>> >          >> >> these as top-level options, but still pass them as
> runner
> >>> >         options.
> >>> >          >> >>
> >>> >          >> >> > On 13.10.18 02:34, Charles Chen wrote:
> >>> >          >> >> > > The current release branch
> >>> >          >> >> > >
> >>> >         (https://github.com/apache/beam/commits/release-2.8.0) was
> cut
> >>> >         after the
> >>> >          >> >> > > revert went in.  Sent out
> >>> >         https://github.com/apache/beam/pull/6683 as a
> >>> >          >> >> > > revert of the revert.  Regarding your comment
> above,
> >>> >         I can help out with
> >>> >          >> >> > > the design / PR reviews for common Python code as
> you
> >>> >         suggest.
> >>> >          >> >> > >
> >>> >          >> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise
> >>> >         <thw@apache.org <ma...@apache.org>
> >>> >          >> >> > > <mailto:thw@apache.org <ma...@apache.org>>>
> wrote:
> >>> >          >> >> > >
> >>> >          >> >> > >     Thanks, will tag you and looking forward to
> >>> >         feedback so we can
> >>> >          >> >> > >     ensure that changes work for everyone.
> >>> >          >> >> > >
> >>> >          >> >> > >     Looking at the PR, I see agreement from Max to
> >>> >         revert the change on
> >>> >          >> >> > >     the release branch, but not in master. Would
> you
> >>> >         mind to restore it
> >>> >          >> >> > >     in master?
> >>> >          >> >> > >
> >>> >          >> >> > >     Thanks
> >>> >          >> >> > >
> >>> >          >> >> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay
> >>> >         <altay@google.com <ma...@google.com>
> >>> >          >> >> > >     <mailto:altay@google.com
> >>> >         <ma...@google.com>>> wrote:
> >>> >          >> >> > >
> >>> >          >> >> > >
> >>> >          >> >> > >
> >>> >          >> >> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles
> >>> >         Chen <ccy@google.com <ma...@google.com>
> >>> >          >> >> > >         <mailto:ccy@google.com
> >>> >         <ma...@google.com>>> wrote:
> >>> >          >> >> > >
> >>> >          >> >> > >             What I mean is that a user may find
> that
> >>> >         it works for them
> >>> >          >> >> > >             to pass "--myarg blah" and access it
> as
> >>> >         "options.myarg"
> >>> >          >> >> > >             without explicitly defining a "my_arg"
> >>> >         flag due to the added
> >>> >          >> >> > >             logic.  This is not the intended
> behavior
> >>> >         and we may want to
> >>> >          >> >> > >             change this implementation detail in
> the
> >>> >         future.  However,
> >>> >          >> >> > >             having this logic in a released
> version
> >>> >         makes it hard to
> >>> >          >> >> > >             change this behavior since users may
> >>> >         erroneously depend on
> >>> >          >> >> > >             this undocumented behavior.  Instead,
> we
> >>> >         should namespace /
> >>> >          >> >> > >             scope this so that it is obvious that
> >>> >         this is meant for
> >>> >          >> >> > >             runner (and not Beam user)
> consumption.
> >>> >          >> >> > >
> >>> >          >> >> > >             On Fri, Oct 12, 2018 at 10:48 AM
> Thomas Weise
> >>> >          >> >> > >             <thw@apache.org <mailto:
> thw@apache.org>
> >>> >         <mailto:thw@apache.org <ma...@apache.org>>> wrote:
> >>> >          >> >> > >
> >>> >          >> >> > >                 Can you please elaborate more what
> >>> >         practical problems
> >>> >          >> >> > >                 this introduces for users?
> >>> >          >> >> > >
> >>> >          >> >> > >                 I can see that this change allows
> a
> >>> >         user to specify a
> >>> >          >> >> > >                 runner specific option, which in
> the
> >>> >         future may change
> >>> >          >> >> > >                 because we decide to scope
> >>> >         differently. If this only
> >>> >          >> >> > >                 affects users of the portable
> Flink
> >>> >         runner (like us),
> >>> >          >> >> > >                 then no need to revert, because at
> >>> >         this early stage we
> >>> >          >> >> > >                 prefer something that works over
> >>> >         being blocked.
> >>> >          >> >> > >
> >>> >          >> >> > >                 It would also be really great if
> some
> >>> >         of the core Python
> >>> >          >> >> > >                 SDK developers could help out with
> >>> >         the design aspects
> >>> >          >> >> > >                 and PR reviews of changes that
> affect
> >>> >         common Python
> >>> >          >> >> > >                 code. Anyone who specifically
> wants
> >>> >         to be tagged on
> >>> >          >> >> > >                 relevant JIRAs and PRs?
> >>> >          >> >> > >
> >>> >          >> >> > >
> >>> >          >> >> > >         I would be happy to be tagged, and I can
> also
> >>> >         help with
> >>> >          >> >> > >         including other relevant folks whenever
> >>> >         possible. In general I
> >>> >          >> >> > >         think Robert, Charles, myself are good
> >>> >         candidates.
> >>> >          >> >> > >
> >>> >          >> >> > >
> >>> >          >> >> > >                 Thanks
> >>> >          >> >> > >
> >>> >          >> >> > >
> >>> >          >> >> > >                 On Fri, Oct 12, 2018 at 10:20 AM
> >>> >         Ahmet Altay
> >>> >          >> >> > >                 <altay@google.com
> >>> >         <ma...@google.com> <mailto:altay@google.com
> >>> >         <ma...@google.com>>> wrote:
> >>> >          >> >> > >
> >>> >          >> >> > >
> >>> >          >> >> > >
> >>> >          >> >> > >                     On Fri, Oct 12, 2018 at 10:11
> AM,
> >>> >         Charles Chen
> >>> >          >> >> > >                     <ccy@google.com
> >>> >         <ma...@google.com> <mailto:ccy@google.com
> >>> >         <ma...@google.com>>> wrote:
> >>> >          >> >> > >
> >>> >          >> >> > >                         For context, I made
> comments on
> >>> >          >> >> > > https://github.com/apache/beam/pull/6600 noting
> >>> >          >> >> > >                         that the changes being
> made
> >>> >         were not good for
> >>> >          >> >> > >                         Beam
> >>> >         backwards-compatibility.  The change as is
> >>> >          >> >> > >                         allows users to use
> pipeline
> >>> >         options without
> >>> >          >> >> > >                         explicitly defining them,
> >>> >         which is not the type
> >>> >          >> >> > >                         of usage we would like to
> >>> >         encourage since we
> >>> >          >> >> > >                         prefer to be explicit
> >>> >         whenever possible.  If
> >>> >          >> >> > >                         users write pipelines with
> >>> >         this sort of pattern,
> >>> >          >> >> > >                         they will potentially
> >>> >         encounter pain when
> >>> >          >> >> > >                         upgrading to a later
> version
> >>> >         since this is an
> >>> >          >> >> > >                         implementation detail and
> not
> >>> >         an officially
> >>> >          >> >> > >                         supported pattern.  I
> agree
> >>> >         with the comments
> >>> >          >> >> > >                         above that this is
> ultimately
> >>> >         a scoping issue.
> >>> >          >> >> > >                         I would not have a problem
> >>> >         with these changes if
> >>> >          >> >> > >                         they were explicitly
> scoped
> >>> >         under either a
> >>> >          >> >> > >                         runner or unparsed options
> >>> >         namespace.
> >>> >          >> >> > >
> >>> >          >> >> > >                         As a second note, since
> the
> >>> >         2.8.0 release is
> >>> >          >> >> > >                         being cut right now,
> because
> >>> >         of these
> >>> >          >> >> > >                         backwards-compatibility
> >>> >         concerns, I would
> >>> >          >> >> > >                         suggest reverting these
> >>> >         changes, at least until
> >>> >          >> >> > >                         2.8.0 is cut, so we can
> have
> >>> >         a discussion here
> >>> >          >> >> > >                         before committing to and
> >>> >         releasing any API-level
> >>> >          >> >> > >                         changes.
> >>> >          >> >> > >
> >>> >          >> >> > >
> >>> >          >> >> > >                     +1 I would like to revert the
> >>> >         changes in order not
> >>> >          >> >> > >                     rush this into the release.
> Once
> >>> >         this discussion
> >>> >          >> >> > >                     results in an agreement
> changes
> >>> >         can be brought back.
> >>> >          >> >> > >
> >>> >          >> >> > >
> >>> >          >> >> > >                         On Fri, Oct 12, 2018 at
> 9:26
> >>> >         AM Henning Rohde
> >>> >          >> >> > >                         <herohde@google.com
> >>> >         <ma...@google.com> <mailto:herohde@google.com
> >>> >         <ma...@google.com>>>
> >>> >          >> >> > >                         wrote:
> >>> >          >> >> > >
> >>> >          >> >> > >                             Agree that pipeline
> >>> >         options lack some
> >>> >          >> >> > >                             mechanism for
> scoping. It
> >>> >         is also not always
> >>> >          >> >> > >                             possible distinguish
> >>> >         options meant to be
> >>> >          >> >> > >                             consumed at pipeline
> >>> >         construction time, by
> >>> >          >> >> > >                             the runner, by the SDK
> >>> >         harness, by the user
> >>> >          >> >> > >                             code or any
> combination
> >>> >         -- and this causes
> >>> >          >> >> > >                             confusion every now
> and then.
> >>> >          >> >> > >
> >>> >          >> >> > >                             For Dataflow, we have
> >>> >         been using
> >>> >          >> >> > >                             "experiments" for
> >>> >         arbitrary runner-specific
> >>> >          >> >> > >                             options. It's simply a
> >>> >         string list pipeline
> >>> >          >> >> > >                             option that all SDKs
> >>> >         support and, for Go at
> >>> >          >> >> > >                             least, is sent to
> >>> >         portable runners. Flink
> >>> >          >> >> > >                             can do the same in the
> >>> >         short term to move
> >>> >          >> >> > >                             forward.
> >>> >          >> >> > >
> >>> >          >> >> > >                             Henning
> >>> >          >> >> > >
> >>> >          >> >> > >
> >>> >          >> >> > >                             On Fri, Oct 12, 2018
> at
> >>> >         8:50 AM Thomas Weise
> >>> >          >> >> > >                             <thw@apache.org
> >>> >         <ma...@apache.org> <mailto:thw@apache.org
> >>> >         <ma...@apache.org>>> wrote:
> >>> >          >> >> > >
> >>> >          >> >> > >                                 [moving to the
> list]
> >>> >          >> >> > >
> >>> >          >> >> > >                                 The requirement
> >>> >         driving this part of the
> >>> >          >> >> > >                                 change was to
> allow a
> >>> >         user to specify
> >>> >          >> >> > >                                 pipeline options
> that
> >>> >         a runner supports
> >>> >          >> >> > >                                 without having to
> >>> >         declare those in each
> >>> >          >> >> > >                                 language SDK.
> >>> >          >> >> > >
> >>> >          >> >> > >                                 In the specific
> >>> >         scenario, we have
> >>> >          >> >> > >                                 options that the
> >>> >         Flink runner supports
> >>> >          >> >> > >                                 (and can
> validate),
> >>> >         that are not
> >>> >          >> >> > >                                 enumerated in the
> >>> >         Python SDK.
> >>> >          >> >> > >
> >>> >          >> >> > >                                 I think we have a
> >>> >         bigger problem scoping
> >>> >          >> >> > >                                 pipeline options.
> For
> >>> >         example, the
> >>> >          >> >> > >                                 runner options are
> >>> >         dumped into the SDK
> >>> >          >> >> > >                                 worker. There is
> also
> >>> >         a possibility of
> >>> >          >> >> > >                                 name collisions.
> So I
> >>> >         think this would
> >>> >          >> >> > >                                 benefit from
> broader
> >>> >         feedback.
> >>> >          >> >> > >
> >>> >          >> >> > >                                 Thanks,
> >>> >          >> >> > >                                 Thomas
> >>> >          >> >> > >
> >>> >          >> >> > >
> >>> >          >> >> > >                                 ----------
> Forwarded
> >>> >         message ---------
> >>> >          >> >> > >                                 From: *Charles
> Chen*
> >>> >          >> >> > >
> >>> >           <notifications@github.com <mailto:notifications@github.com
> >
> >>> >          >> >> > >
> >>> >           <mailto:notifications@github.com
> >>> >         <ma...@github.com>>>
> >>> >          >> >> > >                                 Date: Fri, Oct 12,
> >>> >         2018 at 8:36 AM
> >>> >          >> >> > >                                 Subject: Re:
> >>> >         [apache/beam] [BEAM-5442]
> >>> >          >> >> > >                                 Store duplicate
> >>> >         unknown options in a
> >>> >          >> >> > >                                 list argument
> (#6600)
> >>> >          >> >> > >                                 To: apache/beam
> >>> >         <beam@noreply.github.com <ma...@noreply.github.com>
> >>> >          >> >> > >
> >>> >           <mailto:beam@noreply.github.com <mailto:
> beam@noreply.github.com>>>
> >>> >          >> >> > >                                 Cc: Thomas Weise
> >>> >         <thomas.weise@gmail.com <ma...@gmail.com>
> >>> >          >> >> > >
> >>> >           <mailto:thomas.weise@gmail.com <mailto:
> thomas.weise@gmail.com>>>,
> >>> >          >> >> > >                                 Mention
> >>> >         <mention@noreply.github.com <mailto:
> mention@noreply.github.com>
> >>> >          >> >> > >
> >>> >           <mailto:mention@noreply.github.com
> >>> >         <ma...@noreply.github.com>>>
> >>> >          >> >> > >
> >>> >          >> >> > >
> >>> >          >> >> > >                                 CC: @tweise
> >>> >         <https://github.com/tweise>
> >>> >          >> >> > >
> >>> >          >> >> > >                                 —
> >>> >          >> >> > >                                 You are receiving
> >>> >         this because you were
> >>> >          >> >> > >                                 mentioned.
> >>> >          >> >> > >                                 Reply to this
> email
> >>> >         directly, view it on
> >>> >          >> >> > >                                 GitHub
> >>> >          >> >> > >
> >>> >           <
> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
> >>> >          >> >> > >                                 or mute the thread
> >>> >          >> >> > >
> >>> >           <
> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
> >.
> >>> >          >> >> > >
> >>> >          >> >> > >
> >>> >          >> >> > >
> >>> >
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Robert Bradshaw <ro...@google.com>.

On Tue, Oct 16, 2018 at 7:03 PM Lukasz Cwik <lc...@google.com> wrote:
>
> For all unknown options, the SDK can require that all flag values be
specified explicitly as a valid JSON type.
> starts with { -> object
> starts with [ -> list
> starts with " -> string
> is null / true / false -> null / true / false
> otherwise is number.
>
> This isn't great for strings but works well for all the other types.
>
> Thus for known options, the additional typing information would
disambiguate whether something should be a
string/boolean/number/object/list but for unknown options we would expect
the user to use valid JSON explicitly and write:
> --foo={"object": "value"}
> --foo=["value", "value2"]
> --foo="string value"

Due to shell escaping, one would have to write

--foo=\"string value\"

or actually, due to the space

--foo='"string value"'

or some other variation on that, which is really unfortunate. (The JSON
list/objects would need similar quoting, but that's less surprising.) Also,
does this mean we'd only have one kind of number (not integer vs. float,
i.e. --parallelism=5.0 works)? I suppose that is JSON.

> --foo=3.5 --foo=-4
> --foo=true --foo=false
> --foo=null
> This also works if the flag is repeated, so --foo=3.5 --foo=-4 is [3.5,
-4]

The thing that sparked this discussion was what to do when unknown foo is
repeated, but only one value given.

> On Tue, Oct 16, 2018 at 7:56 AM Thomas Weise <th...@apache.org> wrote:
>>
>> Discovering options from the job server seems preferable over
replicating runner options in SDKs.
>>
>> Runners evolve on their own, and with portability the SDK does not need
to know anything about the runner.
>>
>> Regarding --runner-option. It is true that this looks less user
friendly. On the other hand it eliminates the possibility of name
collisions.
>>
>> But if options are discovered, the SDK can perform full validation. It
would only be necessary to use explicit scoping when there is ambiguity.
>>
>> Thomas
>>
>>
>> On Tue, Oct 16, 2018 at 3:48 AM Maximilian Michels <mx...@apache.org>
wrote:
>>>
>>> Fetching options directly from the Runner's JobServer seems like the
>>> ideal solution. I agree with Robert that it creates additional
>>> complexity for SDK authors, so the `--runner-option` flag would be an
>>> easy and explicit way to specify additional Runner options.
>>>
>>> The format I prefer would be: --runner_option=key1=val1
>>> --runner_option=key2=val2
>>>
>>> Now, from the perspective of end users, I think it is neither convenient
>>> nor reasonable to require the use of the `--runner-option` flag. To the
>>> user it seems nebulous why some pipeline options live in the top-level
>>> option namespace while others need to be nested within an option. This
>>> is amplified by there being two Runners the user needs to be aware of,
>>> i.e. PortableRunner and the actual Runner (Dataflow/Flink/Spark..).
>>>
>>> I feel like we would eventually replicate all options in the SDK because
>>> otherwise users have to use the `--runner-option`, but at least we can
>>> specify options which have not been replicated yet.
>>>
>>> -Max
>>>
>>> On 16.10.18 10:27, Robert Bradshaw wrote:
>>> > Yes, we don't know how to parse and/or validate it.
>>> >
>>> > On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik <lcwik@google.com
>>> > <ma...@google.com>> wrote:
>>> >
>>> >     I see, is the issue that we currently are using a JSON
>>> >     representation for options when being serialized and when we get
>>> >     some unknown option, we don't know how to convert it into its
JSON form?
>>> >
>>> >     On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw <
robertwb@google.com
>>> >     <ma...@google.com>> wrote:
>>> >
>>> >         On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <lcwik@google.com
>>> >         <ma...@google.com>> wrote:
>>> >          >
>>> >          > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw
>>> >         <robertwb@google.com <ma...@google.com>> wrote:
>>> >          >>
>>> >          >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik
>>> >         <lcwik@google.com <ma...@google.com>> wrote:
>>> >          >> >
>>> >          >> > I agree with the sentiment for better error checking.
>>> >          >> >
>>> >          >> > We can try to make it such that the SDK can "fetch" the
>>> >         set of options that the runner supports by making a call to
the
>>> >         Job API. The API could return a list of option names
>>> >         (descriptions for --help purposes and also potentially the
>>> >         expected format) which would remove the worry around "unknown"
>>> >         options. Yes I understand to be able to make the Job API call,
>>> >         we may need to parse some options from the args parameters
first
>>> >         and then parse the unknown options after they are fetched.
>>> >          >>
>>> >          >> This is an interesting idea, but seems it could get quite
>>> >         complicated.
>>> >          >> E.g. for delegating runners, one would first read the
options to
>>> >          >> determine which runner to fetch the options from, which
>>> >         would then
>>> >          >> return a set of options that possibly depends on the
values
>>> >         of some of
>>> >          >> its options...
>>> >          >>
>>> >          >> > Alternatively, we can choose an explicit format upfront.
>>> >          >> > To expand on the exact format for --runner_option=...,
>>> >         here are some different ideas:
>>> >          >> > 1) Specified multiple times, each one is an explicit
flag
>>> >          >> > --runner_option=--blah=bar --runner_option=--foo=baz1
>>> >         --runner_option=--foo=baz2
>>> >          >>
>>> >          >> I'm -1 on this format. We should move away from the idea
>>> >         that options
>>> >          >> == flags (as that doesn't compose well with other
libraries
>>> >         that do
>>> >          >> their own flags parsing). The ability to parse a set of
>>> >         flags into
>>> >          >> options is just a convenience that an author may (or may
>>> >         not) choose
>>> >          >> to use (e.g. when running pipelines a long-lived process
like a
>>> >          >> service or a notebook, the command line flags are almost
>>> >         certainly not
>>> >          >> the right interface).
>>> >          >>
>>> >          >> > 2) specified multiple times, we drop the explicit flag
>>> >          >> > --runner_option=blah=bar --runner_option=foo=baz1
>>> >         --runner_option=foo=baz2
>>> >          >>
>>> >          >> This or (4) is my preference.
>>> >          >>
>>> >          >> > 3) we use a string which the runner can choose to
>>> >         interpret however they want (JSON/XML shown below)
>>> >          >> > --runner_option='{"blah": "bar", "foo": ["baz1",
"baz2"]}'
>>> >          >> >
>>> >
--runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>>> >          >>
>>> >          >> This would make validation hard. Also, I think it makes
>>> >         sense for some
>>> >          >> runner options to be "shared" (parallelism") by
convention,
>>> >         so letting
>>> >          >> it be a free-form string wouldn't allow different runners
to
>>> >         inspect
>>> >          >> different bits.
>>> >          >>
>>> >          >> We should consider if we should use urns for namespacing,
and
>>> >          >> assigning semantic meaning to strings, here.
>>> >          >>
>>> >          >> > 4) we use a string which must be a specific format such
as
>>> >         JSON (allows the SDK to do simple validation):
>>> >          >> > --runner_option='{"blah": "bar", "foo": ["baz1",
"baz2"]}'
>>> >          >>
>>> >          >> I like this in that at least some validation can be
>>> >         performed, and
>>> >          >> expectations of how to format richer types. On the other
>>> >         hand it gets
>>> >          >> a bit verbose, given that most (I'd imagine) options will
be
>>> >         simple.
>>> >          >> As with normal options,
>>> >          >>
>>> >          >>     --option1=value1 --option2=value2
>>> >          >>
>>> >          >> is shorthand for {"option1": value1, "option2": value2}.
>>> >          >>
>>> >          > I lean to 4 the most. With 2, you run into issues of what
>>> >         does --runner_option=foo=["a", "b"] --runner_option=foo=["c",
>>> >         "d"] mean?
>>> >          > Is it an error or list of lists or concatenated. Similar
>>> >         issues for map types represented via JSON object {...}
>>> >
>>> >         We can err to be on the safe side unless/until an argument can
>>> >         be made
>>> >         that merging is more natural. I just think this will be
excessively
>>> >         verbose to use.
>>> >
>>> >          >> > I would strongly suggest that we go with the "fetch"
>>> >         approach, since this makes the set of options discoverable and
>>> >         helps users find errors much earlier in their pipeline.
>>> >          >>
>>> >          >> This seems like an advanced feature that SDKs may want to
>>> >         support, but
>>> >          >> I wouldn't want to require this complexity for
bootstrapping
>>> >         an SDK.
>>> >          >>
>>> >          > SDKs that are starting off wouldn't need to "fetch"
options,
>>> >         they could choose to not support runner options or they could
>>> >         choose to pass all options through to the runner blindly.
>>> >         Fetching the options only provides the SDK the ability to
>>> >         provide error checking upfront and useful error/help messages.
>>> >
>>> >         But how to even pass all options through blindly is exactly
the
>>> >         difficulty we're running into here.
>>> >
>>> >          >> Regarding always keeping runner options separate, +1,
though
>>> >         I'm not
>>> >          >> sure the line is always clear.
>>> >          >>
>>> >          >>
>>> >          >> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw
>>> >         <robertwb@google.com <ma...@google.com>> wrote:
>>> >          >> >>
>>> >          >> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels
>>> >         <mxm@apache.org <ma...@apache.org>> wrote:
>>> >          >> >> >
>>> >          >> >> > I agree that the current approach breaks the pipeline
>>> >         options contract
>>> >          >> >> > because "unknown" options get parsed in the same way
as
>>> >         options which
>>> >          >> >> > have been defined by the user.
>>> >          >> >>
>>> >          >> >> FWIW, I think we're already breaking this "contract."
>>> >         Unknown options
>>> >          >> >> are silently ignored; with this change we just change
how
>>> >         we record
>>> >          >> >> them. It still feels a bit hacky though.
>>> >          >> >>
>>> >          >> >> > I'm not sure the `experiments` flag works for us.
AFAIK
>>> >         it only allows
>>> >          >> >> > true/false flags. We want to pass all types of
pipeline
>>> >         options to the
>>> >          >> >> > Runner.
>>> >          >> >>
>>> >          >> >> Experiments is an arbitrary set of strings, which can
be
>>> >         of the form
>>> >          >> >> "param=value" if that's useful. (Dataflow does this.)
>>> >         There is, again,
>>> >          >> >> no namespacing on the param names, but we could user
urns
>>> >         or impose
>>> >          >> >> some other structure here.
>>> >          >> >>
>>> >          >> >> > How to solve this?
>>> >          >> >> >
>>> >          >> >> > 1) Add all options of all Runners to each SDK
>>> >          >> >> > We added some of the FlinkRunner options to the
Python
>>> >         SDK but realized
>>> >          >> >> > syncing is rather cumbersome in the long term.
However,
>>> >         we want the most
>>> >          >> >> > important options to be validated on the client side.
>>> >          >> >>
>>> >          >> >> I don't think this is sustainable in the long run.
>>> >         However, thinking
>>> >          >> >> about this, in the worse case validation happens after
>>> >         construction
>>> >          >> >> but before execution (as with much of our other
>>> >         validation) so it
>>> >          >> >> isn't that bad.
>>> >          >> >>
>>> >          >> >> > 2) Pass "unknown" options via a separate list in the
>>> >         Proto which can
>>> >          >> >> > only be accessed internally by the Runners. This
still
>>> >         allows passing
>>> >          >> >> > arbitrary options but we wouldn't leak unknown
options
>>> >         and display them
>>> >          >> >> > as top-level options.
>>> >          >> >>
>>> >          >> >> I think there needs to be a way for the user to
>>> >         communicate values
>>> >          >> >> directly to the runner regardless of the SDK. My
>>> >         preference would be
>>> >          >> >> to make this explicit, e.g. (repeated)
>>> >         --runner_option=..., rather
>>> >          >> >> than scooping up all unknown flags at command line
>>> >         parsing time.
>>> >          >> >> Perhaps an SDK that is aware of some runners could
choose
>>> >         to lift
>>> >          >> >> these as top-level options, but still pass them as
runner
>>> >         options.
>>> >          >> >>
>>> >          >> >> > On 13.10.18 02:34, Charles Chen wrote:
>>> >          >> >> > > The current release branch
>>> >          >> >> > >
>>> >         (https://github.com/apache/beam/commits/release-2.8.0) was cut
>>> >         after the
>>> >          >> >> > > revert went in.  Sent out
>>> >         https://github.com/apache/beam/pull/6683 as a
>>> >          >> >> > > revert of the revert.  Regarding your comment
above,
>>> >         I can help out with
>>> >          >> >> > > the design / PR reviews for common Python code as
you
>>> >         suggest.
>>> >          >> >> > >
>>> >          >> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise
>>> >         <thw@apache.org <ma...@apache.org>
>>> >          >> >> > > <mailto:thw@apache.org <ma...@apache.org>>>
wrote:
>>> >          >> >> > >
>>> >          >> >> > >     Thanks, will tag you and looking forward to
>>> >         feedback so we can
>>> >          >> >> > >     ensure that changes work for everyone.
>>> >          >> >> > >
>>> >          >> >> > >     Looking at the PR, I see agreement from Max to
>>> >         revert the change on
>>> >          >> >> > >     the release branch, but not in master. Would
you
>>> >         mind to restore it
>>> >          >> >> > >     in master?
>>> >          >> >> > >
>>> >          >> >> > >     Thanks
>>> >          >> >> > >
>>> >          >> >> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay
>>> >         <altay@google.com <ma...@google.com>
>>> >          >> >> > >     <mailto:altay@google.com
>>> >         <ma...@google.com>>> wrote:
>>> >          >> >> > >
>>> >          >> >> > >
>>> >          >> >> > >
>>> >          >> >> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles
>>> >         Chen <ccy@google.com <ma...@google.com>
>>> >          >> >> > >         <mailto:ccy@google.com
>>> >         <ma...@google.com>>> wrote:
>>> >          >> >> > >
>>> >          >> >> > >             What I mean is that a user may find
that
>>> >         it works for them
>>> >          >> >> > >             to pass "--myarg blah" and access it as
>>> >         "options.myarg"
>>> >          >> >> > >             without explicitly defining a "my_arg"
>>> >         flag due to the added
>>> >          >> >> > >             logic.  This is not the intended
behavior
>>> >         and we may want to
>>> >          >> >> > >             change this implementation detail in
the
>>> >         future.  However,
>>> >          >> >> > >             having this logic in a released version
>>> >         makes it hard to
>>> >          >> >> > >             change this behavior since users may
>>> >         erroneously depend on
>>> >          >> >> > >             this undocumented behavior.  Instead,
we
>>> >         should namespace /
>>> >          >> >> > >             scope this so that it is obvious that
>>> >         this is meant for
>>> >          >> >> > >             runner (and not Beam user) consumption.
>>> >          >> >> > >
>>> >          >> >> > >             On Fri, Oct 12, 2018 at 10:48 AM
Thomas Weise
>>> >          >> >> > >             <thw@apache.org <ma...@apache.org>
>>> >         <mailto:thw@apache.org <ma...@apache.org>>> wrote:
>>> >          >> >> > >
>>> >          >> >> > >                 Can you please elaborate more what
>>> >         practical problems
>>> >          >> >> > >                 this introduces for users?
>>> >          >> >> > >
>>> >          >> >> > >                 I can see that this change allows a
>>> >         user to specify a
>>> >          >> >> > >                 runner specific option, which in
the
>>> >         future may change
>>> >          >> >> > >                 because we decide to scope
>>> >         differently. If this only
>>> >          >> >> > >                 affects users of the portable Flink
>>> >         runner (like us),
>>> >          >> >> > >                 then no need to revert, because at
>>> >         this early stage we
>>> >          >> >> > >                 prefer something that works over
>>> >         being blocked.
>>> >          >> >> > >
>>> >          >> >> > >                 It would also be really great if
some
>>> >         of the core Python
>>> >          >> >> > >                 SDK developers could help out with
>>> >         the design aspects
>>> >          >> >> > >                 and PR reviews of changes that
affect
>>> >         common Python
>>> >          >> >> > >                 code. Anyone who specifically wants
>>> >         to be tagged on
>>> >          >> >> > >                 relevant JIRAs and PRs?
>>> >          >> >> > >
>>> >          >> >> > >
>>> >          >> >> > >         I would be happy to be tagged, and I can
also
>>> >         help with
>>> >          >> >> > >         including other relevant folks whenever
>>> >         possible. In general I
>>> >          >> >> > >         think Robert, Charles, myself are good
>>> >         candidates.
>>> >          >> >> > >
>>> >          >> >> > >
>>> >          >> >> > >                 Thanks
>>> >          >> >> > >
>>> >          >> >> > >
>>> >          >> >> > >                 On Fri, Oct 12, 2018 at 10:20 AM
>>> >         Ahmet Altay
>>> >          >> >> > >                 <altay@google.com
>>> >         <ma...@google.com> <mailto:altay@google.com
>>> >         <ma...@google.com>>> wrote:
>>> >          >> >> > >
>>> >          >> >> > >
>>> >          >> >> > >
>>> >          >> >> > >                     On Fri, Oct 12, 2018 at 10:11
AM,
>>> >         Charles Chen
>>> >          >> >> > >                     <ccy@google.com
>>> >         <ma...@google.com> <mailto:ccy@google.com
>>> >         <ma...@google.com>>> wrote:
>>> >          >> >> > >
>>> >          >> >> > >                         For context, I made
comments on
>>> >          >> >> > > https://github.com/apache/beam/pull/6600 noting
>>> >          >> >> > >                         that the changes being made
>>> >         were not good for
>>> >          >> >> > >                         Beam
>>> >         backwards-compatibility.  The change as is
>>> >          >> >> > >                         allows users to use
pipeline
>>> >         options without
>>> >          >> >> > >                         explicitly defining them,
>>> >         which is not the type
>>> >          >> >> > >                         of usage we would like to
>>> >         encourage since we
>>> >          >> >> > >                         prefer to be explicit
>>> >         whenever possible.  If
>>> >          >> >> > >                         users write pipelines with
>>> >         this sort of pattern,
>>> >          >> >> > >                         they will potentially
>>> >         encounter pain when
>>> >          >> >> > >                         upgrading to a later
version
>>> >         since this is an
>>> >          >> >> > >                         implementation detail and
not
>>> >         an officially
>>> >          >> >> > >                         supported pattern.  I agree
>>> >         with the comments
>>> >          >> >> > >                         above that this is
ultimately
>>> >         a scoping issue.
>>> >          >> >> > >                         I would not have a problem
>>> >         with these changes if
>>> >          >> >> > >                         they were explicitly scoped
>>> >         under either a
>>> >          >> >> > >                         runner or unparsed options
>>> >         namespace.
>>> >          >> >> > >
>>> >          >> >> > >                         As a second note, since the
>>> >         2.8.0 release is
>>> >          >> >> > >                         being cut right now,
because
>>> >         of these
>>> >          >> >> > >                         backwards-compatibility
>>> >         concerns, I would
>>> >          >> >> > >                         suggest reverting these
>>> >         changes, at least until
>>> >          >> >> > >                         2.8.0 is cut, so we can
have
>>> >         a discussion here
>>> >          >> >> > >                         before committing to and
>>> >         releasing any API-level
>>> >          >> >> > >                         changes.
>>> >          >> >> > >
>>> >          >> >> > >
>>> >          >> >> > >                     +1 I would like to revert the
>>> >         changes in order not
>>> >          >> >> > >                     rush this into the release.
Once
>>> >         this discussion
>>> >          >> >> > >                     results in an agreement changes
>>> >         can be brought back.
>>> >          >> >> > >
>>> >          >> >> > >
>>> >          >> >> > >                         On Fri, Oct 12, 2018 at
9:26
>>> >         AM Henning Rohde
>>> >          >> >> > >                         <herohde@google.com
>>> >         <ma...@google.com> <mailto:herohde@google.com
>>> >         <ma...@google.com>>>
>>> >          >> >> > >                         wrote:
>>> >          >> >> > >
>>> >          >> >> > >                             Agree that pipeline
>>> >         options lack some
>>> >          >> >> > >                             mechanism for scoping.
It
>>> >         is also not always
>>> >          >> >> > >                             possible distinguish
>>> >         options meant to be
>>> >          >> >> > >                             consumed at pipeline
>>> >         construction time, by
>>> >          >> >> > >                             the runner, by the SDK
>>> >         harness, by the user
>>> >          >> >> > >                             code or any combination
>>> >         -- and this causes
>>> >          >> >> > >                             confusion every now
and then.
>>> >          >> >> > >
>>> >          >> >> > >                             For Dataflow, we have
>>> >         been using
>>> >          >> >> > >                             "experiments" for
>>> >         arbitrary runner-specific
>>> >          >> >> > >                             options. It's simply a
>>> >         string list pipeline
>>> >          >> >> > >                             option that all SDKs
>>> >         support and, for Go at
>>> >          >> >> > >                             least, is sent to
>>> >         portable runners. Flink
>>> >          >> >> > >                             can do the same in the
>>> >         short term to move
>>> >          >> >> > >                             forward.
>>> >          >> >> > >
>>> >          >> >> > >                             Henning
>>> >          >> >> > >
>>> >          >> >> > >
>>> >          >> >> > >                             On Fri, Oct 12, 2018 at
>>> >         8:50 AM Thomas Weise
>>> >          >> >> > >                             <thw@apache.org
>>> >         <ma...@apache.org> <mailto:thw@apache.org
>>> >         <ma...@apache.org>>> wrote:
>>> >          >> >> > >
>>> >          >> >> > >                                 [moving to the
list]
>>> >          >> >> > >
>>> >          >> >> > >                                 The requirement
>>> >         driving this part of the
>>> >          >> >> > >                                 change was to
allow a
>>> >         user to specify
>>> >          >> >> > >                                 pipeline options
that
>>> >         a runner supports
>>> >          >> >> > >                                 without having to
>>> >         declare those in each
>>> >          >> >> > >                                 language SDK.
>>> >          >> >> > >
>>> >          >> >> > >                                 In the specific
>>> >         scenario, we have
>>> >          >> >> > >                                 options that the
>>> >         Flink runner supports
>>> >          >> >> > >                                 (and can validate),
>>> >         that are not
>>> >          >> >> > >                                 enumerated in the
>>> >         Python SDK.
>>> >          >> >> > >
>>> >          >> >> > >                                 I think we have a
>>> >         bigger problem scoping
>>> >          >> >> > >                                 pipeline options.
For
>>> >         example, the
>>> >          >> >> > >                                 runner options are
>>> >         dumped into the SDK
>>> >          >> >> > >                                 worker. There is
also
>>> >         a possibility of
>>> >          >> >> > >                                 name collisions.
So I
>>> >         think this would
>>> >          >> >> > >                                 benefit from
broader
>>> >         feedback.
>>> >          >> >> > >
>>> >          >> >> > >                                 Thanks,
>>> >          >> >> > >                                 Thomas
>>> >          >> >> > >
>>> >          >> >> > >
>>> >          >> >> > >                                 ----------
Forwarded
>>> >         message ---------
>>> >          >> >> > >                                 From: *Charles
Chen*
>>> >          >> >> > >
>>> >           <notifications@github.com <ma...@github.com>
>>> >          >> >> > >
>>> >           <mailto:notifications@github.com
>>> >         <ma...@github.com>>>
>>> >          >> >> > >                                 Date: Fri, Oct 12,
>>> >         2018 at 8:36 AM
>>> >          >> >> > >                                 Subject: Re:
>>> >         [apache/beam] [BEAM-5442]
>>> >          >> >> > >                                 Store duplicate
>>> >         unknown options in a
>>> >          >> >> > >                                 list argument
(#6600)
>>> >          >> >> > >                                 To: apache/beam
>>> >         <beam@noreply.github.com <ma...@noreply.github.com>
>>> >          >> >> > >
>>> >           <mailto:beam@noreply.github.com <mailto:
beam@noreply.github.com>>>
>>> >          >> >> > >                                 Cc: Thomas Weise
>>> >         <thomas.weise@gmail.com <ma...@gmail.com>
>>> >          >> >> > >
>>> >           <mailto:thomas.weise@gmail.com <mailto:
thomas.weise@gmail.com>>>,
>>> >          >> >> > >                                 Mention
>>> >         <mention@noreply.github.com <mailto:mention@noreply.github.com
>
>>> >          >> >> > >
>>> >           <mailto:mention@noreply.github.com
>>> >         <ma...@noreply.github.com>>>
>>> >          >> >> > >
>>> >          >> >> > >
>>> >          >> >> > >                                 CC: @tweise
>>> >         <https://github.com/tweise>
>>> >          >> >> > >
>>> >          >> >> > >                                 —
>>> >          >> >> > >                                 You are receiving
>>> >         this because you were
>>> >          >> >> > >                                 mentioned.
>>> >          >> >> > >                                 Reply to this email
>>> >         directly, view it on
>>> >          >> >> > >                                 GitHub
>>> >          >> >> > >
>>> >           <
https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>>> >          >> >> > >                                 or mute the thread
>>> >          >> >> > >
>>> >           <
https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
>.
>>> >          >> >> > >
>>> >          >> >> > >
>>> >          >> >> > >
>>> >

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Lukasz Cwik <lc...@google.com>.

For all unknown options, the SDK can require that all flag values be
specified explicitly as a valid JSON type.
starts with { -> object
starts with [ -> list
starts with " -> string
is null / true / false -> null / true / false
otherwise is number.

This isn't great for strings but works well for all the other types.

Thus for known options, the additional typing information would
disambiguate whether something should be a
string/boolean/number/object/list but for unknown options we would expect
the user to use valid JSON explicitly and write:
--foo={"object": "value"}
--foo=["value", "value2"]
--foo="string value"
--foo=3.5 --foo=-4
--foo=true --foo=false
--foo=null
This also works if the flag is repeated, so --foo=3.5 --foo=-4 is [3.5, -4]

On Tue, Oct 16, 2018 at 7:56 AM Thomas Weise <th...@apache.org> wrote:

> Discovering options from the job server seems preferable over replicating
> runner options in SDKs.
>
> Runners evolve on their own, and with portability the SDK does not need to
> know anything about the runner.
>
> Regarding --runner-option. It is true that this looks less user friendly.
> On the other hand it eliminates the possibility of name collisions.
>
> But if options are discovered, the SDK can perform full validation. It
> would only be necessary to use explicit scoping when there is ambiguity.
>
> Thomas
>
>
> On Tue, Oct 16, 2018 at 3:48 AM Maximilian Michels <mx...@apache.org> wrote:
>
>> Fetching options directly from the Runner's JobServer seems like the
>> ideal solution. I agree with Robert that it creates additional
>> complexity for SDK authors, so the `--runner-option` flag would be an
>> easy and explicit way to specify additional Runner options.
>>
>> The format I prefer would be: --runner_option=key1=val1
>> --runner_option=key2=val2
>>
>> Now, from the perspective of end users, I think it is neither convenient
>> nor reasonable to require the use of the `--runner-option` flag. To the
>> user it seems nebulous why some pipeline options live in the top-level
>> option namespace while others need to be nested within an option. This
>> is amplified by there being two Runners the user needs to be aware of,
>> i.e. PortableRunner and the actual Runner (Dataflow/Flink/Spark..).
>>
>> I feel like we would eventually replicate all options in the SDK because
>> otherwise users have to use the `--runner-option`, but at least we can
>> specify options which have not been replicated yet.
>>
>> -Max
>>
>> On 16.10.18 10:27, Robert Bradshaw wrote:
>> > Yes, we don't know how to parse and/or validate it.
>> >
>> > On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik <lcwik@google.com
>> > <ma...@google.com>> wrote:
>> >
>> >     I see, is the issue that we currently are using a JSON
>> >     representation for options when being serialized and when we get
>> >     some unknown option, we don't know how to convert it into its JSON
>> form?
>> >
>> >     On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw <
>> robertwb@google.com
>> >     <ma...@google.com>> wrote:
>> >
>> >         On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <lcwik@google.com
>> >         <ma...@google.com>> wrote:
>> >          >
>> >          > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw
>> >         <robertwb@google.com <ma...@google.com>> wrote:
>> >          >>
>> >          >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik
>> >         <lcwik@google.com <ma...@google.com>> wrote:
>> >          >> >
>> >          >> > I agree with the sentiment for better error checking.
>> >          >> >
>> >          >> > We can try to make it such that the SDK can "fetch" the
>> >         set of options that the runner supports by making a call to the
>> >         Job API. The API could return a list of option names
>> >         (descriptions for --help purposes and also potentially the
>> >         expected format) which would remove the worry around "unknown"
>> >         options. Yes I understand to be able to make the Job API call,
>> >         we may need to parse some options from the args parameters first
>> >         and then parse the unknown options after they are fetched.
>> >          >>
>> >          >> This is an interesting idea, but seems it could get quite
>> >         complicated.
>> >          >> E.g. for delegating runners, one would first read the
>> options to
>> >          >> determine which runner to fetch the options from, which
>> >         would then
>> >          >> return a set of options that possibly depends on the values
>> >         of some of
>> >          >> its options...
>> >          >>
>> >          >> > Alternatively, we can choose an explicit format upfront.
>> >          >> > To expand on the exact format for --runner_option=...,
>> >         here are some different ideas:
>> >          >> > 1) Specified multiple times, each one is an explicit flag
>> >          >> > --runner_option=--blah=bar --runner_option=--foo=baz1
>> >         --runner_option=--foo=baz2
>> >          >>
>> >          >> I'm -1 on this format. We should move away from the idea
>> >         that options
>> >          >> == flags (as that doesn't compose well with other libraries
>> >         that do
>> >          >> their own flags parsing). The ability to parse a set of
>> >         flags into
>> >          >> options is just a convenience that an author may (or may
>> >         not) choose
>> >          >> to use (e.g. when running pipelines a long-lived process
>> like a
>> >          >> service or a notebook, the command line flags are almost
>> >         certainly not
>> >          >> the right interface).
>> >          >>
>> >          >> > 2) specified multiple times, we drop the explicit flag
>> >          >> > --runner_option=blah=bar --runner_option=foo=baz1
>> >         --runner_option=foo=baz2
>> >          >>
>> >          >> This or (4) is my preference.
>> >          >>
>> >          >> > 3) we use a string which the runner can choose to
>> >         interpret however they want (JSON/XML shown below)
>> >          >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>> >          >> >
>> >
>>  --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>> >          >>
>> >          >> This would make validation hard. Also, I think it makes
>> >         sense for some
>> >          >> runner options to be "shared" (parallelism") by convention,
>> >         so letting
>> >          >> it be a free-form string wouldn't allow different runners to
>> >         inspect
>> >          >> different bits.
>> >          >>
>> >          >> We should consider if we should use urns for namespacing,
>> and
>> >          >> assigning semantic meaning to strings, here.
>> >          >>
>> >          >> > 4) we use a string which must be a specific format such as
>> >         JSON (allows the SDK to do simple validation):
>> >          >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>> >          >>
>> >          >> I like this in that at least some validation can be
>> >         performed, and
>> >          >> expectations of how to format richer types. On the other
>> >         hand it gets
>> >          >> a bit verbose, given that most (I'd imagine) options will be
>> >         simple.
>> >          >> As with normal options,
>> >          >>
>> >          >>     --option1=value1 --option2=value2
>> >          >>
>> >          >> is shorthand for {"option1": value1, "option2": value2}.
>> >          >>
>> >          > I lean to 4 the most. With 2, you run into issues of what
>> >         does --runner_option=foo=["a", "b"] --runner_option=foo=["c",
>> >         "d"] mean?
>> >          > Is it an error or list of lists or concatenated. Similar
>> >         issues for map types represented via JSON object {...}
>> >
>> >         We can err to be on the safe side unless/until an argument can
>> >         be made
>> >         that merging is more natural. I just think this will be
>> excessively
>> >         verbose to use.
>> >
>> >          >> > I would strongly suggest that we go with the "fetch"
>> >         approach, since this makes the set of options discoverable and
>> >         helps users find errors much earlier in their pipeline.
>> >          >>
>> >          >> This seems like an advanced feature that SDKs may want to
>> >         support, but
>> >          >> I wouldn't want to require this complexity for bootstrapping
>> >         an SDK.
>> >          >>
>> >          > SDKs that are starting off wouldn't need to "fetch" options,
>> >         they could choose to not support runner options or they could
>> >         choose to pass all options through to the runner blindly.
>> >         Fetching the options only provides the SDK the ability to
>> >         provide error checking upfront and useful error/help messages.
>> >
>> >         But how to even pass all options through blindly is exactly the
>> >         difficulty we're running into here.
>> >
>> >          >> Regarding always keeping runner options separate, +1, though
>> >         I'm not
>> >          >> sure the line is always clear.
>> >          >>
>> >          >>
>> >          >> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw
>> >         <robertwb@google.com <ma...@google.com>> wrote:
>> >          >> >>
>> >          >> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels
>> >         <mxm@apache.org <ma...@apache.org>> wrote:
>> >          >> >> >
>> >          >> >> > I agree that the current approach breaks the pipeline
>> >         options contract
>> >          >> >> > because "unknown" options get parsed in the same way as
>> >         options which
>> >          >> >> > have been defined by the user.
>> >          >> >>
>> >          >> >> FWIW, I think we're already breaking this "contract."
>> >         Unknown options
>> >          >> >> are silently ignored; with this change we just change how
>> >         we record
>> >          >> >> them. It still feels a bit hacky though.
>> >          >> >>
>> >          >> >> > I'm not sure the `experiments` flag works for us. AFAIK
>> >         it only allows
>> >          >> >> > true/false flags. We want to pass all types of pipeline
>> >         options to the
>> >          >> >> > Runner.
>> >          >> >>
>> >          >> >> Experiments is an arbitrary set of strings, which can be
>> >         of the form
>> >          >> >> "param=value" if that's useful. (Dataflow does this.)
>> >         There is, again,
>> >          >> >> no namespacing on the param names, but we could user urns
>> >         or impose
>> >          >> >> some other structure here.
>> >          >> >>
>> >          >> >> > How to solve this?
>> >          >> >> >
>> >          >> >> > 1) Add all options of all Runners to each SDK
>> >          >> >> > We added some of the FlinkRunner options to the Python
>> >         SDK but realized
>> >          >> >> > syncing is rather cumbersome in the long term. However,
>> >         we want the most
>> >          >> >> > important options to be validated on the client side.
>> >          >> >>
>> >          >> >> I don't think this is sustainable in the long run.
>> >         However, thinking
>> >          >> >> about this, in the worse case validation happens after
>> >         construction
>> >          >> >> but before execution (as with much of our other
>> >         validation) so it
>> >          >> >> isn't that bad.
>> >          >> >>
>> >          >> >> > 2) Pass "unknown" options via a separate list in the
>> >         Proto which can
>> >          >> >> > only be accessed internally by the Runners. This still
>> >         allows passing
>> >          >> >> > arbitrary options but we wouldn't leak unknown options
>> >         and display them
>> >          >> >> > as top-level options.
>> >          >> >>
>> >          >> >> I think there needs to be a way for the user to
>> >         communicate values
>> >          >> >> directly to the runner regardless of the SDK. My
>> >         preference would be
>> >          >> >> to make this explicit, e.g. (repeated)
>> >         --runner_option=..., rather
>> >          >> >> than scooping up all unknown flags at command line
>> >         parsing time.
>> >          >> >> Perhaps an SDK that is aware of some runners could choose
>> >         to lift
>> >          >> >> these as top-level options, but still pass them as runner
>> >         options.
>> >          >> >>
>> >          >> >> > On 13.10.18 02:34, Charles Chen wrote:
>> >          >> >> > > The current release branch
>> >          >> >> > >
>> >         (https://github.com/apache/beam/commits/release-2.8.0) was cut
>> >         after the
>> >          >> >> > > revert went in.  Sent out
>> >         https://github.com/apache/beam/pull/6683 as a
>> >          >> >> > > revert of the revert.  Regarding your comment above,
>> >         I can help out with
>> >          >> >> > > the design / PR reviews for common Python code as you
>> >         suggest.
>> >          >> >> > >
>> >          >> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise
>> >         <thw@apache.org <ma...@apache.org>
>> >          >> >> > > <mailto:thw@apache.org <ma...@apache.org>>>
>> wrote:
>> >          >> >> > >
>> >          >> >> > >     Thanks, will tag you and looking forward to
>> >         feedback so we can
>> >          >> >> > >     ensure that changes work for everyone.
>> >          >> >> > >
>> >          >> >> > >     Looking at the PR, I see agreement from Max to
>> >         revert the change on
>> >          >> >> > >     the release branch, but not in master. Would you
>> >         mind to restore it
>> >          >> >> > >     in master?
>> >          >> >> > >
>> >          >> >> > >     Thanks
>> >          >> >> > >
>> >          >> >> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay
>> >         <altay@google.com <ma...@google.com>
>> >          >> >> > >     <mailto:altay@google.com
>> >         <ma...@google.com>>> wrote:
>> >          >> >> > >
>> >          >> >> > >
>> >          >> >> > >
>> >          >> >> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles
>> >         Chen <ccy@google.com <ma...@google.com>
>> >          >> >> > >         <mailto:ccy@google.com
>> >         <ma...@google.com>>> wrote:
>> >          >> >> > >
>> >          >> >> > >             What I mean is that a user may find that
>> >         it works for them
>> >          >> >> > >             to pass "--myarg blah" and access it as
>> >         "options.myarg"
>> >          >> >> > >             without explicitly defining a "my_arg"
>> >         flag due to the added
>> >          >> >> > >             logic.  This is not the intended behavior
>> >         and we may want to
>> >          >> >> > >             change this implementation detail in the
>> >         future.  However,
>> >          >> >> > >             having this logic in a released version
>> >         makes it hard to
>> >          >> >> > >             change this behavior since users may
>> >         erroneously depend on
>> >          >> >> > >             this undocumented behavior.  Instead, we
>> >         should namespace /
>> >          >> >> > >             scope this so that it is obvious that
>> >         this is meant for
>> >          >> >> > >             runner (and not Beam user) consumption.
>> >          >> >> > >
>> >          >> >> > >             On Fri, Oct 12, 2018 at 10:48 AM Thomas
>> Weise
>> >          >> >> > >             <thw@apache.org <ma...@apache.org>
>> >         <mailto:thw@apache.org <ma...@apache.org>>> wrote:
>> >          >> >> > >
>> >          >> >> > >                 Can you please elaborate more what
>> >         practical problems
>> >          >> >> > >                 this introduces for users?
>> >          >> >> > >
>> >          >> >> > >                 I can see that this change allows a
>> >         user to specify a
>> >          >> >> > >                 runner specific option, which in the
>> >         future may change
>> >          >> >> > >                 because we decide to scope
>> >         differently. If this only
>> >          >> >> > >                 affects users of the portable Flink
>> >         runner (like us),
>> >          >> >> > >                 then no need to revert, because at
>> >         this early stage we
>> >          >> >> > >                 prefer something that works over
>> >         being blocked.
>> >          >> >> > >
>> >          >> >> > >                 It would also be really great if some
>> >         of the core Python
>> >          >> >> > >                 SDK developers could help out with
>> >         the design aspects
>> >          >> >> > >                 and PR reviews of changes that affect
>> >         common Python
>> >          >> >> > >                 code. Anyone who specifically wants
>> >         to be tagged on
>> >          >> >> > >                 relevant JIRAs and PRs?
>> >          >> >> > >
>> >          >> >> > >
>> >          >> >> > >         I would be happy to be tagged, and I can also
>> >         help with
>> >          >> >> > >         including other relevant folks whenever
>> >         possible. In general I
>> >          >> >> > >         think Robert, Charles, myself are good
>> >         candidates.
>> >          >> >> > >
>> >          >> >> > >
>> >          >> >> > >                 Thanks
>> >          >> >> > >
>> >          >> >> > >
>> >          >> >> > >                 On Fri, Oct 12, 2018 at 10:20 AM
>> >         Ahmet Altay
>> >          >> >> > >                 <altay@google.com
>> >         <ma...@google.com> <mailto:altay@google.com
>> >         <ma...@google.com>>> wrote:
>> >          >> >> > >
>> >          >> >> > >
>> >          >> >> > >
>> >          >> >> > >                     On Fri, Oct 12, 2018 at 10:11 AM,
>> >         Charles Chen
>> >          >> >> > >                     <ccy@google.com
>> >         <ma...@google.com> <mailto:ccy@google.com
>> >         <ma...@google.com>>> wrote:
>> >          >> >> > >
>> >          >> >> > >                         For context, I made comments
>> on
>> >          >> >> > > https://github.com/apache/beam/pull/6600 noting
>> >          >> >> > >                         that the changes being made
>> >         were not good for
>> >          >> >> > >                         Beam
>> >         backwards-compatibility.  The change as is
>> >          >> >> > >                         allows users to use pipeline
>> >         options without
>> >          >> >> > >                         explicitly defining them,
>> >         which is not the type
>> >          >> >> > >                         of usage we would like to
>> >         encourage since we
>> >          >> >> > >                         prefer to be explicit
>> >         whenever possible.  If
>> >          >> >> > >                         users write pipelines with
>> >         this sort of pattern,
>> >          >> >> > >                         they will potentially
>> >         encounter pain when
>> >          >> >> > >                         upgrading to a later version
>> >         since this is an
>> >          >> >> > >                         implementation detail and not
>> >         an officially
>> >          >> >> > >                         supported pattern.  I agree
>> >         with the comments
>> >          >> >> > >                         above that this is ultimately
>> >         a scoping issue.
>> >          >> >> > >                         I would not have a problem
>> >         with these changes if
>> >          >> >> > >                         they were explicitly scoped
>> >         under either a
>> >          >> >> > >                         runner or unparsed options
>> >         namespace.
>> >          >> >> > >
>> >          >> >> > >                         As a second note, since the
>> >         2.8.0 release is
>> >          >> >> > >                         being cut right now, because
>> >         of these
>> >          >> >> > >                         backwards-compatibility
>> >         concerns, I would
>> >          >> >> > >                         suggest reverting these
>> >         changes, at least until
>> >          >> >> > >                         2.8.0 is cut, so we can have
>> >         a discussion here
>> >          >> >> > >                         before committing to and
>> >         releasing any API-level
>> >          >> >> > >                         changes.
>> >          >> >> > >
>> >          >> >> > >
>> >          >> >> > >                     +1 I would like to revert the
>> >         changes in order not
>> >          >> >> > >                     rush this into the release. Once
>> >         this discussion
>> >          >> >> > >                     results in an agreement changes
>> >         can be brought back.
>> >          >> >> > >
>> >          >> >> > >
>> >          >> >> > >                         On Fri, Oct 12, 2018 at 9:26
>> >         AM Henning Rohde
>> >          >> >> > >                         <herohde@google.com
>> >         <ma...@google.com> <mailto:herohde@google.com
>> >         <ma...@google.com>>>
>> >          >> >> > >                         wrote:
>> >          >> >> > >
>> >          >> >> > >                             Agree that pipeline
>> >         options lack some
>> >          >> >> > >                             mechanism for scoping. It
>> >         is also not always
>> >          >> >> > >                             possible distinguish
>> >         options meant to be
>> >          >> >> > >                             consumed at pipeline
>> >         construction time, by
>> >          >> >> > >                             the runner, by the SDK
>> >         harness, by the user
>> >          >> >> > >                             code or any combination
>> >         -- and this causes
>> >          >> >> > >                             confusion every now and
>> then.
>> >          >> >> > >
>> >          >> >> > >                             For Dataflow, we have
>> >         been using
>> >          >> >> > >                             "experiments" for
>> >         arbitrary runner-specific
>> >          >> >> > >                             options. It's simply a
>> >         string list pipeline
>> >          >> >> > >                             option that all SDKs
>> >         support and, for Go at
>> >          >> >> > >                             least, is sent to
>> >         portable runners. Flink
>> >          >> >> > >                             can do the same in the
>> >         short term to move
>> >          >> >> > >                             forward.
>> >          >> >> > >
>> >          >> >> > >                             Henning
>> >          >> >> > >
>> >          >> >> > >
>> >          >> >> > >                             On Fri, Oct 12, 2018 at
>> >         8:50 AM Thomas Weise
>> >          >> >> > >                             <thw@apache.org
>> >         <ma...@apache.org> <mailto:thw@apache.org
>> >         <ma...@apache.org>>> wrote:
>> >          >> >> > >
>> >          >> >> > >                                 [moving to the list]
>> >          >> >> > >
>> >          >> >> > >                                 The requirement
>> >         driving this part of the
>> >          >> >> > >                                 change was to allow a
>> >         user to specify
>> >          >> >> > >                                 pipeline options that
>> >         a runner supports
>> >          >> >> > >                                 without having to
>> >         declare those in each
>> >          >> >> > >                                 language SDK.
>> >          >> >> > >
>> >          >> >> > >                                 In the specific
>> >         scenario, we have
>> >          >> >> > >                                 options that the
>> >         Flink runner supports
>> >          >> >> > >                                 (and can validate),
>> >         that are not
>> >          >> >> > >                                 enumerated in the
>> >         Python SDK.
>> >          >> >> > >
>> >          >> >> > >                                 I think we have a
>> >         bigger problem scoping
>> >          >> >> > >                                 pipeline options. For
>> >         example, the
>> >          >> >> > >                                 runner options are
>> >         dumped into the SDK
>> >          >> >> > >                                 worker. There is also
>> >         a possibility of
>> >          >> >> > >                                 name collisions. So I
>> >         think this would
>> >          >> >> > >                                 benefit from broader
>> >         feedback.
>> >          >> >> > >
>> >          >> >> > >                                 Thanks,
>> >          >> >> > >                                 Thomas
>> >          >> >> > >
>> >          >> >> > >
>> >          >> >> > >                                 ---------- Forwarded
>> >         message ---------
>> >          >> >> > >                                 From: *Charles Chen*
>> >          >> >> > >
>> >           <notifications@github.com <ma...@github.com>
>> >          >> >> > >
>> >           <mailto:notifications@github.com
>> >         <ma...@github.com>>>
>> >          >> >> > >                                 Date: Fri, Oct 12,
>> >         2018 at 8:36 AM
>> >          >> >> > >                                 Subject: Re:
>> >         [apache/beam] [BEAM-5442]
>> >          >> >> > >                                 Store duplicate
>> >         unknown options in a
>> >          >> >> > >                                 list argument (#6600)
>> >          >> >> > >                                 To: apache/beam
>> >         <beam@noreply.github.com <ma...@noreply.github.com>
>> >          >> >> > >
>> >           <mailto:beam@noreply.github.com <mailto:
>> beam@noreply.github.com>>>
>> >          >> >> > >                                 Cc: Thomas Weise
>> >         <thomas.weise@gmail.com <ma...@gmail.com>
>> >          >> >> > >
>> >           <mailto:thomas.weise@gmail.com <mailto:thomas.weise@gmail.com
>> >>>,
>> >          >> >> > >                                 Mention
>> >         <mention@noreply.github.com <ma...@noreply.github.com>
>> >          >> >> > >
>> >           <mailto:mention@noreply.github.com
>> >         <ma...@noreply.github.com>>>
>> >          >> >> > >
>> >          >> >> > >
>> >          >> >> > >                                 CC: @tweise
>> >         <https://github.com/tweise>
>> >          >> >> > >
>> >          >> >> > >                                 —
>> >          >> >> > >                                 You are receiving
>> >         this because you were
>> >          >> >> > >                                 mentioned.
>> >          >> >> > >                                 Reply to this email
>> >         directly, view it on
>> >          >> >> > >                                 GitHub
>> >          >> >> > >
>> >           <
>> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>> >          >> >> > >                                 or mute the thread
>> >          >> >> > >
>> >           <
>> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
>> >.
>> >          >> >> > >
>> >          >> >> > >
>> >          >> >> > >
>> >
>>
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Thomas Weise <th...@apache.org>.

Discovering options from the job server seems preferable over replicating
runner options in SDKs.

Runners evolve on their own, and with portability the SDK does not need to
know anything about the runner.

Regarding --runner-option. It is true that this looks less user friendly.
On the other hand it eliminates the possibility of name collisions.

But if options are discovered, the SDK can perform full validation. It
would only be necessary to use explicit scoping when there is ambiguity.

Thomas


On Tue, Oct 16, 2018 at 3:48 AM Maximilian Michels <mx...@apache.org> wrote:

> Fetching options directly from the Runner's JobServer seems like the
> ideal solution. I agree with Robert that it creates additional
> complexity for SDK authors, so the `--runner-option` flag would be an
> easy and explicit way to specify additional Runner options.
>
> The format I prefer would be: --runner_option=key1=val1
> --runner_option=key2=val2
>
> Now, from the perspective of end users, I think it is neither convenient
> nor reasonable to require the use of the `--runner-option` flag. To the
> user it seems nebulous why some pipeline options live in the top-level
> option namespace while others need to be nested within an option. This
> is amplified by there being two Runners the user needs to be aware of,
> i.e. PortableRunner and the actual Runner (Dataflow/Flink/Spark..).
>
> I feel like we would eventually replicate all options in the SDK because
> otherwise users have to use the `--runner-option`, but at least we can
> specify options which have not been replicated yet.
>
> -Max
>
> On 16.10.18 10:27, Robert Bradshaw wrote:
> > Yes, we don't know how to parse and/or validate it.
> >
> > On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik <lcwik@google.com
> > <ma...@google.com>> wrote:
> >
> >     I see, is the issue that we currently are using a JSON
> >     representation for options when being serialized and when we get
> >     some unknown option, we don't know how to convert it into its JSON
> form?
> >
> >     On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw <robertwb@google.com
> >     <ma...@google.com>> wrote:
> >
> >         On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <lcwik@google.com
> >         <ma...@google.com>> wrote:
> >          >
> >          > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw
> >         <robertwb@google.com <ma...@google.com>> wrote:
> >          >>
> >          >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik
> >         <lcwik@google.com <ma...@google.com>> wrote:
> >          >> >
> >          >> > I agree with the sentiment for better error checking.
> >          >> >
> >          >> > We can try to make it such that the SDK can "fetch" the
> >         set of options that the runner supports by making a call to the
> >         Job API. The API could return a list of option names
> >         (descriptions for --help purposes and also potentially the
> >         expected format) which would remove the worry around "unknown"
> >         options. Yes I understand to be able to make the Job API call,
> >         we may need to parse some options from the args parameters first
> >         and then parse the unknown options after they are fetched.
> >          >>
> >          >> This is an interesting idea, but seems it could get quite
> >         complicated.
> >          >> E.g. for delegating runners, one would first read the
> options to
> >          >> determine which runner to fetch the options from, which
> >         would then
> >          >> return a set of options that possibly depends on the values
> >         of some of
> >          >> its options...
> >          >>
> >          >> > Alternatively, we can choose an explicit format upfront.
> >          >> > To expand on the exact format for --runner_option=...,
> >         here are some different ideas:
> >          >> > 1) Specified multiple times, each one is an explicit flag
> >          >> > --runner_option=--blah=bar --runner_option=--foo=baz1
> >         --runner_option=--foo=baz2
> >          >>
> >          >> I'm -1 on this format. We should move away from the idea
> >         that options
> >          >> == flags (as that doesn't compose well with other libraries
> >         that do
> >          >> their own flags parsing). The ability to parse a set of
> >         flags into
> >          >> options is just a convenience that an author may (or may
> >         not) choose
> >          >> to use (e.g. when running pipelines a long-lived process
> like a
> >          >> service or a notebook, the command line flags are almost
> >         certainly not
> >          >> the right interface).
> >          >>
> >          >> > 2) specified multiple times, we drop the explicit flag
> >          >> > --runner_option=blah=bar --runner_option=foo=baz1
> >         --runner_option=foo=baz2
> >          >>
> >          >> This or (4) is my preference.
> >          >>
> >          >> > 3) we use a string which the runner can choose to
> >         interpret however they want (JSON/XML shown below)
> >          >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
> >          >> >
> >
>  --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
> >          >>
> >          >> This would make validation hard. Also, I think it makes
> >         sense for some
> >          >> runner options to be "shared" (parallelism") by convention,
> >         so letting
> >          >> it be a free-form string wouldn't allow different runners to
> >         inspect
> >          >> different bits.
> >          >>
> >          >> We should consider if we should use urns for namespacing, and
> >          >> assigning semantic meaning to strings, here.
> >          >>
> >          >> > 4) we use a string which must be a specific format such as
> >         JSON (allows the SDK to do simple validation):
> >          >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
> >          >>
> >          >> I like this in that at least some validation can be
> >         performed, and
> >          >> expectations of how to format richer types. On the other
> >         hand it gets
> >          >> a bit verbose, given that most (I'd imagine) options will be
> >         simple.
> >          >> As with normal options,
> >          >>
> >          >>     --option1=value1 --option2=value2
> >          >>
> >          >> is shorthand for {"option1": value1, "option2": value2}.
> >          >>
> >          > I lean to 4 the most. With 2, you run into issues of what
> >         does --runner_option=foo=["a", "b"] --runner_option=foo=["c",
> >         "d"] mean?
> >          > Is it an error or list of lists or concatenated. Similar
> >         issues for map types represented via JSON object {...}
> >
> >         We can err to be on the safe side unless/until an argument can
> >         be made
> >         that merging is more natural. I just think this will be
> excessively
> >         verbose to use.
> >
> >          >> > I would strongly suggest that we go with the "fetch"
> >         approach, since this makes the set of options discoverable and
> >         helps users find errors much earlier in their pipeline.
> >          >>
> >          >> This seems like an advanced feature that SDKs may want to
> >         support, but
> >          >> I wouldn't want to require this complexity for bootstrapping
> >         an SDK.
> >          >>
> >          > SDKs that are starting off wouldn't need to "fetch" options,
> >         they could choose to not support runner options or they could
> >         choose to pass all options through to the runner blindly.
> >         Fetching the options only provides the SDK the ability to
> >         provide error checking upfront and useful error/help messages.
> >
> >         But how to even pass all options through blindly is exactly the
> >         difficulty we're running into here.
> >
> >          >> Regarding always keeping runner options separate, +1, though
> >         I'm not
> >          >> sure the line is always clear.
> >          >>
> >          >>
> >          >> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw
> >         <robertwb@google.com <ma...@google.com>> wrote:
> >          >> >>
> >          >> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels
> >         <mxm@apache.org <ma...@apache.org>> wrote:
> >          >> >> >
> >          >> >> > I agree that the current approach breaks the pipeline
> >         options contract
> >          >> >> > because "unknown" options get parsed in the same way as
> >         options which
> >          >> >> > have been defined by the user.
> >          >> >>
> >          >> >> FWIW, I think we're already breaking this "contract."
> >         Unknown options
> >          >> >> are silently ignored; with this change we just change how
> >         we record
> >          >> >> them. It still feels a bit hacky though.
> >          >> >>
> >          >> >> > I'm not sure the `experiments` flag works for us. AFAIK
> >         it only allows
> >          >> >> > true/false flags. We want to pass all types of pipeline
> >         options to the
> >          >> >> > Runner.
> >          >> >>
> >          >> >> Experiments is an arbitrary set of strings, which can be
> >         of the form
> >          >> >> "param=value" if that's useful. (Dataflow does this.)
> >         There is, again,
> >          >> >> no namespacing on the param names, but we could user urns
> >         or impose
> >          >> >> some other structure here.
> >          >> >>
> >          >> >> > How to solve this?
> >          >> >> >
> >          >> >> > 1) Add all options of all Runners to each SDK
> >          >> >> > We added some of the FlinkRunner options to the Python
> >         SDK but realized
> >          >> >> > syncing is rather cumbersome in the long term. However,
> >         we want the most
> >          >> >> > important options to be validated on the client side.
> >          >> >>
> >          >> >> I don't think this is sustainable in the long run.
> >         However, thinking
> >          >> >> about this, in the worse case validation happens after
> >         construction
> >          >> >> but before execution (as with much of our other
> >         validation) so it
> >          >> >> isn't that bad.
> >          >> >>
> >          >> >> > 2) Pass "unknown" options via a separate list in the
> >         Proto which can
> >          >> >> > only be accessed internally by the Runners. This still
> >         allows passing
> >          >> >> > arbitrary options but we wouldn't leak unknown options
> >         and display them
> >          >> >> > as top-level options.
> >          >> >>
> >          >> >> I think there needs to be a way for the user to
> >         communicate values
> >          >> >> directly to the runner regardless of the SDK. My
> >         preference would be
> >          >> >> to make this explicit, e.g. (repeated)
> >         --runner_option=..., rather
> >          >> >> than scooping up all unknown flags at command line
> >         parsing time.
> >          >> >> Perhaps an SDK that is aware of some runners could choose
> >         to lift
> >          >> >> these as top-level options, but still pass them as runner
> >         options.
> >          >> >>
> >          >> >> > On 13.10.18 02:34, Charles Chen wrote:
> >          >> >> > > The current release branch
> >          >> >> > >
> >         (https://github.com/apache/beam/commits/release-2.8.0) was cut
> >         after the
> >          >> >> > > revert went in.  Sent out
> >         https://github.com/apache/beam/pull/6683 as a
> >          >> >> > > revert of the revert.  Regarding your comment above,
> >         I can help out with
> >          >> >> > > the design / PR reviews for common Python code as you
> >         suggest.
> >          >> >> > >
> >          >> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise
> >         <thw@apache.org <ma...@apache.org>
> >          >> >> > > <mailto:thw@apache.org <ma...@apache.org>>>
> wrote:
> >          >> >> > >
> >          >> >> > >     Thanks, will tag you and looking forward to
> >         feedback so we can
> >          >> >> > >     ensure that changes work for everyone.
> >          >> >> > >
> >          >> >> > >     Looking at the PR, I see agreement from Max to
> >         revert the change on
> >          >> >> > >     the release branch, but not in master. Would you
> >         mind to restore it
> >          >> >> > >     in master?
> >          >> >> > >
> >          >> >> > >     Thanks
> >          >> >> > >
> >          >> >> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay
> >         <altay@google.com <ma...@google.com>
> >          >> >> > >     <mailto:altay@google.com
> >         <ma...@google.com>>> wrote:
> >          >> >> > >
> >          >> >> > >
> >          >> >> > >
> >          >> >> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles
> >         Chen <ccy@google.com <ma...@google.com>
> >          >> >> > >         <mailto:ccy@google.com
> >         <ma...@google.com>>> wrote:
> >          >> >> > >
> >          >> >> > >             What I mean is that a user may find that
> >         it works for them
> >          >> >> > >             to pass "--myarg blah" and access it as
> >         "options.myarg"
> >          >> >> > >             without explicitly defining a "my_arg"
> >         flag due to the added
> >          >> >> > >             logic.  This is not the intended behavior
> >         and we may want to
> >          >> >> > >             change this implementation detail in the
> >         future.  However,
> >          >> >> > >             having this logic in a released version
> >         makes it hard to
> >          >> >> > >             change this behavior since users may
> >         erroneously depend on
> >          >> >> > >             this undocumented behavior.  Instead, we
> >         should namespace /
> >          >> >> > >             scope this so that it is obvious that
> >         this is meant for
> >          >> >> > >             runner (and not Beam user) consumption.
> >          >> >> > >
> >          >> >> > >             On Fri, Oct 12, 2018 at 10:48 AM Thomas
> Weise
> >          >> >> > >             <thw@apache.org <ma...@apache.org>
> >         <mailto:thw@apache.org <ma...@apache.org>>> wrote:
> >          >> >> > >
> >          >> >> > >                 Can you please elaborate more what
> >         practical problems
> >          >> >> > >                 this introduces for users?
> >          >> >> > >
> >          >> >> > >                 I can see that this change allows a
> >         user to specify a
> >          >> >> > >                 runner specific option, which in the
> >         future may change
> >          >> >> > >                 because we decide to scope
> >         differently. If this only
> >          >> >> > >                 affects users of the portable Flink
> >         runner (like us),
> >          >> >> > >                 then no need to revert, because at
> >         this early stage we
> >          >> >> > >                 prefer something that works over
> >         being blocked.
> >          >> >> > >
> >          >> >> > >                 It would also be really great if some
> >         of the core Python
> >          >> >> > >                 SDK developers could help out with
> >         the design aspects
> >          >> >> > >                 and PR reviews of changes that affect
> >         common Python
> >          >> >> > >                 code. Anyone who specifically wants
> >         to be tagged on
> >          >> >> > >                 relevant JIRAs and PRs?
> >          >> >> > >
> >          >> >> > >
> >          >> >> > >         I would be happy to be tagged, and I can also
> >         help with
> >          >> >> > >         including other relevant folks whenever
> >         possible. In general I
> >          >> >> > >         think Robert, Charles, myself are good
> >         candidates.
> >          >> >> > >
> >          >> >> > >
> >          >> >> > >                 Thanks
> >          >> >> > >
> >          >> >> > >
> >          >> >> > >                 On Fri, Oct 12, 2018 at 10:20 AM
> >         Ahmet Altay
> >          >> >> > >                 <altay@google.com
> >         <ma...@google.com> <mailto:altay@google.com
> >         <ma...@google.com>>> wrote:
> >          >> >> > >
> >          >> >> > >
> >          >> >> > >
> >          >> >> > >                     On Fri, Oct 12, 2018 at 10:11 AM,
> >         Charles Chen
> >          >> >> > >                     <ccy@google.com
> >         <ma...@google.com> <mailto:ccy@google.com
> >         <ma...@google.com>>> wrote:
> >          >> >> > >
> >          >> >> > >                         For context, I made comments
> on
> >          >> >> > > https://github.com/apache/beam/pull/6600 noting
> >          >> >> > >                         that the changes being made
> >         were not good for
> >          >> >> > >                         Beam
> >         backwards-compatibility.  The change as is
> >          >> >> > >                         allows users to use pipeline
> >         options without
> >          >> >> > >                         explicitly defining them,
> >         which is not the type
> >          >> >> > >                         of usage we would like to
> >         encourage since we
> >          >> >> > >                         prefer to be explicit
> >         whenever possible.  If
> >          >> >> > >                         users write pipelines with
> >         this sort of pattern,
> >          >> >> > >                         they will potentially
> >         encounter pain when
> >          >> >> > >                         upgrading to a later version
> >         since this is an
> >          >> >> > >                         implementation detail and not
> >         an officially
> >          >> >> > >                         supported pattern.  I agree
> >         with the comments
> >          >> >> > >                         above that this is ultimately
> >         a scoping issue.
> >          >> >> > >                         I would not have a problem
> >         with these changes if
> >          >> >> > >                         they were explicitly scoped
> >         under either a
> >          >> >> > >                         runner or unparsed options
> >         namespace.
> >          >> >> > >
> >          >> >> > >                         As a second note, since the
> >         2.8.0 release is
> >          >> >> > >                         being cut right now, because
> >         of these
> >          >> >> > >                         backwards-compatibility
> >         concerns, I would
> >          >> >> > >                         suggest reverting these
> >         changes, at least until
> >          >> >> > >                         2.8.0 is cut, so we can have
> >         a discussion here
> >          >> >> > >                         before committing to and
> >         releasing any API-level
> >          >> >> > >                         changes.
> >          >> >> > >
> >          >> >> > >
> >          >> >> > >                     +1 I would like to revert the
> >         changes in order not
> >          >> >> > >                     rush this into the release. Once
> >         this discussion
> >          >> >> > >                     results in an agreement changes
> >         can be brought back.
> >          >> >> > >
> >          >> >> > >
> >          >> >> > >                         On Fri, Oct 12, 2018 at 9:26
> >         AM Henning Rohde
> >          >> >> > >                         <herohde@google.com
> >         <ma...@google.com> <mailto:herohde@google.com
> >         <ma...@google.com>>>
> >          >> >> > >                         wrote:
> >          >> >> > >
> >          >> >> > >                             Agree that pipeline
> >         options lack some
> >          >> >> > >                             mechanism for scoping. It
> >         is also not always
> >          >> >> > >                             possible distinguish
> >         options meant to be
> >          >> >> > >                             consumed at pipeline
> >         construction time, by
> >          >> >> > >                             the runner, by the SDK
> >         harness, by the user
> >          >> >> > >                             code or any combination
> >         -- and this causes
> >          >> >> > >                             confusion every now and
> then.
> >          >> >> > >
> >          >> >> > >                             For Dataflow, we have
> >         been using
> >          >> >> > >                             "experiments" for
> >         arbitrary runner-specific
> >          >> >> > >                             options. It's simply a
> >         string list pipeline
> >          >> >> > >                             option that all SDKs
> >         support and, for Go at
> >          >> >> > >                             least, is sent to
> >         portable runners. Flink
> >          >> >> > >                             can do the same in the
> >         short term to move
> >          >> >> > >                             forward.
> >          >> >> > >
> >          >> >> > >                             Henning
> >          >> >> > >
> >          >> >> > >
> >          >> >> > >                             On Fri, Oct 12, 2018 at
> >         8:50 AM Thomas Weise
> >          >> >> > >                             <thw@apache.org
> >         <ma...@apache.org> <mailto:thw@apache.org
> >         <ma...@apache.org>>> wrote:
> >          >> >> > >
> >          >> >> > >                                 [moving to the list]
> >          >> >> > >
> >          >> >> > >                                 The requirement
> >         driving this part of the
> >          >> >> > >                                 change was to allow a
> >         user to specify
> >          >> >> > >                                 pipeline options that
> >         a runner supports
> >          >> >> > >                                 without having to
> >         declare those in each
> >          >> >> > >                                 language SDK.
> >          >> >> > >
> >          >> >> > >                                 In the specific
> >         scenario, we have
> >          >> >> > >                                 options that the
> >         Flink runner supports
> >          >> >> > >                                 (and can validate),
> >         that are not
> >          >> >> > >                                 enumerated in the
> >         Python SDK.
> >          >> >> > >
> >          >> >> > >                                 I think we have a
> >         bigger problem scoping
> >          >> >> > >                                 pipeline options. For
> >         example, the
> >          >> >> > >                                 runner options are
> >         dumped into the SDK
> >          >> >> > >                                 worker. There is also
> >         a possibility of
> >          >> >> > >                                 name collisions. So I
> >         think this would
> >          >> >> > >                                 benefit from broader
> >         feedback.
> >          >> >> > >
> >          >> >> > >                                 Thanks,
> >          >> >> > >                                 Thomas
> >          >> >> > >
> >          >> >> > >
> >          >> >> > >                                 ---------- Forwarded
> >         message ---------
> >          >> >> > >                                 From: *Charles Chen*
> >          >> >> > >
> >           <notifications@github.com <ma...@github.com>
> >          >> >> > >
> >           <mailto:notifications@github.com
> >         <ma...@github.com>>>
> >          >> >> > >                                 Date: Fri, Oct 12,
> >         2018 at 8:36 AM
> >          >> >> > >                                 Subject: Re:
> >         [apache/beam] [BEAM-5442]
> >          >> >> > >                                 Store duplicate
> >         unknown options in a
> >          >> >> > >                                 list argument (#6600)
> >          >> >> > >                                 To: apache/beam
> >         <beam@noreply.github.com <ma...@noreply.github.com>
> >          >> >> > >
> >           <mailto:beam@noreply.github.com <mailto:
> beam@noreply.github.com>>>
> >          >> >> > >                                 Cc: Thomas Weise
> >         <thomas.weise@gmail.com <ma...@gmail.com>
> >          >> >> > >
> >           <mailto:thomas.weise@gmail.com <mailto:thomas.weise@gmail.com
> >>>,
> >          >> >> > >                                 Mention
> >         <mention@noreply.github.com <ma...@noreply.github.com>
> >          >> >> > >
> >           <mailto:mention@noreply.github.com
> >         <ma...@noreply.github.com>>>
> >          >> >> > >
> >          >> >> > >
> >          >> >> > >                                 CC: @tweise
> >         <https://github.com/tweise>
> >          >> >> > >
> >          >> >> > >                                 —
> >          >> >> > >                                 You are receiving
> >         this because you were
> >          >> >> > >                                 mentioned.
> >          >> >> > >                                 Reply to this email
> >         directly, view it on
> >          >> >> > >                                 GitHub
> >          >> >> > >
> >           <
> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
> >          >> >> > >                                 or mute the thread
> >          >> >> > >
> >           <
> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
> >.
> >          >> >> > >
> >          >> >> > >
> >          >> >> > >
> >
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Maximilian Michels <mx...@apache.org>.

Fetching options directly from the Runner's JobServer seems like the 
ideal solution. I agree with Robert that it creates additional 
complexity for SDK authors, so the `--runner-option` flag would be an 
easy and explicit way to specify additional Runner options.

The format I prefer would be: --runner_option=key1=val1 
--runner_option=key2=val2

Now, from the perspective of end users, I think it is neither convenient 
nor reasonable to require the use of the `--runner-option` flag. To the 
user it seems nebulous why some pipeline options live in the top-level 
option namespace while others need to be nested within an option. This 
is amplified by there being two Runners the user needs to be aware of, 
i.e. PortableRunner and the actual Runner (Dataflow/Flink/Spark..).

I feel like we would eventually replicate all options in the SDK because 
otherwise users have to use the `--runner-option`, but at least we can 
specify options which have not been replicated yet.

-Max

On 16.10.18 10:27, Robert Bradshaw wrote:
> Yes, we don't know how to parse and/or validate it.
> 
> On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik <lcwik@google.com 
> <ma...@google.com>> wrote:
> 
>     I see, is the issue that we currently are using a JSON
>     representation for options when being serialized and when we get
>     some unknown option, we don't know how to convert it into its JSON form?
> 
>     On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw <robertwb@google.com
>     <ma...@google.com>> wrote:
> 
>         On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <lcwik@google.com
>         <ma...@google.com>> wrote:
>          >
>          > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw
>         <robertwb@google.com <ma...@google.com>> wrote:
>          >>
>          >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik
>         <lcwik@google.com <ma...@google.com>> wrote:
>          >> >
>          >> > I agree with the sentiment for better error checking.
>          >> >
>          >> > We can try to make it such that the SDK can "fetch" the
>         set of options that the runner supports by making a call to the
>         Job API. The API could return a list of option names
>         (descriptions for --help purposes and also potentially the
>         expected format) which would remove the worry around "unknown"
>         options. Yes I understand to be able to make the Job API call,
>         we may need to parse some options from the args parameters first
>         and then parse the unknown options after they are fetched.
>          >>
>          >> This is an interesting idea, but seems it could get quite
>         complicated.
>          >> E.g. for delegating runners, one would first read the options to
>          >> determine which runner to fetch the options from, which
>         would then
>          >> return a set of options that possibly depends on the values
>         of some of
>          >> its options...
>          >>
>          >> > Alternatively, we can choose an explicit format upfront.
>          >> > To expand on the exact format for --runner_option=...,
>         here are some different ideas:
>          >> > 1) Specified multiple times, each one is an explicit flag
>          >> > --runner_option=--blah=bar --runner_option=--foo=baz1
>         --runner_option=--foo=baz2
>          >>
>          >> I'm -1 on this format. We should move away from the idea
>         that options
>          >> == flags (as that doesn't compose well with other libraries
>         that do
>          >> their own flags parsing). The ability to parse a set of
>         flags into
>          >> options is just a convenience that an author may (or may
>         not) choose
>          >> to use (e.g. when running pipelines a long-lived process like a
>          >> service or a notebook, the command line flags are almost
>         certainly not
>          >> the right interface).
>          >>
>          >> > 2) specified multiple times, we drop the explicit flag
>          >> > --runner_option=blah=bar --runner_option=foo=baz1
>         --runner_option=foo=baz2
>          >>
>          >> This or (4) is my preference.
>          >>
>          >> > 3) we use a string which the runner can choose to
>         interpret however they want (JSON/XML shown below)
>          >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>          >> >
>         --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>          >>
>          >> This would make validation hard. Also, I think it makes
>         sense for some
>          >> runner options to be "shared" (parallelism") by convention,
>         so letting
>          >> it be a free-form string wouldn't allow different runners to
>         inspect
>          >> different bits.
>          >>
>          >> We should consider if we should use urns for namespacing, and
>          >> assigning semantic meaning to strings, here.
>          >>
>          >> > 4) we use a string which must be a specific format such as
>         JSON (allows the SDK to do simple validation):
>          >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>          >>
>          >> I like this in that at least some validation can be
>         performed, and
>          >> expectations of how to format richer types. On the other
>         hand it gets
>          >> a bit verbose, given that most (I'd imagine) options will be
>         simple.
>          >> As with normal options,
>          >>
>          >>     --option1=value1 --option2=value2
>          >>
>          >> is shorthand for {"option1": value1, "option2": value2}.
>          >>
>          > I lean to 4 the most. With 2, you run into issues of what
>         does --runner_option=foo=["a", "b"] --runner_option=foo=["c",
>         "d"] mean?
>          > Is it an error or list of lists or concatenated. Similar
>         issues for map types represented via JSON object {...}
> 
>         We can err to be on the safe side unless/until an argument can
>         be made
>         that merging is more natural. I just think this will be excessively
>         verbose to use.
> 
>          >> > I would strongly suggest that we go with the "fetch"
>         approach, since this makes the set of options discoverable and
>         helps users find errors much earlier in their pipeline.
>          >>
>          >> This seems like an advanced feature that SDKs may want to
>         support, but
>          >> I wouldn't want to require this complexity for bootstrapping
>         an SDK.
>          >>
>          > SDKs that are starting off wouldn't need to "fetch" options,
>         they could choose to not support runner options or they could
>         choose to pass all options through to the runner blindly.
>         Fetching the options only provides the SDK the ability to
>         provide error checking upfront and useful error/help messages.
> 
>         But how to even pass all options through blindly is exactly the
>         difficulty we're running into here.
> 
>          >> Regarding always keeping runner options separate, +1, though
>         I'm not
>          >> sure the line is always clear.
>          >>
>          >>
>          >> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw
>         <robertwb@google.com <ma...@google.com>> wrote:
>          >> >>
>          >> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels
>         <mxm@apache.org <ma...@apache.org>> wrote:
>          >> >> >
>          >> >> > I agree that the current approach breaks the pipeline
>         options contract
>          >> >> > because "unknown" options get parsed in the same way as
>         options which
>          >> >> > have been defined by the user.
>          >> >>
>          >> >> FWIW, I think we're already breaking this "contract."
>         Unknown options
>          >> >> are silently ignored; with this change we just change how
>         we record
>          >> >> them. It still feels a bit hacky though.
>          >> >>
>          >> >> > I'm not sure the `experiments` flag works for us. AFAIK
>         it only allows
>          >> >> > true/false flags. We want to pass all types of pipeline
>         options to the
>          >> >> > Runner.
>          >> >>
>          >> >> Experiments is an arbitrary set of strings, which can be
>         of the form
>          >> >> "param=value" if that's useful. (Dataflow does this.)
>         There is, again,
>          >> >> no namespacing on the param names, but we could user urns
>         or impose
>          >> >> some other structure here.
>          >> >>
>          >> >> > How to solve this?
>          >> >> >
>          >> >> > 1) Add all options of all Runners to each SDK
>          >> >> > We added some of the FlinkRunner options to the Python
>         SDK but realized
>          >> >> > syncing is rather cumbersome in the long term. However,
>         we want the most
>          >> >> > important options to be validated on the client side.
>          >> >>
>          >> >> I don't think this is sustainable in the long run.
>         However, thinking
>          >> >> about this, in the worse case validation happens after
>         construction
>          >> >> but before execution (as with much of our other
>         validation) so it
>          >> >> isn't that bad.
>          >> >>
>          >> >> > 2) Pass "unknown" options via a separate list in the
>         Proto which can
>          >> >> > only be accessed internally by the Runners. This still
>         allows passing
>          >> >> > arbitrary options but we wouldn't leak unknown options
>         and display them
>          >> >> > as top-level options.
>          >> >>
>          >> >> I think there needs to be a way for the user to
>         communicate values
>          >> >> directly to the runner regardless of the SDK. My
>         preference would be
>          >> >> to make this explicit, e.g. (repeated)
>         --runner_option=..., rather
>          >> >> than scooping up all unknown flags at command line
>         parsing time.
>          >> >> Perhaps an SDK that is aware of some runners could choose
>         to lift
>          >> >> these as top-level options, but still pass them as runner
>         options.
>          >> >>
>          >> >> > On 13.10.18 02:34, Charles Chen wrote:
>          >> >> > > The current release branch
>          >> >> > >
>         (https://github.com/apache/beam/commits/release-2.8.0) was cut
>         after the
>          >> >> > > revert went in.  Sent out
>         https://github.com/apache/beam/pull/6683 as a
>          >> >> > > revert of the revert.  Regarding your comment above,
>         I can help out with
>          >> >> > > the design / PR reviews for common Python code as you
>         suggest.
>          >> >> > >
>          >> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise
>         <thw@apache.org <ma...@apache.org>
>          >> >> > > <mailto:thw@apache.org <ma...@apache.org>>> wrote:
>          >> >> > >
>          >> >> > >     Thanks, will tag you and looking forward to
>         feedback so we can
>          >> >> > >     ensure that changes work for everyone.
>          >> >> > >
>          >> >> > >     Looking at the PR, I see agreement from Max to
>         revert the change on
>          >> >> > >     the release branch, but not in master. Would you
>         mind to restore it
>          >> >> > >     in master?
>          >> >> > >
>          >> >> > >     Thanks
>          >> >> > >
>          >> >> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay
>         <altay@google.com <ma...@google.com>
>          >> >> > >     <mailto:altay@google.com
>         <ma...@google.com>>> wrote:
>          >> >> > >
>          >> >> > >
>          >> >> > >
>          >> >> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles
>         Chen <ccy@google.com <ma...@google.com>
>          >> >> > >         <mailto:ccy@google.com
>         <ma...@google.com>>> wrote:
>          >> >> > >
>          >> >> > >             What I mean is that a user may find that
>         it works for them
>          >> >> > >             to pass "--myarg blah" and access it as
>         "options.myarg"
>          >> >> > >             without explicitly defining a "my_arg"
>         flag due to the added
>          >> >> > >             logic.  This is not the intended behavior
>         and we may want to
>          >> >> > >             change this implementation detail in the
>         future.  However,
>          >> >> > >             having this logic in a released version
>         makes it hard to
>          >> >> > >             change this behavior since users may
>         erroneously depend on
>          >> >> > >             this undocumented behavior.  Instead, we
>         should namespace /
>          >> >> > >             scope this so that it is obvious that
>         this is meant for
>          >> >> > >             runner (and not Beam user) consumption.
>          >> >> > >
>          >> >> > >             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
>          >> >> > >             <thw@apache.org <ma...@apache.org>
>         <mailto:thw@apache.org <ma...@apache.org>>> wrote:
>          >> >> > >
>          >> >> > >                 Can you please elaborate more what
>         practical problems
>          >> >> > >                 this introduces for users?
>          >> >> > >
>          >> >> > >                 I can see that this change allows a
>         user to specify a
>          >> >> > >                 runner specific option, which in the
>         future may change
>          >> >> > >                 because we decide to scope
>         differently. If this only
>          >> >> > >                 affects users of the portable Flink
>         runner (like us),
>          >> >> > >                 then no need to revert, because at
>         this early stage we
>          >> >> > >                 prefer something that works over
>         being blocked.
>          >> >> > >
>          >> >> > >                 It would also be really great if some
>         of the core Python
>          >> >> > >                 SDK developers could help out with
>         the design aspects
>          >> >> > >                 and PR reviews of changes that affect
>         common Python
>          >> >> > >                 code. Anyone who specifically wants
>         to be tagged on
>          >> >> > >                 relevant JIRAs and PRs?
>          >> >> > >
>          >> >> > >
>          >> >> > >         I would be happy to be tagged, and I can also
>         help with
>          >> >> > >         including other relevant folks whenever
>         possible. In general I
>          >> >> > >         think Robert, Charles, myself are good
>         candidates.
>          >> >> > >
>          >> >> > >
>          >> >> > >                 Thanks
>          >> >> > >
>          >> >> > >
>          >> >> > >                 On Fri, Oct 12, 2018 at 10:20 AM
>         Ahmet Altay
>          >> >> > >                 <altay@google.com
>         <ma...@google.com> <mailto:altay@google.com
>         <ma...@google.com>>> wrote:
>          >> >> > >
>          >> >> > >
>          >> >> > >
>          >> >> > >                     On Fri, Oct 12, 2018 at 10:11 AM,
>         Charles Chen
>          >> >> > >                     <ccy@google.com
>         <ma...@google.com> <mailto:ccy@google.com
>         <ma...@google.com>>> wrote:
>          >> >> > >
>          >> >> > >                         For context, I made comments on
>          >> >> > > https://github.com/apache/beam/pull/6600 noting
>          >> >> > >                         that the changes being made
>         were not good for
>          >> >> > >                         Beam
>         backwards-compatibility.  The change as is
>          >> >> > >                         allows users to use pipeline
>         options without
>          >> >> > >                         explicitly defining them,
>         which is not the type
>          >> >> > >                         of usage we would like to
>         encourage since we
>          >> >> > >                         prefer to be explicit
>         whenever possible.  If
>          >> >> > >                         users write pipelines with
>         this sort of pattern,
>          >> >> > >                         they will potentially
>         encounter pain when
>          >> >> > >                         upgrading to a later version
>         since this is an
>          >> >> > >                         implementation detail and not
>         an officially
>          >> >> > >                         supported pattern.  I agree
>         with the comments
>          >> >> > >                         above that this is ultimately
>         a scoping issue.
>          >> >> > >                         I would not have a problem
>         with these changes if
>          >> >> > >                         they were explicitly scoped
>         under either a
>          >> >> > >                         runner or unparsed options
>         namespace.
>          >> >> > >
>          >> >> > >                         As a second note, since the
>         2.8.0 release is
>          >> >> > >                         being cut right now, because
>         of these
>          >> >> > >                         backwards-compatibility
>         concerns, I would
>          >> >> > >                         suggest reverting these
>         changes, at least until
>          >> >> > >                         2.8.0 is cut, so we can have
>         a discussion here
>          >> >> > >                         before committing to and
>         releasing any API-level
>          >> >> > >                         changes.
>          >> >> > >
>          >> >> > >
>          >> >> > >                     +1 I would like to revert the
>         changes in order not
>          >> >> > >                     rush this into the release. Once
>         this discussion
>          >> >> > >                     results in an agreement changes
>         can be brought back.
>          >> >> > >
>          >> >> > >
>          >> >> > >                         On Fri, Oct 12, 2018 at 9:26
>         AM Henning Rohde
>          >> >> > >                         <herohde@google.com
>         <ma...@google.com> <mailto:herohde@google.com
>         <ma...@google.com>>>
>          >> >> > >                         wrote:
>          >> >> > >
>          >> >> > >                             Agree that pipeline
>         options lack some
>          >> >> > >                             mechanism for scoping. It
>         is also not always
>          >> >> > >                             possible distinguish
>         options meant to be
>          >> >> > >                             consumed at pipeline
>         construction time, by
>          >> >> > >                             the runner, by the SDK
>         harness, by the user
>          >> >> > >                             code or any combination
>         -- and this causes
>          >> >> > >                             confusion every now and then.
>          >> >> > >
>          >> >> > >                             For Dataflow, we have
>         been using
>          >> >> > >                             "experiments" for
>         arbitrary runner-specific
>          >> >> > >                             options. It's simply a
>         string list pipeline
>          >> >> > >                             option that all SDKs
>         support and, for Go at
>          >> >> > >                             least, is sent to
>         portable runners. Flink
>          >> >> > >                             can do the same in the
>         short term to move
>          >> >> > >                             forward.
>          >> >> > >
>          >> >> > >                             Henning
>          >> >> > >
>          >> >> > >
>          >> >> > >                             On Fri, Oct 12, 2018 at
>         8:50 AM Thomas Weise
>          >> >> > >                             <thw@apache.org
>         <ma...@apache.org> <mailto:thw@apache.org
>         <ma...@apache.org>>> wrote:
>          >> >> > >
>          >> >> > >                                 [moving to the list]
>          >> >> > >
>          >> >> > >                                 The requirement
>         driving this part of the
>          >> >> > >                                 change was to allow a
>         user to specify
>          >> >> > >                                 pipeline options that
>         a runner supports
>          >> >> > >                                 without having to
>         declare those in each
>          >> >> > >                                 language SDK.
>          >> >> > >
>          >> >> > >                                 In the specific
>         scenario, we have
>          >> >> > >                                 options that the
>         Flink runner supports
>          >> >> > >                                 (and can validate),
>         that are not
>          >> >> > >                                 enumerated in the
>         Python SDK.
>          >> >> > >
>          >> >> > >                                 I think we have a
>         bigger problem scoping
>          >> >> > >                                 pipeline options. For
>         example, the
>          >> >> > >                                 runner options are
>         dumped into the SDK
>          >> >> > >                                 worker. There is also
>         a possibility of
>          >> >> > >                                 name collisions. So I
>         think this would
>          >> >> > >                                 benefit from broader
>         feedback.
>          >> >> > >
>          >> >> > >                                 Thanks,
>          >> >> > >                                 Thomas
>          >> >> > >
>          >> >> > >
>          >> >> > >                                 ---------- Forwarded
>         message ---------
>          >> >> > >                                 From: *Charles Chen*
>          >> >> > >                               
>           <notifications@github.com <ma...@github.com>
>          >> >> > >                               
>           <mailto:notifications@github.com
>         <ma...@github.com>>>
>          >> >> > >                                 Date: Fri, Oct 12,
>         2018 at 8:36 AM
>          >> >> > >                                 Subject: Re:
>         [apache/beam] [BEAM-5442]
>          >> >> > >                                 Store duplicate
>         unknown options in a
>          >> >> > >                                 list argument (#6600)
>          >> >> > >                                 To: apache/beam
>         <beam@noreply.github.com <ma...@noreply.github.com>
>          >> >> > >                               
>           <mailto:beam@noreply.github.com <ma...@noreply.github.com>>>
>          >> >> > >                                 Cc: Thomas Weise
>         <thomas.weise@gmail.com <ma...@gmail.com>
>          >> >> > >                               
>           <mailto:thomas.weise@gmail.com <ma...@gmail.com>>>,
>          >> >> > >                                 Mention
>         <mention@noreply.github.com <ma...@noreply.github.com>
>          >> >> > >                               
>           <mailto:mention@noreply.github.com
>         <ma...@noreply.github.com>>>
>          >> >> > >
>          >> >> > >
>          >> >> > >                                 CC: @tweise
>         <https://github.com/tweise>
>          >> >> > >
>          >> >> > >                                 —
>          >> >> > >                                 You are receiving
>         this because you were
>          >> >> > >                                 mentioned.
>          >> >> > >                                 Reply to this email
>         directly, view it on
>          >> >> > >                                 GitHub
>          >> >> > >                               
>           <https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>          >> >> > >                                 or mute the thread
>          >> >> > >                               
>           <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>.
>          >> >> > >
>          >> >> > >
>          >> >> > >
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Robert Bradshaw <ro...@google.com>.

Yes, we don't know how to parse and/or validate it.

On Tue, Oct 16, 2018 at 1:14 AM Lukasz Cwik <lc...@google.com> wrote:

> I see, is the issue that we currently are using a JSON representation for
> options when being serialized and when we get some unknown option, we don't
> know how to convert it into its JSON form?
>
> On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <lc...@google.com> wrote:
>> >
>> > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>> >>
>> >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik <lc...@google.com> wrote:
>> >> >
>> >> > I agree with the sentiment for better error checking.
>> >> >
>> >> > We can try to make it such that the SDK can "fetch" the set of
>> options that the runner supports by making a call to the Job API. The API
>> could return a list of option names (descriptions for --help purposes and
>> also potentially the expected format) which would remove the worry around
>> "unknown" options. Yes I understand to be able to make the Job API call, we
>> may need to parse some options from the args parameters first and then
>> parse the unknown options after they are fetched.
>> >>
>> >> This is an interesting idea, but seems it could get quite complicated.
>> >> E.g. for delegating runners, one would first read the options to
>> >> determine which runner to fetch the options from, which would then
>> >> return a set of options that possibly depends on the values of some of
>> >> its options...
>> >>
>> >> > Alternatively, we can choose an explicit format upfront.
>> >> > To expand on the exact format for --runner_option=..., here are some
>> different ideas:
>> >> > 1) Specified multiple times, each one is an explicit flag
>> >> > --runner_option=--blah=bar --runner_option=--foo=baz1
>> --runner_option=--foo=baz2
>> >>
>> >> I'm -1 on this format. We should move away from the idea that options
>> >> == flags (as that doesn't compose well with other libraries that do
>> >> their own flags parsing). The ability to parse a set of flags into
>> >> options is just a convenience that an author may (or may not) choose
>> >> to use (e.g. when running pipelines a long-lived process like a
>> >> service or a notebook, the command line flags are almost certainly not
>> >> the right interface).
>> >>
>> >> > 2) specified multiple times, we drop the explicit flag
>> >> > --runner_option=blah=bar --runner_option=foo=baz1
>> --runner_option=foo=baz2
>> >>
>> >> This or (4) is my preference.
>> >>
>> >> > 3) we use a string which the runner can choose to interpret however
>> they want (JSON/XML shown below)
>> >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>> >> >
>> --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>> >>
>> >> This would make validation hard. Also, I think it makes sense for some
>> >> runner options to be "shared" (parallelism") by convention, so letting
>> >> it be a free-form string wouldn't allow different runners to inspect
>> >> different bits.
>> >>
>> >> We should consider if we should use urns for namespacing, and
>> >> assigning semantic meaning to strings, here.
>> >>
>> >> > 4) we use a string which must be a specific format such as JSON
>> (allows the SDK to do simple validation):
>> >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>> >>
>> >> I like this in that at least some validation can be performed, and
>> >> expectations of how to format richer types. On the other hand it gets
>> >> a bit verbose, given that most (I'd imagine) options will be simple.
>> >> As with normal options,
>> >>
>> >>     --option1=value1 --option2=value2
>> >>
>> >> is shorthand for {"option1": value1, "option2": value2}.
>> >>
>> > I lean to 4 the most. With 2, you run into issues of what does
>> --runner_option=foo=["a", "b"] --runner_option=foo=["c", "d"] mean?
>> > Is it an error or list of lists or concatenated. Similar issues for map
>> types represented via JSON object {...}
>>
>> We can err to be on the safe side unless/until an argument can be made
>> that merging is more natural. I just think this will be excessively
>> verbose to use.
>>
>> >> > I would strongly suggest that we go with the "fetch" approach, since
>> this makes the set of options discoverable and helps users find errors much
>> earlier in their pipeline.
>> >>
>> >> This seems like an advanced feature that SDKs may want to support, but
>> >> I wouldn't want to require this complexity for bootstrapping an SDK.
>> >>
>> > SDKs that are starting off wouldn't need to "fetch" options, they could
>> choose to not support runner options or they could choose to pass all
>> options through to the runner blindly. Fetching the options only provides
>> the SDK the ability to provide error checking upfront and useful error/help
>> messages.
>>
>> But how to even pass all options through blindly is exactly the
>> difficulty we're running into here.
>>
>> >> Regarding always keeping runner options separate, +1, though I'm not
>> >> sure the line is always clear.
>> >>
>> >>
>> >> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>> >> >>
>> >> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <mx...@apache.org>
>> wrote:
>> >> >> >
>> >> >> > I agree that the current approach breaks the pipeline options
>> contract
>> >> >> > because "unknown" options get parsed in the same way as options
>> which
>> >> >> > have been defined by the user.
>> >> >>
>> >> >> FWIW, I think we're already breaking this "contract." Unknown
>> options
>> >> >> are silently ignored; with this change we just change how we record
>> >> >> them. It still feels a bit hacky though.
>> >> >>
>> >> >> > I'm not sure the `experiments` flag works for us. AFAIK it only
>> allows
>> >> >> > true/false flags. We want to pass all types of pipeline options
>> to the
>> >> >> > Runner.
>> >> >>
>> >> >> Experiments is an arbitrary set of strings, which can be of the form
>> >> >> "param=value" if that's useful. (Dataflow does this.) There is,
>> again,
>> >> >> no namespacing on the param names, but we could user urns or impose
>> >> >> some other structure here.
>> >> >>
>> >> >> > How to solve this?
>> >> >> >
>> >> >> > 1) Add all options of all Runners to each SDK
>> >> >> > We added some of the FlinkRunner options to the Python SDK but
>> realized
>> >> >> > syncing is rather cumbersome in the long term. However, we want
>> the most
>> >> >> > important options to be validated on the client side.
>> >> >>
>> >> >> I don't think this is sustainable in the long run. However, thinking
>> >> >> about this, in the worse case validation happens after construction
>> >> >> but before execution (as with much of our other validation) so it
>> >> >> isn't that bad.
>> >> >>
>> >> >> > 2) Pass "unknown" options via a separate list in the Proto which
>> can
>> >> >> > only be accessed internally by the Runners. This still allows
>> passing
>> >> >> > arbitrary options but we wouldn't leak unknown options and
>> display them
>> >> >> > as top-level options.
>> >> >>
>> >> >> I think there needs to be a way for the user to communicate values
>> >> >> directly to the runner regardless of the SDK. My preference would be
>> >> >> to make this explicit, e.g. (repeated) --runner_option=..., rather
>> >> >> than scooping up all unknown flags at command line parsing time.
>> >> >> Perhaps an SDK that is aware of some runners could choose to lift
>> >> >> these as top-level options, but still pass them as runner options.
>> >> >>
>> >> >> > On 13.10.18 02:34, Charles Chen wrote:
>> >> >> > > The current release branch
>> >> >> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut
>> after the
>> >> >> > > revert went in.  Sent out
>> https://github.com/apache/beam/pull/6683 as a
>> >> >> > > revert of the revert.  Regarding your comment above, I can help
>> out with
>> >> >> > > the design / PR reviews for common Python code as you suggest.
>> >> >> > >
>> >> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <thw@apache.org
>> >> >> > > <ma...@apache.org>> wrote:
>> >> >> > >
>> >> >> > >     Thanks, will tag you and looking forward to feedback so we
>> can
>> >> >> > >     ensure that changes work for everyone.
>> >> >> > >
>> >> >> > >     Looking at the PR, I see agreement from Max to revert the
>> change on
>> >> >> > >     the release branch, but not in master. Would you mind to
>> restore it
>> >> >> > >     in master?
>> >> >> > >
>> >> >> > >     Thanks
>> >> >> > >
>> >> >> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <
>> altay@google.com
>> >> >> > >     <ma...@google.com>> wrote:
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <
>> ccy@google.com
>> >> >> > >         <ma...@google.com>> wrote:
>> >> >> > >
>> >> >> > >             What I mean is that a user may find that it works
>> for them
>> >> >> > >             to pass "--myarg blah" and access it as
>> "options.myarg"
>> >> >> > >             without explicitly defining a "my_arg" flag due to
>> the added
>> >> >> > >             logic.  This is not the intended behavior and we
>> may want to
>> >> >> > >             change this implementation detail in the future.
>> However,
>> >> >> > >             having this logic in a released version makes it
>> hard to
>> >> >> > >             change this behavior since users may erroneously
>> depend on
>> >> >> > >             this undocumented behavior.  Instead, we should
>> namespace /
>> >> >> > >             scope this so that it is obvious that this is meant
>> for
>> >> >> > >             runner (and not Beam user) consumption.
>> >> >> > >
>> >> >> > >             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
>> >> >> > >             <thw@apache.org <ma...@apache.org>> wrote:
>> >> >> > >
>> >> >> > >                 Can you please elaborate more what practical
>> problems
>> >> >> > >                 this introduces for users?
>> >> >> > >
>> >> >> > >                 I can see that this change allows a user to
>> specify a
>> >> >> > >                 runner specific option, which in the future may
>> change
>> >> >> > >                 because we decide to scope differently. If this
>> only
>> >> >> > >                 affects users of the portable Flink runner
>> (like us),
>> >> >> > >                 then no need to revert, because at this early
>> stage we
>> >> >> > >                 prefer something that works over being blocked.
>> >> >> > >
>> >> >> > >                 It would also be really great if some of the
>> core Python
>> >> >> > >                 SDK developers could help out with the design
>> aspects
>> >> >> > >                 and PR reviews of changes that affect common
>> Python
>> >> >> > >                 code. Anyone who specifically wants to be
>> tagged on
>> >> >> > >                 relevant JIRAs and PRs?
>> >> >> > >
>> >> >> > >
>> >> >> > >         I would be happy to be tagged, and I can also help with
>> >> >> > >         including other relevant folks whenever possible. In
>> general I
>> >> >> > >         think Robert, Charles, myself are good candidates.
>> >> >> > >
>> >> >> > >
>> >> >> > >                 Thanks
>> >> >> > >
>> >> >> > >
>> >> >> > >                 On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
>> >> >> > >                 <altay@google.com <ma...@google.com>>
>> wrote:
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > >                     On Fri, Oct 12, 2018 at 10:11 AM, Charles
>> Chen
>> >> >> > >                     <ccy@google.com <ma...@google.com>>
>> wrote:
>> >> >> > >
>> >> >> > >                         For context, I made comments on
>> >> >> > >
>> https://github.com/apache/beam/pull/6600 noting
>> >> >> > >                         that the changes being made were not
>> good for
>> >> >> > >                         Beam backwards-compatibility.  The
>> change as is
>> >> >> > >                         allows users to use pipeline options
>> without
>> >> >> > >                         explicitly defining them, which is not
>> the type
>> >> >> > >                         of usage we would like to encourage
>> since we
>> >> >> > >                         prefer to be explicit whenever
>> possible.  If
>> >> >> > >                         users write pipelines with this sort of
>> pattern,
>> >> >> > >                         they will potentially encounter pain
>> when
>> >> >> > >                         upgrading to a later version since this
>> is an
>> >> >> > >                         implementation detail and not an
>> officially
>> >> >> > >                         supported pattern.  I agree with the
>> comments
>> >> >> > >                         above that this is ultimately a scoping
>> issue.
>> >> >> > >                         I would not have a problem with these
>> changes if
>> >> >> > >                         they were explicitly scoped under
>> either a
>> >> >> > >                         runner or unparsed options namespace.
>> >> >> > >
>> >> >> > >                         As a second note, since the 2.8.0
>> release is
>> >> >> > >                         being cut right now, because of these
>> >> >> > >                         backwards-compatibility concerns, I
>> would
>> >> >> > >                         suggest reverting these changes, at
>> least until
>> >> >> > >                         2.8.0 is cut, so we can have a
>> discussion here
>> >> >> > >                         before committing to and releasing any
>> API-level
>> >> >> > >                         changes.
>> >> >> > >
>> >> >> > >
>> >> >> > >                     +1 I would like to revert the changes in
>> order not
>> >> >> > >                     rush this into the release. Once this
>> discussion
>> >> >> > >                     results in an agreement changes can be
>> brought back.
>> >> >> > >
>> >> >> > >
>> >> >> > >                         On Fri, Oct 12, 2018 at 9:26 AM Henning
>> Rohde
>> >> >> > >                         <herohde@google.com <mailto:
>> herohde@google.com>>
>> >> >> > >                         wrote:
>> >> >> > >
>> >> >> > >                             Agree that pipeline options lack
>> some
>> >> >> > >                             mechanism for scoping. It is also
>> not always
>> >> >> > >                             possible distinguish options meant
>> to be
>> >> >> > >                             consumed at pipeline construction
>> time, by
>> >> >> > >                             the runner, by the SDK harness, by
>> the user
>> >> >> > >                             code or any combination -- and this
>> causes
>> >> >> > >                             confusion every now and then.
>> >> >> > >
>> >> >> > >                             For Dataflow, we have been using
>> >> >> > >                             "experiments" for arbitrary
>> runner-specific
>> >> >> > >                             options. It's simply a string list
>> pipeline
>> >> >> > >                             option that all SDKs support and,
>> for Go at
>> >> >> > >                             least, is sent to portable runners.
>> Flink
>> >> >> > >                             can do the same in the short term
>> to move
>> >> >> > >                             forward.
>> >> >> > >
>> >> >> > >                             Henning
>> >> >> > >
>> >> >> > >
>> >> >> > >                             On Fri, Oct 12, 2018 at 8:50 AM
>> Thomas Weise
>> >> >> > >                             <thw@apache.org <mailto:
>> thw@apache.org>> wrote:
>> >> >> > >
>> >> >> > >                                 [moving to the list]
>> >> >> > >
>> >> >> > >                                 The requirement driving this
>> part of the
>> >> >> > >                                 change was to allow a user to
>> specify
>> >> >> > >                                 pipeline options that a runner
>> supports
>> >> >> > >                                 without having to declare those
>> in each
>> >> >> > >                                 language SDK.
>> >> >> > >
>> >> >> > >                                 In the specific scenario, we
>> have
>> >> >> > >                                 options that the Flink runner
>> supports
>> >> >> > >                                 (and can validate), that are not
>> >> >> > >                                 enumerated in the Python SDK.
>> >> >> > >
>> >> >> > >                                 I think we have a bigger
>> problem scoping
>> >> >> > >                                 pipeline options. For example,
>> the
>> >> >> > >                                 runner options are dumped into
>> the SDK
>> >> >> > >                                 worker. There is also a
>> possibility of
>> >> >> > >                                 name collisions. So I think
>> this would
>> >> >> > >                                 benefit from broader feedback.
>> >> >> > >
>> >> >> > >                                 Thanks,
>> >> >> > >                                 Thomas
>> >> >> > >
>> >> >> > >
>> >> >> > >                                 ---------- Forwarded message
>> ---------
>> >> >> > >                                 From: *Charles Chen*
>> >> >> > >                                 <notifications@github.com
>> >> >> > >                                 <mailto:
>> notifications@github.com>>
>> >> >> > >                                 Date: Fri, Oct 12, 2018 at 8:36
>> AM
>> >> >> > >                                 Subject: Re: [apache/beam]
>> [BEAM-5442]
>> >> >> > >                                 Store duplicate unknown options
>> in a
>> >> >> > >                                 list argument (#6600)
>> >> >> > >                                 To: apache/beam <
>> beam@noreply.github.com
>> >> >> > >                                 <mailto:beam@noreply.github.com
>> >>
>> >> >> > >                                 Cc: Thomas Weise <
>> thomas.weise@gmail.com
>> >> >> > >                                 <mailto:thomas.weise@gmail.com
>> >>,
>> >> >> > >                                 Mention <
>> mention@noreply.github.com
>> >> >> > >                                 <mailto:
>> mention@noreply.github.com>>
>> >> >> > >
>> >> >> > >
>> >> >> > >                                 CC: @tweise <
>> https://github.com/tweise>
>> >> >> > >
>> >> >> > >                                 —
>> >> >> > >                                 You are receiving this because
>> you were
>> >> >> > >                                 mentioned.
>> >> >> > >                                 Reply to this email directly,
>> view it on
>> >> >> > >                                 GitHub
>> >> >> > >                                 <
>> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>> >> >> > >                                 or mute the thread
>> >> >> > >                                 <
>> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
>> >.
>> >> >> > >
>> >> >> > >
>> >> >> > >
>>
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Lukasz Cwik <lc...@google.com>.

I see, is the issue that we currently are using a JSON representation for
options when being serialized and when we get some unknown option, we don't
know how to convert it into its JSON form?

On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw <ro...@google.com> wrote:

> On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <lc...@google.com> wrote:
> >
> > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >>
> >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik <lc...@google.com> wrote:
> >> >
> >> > I agree with the sentiment for better error checking.
> >> >
> >> > We can try to make it such that the SDK can "fetch" the set of
> options that the runner supports by making a call to the Job API. The API
> could return a list of option names (descriptions for --help purposes and
> also potentially the expected format) which would remove the worry around
> "unknown" options. Yes I understand to be able to make the Job API call, we
> may need to parse some options from the args parameters first and then
> parse the unknown options after they are fetched.
> >>
> >> This is an interesting idea, but seems it could get quite complicated.
> >> E.g. for delegating runners, one would first read the options to
> >> determine which runner to fetch the options from, which would then
> >> return a set of options that possibly depends on the values of some of
> >> its options...
> >>
> >> > Alternatively, we can choose an explicit format upfront.
> >> > To expand on the exact format for --runner_option=..., here are some
> different ideas:
> >> > 1) Specified multiple times, each one is an explicit flag
> >> > --runner_option=--blah=bar --runner_option=--foo=baz1
> --runner_option=--foo=baz2
> >>
> >> I'm -1 on this format. We should move away from the idea that options
> >> == flags (as that doesn't compose well with other libraries that do
> >> their own flags parsing). The ability to parse a set of flags into
> >> options is just a convenience that an author may (or may not) choose
> >> to use (e.g. when running pipelines a long-lived process like a
> >> service or a notebook, the command line flags are almost certainly not
> >> the right interface).
> >>
> >> > 2) specified multiple times, we drop the explicit flag
> >> > --runner_option=blah=bar --runner_option=foo=baz1
> --runner_option=foo=baz2
> >>
> >> This or (4) is my preference.
> >>
> >> > 3) we use a string which the runner can choose to interpret however
> they want (JSON/XML shown below)
> >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
> >> >
> --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
> >>
> >> This would make validation hard. Also, I think it makes sense for some
> >> runner options to be "shared" (parallelism") by convention, so letting
> >> it be a free-form string wouldn't allow different runners to inspect
> >> different bits.
> >>
> >> We should consider if we should use urns for namespacing, and
> >> assigning semantic meaning to strings, here.
> >>
> >> > 4) we use a string which must be a specific format such as JSON
> (allows the SDK to do simple validation):
> >> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
> >>
> >> I like this in that at least some validation can be performed, and
> >> expectations of how to format richer types. On the other hand it gets
> >> a bit verbose, given that most (I'd imagine) options will be simple.
> >> As with normal options,
> >>
> >>     --option1=value1 --option2=value2
> >>
> >> is shorthand for {"option1": value1, "option2": value2}.
> >>
> > I lean to 4 the most. With 2, you run into issues of what does
> --runner_option=foo=["a", "b"] --runner_option=foo=["c", "d"] mean?
> > Is it an error or list of lists or concatenated. Similar issues for map
> types represented via JSON object {...}
>
> We can err to be on the safe side unless/until an argument can be made
> that merging is more natural. I just think this will be excessively
> verbose to use.
>
> >> > I would strongly suggest that we go with the "fetch" approach, since
> this makes the set of options discoverable and helps users find errors much
> earlier in their pipeline.
> >>
> >> This seems like an advanced feature that SDKs may want to support, but
> >> I wouldn't want to require this complexity for bootstrapping an SDK.
> >>
> > SDKs that are starting off wouldn't need to "fetch" options, they could
> choose to not support runner options or they could choose to pass all
> options through to the runner blindly. Fetching the options only provides
> the SDK the ability to provide error checking upfront and useful error/help
> messages.
>
> But how to even pass all options through blindly is exactly the
> difficulty we're running into here.
>
> >> Regarding always keeping runner options separate, +1, though I'm not
> >> sure the line is always clear.
> >>
> >>
> >> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <ro...@google.com>
> wrote:
> >> >>
> >> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <mx...@apache.org>
> wrote:
> >> >> >
> >> >> > I agree that the current approach breaks the pipeline options
> contract
> >> >> > because "unknown" options get parsed in the same way as options
> which
> >> >> > have been defined by the user.
> >> >>
> >> >> FWIW, I think we're already breaking this "contract." Unknown options
> >> >> are silently ignored; with this change we just change how we record
> >> >> them. It still feels a bit hacky though.
> >> >>
> >> >> > I'm not sure the `experiments` flag works for us. AFAIK it only
> allows
> >> >> > true/false flags. We want to pass all types of pipeline options to
> the
> >> >> > Runner.
> >> >>
> >> >> Experiments is an arbitrary set of strings, which can be of the form
> >> >> "param=value" if that's useful. (Dataflow does this.) There is,
> again,
> >> >> no namespacing on the param names, but we could user urns or impose
> >> >> some other structure here.
> >> >>
> >> >> > How to solve this?
> >> >> >
> >> >> > 1) Add all options of all Runners to each SDK
> >> >> > We added some of the FlinkRunner options to the Python SDK but
> realized
> >> >> > syncing is rather cumbersome in the long term. However, we want
> the most
> >> >> > important options to be validated on the client side.
> >> >>
> >> >> I don't think this is sustainable in the long run. However, thinking
> >> >> about this, in the worse case validation happens after construction
> >> >> but before execution (as with much of our other validation) so it
> >> >> isn't that bad.
> >> >>
> >> >> > 2) Pass "unknown" options via a separate list in the Proto which
> can
> >> >> > only be accessed internally by the Runners. This still allows
> passing
> >> >> > arbitrary options but we wouldn't leak unknown options and display
> them
> >> >> > as top-level options.
> >> >>
> >> >> I think there needs to be a way for the user to communicate values
> >> >> directly to the runner regardless of the SDK. My preference would be
> >> >> to make this explicit, e.g. (repeated) --runner_option=..., rather
> >> >> than scooping up all unknown flags at command line parsing time.
> >> >> Perhaps an SDK that is aware of some runners could choose to lift
> >> >> these as top-level options, but still pass them as runner options.
> >> >>
> >> >> > On 13.10.18 02:34, Charles Chen wrote:
> >> >> > > The current release branch
> >> >> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut
> after the
> >> >> > > revert went in.  Sent out
> https://github.com/apache/beam/pull/6683 as a
> >> >> > > revert of the revert.  Regarding your comment above, I can help
> out with
> >> >> > > the design / PR reviews for common Python code as you suggest.
> >> >> > >
> >> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <thw@apache.org
> >> >> > > <ma...@apache.org>> wrote:
> >> >> > >
> >> >> > >     Thanks, will tag you and looking forward to feedback so we
> can
> >> >> > >     ensure that changes work for everyone.
> >> >> > >
> >> >> > >     Looking at the PR, I see agreement from Max to revert the
> change on
> >> >> > >     the release branch, but not in master. Would you mind to
> restore it
> >> >> > >     in master?
> >> >> > >
> >> >> > >     Thanks
> >> >> > >
> >> >> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <
> altay@google.com
> >> >> > >     <ma...@google.com>> wrote:
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <
> ccy@google.com
> >> >> > >         <ma...@google.com>> wrote:
> >> >> > >
> >> >> > >             What I mean is that a user may find that it works
> for them
> >> >> > >             to pass "--myarg blah" and access it as
> "options.myarg"
> >> >> > >             without explicitly defining a "my_arg" flag due to
> the added
> >> >> > >             logic.  This is not the intended behavior and we may
> want to
> >> >> > >             change this implementation detail in the future.
> However,
> >> >> > >             having this logic in a released version makes it
> hard to
> >> >> > >             change this behavior since users may erroneously
> depend on
> >> >> > >             this undocumented behavior.  Instead, we should
> namespace /
> >> >> > >             scope this so that it is obvious that this is meant
> for
> >> >> > >             runner (and not Beam user) consumption.
> >> >> > >
> >> >> > >             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
> >> >> > >             <thw@apache.org <ma...@apache.org>> wrote:
> >> >> > >
> >> >> > >                 Can you please elaborate more what practical
> problems
> >> >> > >                 this introduces for users?
> >> >> > >
> >> >> > >                 I can see that this change allows a user to
> specify a
> >> >> > >                 runner specific option, which in the future may
> change
> >> >> > >                 because we decide to scope differently. If this
> only
> >> >> > >                 affects users of the portable Flink runner (like
> us),
> >> >> > >                 then no need to revert, because at this early
> stage we
> >> >> > >                 prefer something that works over being blocked.
> >> >> > >
> >> >> > >                 It would also be really great if some of the
> core Python
> >> >> > >                 SDK developers could help out with the design
> aspects
> >> >> > >                 and PR reviews of changes that affect common
> Python
> >> >> > >                 code. Anyone who specifically wants to be tagged
> on
> >> >> > >                 relevant JIRAs and PRs?
> >> >> > >
> >> >> > >
> >> >> > >         I would be happy to be tagged, and I can also help with
> >> >> > >         including other relevant folks whenever possible. In
> general I
> >> >> > >         think Robert, Charles, myself are good candidates.
> >> >> > >
> >> >> > >
> >> >> > >                 Thanks
> >> >> > >
> >> >> > >
> >> >> > >                 On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
> >> >> > >                 <altay@google.com <ma...@google.com>>
> wrote:
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >                     On Fri, Oct 12, 2018 at 10:11 AM, Charles
> Chen
> >> >> > >                     <ccy@google.com <ma...@google.com>>
> wrote:
> >> >> > >
> >> >> > >                         For context, I made comments on
> >> >> > >                         https://github.com/apache/beam/pull/6600
> noting
> >> >> > >                         that the changes being made were not
> good for
> >> >> > >                         Beam backwards-compatibility.  The
> change as is
> >> >> > >                         allows users to use pipeline options
> without
> >> >> > >                         explicitly defining them, which is not
> the type
> >> >> > >                         of usage we would like to encourage
> since we
> >> >> > >                         prefer to be explicit whenever
> possible.  If
> >> >> > >                         users write pipelines with this sort of
> pattern,
> >> >> > >                         they will potentially encounter pain when
> >> >> > >                         upgrading to a later version since this
> is an
> >> >> > >                         implementation detail and not an
> officially
> >> >> > >                         supported pattern.  I agree with the
> comments
> >> >> > >                         above that this is ultimately a scoping
> issue.
> >> >> > >                         I would not have a problem with these
> changes if
> >> >> > >                         they were explicitly scoped under either
> a
> >> >> > >                         runner or unparsed options namespace.
> >> >> > >
> >> >> > >                         As a second note, since the 2.8.0
> release is
> >> >> > >                         being cut right now, because of these
> >> >> > >                         backwards-compatibility concerns, I would
> >> >> > >                         suggest reverting these changes, at
> least until
> >> >> > >                         2.8.0 is cut, so we can have a
> discussion here
> >> >> > >                         before committing to and releasing any
> API-level
> >> >> > >                         changes.
> >> >> > >
> >> >> > >
> >> >> > >                     +1 I would like to revert the changes in
> order not
> >> >> > >                     rush this into the release. Once this
> discussion
> >> >> > >                     results in an agreement changes can be
> brought back.
> >> >> > >
> >> >> > >
> >> >> > >                         On Fri, Oct 12, 2018 at 9:26 AM Henning
> Rohde
> >> >> > >                         <herohde@google.com <mailto:
> herohde@google.com>>
> >> >> > >                         wrote:
> >> >> > >
> >> >> > >                             Agree that pipeline options lack some
> >> >> > >                             mechanism for scoping. It is also
> not always
> >> >> > >                             possible distinguish options meant
> to be
> >> >> > >                             consumed at pipeline construction
> time, by
> >> >> > >                             the runner, by the SDK harness, by
> the user
> >> >> > >                             code or any combination -- and this
> causes
> >> >> > >                             confusion every now and then.
> >> >> > >
> >> >> > >                             For Dataflow, we have been using
> >> >> > >                             "experiments" for arbitrary
> runner-specific
> >> >> > >                             options. It's simply a string list
> pipeline
> >> >> > >                             option that all SDKs support and,
> for Go at
> >> >> > >                             least, is sent to portable runners.
> Flink
> >> >> > >                             can do the same in the short term to
> move
> >> >> > >                             forward.
> >> >> > >
> >> >> > >                             Henning
> >> >> > >
> >> >> > >
> >> >> > >                             On Fri, Oct 12, 2018 at 8:50 AM
> Thomas Weise
> >> >> > >                             <thw@apache.org <mailto:
> thw@apache.org>> wrote:
> >> >> > >
> >> >> > >                                 [moving to the list]
> >> >> > >
> >> >> > >                                 The requirement driving this
> part of the
> >> >> > >                                 change was to allow a user to
> specify
> >> >> > >                                 pipeline options that a runner
> supports
> >> >> > >                                 without having to declare those
> in each
> >> >> > >                                 language SDK.
> >> >> > >
> >> >> > >                                 In the specific scenario, we have
> >> >> > >                                 options that the Flink runner
> supports
> >> >> > >                                 (and can validate), that are not
> >> >> > >                                 enumerated in the Python SDK.
> >> >> > >
> >> >> > >                                 I think we have a bigger problem
> scoping
> >> >> > >                                 pipeline options. For example,
> the
> >> >> > >                                 runner options are dumped into
> the SDK
> >> >> > >                                 worker. There is also a
> possibility of
> >> >> > >                                 name collisions. So I think this
> would
> >> >> > >                                 benefit from broader feedback.
> >> >> > >
> >> >> > >                                 Thanks,
> >> >> > >                                 Thomas
> >> >> > >
> >> >> > >
> >> >> > >                                 ---------- Forwarded message
> ---------
> >> >> > >                                 From: *Charles Chen*
> >> >> > >                                 <notifications@github.com
> >> >> > >                                 <mailto:notifications@github.com
> >>
> >> >> > >                                 Date: Fri, Oct 12, 2018 at 8:36
> AM
> >> >> > >                                 Subject: Re: [apache/beam]
> [BEAM-5442]
> >> >> > >                                 Store duplicate unknown options
> in a
> >> >> > >                                 list argument (#6600)
> >> >> > >                                 To: apache/beam <
> beam@noreply.github.com
> >> >> > >                                 <mailto:beam@noreply.github.com
> >>
> >> >> > >                                 Cc: Thomas Weise <
> thomas.weise@gmail.com
> >> >> > >                                 <mailto:thomas.weise@gmail.com
> >>,
> >> >> > >                                 Mention <
> mention@noreply.github.com
> >> >> > >                                 <mailto:
> mention@noreply.github.com>>
> >> >> > >
> >> >> > >
> >> >> > >                                 CC: @tweise <
> https://github.com/tweise>
> >> >> > >
> >> >> > >                                 —
> >> >> > >                                 You are receiving this because
> you were
> >> >> > >                                 mentioned.
> >> >> > >                                 Reply to this email directly,
> view it on
> >> >> > >                                 GitHub
> >> >> > >                                 <
> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
> >> >> > >                                 or mute the thread
> >> >> > >                                 <
> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
> >.
> >> >> > >
> >> >> > >
> >> >> > >
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Robert Bradshaw <ro...@google.com>.

On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik <lc...@google.com> wrote:
>
> On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw <ro...@google.com> wrote:
>>
>> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik <lc...@google.com> wrote:
>> >
>> > I agree with the sentiment for better error checking.
>> >
>> > We can try to make it such that the SDK can "fetch" the set of options that the runner supports by making a call to the Job API. The API could return a list of option names (descriptions for --help purposes and also potentially the expected format) which would remove the worry around "unknown" options. Yes I understand to be able to make the Job API call, we may need to parse some options from the args parameters first and then parse the unknown options after they are fetched.
>>
>> This is an interesting idea, but seems it could get quite complicated.
>> E.g. for delegating runners, one would first read the options to
>> determine which runner to fetch the options from, which would then
>> return a set of options that possibly depends on the values of some of
>> its options...
>>
>> > Alternatively, we can choose an explicit format upfront.
>> > To expand on the exact format for --runner_option=..., here are some different ideas:
>> > 1) Specified multiple times, each one is an explicit flag
>> > --runner_option=--blah=bar --runner_option=--foo=baz1 --runner_option=--foo=baz2
>>
>> I'm -1 on this format. We should move away from the idea that options
>> == flags (as that doesn't compose well with other libraries that do
>> their own flags parsing). The ability to parse a set of flags into
>> options is just a convenience that an author may (or may not) choose
>> to use (e.g. when running pipelines a long-lived process like a
>> service or a notebook, the command line flags are almost certainly not
>> the right interface).
>>
>> > 2) specified multiple times, we drop the explicit flag
>> > --runner_option=blah=bar --runner_option=foo=baz1 --runner_option=foo=baz2
>>
>> This or (4) is my preference.
>>
>> > 3) we use a string which the runner can choose to interpret however they want (JSON/XML shown below)
>> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>> > --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>>
>> This would make validation hard. Also, I think it makes sense for some
>> runner options to be "shared" (parallelism") by convention, so letting
>> it be a free-form string wouldn't allow different runners to inspect
>> different bits.
>>
>> We should consider if we should use urns for namespacing, and
>> assigning semantic meaning to strings, here.
>>
>> > 4) we use a string which must be a specific format such as JSON (allows the SDK to do simple validation):
>> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>>
>> I like this in that at least some validation can be performed, and
>> expectations of how to format richer types. On the other hand it gets
>> a bit verbose, given that most (I'd imagine) options will be simple.
>> As with normal options,
>>
>>     --option1=value1 --option2=value2
>>
>> is shorthand for {"option1": value1, "option2": value2}.
>>
> I lean to 4 the most. With 2, you run into issues of what does --runner_option=foo=["a", "b"] --runner_option=foo=["c", "d"] mean?
> Is it an error or list of lists or concatenated. Similar issues for map types represented via JSON object {...}

We can err to be on the safe side unless/until an argument can be made
that merging is more natural. I just think this will be excessively
verbose to use.

>> > I would strongly suggest that we go with the "fetch" approach, since this makes the set of options discoverable and helps users find errors much earlier in their pipeline.
>>
>> This seems like an advanced feature that SDKs may want to support, but
>> I wouldn't want to require this complexity for bootstrapping an SDK.
>>
> SDKs that are starting off wouldn't need to "fetch" options, they could choose to not support runner options or they could choose to pass all options through to the runner blindly. Fetching the options only provides the SDK the ability to provide error checking upfront and useful error/help messages.

But how to even pass all options through blindly is exactly the
difficulty we're running into here.

>> Regarding always keeping runner options separate, +1, though I'm not
>> sure the line is always clear.
>>
>>
>> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <ro...@google.com> wrote:
>> >>
>> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <mx...@apache.org> wrote:
>> >> >
>> >> > I agree that the current approach breaks the pipeline options contract
>> >> > because "unknown" options get parsed in the same way as options which
>> >> > have been defined by the user.
>> >>
>> >> FWIW, I think we're already breaking this "contract." Unknown options
>> >> are silently ignored; with this change we just change how we record
>> >> them. It still feels a bit hacky though.
>> >>
>> >> > I'm not sure the `experiments` flag works for us. AFAIK it only allows
>> >> > true/false flags. We want to pass all types of pipeline options to the
>> >> > Runner.
>> >>
>> >> Experiments is an arbitrary set of strings, which can be of the form
>> >> "param=value" if that's useful. (Dataflow does this.) There is, again,
>> >> no namespacing on the param names, but we could user urns or impose
>> >> some other structure here.
>> >>
>> >> > How to solve this?
>> >> >
>> >> > 1) Add all options of all Runners to each SDK
>> >> > We added some of the FlinkRunner options to the Python SDK but realized
>> >> > syncing is rather cumbersome in the long term. However, we want the most
>> >> > important options to be validated on the client side.
>> >>
>> >> I don't think this is sustainable in the long run. However, thinking
>> >> about this, in the worse case validation happens after construction
>> >> but before execution (as with much of our other validation) so it
>> >> isn't that bad.
>> >>
>> >> > 2) Pass "unknown" options via a separate list in the Proto which can
>> >> > only be accessed internally by the Runners. This still allows passing
>> >> > arbitrary options but we wouldn't leak unknown options and display them
>> >> > as top-level options.
>> >>
>> >> I think there needs to be a way for the user to communicate values
>> >> directly to the runner regardless of the SDK. My preference would be
>> >> to make this explicit, e.g. (repeated) --runner_option=..., rather
>> >> than scooping up all unknown flags at command line parsing time.
>> >> Perhaps an SDK that is aware of some runners could choose to lift
>> >> these as top-level options, but still pass them as runner options.
>> >>
>> >> > On 13.10.18 02:34, Charles Chen wrote:
>> >> > > The current release branch
>> >> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut after the
>> >> > > revert went in.  Sent out https://github.com/apache/beam/pull/6683 as a
>> >> > > revert of the revert.  Regarding your comment above, I can help out with
>> >> > > the design / PR reviews for common Python code as you suggest.
>> >> > >
>> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <thw@apache.org
>> >> > > <ma...@apache.org>> wrote:
>> >> > >
>> >> > >     Thanks, will tag you and looking forward to feedback so we can
>> >> > >     ensure that changes work for everyone.
>> >> > >
>> >> > >     Looking at the PR, I see agreement from Max to revert the change on
>> >> > >     the release branch, but not in master. Would you mind to restore it
>> >> > >     in master?
>> >> > >
>> >> > >     Thanks
>> >> > >
>> >> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <altay@google.com
>> >> > >     <ma...@google.com>> wrote:
>> >> > >
>> >> > >
>> >> > >
>> >> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <ccy@google.com
>> >> > >         <ma...@google.com>> wrote:
>> >> > >
>> >> > >             What I mean is that a user may find that it works for them
>> >> > >             to pass "--myarg blah" and access it as "options.myarg"
>> >> > >             without explicitly defining a "my_arg" flag due to the added
>> >> > >             logic.  This is not the intended behavior and we may want to
>> >> > >             change this implementation detail in the future.  However,
>> >> > >             having this logic in a released version makes it hard to
>> >> > >             change this behavior since users may erroneously depend on
>> >> > >             this undocumented behavior.  Instead, we should namespace /
>> >> > >             scope this so that it is obvious that this is meant for
>> >> > >             runner (and not Beam user) consumption.
>> >> > >
>> >> > >             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
>> >> > >             <thw@apache.org <ma...@apache.org>> wrote:
>> >> > >
>> >> > >                 Can you please elaborate more what practical problems
>> >> > >                 this introduces for users?
>> >> > >
>> >> > >                 I can see that this change allows a user to specify a
>> >> > >                 runner specific option, which in the future may change
>> >> > >                 because we decide to scope differently. If this only
>> >> > >                 affects users of the portable Flink runner (like us),
>> >> > >                 then no need to revert, because at this early stage we
>> >> > >                 prefer something that works over being blocked.
>> >> > >
>> >> > >                 It would also be really great if some of the core Python
>> >> > >                 SDK developers could help out with the design aspects
>> >> > >                 and PR reviews of changes that affect common Python
>> >> > >                 code. Anyone who specifically wants to be tagged on
>> >> > >                 relevant JIRAs and PRs?
>> >> > >
>> >> > >
>> >> > >         I would be happy to be tagged, and I can also help with
>> >> > >         including other relevant folks whenever possible. In general I
>> >> > >         think Robert, Charles, myself are good candidates.
>> >> > >
>> >> > >
>> >> > >                 Thanks
>> >> > >
>> >> > >
>> >> > >                 On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
>> >> > >                 <altay@google.com <ma...@google.com>> wrote:
>> >> > >
>> >> > >
>> >> > >
>> >> > >                     On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen
>> >> > >                     <ccy@google.com <ma...@google.com>> wrote:
>> >> > >
>> >> > >                         For context, I made comments on
>> >> > >                         https://github.com/apache/beam/pull/6600 noting
>> >> > >                         that the changes being made were not good for
>> >> > >                         Beam backwards-compatibility.  The change as is
>> >> > >                         allows users to use pipeline options without
>> >> > >                         explicitly defining them, which is not the type
>> >> > >                         of usage we would like to encourage since we
>> >> > >                         prefer to be explicit whenever possible.  If
>> >> > >                         users write pipelines with this sort of pattern,
>> >> > >                         they will potentially encounter pain when
>> >> > >                         upgrading to a later version since this is an
>> >> > >                         implementation detail and not an officially
>> >> > >                         supported pattern.  I agree with the comments
>> >> > >                         above that this is ultimately a scoping issue.
>> >> > >                         I would not have a problem with these changes if
>> >> > >                         they were explicitly scoped under either a
>> >> > >                         runner or unparsed options namespace.
>> >> > >
>> >> > >                         As a second note, since the 2.8.0 release is
>> >> > >                         being cut right now, because of these
>> >> > >                         backwards-compatibility concerns, I would
>> >> > >                         suggest reverting these changes, at least until
>> >> > >                         2.8.0 is cut, so we can have a discussion here
>> >> > >                         before committing to and releasing any API-level
>> >> > >                         changes.
>> >> > >
>> >> > >
>> >> > >                     +1 I would like to revert the changes in order not
>> >> > >                     rush this into the release. Once this discussion
>> >> > >                     results in an agreement changes can be brought back.
>> >> > >
>> >> > >
>> >> > >                         On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde
>> >> > >                         <herohde@google.com <ma...@google.com>>
>> >> > >                         wrote:
>> >> > >
>> >> > >                             Agree that pipeline options lack some
>> >> > >                             mechanism for scoping. It is also not always
>> >> > >                             possible distinguish options meant to be
>> >> > >                             consumed at pipeline construction time, by
>> >> > >                             the runner, by the SDK harness, by the user
>> >> > >                             code or any combination -- and this causes
>> >> > >                             confusion every now and then.
>> >> > >
>> >> > >                             For Dataflow, we have been using
>> >> > >                             "experiments" for arbitrary runner-specific
>> >> > >                             options. It's simply a string list pipeline
>> >> > >                             option that all SDKs support and, for Go at
>> >> > >                             least, is sent to portable runners. Flink
>> >> > >                             can do the same in the short term to move
>> >> > >                             forward.
>> >> > >
>> >> > >                             Henning
>> >> > >
>> >> > >
>> >> > >                             On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise
>> >> > >                             <thw@apache.org <ma...@apache.org>> wrote:
>> >> > >
>> >> > >                                 [moving to the list]
>> >> > >
>> >> > >                                 The requirement driving this part of the
>> >> > >                                 change was to allow a user to specify
>> >> > >                                 pipeline options that a runner supports
>> >> > >                                 without having to declare those in each
>> >> > >                                 language SDK.
>> >> > >
>> >> > >                                 In the specific scenario, we have
>> >> > >                                 options that the Flink runner supports
>> >> > >                                 (and can validate), that are not
>> >> > >                                 enumerated in the Python SDK.
>> >> > >
>> >> > >                                 I think we have a bigger problem scoping
>> >> > >                                 pipeline options. For example, the
>> >> > >                                 runner options are dumped into the SDK
>> >> > >                                 worker. There is also a possibility of
>> >> > >                                 name collisions. So I think this would
>> >> > >                                 benefit from broader feedback.
>> >> > >
>> >> > >                                 Thanks,
>> >> > >                                 Thomas
>> >> > >
>> >> > >
>> >> > >                                 ---------- Forwarded message ---------
>> >> > >                                 From: *Charles Chen*
>> >> > >                                 <notifications@github.com
>> >> > >                                 <ma...@github.com>>
>> >> > >                                 Date: Fri, Oct 12, 2018 at 8:36 AM
>> >> > >                                 Subject: Re: [apache/beam] [BEAM-5442]
>> >> > >                                 Store duplicate unknown options in a
>> >> > >                                 list argument (#6600)
>> >> > >                                 To: apache/beam <beam@noreply.github.com
>> >> > >                                 <ma...@noreply.github.com>>
>> >> > >                                 Cc: Thomas Weise <thomas.weise@gmail.com
>> >> > >                                 <ma...@gmail.com>>,
>> >> > >                                 Mention <mention@noreply.github.com
>> >> > >                                 <ma...@noreply.github.com>>
>> >> > >
>> >> > >
>> >> > >                                 CC: @tweise <https://github.com/tweise>
>> >> > >
>> >> > >                                 —
>> >> > >                                 You are receiving this because you were
>> >> > >                                 mentioned.
>> >> > >                                 Reply to this email directly, view it on
>> >> > >                                 GitHub
>> >> > >                                 <https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>> >> > >                                 or mute the thread
>> >> > >                                 <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>.
>> >> > >
>> >> > >
>> >> > >

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Lukasz Cwik <lc...@google.com>.

On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw <ro...@google.com> wrote:

> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik <lc...@google.com> wrote:
> >
> > I agree with the sentiment for better error checking.
> >
> > We can try to make it such that the SDK can "fetch" the set of options
> that the runner supports by making a call to the Job API. The API could
> return a list of option names (descriptions for --help purposes and also
> potentially the expected format) which would remove the worry around
> "unknown" options. Yes I understand to be able to make the Job API call, we
> may need to parse some options from the args parameters first and then
> parse the unknown options after they are fetched.
>
> This is an interesting idea, but seems it could get quite complicated.
> E.g. for delegating runners, one would first read the options to
> determine which runner to fetch the options from, which would then
> return a set of options that possibly depends on the values of some of
> its options...
>
> > Alternatively, we can choose an explicit format upfront.
> > To expand on the exact format for --runner_option=..., here are some
> different ideas:
> > 1) Specified multiple times, each one is an explicit flag
> > --runner_option=--blah=bar --runner_option=--foo=baz1
> --runner_option=--foo=baz2
>
> I'm -1 on this format. We should move away from the idea that options
> == flags (as that doesn't compose well with other libraries that do
> their own flags parsing). The ability to parse a set of flags into
> options is just a convenience that an author may (or may not) choose
> to use (e.g. when running pipelines a long-lived process like a
> service or a notebook, the command line flags are almost certainly not
> the right interface).
>
> > 2) specified multiple times, we drop the explicit flag
> > --runner_option=blah=bar --runner_option=foo=baz1
> --runner_option=foo=baz2
>
> This or (4) is my preference.
>
> > 3) we use a string which the runner can choose to interpret however they
> want (JSON/XML shown below)
> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
> >
> --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>
> This would make validation hard. Also, I think it makes sense for some
> runner options to be "shared" (parallelism") by convention, so letting
> it be a free-form string wouldn't allow different runners to inspect
> different bits.
>
> We should consider if we should use urns for namespacing, and
> assigning semantic meaning to strings, here.
>
> > 4) we use a string which must be a specific format such as JSON (allows
> the SDK to do simple validation):
> > --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>
> I like this in that at least some validation can be performed, and
> expectations of how to format richer types. On the other hand it gets
> a bit verbose, given that most (I'd imagine) options will be simple.
> As with normal options,
>
>     --option1=value1 --option2=value2
>
> is shorthand for {"option1": value1, "option2": value2}.
>
> I lean to 4 the most. With 2, you run into issues of what does
--runner_option=foo=["a", "b"] --runner_option=foo=["c", "d"] mean?
Is it an error or list of lists or concatenated. Similar issues for map
types represented via JSON object {...}


> > I would strongly suggest that we go with the "fetch" approach, since
> this makes the set of options discoverable and helps users find errors much
> earlier in their pipeline.
>
> This seems like an advanced feature that SDKs may want to support, but
> I wouldn't want to require this complexity for bootstrapping an SDK.
>
> SDKs that are starting off wouldn't need to "fetch" options, they could
choose to not support runner options or they could choose to pass all
options through to the runner blindly. Fetching the options only provides
the SDK the ability to provide error checking upfront and useful error/help
messages.


> Regarding always keeping runner options separate, +1, though I'm not
> sure the line is always clear.
>
>
> > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <ro...@google.com>
> wrote:
> >>
> >> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <mx...@apache.org>
> wrote:
> >> >
> >> > I agree that the current approach breaks the pipeline options contract
> >> > because "unknown" options get parsed in the same way as options which
> >> > have been defined by the user.
> >>
> >> FWIW, I think we're already breaking this "contract." Unknown options
> >> are silently ignored; with this change we just change how we record
> >> them. It still feels a bit hacky though.
> >>
> >> > I'm not sure the `experiments` flag works for us. AFAIK it only allows
> >> > true/false flags. We want to pass all types of pipeline options to the
> >> > Runner.
> >>
> >> Experiments is an arbitrary set of strings, which can be of the form
> >> "param=value" if that's useful. (Dataflow does this.) There is, again,
> >> no namespacing on the param names, but we could user urns or impose
> >> some other structure here.
> >>
> >> > How to solve this?
> >> >
> >> > 1) Add all options of all Runners to each SDK
> >> > We added some of the FlinkRunner options to the Python SDK but
> realized
> >> > syncing is rather cumbersome in the long term. However, we want the
> most
> >> > important options to be validated on the client side.
> >>
> >> I don't think this is sustainable in the long run. However, thinking
> >> about this, in the worse case validation happens after construction
> >> but before execution (as with much of our other validation) so it
> >> isn't that bad.
> >>
> >> > 2) Pass "unknown" options via a separate list in the Proto which can
> >> > only be accessed internally by the Runners. This still allows passing
> >> > arbitrary options but we wouldn't leak unknown options and display
> them
> >> > as top-level options.
> >>
> >> I think there needs to be a way for the user to communicate values
> >> directly to the runner regardless of the SDK. My preference would be
> >> to make this explicit, e.g. (repeated) --runner_option=..., rather
> >> than scooping up all unknown flags at command line parsing time.
> >> Perhaps an SDK that is aware of some runners could choose to lift
> >> these as top-level options, but still pass them as runner options.
> >>
> >> > On 13.10.18 02:34, Charles Chen wrote:
> >> > > The current release branch
> >> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut
> after the
> >> > > revert went in.  Sent out https://github.com/apache/beam/pull/6683
> as a
> >> > > revert of the revert.  Regarding your comment above, I can help out
> with
> >> > > the design / PR reviews for common Python code as you suggest.
> >> > >
> >> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <thw@apache.org
> >> > > <ma...@apache.org>> wrote:
> >> > >
> >> > >     Thanks, will tag you and looking forward to feedback so we can
> >> > >     ensure that changes work for everyone.
> >> > >
> >> > >     Looking at the PR, I see agreement from Max to revert the
> change on
> >> > >     the release branch, but not in master. Would you mind to
> restore it
> >> > >     in master?
> >> > >
> >> > >     Thanks
> >> > >
> >> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <altay@google.com
> >> > >     <ma...@google.com>> wrote:
> >> > >
> >> > >
> >> > >
> >> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <
> ccy@google.com
> >> > >         <ma...@google.com>> wrote:
> >> > >
> >> > >             What I mean is that a user may find that it works for
> them
> >> > >             to pass "--myarg blah" and access it as "options.myarg"
> >> > >             without explicitly defining a "my_arg" flag due to the
> added
> >> > >             logic.  This is not the intended behavior and we may
> want to
> >> > >             change this implementation detail in the future.
> However,
> >> > >             having this logic in a released version makes it hard to
> >> > >             change this behavior since users may erroneously depend
> on
> >> > >             this undocumented behavior.  Instead, we should
> namespace /
> >> > >             scope this so that it is obvious that this is meant for
> >> > >             runner (and not Beam user) consumption.
> >> > >
> >> > >             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
> >> > >             <thw@apache.org <ma...@apache.org>> wrote:
> >> > >
> >> > >                 Can you please elaborate more what practical
> problems
> >> > >                 this introduces for users?
> >> > >
> >> > >                 I can see that this change allows a user to specify
> a
> >> > >                 runner specific option, which in the future may
> change
> >> > >                 because we decide to scope differently. If this only
> >> > >                 affects users of the portable Flink runner (like
> us),
> >> > >                 then no need to revert, because at this early stage
> we
> >> > >                 prefer something that works over being blocked.
> >> > >
> >> > >                 It would also be really great if some of the core
> Python
> >> > >                 SDK developers could help out with the design
> aspects
> >> > >                 and PR reviews of changes that affect common Python
> >> > >                 code. Anyone who specifically wants to be tagged on
> >> > >                 relevant JIRAs and PRs?
> >> > >
> >> > >
> >> > >         I would be happy to be tagged, and I can also help with
> >> > >         including other relevant folks whenever possible. In
> general I
> >> > >         think Robert, Charles, myself are good candidates.
> >> > >
> >> > >
> >> > >                 Thanks
> >> > >
> >> > >
> >> > >                 On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
> >> > >                 <altay@google.com <ma...@google.com>> wrote:
> >> > >
> >> > >
> >> > >
> >> > >                     On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen
> >> > >                     <ccy@google.com <ma...@google.com>> wrote:
> >> > >
> >> > >                         For context, I made comments on
> >> > >                         https://github.com/apache/beam/pull/6600
> noting
> >> > >                         that the changes being made were not good
> for
> >> > >                         Beam backwards-compatibility.  The change
> as is
> >> > >                         allows users to use pipeline options without
> >> > >                         explicitly defining them, which is not the
> type
> >> > >                         of usage we would like to encourage since we
> >> > >                         prefer to be explicit whenever possible.  If
> >> > >                         users write pipelines with this sort of
> pattern,
> >> > >                         they will potentially encounter pain when
> >> > >                         upgrading to a later version since this is
> an
> >> > >                         implementation detail and not an officially
> >> > >                         supported pattern.  I agree with the
> comments
> >> > >                         above that this is ultimately a scoping
> issue.
> >> > >                         I would not have a problem with these
> changes if
> >> > >                         they were explicitly scoped under either a
> >> > >                         runner or unparsed options namespace.
> >> > >
> >> > >                         As a second note, since the 2.8.0 release is
> >> > >                         being cut right now, because of these
> >> > >                         backwards-compatibility concerns, I would
> >> > >                         suggest reverting these changes, at least
> until
> >> > >                         2.8.0 is cut, so we can have a discussion
> here
> >> > >                         before committing to and releasing any
> API-level
> >> > >                         changes.
> >> > >
> >> > >
> >> > >                     +1 I would like to revert the changes in order
> not
> >> > >                     rush this into the release. Once this discussion
> >> > >                     results in an agreement changes can be brought
> back.
> >> > >
> >> > >
> >> > >                         On Fri, Oct 12, 2018 at 9:26 AM Henning
> Rohde
> >> > >                         <herohde@google.com <mailto:
> herohde@google.com>>
> >> > >                         wrote:
> >> > >
> >> > >                             Agree that pipeline options lack some
> >> > >                             mechanism for scoping. It is also not
> always
> >> > >                             possible distinguish options meant to be
> >> > >                             consumed at pipeline construction time,
> by
> >> > >                             the runner, by the SDK harness, by the
> user
> >> > >                             code or any combination -- and this
> causes
> >> > >                             confusion every now and then.
> >> > >
> >> > >                             For Dataflow, we have been using
> >> > >                             "experiments" for arbitrary
> runner-specific
> >> > >                             options. It's simply a string list
> pipeline
> >> > >                             option that all SDKs support and, for
> Go at
> >> > >                             least, is sent to portable runners.
> Flink
> >> > >                             can do the same in the short term to
> move
> >> > >                             forward.
> >> > >
> >> > >                             Henning
> >> > >
> >> > >
> >> > >                             On Fri, Oct 12, 2018 at 8:50 AM Thomas
> Weise
> >> > >                             <thw@apache.org <ma...@apache.org>>
> wrote:
> >> > >
> >> > >                                 [moving to the list]
> >> > >
> >> > >                                 The requirement driving this part
> of the
> >> > >                                 change was to allow a user to
> specify
> >> > >                                 pipeline options that a runner
> supports
> >> > >                                 without having to declare those in
> each
> >> > >                                 language SDK.
> >> > >
> >> > >                                 In the specific scenario, we have
> >> > >                                 options that the Flink runner
> supports
> >> > >                                 (and can validate), that are not
> >> > >                                 enumerated in the Python SDK.
> >> > >
> >> > >                                 I think we have a bigger problem
> scoping
> >> > >                                 pipeline options. For example, the
> >> > >                                 runner options are dumped into the
> SDK
> >> > >                                 worker. There is also a possibility
> of
> >> > >                                 name collisions. So I think this
> would
> >> > >                                 benefit from broader feedback.
> >> > >
> >> > >                                 Thanks,
> >> > >                                 Thomas
> >> > >
> >> > >
> >> > >                                 ---------- Forwarded message
> ---------
> >> > >                                 From: *Charles Chen*
> >> > >                                 <notifications@github.com
> >> > >                                 <ma...@github.com>>
> >> > >                                 Date: Fri, Oct 12, 2018 at 8:36 AM
> >> > >                                 Subject: Re: [apache/beam]
> [BEAM-5442]
> >> > >                                 Store duplicate unknown options in a
> >> > >                                 list argument (#6600)
> >> > >                                 To: apache/beam <
> beam@noreply.github.com
> >> > >                                 <ma...@noreply.github.com>>
> >> > >                                 Cc: Thomas Weise <
> thomas.weise@gmail.com
> >> > >                                 <ma...@gmail.com>>,
> >> > >                                 Mention <mention@noreply.github.com
> >> > >                                 <mailto:mention@noreply.github.com
> >>
> >> > >
> >> > >
> >> > >                                 CC: @tweise <
> https://github.com/tweise>
> >> > >
> >> > >                                 —
> >> > >                                 You are receiving this because you
> were
> >> > >                                 mentioned.
> >> > >                                 Reply to this email directly, view
> it on
> >> > >                                 GitHub
> >> > >                                 <
> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
> >> > >                                 or mute the thread
> >> > >                                 <
> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
> >.
> >> > >
> >> > >
> >> > >
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Robert Bradshaw <ro...@google.com>.

On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik <lc...@google.com> wrote:
>
> I agree with the sentiment for better error checking.
>
> We can try to make it such that the SDK can "fetch" the set of options that the runner supports by making a call to the Job API. The API could return a list of option names (descriptions for --help purposes and also potentially the expected format) which would remove the worry around "unknown" options. Yes I understand to be able to make the Job API call, we may need to parse some options from the args parameters first and then parse the unknown options after they are fetched.

This is an interesting idea, but seems it could get quite complicated.
E.g. for delegating runners, one would first read the options to
determine which runner to fetch the options from, which would then
return a set of options that possibly depends on the values of some of
its options...

> Alternatively, we can choose an explicit format upfront.
> To expand on the exact format for --runner_option=..., here are some different ideas:
> 1) Specified multiple times, each one is an explicit flag
> --runner_option=--blah=bar --runner_option=--foo=baz1 --runner_option=--foo=baz2

I'm -1 on this format. We should move away from the idea that options
== flags (as that doesn't compose well with other libraries that do
their own flags parsing). The ability to parse a set of flags into
options is just a convenience that an author may (or may not) choose
to use (e.g. when running pipelines a long-lived process like a
service or a notebook, the command line flags are almost certainly not
the right interface).

> 2) specified multiple times, we drop the explicit flag
> --runner_option=blah=bar --runner_option=foo=baz1 --runner_option=foo=baz2

This or (4) is my preference.

> 3) we use a string which the runner can choose to interpret however they want (JSON/XML shown below)
> --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
> --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'

This would make validation hard. Also, I think it makes sense for some
runner options to be "shared" (parallelism") by convention, so letting
it be a free-form string wouldn't allow different runners to inspect
different bits.

We should consider if we should use urns for namespacing, and
assigning semantic meaning to strings, here.

> 4) we use a string which must be a specific format such as JSON (allows the SDK to do simple validation):
> --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'

I like this in that at least some validation can be performed, and
expectations of how to format richer types. On the other hand it gets
a bit verbose, given that most (I'd imagine) options will be simple.
As with normal options,

    --option1=value1 --option2=value2

is shorthand for {"option1": value1, "option2": value2}.

> I would strongly suggest that we go with the "fetch" approach, since this makes the set of options discoverable and helps users find errors much earlier in their pipeline.

This seems like an advanced feature that SDKs may want to support, but
I wouldn't want to require this complexity for bootstrapping an SDK.

Regarding always keeping runner options separate, +1, though I'm not
sure the line is always clear.


> On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <ro...@google.com> wrote:
>>
>> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <mx...@apache.org> wrote:
>> >
>> > I agree that the current approach breaks the pipeline options contract
>> > because "unknown" options get parsed in the same way as options which
>> > have been defined by the user.
>>
>> FWIW, I think we're already breaking this "contract." Unknown options
>> are silently ignored; with this change we just change how we record
>> them. It still feels a bit hacky though.
>>
>> > I'm not sure the `experiments` flag works for us. AFAIK it only allows
>> > true/false flags. We want to pass all types of pipeline options to the
>> > Runner.
>>
>> Experiments is an arbitrary set of strings, which can be of the form
>> "param=value" if that's useful. (Dataflow does this.) There is, again,
>> no namespacing on the param names, but we could user urns or impose
>> some other structure here.
>>
>> > How to solve this?
>> >
>> > 1) Add all options of all Runners to each SDK
>> > We added some of the FlinkRunner options to the Python SDK but realized
>> > syncing is rather cumbersome in the long term. However, we want the most
>> > important options to be validated on the client side.
>>
>> I don't think this is sustainable in the long run. However, thinking
>> about this, in the worse case validation happens after construction
>> but before execution (as with much of our other validation) so it
>> isn't that bad.
>>
>> > 2) Pass "unknown" options via a separate list in the Proto which can
>> > only be accessed internally by the Runners. This still allows passing
>> > arbitrary options but we wouldn't leak unknown options and display them
>> > as top-level options.
>>
>> I think there needs to be a way for the user to communicate values
>> directly to the runner regardless of the SDK. My preference would be
>> to make this explicit, e.g. (repeated) --runner_option=..., rather
>> than scooping up all unknown flags at command line parsing time.
>> Perhaps an SDK that is aware of some runners could choose to lift
>> these as top-level options, but still pass them as runner options.
>>
>> > On 13.10.18 02:34, Charles Chen wrote:
>> > > The current release branch
>> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut after the
>> > > revert went in.  Sent out https://github.com/apache/beam/pull/6683 as a
>> > > revert of the revert.  Regarding your comment above, I can help out with
>> > > the design / PR reviews for common Python code as you suggest.
>> > >
>> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <thw@apache.org
>> > > <ma...@apache.org>> wrote:
>> > >
>> > >     Thanks, will tag you and looking forward to feedback so we can
>> > >     ensure that changes work for everyone.
>> > >
>> > >     Looking at the PR, I see agreement from Max to revert the change on
>> > >     the release branch, but not in master. Would you mind to restore it
>> > >     in master?
>> > >
>> > >     Thanks
>> > >
>> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <altay@google.com
>> > >     <ma...@google.com>> wrote:
>> > >
>> > >
>> > >
>> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <ccy@google.com
>> > >         <ma...@google.com>> wrote:
>> > >
>> > >             What I mean is that a user may find that it works for them
>> > >             to pass "--myarg blah" and access it as "options.myarg"
>> > >             without explicitly defining a "my_arg" flag due to the added
>> > >             logic.  This is not the intended behavior and we may want to
>> > >             change this implementation detail in the future.  However,
>> > >             having this logic in a released version makes it hard to
>> > >             change this behavior since users may erroneously depend on
>> > >             this undocumented behavior.  Instead, we should namespace /
>> > >             scope this so that it is obvious that this is meant for
>> > >             runner (and not Beam user) consumption.
>> > >
>> > >             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
>> > >             <thw@apache.org <ma...@apache.org>> wrote:
>> > >
>> > >                 Can you please elaborate more what practical problems
>> > >                 this introduces for users?
>> > >
>> > >                 I can see that this change allows a user to specify a
>> > >                 runner specific option, which in the future may change
>> > >                 because we decide to scope differently. If this only
>> > >                 affects users of the portable Flink runner (like us),
>> > >                 then no need to revert, because at this early stage we
>> > >                 prefer something that works over being blocked.
>> > >
>> > >                 It would also be really great if some of the core Python
>> > >                 SDK developers could help out with the design aspects
>> > >                 and PR reviews of changes that affect common Python
>> > >                 code. Anyone who specifically wants to be tagged on
>> > >                 relevant JIRAs and PRs?
>> > >
>> > >
>> > >         I would be happy to be tagged, and I can also help with
>> > >         including other relevant folks whenever possible. In general I
>> > >         think Robert, Charles, myself are good candidates.
>> > >
>> > >
>> > >                 Thanks
>> > >
>> > >
>> > >                 On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
>> > >                 <altay@google.com <ma...@google.com>> wrote:
>> > >
>> > >
>> > >
>> > >                     On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen
>> > >                     <ccy@google.com <ma...@google.com>> wrote:
>> > >
>> > >                         For context, I made comments on
>> > >                         https://github.com/apache/beam/pull/6600 noting
>> > >                         that the changes being made were not good for
>> > >                         Beam backwards-compatibility.  The change as is
>> > >                         allows users to use pipeline options without
>> > >                         explicitly defining them, which is not the type
>> > >                         of usage we would like to encourage since we
>> > >                         prefer to be explicit whenever possible.  If
>> > >                         users write pipelines with this sort of pattern,
>> > >                         they will potentially encounter pain when
>> > >                         upgrading to a later version since this is an
>> > >                         implementation detail and not an officially
>> > >                         supported pattern.  I agree with the comments
>> > >                         above that this is ultimately a scoping issue.
>> > >                         I would not have a problem with these changes if
>> > >                         they were explicitly scoped under either a
>> > >                         runner or unparsed options namespace.
>> > >
>> > >                         As a second note, since the 2.8.0 release is
>> > >                         being cut right now, because of these
>> > >                         backwards-compatibility concerns, I would
>> > >                         suggest reverting these changes, at least until
>> > >                         2.8.0 is cut, so we can have a discussion here
>> > >                         before committing to and releasing any API-level
>> > >                         changes.
>> > >
>> > >
>> > >                     +1 I would like to revert the changes in order not
>> > >                     rush this into the release. Once this discussion
>> > >                     results in an agreement changes can be brought back.
>> > >
>> > >
>> > >                         On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde
>> > >                         <herohde@google.com <ma...@google.com>>
>> > >                         wrote:
>> > >
>> > >                             Agree that pipeline options lack some
>> > >                             mechanism for scoping. It is also not always
>> > >                             possible distinguish options meant to be
>> > >                             consumed at pipeline construction time, by
>> > >                             the runner, by the SDK harness, by the user
>> > >                             code or any combination -- and this causes
>> > >                             confusion every now and then.
>> > >
>> > >                             For Dataflow, we have been using
>> > >                             "experiments" for arbitrary runner-specific
>> > >                             options. It's simply a string list pipeline
>> > >                             option that all SDKs support and, for Go at
>> > >                             least, is sent to portable runners. Flink
>> > >                             can do the same in the short term to move
>> > >                             forward.
>> > >
>> > >                             Henning
>> > >
>> > >
>> > >                             On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise
>> > >                             <thw@apache.org <ma...@apache.org>> wrote:
>> > >
>> > >                                 [moving to the list]
>> > >
>> > >                                 The requirement driving this part of the
>> > >                                 change was to allow a user to specify
>> > >                                 pipeline options that a runner supports
>> > >                                 without having to declare those in each
>> > >                                 language SDK.
>> > >
>> > >                                 In the specific scenario, we have
>> > >                                 options that the Flink runner supports
>> > >                                 (and can validate), that are not
>> > >                                 enumerated in the Python SDK.
>> > >
>> > >                                 I think we have a bigger problem scoping
>> > >                                 pipeline options. For example, the
>> > >                                 runner options are dumped into the SDK
>> > >                                 worker. There is also a possibility of
>> > >                                 name collisions. So I think this would
>> > >                                 benefit from broader feedback.
>> > >
>> > >                                 Thanks,
>> > >                                 Thomas
>> > >
>> > >
>> > >                                 ---------- Forwarded message ---------
>> > >                                 From: *Charles Chen*
>> > >                                 <notifications@github.com
>> > >                                 <ma...@github.com>>
>> > >                                 Date: Fri, Oct 12, 2018 at 8:36 AM
>> > >                                 Subject: Re: [apache/beam] [BEAM-5442]
>> > >                                 Store duplicate unknown options in a
>> > >                                 list argument (#6600)
>> > >                                 To: apache/beam <beam@noreply.github.com
>> > >                                 <ma...@noreply.github.com>>
>> > >                                 Cc: Thomas Weise <thomas.weise@gmail.com
>> > >                                 <ma...@gmail.com>>,
>> > >                                 Mention <mention@noreply.github.com
>> > >                                 <ma...@noreply.github.com>>
>> > >
>> > >
>> > >                                 CC: @tweise <https://github.com/tweise>
>> > >
>> > >                                 —
>> > >                                 You are receiving this because you were
>> > >                                 mentioned.
>> > >                                 Reply to this email directly, view it on
>> > >                                 GitHub
>> > >                                 <https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>> > >                                 or mute the thread
>> > >                                 <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>.
>> > >
>> > >
>> > >

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Thomas Weise <th...@apache.org>.

Would it be better to generally separate the runner options (whether they
are unknown or not) from other pipeline options?


On Mon, Oct 15, 2018 at 10:55 AM Lukasz Cwik <lc...@google.com> wrote:

> Note, that thinking ahead to cross language pipelines, we'll want
> something like "options" discovery as well. So reusing this concept for
> runners makes sense to me.
>
> On Mon, Oct 15, 2018 at 10:50 AM Lukasz Cwik <lc...@google.com> wrote:
>
>> I agree with the sentiment for better error checking.
>>
>> We can try to make it such that the SDK can "fetch" the set of options
>> that the runner supports by making a call to the Job API. The API could
>> return a list of option names (descriptions for --help purposes and also
>> potentially the expected format) which would remove the worry around
>> "unknown" options. Yes I understand to be able to make the Job API call, we
>> may need to parse some options from the args parameters first and then
>> parse the unknown options after they are fetched.
>>
>> Alternatively, we can choose an explicit format upfront.
>> To expand on the exact format for --runner_option=..., here are some
>> different ideas:
>> 1) Specified multiple times, each one is an explicit flag
>> --runner_option=--blah=bar --runner_option=--foo=baz1
>> --runner_option=--foo=baz2
>>
>> 2) specified multiple times, we drop the explicit flag
>> --runner_option=blah=bar --runner_option=foo=baz1
>> --runner_option=foo=baz2
>>
>> 3) we use a string which the runner can choose to interpret however they
>> want (JSON/XML shown below)
>> --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>>
>> --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>>
>> 4) we use a string which must be a specific format such as JSON (allows
>> the SDK to do simple validation):
>> --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>>
>> I would strongly suggest that we go with the "fetch" approach, since this
>> makes the set of options discoverable and helps users find errors much
>> earlier in their pipeline.
>>
>>
>>
>> On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <mx...@apache.org>
>>> wrote:
>>> >
>>> > I agree that the current approach breaks the pipeline options contract
>>> > because "unknown" options get parsed in the same way as options which
>>> > have been defined by the user.
>>>
>>> FWIW, I think we're already breaking this "contract." Unknown options
>>> are silently ignored; with this change we just change how we record
>>> them. It still feels a bit hacky though.
>>>
>>> > I'm not sure the `experiments` flag works for us. AFAIK it only allows
>>> > true/false flags. We want to pass all types of pipeline options to the
>>> > Runner.
>>>
>>> Experiments is an arbitrary set of strings, which can be of the form
>>> "param=value" if that's useful. (Dataflow does this.) There is, again,
>>> no namespacing on the param names, but we could user urns or impose
>>> some other structure here.
>>>
>>> > How to solve this?
>>> >
>>> > 1) Add all options of all Runners to each SDK
>>> > We added some of the FlinkRunner options to the Python SDK but realized
>>> > syncing is rather cumbersome in the long term. However, we want the
>>> most
>>> > important options to be validated on the client side.
>>>
>>> I don't think this is sustainable in the long run. However, thinking
>>> about this, in the worse case validation happens after construction
>>> but before execution (as with much of our other validation) so it
>>> isn't that bad.
>>>
>>> > 2) Pass "unknown" options via a separate list in the Proto which can
>>> > only be accessed internally by the Runners. This still allows passing
>>> > arbitrary options but we wouldn't leak unknown options and display them
>>> > as top-level options.
>>>
>>> I think there needs to be a way for the user to communicate values
>>> directly to the runner regardless of the SDK. My preference would be
>>> to make this explicit, e.g. (repeated) --runner_option=..., rather
>>> than scooping up all unknown flags at command line parsing time.
>>> Perhaps an SDK that is aware of some runners could choose to lift
>>> these as top-level options, but still pass them as runner options.
>>>
>>> > On 13.10.18 02:34, Charles Chen wrote:
>>> > > The current release branch
>>> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut
>>> after the
>>> > > revert went in.  Sent out https://github.com/apache/beam/pull/6683
>>> as a
>>> > > revert of the revert.  Regarding your comment above, I can help out
>>> with
>>> > > the design / PR reviews for common Python code as you suggest.
>>> > >
>>> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <thw@apache.org
>>> > > <ma...@apache.org>> wrote:
>>> > >
>>> > >     Thanks, will tag you and looking forward to feedback so we can
>>> > >     ensure that changes work for everyone.
>>> > >
>>> > >     Looking at the PR, I see agreement from Max to revert the change
>>> on
>>> > >     the release branch, but not in master. Would you mind to restore
>>> it
>>> > >     in master?
>>> > >
>>> > >     Thanks
>>> > >
>>> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <altay@google.com
>>> > >     <ma...@google.com>> wrote:
>>> > >
>>> > >
>>> > >
>>> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <
>>> ccy@google.com
>>> > >         <ma...@google.com>> wrote:
>>> > >
>>> > >             What I mean is that a user may find that it works for
>>> them
>>> > >             to pass "--myarg blah" and access it as "options.myarg"
>>> > >             without explicitly defining a "my_arg" flag due to the
>>> added
>>> > >             logic.  This is not the intended behavior and we may
>>> want to
>>> > >             change this implementation detail in the future.
>>> However,
>>> > >             having this logic in a released version makes it hard to
>>> > >             change this behavior since users may erroneously depend
>>> on
>>> > >             this undocumented behavior.  Instead, we should
>>> namespace /
>>> > >             scope this so that it is obvious that this is meant for
>>> > >             runner (and not Beam user) consumption.
>>> > >
>>> > >             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
>>> > >             <thw@apache.org <ma...@apache.org>> wrote:
>>> > >
>>> > >                 Can you please elaborate more what practical problems
>>> > >                 this introduces for users?
>>> > >
>>> > >                 I can see that this change allows a user to specify a
>>> > >                 runner specific option, which in the future may
>>> change
>>> > >                 because we decide to scope differently. If this only
>>> > >                 affects users of the portable Flink runner (like us),
>>> > >                 then no need to revert, because at this early stage
>>> we
>>> > >                 prefer something that works over being blocked.
>>> > >
>>> > >                 It would also be really great if some of the core
>>> Python
>>> > >                 SDK developers could help out with the design aspects
>>> > >                 and PR reviews of changes that affect common Python
>>> > >                 code. Anyone who specifically wants to be tagged on
>>> > >                 relevant JIRAs and PRs?
>>> > >
>>> > >
>>> > >         I would be happy to be tagged, and I can also help with
>>> > >         including other relevant folks whenever possible. In general
>>> I
>>> > >         think Robert, Charles, myself are good candidates.
>>> > >
>>> > >
>>> > >                 Thanks
>>> > >
>>> > >
>>> > >                 On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
>>> > >                 <altay@google.com <ma...@google.com>> wrote:
>>> > >
>>> > >
>>> > >
>>> > >                     On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen
>>> > >                     <ccy@google.com <ma...@google.com>> wrote:
>>> > >
>>> > >                         For context, I made comments on
>>> > >                         https://github.com/apache/beam/pull/6600
>>> noting
>>> > >                         that the changes being made were not good for
>>> > >                         Beam backwards-compatibility.  The change as
>>> is
>>> > >                         allows users to use pipeline options without
>>> > >                         explicitly defining them, which is not the
>>> type
>>> > >                         of usage we would like to encourage since we
>>> > >                         prefer to be explicit whenever possible.  If
>>> > >                         users write pipelines with this sort of
>>> pattern,
>>> > >                         they will potentially encounter pain when
>>> > >                         upgrading to a later version since this is an
>>> > >                         implementation detail and not an officially
>>> > >                         supported pattern.  I agree with the comments
>>> > >                         above that this is ultimately a scoping
>>> issue.
>>> > >                         I would not have a problem with these
>>> changes if
>>> > >                         they were explicitly scoped under either a
>>> > >                         runner or unparsed options namespace.
>>> > >
>>> > >                         As a second note, since the 2.8.0 release is
>>> > >                         being cut right now, because of these
>>> > >                         backwards-compatibility concerns, I would
>>> > >                         suggest reverting these changes, at least
>>> until
>>> > >                         2.8.0 is cut, so we can have a discussion
>>> here
>>> > >                         before committing to and releasing any
>>> API-level
>>> > >                         changes.
>>> > >
>>> > >
>>> > >                     +1 I would like to revert the changes in order
>>> not
>>> > >                     rush this into the release. Once this discussion
>>> > >                     results in an agreement changes can be brought
>>> back.
>>> > >
>>> > >
>>> > >                         On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde
>>> > >                         <herohde@google.com <mailto:
>>> herohde@google.com>>
>>> > >                         wrote:
>>> > >
>>> > >                             Agree that pipeline options lack some
>>> > >                             mechanism for scoping. It is also not
>>> always
>>> > >                             possible distinguish options meant to be
>>> > >                             consumed at pipeline construction time,
>>> by
>>> > >                             the runner, by the SDK harness, by the
>>> user
>>> > >                             code or any combination -- and this
>>> causes
>>> > >                             confusion every now and then.
>>> > >
>>> > >                             For Dataflow, we have been using
>>> > >                             "experiments" for arbitrary
>>> runner-specific
>>> > >                             options. It's simply a string list
>>> pipeline
>>> > >                             option that all SDKs support and, for Go
>>> at
>>> > >                             least, is sent to portable runners. Flink
>>> > >                             can do the same in the short term to move
>>> > >                             forward.
>>> > >
>>> > >                             Henning
>>> > >
>>> > >
>>> > >                             On Fri, Oct 12, 2018 at 8:50 AM Thomas
>>> Weise
>>> > >                             <thw@apache.org <ma...@apache.org>>
>>> wrote:
>>> > >
>>> > >                                 [moving to the list]
>>> > >
>>> > >                                 The requirement driving this part of
>>> the
>>> > >                                 change was to allow a user to specify
>>> > >                                 pipeline options that a runner
>>> supports
>>> > >                                 without having to declare those in
>>> each
>>> > >                                 language SDK.
>>> > >
>>> > >                                 In the specific scenario, we have
>>> > >                                 options that the Flink runner
>>> supports
>>> > >                                 (and can validate), that are not
>>> > >                                 enumerated in the Python SDK.
>>> > >
>>> > >                                 I think we have a bigger problem
>>> scoping
>>> > >                                 pipeline options. For example, the
>>> > >                                 runner options are dumped into the
>>> SDK
>>> > >                                 worker. There is also a possibility
>>> of
>>> > >                                 name collisions. So I think this
>>> would
>>> > >                                 benefit from broader feedback.
>>> > >
>>> > >                                 Thanks,
>>> > >                                 Thomas
>>> > >
>>> > >
>>> > >                                 ---------- Forwarded message
>>> ---------
>>> > >                                 From: *Charles Chen*
>>> > >                                 <notifications@github.com
>>> > >                                 <ma...@github.com>>
>>> > >                                 Date: Fri, Oct 12, 2018 at 8:36 AM
>>> > >                                 Subject: Re: [apache/beam]
>>> [BEAM-5442]
>>> > >                                 Store duplicate unknown options in a
>>> > >                                 list argument (#6600)
>>> > >                                 To: apache/beam <
>>> beam@noreply.github.com
>>> > >                                 <ma...@noreply.github.com>>
>>> > >                                 Cc: Thomas Weise <
>>> thomas.weise@gmail.com
>>> > >                                 <ma...@gmail.com>>,
>>> > >                                 Mention <mention@noreply.github.com
>>> > >                                 <ma...@noreply.github.com>>
>>> > >
>>> > >
>>> > >                                 CC: @tweise <
>>> https://github.com/tweise>
>>> > >
>>> > >                                 —
>>> > >                                 You are receiving this because you
>>> were
>>> > >                                 mentioned.
>>> > >                                 Reply to this email directly, view
>>> it on
>>> > >                                 GitHub
>>> > >                                 <
>>> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>>> > >                                 or mute the thread
>>> > >                                 <
>>> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
>>> >.
>>> > >
>>> > >
>>> > >
>>>
>>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Lukasz Cwik <lc...@google.com>.

Note, that thinking ahead to cross language pipelines, we'll want something
like "options" discovery as well. So reusing this concept for runners makes
sense to me.

On Mon, Oct 15, 2018 at 10:50 AM Lukasz Cwik <lc...@google.com> wrote:

> I agree with the sentiment for better error checking.
>
> We can try to make it such that the SDK can "fetch" the set of options
> that the runner supports by making a call to the Job API. The API could
> return a list of option names (descriptions for --help purposes and also
> potentially the expected format) which would remove the worry around
> "unknown" options. Yes I understand to be able to make the Job API call, we
> may need to parse some options from the args parameters first and then
> parse the unknown options after they are fetched.
>
> Alternatively, we can choose an explicit format upfront.
> To expand on the exact format for --runner_option=..., here are some
> different ideas:
> 1) Specified multiple times, each one is an explicit flag
> --runner_option=--blah=bar --runner_option=--foo=baz1
> --runner_option=--foo=baz2
>
> 2) specified multiple times, we drop the explicit flag
> --runner_option=blah=bar --runner_option=foo=baz1 --runner_option=foo=baz2
>
> 3) we use a string which the runner can choose to interpret however they
> want (JSON/XML shown below)
> --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>
> --runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'
>
> 4) we use a string which must be a specific format such as JSON (allows
> the SDK to do simple validation):
> --runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
>
> I would strongly suggest that we go with the "fetch" approach, since this
> makes the set of options discoverable and helps users find errors much
> earlier in their pipeline.
>
>
>
> On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <mx...@apache.org>
>> wrote:
>> >
>> > I agree that the current approach breaks the pipeline options contract
>> > because "unknown" options get parsed in the same way as options which
>> > have been defined by the user.
>>
>> FWIW, I think we're already breaking this "contract." Unknown options
>> are silently ignored; with this change we just change how we record
>> them. It still feels a bit hacky though.
>>
>> > I'm not sure the `experiments` flag works for us. AFAIK it only allows
>> > true/false flags. We want to pass all types of pipeline options to the
>> > Runner.
>>
>> Experiments is an arbitrary set of strings, which can be of the form
>> "param=value" if that's useful. (Dataflow does this.) There is, again,
>> no namespacing on the param names, but we could user urns or impose
>> some other structure here.
>>
>> > How to solve this?
>> >
>> > 1) Add all options of all Runners to each SDK
>> > We added some of the FlinkRunner options to the Python SDK but realized
>> > syncing is rather cumbersome in the long term. However, we want the most
>> > important options to be validated on the client side.
>>
>> I don't think this is sustainable in the long run. However, thinking
>> about this, in the worse case validation happens after construction
>> but before execution (as with much of our other validation) so it
>> isn't that bad.
>>
>> > 2) Pass "unknown" options via a separate list in the Proto which can
>> > only be accessed internally by the Runners. This still allows passing
>> > arbitrary options but we wouldn't leak unknown options and display them
>> > as top-level options.
>>
>> I think there needs to be a way for the user to communicate values
>> directly to the runner regardless of the SDK. My preference would be
>> to make this explicit, e.g. (repeated) --runner_option=..., rather
>> than scooping up all unknown flags at command line parsing time.
>> Perhaps an SDK that is aware of some runners could choose to lift
>> these as top-level options, but still pass them as runner options.
>>
>> > On 13.10.18 02:34, Charles Chen wrote:
>> > > The current release branch
>> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut after
>> the
>> > > revert went in.  Sent out https://github.com/apache/beam/pull/6683
>> as a
>> > > revert of the revert.  Regarding your comment above, I can help out
>> with
>> > > the design / PR reviews for common Python code as you suggest.
>> > >
>> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <thw@apache.org
>> > > <ma...@apache.org>> wrote:
>> > >
>> > >     Thanks, will tag you and looking forward to feedback so we can
>> > >     ensure that changes work for everyone.
>> > >
>> > >     Looking at the PR, I see agreement from Max to revert the change
>> on
>> > >     the release branch, but not in master. Would you mind to restore
>> it
>> > >     in master?
>> > >
>> > >     Thanks
>> > >
>> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <altay@google.com
>> > >     <ma...@google.com>> wrote:
>> > >
>> > >
>> > >
>> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <
>> ccy@google.com
>> > >         <ma...@google.com>> wrote:
>> > >
>> > >             What I mean is that a user may find that it works for them
>> > >             to pass "--myarg blah" and access it as "options.myarg"
>> > >             without explicitly defining a "my_arg" flag due to the
>> added
>> > >             logic.  This is not the intended behavior and we may want
>> to
>> > >             change this implementation detail in the future.  However,
>> > >             having this logic in a released version makes it hard to
>> > >             change this behavior since users may erroneously depend on
>> > >             this undocumented behavior.  Instead, we should namespace
>> /
>> > >             scope this so that it is obvious that this is meant for
>> > >             runner (and not Beam user) consumption.
>> > >
>> > >             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
>> > >             <thw@apache.org <ma...@apache.org>> wrote:
>> > >
>> > >                 Can you please elaborate more what practical problems
>> > >                 this introduces for users?
>> > >
>> > >                 I can see that this change allows a user to specify a
>> > >                 runner specific option, which in the future may change
>> > >                 because we decide to scope differently. If this only
>> > >                 affects users of the portable Flink runner (like us),
>> > >                 then no need to revert, because at this early stage we
>> > >                 prefer something that works over being blocked.
>> > >
>> > >                 It would also be really great if some of the core
>> Python
>> > >                 SDK developers could help out with the design aspects
>> > >                 and PR reviews of changes that affect common Python
>> > >                 code. Anyone who specifically wants to be tagged on
>> > >                 relevant JIRAs and PRs?
>> > >
>> > >
>> > >         I would be happy to be tagged, and I can also help with
>> > >         including other relevant folks whenever possible. In general I
>> > >         think Robert, Charles, myself are good candidates.
>> > >
>> > >
>> > >                 Thanks
>> > >
>> > >
>> > >                 On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
>> > >                 <altay@google.com <ma...@google.com>> wrote:
>> > >
>> > >
>> > >
>> > >                     On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen
>> > >                     <ccy@google.com <ma...@google.com>> wrote:
>> > >
>> > >                         For context, I made comments on
>> > >                         https://github.com/apache/beam/pull/6600
>> noting
>> > >                         that the changes being made were not good for
>> > >                         Beam backwards-compatibility.  The change as
>> is
>> > >                         allows users to use pipeline options without
>> > >                         explicitly defining them, which is not the
>> type
>> > >                         of usage we would like to encourage since we
>> > >                         prefer to be explicit whenever possible.  If
>> > >                         users write pipelines with this sort of
>> pattern,
>> > >                         they will potentially encounter pain when
>> > >                         upgrading to a later version since this is an
>> > >                         implementation detail and not an officially
>> > >                         supported pattern.  I agree with the comments
>> > >                         above that this is ultimately a scoping issue.
>> > >                         I would not have a problem with these changes
>> if
>> > >                         they were explicitly scoped under either a
>> > >                         runner or unparsed options namespace.
>> > >
>> > >                         As a second note, since the 2.8.0 release is
>> > >                         being cut right now, because of these
>> > >                         backwards-compatibility concerns, I would
>> > >                         suggest reverting these changes, at least
>> until
>> > >                         2.8.0 is cut, so we can have a discussion here
>> > >                         before committing to and releasing any
>> API-level
>> > >                         changes.
>> > >
>> > >
>> > >                     +1 I would like to revert the changes in order not
>> > >                     rush this into the release. Once this discussion
>> > >                     results in an agreement changes can be brought
>> back.
>> > >
>> > >
>> > >                         On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde
>> > >                         <herohde@google.com <mailto:
>> herohde@google.com>>
>> > >                         wrote:
>> > >
>> > >                             Agree that pipeline options lack some
>> > >                             mechanism for scoping. It is also not
>> always
>> > >                             possible distinguish options meant to be
>> > >                             consumed at pipeline construction time, by
>> > >                             the runner, by the SDK harness, by the
>> user
>> > >                             code or any combination -- and this causes
>> > >                             confusion every now and then.
>> > >
>> > >                             For Dataflow, we have been using
>> > >                             "experiments" for arbitrary
>> runner-specific
>> > >                             options. It's simply a string list
>> pipeline
>> > >                             option that all SDKs support and, for Go
>> at
>> > >                             least, is sent to portable runners. Flink
>> > >                             can do the same in the short term to move
>> > >                             forward.
>> > >
>> > >                             Henning
>> > >
>> > >
>> > >                             On Fri, Oct 12, 2018 at 8:50 AM Thomas
>> Weise
>> > >                             <thw@apache.org <ma...@apache.org>>
>> wrote:
>> > >
>> > >                                 [moving to the list]
>> > >
>> > >                                 The requirement driving this part of
>> the
>> > >                                 change was to allow a user to specify
>> > >                                 pipeline options that a runner
>> supports
>> > >                                 without having to declare those in
>> each
>> > >                                 language SDK.
>> > >
>> > >                                 In the specific scenario, we have
>> > >                                 options that the Flink runner supports
>> > >                                 (and can validate), that are not
>> > >                                 enumerated in the Python SDK.
>> > >
>> > >                                 I think we have a bigger problem
>> scoping
>> > >                                 pipeline options. For example, the
>> > >                                 runner options are dumped into the SDK
>> > >                                 worker. There is also a possibility of
>> > >                                 name collisions. So I think this would
>> > >                                 benefit from broader feedback.
>> > >
>> > >                                 Thanks,
>> > >                                 Thomas
>> > >
>> > >
>> > >                                 ---------- Forwarded message ---------
>> > >                                 From: *Charles Chen*
>> > >                                 <notifications@github.com
>> > >                                 <ma...@github.com>>
>> > >                                 Date: Fri, Oct 12, 2018 at 8:36 AM
>> > >                                 Subject: Re: [apache/beam] [BEAM-5442]
>> > >                                 Store duplicate unknown options in a
>> > >                                 list argument (#6600)
>> > >                                 To: apache/beam <
>> beam@noreply.github.com
>> > >                                 <ma...@noreply.github.com>>
>> > >                                 Cc: Thomas Weise <
>> thomas.weise@gmail.com
>> > >                                 <ma...@gmail.com>>,
>> > >                                 Mention <mention@noreply.github.com
>> > >                                 <ma...@noreply.github.com>>
>> > >
>> > >
>> > >                                 CC: @tweise <
>> https://github.com/tweise>
>> > >
>> > >                                 —
>> > >                                 You are receiving this because you
>> were
>> > >                                 mentioned.
>> > >                                 Reply to this email directly, view it
>> on
>> > >                                 GitHub
>> > >                                 <
>> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>> > >                                 or mute the thread
>> > >                                 <
>> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
>> >.
>> > >
>> > >
>> > >
>>
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Lukasz Cwik <lc...@google.com>.

I agree with the sentiment for better error checking.

We can try to make it such that the SDK can "fetch" the set of options that
the runner supports by making a call to the Job API. The API could return a
list of option names (descriptions for --help purposes and also potentially
the expected format) which would remove the worry around "unknown" options.
Yes I understand to be able to make the Job API call, we may need to parse
some options from the args parameters first and then parse the unknown
options after they are fetched.

Alternatively, we can choose an explicit format upfront.
To expand on the exact format for --runner_option=..., here are some
different ideas:
1) Specified multiple times, each one is an explicit flag
--runner_option=--blah=bar --runner_option=--foo=baz1
--runner_option=--foo=baz2

2) specified multiple times, we drop the explicit flag
--runner_option=blah=bar --runner_option=foo=baz1 --runner_option=foo=baz2

3) we use a string which the runner can choose to interpret however they
want (JSON/XML shown below)
--runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'
--runner_option='<options><blah>bar</blah><foo>baz1</foo><foo>baz2</foo></options>'

4) we use a string which must be a specific format such as JSON (allows the
SDK to do simple validation):
--runner_option='{"blah": "bar", "foo": ["baz1", "baz2"]}'

I would strongly suggest that we go with the "fetch" approach, since this
makes the set of options discoverable and helps users find errors much
earlier in their pipeline.



On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <ro...@google.com> wrote:

> On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <mx...@apache.org> wrote:
> >
> > I agree that the current approach breaks the pipeline options contract
> > because "unknown" options get parsed in the same way as options which
> > have been defined by the user.
>
> FWIW, I think we're already breaking this "contract." Unknown options
> are silently ignored; with this change we just change how we record
> them. It still feels a bit hacky though.
>
> > I'm not sure the `experiments` flag works for us. AFAIK it only allows
> > true/false flags. We want to pass all types of pipeline options to the
> > Runner.
>
> Experiments is an arbitrary set of strings, which can be of the form
> "param=value" if that's useful. (Dataflow does this.) There is, again,
> no namespacing on the param names, but we could user urns or impose
> some other structure here.
>
> > How to solve this?
> >
> > 1) Add all options of all Runners to each SDK
> > We added some of the FlinkRunner options to the Python SDK but realized
> > syncing is rather cumbersome in the long term. However, we want the most
> > important options to be validated on the client side.
>
> I don't think this is sustainable in the long run. However, thinking
> about this, in the worse case validation happens after construction
> but before execution (as with much of our other validation) so it
> isn't that bad.
>
> > 2) Pass "unknown" options via a separate list in the Proto which can
> > only be accessed internally by the Runners. This still allows passing
> > arbitrary options but we wouldn't leak unknown options and display them
> > as top-level options.
>
> I think there needs to be a way for the user to communicate values
> directly to the runner regardless of the SDK. My preference would be
> to make this explicit, e.g. (repeated) --runner_option=..., rather
> than scooping up all unknown flags at command line parsing time.
> Perhaps an SDK that is aware of some runners could choose to lift
> these as top-level options, but still pass them as runner options.
>
> > On 13.10.18 02:34, Charles Chen wrote:
> > > The current release branch
> > > (https://github.com/apache/beam/commits/release-2.8.0) was cut after
> the
> > > revert went in.  Sent out https://github.com/apache/beam/pull/6683 as
> a
> > > revert of the revert.  Regarding your comment above, I can help out
> with
> > > the design / PR reviews for common Python code as you suggest.
> > >
> > > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <thw@apache.org
> > > <ma...@apache.org>> wrote:
> > >
> > >     Thanks, will tag you and looking forward to feedback so we can
> > >     ensure that changes work for everyone.
> > >
> > >     Looking at the PR, I see agreement from Max to revert the change on
> > >     the release branch, but not in master. Would you mind to restore it
> > >     in master?
> > >
> > >     Thanks
> > >
> > >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <altay@google.com
> > >     <ma...@google.com>> wrote:
> > >
> > >
> > >
> > >         On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <ccy@google.com
> > >         <ma...@google.com>> wrote:
> > >
> > >             What I mean is that a user may find that it works for them
> > >             to pass "--myarg blah" and access it as "options.myarg"
> > >             without explicitly defining a "my_arg" flag due to the
> added
> > >             logic.  This is not the intended behavior and we may want
> to
> > >             change this implementation detail in the future.  However,
> > >             having this logic in a released version makes it hard to
> > >             change this behavior since users may erroneously depend on
> > >             this undocumented behavior.  Instead, we should namespace /
> > >             scope this so that it is obvious that this is meant for
> > >             runner (and not Beam user) consumption.
> > >
> > >             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
> > >             <thw@apache.org <ma...@apache.org>> wrote:
> > >
> > >                 Can you please elaborate more what practical problems
> > >                 this introduces for users?
> > >
> > >                 I can see that this change allows a user to specify a
> > >                 runner specific option, which in the future may change
> > >                 because we decide to scope differently. If this only
> > >                 affects users of the portable Flink runner (like us),
> > >                 then no need to revert, because at this early stage we
> > >                 prefer something that works over being blocked.
> > >
> > >                 It would also be really great if some of the core
> Python
> > >                 SDK developers could help out with the design aspects
> > >                 and PR reviews of changes that affect common Python
> > >                 code. Anyone who specifically wants to be tagged on
> > >                 relevant JIRAs and PRs?
> > >
> > >
> > >         I would be happy to be tagged, and I can also help with
> > >         including other relevant folks whenever possible. In general I
> > >         think Robert, Charles, myself are good candidates.
> > >
> > >
> > >                 Thanks
> > >
> > >
> > >                 On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
> > >                 <altay@google.com <ma...@google.com>> wrote:
> > >
> > >
> > >
> > >                     On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen
> > >                     <ccy@google.com <ma...@google.com>> wrote:
> > >
> > >                         For context, I made comments on
> > >                         https://github.com/apache/beam/pull/6600
> noting
> > >                         that the changes being made were not good for
> > >                         Beam backwards-compatibility.  The change as is
> > >                         allows users to use pipeline options without
> > >                         explicitly defining them, which is not the type
> > >                         of usage we would like to encourage since we
> > >                         prefer to be explicit whenever possible.  If
> > >                         users write pipelines with this sort of
> pattern,
> > >                         they will potentially encounter pain when
> > >                         upgrading to a later version since this is an
> > >                         implementation detail and not an officially
> > >                         supported pattern.  I agree with the comments
> > >                         above that this is ultimately a scoping issue.
> > >                         I would not have a problem with these changes
> if
> > >                         they were explicitly scoped under either a
> > >                         runner or unparsed options namespace.
> > >
> > >                         As a second note, since the 2.8.0 release is
> > >                         being cut right now, because of these
> > >                         backwards-compatibility concerns, I would
> > >                         suggest reverting these changes, at least until
> > >                         2.8.0 is cut, so we can have a discussion here
> > >                         before committing to and releasing any
> API-level
> > >                         changes.
> > >
> > >
> > >                     +1 I would like to revert the changes in order not
> > >                     rush this into the release. Once this discussion
> > >                     results in an agreement changes can be brought
> back.
> > >
> > >
> > >                         On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde
> > >                         <herohde@google.com <mailto:herohde@google.com
> >>
> > >                         wrote:
> > >
> > >                             Agree that pipeline options lack some
> > >                             mechanism for scoping. It is also not
> always
> > >                             possible distinguish options meant to be
> > >                             consumed at pipeline construction time, by
> > >                             the runner, by the SDK harness, by the user
> > >                             code or any combination -- and this causes
> > >                             confusion every now and then.
> > >
> > >                             For Dataflow, we have been using
> > >                             "experiments" for arbitrary runner-specific
> > >                             options. It's simply a string list pipeline
> > >                             option that all SDKs support and, for Go at
> > >                             least, is sent to portable runners. Flink
> > >                             can do the same in the short term to move
> > >                             forward.
> > >
> > >                             Henning
> > >
> > >
> > >                             On Fri, Oct 12, 2018 at 8:50 AM Thomas
> Weise
> > >                             <thw@apache.org <ma...@apache.org>>
> wrote:
> > >
> > >                                 [moving to the list]
> > >
> > >                                 The requirement driving this part of
> the
> > >                                 change was to allow a user to specify
> > >                                 pipeline options that a runner supports
> > >                                 without having to declare those in each
> > >                                 language SDK.
> > >
> > >                                 In the specific scenario, we have
> > >                                 options that the Flink runner supports
> > >                                 (and can validate), that are not
> > >                                 enumerated in the Python SDK.
> > >
> > >                                 I think we have a bigger problem
> scoping
> > >                                 pipeline options. For example, the
> > >                                 runner options are dumped into the SDK
> > >                                 worker. There is also a possibility of
> > >                                 name collisions. So I think this would
> > >                                 benefit from broader feedback.
> > >
> > >                                 Thanks,
> > >                                 Thomas
> > >
> > >
> > >                                 ---------- Forwarded message ---------
> > >                                 From: *Charles Chen*
> > >                                 <notifications@github.com
> > >                                 <ma...@github.com>>
> > >                                 Date: Fri, Oct 12, 2018 at 8:36 AM
> > >                                 Subject: Re: [apache/beam] [BEAM-5442]
> > >                                 Store duplicate unknown options in a
> > >                                 list argument (#6600)
> > >                                 To: apache/beam <
> beam@noreply.github.com
> > >                                 <ma...@noreply.github.com>>
> > >                                 Cc: Thomas Weise <
> thomas.weise@gmail.com
> > >                                 <ma...@gmail.com>>,
> > >                                 Mention <mention@noreply.github.com
> > >                                 <ma...@noreply.github.com>>
> > >
> > >
> > >                                 CC: @tweise <https://github.com/tweise
> >
> > >
> > >                                 —
> > >                                 You are receiving this because you were
> > >                                 mentioned.
> > >                                 Reply to this email directly, view it
> on
> > >                                 GitHub
> > >                                 <
> https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
> > >                                 or mute the thread
> > >                                 <
> https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T
> >.
> > >
> > >
> > >
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Robert Bradshaw <ro...@google.com>.

On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <mx...@apache.org> wrote:
>
> I agree that the current approach breaks the pipeline options contract
> because "unknown" options get parsed in the same way as options which
> have been defined by the user.

FWIW, I think we're already breaking this "contract." Unknown options
are silently ignored; with this change we just change how we record
them. It still feels a bit hacky though.

> I'm not sure the `experiments` flag works for us. AFAIK it only allows
> true/false flags. We want to pass all types of pipeline options to the
> Runner.

Experiments is an arbitrary set of strings, which can be of the form
"param=value" if that's useful. (Dataflow does this.) There is, again,
no namespacing on the param names, but we could user urns or impose
some other structure here.

> How to solve this?
>
> 1) Add all options of all Runners to each SDK
> We added some of the FlinkRunner options to the Python SDK but realized
> syncing is rather cumbersome in the long term. However, we want the most
> important options to be validated on the client side.

I don't think this is sustainable in the long run. However, thinking
about this, in the worse case validation happens after construction
but before execution (as with much of our other validation) so it
isn't that bad.

> 2) Pass "unknown" options via a separate list in the Proto which can
> only be accessed internally by the Runners. This still allows passing
> arbitrary options but we wouldn't leak unknown options and display them
> as top-level options.

I think there needs to be a way for the user to communicate values
directly to the runner regardless of the SDK. My preference would be
to make this explicit, e.g. (repeated) --runner_option=..., rather
than scooping up all unknown flags at command line parsing time.
Perhaps an SDK that is aware of some runners could choose to lift
these as top-level options, but still pass them as runner options.

> On 13.10.18 02:34, Charles Chen wrote:
> > The current release branch
> > (https://github.com/apache/beam/commits/release-2.8.0) was cut after the
> > revert went in.  Sent out https://github.com/apache/beam/pull/6683 as a
> > revert of the revert.  Regarding your comment above, I can help out with
> > the design / PR reviews for common Python code as you suggest.
> >
> > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <thw@apache.org
> > <ma...@apache.org>> wrote:
> >
> >     Thanks, will tag you and looking forward to feedback so we can
> >     ensure that changes work for everyone.
> >
> >     Looking at the PR, I see agreement from Max to revert the change on
> >     the release branch, but not in master. Would you mind to restore it
> >     in master?
> >
> >     Thanks
> >
> >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <altay@google.com
> >     <ma...@google.com>> wrote:
> >
> >
> >
> >         On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <ccy@google.com
> >         <ma...@google.com>> wrote:
> >
> >             What I mean is that a user may find that it works for them
> >             to pass "--myarg blah" and access it as "options.myarg"
> >             without explicitly defining a "my_arg" flag due to the added
> >             logic.  This is not the intended behavior and we may want to
> >             change this implementation detail in the future.  However,
> >             having this logic in a released version makes it hard to
> >             change this behavior since users may erroneously depend on
> >             this undocumented behavior.  Instead, we should namespace /
> >             scope this so that it is obvious that this is meant for
> >             runner (and not Beam user) consumption.
> >
> >             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
> >             <thw@apache.org <ma...@apache.org>> wrote:
> >
> >                 Can you please elaborate more what practical problems
> >                 this introduces for users?
> >
> >                 I can see that this change allows a user to specify a
> >                 runner specific option, which in the future may change
> >                 because we decide to scope differently. If this only
> >                 affects users of the portable Flink runner (like us),
> >                 then no need to revert, because at this early stage we
> >                 prefer something that works over being blocked.
> >
> >                 It would also be really great if some of the core Python
> >                 SDK developers could help out with the design aspects
> >                 and PR reviews of changes that affect common Python
> >                 code. Anyone who specifically wants to be tagged on
> >                 relevant JIRAs and PRs?
> >
> >
> >         I would be happy to be tagged, and I can also help with
> >         including other relevant folks whenever possible. In general I
> >         think Robert, Charles, myself are good candidates.
> >
> >
> >                 Thanks
> >
> >
> >                 On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
> >                 <altay@google.com <ma...@google.com>> wrote:
> >
> >
> >
> >                     On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen
> >                     <ccy@google.com <ma...@google.com>> wrote:
> >
> >                         For context, I made comments on
> >                         https://github.com/apache/beam/pull/6600 noting
> >                         that the changes being made were not good for
> >                         Beam backwards-compatibility.  The change as is
> >                         allows users to use pipeline options without
> >                         explicitly defining them, which is not the type
> >                         of usage we would like to encourage since we
> >                         prefer to be explicit whenever possible.  If
> >                         users write pipelines with this sort of pattern,
> >                         they will potentially encounter pain when
> >                         upgrading to a later version since this is an
> >                         implementation detail and not an officially
> >                         supported pattern.  I agree with the comments
> >                         above that this is ultimately a scoping issue.
> >                         I would not have a problem with these changes if
> >                         they were explicitly scoped under either a
> >                         runner or unparsed options namespace.
> >
> >                         As a second note, since the 2.8.0 release is
> >                         being cut right now, because of these
> >                         backwards-compatibility concerns, I would
> >                         suggest reverting these changes, at least until
> >                         2.8.0 is cut, so we can have a discussion here
> >                         before committing to and releasing any API-level
> >                         changes.
> >
> >
> >                     +1 I would like to revert the changes in order not
> >                     rush this into the release. Once this discussion
> >                     results in an agreement changes can be brought back.
> >
> >
> >                         On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde
> >                         <herohde@google.com <ma...@google.com>>
> >                         wrote:
> >
> >                             Agree that pipeline options lack some
> >                             mechanism for scoping. It is also not always
> >                             possible distinguish options meant to be
> >                             consumed at pipeline construction time, by
> >                             the runner, by the SDK harness, by the user
> >                             code or any combination -- and this causes
> >                             confusion every now and then.
> >
> >                             For Dataflow, we have been using
> >                             "experiments" for arbitrary runner-specific
> >                             options. It's simply a string list pipeline
> >                             option that all SDKs support and, for Go at
> >                             least, is sent to portable runners. Flink
> >                             can do the same in the short term to move
> >                             forward.
> >
> >                             Henning
> >
> >
> >                             On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise
> >                             <thw@apache.org <ma...@apache.org>> wrote:
> >
> >                                 [moving to the list]
> >
> >                                 The requirement driving this part of the
> >                                 change was to allow a user to specify
> >                                 pipeline options that a runner supports
> >                                 without having to declare those in each
> >                                 language SDK.
> >
> >                                 In the specific scenario, we have
> >                                 options that the Flink runner supports
> >                                 (and can validate), that are not
> >                                 enumerated in the Python SDK.
> >
> >                                 I think we have a bigger problem scoping
> >                                 pipeline options. For example, the
> >                                 runner options are dumped into the SDK
> >                                 worker. There is also a possibility of
> >                                 name collisions. So I think this would
> >                                 benefit from broader feedback.
> >
> >                                 Thanks,
> >                                 Thomas
> >
> >
> >                                 ---------- Forwarded message ---------
> >                                 From: *Charles Chen*
> >                                 <notifications@github.com
> >                                 <ma...@github.com>>
> >                                 Date: Fri, Oct 12, 2018 at 8:36 AM
> >                                 Subject: Re: [apache/beam] [BEAM-5442]
> >                                 Store duplicate unknown options in a
> >                                 list argument (#6600)
> >                                 To: apache/beam <beam@noreply.github.com
> >                                 <ma...@noreply.github.com>>
> >                                 Cc: Thomas Weise <thomas.weise@gmail.com
> >                                 <ma...@gmail.com>>,
> >                                 Mention <mention@noreply.github.com
> >                                 <ma...@noreply.github.com>>
> >
> >
> >                                 CC: @tweise <https://github.com/tweise>
> >
> >                                 —
> >                                 You are receiving this because you were
> >                                 mentioned.
> >                                 Reply to this email directly, view it on
> >                                 GitHub
> >                                 <https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
> >                                 or mute the thread
> >                                 <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>.
> >
> >
> >

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Maximilian Michels <mx...@apache.org>.

I agree that the current approach breaks the pipeline options contract 
because "unknown" options get parsed in the same way as options which 
have been defined by the user.

I'm not sure the `experiments` flag works for us. AFAIK it only allows 
true/false flags. We want to pass all types of pipeline options to the 
Runner.

How to solve this?

1) Add all options of all Runners to each SDK
We added some of the FlinkRunner options to the Python SDK but realized 
syncing is rather cumbersome in the long term. However, we want the most 
important options to be validated on the client side.

2) Pass "unknown" options via a separate list in the Proto which can 
only be accessed internally by the Runners. This still allows passing 
arbitrary options but we wouldn't leak unknown options and display them 
as top-level options.

-Max

On 13.10.18 02:34, Charles Chen wrote:
> The current release branch 
> (https://github.com/apache/beam/commits/release-2.8.0) was cut after the 
> revert went in.  Sent out https://github.com/apache/beam/pull/6683 as a 
> revert of the revert.  Regarding your comment above, I can help out with 
> the design / PR reviews for common Python code as you suggest.
> 
> On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <thw@apache.org 
> <ma...@apache.org>> wrote:
> 
>     Thanks, will tag you and looking forward to feedback so we can
>     ensure that changes work for everyone.
> 
>     Looking at the PR, I see agreement from Max to revert the change on
>     the release branch, but not in master. Would you mind to restore it
>     in master?
> 
>     Thanks
> 
>     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <altay@google.com
>     <ma...@google.com>> wrote:
> 
> 
> 
>         On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <ccy@google.com
>         <ma...@google.com>> wrote:
> 
>             What I mean is that a user may find that it works for them
>             to pass "--myarg blah" and access it as "options.myarg"
>             without explicitly defining a "my_arg" flag due to the added
>             logic.  This is not the intended behavior and we may want to
>             change this implementation detail in the future.  However,
>             having this logic in a released version makes it hard to
>             change this behavior since users may erroneously depend on
>             this undocumented behavior.  Instead, we should namespace /
>             scope this so that it is obvious that this is meant for
>             runner (and not Beam user) consumption.
> 
>             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
>             <thw@apache.org <ma...@apache.org>> wrote:
> 
>                 Can you please elaborate more what practical problems
>                 this introduces for users?
> 
>                 I can see that this change allows a user to specify a
>                 runner specific option, which in the future may change
>                 because we decide to scope differently. If this only
>                 affects users of the portable Flink runner (like us),
>                 then no need to revert, because at this early stage we
>                 prefer something that works over being blocked.
> 
>                 It would also be really great if some of the core Python
>                 SDK developers could help out with the design aspects
>                 and PR reviews of changes that affect common Python
>                 code. Anyone who specifically wants to be tagged on
>                 relevant JIRAs and PRs?
> 
> 
>         I would be happy to be tagged, and I can also help with
>         including other relevant folks whenever possible. In general I
>         think Robert, Charles, myself are good candidates.
> 
> 
>                 Thanks
> 
> 
>                 On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
>                 <altay@google.com <ma...@google.com>> wrote:
> 
> 
> 
>                     On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen
>                     <ccy@google.com <ma...@google.com>> wrote:
> 
>                         For context, I made comments on
>                         https://github.com/apache/beam/pull/6600 noting
>                         that the changes being made were not good for
>                         Beam backwards-compatibility.  The change as is
>                         allows users to use pipeline options without
>                         explicitly defining them, which is not the type
>                         of usage we would like to encourage since we
>                         prefer to be explicit whenever possible.  If
>                         users write pipelines with this sort of pattern,
>                         they will potentially encounter pain when
>                         upgrading to a later version since this is an
>                         implementation detail and not an officially
>                         supported pattern.  I agree with the comments
>                         above that this is ultimately a scoping issue. 
>                         I would not have a problem with these changes if
>                         they were explicitly scoped under either a
>                         runner or unparsed options namespace.
> 
>                         As a second note, since the 2.8.0 release is
>                         being cut right now, because of these
>                         backwards-compatibility concerns, I would
>                         suggest reverting these changes, at least until
>                         2.8.0 is cut, so we can have a discussion here
>                         before committing to and releasing any API-level
>                         changes.
> 
> 
>                     +1 I would like to revert the changes in order not
>                     rush this into the release. Once this discussion
>                     results in an agreement changes can be brought back.
> 
> 
>                         On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde
>                         <herohde@google.com <ma...@google.com>>
>                         wrote:
> 
>                             Agree that pipeline options lack some
>                             mechanism for scoping. It is also not always
>                             possible distinguish options meant to be
>                             consumed at pipeline construction time, by
>                             the runner, by the SDK harness, by the user
>                             code or any combination -- and this causes
>                             confusion every now and then.
> 
>                             For Dataflow, we have been using
>                             "experiments" for arbitrary runner-specific
>                             options. It's simply a string list pipeline
>                             option that all SDKs support and, for Go at
>                             least, is sent to portable runners. Flink
>                             can do the same in the short term to move
>                             forward.
> 
>                             Henning
> 
> 
>                             On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise
>                             <thw@apache.org <ma...@apache.org>> wrote:
> 
>                                 [moving to the list]
> 
>                                 The requirement driving this part of the
>                                 change was to allow a user to specify
>                                 pipeline options that a runner supports
>                                 without having to declare those in each
>                                 language SDK.
> 
>                                 In the specific scenario, we have
>                                 options that the Flink runner supports
>                                 (and can validate), that are not
>                                 enumerated in the Python SDK.
> 
>                                 I think we have a bigger problem scoping
>                                 pipeline options. For example, the
>                                 runner options are dumped into the SDK
>                                 worker. There is also a possibility of
>                                 name collisions. So I think this would
>                                 benefit from broader feedback.
> 
>                                 Thanks,
>                                 Thomas
> 
> 
>                                 ---------- Forwarded message ---------
>                                 From: *Charles Chen*
>                                 <notifications@github.com
>                                 <ma...@github.com>>
>                                 Date: Fri, Oct 12, 2018 at 8:36 AM
>                                 Subject: Re: [apache/beam] [BEAM-5442]
>                                 Store duplicate unknown options in a
>                                 list argument (#6600)
>                                 To: apache/beam <beam@noreply.github.com
>                                 <ma...@noreply.github.com>>
>                                 Cc: Thomas Weise <thomas.weise@gmail.com
>                                 <ma...@gmail.com>>,
>                                 Mention <mention@noreply.github.com
>                                 <ma...@noreply.github.com>>
> 
> 
>                                 CC: @tweise <https://github.com/tweise>
> 
>                                 —
>                                 You are receiving this because you were
>                                 mentioned.
>                                 Reply to this email directly, view it on
>                                 GitHub
>                                 <https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>                                 or mute the thread
>                                 <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>.
> 
> 
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Charles Chen <cc...@google.com>.

The current release branch (
https://github.com/apache/beam/commits/release-2.8.0) was cut after the
revert went in.  Sent out https://github.com/apache/beam/pull/6683 as a
revert of the revert.  Regarding your comment above, I can help out with
the design / PR reviews for common Python code as you suggest.

On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <th...@apache.org> wrote:

> Thanks, will tag you and looking forward to feedback so we can ensure that
> changes work for everyone.
>
> Looking at the PR, I see agreement from Max to revert the change on the
> release branch, but not in master. Would you mind to restore it in master?
>
> Thanks
>
> On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <al...@google.com> wrote:
>
>>
>>
>> On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <cc...@google.com> wrote:
>>
>>> What I mean is that a user may find that it works for them to pass
>>> "--myarg blah" and access it as "options.myarg" without explicitly defining
>>> a "my_arg" flag due to the added logic.  This is not the intended behavior
>>> and we may want to change this implementation detail in the future.
>>> However, having this logic in a released version makes it hard to change
>>> this behavior since users may erroneously depend on this undocumented
>>> behavior.  Instead, we should namespace / scope this so that it is obvious
>>> that this is meant for runner (and not Beam user) consumption.
>>>
>>> On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise <th...@apache.org> wrote:
>>>
>>>> Can you please elaborate more what practical problems this introduces
>>>> for users?
>>>>
>>>> I can see that this change allows a user to specify a runner specific
>>>> option, which in the future may change because we decide to scope
>>>> differently. If this only affects users of the portable Flink runner (like
>>>> us), then no need to revert, because at this early stage we prefer
>>>> something that works over being blocked.
>>>>
>>>> It would also be really great if some of the core Python SDK developers
>>>> could help out with the design aspects and PR reviews of changes that
>>>> affect common Python code. Anyone who specifically wants to be tagged on
>>>> relevant JIRAs and PRs?
>>>>
>>>
>> I would be happy to be tagged, and I can also help with including other
>> relevant folks whenever possible. In general I think Robert, Charles,
>> myself are good candidates.
>>
>>
>>
>>>
>>>> Thanks
>>>>
>>>>
>>>> On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay <al...@google.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen <cc...@google.com> wrote:
>>>>>
>>>>>> For context, I made comments on
>>>>>> https://github.com/apache/beam/pull/6600 noting that the changes
>>>>>> being made were not good for Beam backwards-compatibility.  The change as
>>>>>> is allows users to use pipeline options without explicitly defining them,
>>>>>> which is not the type of usage we would like to encourage since we prefer
>>>>>> to be explicit whenever possible.  If users write pipelines with this sort
>>>>>> of pattern, they will potentially encounter pain when upgrading to a later
>>>>>> version since this is an implementation detail and not an officially
>>>>>> supported pattern.  I agree with the comments above that this is ultimately
>>>>>> a scoping issue.  I would not have a problem with these changes if they
>>>>>> were explicitly scoped under either a runner or unparsed options namespace.
>>>>>>
>>>>>> As a second note, since the 2.8.0 release is being cut right now,
>>>>>> because of these backwards-compatibility concerns, I would suggest
>>>>>> reverting these changes, at least until 2.8.0 is cut, so we can have a
>>>>>> discussion here before committing to and releasing any API-level changes.
>>>>>>
>>>>>
>>>>> +1 I would like to revert the changes in order not rush this into the
>>>>> release. Once this discussion results in an agreement changes can be
>>>>> brought back.
>>>>>
>>>>>
>>>>>>
>>>>>> On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde <he...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Agree that pipeline options lack some mechanism for scoping. It is
>>>>>>> also not always possible distinguish options meant to be consumed at
>>>>>>> pipeline construction time, by the runner, by the SDK harness, by the user
>>>>>>> code or any combination -- and this causes confusion every now and then.
>>>>>>>
>>>>>>> For Dataflow, we have been using "experiments" for arbitrary
>>>>>>> runner-specific options. It's simply a string list pipeline option that all
>>>>>>> SDKs support and, for Go at least, is sent to portable runners. Flink can
>>>>>>> do the same in the short term to move forward.
>>>>>>>
>>>>>>> Henning
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise <th...@apache.org> wrote:
>>>>>>>
>>>>>>>> [moving to the list]
>>>>>>>>
>>>>>>>> The requirement driving this part of the change was to allow a user
>>>>>>>> to specify pipeline options that a runner supports without having to
>>>>>>>> declare those in each language SDK.
>>>>>>>>
>>>>>>>> In the specific scenario, we have options that the Flink runner
>>>>>>>> supports (and can validate), that are not enumerated in the Python SDK.
>>>>>>>>
>>>>>>>> I think we have a bigger problem scoping pipeline options. For
>>>>>>>> example, the runner options are dumped into the SDK worker. There is also a
>>>>>>>> possibility of name collisions. So I think this would benefit from broader
>>>>>>>> feedback.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Thomas
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------- Forwarded message ---------
>>>>>>>> From: Charles Chen <no...@github.com>
>>>>>>>> Date: Fri, Oct 12, 2018 at 8:36 AM
>>>>>>>> Subject: Re: [apache/beam] [BEAM-5442] Store duplicate unknown
>>>>>>>> options in a list argument (#6600)
>>>>>>>> To: apache/beam <be...@noreply.github.com>
>>>>>>>> Cc: Thomas Weise <th...@gmail.com>, Mention <
>>>>>>>> mention@noreply.github.com>
>>>>>>>>
>>>>>>>>
>>>>>>>> CC: @tweise <https://github.com/tweise>
>>>>>>>>
>>>>>>>> —
>>>>>>>> You are receiving this because you were mentioned.
>>>>>>>> Reply to this email directly, view it on GitHub
>>>>>>>> <https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>>>>>>>> or mute the thread
>>>>>>>> <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>
>>>>>>>> .
>>>>>>>>
>>>>>>>
>>>>>
>>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Thomas Weise <th...@apache.org>.

Thanks, will tag you and looking forward to feedback so we can ensure that
changes work for everyone.

Looking at the PR, I see agreement from Max to revert the change on the
release branch, but not in master. Would you mind to restore it in master?

Thanks

On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <al...@google.com> wrote:

>
>
> On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <cc...@google.com> wrote:
>
>> What I mean is that a user may find that it works for them to pass
>> "--myarg blah" and access it as "options.myarg" without explicitly defining
>> a "my_arg" flag due to the added logic.  This is not the intended behavior
>> and we may want to change this implementation detail in the future.
>> However, having this logic in a released version makes it hard to change
>> this behavior since users may erroneously depend on this undocumented
>> behavior.  Instead, we should namespace / scope this so that it is obvious
>> that this is meant for runner (and not Beam user) consumption.
>>
>> On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise <th...@apache.org> wrote:
>>
>>> Can you please elaborate more what practical problems this introduces
>>> for users?
>>>
>>> I can see that this change allows a user to specify a runner specific
>>> option, which in the future may change because we decide to scope
>>> differently. If this only affects users of the portable Flink runner (like
>>> us), then no need to revert, because at this early stage we prefer
>>> something that works over being blocked.
>>>
>>> It would also be really great if some of the core Python SDK developers
>>> could help out with the design aspects and PR reviews of changes that
>>> affect common Python code. Anyone who specifically wants to be tagged on
>>> relevant JIRAs and PRs?
>>>
>>
> I would be happy to be tagged, and I can also help with including other
> relevant folks whenever possible. In general I think Robert, Charles,
> myself are good candidates.
>
>
>
>>
>>> Thanks
>>>
>>>
>>> On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay <al...@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen <cc...@google.com> wrote:
>>>>
>>>>> For context, I made comments on
>>>>> https://github.com/apache/beam/pull/6600 noting that the changes
>>>>> being made were not good for Beam backwards-compatibility.  The change as
>>>>> is allows users to use pipeline options without explicitly defining them,
>>>>> which is not the type of usage we would like to encourage since we prefer
>>>>> to be explicit whenever possible.  If users write pipelines with this sort
>>>>> of pattern, they will potentially encounter pain when upgrading to a later
>>>>> version since this is an implementation detail and not an officially
>>>>> supported pattern.  I agree with the comments above that this is ultimately
>>>>> a scoping issue.  I would not have a problem with these changes if they
>>>>> were explicitly scoped under either a runner or unparsed options namespace.
>>>>>
>>>>> As a second note, since the 2.8.0 release is being cut right now,
>>>>> because of these backwards-compatibility concerns, I would suggest
>>>>> reverting these changes, at least until 2.8.0 is cut, so we can have a
>>>>> discussion here before committing to and releasing any API-level changes.
>>>>>
>>>>
>>>> +1 I would like to revert the changes in order not rush this into the
>>>> release. Once this discussion results in an agreement changes can be
>>>> brought back.
>>>>
>>>>
>>>>>
>>>>> On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde <he...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Agree that pipeline options lack some mechanism for scoping. It is
>>>>>> also not always possible distinguish options meant to be consumed at
>>>>>> pipeline construction time, by the runner, by the SDK harness, by the user
>>>>>> code or any combination -- and this causes confusion every now and then.
>>>>>>
>>>>>> For Dataflow, we have been using "experiments" for arbitrary
>>>>>> runner-specific options. It's simply a string list pipeline option that all
>>>>>> SDKs support and, for Go at least, is sent to portable runners. Flink can
>>>>>> do the same in the short term to move forward.
>>>>>>
>>>>>> Henning
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise <th...@apache.org> wrote:
>>>>>>
>>>>>>> [moving to the list]
>>>>>>>
>>>>>>> The requirement driving this part of the change was to allow a user
>>>>>>> to specify pipeline options that a runner supports without having to
>>>>>>> declare those in each language SDK.
>>>>>>>
>>>>>>> In the specific scenario, we have options that the Flink runner
>>>>>>> supports (and can validate), that are not enumerated in the Python SDK.
>>>>>>>
>>>>>>> I think we have a bigger problem scoping pipeline options. For
>>>>>>> example, the runner options are dumped into the SDK worker. There is also a
>>>>>>> possibility of name collisions. So I think this would benefit from broader
>>>>>>> feedback.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Thomas
>>>>>>>
>>>>>>>
>>>>>>> ---------- Forwarded message ---------
>>>>>>> From: Charles Chen <no...@github.com>
>>>>>>> Date: Fri, Oct 12, 2018 at 8:36 AM
>>>>>>> Subject: Re: [apache/beam] [BEAM-5442] Store duplicate unknown
>>>>>>> options in a list argument (#6600)
>>>>>>> To: apache/beam <be...@noreply.github.com>
>>>>>>> Cc: Thomas Weise <th...@gmail.com>, Mention <
>>>>>>> mention@noreply.github.com>
>>>>>>>
>>>>>>>
>>>>>>> CC: @tweise <https://github.com/tweise>
>>>>>>>
>>>>>>> —
>>>>>>> You are receiving this because you were mentioned.
>>>>>>> Reply to this email directly, view it on GitHub
>>>>>>> <https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>>>>>>> or mute the thread
>>>>>>> <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Ahmet Altay <al...@google.com>.

On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <cc...@google.com> wrote:

> What I mean is that a user may find that it works for them to pass
> "--myarg blah" and access it as "options.myarg" without explicitly defining
> a "my_arg" flag due to the added logic.  This is not the intended behavior
> and we may want to change this implementation detail in the future.
> However, having this logic in a released version makes it hard to change
> this behavior since users may erroneously depend on this undocumented
> behavior.  Instead, we should namespace / scope this so that it is obvious
> that this is meant for runner (and not Beam user) consumption.
>
> On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise <th...@apache.org> wrote:
>
>> Can you please elaborate more what practical problems this introduces for
>> users?
>>
>> I can see that this change allows a user to specify a runner specific
>> option, which in the future may change because we decide to scope
>> differently. If this only affects users of the portable Flink runner (like
>> us), then no need to revert, because at this early stage we prefer
>> something that works over being blocked.
>>
>> It would also be really great if some of the core Python SDK developers
>> could help out with the design aspects and PR reviews of changes that
>> affect common Python code. Anyone who specifically wants to be tagged on
>> relevant JIRAs and PRs?
>>
>
I would be happy to be tagged, and I can also help with including other
relevant folks whenever possible. In general I think Robert, Charles,
myself are good candidates.



>
>> Thanks
>>
>>
>> On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay <al...@google.com> wrote:
>>
>>>
>>>
>>> On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen <cc...@google.com> wrote:
>>>
>>>> For context, I made comments on https://github.com/apache/
>>>> beam/pull/6600 noting that the changes being made were not good for
>>>> Beam backwards-compatibility.  The change as is allows users to use
>>>> pipeline options without explicitly defining them, which is not the type of
>>>> usage we would like to encourage since we prefer to be explicit whenever
>>>> possible.  If users write pipelines with this sort of pattern, they will
>>>> potentially encounter pain when upgrading to a later version since this is
>>>> an implementation detail and not an officially supported pattern.  I agree
>>>> with the comments above that this is ultimately a scoping issue.  I would
>>>> not have a problem with these changes if they were explicitly scoped under
>>>> either a runner or unparsed options namespace.
>>>>
>>>> As a second note, since the 2.8.0 release is being cut right now,
>>>> because of these backwards-compatibility concerns, I would suggest
>>>> reverting these changes, at least until 2.8.0 is cut, so we can have a
>>>> discussion here before committing to and releasing any API-level changes.
>>>>
>>>
>>> +1 I would like to revert the changes in order not rush this into the
>>> release. Once this discussion results in an agreement changes can be
>>> brought back.
>>>
>>>
>>>>
>>>> On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde <he...@google.com>
>>>> wrote:
>>>>
>>>>> Agree that pipeline options lack some mechanism for scoping. It is
>>>>> also not always possible distinguish options meant to be consumed at
>>>>> pipeline construction time, by the runner, by the SDK harness, by the user
>>>>> code or any combination -- and this causes confusion every now and then.
>>>>>
>>>>> For Dataflow, we have been using "experiments" for arbitrary
>>>>> runner-specific options. It's simply a string list pipeline option that all
>>>>> SDKs support and, for Go at least, is sent to portable runners. Flink can
>>>>> do the same in the short term to move forward.
>>>>>
>>>>> Henning
>>>>>
>>>>>
>>>>> On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise <th...@apache.org> wrote:
>>>>>
>>>>>> [moving to the list]
>>>>>>
>>>>>> The requirement driving this part of the change was to allow a user
>>>>>> to specify pipeline options that a runner supports without having to
>>>>>> declare those in each language SDK.
>>>>>>
>>>>>> In the specific scenario, we have options that the Flink runner
>>>>>> supports (and can validate), that are not enumerated in the Python SDK.
>>>>>>
>>>>>> I think we have a bigger problem scoping pipeline options. For
>>>>>> example, the runner options are dumped into the SDK worker. There is also a
>>>>>> possibility of name collisions. So I think this would benefit from broader
>>>>>> feedback.
>>>>>>
>>>>>> Thanks,
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>> ---------- Forwarded message ---------
>>>>>> From: Charles Chen <no...@github.com>
>>>>>> Date: Fri, Oct 12, 2018 at 8:36 AM
>>>>>> Subject: Re: [apache/beam] [BEAM-5442] Store duplicate unknown
>>>>>> options in a list argument (#6600)
>>>>>> To: apache/beam <be...@noreply.github.com>
>>>>>> Cc: Thomas Weise <th...@gmail.com>, Mention <
>>>>>> mention@noreply.github.com>
>>>>>>
>>>>>>
>>>>>> CC: @tweise <https://github.com/tweise>
>>>>>>
>>>>>> —
>>>>>> You are receiving this because you were mentioned.
>>>>>> Reply to this email directly, view it on GitHub
>>>>>> <https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
>>>>>> or mute the thread
>>>>>> <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>
>>>>>> .
>>>>>>
>>>>>
>>>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Charles Chen <cc...@google.com>.

What I mean is that a user may find that it works for them to pass "--myarg
blah" and access it as "options.myarg" without explicitly defining a
"my_arg" flag due to the added logic.  This is not the intended behavior
and we may want to change this implementation detail in the future.
However, having this logic in a released version makes it hard to change
this behavior since users may erroneously depend on this undocumented
behavior.  Instead, we should namespace / scope this so that it is obvious
that this is meant for runner (and not Beam user) consumption.

On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise <th...@apache.org> wrote:

> Can you please elaborate more what practical problems this introduces for
> users?
>
> I can see that this change allows a user to specify a runner specific
> option, which in the future may change because we decide to scope
> differently. If this only affects users of the portable Flink runner (like
> us), then no need to revert, because at this early stage we prefer
> something that works over being blocked.
>
> It would also be really great if some of the core Python SDK developers
> could help out with the design aspects and PR reviews of changes that
> affect common Python code. Anyone who specifically wants to be tagged on
> relevant JIRAs and PRs?
>
> Thanks
>
>
> On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay <al...@google.com> wrote:
>
>>
>>
>> On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen <cc...@google.com> wrote:
>>
>>> For context, I made comments on https://github.com/apache/beam/pull/6600
>>> noting that the changes being made were not good for Beam
>>> backwards-compatibility.  The change as is allows users to use pipeline
>>> options without explicitly defining them, which is not the type of usage we
>>> would like to encourage since we prefer to be explicit whenever possible.
>>> If users write pipelines with this sort of pattern, they will potentially
>>> encounter pain when upgrading to a later version since this is an
>>> implementation detail and not an officially supported pattern.  I agree
>>> with the comments above that this is ultimately a scoping issue.  I would
>>> not have a problem with these changes if they were explicitly scoped under
>>> either a runner or unparsed options namespace.
>>>
>>> As a second note, since the 2.8.0 release is being cut right now,
>>> because of these backwards-compatibility concerns, I would suggest
>>> reverting these changes, at least until 2.8.0 is cut, so we can have a
>>> discussion here before committing to and releasing any API-level changes.
>>>
>>
>> +1 I would like to revert the changes in order not rush this into the
>> release. Once this discussion results in an agreement changes can be
>> brought back.
>>
>>
>>>
>>> On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde <he...@google.com>
>>> wrote:
>>>
>>>> Agree that pipeline options lack some mechanism for scoping. It is also
>>>> not always possible distinguish options meant to be consumed at pipeline
>>>> construction time, by the runner, by the SDK harness, by the user code or
>>>> any combination -- and this causes confusion every now and then.
>>>>
>>>> For Dataflow, we have been using "experiments" for arbitrary
>>>> runner-specific options. It's simply a string list pipeline option that all
>>>> SDKs support and, for Go at least, is sent to portable runners. Flink can
>>>> do the same in the short term to move forward.
>>>>
>>>> Henning
>>>>
>>>>
>>>> On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise <th...@apache.org> wrote:
>>>>
>>>>> [moving to the list]
>>>>>
>>>>> The requirement driving this part of the change was to allow a user to
>>>>> specify pipeline options that a runner supports without having to declare
>>>>> those in each language SDK.
>>>>>
>>>>> In the specific scenario, we have options that the Flink runner
>>>>> supports (and can validate), that are not enumerated in the Python SDK.
>>>>>
>>>>> I think we have a bigger problem scoping pipeline options. For
>>>>> example, the runner options are dumped into the SDK worker. There is also a
>>>>> possibility of name collisions. So I think this would benefit from broader
>>>>> feedback.
>>>>>
>>>>> Thanks,
>>>>> Thomas
>>>>>
>>>>>
>>>>> ---------- Forwarded message ---------
>>>>> From: Charles Chen <no...@github.com>
>>>>> Date: Fri, Oct 12, 2018 at 8:36 AM
>>>>> Subject: Re: [apache/beam] [BEAM-5442] Store duplicate unknown options
>>>>> in a list argument (#6600)
>>>>> To: apache/beam <be...@noreply.github.com>
>>>>> Cc: Thomas Weise <th...@gmail.com>, Mention <
>>>>> mention@noreply.github.com>
>>>>>
>>>>>
>>>>> CC: @tweise <https://github.com/tweise>
>>>>>
>>>>> —
>>>>> You are receiving this because you were mentioned.
>>>>> Reply to this email directly, view it on GitHub
>>>>> <https://github.com/apache/beam/pull/6600#issuecomment-429367754>, or mute
>>>>> the thread
>>>>> <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>
>>>>> .
>>>>>
>>>>
>>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Thomas Weise <th...@apache.org>.

Can you please elaborate more what practical problems this introduces for
users?

I can see that this change allows a user to specify a runner specific
option, which in the future may change because we decide to scope
differently. If this only affects users of the portable Flink runner (like
us), then no need to revert, because at this early stage we prefer
something that works over being blocked.

It would also be really great if some of the core Python SDK developers
could help out with the design aspects and PR reviews of changes that
affect common Python code. Anyone who specifically wants to be tagged on
relevant JIRAs and PRs?

Thanks


On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay <al...@google.com> wrote:

>
>
> On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen <cc...@google.com> wrote:
>
>> For context, I made comments on https://github.com/apache/beam/pull/6600
>> noting that the changes being made were not good for Beam
>> backwards-compatibility.  The change as is allows users to use pipeline
>> options without explicitly defining them, which is not the type of usage we
>> would like to encourage since we prefer to be explicit whenever possible.
>> If users write pipelines with this sort of pattern, they will potentially
>> encounter pain when upgrading to a later version since this is an
>> implementation detail and not an officially supported pattern.  I agree
>> with the comments above that this is ultimately a scoping issue.  I would
>> not have a problem with these changes if they were explicitly scoped under
>> either a runner or unparsed options namespace.
>>
>> As a second note, since the 2.8.0 release is being cut right now, because
>> of these backwards-compatibility concerns, I would suggest reverting these
>> changes, at least until 2.8.0 is cut, so we can have a discussion here
>> before committing to and releasing any API-level changes.
>>
>
> +1 I would like to revert the changes in order not rush this into the
> release. Once this discussion results in an agreement changes can be
> brought back.
>
>
>>
>> On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde <he...@google.com> wrote:
>>
>>> Agree that pipeline options lack some mechanism for scoping. It is also
>>> not always possible distinguish options meant to be consumed at pipeline
>>> construction time, by the runner, by the SDK harness, by the user code or
>>> any combination -- and this causes confusion every now and then.
>>>
>>> For Dataflow, we have been using "experiments" for arbitrary
>>> runner-specific options. It's simply a string list pipeline option that all
>>> SDKs support and, for Go at least, is sent to portable runners. Flink can
>>> do the same in the short term to move forward.
>>>
>>> Henning
>>>
>>>
>>> On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise <th...@apache.org> wrote:
>>>
>>>> [moving to the list]
>>>>
>>>> The requirement driving this part of the change was to allow a user to
>>>> specify pipeline options that a runner supports without having to declare
>>>> those in each language SDK.
>>>>
>>>> In the specific scenario, we have options that the Flink runner
>>>> supports (and can validate), that are not enumerated in the Python SDK.
>>>>
>>>> I think we have a bigger problem scoping pipeline options. For example,
>>>> the runner options are dumped into the SDK worker. There is also a
>>>> possibility of name collisions. So I think this would benefit from broader
>>>> feedback.
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>>
>>>> ---------- Forwarded message ---------
>>>> From: Charles Chen <no...@github.com>
>>>> Date: Fri, Oct 12, 2018 at 8:36 AM
>>>> Subject: Re: [apache/beam] [BEAM-5442] Store duplicate unknown options
>>>> in a list argument (#6600)
>>>> To: apache/beam <be...@noreply.github.com>
>>>> Cc: Thomas Weise <th...@gmail.com>, Mention <
>>>> mention@noreply.github.com>
>>>>
>>>>
>>>> CC: @tweise <https://github.com/tweise>
>>>>
>>>> —
>>>> You are receiving this because you were mentioned.
>>>> Reply to this email directly, view it on GitHub
>>>> <https://github.com/apache/beam/pull/6600#issuecomment-429367754>, or mute
>>>> the thread
>>>> <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>
>>>> .
>>>>
>>>
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Ahmet Altay <al...@google.com>.

On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen <cc...@google.com> wrote:

> For context, I made comments on https://github.com/apache/beam/pull/6600
> noting that the changes being made were not good for Beam
> backwards-compatibility.  The change as is allows users to use pipeline
> options without explicitly defining them, which is not the type of usage we
> would like to encourage since we prefer to be explicit whenever possible.
> If users write pipelines with this sort of pattern, they will potentially
> encounter pain when upgrading to a later version since this is an
> implementation detail and not an officially supported pattern.  I agree
> with the comments above that this is ultimately a scoping issue.  I would
> not have a problem with these changes if they were explicitly scoped under
> either a runner or unparsed options namespace.
>
> As a second note, since the 2.8.0 release is being cut right now, because
> of these backwards-compatibility concerns, I would suggest reverting these
> changes, at least until 2.8.0 is cut, so we can have a discussion here
> before committing to and releasing any API-level changes.
>

+1 I would like to revert the changes in order not rush this into the
release. Once this discussion results in an agreement changes can be
brought back.


>
> On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde <he...@google.com> wrote:
>
>> Agree that pipeline options lack some mechanism for scoping. It is also
>> not always possible distinguish options meant to be consumed at pipeline
>> construction time, by the runner, by the SDK harness, by the user code or
>> any combination -- and this causes confusion every now and then.
>>
>> For Dataflow, we have been using "experiments" for arbitrary
>> runner-specific options. It's simply a string list pipeline option that all
>> SDKs support and, for Go at least, is sent to portable runners. Flink can
>> do the same in the short term to move forward.
>>
>> Henning
>>
>>
>> On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise <th...@apache.org> wrote:
>>
>>> [moving to the list]
>>>
>>> The requirement driving this part of the change was to allow a user to
>>> specify pipeline options that a runner supports without having to declare
>>> those in each language SDK.
>>>
>>> In the specific scenario, we have options that the Flink runner supports
>>> (and can validate), that are not enumerated in the Python SDK.
>>>
>>> I think we have a bigger problem scoping pipeline options. For example,
>>> the runner options are dumped into the SDK worker. There is also a
>>> possibility of name collisions. So I think this would benefit from broader
>>> feedback.
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>> ---------- Forwarded message ---------
>>> From: Charles Chen <no...@github.com>
>>> Date: Fri, Oct 12, 2018 at 8:36 AM
>>> Subject: Re: [apache/beam] [BEAM-5442] Store duplicate unknown options
>>> in a list argument (#6600)
>>> To: apache/beam <be...@noreply.github.com>
>>> Cc: Thomas Weise <th...@gmail.com>, Mention <
>>> mention@noreply.github.com>
>>>
>>>
>>> CC: @tweise <https://github.com/tweise>
>>>
>>> —
>>> You are receiving this because you were mentioned.
>>> Reply to this email directly, view it on GitHub
>>> <https://github.com/apache/beam/pull/6600#issuecomment-429367754>, or mute
>>> the thread
>>> <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>
>>> .
>>>
>>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Charles Chen <cc...@google.com>.

For context, I made comments on https://github.com/apache/beam/pull/6600
noting that the changes being made were not good for Beam
backwards-compatibility.  The change as is allows users to use pipeline
options without explicitly defining them, which is not the type of usage we
would like to encourage since we prefer to be explicit whenever possible.
If users write pipelines with this sort of pattern, they will potentially
encounter pain when upgrading to a later version since this is an
implementation detail and not an officially supported pattern.  I agree
with the comments above that this is ultimately a scoping issue.  I would
not have a problem with these changes if they were explicitly scoped under
either a runner or unparsed options namespace.

As a second note, since the 2.8.0 release is being cut right now, because
of these backwards-compatibility concerns, I would suggest reverting these
changes, at least until 2.8.0 is cut, so we can have a discussion here
before committing to and releasing any API-level changes.

On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde <he...@google.com> wrote:

> Agree that pipeline options lack some mechanism for scoping. It is also
> not always possible distinguish options meant to be consumed at pipeline
> construction time, by the runner, by the SDK harness, by the user code or
> any combination -- and this causes confusion every now and then.
>
> For Dataflow, we have been using "experiments" for arbitrary
> runner-specific options. It's simply a string list pipeline option that all
> SDKs support and, for Go at least, is sent to portable runners. Flink can
> do the same in the short term to move forward.
>
> Henning
>
>
> On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise <th...@apache.org> wrote:
>
>> [moving to the list]
>>
>> The requirement driving this part of the change was to allow a user to
>> specify pipeline options that a runner supports without having to declare
>> those in each language SDK.
>>
>> In the specific scenario, we have options that the Flink runner supports
>> (and can validate), that are not enumerated in the Python SDK.
>>
>> I think we have a bigger problem scoping pipeline options. For example,
>> the runner options are dumped into the SDK worker. There is also a
>> possibility of name collisions. So I think this would benefit from broader
>> feedback.
>>
>> Thanks,
>> Thomas
>>
>>
>> ---------- Forwarded message ---------
>> From: Charles Chen <no...@github.com>
>> Date: Fri, Oct 12, 2018 at 8:36 AM
>> Subject: Re: [apache/beam] [BEAM-5442] Store duplicate unknown options in
>> a list argument (#6600)
>> To: apache/beam <be...@noreply.github.com>
>> Cc: Thomas Weise <th...@gmail.com>, Mention <
>> mention@noreply.github.com>
>>
>>
>> CC: @tweise <https://github.com/tweise>
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> <https://github.com/apache/beam/pull/6600#issuecomment-429367754>, or mute
>> the thread
>> <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>
>> .
>>
>

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

Posted by Henning Rohde <he...@google.com>.

Agree that pipeline options lack some mechanism for scoping. It is also not
always possible distinguish options meant to be consumed at pipeline
construction time, by the runner, by the SDK harness, by the user code or
any combination -- and this causes confusion every now and then.

For Dataflow, we have been using "experiments" for arbitrary
runner-specific options. It's simply a string list pipeline option that all
SDKs support and, for Go at least, is sent to portable runners. Flink can
do the same in the short term to move forward.

Henning

On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise <th...@apache.org> wrote:

> [moving to the list]
>
> The requirement driving this part of the change was to allow a user to
> specify pipeline options that a runner supports without having to declare
> those in each language SDK.
>
> In the specific scenario, we have options that the Flink runner supports
> (and can validate), that are not enumerated in the Python SDK.
>
> I think we have a bigger problem scoping pipeline options. For example,
> the runner options are dumped into the SDK worker. There is also a
> possibility of name collisions. So I think this would benefit from broader
> feedback.
>
> Thanks,
> Thomas
>
>
> ---------- Forwarded message ---------
> From: Charles Chen <no...@github.com>
> Date: Fri, Oct 12, 2018 at 8:36 AM
> Subject: Re: [apache/beam] [BEAM-5442] Store duplicate unknown options in
> a list argument (#6600)
> To: apache/beam <be...@noreply.github.com>
> Cc: Thomas Weise <th...@gmail.com>, Mention <
> mention@noreply.github.com>
>
>
> CC: @tweise <https://github.com/tweise>
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/beam/pull/6600#issuecomment-429367754>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>
> .
>