You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Romain Manni-Bucau <rm...@gmail.com> on 2018/03/20 16:25:27 UTC

PipelineOptions strict mode broken?

Hi guys,

PipelineOptionsFactory has a nice strict mode validating the options you
pass.

Concretely if you pass --sudoMakeItWork you will ikely see:

java.lang.IllegalArgumentException: Class interface
org.apache.beam.sdk.options.PipelineOptions missing a property named '
sudoMakeItWork'.

This is not bad however the way it is implemented just doesn't work and its
design is wrong:

1. the pipeline options factory relies on a cache so only the options
available when the class is instantiated are available. For example in a
container (web container, OSGi or other = not flat classpath) you will not
be able to load the IO options if the IO are not in the same classloader
than the sdk core which is quite a pitfall.
2. the validation leads between options. The validation is not "the options
are valid" but "there is some option matching your parameter"

A case which is broken but "green" today is (not using exact names for the
example):

--runner=DirectRunner --sparkMaster= spark://localhost

this will work if I have both runner in the classpath but it should
actually fail cause one is invalid for the other.

To fix that I see 3 options:

1. relax the strict mode to be false by default and lazily evaluate the
options
2. don't allow to lazily cast the options but enforce to do it when the
instance is created, this means that a user must know all PipelineOptions
children it relies on at creation time
(PipelineOptionsFactory.fromArgs(args).create(DirectOptions.class,
S3Options.class, MyIOOptions.class)
3. use a namespace/prefix for nested options, this way the previous example
would become:

--runner=DirectRunner --spark.master=spark://localhost

in this case the PipelineOptions instantiation would validate the prefix ""
and if SparkOptions are requested they would validate the prefix spark.*.

This is not perfect but current state with a single eager registry for the
validation is very hardly usable as soon as you try to industrialize beam
in something else than a main :(.

Any one has an idea to not break the API?


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

Re: PipelineOptions strict mode broken?

Posted by Romain Manni-Bucau <rm...@gmail.com>.
2018-03-20 17:53 GMT+01:00 Lukasz Cwik <lc...@google.com>:

> The only current validator is the @Required validator, there were some
> ideas to integrate another system to perform validation on options like >=0
> for numbers. I'm not sure how much use this has gotten from users, I would
> be for leaving it as is (if users get value out of it) or removing it and
> mark it deprecated for removal in the next major version.
>
> Have you taken a look at PipelineOptionsFactory.register(...)?
> It allows you to register all PipelineOptions interfaces for
> parsing/validation reasons similar to your suggestion for
> PipelineOptionsFactory.fromArgs(args).create(DirectOptions.class,
> S3Options.class, MyIOOptions.class). Note that the user can also have their
> PipelineOptions interface extend all interfaces that they want to ensure
> are visible.
>

Sure, this is linked to the discussion we got on the Cache in POF PR. If
moved to something not cached anymore we should work OOTB but this requires
more deeper work since the cache is assumed ATM.

Also note that it would require one lookup per classloader and potentially
not shared between IOs (which is fine but just not as straight forward as
reading these lines can look).


>
> Note that PipelineOptions was created to be a flat bag of string to type
> safe accessor/serializer/deserializer and nothing more. There has been
> little discussion about making it hierarchical and what that would mean.
>
> Also, wouldn't making PipelineOptionsFactory not a static singleton (like
> how I suggested on your PR) address most of the classloader issues that you
> speak of?
>

Half, the "find it" case will work, the not wrongly failing validation case
would still be silent and not obvious if we dont split the configs in
namespaces which are easier to understand. That said it wouldnt be a
blocker anymore, you are right.


>
>
> On Tue, Mar 20, 2018 at 9:26 AM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Hi guys,
>>
>> PipelineOptionsFactory has a nice strict mode validating the options you
>> pass.
>>
>> Concretely if you pass --sudoMakeItWork you will ikely see:
>>
>> java.lang.IllegalArgumentException: Class interface
>> org.apache.beam.sdk.options.PipelineOptions missing a property named '
>> sudoMakeItWork'.
>>
>> This is not bad however the way it is implemented just doesn't work and
>> its design is wrong:
>>
>> 1. the pipeline options factory relies on a cache so only the options
>> available when the class is instantiated are available. For example in a
>> container (web container, OSGi or other = not flat classpath) you will not
>> be able to load the IO options if the IO are not in the same classloader
>> than the sdk core which is quite a pitfall.
>> 2. the validation leads between options. The validation is not "the
>> options are valid" but "there is some option matching your parameter"
>>
>> A case which is broken but "green" today is (not using exact names for
>> the example):
>>
>> --runner=DirectRunner --sparkMaster= spark://localhost
>>
>> this will work if I have both runner in the classpath but it should
>> actually fail cause one is invalid for the other.
>>
>> To fix that I see 3 options:
>>
>> 1. relax the strict mode to be false by default and lazily evaluate the
>> options
>> 2. don't allow to lazily cast the options but enforce to do it when the
>> instance is created, this means that a user must know all PipelineOptions
>> children it relies on at creation time (PipelineOptionsFactory.
>> fromArgs(args).create(DirectOptions.class, S3Options.class,
>> MyIOOptions.class)
>> 3. use a namespace/prefix for nested options, this way the previous
>> example would become:
>>
>> --runner=DirectRunner --spark.master=spark://localhost
>>
>> in this case the PipelineOptions instantiation would validate the prefix
>> "" and if SparkOptions are requested they would validate the prefix spark.*.
>>
>> This is not perfect but current state with a single eager registry for
>> the validation is very hardly usable as soon as you try to industrialize
>> beam in something else than a main :(.
>>
>> Any one has an idea to not break the API?
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>

Re: PipelineOptions strict mode broken?

Posted by Lukasz Cwik <lc...@google.com>.
The only current validator is the @Required validator, there were some
ideas to integrate another system to perform validation on options like >=0
for numbers. I'm not sure how much use this has gotten from users, I would
be for leaving it as is (if users get value out of it) or removing it and
mark it deprecated for removal in the next major version.

Have you taken a look at PipelineOptionsFactory.register(...)?
It allows you to register all PipelineOptions interfaces for
parsing/validation reasons similar to your suggestion for
PipelineOptionsFactory.fromArgs(args).create(DirectOptions.class,
S3Options.class, MyIOOptions.class). Note that the user can also have their
PipelineOptions interface extend all interfaces that they want to ensure
are visible.

Note that PipelineOptions was created to be a flat bag of string to type
safe accessor/serializer/deserializer and nothing more. There has been
little discussion about making it hierarchical and what that would mean.

Also, wouldn't making PipelineOptionsFactory not a static singleton (like
how I suggested on your PR) address most of the classloader issues that you
speak of?


On Tue, Mar 20, 2018 at 9:26 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Hi guys,
>
> PipelineOptionsFactory has a nice strict mode validating the options you
> pass.
>
> Concretely if you pass --sudoMakeItWork you will ikely see:
>
> java.lang.IllegalArgumentException: Class interface
> org.apache.beam.sdk.options.PipelineOptions missing a property named '
> sudoMakeItWork'.
>
> This is not bad however the way it is implemented just doesn't work and
> its design is wrong:
>
> 1. the pipeline options factory relies on a cache so only the options
> available when the class is instantiated are available. For example in a
> container (web container, OSGi or other = not flat classpath) you will not
> be able to load the IO options if the IO are not in the same classloader
> than the sdk core which is quite a pitfall.
> 2. the validation leads between options. The validation is not "the
> options are valid" but "there is some option matching your parameter"
>
> A case which is broken but "green" today is (not using exact names for the
> example):
>
> --runner=DirectRunner --sparkMaster= spark://localhost
>
> this will work if I have both runner in the classpath but it should
> actually fail cause one is invalid for the other.
>
> To fix that I see 3 options:
>
> 1. relax the strict mode to be false by default and lazily evaluate the
> options
> 2. don't allow to lazily cast the options but enforce to do it when the
> instance is created, this means that a user must know all PipelineOptions
> children it relies on at creation time
> (PipelineOptionsFactory.fromArgs(args).create(DirectOptions.class,
> S3Options.class, MyIOOptions.class)
> 3. use a namespace/prefix for nested options, this way the previous
> example would become:
>
> --runner=DirectRunner --spark.master=spark://localhost
>
> in this case the PipelineOptions instantiation would validate the prefix
> "" and if SparkOptions are requested they would validate the prefix spark.*.
>
> This is not perfect but current state with a single eager registry for the
> validation is very hardly usable as soon as you try to industrialize beam
> in something else than a main :(.
>
> Any one has an idea to not break the API?
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>