You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Anand Inguva <an...@google.com> on 2022/02/08 19:46:12 UTC

Inconsistent behavior when parsing boolean flags across different APIs in Python SDK

Hi all,

For the Python SDK, there is some inconsistency across different APIs when
it comes to parsing the boolean flags. The issue is briefly defined in the
doc[1].
*I would like to have a discussion* on this since it is causing *confusion*
among users when it comes to using boolean flags across different APIs.

Please go through it and provide feedback/comments on the issue.

Thanks,
Anand

[1]
https://docs.google.com/document/d/13n8IkonXvugBnDWocuaY-obQs4_oR70oZ4WVxewPuWI/edit?usp=sharing&resourcekey=0-tILu89gr-zXa_JcwHxiPYQ

Re: Inconsistent behavior when parsing boolean flags across different APIs in Python SDK

Posted by Anand Inguva <an...@google.com>.
+1 to @Valentyn Tymofieiev <va...@google.com> suggestions

To resolve the confusion around this issue, a better doc string needs to be
written for [1][2].

   - [1] accepts options_name + value
   - [2] accepts flag_name + value

If it encounters an invalid flag/option, a warning could be displayed on
how to use the respective APIs in a correct way.

For the flags like *use_public_ips*, which accepts three values
{unspecified, enabled, disabled}, I am + 1 to follow the approach mentioned
by @Valentyn Tymofieiev <va...@google.com>.

[1]
https://github.com/apache/beam/blob/1c6d43b2f2ea0638298d77ee6c9f510eb922bfb6/sdks/python/apache_beam/options/pipeline_options.py#L168
[2]
https://github.com/apache/beam/blob/1c6d43b2f2ea0638298d77ee6c9f510eb922bfb6/sdks/python/apache_beam/options/pipeline_options.py#L229

On Fri, Feb 11, 2022 at 3:56 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> Yes, users discovered the from_dictionary() API. I am seeing several
> instances in the internal codebase, so taking it away would be a breaking
> change for some users.
>
> Looked some more. I think you are right that the origin of this API was to
> have it for internal purposes, possibly as a counterpart for this code[1].
> So it is actually the current implementation of the from_dictionary that
> makes it work correctly only if passed flag names + values. The intent
> seems to be to take option names. No wonder the docstring does not mention
> that flag names are expected.
>
> That means, the current usage in [2] is susceptible to the same bug,
> specifically command_line -> parse_known_args() -> get_all_options() ->
> from_dictionary() -> parse_known_args()  will lose command_line options for
> which option name does not have a matching flag name, or needs a different
> flag, as is the case for Boolean options that allow 3 values. We just never
> had an option previously for which this difference mattered.
>
> That means that we should fix the from_dictionary implementation to accept
> option names as keys for affected flags. I think use_public_ips currently
> is the only option we have for which this would matter, so we could make a
> special treatment just for this option and add a test asserting that any
> new options that exhibit the same behavior, receive the special treatment.
>
> [1]
> https://github.com/apache/beam/blob/af5be6070b379ad875e430324f84bd9104999ecf/sdks/python/apache_beam/runners/dataflow/internal/apiclient.py#L335-L338
> [2]
> https://github.com/apache/beam/blob/3ad05523f4cdf5122fc319276fcb461f768af39d/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L78
>
> On Fri, Feb 11, 2022 at 9:45 AM Brian Hulette <bh...@google.com> wrote:
>
>> Thanks for helping elucidate this Valentyn. I'm curious if users actually
>> use the from_dictionary() API directly though? I thought it really just
>> existed for getting pipeline options onto the worker [1]. Maybe we should
>> consider it an internal API, and have users prefer calls like options.as(MyClass).my_option
>> = 'foo' if they want to set pipeline options programmatically?
>>
>> Brian
>>
>> [1]
>> https://github.com/apache/beam/blob/3ad05523f4cdf5122fc319276fcb461f768af39d/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L78
>>
>> On Thu, Feb 10, 2022 at 11:09 PM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>> Thanks, Anand. I took a look at the doc and comments, and took
>>> another look at the apis.
>>>
>>> We have two separate concepts associated with each pipeline option:
>>> -  the option name, which SDK uses to retrieve the value of the option,
>>> and which we specify as 'dest' in the parser.
>>> -  the flag name(s), through which users set values of the option on the
>>> command line.
>>>
>>> Command line takes *flag names*. The PipelineOptions constructor takes
>>> a dictionary(as kwargs) of *option names + *values (in addition to the
>>> list of flags).
>>> The *from_dictionary() *method takes a dictionary of *flag names *+
>>> values. The expected inputs for programmatic apis may not be sufficiently
>>> documented  in the docstrings, which can confuse users when option names
>>> and flag names don't match. This is the case for Boolean flags, which by
>>> nature of argparse need to have at least 2 flag names if one needs to have
>>> 3 behaviors (unspecified, enabled, disabled).
>>>
>>> Reconciling the two apis to use the same input at this point, would
>>> create more confusion in addition to breaking changes, especially if we do
>>> it only for boolean options - since, the mismatch is there for all options
>>> that have option name different from flag name.
>>>
>>> I think, what we should do instead is, to make it clear to the user
>>> which API expects what keys through better docstring documentation, and if
>>> a user accidentally passes an option name  instead of a flag name  (or vice
>>> versa), instead of silently dropping the flag, we could print a message
>>> that explain that they likely used the API in a wrong way and give a
>>> suggestion.
>>>
>>> For example, if a user passes: use_public_ips=False to
>>> from_dictionary(), we can print smth like: "*Skipping Boolean flag
>>> use_public_ips=False. **Did you mean to pass no_use_public_ips=True
>>> instead ?*
>>>
>>> Likewise, if a user passes no_use_public_ips=True to PipelineOptions(),
>>> we can print a warning: "*Skipping unrecognized option *`no_use_public_ips`,
>>> did you mean to set `*use_public_ips`* option instead"?
>>>
>>> Hopefully, this would help clear the user confusion without making
>>> breaking changes and also clarify the api documentation.
>>>
>>> On Wed, Feb 9, 2022 at 12:20 PM Andy Ye <ye...@google.com> wrote:
>>>
>>>> Left some comments in the doc. Thanks for the thorough investigation
>>>> Anand!
>>>>
>>>> Best,
>>>> Andy
>>>>
>>>>
>>>>
>>>> On Wed, Feb 9, 2022 at 10:13 AM Kenneth Knowles <ke...@apache.org>
>>>> wrote:
>>>>
>>>>> Thanks for writing this up. This is an instant classic sort of
>>>>> bug/inconsistency.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Tue, Feb 8, 2022 at 11:46 AM Anand Inguva <an...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> For the Python SDK, there is some inconsistency across different APIs
>>>>>> when it comes to parsing the boolean flags. The issue is briefly defined in
>>>>>> the doc[1].
>>>>>> *I would like to have a discussion* on this since it is causing
>>>>>> *confusion* among users when it comes to using boolean flags across
>>>>>> different APIs.
>>>>>>
>>>>>> Please go through it and provide feedback/comments on the issue.
>>>>>>
>>>>>> Thanks,
>>>>>> Anand
>>>>>>
>>>>>> [1]
>>>>>> https://docs.google.com/document/d/13n8IkonXvugBnDWocuaY-obQs4_oR70oZ4WVxewPuWI/edit?usp=sharing&resourcekey=0-tILu89gr-zXa_JcwHxiPYQ
>>>>>>
>>>>>

Re: Inconsistent behavior when parsing boolean flags across different APIs in Python SDK

Posted by Valentyn Tymofieiev <va...@google.com>.
Yes, users discovered the from_dictionary() API. I am seeing several
instances in the internal codebase, so taking it away would be a breaking
change for some users.

Looked some more. I think you are right that the origin of this API was to
have it for internal purposes, possibly as a counterpart for this code[1].
So it is actually the current implementation of the from_dictionary that
makes it work correctly only if passed flag names + values. The intent
seems to be to take option names. No wonder the docstring does not mention
that flag names are expected.

That means, the current usage in [2] is susceptible to the same bug,
specifically command_line -> parse_known_args() -> get_all_options() ->
from_dictionary() -> parse_known_args()  will lose command_line options for
which option name does not have a matching flag name, or needs a different
flag, as is the case for Boolean options that allow 3 values. We just never
had an option previously for which this difference mattered.

That means that we should fix the from_dictionary implementation to accept
option names as keys for affected flags. I think use_public_ips currently
is the only option we have for which this would matter, so we could make a
special treatment just for this option and add a test asserting that any
new options that exhibit the same behavior, receive the special treatment.

[1]
https://github.com/apache/beam/blob/af5be6070b379ad875e430324f84bd9104999ecf/sdks/python/apache_beam/runners/dataflow/internal/apiclient.py#L335-L338
[2]
https://github.com/apache/beam/blob/3ad05523f4cdf5122fc319276fcb461f768af39d/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L78

On Fri, Feb 11, 2022 at 9:45 AM Brian Hulette <bh...@google.com> wrote:

> Thanks for helping elucidate this Valentyn. I'm curious if users actually
> use the from_dictionary() API directly though? I thought it really just
> existed for getting pipeline options onto the worker [1]. Maybe we should
> consider it an internal API, and have users prefer calls like options.as(MyClass).my_option
> = 'foo' if they want to set pipeline options programmatically?
>
> Brian
>
> [1]
> https://github.com/apache/beam/blob/3ad05523f4cdf5122fc319276fcb461f768af39d/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L78
>
> On Thu, Feb 10, 2022 at 11:09 PM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> Thanks, Anand. I took a look at the doc and comments, and took
>> another look at the apis.
>>
>> We have two separate concepts associated with each pipeline option:
>> -  the option name, which SDK uses to retrieve the value of the option,
>> and which we specify as 'dest' in the parser.
>> -  the flag name(s), through which users set values of the option on the
>> command line.
>>
>> Command line takes *flag names*. The PipelineOptions constructor takes a
>> dictionary(as kwargs) of *option names + *values (in addition to the
>> list of flags).
>> The *from_dictionary() *method takes a dictionary of *flag names *+
>> values. The expected inputs for programmatic apis may not be sufficiently
>> documented  in the docstrings, which can confuse users when option names
>> and flag names don't match. This is the case for Boolean flags, which by
>> nature of argparse need to have at least 2 flag names if one needs to have
>> 3 behaviors (unspecified, enabled, disabled).
>>
>> Reconciling the two apis to use the same input at this point, would
>> create more confusion in addition to breaking changes, especially if we do
>> it only for boolean options - since, the mismatch is there for all options
>> that have option name different from flag name.
>>
>> I think, what we should do instead is, to make it clear to the user which
>> API expects what keys through better docstring documentation, and if a user
>> accidentally passes an option name  instead of a flag name  (or vice
>> versa), instead of silently dropping the flag, we could print a message
>> that explain that they likely used the API in a wrong way and give a
>> suggestion.
>>
>> For example, if a user passes: use_public_ips=False to from_dictionary(),
>> we can print smth like: "*Skipping Boolean flag use_public_ips=False. **Did
>> you mean to pass no_use_public_ips=True instead ?*
>>
>> Likewise, if a user passes no_use_public_ips=True to PipelineOptions(),
>> we can print a warning: "*Skipping unrecognized option *`no_use_public_ips`,
>> did you mean to set `*use_public_ips`* option instead"?
>>
>> Hopefully, this would help clear the user confusion without making
>> breaking changes and also clarify the api documentation.
>>
>> On Wed, Feb 9, 2022 at 12:20 PM Andy Ye <ye...@google.com> wrote:
>>
>>> Left some comments in the doc. Thanks for the thorough investigation
>>> Anand!
>>>
>>> Best,
>>> Andy
>>>
>>>
>>>
>>> On Wed, Feb 9, 2022 at 10:13 AM Kenneth Knowles <ke...@apache.org> wrote:
>>>
>>>> Thanks for writing this up. This is an instant classic sort of
>>>> bug/inconsistency.
>>>>
>>>> Kenn
>>>>
>>>> On Tue, Feb 8, 2022 at 11:46 AM Anand Inguva <an...@google.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> For the Python SDK, there is some inconsistency across different APIs
>>>>> when it comes to parsing the boolean flags. The issue is briefly defined in
>>>>> the doc[1].
>>>>> *I would like to have a discussion* on this since it is causing
>>>>> *confusion* among users when it comes to using boolean flags across
>>>>> different APIs.
>>>>>
>>>>> Please go through it and provide feedback/comments on the issue.
>>>>>
>>>>> Thanks,
>>>>> Anand
>>>>>
>>>>> [1]
>>>>> https://docs.google.com/document/d/13n8IkonXvugBnDWocuaY-obQs4_oR70oZ4WVxewPuWI/edit?usp=sharing&resourcekey=0-tILu89gr-zXa_JcwHxiPYQ
>>>>>
>>>>

Re: Inconsistent behavior when parsing boolean flags across different APIs in Python SDK

Posted by Brian Hulette <bh...@google.com>.
Thanks for helping elucidate this Valentyn. I'm curious if users actually
use the from_dictionary() API directly though? I thought it really just
existed for getting pipeline options onto the worker [1]. Maybe we should
consider it an internal API, and have users prefer calls like
options.as(MyClass).my_option
= 'foo' if they want to set pipeline options programmatically?

Brian

[1]
https://github.com/apache/beam/blob/3ad05523f4cdf5122fc319276fcb461f768af39d/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L78

On Thu, Feb 10, 2022 at 11:09 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> Thanks, Anand. I took a look at the doc and comments, and took
> another look at the apis.
>
> We have two separate concepts associated with each pipeline option:
> -  the option name, which SDK uses to retrieve the value of the option,
> and which we specify as 'dest' in the parser.
> -  the flag name(s), through which users set values of the option on the
> command line.
>
> Command line takes *flag names*. The PipelineOptions constructor takes a
> dictionary(as kwargs) of *option names + *values (in addition to the list
> of flags).
> The *from_dictionary() *method takes a dictionary of *flag names *+
> values. The expected inputs for programmatic apis may not be sufficiently
> documented  in the docstrings, which can confuse users when option names
> and flag names don't match. This is the case for Boolean flags, which by
> nature of argparse need to have at least 2 flag names if one needs to have
> 3 behaviors (unspecified, enabled, disabled).
>
> Reconciling the two apis to use the same input at this point, would create
> more confusion in addition to breaking changes, especially if we do it only
> for boolean options - since, the mismatch is there for all options that
> have option name different from flag name.
>
> I think, what we should do instead is, to make it clear to the user which
> API expects what keys through better docstring documentation, and if a user
> accidentally passes an option name  instead of a flag name  (or vice
> versa), instead of silently dropping the flag, we could print a message
> that explain that they likely used the API in a wrong way and give a
> suggestion.
>
> For example, if a user passes: use_public_ips=False to from_dictionary(),
> we can print smth like: "*Skipping Boolean flag use_public_ips=False. **Did
> you mean to pass no_use_public_ips=True instead ?*
>
> Likewise, if a user passes no_use_public_ips=True to PipelineOptions(), we
> can print a warning: "*Skipping unrecognized option *`no_use_public_ips`,
> did you mean to set `*use_public_ips`* option instead"?
>
> Hopefully, this would help clear the user confusion without making
> breaking changes and also clarify the api documentation.
>
> On Wed, Feb 9, 2022 at 12:20 PM Andy Ye <ye...@google.com> wrote:
>
>> Left some comments in the doc. Thanks for the thorough investigation
>> Anand!
>>
>> Best,
>> Andy
>>
>>
>>
>> On Wed, Feb 9, 2022 at 10:13 AM Kenneth Knowles <ke...@apache.org> wrote:
>>
>>> Thanks for writing this up. This is an instant classic sort of
>>> bug/inconsistency.
>>>
>>> Kenn
>>>
>>> On Tue, Feb 8, 2022 at 11:46 AM Anand Inguva <an...@google.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> For the Python SDK, there is some inconsistency across different APIs
>>>> when it comes to parsing the boolean flags. The issue is briefly defined in
>>>> the doc[1].
>>>> *I would like to have a discussion* on this since it is causing
>>>> *confusion* among users when it comes to using boolean flags across
>>>> different APIs.
>>>>
>>>> Please go through it and provide feedback/comments on the issue.
>>>>
>>>> Thanks,
>>>> Anand
>>>>
>>>> [1]
>>>> https://docs.google.com/document/d/13n8IkonXvugBnDWocuaY-obQs4_oR70oZ4WVxewPuWI/edit?usp=sharing&resourcekey=0-tILu89gr-zXa_JcwHxiPYQ
>>>>
>>>

Re: Inconsistent behavior when parsing boolean flags across different APIs in Python SDK

Posted by Valentyn Tymofieiev <va...@google.com>.
Thanks, Anand. I took a look at the doc and comments, and took another look
at the apis.

We have two separate concepts associated with each pipeline option:
-  the option name, which SDK uses to retrieve the value of the option, and
which we specify as 'dest' in the parser.
-  the flag name(s), through which users set values of the option on the
command line.

Command line takes *flag names*. The PipelineOptions constructor takes a
dictionary(as kwargs) of *option names + *values (in addition to the list
of flags).
The *from_dictionary() *method takes a dictionary of *flag names *+ values.
The expected inputs for programmatic apis may not be sufficiently
documented  in the docstrings, which can confuse users when option names
and flag names don't match. This is the case for Boolean flags, which by
nature of argparse need to have at least 2 flag names if one needs to have
3 behaviors (unspecified, enabled, disabled).

Reconciling the two apis to use the same input at this point, would create
more confusion in addition to breaking changes, especially if we do it only
for boolean options - since, the mismatch is there for all options that
have option name different from flag name.

I think, what we should do instead is, to make it clear to the user which
API expects what keys through better docstring documentation, and if a user
accidentally passes an option name  instead of a flag name  (or vice
versa), instead of silently dropping the flag, we could print a message
that explain that they likely used the API in a wrong way and give a
suggestion.

For example, if a user passes: use_public_ips=False to from_dictionary(),
we can print smth like: "*Skipping Boolean flag use_public_ips=False. **Did
you mean to pass no_use_public_ips=True instead ?*

Likewise, if a user passes no_use_public_ips=True to PipelineOptions(), we
can print a warning: "*Skipping unrecognized option *`no_use_public_ips`,
did you mean to set `*use_public_ips`* option instead"?

Hopefully, this would help clear the user confusion without making breaking
changes and also clarify the api documentation.

On Wed, Feb 9, 2022 at 12:20 PM Andy Ye <ye...@google.com> wrote:

> Left some comments in the doc. Thanks for the thorough investigation Anand!
>
> Best,
> Andy
>
>
>
> On Wed, Feb 9, 2022 at 10:13 AM Kenneth Knowles <ke...@apache.org> wrote:
>
>> Thanks for writing this up. This is an instant classic sort of
>> bug/inconsistency.
>>
>> Kenn
>>
>> On Tue, Feb 8, 2022 at 11:46 AM Anand Inguva <an...@google.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> For the Python SDK, there is some inconsistency across different APIs
>>> when it comes to parsing the boolean flags. The issue is briefly defined in
>>> the doc[1].
>>> *I would like to have a discussion* on this since it is causing
>>> *confusion* among users when it comes to using boolean flags across
>>> different APIs.
>>>
>>> Please go through it and provide feedback/comments on the issue.
>>>
>>> Thanks,
>>> Anand
>>>
>>> [1]
>>> https://docs.google.com/document/d/13n8IkonXvugBnDWocuaY-obQs4_oR70oZ4WVxewPuWI/edit?usp=sharing&resourcekey=0-tILu89gr-zXa_JcwHxiPYQ
>>>
>>

Re: Inconsistent behavior when parsing boolean flags across different APIs in Python SDK

Posted by Andy Ye <ye...@google.com>.
Left some comments in the doc. Thanks for the thorough investigation Anand!

Best,
Andy



On Wed, Feb 9, 2022 at 10:13 AM Kenneth Knowles <ke...@apache.org> wrote:

> Thanks for writing this up. This is an instant classic sort of
> bug/inconsistency.
>
> Kenn
>
> On Tue, Feb 8, 2022 at 11:46 AM Anand Inguva <an...@google.com>
> wrote:
>
>> Hi all,
>>
>> For the Python SDK, there is some inconsistency across different APIs
>> when it comes to parsing the boolean flags. The issue is briefly defined in
>> the doc[1].
>> *I would like to have a discussion* on this since it is causing
>> *confusion* among users when it comes to using boolean flags across
>> different APIs.
>>
>> Please go through it and provide feedback/comments on the issue.
>>
>> Thanks,
>> Anand
>>
>> [1]
>> https://docs.google.com/document/d/13n8IkonXvugBnDWocuaY-obQs4_oR70oZ4WVxewPuWI/edit?usp=sharing&resourcekey=0-tILu89gr-zXa_JcwHxiPYQ
>>
>

Re: Inconsistent behavior when parsing boolean flags across different APIs in Python SDK

Posted by Kenneth Knowles <ke...@apache.org>.
Thanks for writing this up. This is an instant classic sort of
bug/inconsistency.

Kenn

On Tue, Feb 8, 2022 at 11:46 AM Anand Inguva <an...@google.com> wrote:

> Hi all,
>
> For the Python SDK, there is some inconsistency across different APIs when
> it comes to parsing the boolean flags. The issue is briefly defined in the
> doc[1].
> *I would like to have a discussion* on this since it is causing
> *confusion* among users when it comes to using boolean flags across
> different APIs.
>
> Please go through it and provide feedback/comments on the issue.
>
> Thanks,
> Anand
>
> [1]
> https://docs.google.com/document/d/13n8IkonXvugBnDWocuaY-obQs4_oR70oZ4WVxewPuWI/edit?usp=sharing&resourcekey=0-tILu89gr-zXa_JcwHxiPYQ
>