You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Sam Rohde <sr...@google.com> on 2020/03/24 20:07:07 UTC

[BEAM-9322] Python SDK discussion on correct output tag names

Hi All,

*Problem*
I would like to discuss BEAM-9322
<https://issues.apache.org/jira/projects/BEAM/issues/BEAM-9322> and the
correct way to set the output tags of a transform with nested PCollections,
e.g. a dict of PCollections, a tuple of dicts of PCollections. Before the
fixing of BEAM-1833 <https://issues.apache.org/jira/browse/BEAM-1833>, the
Python SDK when applying a PTransform would auto-generate the output tags
for the output PCollections even if they are manually set by the user:

class MyComposite(beam.PTransform):
  def expand(self, pcoll):
    a = PCollection.from_(pcoll)
    a.tag = 'a'

    b = PCollection.from_(pcoll)
    b.tag = 'b'
    return (a, b)

would yield a PTransform with two output PCollection and output tags with
'None' and '0' instead of 'a' and 'b'. This was corrected for simple cases
like this. However, this fails when the PCollections share the same output
tag (of course). This can happen like so:

class MyComposite(beam.PTransform):
  def expand(self, pcoll):
    partition_1 = beam.Partition(pcoll, ...)
    partition_2 = beam.Partition(pcoll, ...)
    return (partition_1[0], partition_2[0])

With the new code, this leads to an error because both output PCollections
have an output tag of '0'.

*Proposal*
When applying PTransforms to a pipeline (pipeline.py:550) we name the
PCollections according to their position in the tree concatenated with the
PCollection tag and a delimiter. From the first example, the output
PCollections of the applied transform will be: '0.a' and '1.b' because it
is a tuple of PCollections. In the second example, the outputs should be:
'0.0' and '1.0'. In the case of a dict of PCollections, it should simply be
the keys of the dict.

What do you think? Am I missing edge cases? Will this be unexpected to
users? Will this break people who rely on the generated PCollection output
tags?

Regards,
Sam

Re: [BEAM-9322] Python SDK discussion on correct output tag names

Posted by Robert Bradshaw <ro...@google.com>.

On Tue, Mar 24, 2020 at 1:07 PM Sam Rohde <sr...@google.com> wrote:
>
> Hi All,
>
> Problem
> I would like to discuss BEAM-9322 and the correct way to set the output tags of a transform with nested PCollections, e.g. a dict of PCollections, a tuple of dicts of PCollections. Before the fixing of BEAM-1833, the Python SDK when applying a PTransform would auto-generate the output tags for the output PCollections even if they are manually set by the user:
>
> class MyComposite(beam.PTransform):
>   def expand(self, pcoll):
>     a = PCollection.from_(pcoll)
>     a.tag = 'a'
>
>     b = PCollection.from_(pcoll)
>     b.tag = 'b'
>     return (a, b)
>
> would yield a PTransform with two output PCollection and output tags with 'None' and '0' instead of 'a' and 'b'. This was corrected for simple cases like this.

I don't even know what that composite PTransform does.
PCollection.from_ should not be a user-facing method; it looks like
it's a helper used for defining primitives.

> However, this fails when the PCollections share the same output tag (of course). This can happen like so:
>
> class MyComposite(beam.PTransform):
>   def expand(self, pcoll):
>     partition_1 = beam.Partition(pcoll, ...)
>     partition_2 = beam.Partition(pcoll, ...)
>     return (partition_1[0], partition_2[0])
>
> With the new code, this leads to an error because both output PCollections have an output tag of '0'.

We should not be using pcollection tags to name the outputs of
composite PCollections. PCollection.tag is only relevant to the
primitive transform producing that PCollection. Here we have three
PTransforms

   Partition1.outputs = {0: partition_1, ...}
   Partition1.outputs = {0: partition_2, ...}
   MyComposite.outputs = {0: partition_1, 1: partition_2}

I'm not seeing what the issue is. (That being said, I think there *is*
a lot to clean up here.)

> Proposal
> When applying PTransforms to a pipeline (pipeline.py:550) we name the PCollections according to their position in the tree concatenated with the PCollection tag and a delimiter. From the first example, the output PCollections of the applied transform will be: '0.a' and '1.b' because it is a tuple of PCollections. In the second example, the outputs should be: '0.0' and '1.0'. In the case of a dict of PCollections, it should simply be the keys of the dict.
>
> What do you think? Am I missing edge cases? Will this be unexpected to users? Will this break people who rely on the generated PCollection output tags?

The user currently has no control of the tags used for anything but
multi-output ParDo operations.

Re: [BEAM-9322] Python SDK discussion on correct output tag names

Posted by Sam Rohde <sr...@google.com>.

On Wed, Apr 1, 2020 at 3:34 PM Robert Bradshaw <ro...@google.com> wrote:

> E.g. something like https://github.com/apache/beam/pull/11283
>
> That looks good. Though it may be better to do the change under the
"passthrough_pcollection_output_ids" experiment flag above it (I'll comment
on this in the PR too).


> On Wed, Apr 1, 2020 at 2:57 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Wed, Apr 1, 2020 at 1:48 PM Sam Rohde <sr...@google.com> wrote:
>>
>>> To restate the original issue it is that the current method of setting
>>> the output tags on PCollections from composites drops the tag information
>>> of the returned PCollections.
>>>
>>
>> Composite PTransforms should *not* be setting output tags on
>> returned PCollecitons; this will break producing outputs from the actual
>> primitive that produces them.
>>
>>
>>> So an expand returning a dict will have its outputs labeled as None, 1,
>>> ..., len(outputs). This is broken because embedded payloads in composites
>>> won't be able to reference the outputs if they differ from the returned
>>> value.
>>>
>>
>> Yes, we need a way for composites to declare their output tags. Currently
>> this is only supported for the multi-output ParDo primitive.
>>
>
>>
>>> In this case, having the restriction of no nesting decreases technical
>>> complexity substantially and always giving the tag unambiguously informs
>>> the SDK what to name the outputs. We can also allow for arbitrary nesting,
>>> we'll just have to figure out an unambiguous naming scheme for the
>>> PCollections.
>>>
>>
>> How about this: if the returned PValue is a dictionary of string ->
>> PCollection, we use the keys as tags. We can extend this naturally to
>> tuples, named tuples, nesting, etc. (though I don't know if there are any
>> hidden assumptions left about having an output labeled None if we want to
>> push this through to completion).
>>
>>
> Sorry for my imperfect vocabulary, this is what I was roughly trying (and
failed) to propose.



>
>>>
>>>
>>> On Wed, Apr 1, 2020 at 12:44 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> I'm -1 on this, it adds additional restrictions and I don't see what
>>>> this buys us. (In particular, it doesn't address the original issue.)
>>>>
>>>> On Wed, Apr 1, 2020 at 12:41 PM Sam Rohde <sr...@google.com> wrote:
>>>>
>>>>> So then how does the proposal sound?
>>>>>
>>>>> Writing again here:
>>>>> PTransform.expand: (...) -> Union[PValue, NamedTuple[str,
>>>>> PCollection], Tuple[str, PCollection], Dict[str, PCollection],
>>>>> DoOutputsTuple]
>>>>>
>>>>> i.e. no arbitrary nesting when outputting from an expand
>>>>>
>>>>> On Tue, Mar 31, 2020 at 5:15 PM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> On Tue, Mar 31, 2020 at 4:13 PM Luke Cwik <lc...@google.com> wrote:
>>>>>> >
>>>>>> > It is important that composites know how things are named so that
>>>>>> any embedded payloads in the composite PTransform can reference the outputs
>>>>>> appropriately.
>>>>>>
>>>>>> Very good point. This is part of the cleanup to treat inputs and
>>>>>> outputs of PCollections as maps rather than lists generally across the
>>>>>> Python representations (which also relates to some of the ugliness
>>>>>> that Cham has been running into with cross-language).
>>>>>>
>>>>>> > On Tue, Mar 31, 2020 at 2:51 PM Robert Bradshaw <
>>>>>> robertwb@google.com> wrote:
>>>>>> >>
>>>>>> >> On Tue, Mar 31, 2020 at 1:13 PM Sam Rohde <sr...@google.com>
>>>>>> wrote:
>>>>>> >> >>>
>>>>>> >> >>> * Don't allow arbitrary nestings returned during expansion,
>>>>>> force composite transforms to always provide an unambiguous name (either a
>>>>>> tuple with PCollections with unique tags or a dictionary with untagged
>>>>>> PCollections or a singular PCollection (Java and Go SDKs do this)).
>>>>>> >> >>
>>>>>> >> >> I believe that aligning with Java and Go would be the right way
>>>>>> to go here. I don't know if this would limit expressiveness.
>>>>>> >> >
>>>>>> >> > Yeah this sounds like a much more elegant way of handling this
>>>>>> situation. I would lean towards this limiting expressiveness because there
>>>>>> would be a limit to nesting, but I think that the trade-off with reducing
>>>>>> complexity is worth it.
>>>>>> >> >
>>>>>> >> > So in summary it could be:
>>>>>> >> > PTransform.expand: (...) -> Union[PValue, NamedTuple[str,
>>>>>> PCollection], Tuple[str, PCollection], Dict[str, PCollection],
>>>>>> DoOutputsTuple]
>>>>>> >> >
>>>>>> >> > With the expectation that (pseudo-code):
>>>>>> >> > a_transform = ATransform()
>>>>>> >> >
>>>>>> ATransform.from_runner_api(a_transform.to_runner_api()).outputs.keys() ==
>>>>>> a_transform.outputs.keys()
>>>>>> >> >
>>>>>> >> > Since this changes the Python SDK composite transform API, what
>>>>>> would be the next steps for the community to come to a consensus on this?
>>>>>> >>
>>>>>> >> It seems here we're conflating the nesting of PValue results with
>>>>>> the
>>>>>> >> nesting of composite operations.
>>>>>> >>
>>>>>> >> Both examples in the original post have PTransform nesting (a
>>>>>> >> composite) returning a flat tuple. This is completely orthogonal to
>>>>>> >> the idea of a PTransform returning a nested result (such as (pc1,
>>>>>> >> (pc2, pc3))) and forbidding the latter won't solve the former.
>>>>>> >>
>>>>>> >> Currently, with the exception of explicit names given for
>>>>>> multi-output
>>>>>> >> ParDos, we simply label the outputs sequentially with 0, 1, 2, 3,
>>>>>> ...
>>>>>> >> (Actually, for historical reasons, it's None, 1, 2, 3, ...), no
>>>>>> matter
>>>>>> >> the nesting. We could do better, e.g. for the example above, label
>>>>>> >> them "0", "1.0", "1.1", or use the keys in the returned dict, but
>>>>>> this
>>>>>> >> is separate from the idea of trying to relate the output tags of
>>>>>> >> composites to the output tags of their inner transforms.
>>>>>> >>
>>>>>> >> - Robert
>>>>>>
>>>>>
On Wed, Apr 1, 2020 at 3:34 PM Robert Bradshaw <ro...@google.com> wrote:

> E.g. something like https://github.com/apache/beam/pull/11283
>
> On Wed, Apr 1, 2020 at 2:57 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Wed, Apr 1, 2020 at 1:48 PM Sam Rohde <sr...@google.com> wrote:
>>
>>> To restate the original issue it is that the current method of setting
>>> the output tags on PCollections from composites drops the tag information
>>> of the returned PCollections.
>>>
>>
>> Composite PTransforms should *not* be setting output tags on
>> returned PCollecitons; this will break producing outputs from the actual
>> primitive that produces them.
>>
>>
>>> So an expand returning a dict will have its outputs labeled as None, 1,
>>> ..., len(outputs). This is broken because embedded payloads in composites
>>> won't be able to reference the outputs if they differ from the returned
>>> value.
>>>
>>
>> Yes, we need a way for composites to declare their output tags. Currently
>> this is only supported for the multi-output ParDo primitive.
>>
>>
>>> In this case, having the restriction of no nesting decreases technical
>>> complexity substantially and always giving the tag unambiguously informs
>>> the SDK what to name the outputs. We can also allow for arbitrary nesting,
>>> we'll just have to figure out an unambiguous naming scheme for the
>>> PCollections.
>>>
>>
>> How about this: if the returned PValue is a dictionary of string ->
>> PCollection, we use the keys as tags. We can extend this naturally to
>> tuples, named tuples, nesting, etc. (though I don't know if there are any
>> hidden assumptions left about having an output labeled None if we want to
>> push this through to completion).
>>
>>
>>>
>>>
>>>
>>> On Wed, Apr 1, 2020 at 12:44 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> I'm -1 on this, it adds additional restrictions and I don't see what
>>>> this buys us. (In particular, it doesn't address the original issue.)
>>>>
>>>> On Wed, Apr 1, 2020 at 12:41 PM Sam Rohde <sr...@google.com> wrote:
>>>>
>>>>> So then how does the proposal sound?
>>>>>
>>>>> Writing again here:
>>>>> PTransform.expand: (...) -> Union[PValue, NamedTuple[str,
>>>>> PCollection], Tuple[str, PCollection], Dict[str, PCollection],
>>>>> DoOutputsTuple]
>>>>>
>>>>> i.e. no arbitrary nesting when outputting from an expand
>>>>>
>>>>> On Tue, Mar 31, 2020 at 5:15 PM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> On Tue, Mar 31, 2020 at 4:13 PM Luke Cwik <lc...@google.com> wrote:
>>>>>> >
>>>>>> > It is important that composites know how things are named so that
>>>>>> any embedded payloads in the composite PTransform can reference the outputs
>>>>>> appropriately.
>>>>>>
>>>>>> Very good point. This is part of the cleanup to treat inputs and
>>>>>> outputs of PCollections as maps rather than lists generally across the
>>>>>> Python representations (which also relates to some of the ugliness
>>>>>> that Cham has been running into with cross-language).
>>>>>>
>>>>>> > On Tue, Mar 31, 2020 at 2:51 PM Robert Bradshaw <
>>>>>> robertwb@google.com> wrote:
>>>>>> >>
>>>>>> >> On Tue, Mar 31, 2020 at 1:13 PM Sam Rohde <sr...@google.com>
>>>>>> wrote:
>>>>>> >> >>>
>>>>>> >> >>> * Don't allow arbitrary nestings returned during expansion,
>>>>>> force composite transforms to always provide an unambiguous name (either a
>>>>>> tuple with PCollections with unique tags or a dictionary with untagged
>>>>>> PCollections or a singular PCollection (Java and Go SDKs do this)).
>>>>>> >> >>
>>>>>> >> >> I believe that aligning with Java and Go would be the right way
>>>>>> to go here. I don't know if this would limit expressiveness.
>>>>>> >> >
>>>>>> >> > Yeah this sounds like a much more elegant way of handling this
>>>>>> situation. I would lean towards this limiting expressiveness because there
>>>>>> would be a limit to nesting, but I think that the trade-off with reducing
>>>>>> complexity is worth it.
>>>>>> >> >
>>>>>> >> > So in summary it could be:
>>>>>> >> > PTransform.expand: (...) -> Union[PValue, NamedTuple[str,
>>>>>> PCollection], Tuple[str, PCollection], Dict[str, PCollection],
>>>>>> DoOutputsTuple]
>>>>>> >> >
>>>>>> >> > With the expectation that (pseudo-code):
>>>>>> >> > a_transform = ATransform()
>>>>>> >> >
>>>>>> ATransform.from_runner_api(a_transform.to_runner_api()).outputs.keys() ==
>>>>>> a_transform.outputs.keys()
>>>>>> >> >
>>>>>> >> > Since this changes the Python SDK composite transform API, what
>>>>>> would be the next steps for the community to come to a consensus on this?
>>>>>> >>
>>>>>> >> It seems here we're conflating the nesting of PValue results with
>>>>>> the
>>>>>> >> nesting of composite operations.
>>>>>> >>
>>>>>> >> Both examples in the original post have PTransform nesting (a
>>>>>> >> composite) returning a flat tuple. This is completely orthogonal to
>>>>>> >> the idea of a PTransform returning a nested result (such as (pc1,
>>>>>> >> (pc2, pc3))) and forbidding the latter won't solve the former.
>>>>>> >>
>>>>>> >> Currently, with the exception of explicit names given for
>>>>>> multi-output
>>>>>> >> ParDos, we simply label the outputs sequentially with 0, 1, 2, 3,
>>>>>> ...
>>>>>> >> (Actually, for historical reasons, it's None, 1, 2, 3, ...), no
>>>>>> matter
>>>>>> >> the nesting. We could do better, e.g. for the example above, label
>>>>>> >> them "0", "1.0", "1.1", or use the keys in the returned dict, but
>>>>>> this
>>>>>> >> is separate from the idea of trying to relate the output tags of
>>>>>> >> composites to the output tags of their inner transforms.
>>>>>> >>
>>>>>> >> - Robert
>>>>>>
>>>>>

Re: [BEAM-9322] Python SDK discussion on correct output tag names

Posted by Robert Bradshaw <ro...@google.com>.

E.g. something like https://github.com/apache/beam/pull/11283

On Wed, Apr 1, 2020 at 2:57 PM Robert Bradshaw <ro...@google.com> wrote:

> On Wed, Apr 1, 2020 at 1:48 PM Sam Rohde <sr...@google.com> wrote:
>
>> To restate the original issue it is that the current method of setting
>> the output tags on PCollections from composites drops the tag information
>> of the returned PCollections.
>>
>
> Composite PTransforms should *not* be setting output tags on
> returned PCollecitons; this will break producing outputs from the actual
> primitive that produces them.
>
>
>> So an expand returning a dict will have its outputs labeled as None, 1,
>> ..., len(outputs). This is broken because embedded payloads in composites
>> won't be able to reference the outputs if they differ from the returned
>> value.
>>
>
> Yes, we need a way for composites to declare their output tags. Currently
> this is only supported for the multi-output ParDo primitive.
>
>
>> In this case, having the restriction of no nesting decreases technical
>> complexity substantially and always giving the tag unambiguously informs
>> the SDK what to name the outputs. We can also allow for arbitrary nesting,
>> we'll just have to figure out an unambiguous naming scheme for the
>> PCollections.
>>
>
> How about this: if the returned PValue is a dictionary of string ->
> PCollection, we use the keys as tags. We can extend this naturally to
> tuples, named tuples, nesting, etc. (though I don't know if there are any
> hidden assumptions left about having an output labeled None if we want to
> push this through to completion).
>
>
>>
>>
>>
>> On Wed, Apr 1, 2020 at 12:44 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> I'm -1 on this, it adds additional restrictions and I don't see what
>>> this buys us. (In particular, it doesn't address the original issue.)
>>>
>>> On Wed, Apr 1, 2020 at 12:41 PM Sam Rohde <sr...@google.com> wrote:
>>>
>>>> So then how does the proposal sound?
>>>>
>>>> Writing again here:
>>>> PTransform.expand: (...) -> Union[PValue, NamedTuple[str, PCollection],
>>>> Tuple[str, PCollection], Dict[str, PCollection], DoOutputsTuple]
>>>>
>>>> i.e. no arbitrary nesting when outputting from an expand
>>>>
>>>> On Tue, Mar 31, 2020 at 5:15 PM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> On Tue, Mar 31, 2020 at 4:13 PM Luke Cwik <lc...@google.com> wrote:
>>>>> >
>>>>> > It is important that composites know how things are named so that
>>>>> any embedded payloads in the composite PTransform can reference the outputs
>>>>> appropriately.
>>>>>
>>>>> Very good point. This is part of the cleanup to treat inputs and
>>>>> outputs of PCollections as maps rather than lists generally across the
>>>>> Python representations (which also relates to some of the ugliness
>>>>> that Cham has been running into with cross-language).
>>>>>
>>>>> > On Tue, Mar 31, 2020 at 2:51 PM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>> >>
>>>>> >> On Tue, Mar 31, 2020 at 1:13 PM Sam Rohde <sr...@google.com>
>>>>> wrote:
>>>>> >> >>>
>>>>> >> >>> * Don't allow arbitrary nestings returned during expansion,
>>>>> force composite transforms to always provide an unambiguous name (either a
>>>>> tuple with PCollections with unique tags or a dictionary with untagged
>>>>> PCollections or a singular PCollection (Java and Go SDKs do this)).
>>>>> >> >>
>>>>> >> >> I believe that aligning with Java and Go would be the right way
>>>>> to go here. I don't know if this would limit expressiveness.
>>>>> >> >
>>>>> >> > Yeah this sounds like a much more elegant way of handling this
>>>>> situation. I would lean towards this limiting expressiveness because there
>>>>> would be a limit to nesting, but I think that the trade-off with reducing
>>>>> complexity is worth it.
>>>>> >> >
>>>>> >> > So in summary it could be:
>>>>> >> > PTransform.expand: (...) -> Union[PValue, NamedTuple[str,
>>>>> PCollection], Tuple[str, PCollection], Dict[str, PCollection],
>>>>> DoOutputsTuple]
>>>>> >> >
>>>>> >> > With the expectation that (pseudo-code):
>>>>> >> > a_transform = ATransform()
>>>>> >> >
>>>>> ATransform.from_runner_api(a_transform.to_runner_api()).outputs.keys() ==
>>>>> a_transform.outputs.keys()
>>>>> >> >
>>>>> >> > Since this changes the Python SDK composite transform API, what
>>>>> would be the next steps for the community to come to a consensus on this?
>>>>> >>
>>>>> >> It seems here we're conflating the nesting of PValue results with
>>>>> the
>>>>> >> nesting of composite operations.
>>>>> >>
>>>>> >> Both examples in the original post have PTransform nesting (a
>>>>> >> composite) returning a flat tuple. This is completely orthogonal to
>>>>> >> the idea of a PTransform returning a nested result (such as (pc1,
>>>>> >> (pc2, pc3))) and forbidding the latter won't solve the former.
>>>>> >>
>>>>> >> Currently, with the exception of explicit names given for
>>>>> multi-output
>>>>> >> ParDos, we simply label the outputs sequentially with 0, 1, 2, 3,
>>>>> ...
>>>>> >> (Actually, for historical reasons, it's None, 1, 2, 3, ...), no
>>>>> matter
>>>>> >> the nesting. We could do better, e.g. for the example above, label
>>>>> >> them "0", "1.0", "1.1", or use the keys in the returned dict, but
>>>>> this
>>>>> >> is separate from the idea of trying to relate the output tags of
>>>>> >> composites to the output tags of their inner transforms.
>>>>> >>
>>>>> >> - Robert
>>>>>
>>>>

Re: [BEAM-9322] Python SDK discussion on correct output tag names

Posted by Robert Bradshaw <ro...@google.com>.

On Wed, Apr 1, 2020 at 1:48 PM Sam Rohde <sr...@google.com> wrote:

> To restate the original issue it is that the current method of setting the
> output tags on PCollections from composites drops the tag information of
> the returned PCollections.
>

Composite PTransforms should *not* be setting output tags on
returned PCollecitons; this will break producing outputs from the actual
primitive that produces them.


> So an expand returning a dict will have its outputs labeled as None, 1,
> ..., len(outputs). This is broken because embedded payloads in composites
> won't be able to reference the outputs if they differ from the returned
> value.
>

Yes, we need a way for composites to declare their output tags. Currently
this is only supported for the multi-output ParDo primitive.


> In this case, having the restriction of no nesting decreases technical
> complexity substantially and always giving the tag unambiguously informs
> the SDK what to name the outputs. We can also allow for arbitrary nesting,
> we'll just have to figure out an unambiguous naming scheme for the
> PCollections.
>

How about this: if the returned PValue is a dictionary of string ->
PCollection, we use the keys as tags. We can extend this naturally to
tuples, named tuples, nesting, etc. (though I don't know if there are any
hidden assumptions left about having an output labeled None if we want to
push this through to completion).


>
>
>
> On Wed, Apr 1, 2020 at 12:44 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> I'm -1 on this, it adds additional restrictions and I don't see what this
>> buys us. (In particular, it doesn't address the original issue.)
>>
>> On Wed, Apr 1, 2020 at 12:41 PM Sam Rohde <sr...@google.com> wrote:
>>
>>> So then how does the proposal sound?
>>>
>>> Writing again here:
>>> PTransform.expand: (...) -> Union[PValue, NamedTuple[str, PCollection],
>>> Tuple[str, PCollection], Dict[str, PCollection], DoOutputsTuple]
>>>
>>> i.e. no arbitrary nesting when outputting from an expand
>>>
>>> On Tue, Mar 31, 2020 at 5:15 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> On Tue, Mar 31, 2020 at 4:13 PM Luke Cwik <lc...@google.com> wrote:
>>>> >
>>>> > It is important that composites know how things are named so that any
>>>> embedded payloads in the composite PTransform can reference the outputs
>>>> appropriately.
>>>>
>>>> Very good point. This is part of the cleanup to treat inputs and
>>>> outputs of PCollections as maps rather than lists generally across the
>>>> Python representations (which also relates to some of the ugliness
>>>> that Cham has been running into with cross-language).
>>>>
>>>> > On Tue, Mar 31, 2020 at 2:51 PM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>> >>
>>>> >> On Tue, Mar 31, 2020 at 1:13 PM Sam Rohde <sr...@google.com> wrote:
>>>> >> >>>
>>>> >> >>> * Don't allow arbitrary nestings returned during expansion,
>>>> force composite transforms to always provide an unambiguous name (either a
>>>> tuple with PCollections with unique tags or a dictionary with untagged
>>>> PCollections or a singular PCollection (Java and Go SDKs do this)).
>>>> >> >>
>>>> >> >> I believe that aligning with Java and Go would be the right way
>>>> to go here. I don't know if this would limit expressiveness.
>>>> >> >
>>>> >> > Yeah this sounds like a much more elegant way of handling this
>>>> situation. I would lean towards this limiting expressiveness because there
>>>> would be a limit to nesting, but I think that the trade-off with reducing
>>>> complexity is worth it.
>>>> >> >
>>>> >> > So in summary it could be:
>>>> >> > PTransform.expand: (...) -> Union[PValue, NamedTuple[str,
>>>> PCollection], Tuple[str, PCollection], Dict[str, PCollection],
>>>> DoOutputsTuple]
>>>> >> >
>>>> >> > With the expectation that (pseudo-code):
>>>> >> > a_transform = ATransform()
>>>> >> >
>>>> ATransform.from_runner_api(a_transform.to_runner_api()).outputs.keys() ==
>>>> a_transform.outputs.keys()
>>>> >> >
>>>> >> > Since this changes the Python SDK composite transform API, what
>>>> would be the next steps for the community to come to a consensus on this?
>>>> >>
>>>> >> It seems here we're conflating the nesting of PValue results with the
>>>> >> nesting of composite operations.
>>>> >>
>>>> >> Both examples in the original post have PTransform nesting (a
>>>> >> composite) returning a flat tuple. This is completely orthogonal to
>>>> >> the idea of a PTransform returning a nested result (such as (pc1,
>>>> >> (pc2, pc3))) and forbidding the latter won't solve the former.
>>>> >>
>>>> >> Currently, with the exception of explicit names given for
>>>> multi-output
>>>> >> ParDos, we simply label the outputs sequentially with 0, 1, 2, 3, ...
>>>> >> (Actually, for historical reasons, it's None, 1, 2, 3, ...), no
>>>> matter
>>>> >> the nesting. We could do better, e.g. for the example above, label
>>>> >> them "0", "1.0", "1.1", or use the keys in the returned dict, but
>>>> this
>>>> >> is separate from the idea of trying to relate the output tags of
>>>> >> composites to the output tags of their inner transforms.
>>>> >>
>>>> >> - Robert
>>>>
>>>

Re: [BEAM-9322] Python SDK discussion on correct output tag names

Posted by Sam Rohde <sr...@google.com>.

To restate the original issue it is that the current method of setting the
output tags on PCollections from composites drops the tag information of
the returned PCollections. So an expand returning a dict will have its
outputs labeled as None, 1, ..., len(outputs). This is broken because
embedded payloads in composites won't be able to reference the outputs if
they differ from the returned value.

In this case, having the restriction of no nesting decreases technical
complexity substantially and always giving the tag unambiguously informs
the SDK what to name the outputs. We can also allow for arbitrary nesting,
we'll just have to figure out an unambiguous naming scheme for the
PCollections.



On Wed, Apr 1, 2020 at 12:44 PM Robert Bradshaw <ro...@google.com> wrote:

> I'm -1 on this, it adds additional restrictions and I don't see what this
> buys us. (In particular, it doesn't address the original issue.)
>
> On Wed, Apr 1, 2020 at 12:41 PM Sam Rohde <sr...@google.com> wrote:
>
>> So then how does the proposal sound?
>>
>> Writing again here:
>> PTransform.expand: (...) -> Union[PValue, NamedTuple[str, PCollection],
>> Tuple[str, PCollection], Dict[str, PCollection], DoOutputsTuple]
>>
>> i.e. no arbitrary nesting when outputting from an expand
>>
>> On Tue, Mar 31, 2020 at 5:15 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> On Tue, Mar 31, 2020 at 4:13 PM Luke Cwik <lc...@google.com> wrote:
>>> >
>>> > It is important that composites know how things are named so that any
>>> embedded payloads in the composite PTransform can reference the outputs
>>> appropriately.
>>>
>>> Very good point. This is part of the cleanup to treat inputs and
>>> outputs of PCollections as maps rather than lists generally across the
>>> Python representations (which also relates to some of the ugliness
>>> that Cham has been running into with cross-language).
>>>
>>> > On Tue, Mar 31, 2020 at 2:51 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>> >>
>>> >> On Tue, Mar 31, 2020 at 1:13 PM Sam Rohde <sr...@google.com> wrote:
>>> >> >>>
>>> >> >>> * Don't allow arbitrary nestings returned during expansion, force
>>> composite transforms to always provide an unambiguous name (either a tuple
>>> with PCollections with unique tags or a dictionary with untagged
>>> PCollections or a singular PCollection (Java and Go SDKs do this)).
>>> >> >>
>>> >> >> I believe that aligning with Java and Go would be the right way to
>>> go here. I don't know if this would limit expressiveness.
>>> >> >
>>> >> > Yeah this sounds like a much more elegant way of handling this
>>> situation. I would lean towards this limiting expressiveness because there
>>> would be a limit to nesting, but I think that the trade-off with reducing
>>> complexity is worth it.
>>> >> >
>>> >> > So in summary it could be:
>>> >> > PTransform.expand: (...) -> Union[PValue, NamedTuple[str,
>>> PCollection], Tuple[str, PCollection], Dict[str, PCollection],
>>> DoOutputsTuple]
>>> >> >
>>> >> > With the expectation that (pseudo-code):
>>> >> > a_transform = ATransform()
>>> >> >
>>> ATransform.from_runner_api(a_transform.to_runner_api()).outputs.keys() ==
>>> a_transform.outputs.keys()
>>> >> >
>>> >> > Since this changes the Python SDK composite transform API, what
>>> would be the next steps for the community to come to a consensus on this?
>>> >>
>>> >> It seems here we're conflating the nesting of PValue results with the
>>> >> nesting of composite operations.
>>> >>
>>> >> Both examples in the original post have PTransform nesting (a
>>> >> composite) returning a flat tuple. This is completely orthogonal to
>>> >> the idea of a PTransform returning a nested result (such as (pc1,
>>> >> (pc2, pc3))) and forbidding the latter won't solve the former.
>>> >>
>>> >> Currently, with the exception of explicit names given for multi-output
>>> >> ParDos, we simply label the outputs sequentially with 0, 1, 2, 3, ...
>>> >> (Actually, for historical reasons, it's None, 1, 2, 3, ...), no matter
>>> >> the nesting. We could do better, e.g. for the example above, label
>>> >> them "0", "1.0", "1.1", or use the keys in the returned dict, but this
>>> >> is separate from the idea of trying to relate the output tags of
>>> >> composites to the output tags of their inner transforms.
>>> >>
>>> >> - Robert
>>>
>>

Re: [BEAM-9322] Python SDK discussion on correct output tag names

Posted by Robert Bradshaw <ro...@google.com>.

I'm -1 on this, it adds additional restrictions and I don't see what this
buys us. (In particular, it doesn't address the original issue.)

On Wed, Apr 1, 2020 at 12:41 PM Sam Rohde <sr...@google.com> wrote:

> So then how does the proposal sound?
>
> Writing again here:
> PTransform.expand: (...) -> Union[PValue, NamedTuple[str, PCollection],
> Tuple[str, PCollection], Dict[str, PCollection], DoOutputsTuple]
>
> i.e. no arbitrary nesting when outputting from an expand
>
> On Tue, Mar 31, 2020 at 5:15 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Tue, Mar 31, 2020 at 4:13 PM Luke Cwik <lc...@google.com> wrote:
>> >
>> > It is important that composites know how things are named so that any
>> embedded payloads in the composite PTransform can reference the outputs
>> appropriately.
>>
>> Very good point. This is part of the cleanup to treat inputs and
>> outputs of PCollections as maps rather than lists generally across the
>> Python representations (which also relates to some of the ugliness
>> that Cham has been running into with cross-language).
>>
>> > On Tue, Mar 31, 2020 at 2:51 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>> >>
>> >> On Tue, Mar 31, 2020 at 1:13 PM Sam Rohde <sr...@google.com> wrote:
>> >> >>>
>> >> >>> * Don't allow arbitrary nestings returned during expansion, force
>> composite transforms to always provide an unambiguous name (either a tuple
>> with PCollections with unique tags or a dictionary with untagged
>> PCollections or a singular PCollection (Java and Go SDKs do this)).
>> >> >>
>> >> >> I believe that aligning with Java and Go would be the right way to
>> go here. I don't know if this would limit expressiveness.
>> >> >
>> >> > Yeah this sounds like a much more elegant way of handling this
>> situation. I would lean towards this limiting expressiveness because there
>> would be a limit to nesting, but I think that the trade-off with reducing
>> complexity is worth it.
>> >> >
>> >> > So in summary it could be:
>> >> > PTransform.expand: (...) -> Union[PValue, NamedTuple[str,
>> PCollection], Tuple[str, PCollection], Dict[str, PCollection],
>> DoOutputsTuple]
>> >> >
>> >> > With the expectation that (pseudo-code):
>> >> > a_transform = ATransform()
>> >> >
>> ATransform.from_runner_api(a_transform.to_runner_api()).outputs.keys() ==
>> a_transform.outputs.keys()
>> >> >
>> >> > Since this changes the Python SDK composite transform API, what
>> would be the next steps for the community to come to a consensus on this?
>> >>
>> >> It seems here we're conflating the nesting of PValue results with the
>> >> nesting of composite operations.
>> >>
>> >> Both examples in the original post have PTransform nesting (a
>> >> composite) returning a flat tuple. This is completely orthogonal to
>> >> the idea of a PTransform returning a nested result (such as (pc1,
>> >> (pc2, pc3))) and forbidding the latter won't solve the former.
>> >>
>> >> Currently, with the exception of explicit names given for multi-output
>> >> ParDos, we simply label the outputs sequentially with 0, 1, 2, 3, ...
>> >> (Actually, for historical reasons, it's None, 1, 2, 3, ...), no matter
>> >> the nesting. We could do better, e.g. for the example above, label
>> >> them "0", "1.0", "1.1", or use the keys in the returned dict, but this
>> >> is separate from the idea of trying to relate the output tags of
>> >> composites to the output tags of their inner transforms.
>> >>
>> >> - Robert
>>
>

Re: [BEAM-9322] Python SDK discussion on correct output tag names

Posted by Sam Rohde <sr...@google.com>.

So then how does the proposal sound?

Writing again here:
PTransform.expand: (...) -> Union[PValue, NamedTuple[str, PCollection],
Tuple[str, PCollection], Dict[str, PCollection], DoOutputsTuple]

i.e. no arbitrary nesting when outputting from an expand

On Tue, Mar 31, 2020 at 5:15 PM Robert Bradshaw <ro...@google.com> wrote:

> On Tue, Mar 31, 2020 at 4:13 PM Luke Cwik <lc...@google.com> wrote:
> >
> > It is important that composites know how things are named so that any
> embedded payloads in the composite PTransform can reference the outputs
> appropriately.
>
> Very good point. This is part of the cleanup to treat inputs and
> outputs of PCollections as maps rather than lists generally across the
> Python representations (which also relates to some of the ugliness
> that Cham has been running into with cross-language).
>
> > On Tue, Mar 31, 2020 at 2:51 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >>
> >> On Tue, Mar 31, 2020 at 1:13 PM Sam Rohde <sr...@google.com> wrote:
> >> >>>
> >> >>> * Don't allow arbitrary nestings returned during expansion, force
> composite transforms to always provide an unambiguous name (either a tuple
> with PCollections with unique tags or a dictionary with untagged
> PCollections or a singular PCollection (Java and Go SDKs do this)).
> >> >>
> >> >> I believe that aligning with Java and Go would be the right way to
> go here. I don't know if this would limit expressiveness.
> >> >
> >> > Yeah this sounds like a much more elegant way of handling this
> situation. I would lean towards this limiting expressiveness because there
> would be a limit to nesting, but I think that the trade-off with reducing
> complexity is worth it.
> >> >
> >> > So in summary it could be:
> >> > PTransform.expand: (...) -> Union[PValue, NamedTuple[str,
> PCollection], Tuple[str, PCollection], Dict[str, PCollection],
> DoOutputsTuple]
> >> >
> >> > With the expectation that (pseudo-code):
> >> > a_transform = ATransform()
> >> >
> ATransform.from_runner_api(a_transform.to_runner_api()).outputs.keys() ==
> a_transform.outputs.keys()
> >> >
> >> > Since this changes the Python SDK composite transform API, what would
> be the next steps for the community to come to a consensus on this?
> >>
> >> It seems here we're conflating the nesting of PValue results with the
> >> nesting of composite operations.
> >>
> >> Both examples in the original post have PTransform nesting (a
> >> composite) returning a flat tuple. This is completely orthogonal to
> >> the idea of a PTransform returning a nested result (such as (pc1,
> >> (pc2, pc3))) and forbidding the latter won't solve the former.
> >>
> >> Currently, with the exception of explicit names given for multi-output
> >> ParDos, we simply label the outputs sequentially with 0, 1, 2, 3, ...
> >> (Actually, for historical reasons, it's None, 1, 2, 3, ...), no matter
> >> the nesting. We could do better, e.g. for the example above, label
> >> them "0", "1.0", "1.1", or use the keys in the returned dict, but this
> >> is separate from the idea of trying to relate the output tags of
> >> composites to the output tags of their inner transforms.
> >>
> >> - Robert
>

Re: [BEAM-9322] Python SDK discussion on correct output tag names

Posted by Robert Bradshaw <ro...@google.com>.

On Tue, Mar 31, 2020 at 4:13 PM Luke Cwik <lc...@google.com> wrote:
>
> It is important that composites know how things are named so that any embedded payloads in the composite PTransform can reference the outputs appropriately.

Very good point. This is part of the cleanup to treat inputs and
outputs of PCollections as maps rather than lists generally across the
Python representations (which also relates to some of the ugliness
that Cham has been running into with cross-language).

> On Tue, Mar 31, 2020 at 2:51 PM Robert Bradshaw <ro...@google.com> wrote:
>>
>> On Tue, Mar 31, 2020 at 1:13 PM Sam Rohde <sr...@google.com> wrote:
>> >>>
>> >>> * Don't allow arbitrary nestings returned during expansion, force composite transforms to always provide an unambiguous name (either a tuple with PCollections with unique tags or a dictionary with untagged PCollections or a singular PCollection (Java and Go SDKs do this)).
>> >>
>> >> I believe that aligning with Java and Go would be the right way to go here. I don't know if this would limit expressiveness.
>> >
>> > Yeah this sounds like a much more elegant way of handling this situation. I would lean towards this limiting expressiveness because there would be a limit to nesting, but I think that the trade-off with reducing complexity is worth it.
>> >
>> > So in summary it could be:
>> > PTransform.expand: (...) -> Union[PValue, NamedTuple[str, PCollection], Tuple[str, PCollection], Dict[str, PCollection], DoOutputsTuple]
>> >
>> > With the expectation that (pseudo-code):
>> > a_transform = ATransform()
>> > ATransform.from_runner_api(a_transform.to_runner_api()).outputs.keys() == a_transform.outputs.keys()
>> >
>> > Since this changes the Python SDK composite transform API, what would be the next steps for the community to come to a consensus on this?
>>
>> It seems here we're conflating the nesting of PValue results with the
>> nesting of composite operations.
>>
>> Both examples in the original post have PTransform nesting (a
>> composite) returning a flat tuple. This is completely orthogonal to
>> the idea of a PTransform returning a nested result (such as (pc1,
>> (pc2, pc3))) and forbidding the latter won't solve the former.
>>
>> Currently, with the exception of explicit names given for multi-output
>> ParDos, we simply label the outputs sequentially with 0, 1, 2, 3, ...
>> (Actually, for historical reasons, it's None, 1, 2, 3, ...), no matter
>> the nesting. We could do better, e.g. for the example above, label
>> them "0", "1.0", "1.1", or use the keys in the returned dict, but this
>> is separate from the idea of trying to relate the output tags of
>> composites to the output tags of their inner transforms.
>>
>> - Robert

Re: [BEAM-9322] Python SDK discussion on correct output tag names

Posted by Luke Cwik <lc...@google.com>.

It is important that composites know how things are named so that any
embedded payloads in the composite PTransform can reference the outputs
appropriately.

On Tue, Mar 31, 2020 at 2:51 PM Robert Bradshaw <ro...@google.com> wrote:

> On Tue, Mar 31, 2020 at 1:13 PM Sam Rohde <sr...@google.com> wrote:
> >>>
> >>> * Don't allow arbitrary nestings returned during expansion, force
> composite transforms to always provide an unambiguous name (either a tuple
> with PCollections with unique tags or a dictionary with untagged
> PCollections or a singular PCollection (Java and Go SDKs do this)).
> >>
> >> I believe that aligning with Java and Go would be the right way to go
> here. I don't know if this would limit expressiveness.
> >
> > Yeah this sounds like a much more elegant way of handling this
> situation. I would lean towards this limiting expressiveness because there
> would be a limit to nesting, but I think that the trade-off with reducing
> complexity is worth it.
> >
> > So in summary it could be:
> > PTransform.expand: (...) -> Union[PValue, NamedTuple[str, PCollection],
> Tuple[str, PCollection], Dict[str, PCollection], DoOutputsTuple]
> >
> > With the expectation that (pseudo-code):
> > a_transform = ATransform()
> > ATransform.from_runner_api(a_transform.to_runner_api()).outputs.keys()
> == a_transform.outputs.keys()
> >
> > Since this changes the Python SDK composite transform API, what would be
> the next steps for the community to come to a consensus on this?
>
> It seems here we're conflating the nesting of PValue results with the
> nesting of composite operations.
>
> Both examples in the original post have PTransform nesting (a
> composite) returning a flat tuple. This is completely orthogonal to
> the idea of a PTransform returning a nested result (such as (pc1,
> (pc2, pc3))) and forbidding the latter won't solve the former.
>
> Currently, with the exception of explicit names given for multi-output
> ParDos, we simply label the outputs sequentially with 0, 1, 2, 3, ...
> (Actually, for historical reasons, it's None, 1, 2, 3, ...), no matter
> the nesting. We could do better, e.g. for the example above, label
> them "0", "1.0", "1.1", or use the keys in the returned dict, but this
> is separate from the idea of trying to relate the output tags of
> composites to the output tags of their inner transforms.
>
> - Robert
>

Re: [BEAM-9322] Python SDK discussion on correct output tag names

Posted by Robert Bradshaw <ro...@google.com>.

On Tue, Mar 31, 2020 at 1:13 PM Sam Rohde <sr...@google.com> wrote:
>>>
>>> * Don't allow arbitrary nestings returned during expansion, force composite transforms to always provide an unambiguous name (either a tuple with PCollections with unique tags or a dictionary with untagged PCollections or a singular PCollection (Java and Go SDKs do this)).
>>
>> I believe that aligning with Java and Go would be the right way to go here. I don't know if this would limit expressiveness.
>
> Yeah this sounds like a much more elegant way of handling this situation. I would lean towards this limiting expressiveness because there would be a limit to nesting, but I think that the trade-off with reducing complexity is worth it.
>
> So in summary it could be:
> PTransform.expand: (...) -> Union[PValue, NamedTuple[str, PCollection], Tuple[str, PCollection], Dict[str, PCollection], DoOutputsTuple]
>
> With the expectation that (pseudo-code):
> a_transform = ATransform()
> ATransform.from_runner_api(a_transform.to_runner_api()).outputs.keys() == a_transform.outputs.keys()
>
> Since this changes the Python SDK composite transform API, what would be the next steps for the community to come to a consensus on this?

It seems here we're conflating the nesting of PValue results with the
nesting of composite operations.

Both examples in the original post have PTransform nesting (a
composite) returning a flat tuple. This is completely orthogonal to
the idea of a PTransform returning a nested result (such as (pc1,
(pc2, pc3))) and forbidding the latter won't solve the former.

Currently, with the exception of explicit names given for multi-output
ParDos, we simply label the outputs sequentially with 0, 1, 2, 3, ...
(Actually, for historical reasons, it's None, 1, 2, 3, ...), no matter
the nesting. We could do better, e.g. for the example above, label
them "0", "1.0", "1.1", or use the keys in the returned dict, but this
is separate from the idea of trying to relate the output tags of
composites to the output tags of their inner transforms.

- Robert

Re: [BEAM-9322] Python SDK discussion on correct output tag names

Posted by Sam Rohde <sr...@google.com>.

>
> * Don't allow arbitrary nestings returned during expansion, force
>> composite transforms to always provide an unambiguous name (either a tuple
>> with PCollections with unique tags or a dictionary with untagged
>> PCollections or a singular PCollection (Java and Go SDKs do this)).
>>
>
> I believe that aligning with Java and Go would be the right way to go
> here. I don't know if this would limit expressiveness.
>
Yeah this sounds like a much more elegant way of handling this situation. I
would lean towards this limiting expressiveness because there would be a
limit to nesting, but I think that the trade-off with reducing complexity
is worth it.

So in summary it could be:
PTransform.expand: (...) -> Union[PValue, NamedTuple[str, PCollection],
Tuple[str, PCollection], Dict[str, PCollection], DoOutputsTuple]

With the expectation that (pseudo-code):
a_transform = ATransform()
ATransform.from_runner_api(a_transform.to_runner_api()).outputs.keys() ==
a_transform.outputs.keys()

Since this changes the Python SDK composite transform API, what would be
the next steps for the community to come to a consensus on this?

-Sam


On Thu, Mar 26, 2020 at 12:52 PM Udi Meiri <eh...@google.com> wrote:

>
>
> On Thu, Mar 26, 2020 at 10:13 AM Luke Cwik <lc...@google.com> wrote:
>
>> The issue seems to be that a PCollection can have a "tag" associated with
>> it and PTransform expansion can return an arbitrary nested dictionary/tuple
>> yet we need to figure out what the user wanted as the local name for the
>> PCollection from all this information.
>>
>> Will this break people who rely on the generated PCollection output tags?
>> One big question is whether a composite transform cares about the name
>> that is used. For primitive transforms such as ParDo, this is very much a
>> yes because the pickled code likely references that name in some way. Some
>> composites could have the same need where the payload that is stored as
>> part of the composite references these local names and hence we have to
>> tell people how to instruct the SDK during transform expansion about what
>> name will be used unambiguously (as long as we document and have tests
>> around this we can choose from many options). Finally, in the XLang world,
>> we need to preserve the names that were provided to us and not change them;
>> which is more about making the Python SDK handle XLang transform expansion
>> carefully.
>>
>> Am I missing edge cases?
>> Concatenation of strings leads to collisions if the delimiter character
>> is used within the tags or map keys. You could use an escaping encoding to
>> guarantee that the concatenation always generates unique names.
>>
>> Some alternatives I thought about were:
>> * Don't allow arbitrary nestings returned during expansion, force
>> composite transforms to always provide an unambiguous name (either a tuple
>> with PCollections with unique tags or a dictionary with untagged
>> PCollections or a singular PCollection (Java and Go SDKs do this)).
>>
>
> I believe that aligning with Java and Go would be the right way to go
> here. I don't know if this would limit expressiveness.
>
>
>> * Have a "best" effort naming system (note the example I give can have
>> many of the "rules" re-ordered) e.g. if all the PCollection tags are unique
>> then use only them, followed by if a flat dictionary is returned then use
>> only the keys as names, followed by if a flat tuple is returned then use
>> indices, and finally fallback to the hierarchical naming scheme.
>>
>>
>> On Tue, Mar 24, 2020 at 1:07 PM Sam Rohde <sr...@google.com> wrote:
>>
>>> Hi All,
>>>
>>> *Problem*
>>> I would like to discuss BEAM-9322
>>> <https://issues.apache.org/jira/projects/BEAM/issues/BEAM-9322> and the
>>> correct way to set the output tags of a transform with nested PCollections,
>>> e.g. a dict of PCollections, a tuple of dicts of PCollections. Before the
>>> fixing of BEAM-1833 <https://issues.apache.org/jira/browse/BEAM-1833>,
>>> the Python SDK when applying a PTransform would auto-generate the output
>>> tags for the output PCollections even if they are manually set by the user:
>>>
>>> class MyComposite(beam.PTransform):
>>>   def expand(self, pcoll):
>>>     a = PCollection.from_(pcoll)
>>>     a.tag = 'a'
>>>
>>>     b = PCollection.from_(pcoll)
>>>     b.tag = 'b'
>>>     return (a, b)
>>>
>>> would yield a PTransform with two output PCollection and output tags
>>> with 'None' and '0' instead of 'a' and 'b'. This was corrected for simple
>>> cases like this. However, this fails when the PCollections share the same
>>> output tag (of course). This can happen like so:
>>>
>>> class MyComposite(beam.PTransform):
>>>   def expand(self, pcoll):
>>>     partition_1 = beam.Partition(pcoll, ...)
>>>     partition_2 = beam.Partition(pcoll, ...)
>>>     return (partition_1[0], partition_2[0])
>>>
>>> With the new code, this leads to an error because both output
>>> PCollections have an output tag of '0'.
>>>
>>> *Proposal*
>>> When applying PTransforms to a pipeline (pipeline.py:550) we name the
>>> PCollections according to their position in the tree concatenated with the
>>> PCollection tag and a delimiter. From the first example, the output
>>> PCollections of the applied transform will be: '0.a' and '1.b' because it
>>> is a tuple of PCollections. In the second example, the outputs should be:
>>> '0.0' and '1.0'. In the case of a dict of PCollections, it should simply be
>>> the keys of the dict.
>>>
>>> What do you think? Am I missing edge cases? Will this be unexpected to
>>> users? Will this break people who rely on the generated PCollection output
>>> tags?
>>>
>>> Regards,
>>> Sam
>>>
>>

Re: [BEAM-9322] Python SDK discussion on correct output tag names

Posted by Udi Meiri <eh...@google.com>.

On Thu, Mar 26, 2020 at 10:13 AM Luke Cwik <lc...@google.com> wrote:

> The issue seems to be that a PCollection can have a "tag" associated with
> it and PTransform expansion can return an arbitrary nested dictionary/tuple
> yet we need to figure out what the user wanted as the local name for the
> PCollection from all this information.
>
> Will this break people who rely on the generated PCollection output tags?
> One big question is whether a composite transform cares about the name
> that is used. For primitive transforms such as ParDo, this is very much a
> yes because the pickled code likely references that name in some way. Some
> composites could have the same need where the payload that is stored as
> part of the composite references these local names and hence we have to
> tell people how to instruct the SDK during transform expansion about what
> name will be used unambiguously (as long as we document and have tests
> around this we can choose from many options). Finally, in the XLang world,
> we need to preserve the names that were provided to us and not change them;
> which is more about making the Python SDK handle XLang transform expansion
> carefully.
>
> Am I missing edge cases?
> Concatenation of strings leads to collisions if the delimiter character is
> used within the tags or map keys. You could use an escaping encoding to
> guarantee that the concatenation always generates unique names.
>
> Some alternatives I thought about were:
> * Don't allow arbitrary nestings returned during expansion, force
> composite transforms to always provide an unambiguous name (either a tuple
> with PCollections with unique tags or a dictionary with untagged
> PCollections or a singular PCollection (Java and Go SDKs do this)).
>

I believe that aligning with Java and Go would be the right way to go here.
I don't know if this would limit expressiveness.


> * Have a "best" effort naming system (note the example I give can have
> many of the "rules" re-ordered) e.g. if all the PCollection tags are unique
> then use only them, followed by if a flat dictionary is returned then use
> only the keys as names, followed by if a flat tuple is returned then use
> indices, and finally fallback to the hierarchical naming scheme.
>
>
> On Tue, Mar 24, 2020 at 1:07 PM Sam Rohde <sr...@google.com> wrote:
>
>> Hi All,
>>
>> *Problem*
>> I would like to discuss BEAM-9322
>> <https://issues.apache.org/jira/projects/BEAM/issues/BEAM-9322> and the
>> correct way to set the output tags of a transform with nested PCollections,
>> e.g. a dict of PCollections, a tuple of dicts of PCollections. Before the
>> fixing of BEAM-1833 <https://issues.apache.org/jira/browse/BEAM-1833>,
>> the Python SDK when applying a PTransform would auto-generate the output
>> tags for the output PCollections even if they are manually set by the user:
>>
>> class MyComposite(beam.PTransform):
>>   def expand(self, pcoll):
>>     a = PCollection.from_(pcoll)
>>     a.tag = 'a'
>>
>>     b = PCollection.from_(pcoll)
>>     b.tag = 'b'
>>     return (a, b)
>>
>> would yield a PTransform with two output PCollection and output tags with
>> 'None' and '0' instead of 'a' and 'b'. This was corrected for simple cases
>> like this. However, this fails when the PCollections share the same output
>> tag (of course). This can happen like so:
>>
>> class MyComposite(beam.PTransform):
>>   def expand(self, pcoll):
>>     partition_1 = beam.Partition(pcoll, ...)
>>     partition_2 = beam.Partition(pcoll, ...)
>>     return (partition_1[0], partition_2[0])
>>
>> With the new code, this leads to an error because both output
>> PCollections have an output tag of '0'.
>>
>> *Proposal*
>> When applying PTransforms to a pipeline (pipeline.py:550) we name the
>> PCollections according to their position in the tree concatenated with the
>> PCollection tag and a delimiter. From the first example, the output
>> PCollections of the applied transform will be: '0.a' and '1.b' because it
>> is a tuple of PCollections. In the second example, the outputs should be:
>> '0.0' and '1.0'. In the case of a dict of PCollections, it should simply be
>> the keys of the dict.
>>
>> What do you think? Am I missing edge cases? Will this be unexpected to
>> users? Will this break people who rely on the generated PCollection output
>> tags?
>>
>> Regards,
>> Sam
>>
>

Re: [BEAM-9322] Python SDK discussion on correct output tag names

Posted by Luke Cwik <lc...@google.com>.

The issue seems to be that a PCollection can have a "tag" associated with
it and PTransform expansion can return an arbitrary nested dictionary/tuple
yet we need to figure out what the user wanted as the local name for the
PCollection from all this information.

Will this break people who rely on the generated PCollection output tags?
One big question is whether a composite transform cares about the name that
is used. For primitive transforms such as ParDo, this is very much a yes
because the pickled code likely references that name in some way. Some
composites could have the same need where the payload that is stored as
part of the composite references these local names and hence we have to
tell people how to instruct the SDK during transform expansion about what
name will be used unambiguously (as long as we document and have tests
around this we can choose from many options). Finally, in the XLang world,
we need to preserve the names that were provided to us and not change them;
which is more about making the Python SDK handle XLang transform expansion
carefully.

Am I missing edge cases?
Concatenation of strings leads to collisions if the delimiter character is
used within the tags or map keys. You could use an escaping encoding to
guarantee that the concatenation always generates unique names.

Some alternatives I thought about were:
* Don't allow arbitrary nestings returned during expansion, force composite
transforms to always provide an unambiguous name (either a tuple with
PCollections with unique tags or a dictionary with untagged PCollections or
a singular PCollection (Java and Go SDKs do this)).
* Have a "best" effort naming system (note the example I give can have many
of the "rules" re-ordered) e.g. if all the PCollection tags are unique then
use only them, followed by if a flat dictionary is returned then use only
the keys as names, followed by if a flat tuple is returned then use
indices, and finally fallback to the hierarchical naming scheme.

On Tue, Mar 24, 2020 at 1:07 PM Sam Rohde <sr...@google.com> wrote:

> Hi All,
>
> *Problem*
> I would like to discuss BEAM-9322
> <https://issues.apache.org/jira/projects/BEAM/issues/BEAM-9322> and the
> correct way to set the output tags of a transform with nested PCollections,
> e.g. a dict of PCollections, a tuple of dicts of PCollections. Before the
> fixing of BEAM-1833 <https://issues.apache.org/jira/browse/BEAM-1833>,
> the Python SDK when applying a PTransform would auto-generate the output
> tags for the output PCollections even if they are manually set by the user:
>
> class MyComposite(beam.PTransform):
>   def expand(self, pcoll):
>     a = PCollection.from_(pcoll)
>     a.tag = 'a'
>
>     b = PCollection.from_(pcoll)
>     b.tag = 'b'
>     return (a, b)
>
> would yield a PTransform with two output PCollection and output tags with
> 'None' and '0' instead of 'a' and 'b'. This was corrected for simple cases
> like this. However, this fails when the PCollections share the same output
> tag (of course). This can happen like so:
>
> class MyComposite(beam.PTransform):
>   def expand(self, pcoll):
>     partition_1 = beam.Partition(pcoll, ...)
>     partition_2 = beam.Partition(pcoll, ...)
>     return (partition_1[0], partition_2[0])
>
> With the new code, this leads to an error because both output PCollections
> have an output tag of '0'.
>
> *Proposal*
> When applying PTransforms to a pipeline (pipeline.py:550) we name the
> PCollections according to their position in the tree concatenated with the
> PCollection tag and a delimiter. From the first example, the output
> PCollections of the applied transform will be: '0.a' and '1.b' because it
> is a tuple of PCollections. In the second example, the outputs should be:
> '0.0' and '1.0'. In the case of a dict of PCollections, it should simply be
> the keys of the dict.
>
> What do you think? Am I missing edge cases? Will this be unexpected to
> users? Will this break people who rely on the generated PCollection output
> tags?
>
> Regards,
> Sam
>