You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Reuven Lax via user <us...@beam.apache.org> on 2023/10/02 00:12:11 UTC

Re: [QUESTION] Why no auto labels?

Are you talking about transform names? The main reason was because for
runners that support updating pipelines in place, the only way to do so
safely is if the runner can perfectly identify which transforms in the new
graph match the ones in the old graph. There's no good way to auto generate
names that will stay stable across updates - even small changes to the
pipeline might change the order of nodes in the graph, which could result
in a corrupted update.

However, if you don't care about update, Beam can auto generate these names
for you! When you call PCollection.apply (if using BeamJava), simply omit
the name parameter and Beam will auto generate a unique name for you.

Reuven

On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <jo...@schrodinger.com>
wrote:

> After writing a few pipelines now, I keep getting tripped up from
> accidentally have duplicate labels from using multiple of the same
> transforms without labeling them. I figure this must be a common complaint,
> so I was just curious, what the rationale behind this design was? My naive
> thought off the top of my head is that it'd be more user friendly to just
> auto increment duplicate transforms, but I figure I must be missing
> something
>
> Cheers,
> Joey
>

Re: [QUESTION] Why no auto labels?

Posted by Byron Ellis via dev <de...@beam.apache.org>.

This feels like something that maybe should be more explicit? Overloading
the transform name to provide a unique stable id feels like perhaps too
much magic... also maybe feels like this is leaking specific runner
behavior? I get that it's convenient

On Wed, Oct 4, 2023 at 9:16 AM Robert Bradshaw via user <
user@beam.apache.org> wrote:

> BeamJava and BeamPython have the exact same behavior: transform names
> within must be distinct [1]. This is because we do not necessarily know at
> pipeline construction time if the pipeline will be streaming or batch, or
> if it will be updated in the future, so the decision was made to impose
> this restriction up front. Both will auto-generate a name for you if one is
> not given, but will do so deterministically (not depending on some global
> context) to avoid potential update problems.
>
> [1] Note that this applies to the fully qualified transform name, so the
> naming only has to be distinct within a composite transform (or at the top
> level--the pipeline itself is isomorphic to a single composite transform).
>
> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
> wrote:
>
>> Cross posting this thread to dev@ to see if this is intentional behavior
>> or if it's something worth changing for the python SDK
>>
>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>> wrote:
>>
>>> That suggests the default label is created as that, which indeed causes
>>> the duplication error.
>>>
>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> Not sure what that suggests
>>>>
>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>>> wrote:
>>>>
>>>>> Looks like this is the current behaviour. If you have `t =
>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>> `Filter(identity_filter)`.
>>>>>
>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> You don't have to specify the names if the callable you pass in is
>>>>>> /different/ for two `beam.Map`s, but  if the callable is the same you must
>>>>>> specify a label. For example, the below will raise an exception:
>>>>>>
>>>>>> ```
>>>>>>         | beam.Filter(identity_filter)
>>>>>>         | beam.Filter(identity_filter)
>>>>>> ```
>>>>>>
>>>>>> Here's an example on playground that shows the error message you get
>>>>>> [1]. I marked every line I added with a "# ++".
>>>>>>
>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>> of filter on each of them.
>>>>>>
>>>>>> The case that happens most often for me is using the `assert_that`
>>>>>> [2] testing transform. In this case, I think often users will really have
>>>>>> no need for a disambiguating label as they're often just writing unit tests
>>>>>> that test a few different properties of their workflow.
>>>>>>
>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>> [2]
>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>
>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>> user@beam.apache.org> wrote:
>>>>>>
>>>>>>> If I understand the question correctly, you don't have to specify
>>>>>>> those names.
>>>>>>>
>>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>>> stable / deterministic graph.
>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>
>>>>>>> See an example here
>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <jo...@schrodinger.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place".
>>>>>>>> Can you elaborate?
>>>>>>>>
>>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>
>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>
>>>>>>>>> Are you talking about transform names? The main reason was because
>>>>>>>>> for runners that support updating pipelines in place, the only way to do so
>>>>>>>>> safely is if the runner can perfectly identify which transforms in the new
>>>>>>>>> graph match the ones in the old graph. There's no good way to auto generate
>>>>>>>>> names that will stay stable across updates - even small changes to the
>>>>>>>>> pipeline might change the order of nodes in the graph, which could result
>>>>>>>>> in a corrupted update.
>>>>>>>>>
>>>>>>>>> However, if you don't care about update, Beam can auto generate
>>>>>>>>> these names for you! When you call PCollection.apply (if using BeamJava),
>>>>>>>>> simply omit the name parameter and Beam will auto generate a unique name
>>>>>>>>> for you.
>>>>>>>>>
>>>>>>>>> Reuven
>>>>>>>>>
>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up from
>>>>>>>>>> accidentally have duplicate labels from using multiple of the same
>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>> something
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Joey
>>>>>>>>>>
>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Byron Ellis via user <us...@beam.apache.org>.

This feels like something that maybe should be more explicit? Overloading
the transform name to provide a unique stable id feels like perhaps too
much magic... also maybe feels like this is leaking specific runner
behavior? I get that it's convenient

On Wed, Oct 4, 2023 at 9:16 AM Robert Bradshaw via user <
user@beam.apache.org> wrote:

> BeamJava and BeamPython have the exact same behavior: transform names
> within must be distinct [1]. This is because we do not necessarily know at
> pipeline construction time if the pipeline will be streaming or batch, or
> if it will be updated in the future, so the decision was made to impose
> this restriction up front. Both will auto-generate a name for you if one is
> not given, but will do so deterministically (not depending on some global
> context) to avoid potential update problems.
>
> [1] Note that this applies to the fully qualified transform name, so the
> naming only has to be distinct within a composite transform (or at the top
> level--the pipeline itself is isomorphic to a single composite transform).
>
> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
> wrote:
>
>> Cross posting this thread to dev@ to see if this is intentional behavior
>> or if it's something worth changing for the python SDK
>>
>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>> wrote:
>>
>>> That suggests the default label is created as that, which indeed causes
>>> the duplication error.
>>>
>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> Not sure what that suggests
>>>>
>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>>> wrote:
>>>>
>>>>> Looks like this is the current behaviour. If you have `t =
>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>> `Filter(identity_filter)`.
>>>>>
>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> You don't have to specify the names if the callable you pass in is
>>>>>> /different/ for two `beam.Map`s, but  if the callable is the same you must
>>>>>> specify a label. For example, the below will raise an exception:
>>>>>>
>>>>>> ```
>>>>>>         | beam.Filter(identity_filter)
>>>>>>         | beam.Filter(identity_filter)
>>>>>> ```
>>>>>>
>>>>>> Here's an example on playground that shows the error message you get
>>>>>> [1]. I marked every line I added with a "# ++".
>>>>>>
>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>> of filter on each of them.
>>>>>>
>>>>>> The case that happens most often for me is using the `assert_that`
>>>>>> [2] testing transform. In this case, I think often users will really have
>>>>>> no need for a disambiguating label as they're often just writing unit tests
>>>>>> that test a few different properties of their workflow.
>>>>>>
>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>> [2]
>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>
>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>> user@beam.apache.org> wrote:
>>>>>>
>>>>>>> If I understand the question correctly, you don't have to specify
>>>>>>> those names.
>>>>>>>
>>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>>> stable / deterministic graph.
>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>
>>>>>>> See an example here
>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <jo...@schrodinger.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place".
>>>>>>>> Can you elaborate?
>>>>>>>>
>>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>
>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>
>>>>>>>>> Are you talking about transform names? The main reason was because
>>>>>>>>> for runners that support updating pipelines in place, the only way to do so
>>>>>>>>> safely is if the runner can perfectly identify which transforms in the new
>>>>>>>>> graph match the ones in the old graph. There's no good way to auto generate
>>>>>>>>> names that will stay stable across updates - even small changes to the
>>>>>>>>> pipeline might change the order of nodes in the graph, which could result
>>>>>>>>> in a corrupted update.
>>>>>>>>>
>>>>>>>>> However, if you don't care about update, Beam can auto generate
>>>>>>>>> these names for you! When you call PCollection.apply (if using BeamJava),
>>>>>>>>> simply omit the name parameter and Beam will auto generate a unique name
>>>>>>>>> for you.
>>>>>>>>>
>>>>>>>>> Reuven
>>>>>>>>>
>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up from
>>>>>>>>>> accidentally have duplicate labels from using multiple of the same
>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>> something
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Joey
>>>>>>>>>>
>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Robert Bradshaw via dev <de...@beam.apache.org>.

I'll take another look at the PR. My inclination is still to use uuids
to uniquify. I think that's worth the cost to the readability hit (I'm
OK reducing this down to 6-8 hex digits which will still give very low
chances of collisions, though it doesn't solve the first one). If
someone cares about names more than this they can set them manually.

On Fri, Oct 20, 2023 at 9:30 AM Joey Tran <jo...@schrodinger.com> wrote:
>
> Just want to bump this discussion again. I'm introducing Beam to other developers at my Schrodinger now and the first (of hopefully many!) developer has started migrating our internal workflows to Beam. As I suspected though, he's complained about the iteration cycles spent from using the same transform without specifying a label and from using multiple assert_that's without unique labels.
>
> From the java code[1], it looks like the same naming scheme is used (I think, it's been a decade since I've read java so apologies if I'm misreading) as the PR I posted
>
> [1] https://github.com/apache/beam/blob/e7a6405800a83dd16437b8b1b372e020e010a042/sdks/java/core/src/main/java/org/apache/beam/sdk/Pipeline.java#L630
>
> On Fri, Oct 13, 2023 at 1:32 PM Joey Tran <jo...@schrodinger.com> wrote:
>>
>>
>>
>> On Fri, Oct 13, 2023 at 1:18 PM Robert Bradshaw <ro...@google.com> wrote:
>>>
>>> On Fri, Oct 13, 2023 at 10:08 AM Joey Tran <jo...@schrodinger.com> wrote:
>>>>
>>>> Are there places on the SDK side that expect unique labels? Or in non-updateable runners?
>>>
>>>
>>> That's a good question. The label eventually ends up here: https://github.com/apache/beam/blob/release-2.51.0/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L142 which is technically a violation of the spec if these are not unique (though I guess Java already does this?). More importantly, though, the id of the transform (the proto contains a map of strings -> transforms) must be unique.
>>>
>>
>> Doesn't that suggest not raising an exception won't be a sufficient enough solution? But I suppose if the Java SDK does this already it's fine..?
>>
>>>
>>> Another option is to make the suffix a uuid rather than a single counter. (This would still have issues with the first application possibly getting mixed up with a "different" first application unless it was always appended.)
>>>
>>
>> Is the marginal benefit (only the first application may be confused instead of possibly other applications) valuable enough to justify the marginal cost of the readability hit of adding uuids to the labels?
>>
>>
>>
>>
>>>> On Fri, Oct 13, 2023 at 12:52 PM Robert Bradshaw <ro...@google.com> wrote:
>>>>>
>>>>>
>>>>> Thanks for the PR.
>>>>>
>>>>> I think we should follow Java and allow non-unique labels, but not provide automatic uniquification, In particular, the danger of using a counter is that one can get accidental (and potentially hard to check) off-by-one collisions. As a concrete example, imagine one partitions a dataset into two collections, each followed by a similarly-named transform.
>>>>>
>>>>>     --> B
>>>>>   /
>>>>> A
>>>>>  \
>>>>>    --> B
>>>>>
>>>>> Uniquification would give something like
>>>>>
>>>>>     --> B
>>>>>   /
>>>>> A
>>>>>  \
>>>>>    --> B_2
>>>>>
>>>>> Suppose one then realizes there's a third case to handle, giving
>>>>>
>>>>>     --> B
>>>>>   /
>>>>> A --> B
>>>>>  \
>>>>>    --> B
>>>>>
>>>>> But this would be uniquified to
>>>>>
>>>>>     --> B
>>>>>   /
>>>>> A --> B_2
>>>>>  \
>>>>>    --> B_3
>>>>>
>>>>> where the old B_2 got renamed to B_3 and a new B_2 got put in its place. This is bad because an updating runner would then attribute old B_2's state to the new B_2 (and also possibly mis-direct any inflight messages). At least with the old, intersecting names we can detect this problem rather than silently give corrupt data.
>>>>>
>>>>>
>>>>> On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>
>>>>>> For posterity: https://github.com/apache/beam/pull/28984
>>>>>>
>>>>>> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <ro...@google.com> wrote:
>>>>>>>
>>>>>>> I would definitely support a PR making this an option. Changing the default would be a rather big change that would require more thought.
>>>>>>>
>>>>>>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>>>
>>>>>>>> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt Apache Beam at my company and I'm trying to foresee parts of the API they might find inconvenient.
>>>>>>>>
>>>>>>>> If there's a conclusion to make the behavior similar to java, I'm happy to put up a PR
>>>>>>>>
>>>>>>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>> Is it really toggleable in Java? I imagine that if it's a toggle it'd be a very sticky toggle since it'd be easy for PTransforms to accidentally rely on it.
>>>>>>>>>
>>>>>>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Huh. This used to be a hard error in Java, but I guess it's togglable with an option now. We should probably add the option to toggle Python too. (Unclear what the default should be, but this probably ties into re-thinking how pipeline update should work.)
>>>>>>>>>>
>>>>>>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Makes sense that the requirement is the same, but is the label auto-generation behavior the same? I modified the BeamJava wordcount example[1] to do the regex filter twice in a row, and unlike the BeamPython example I posted before, it just warns instead of throwing an exception.
>>>>>>>>>>>
>>>>>>>>>>> Tangentially, is it expected that the Beam playground examples don't have a way to see the outputs of a run example? I have a vague memory that there used to be a way to navigate to an output file after it's generated but not sure if I just dreamt that up. Playing with the examples, I wasn't positive if my runs were actually succeeding or not based on the stdout alone.
>>>>>>>>>>>
>>>>>>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <us...@beam.apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> BeamJava and BeamPython have the exact same behavior: transform names within must be distinct [1]. This is because we do not necessarily know at pipeline construction time if the pipeline will be streaming or batch, or if it will be updated in the future, so the decision was made to impose this restriction up front. Both will auto-generate a name for you if one is not given, but will do so deterministically (not depending on some global context) to avoid potential update problems.
>>>>>>>>>>>>
>>>>>>>>>>>> [1] Note that this applies to the fully qualified transform name, so the naming only has to be distinct within a composite transform (or at the top level--the pipeline itself is isomorphic to a single composite transform).
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cross posting this thread to dev@ to see if this is intentional behavior or if it's something worth changing for the python SDK
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That suggests the default label is created as that, which indeed causes the duplication error.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Looks like this is the current behaviour. If you have `t = beam.Filter(identity_filter)`, `t.label` is defined as `Filter(identity_filter)`.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You don't have to specify the names if the callable you pass in is /different/ for two `beam.Map`s, but  if the callable is the same you must specify a label. For example, the below will raise an exception:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Here's an example on playground that shows the error message you get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It's a contrived example, but using a map or filter at the same pipeline level probably comes up often, at least in my inexperience. For example, you. might have a pipeline that partitions a pcoll into three different pcolls, runs some transforms on them, and then runs the same type of filter on each of them.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The case that happens most often for me is using the `assert_that` [2] testing transform. In this case, I think often users will really have no need for a disambiguating label as they're often just writing unit tests that test a few different properties of their workflow.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>>>>>>> [2] https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <us...@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If I understand the question correctly, you don't have to specify those names.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have a stable / deterministic graph.
>>>>>>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn, instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> See an example here https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place". Can you elaborate?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I forgot to mention my question is posed from the context of a python SDK user, and afaict, there doesn't seem to be an obvious way to autogenerate names/labels. Hearing that the java SDK supports it makes me wonder if the python SDK could support it as well though... (If so, I'd be happy to do implement it). Currently, it's fairly tedious to have to name every instance of a transform that you might reuse in a pipeline, e.g. when reapplying the same Map on different pcollections.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <us...@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Are you talking about transform names? The main reason was because for runners that support updating pipelines in place, the only way to do so safely is if the runner can perfectly identify which transforms in the new graph match the ones in the old graph. There's no good way to auto generate names that will stay stable across updates - even small changes to the pipeline might change the order of nodes in the graph, which could result in a corrupted update.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto generate these names for you! When you call PCollection.apply (if using BeamJava), simply omit the name parameter and Beam will auto generate a unique name for you.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up from accidentally have duplicate labels from using multiple of the same transforms without labeling them. I figure this must be a common complaint, so I was just curious, what the rationale behind this design was? My naive thought off the top of my head is that it'd be more user friendly to just auto increment duplicate transforms, but I figure I must be missing something
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>> Joey

Re: [QUESTION] Why no auto labels?

Posted by Robert Bradshaw via user <us...@beam.apache.org>.

I'll take another look at the PR. My inclination is still to use uuids
to uniquify. I think that's worth the cost to the readability hit (I'm
OK reducing this down to 6-8 hex digits which will still give very low
chances of collisions, though it doesn't solve the first one). If
someone cares about names more than this they can set them manually.

On Fri, Oct 20, 2023 at 9:30 AM Joey Tran <jo...@schrodinger.com> wrote:
>
> Just want to bump this discussion again. I'm introducing Beam to other developers at my Schrodinger now and the first (of hopefully many!) developer has started migrating our internal workflows to Beam. As I suspected though, he's complained about the iteration cycles spent from using the same transform without specifying a label and from using multiple assert_that's without unique labels.
>
> From the java code[1], it looks like the same naming scheme is used (I think, it's been a decade since I've read java so apologies if I'm misreading) as the PR I posted
>
> [1] https://github.com/apache/beam/blob/e7a6405800a83dd16437b8b1b372e020e010a042/sdks/java/core/src/main/java/org/apache/beam/sdk/Pipeline.java#L630
>
> On Fri, Oct 13, 2023 at 1:32 PM Joey Tran <jo...@schrodinger.com> wrote:
>>
>>
>>
>> On Fri, Oct 13, 2023 at 1:18 PM Robert Bradshaw <ro...@google.com> wrote:
>>>
>>> On Fri, Oct 13, 2023 at 10:08 AM Joey Tran <jo...@schrodinger.com> wrote:
>>>>
>>>> Are there places on the SDK side that expect unique labels? Or in non-updateable runners?
>>>
>>>
>>> That's a good question. The label eventually ends up here: https://github.com/apache/beam/blob/release-2.51.0/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L142 which is technically a violation of the spec if these are not unique (though I guess Java already does this?). More importantly, though, the id of the transform (the proto contains a map of strings -> transforms) must be unique.
>>>
>>
>> Doesn't that suggest not raising an exception won't be a sufficient enough solution? But I suppose if the Java SDK does this already it's fine..?
>>
>>>
>>> Another option is to make the suffix a uuid rather than a single counter. (This would still have issues with the first application possibly getting mixed up with a "different" first application unless it was always appended.)
>>>
>>
>> Is the marginal benefit (only the first application may be confused instead of possibly other applications) valuable enough to justify the marginal cost of the readability hit of adding uuids to the labels?
>>
>>
>>
>>
>>>> On Fri, Oct 13, 2023 at 12:52 PM Robert Bradshaw <ro...@google.com> wrote:
>>>>>
>>>>>
>>>>> Thanks for the PR.
>>>>>
>>>>> I think we should follow Java and allow non-unique labels, but not provide automatic uniquification, In particular, the danger of using a counter is that one can get accidental (and potentially hard to check) off-by-one collisions. As a concrete example, imagine one partitions a dataset into two collections, each followed by a similarly-named transform.
>>>>>
>>>>>     --> B
>>>>>   /
>>>>> A
>>>>>  \
>>>>>    --> B
>>>>>
>>>>> Uniquification would give something like
>>>>>
>>>>>     --> B
>>>>>   /
>>>>> A
>>>>>  \
>>>>>    --> B_2
>>>>>
>>>>> Suppose one then realizes there's a third case to handle, giving
>>>>>
>>>>>     --> B
>>>>>   /
>>>>> A --> B
>>>>>  \
>>>>>    --> B
>>>>>
>>>>> But this would be uniquified to
>>>>>
>>>>>     --> B
>>>>>   /
>>>>> A --> B_2
>>>>>  \
>>>>>    --> B_3
>>>>>
>>>>> where the old B_2 got renamed to B_3 and a new B_2 got put in its place. This is bad because an updating runner would then attribute old B_2's state to the new B_2 (and also possibly mis-direct any inflight messages). At least with the old, intersecting names we can detect this problem rather than silently give corrupt data.
>>>>>
>>>>>
>>>>> On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>
>>>>>> For posterity: https://github.com/apache/beam/pull/28984
>>>>>>
>>>>>> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <ro...@google.com> wrote:
>>>>>>>
>>>>>>> I would definitely support a PR making this an option. Changing the default would be a rather big change that would require more thought.
>>>>>>>
>>>>>>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>>>
>>>>>>>> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt Apache Beam at my company and I'm trying to foresee parts of the API they might find inconvenient.
>>>>>>>>
>>>>>>>> If there's a conclusion to make the behavior similar to java, I'm happy to put up a PR
>>>>>>>>
>>>>>>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>> Is it really toggleable in Java? I imagine that if it's a toggle it'd be a very sticky toggle since it'd be easy for PTransforms to accidentally rely on it.
>>>>>>>>>
>>>>>>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Huh. This used to be a hard error in Java, but I guess it's togglable with an option now. We should probably add the option to toggle Python too. (Unclear what the default should be, but this probably ties into re-thinking how pipeline update should work.)
>>>>>>>>>>
>>>>>>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Makes sense that the requirement is the same, but is the label auto-generation behavior the same? I modified the BeamJava wordcount example[1] to do the regex filter twice in a row, and unlike the BeamPython example I posted before, it just warns instead of throwing an exception.
>>>>>>>>>>>
>>>>>>>>>>> Tangentially, is it expected that the Beam playground examples don't have a way to see the outputs of a run example? I have a vague memory that there used to be a way to navigate to an output file after it's generated but not sure if I just dreamt that up. Playing with the examples, I wasn't positive if my runs were actually succeeding or not based on the stdout alone.
>>>>>>>>>>>
>>>>>>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <us...@beam.apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> BeamJava and BeamPython have the exact same behavior: transform names within must be distinct [1]. This is because we do not necessarily know at pipeline construction time if the pipeline will be streaming or batch, or if it will be updated in the future, so the decision was made to impose this restriction up front. Both will auto-generate a name for you if one is not given, but will do so deterministically (not depending on some global context) to avoid potential update problems.
>>>>>>>>>>>>
>>>>>>>>>>>> [1] Note that this applies to the fully qualified transform name, so the naming only has to be distinct within a composite transform (or at the top level--the pipeline itself is isomorphic to a single composite transform).
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cross posting this thread to dev@ to see if this is intentional behavior or if it's something worth changing for the python SDK
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That suggests the default label is created as that, which indeed causes the duplication error.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Looks like this is the current behaviour. If you have `t = beam.Filter(identity_filter)`, `t.label` is defined as `Filter(identity_filter)`.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You don't have to specify the names if the callable you pass in is /different/ for two `beam.Map`s, but  if the callable is the same you must specify a label. For example, the below will raise an exception:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Here's an example on playground that shows the error message you get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It's a contrived example, but using a map or filter at the same pipeline level probably comes up often, at least in my inexperience. For example, you. might have a pipeline that partitions a pcoll into three different pcolls, runs some transforms on them, and then runs the same type of filter on each of them.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The case that happens most often for me is using the `assert_that` [2] testing transform. In this case, I think often users will really have no need for a disambiguating label as they're often just writing unit tests that test a few different properties of their workflow.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>>>>>>> [2] https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <us...@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If I understand the question correctly, you don't have to specify those names.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have a stable / deterministic graph.
>>>>>>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn, instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> See an example here https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place". Can you elaborate?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I forgot to mention my question is posed from the context of a python SDK user, and afaict, there doesn't seem to be an obvious way to autogenerate names/labels. Hearing that the java SDK supports it makes me wonder if the python SDK could support it as well though... (If so, I'd be happy to do implement it). Currently, it's fairly tedious to have to name every instance of a transform that you might reuse in a pipeline, e.g. when reapplying the same Map on different pcollections.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <us...@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Are you talking about transform names? The main reason was because for runners that support updating pipelines in place, the only way to do so safely is if the runner can perfectly identify which transforms in the new graph match the ones in the old graph. There's no good way to auto generate names that will stay stable across updates - even small changes to the pipeline might change the order of nodes in the graph, which could result in a corrupted update.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto generate these names for you! When you call PCollection.apply (if using BeamJava), simply omit the name parameter and Beam will auto generate a unique name for you.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <jo...@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up from accidentally have duplicate labels from using multiple of the same transforms without labeling them. I figure this must be a common complaint, so I was just curious, what the rationale behind this design was? My naive thought off the top of my head is that it'd be more user friendly to just auto increment duplicate transforms, but I figure I must be missing something
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>> Joey

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

Just want to bump this discussion again. I'm introducing Beam to other
developers at my Schrodinger now and the first (of hopefully many!)
developer has started migrating our internal workflows to Beam. As I
suspected though, he's complained about the iteration cycles spent from
using the same transform without specifying a label and from using multiple
assert_that's without unique labels.

From the java code[1], it looks like the same naming scheme is used (I
think, it's been a decade since I've read java so apologies if I'm
misreading) as the PR I posted

[1]
https://github.com/apache/beam/blob/e7a6405800a83dd16437b8b1b372e020e010a042/sdks/java/core/src/main/java/org/apache/beam/sdk/Pipeline.java#L630

On Fri, Oct 13, 2023 at 1:32 PM Joey Tran <jo...@schrodinger.com> wrote:

>
>
> On Fri, Oct 13, 2023 at 1:18 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Fri, Oct 13, 2023 at 10:08 AM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
> Are there places on the SDK side that expect unique labels? Or in
>>> non-updateable runners?
>>>
>>
>> That's a good question. The label eventually ends up here:
>> https://github.com/apache/beam/blob/release-2.51.0/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L142
>> which is technically a violation of the spec if these are not unique
>> (though I guess Java already does this?). More importantly, though, the id
>> of the transform (the proto contains a map of strings -> transforms) must
>> be unique.
>>
>>
> Doesn't that suggest not raising an exception won't be a sufficient enough
> solution? But I suppose if the Java SDK does this already it's fine..?
>
>
>> Another option is to make the suffix a uuid rather than a single counter.
>> (This would still have issues with the first application possibly getting
>> mixed up with a "different" first application unless it was always
>> appended.)
>>
>>
> Is the marginal benefit (*only* the first application may be confused
> instead of possibly other applications) valuable enough to justify the
> marginal cost of the readability hit of adding uuids to the labels?
>
>
>
>
> On Fri, Oct 13, 2023 at 12:52 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>>
>>>> Thanks for the PR.
>>>>
>>>> I think we should follow Java and allow non-unique labels, but not
>>>> provide automatic uniquification, In particular, the danger of using a
>>>> counter is that one can get accidental (and potentially hard to check)
>>>> off-by-one collisions. As a concrete example, imagine one partitions a
>>>> dataset into two collections, each followed by a similarly-named transform.
>>>>
>>>>     --> B
>>>>   /
>>>> A
>>>>  \
>>>>    --> B
>>>>
>>>> Uniquification would give something like
>>>>
>>>>     --> B
>>>>   /
>>>> A
>>>>  \
>>>>    --> B_2
>>>>
>>>> Suppose one then realizes there's a third case to handle, giving
>>>>
>>>>     --> B
>>>>   /
>>>> A --> B
>>>>  \
>>>>    --> B
>>>>
>>>> But this would be uniquified to
>>>>
>>>>     --> B
>>>>   /
>>>> A --> B_2
>>>>  \
>>>>    --> B_3
>>>>
>>>> where the old B_2 got renamed to B_3 and a new B_2 got put in its
>>>> place. This is bad because an updating runner would then attribute old
>>>> B_2's state to the new B_2 (and also possibly mis-direct any inflight
>>>> messages). At least with the old, intersecting names we can detect this
>>>> problem rather than silently give corrupt data.
>>>>
>>>>
>>>> On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <jo...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> For posterity: https://github.com/apache/beam/pull/28984
>>>>>
>>>>> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> I would definitely support a PR making this an option. Changing the
>>>>>> default would be a rather big change that would require more thought.
>>>>>>
>>>>>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <jo...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Bump on this. Sorry to pester - I'm trying to get a few teams to
>>>>>>> adopt Apache Beam at my company and I'm trying to foresee parts of the API
>>>>>>> they might find inconvenient.
>>>>>>>
>>>>>>> If there's a conclusion to make the behavior similar to java, I'm
>>>>>>> happy to put up a PR
>>>>>>>
>>>>>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Is it really toggleable in Java? I imagine that if it's a toggle
>>>>>>>> it'd be a very sticky toggle since it'd be easy for PTransforms to
>>>>>>>> accidentally rely on it.
>>>>>>>>
>>>>>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <
>>>>>>>> robertwb@google.com> wrote:
>>>>>>>>
>>>>>>>>> Huh. This used to be a hard error in Java, but I guess it's
>>>>>>>>> togglable with an option now. We should probably add the option to toggle
>>>>>>>>> Python too. (Unclear what the default should be, but this probably ties
>>>>>>>>> into re-thinking how pipeline update should work.)
>>>>>>>>>
>>>>>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <
>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>>> Makes sense that the requirement is the same, but is the label
>>>>>>>>>> auto-generation behavior the same? I modified the BeamJava
>>>>>>>>>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>>>>>>>>>> BeamPython example I posted before, it just warns instead of throwing an
>>>>>>>>>> exception.
>>>>>>>>>>
>>>>>>>>>> Tangentially, is it expected that the Beam playground examples
>>>>>>>>>> don't have a way to see the outputs of a run example? I have a vague memory
>>>>>>>>>> that there used to be a way to navigate to an output file after it's
>>>>>>>>>> generated but not sure if I just dreamt that up. Playing with the examples,
>>>>>>>>>> I wasn't positive if my runs were actually succeeding or not based on the
>>>>>>>>>> stdout alone.
>>>>>>>>>>
>>>>>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>>>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>
>>>>>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> BeamJava and BeamPython have the exact same behavior:
>>>>>>>>>>> transform names within must be distinct [1]. This is because we do not
>>>>>>>>>>> necessarily know at pipeline construction time if the pipeline will be
>>>>>>>>>>> streaming or batch, or if it will be updated in the future, so the decision
>>>>>>>>>>> was made to impose this restriction up front. Both will auto-generate a
>>>>>>>>>>> name for you if one is not given, but will do so deterministically (not
>>>>>>>>>>> depending on some global context) to avoid potential update problems.
>>>>>>>>>>>
>>>>>>>>>>> [1] Note that this applies to the fully qualified transform
>>>>>>>>>>> name, so the naming only has to be distinct within a composite transform
>>>>>>>>>>> (or at the top level--the pipeline itself is isomorphic to a single
>>>>>>>>>>> composite transform).
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <
>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Cross posting this thread to dev@ to see if this is
>>>>>>>>>>>> intentional behavior or if it's something worth changing for the python SDK
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <
>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> That suggests the default label is created as that, which
>>>>>>>>>>>>> indeed causes the duplication error.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <
>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <
>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You don't have to specify the names if the callable you
>>>>>>>>>>>>>>>> pass in is /different/ for two `beam.Map`s, but  if the callable is the
>>>>>>>>>>>>>>>> same you must specify a label. For example, the below will raise an
>>>>>>>>>>>>>>>> exception:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here's an example on playground that shows the error
>>>>>>>>>>>>>>>> message you get [1]. I marked every line I added with a "#
>>>>>>>>>>>>>>>> ++".
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It's a contrived example, but using a map or filter at the
>>>>>>>>>>>>>>>> same pipeline level probably comes up often, at least in my inexperience.
>>>>>>>>>>>>>>>> For example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>>>>>>>>>> of filter on each of them.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often users will
>>>>>>>>>>>>>>>> really have no need for a disambiguating label as they're often just
>>>>>>>>>>>>>>>> writing unit tests that test a few different properties of their workflow.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you
>>>>>>>>>>>>>>>>> have a stable / deterministic graph.
>>>>>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection |
>>>>>>>>>>>>>>>>> map_fn, instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> See an example here
>>>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I forgot to mention my question is posed from the context
>>>>>>>>>>>>>>>>>> of a python SDK user, and afaict, there doesn't seem to be an obvious way
>>>>>>>>>>>>>>>>>> to autogenerate names/labels. Hearing that the java SDK supports it makes
>>>>>>>>>>>>>>>>>> me wonder if the python SDK could support it as well though... (If so, I'd
>>>>>>>>>>>>>>>>>> be happy to do implement it). Currently, it's fairly tedious to have to
>>>>>>>>>>>>>>>>>> name every instance of a transform that you might reuse in a pipeline, e.g.
>>>>>>>>>>>>>>>>>> when reapplying the same Map on different pcollections.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Are you talking about transform names? The main reason
>>>>>>>>>>>>>>>>>>> was because for runners that support updating pipelines in place, the only
>>>>>>>>>>>>>>>>>>> way to do so safely is if the runner can perfectly identify which
>>>>>>>>>>>>>>>>>>> transforms in the new graph match the ones in the old graph. There's no
>>>>>>>>>>>>>>>>>>> good way to auto generate names that will stay stable across updates - even
>>>>>>>>>>>>>>>>>>> small changes to the pipeline might change the order of nodes in the graph,
>>>>>>>>>>>>>>>>>>> which could result in a corrupted update.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply (if using
>>>>>>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto generate a
>>>>>>>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting
>>>>>>>>>>>>>>>>>>>> tripped up from accidentally have duplicate labels from using multiple of
>>>>>>>>>>>>>>>>>>>> the same transforms without labeling them. I figure this must be a common
>>>>>>>>>>>>>>>>>>>> complaint, so I was just curious, what the rationale behind this design
>>>>>>>>>>>>>>>>>>>> was? My naive thought off the top of my head is that it'd be more user
>>>>>>>>>>>>>>>>>>>> friendly to just auto increment duplicate transforms, but I figure I must
>>>>>>>>>>>>>>>>>>>> be missing something
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

Just want to bump this discussion again. I'm introducing Beam to other
developers at my Schrodinger now and the first (of hopefully many!)
developer has started migrating our internal workflows to Beam. As I
suspected though, he's complained about the iteration cycles spent from
using the same transform without specifying a label and from using multiple
assert_that's without unique labels.

From the java code[1], it looks like the same naming scheme is used (I
think, it's been a decade since I've read java so apologies if I'm
misreading) as the PR I posted

[1]
https://github.com/apache/beam/blob/e7a6405800a83dd16437b8b1b372e020e010a042/sdks/java/core/src/main/java/org/apache/beam/sdk/Pipeline.java#L630

On Fri, Oct 13, 2023 at 1:32 PM Joey Tran <jo...@schrodinger.com> wrote:

>
>
> On Fri, Oct 13, 2023 at 1:18 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Fri, Oct 13, 2023 at 10:08 AM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
> Are there places on the SDK side that expect unique labels? Or in
>>> non-updateable runners?
>>>
>>
>> That's a good question. The label eventually ends up here:
>> https://github.com/apache/beam/blob/release-2.51.0/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L142
>> which is technically a violation of the spec if these are not unique
>> (though I guess Java already does this?). More importantly, though, the id
>> of the transform (the proto contains a map of strings -> transforms) must
>> be unique.
>>
>>
> Doesn't that suggest not raising an exception won't be a sufficient enough
> solution? But I suppose if the Java SDK does this already it's fine..?
>
>
>> Another option is to make the suffix a uuid rather than a single counter.
>> (This would still have issues with the first application possibly getting
>> mixed up with a "different" first application unless it was always
>> appended.)
>>
>>
> Is the marginal benefit (*only* the first application may be confused
> instead of possibly other applications) valuable enough to justify the
> marginal cost of the readability hit of adding uuids to the labels?
>
>
>
>
> On Fri, Oct 13, 2023 at 12:52 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>>
>>>> Thanks for the PR.
>>>>
>>>> I think we should follow Java and allow non-unique labels, but not
>>>> provide automatic uniquification, In particular, the danger of using a
>>>> counter is that one can get accidental (and potentially hard to check)
>>>> off-by-one collisions. As a concrete example, imagine one partitions a
>>>> dataset into two collections, each followed by a similarly-named transform.
>>>>
>>>>     --> B
>>>>   /
>>>> A
>>>>  \
>>>>    --> B
>>>>
>>>> Uniquification would give something like
>>>>
>>>>     --> B
>>>>   /
>>>> A
>>>>  \
>>>>    --> B_2
>>>>
>>>> Suppose one then realizes there's a third case to handle, giving
>>>>
>>>>     --> B
>>>>   /
>>>> A --> B
>>>>  \
>>>>    --> B
>>>>
>>>> But this would be uniquified to
>>>>
>>>>     --> B
>>>>   /
>>>> A --> B_2
>>>>  \
>>>>    --> B_3
>>>>
>>>> where the old B_2 got renamed to B_3 and a new B_2 got put in its
>>>> place. This is bad because an updating runner would then attribute old
>>>> B_2's state to the new B_2 (and also possibly mis-direct any inflight
>>>> messages). At least with the old, intersecting names we can detect this
>>>> problem rather than silently give corrupt data.
>>>>
>>>>
>>>> On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <jo...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> For posterity: https://github.com/apache/beam/pull/28984
>>>>>
>>>>> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> I would definitely support a PR making this an option. Changing the
>>>>>> default would be a rather big change that would require more thought.
>>>>>>
>>>>>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <jo...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Bump on this. Sorry to pester - I'm trying to get a few teams to
>>>>>>> adopt Apache Beam at my company and I'm trying to foresee parts of the API
>>>>>>> they might find inconvenient.
>>>>>>>
>>>>>>> If there's a conclusion to make the behavior similar to java, I'm
>>>>>>> happy to put up a PR
>>>>>>>
>>>>>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Is it really toggleable in Java? I imagine that if it's a toggle
>>>>>>>> it'd be a very sticky toggle since it'd be easy for PTransforms to
>>>>>>>> accidentally rely on it.
>>>>>>>>
>>>>>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <
>>>>>>>> robertwb@google.com> wrote:
>>>>>>>>
>>>>>>>>> Huh. This used to be a hard error in Java, but I guess it's
>>>>>>>>> togglable with an option now. We should probably add the option to toggle
>>>>>>>>> Python too. (Unclear what the default should be, but this probably ties
>>>>>>>>> into re-thinking how pipeline update should work.)
>>>>>>>>>
>>>>>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <
>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>>> Makes sense that the requirement is the same, but is the label
>>>>>>>>>> auto-generation behavior the same? I modified the BeamJava
>>>>>>>>>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>>>>>>>>>> BeamPython example I posted before, it just warns instead of throwing an
>>>>>>>>>> exception.
>>>>>>>>>>
>>>>>>>>>> Tangentially, is it expected that the Beam playground examples
>>>>>>>>>> don't have a way to see the outputs of a run example? I have a vague memory
>>>>>>>>>> that there used to be a way to navigate to an output file after it's
>>>>>>>>>> generated but not sure if I just dreamt that up. Playing with the examples,
>>>>>>>>>> I wasn't positive if my runs were actually succeeding or not based on the
>>>>>>>>>> stdout alone.
>>>>>>>>>>
>>>>>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>>>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>
>>>>>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> BeamJava and BeamPython have the exact same behavior:
>>>>>>>>>>> transform names within must be distinct [1]. This is because we do not
>>>>>>>>>>> necessarily know at pipeline construction time if the pipeline will be
>>>>>>>>>>> streaming or batch, or if it will be updated in the future, so the decision
>>>>>>>>>>> was made to impose this restriction up front. Both will auto-generate a
>>>>>>>>>>> name for you if one is not given, but will do so deterministically (not
>>>>>>>>>>> depending on some global context) to avoid potential update problems.
>>>>>>>>>>>
>>>>>>>>>>> [1] Note that this applies to the fully qualified transform
>>>>>>>>>>> name, so the naming only has to be distinct within a composite transform
>>>>>>>>>>> (or at the top level--the pipeline itself is isomorphic to a single
>>>>>>>>>>> composite transform).
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <
>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Cross posting this thread to dev@ to see if this is
>>>>>>>>>>>> intentional behavior or if it's something worth changing for the python SDK
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <
>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> That suggests the default label is created as that, which
>>>>>>>>>>>>> indeed causes the duplication error.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <
>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <
>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You don't have to specify the names if the callable you
>>>>>>>>>>>>>>>> pass in is /different/ for two `beam.Map`s, but  if the callable is the
>>>>>>>>>>>>>>>> same you must specify a label. For example, the below will raise an
>>>>>>>>>>>>>>>> exception:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here's an example on playground that shows the error
>>>>>>>>>>>>>>>> message you get [1]. I marked every line I added with a "#
>>>>>>>>>>>>>>>> ++".
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It's a contrived example, but using a map or filter at the
>>>>>>>>>>>>>>>> same pipeline level probably comes up often, at least in my inexperience.
>>>>>>>>>>>>>>>> For example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>>>>>>>>>> of filter on each of them.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often users will
>>>>>>>>>>>>>>>> really have no need for a disambiguating label as they're often just
>>>>>>>>>>>>>>>> writing unit tests that test a few different properties of their workflow.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you
>>>>>>>>>>>>>>>>> have a stable / deterministic graph.
>>>>>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection |
>>>>>>>>>>>>>>>>> map_fn, instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> See an example here
>>>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I forgot to mention my question is posed from the context
>>>>>>>>>>>>>>>>>> of a python SDK user, and afaict, there doesn't seem to be an obvious way
>>>>>>>>>>>>>>>>>> to autogenerate names/labels. Hearing that the java SDK supports it makes
>>>>>>>>>>>>>>>>>> me wonder if the python SDK could support it as well though... (If so, I'd
>>>>>>>>>>>>>>>>>> be happy to do implement it). Currently, it's fairly tedious to have to
>>>>>>>>>>>>>>>>>> name every instance of a transform that you might reuse in a pipeline, e.g.
>>>>>>>>>>>>>>>>>> when reapplying the same Map on different pcollections.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Are you talking about transform names? The main reason
>>>>>>>>>>>>>>>>>>> was because for runners that support updating pipelines in place, the only
>>>>>>>>>>>>>>>>>>> way to do so safely is if the runner can perfectly identify which
>>>>>>>>>>>>>>>>>>> transforms in the new graph match the ones in the old graph. There's no
>>>>>>>>>>>>>>>>>>> good way to auto generate names that will stay stable across updates - even
>>>>>>>>>>>>>>>>>>> small changes to the pipeline might change the order of nodes in the graph,
>>>>>>>>>>>>>>>>>>> which could result in a corrupted update.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply (if using
>>>>>>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto generate a
>>>>>>>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting
>>>>>>>>>>>>>>>>>>>> tripped up from accidentally have duplicate labels from using multiple of
>>>>>>>>>>>>>>>>>>>> the same transforms without labeling them. I figure this must be a common
>>>>>>>>>>>>>>>>>>>> complaint, so I was just curious, what the rationale behind this design
>>>>>>>>>>>>>>>>>>>> was? My naive thought off the top of my head is that it'd be more user
>>>>>>>>>>>>>>>>>>>> friendly to just auto increment duplicate transforms, but I figure I must
>>>>>>>>>>>>>>>>>>>> be missing something
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

On Fri, Oct 13, 2023 at 1:18 PM Robert Bradshaw <ro...@google.com> wrote:

> On Fri, Oct 13, 2023 at 10:08 AM Joey Tran <jo...@schrodinger.com>
> wrote:
>
Are there places on the SDK side that expect unique labels? Or in
>> non-updateable runners?
>>
>
> That's a good question. The label eventually ends up here:
> https://github.com/apache/beam/blob/release-2.51.0/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L142
> which is technically a violation of the spec if these are not unique
> (though I guess Java already does this?). More importantly, though, the id
> of the transform (the proto contains a map of strings -> transforms) must
> be unique.
>
>
Doesn't that suggest not raising an exception won't be a sufficient enough
solution? But I suppose if the Java SDK does this already it's fine..?


> Another option is to make the suffix a uuid rather than a single counter.
> (This would still have issues with the first application possibly getting
> mixed up with a "different" first application unless it was always
> appended.)
>
>
Is the marginal benefit (*only* the first application may be confused
instead of possibly other applications) valuable enough to justify the
marginal cost of the readability hit of adding uuids to the labels?




On Fri, Oct 13, 2023 at 12:52 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>>
>>> Thanks for the PR.
>>>
>>> I think we should follow Java and allow non-unique labels, but not
>>> provide automatic uniquification, In particular, the danger of using a
>>> counter is that one can get accidental (and potentially hard to check)
>>> off-by-one collisions. As a concrete example, imagine one partitions a
>>> dataset into two collections, each followed by a similarly-named transform.
>>>
>>>     --> B
>>>   /
>>> A
>>>  \
>>>    --> B
>>>
>>> Uniquification would give something like
>>>
>>>     --> B
>>>   /
>>> A
>>>  \
>>>    --> B_2
>>>
>>> Suppose one then realizes there's a third case to handle, giving
>>>
>>>     --> B
>>>   /
>>> A --> B
>>>  \
>>>    --> B
>>>
>>> But this would be uniquified to
>>>
>>>     --> B
>>>   /
>>> A --> B_2
>>>  \
>>>    --> B_3
>>>
>>> where the old B_2 got renamed to B_3 and a new B_2 got put in its place.
>>> This is bad because an updating runner would then attribute old B_2's state
>>> to the new B_2 (and also possibly mis-direct any inflight messages). At
>>> least with the old, intersecting names we can detect this problem
>>> rather than silently give corrupt data.
>>>
>>>
>>> On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> For posterity: https://github.com/apache/beam/pull/28984
>>>>
>>>> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> I would definitely support a PR making this an option. Changing the
>>>>> default would be a rather big change that would require more thought.
>>>>>
>>>>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <jo...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> Bump on this. Sorry to pester - I'm trying to get a few teams to
>>>>>> adopt Apache Beam at my company and I'm trying to foresee parts of the API
>>>>>> they might find inconvenient.
>>>>>>
>>>>>> If there's a conclusion to make the behavior similar to java, I'm
>>>>>> happy to put up a PR
>>>>>>
>>>>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Is it really toggleable in Java? I imagine that if it's a toggle
>>>>>>> it'd be a very sticky toggle since it'd be easy for PTransforms to
>>>>>>> accidentally rely on it.
>>>>>>>
>>>>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Huh. This used to be a hard error in Java, but I guess it's
>>>>>>>> togglable with an option now. We should probably add the option to toggle
>>>>>>>> Python too. (Unclear what the default should be, but this probably ties
>>>>>>>> into re-thinking how pipeline update should work.)
>>>>>>>>
>>>>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Makes sense that the requirement is the same, but is the label
>>>>>>>>> auto-generation behavior the same? I modified the BeamJava
>>>>>>>>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>>>>>>>>> BeamPython example I posted before, it just warns instead of throwing an
>>>>>>>>> exception.
>>>>>>>>>
>>>>>>>>> Tangentially, is it expected that the Beam playground examples
>>>>>>>>> don't have a way to see the outputs of a run example? I have a vague memory
>>>>>>>>> that there used to be a way to navigate to an output file after it's
>>>>>>>>> generated but not sure if I just dreamt that up. Playing with the examples,
>>>>>>>>> I wasn't positive if my runs were actually succeeding or not based on the
>>>>>>>>> stdout alone.
>>>>>>>>>
>>>>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>
>>>>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> BeamJava and BeamPython have the exact same behavior:
>>>>>>>>>> transform names within must be distinct [1]. This is because we do not
>>>>>>>>>> necessarily know at pipeline construction time if the pipeline will be
>>>>>>>>>> streaming or batch, or if it will be updated in the future, so the decision
>>>>>>>>>> was made to impose this restriction up front. Both will auto-generate a
>>>>>>>>>> name for you if one is not given, but will do so deterministically (not
>>>>>>>>>> depending on some global context) to avoid potential update problems.
>>>>>>>>>>
>>>>>>>>>> [1] Note that this applies to the fully qualified transform name,
>>>>>>>>>> so the naming only has to be distinct within a composite transform (or at
>>>>>>>>>> the top level--the pipeline itself is isomorphic to a single composite
>>>>>>>>>> transform).
>>>>>>>>>>
>>>>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <
>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>>>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <
>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> That suggests the default label is created as that, which
>>>>>>>>>>>> indeed causes the duplication error.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <
>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <
>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You don't have to specify the names if the callable you pass
>>>>>>>>>>>>>>> in is /different/ for two `beam.Map`s, but  if the callable is the same you
>>>>>>>>>>>>>>> must specify a label. For example, the below will raise an exception:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here's an example on playground that shows the error message
>>>>>>>>>>>>>>> you get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's a contrived example, but using a map or filter at the
>>>>>>>>>>>>>>> same pipeline level probably comes up often, at least in my inexperience.
>>>>>>>>>>>>>>> For example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>>>>>>>>> of filter on each of them.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often users will
>>>>>>>>>>>>>>> really have no need for a disambiguating label as they're often just
>>>>>>>>>>>>>>> writing unit tests that test a few different properties of their workflow.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you
>>>>>>>>>>>>>>>> have a stable / deterministic graph.
>>>>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection |
>>>>>>>>>>>>>>>> map_fn, instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> See an example here
>>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I forgot to mention my question is posed from the context
>>>>>>>>>>>>>>>>> of a python SDK user, and afaict, there doesn't seem to be an obvious way
>>>>>>>>>>>>>>>>> to autogenerate names/labels. Hearing that the java SDK supports it makes
>>>>>>>>>>>>>>>>> me wonder if the python SDK could support it as well though... (If so, I'd
>>>>>>>>>>>>>>>>> be happy to do implement it). Currently, it's fairly tedious to have to
>>>>>>>>>>>>>>>>> name every instance of a transform that you might reuse in a pipeline, e.g.
>>>>>>>>>>>>>>>>> when reapplying the same Map on different pcollections.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Are you talking about transform names? The main reason
>>>>>>>>>>>>>>>>>> was because for runners that support updating pipelines in place, the only
>>>>>>>>>>>>>>>>>> way to do so safely is if the runner can perfectly identify which
>>>>>>>>>>>>>>>>>> transforms in the new graph match the ones in the old graph. There's no
>>>>>>>>>>>>>>>>>> good way to auto generate names that will stay stable across updates - even
>>>>>>>>>>>>>>>>>> small changes to the pipeline might change the order of nodes in the graph,
>>>>>>>>>>>>>>>>>> which could result in a corrupted update.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply (if using
>>>>>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto generate a
>>>>>>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting
>>>>>>>>>>>>>>>>>>> tripped up from accidentally have duplicate labels from using multiple of
>>>>>>>>>>>>>>>>>>> the same transforms without labeling them. I figure this must be a common
>>>>>>>>>>>>>>>>>>> complaint, so I was just curious, what the rationale behind this design
>>>>>>>>>>>>>>>>>>> was? My naive thought off the top of my head is that it'd be more user
>>>>>>>>>>>>>>>>>>> friendly to just auto increment duplicate transforms, but I figure I must
>>>>>>>>>>>>>>>>>>> be missing something
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

On Fri, Oct 13, 2023 at 1:18 PM Robert Bradshaw <ro...@google.com> wrote:

> On Fri, Oct 13, 2023 at 10:08 AM Joey Tran <jo...@schrodinger.com>
> wrote:
>
Are there places on the SDK side that expect unique labels? Or in
>> non-updateable runners?
>>
>
> That's a good question. The label eventually ends up here:
> https://github.com/apache/beam/blob/release-2.51.0/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L142
> which is technically a violation of the spec if these are not unique
> (though I guess Java already does this?). More importantly, though, the id
> of the transform (the proto contains a map of strings -> transforms) must
> be unique.
>
>
Doesn't that suggest not raising an exception won't be a sufficient enough
solution? But I suppose if the Java SDK does this already it's fine..?


> Another option is to make the suffix a uuid rather than a single counter.
> (This would still have issues with the first application possibly getting
> mixed up with a "different" first application unless it was always
> appended.)
>
>
Is the marginal benefit (*only* the first application may be confused
instead of possibly other applications) valuable enough to justify the
marginal cost of the readability hit of adding uuids to the labels?




On Fri, Oct 13, 2023 at 12:52 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>>
>>> Thanks for the PR.
>>>
>>> I think we should follow Java and allow non-unique labels, but not
>>> provide automatic uniquification, In particular, the danger of using a
>>> counter is that one can get accidental (and potentially hard to check)
>>> off-by-one collisions. As a concrete example, imagine one partitions a
>>> dataset into two collections, each followed by a similarly-named transform.
>>>
>>>     --> B
>>>   /
>>> A
>>>  \
>>>    --> B
>>>
>>> Uniquification would give something like
>>>
>>>     --> B
>>>   /
>>> A
>>>  \
>>>    --> B_2
>>>
>>> Suppose one then realizes there's a third case to handle, giving
>>>
>>>     --> B
>>>   /
>>> A --> B
>>>  \
>>>    --> B
>>>
>>> But this would be uniquified to
>>>
>>>     --> B
>>>   /
>>> A --> B_2
>>>  \
>>>    --> B_3
>>>
>>> where the old B_2 got renamed to B_3 and a new B_2 got put in its place.
>>> This is bad because an updating runner would then attribute old B_2's state
>>> to the new B_2 (and also possibly mis-direct any inflight messages). At
>>> least with the old, intersecting names we can detect this problem
>>> rather than silently give corrupt data.
>>>
>>>
>>> On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> For posterity: https://github.com/apache/beam/pull/28984
>>>>
>>>> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> I would definitely support a PR making this an option. Changing the
>>>>> default would be a rather big change that would require more thought.
>>>>>
>>>>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <jo...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> Bump on this. Sorry to pester - I'm trying to get a few teams to
>>>>>> adopt Apache Beam at my company and I'm trying to foresee parts of the API
>>>>>> they might find inconvenient.
>>>>>>
>>>>>> If there's a conclusion to make the behavior similar to java, I'm
>>>>>> happy to put up a PR
>>>>>>
>>>>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Is it really toggleable in Java? I imagine that if it's a toggle
>>>>>>> it'd be a very sticky toggle since it'd be easy for PTransforms to
>>>>>>> accidentally rely on it.
>>>>>>>
>>>>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Huh. This used to be a hard error in Java, but I guess it's
>>>>>>>> togglable with an option now. We should probably add the option to toggle
>>>>>>>> Python too. (Unclear what the default should be, but this probably ties
>>>>>>>> into re-thinking how pipeline update should work.)
>>>>>>>>
>>>>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Makes sense that the requirement is the same, but is the label
>>>>>>>>> auto-generation behavior the same? I modified the BeamJava
>>>>>>>>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>>>>>>>>> BeamPython example I posted before, it just warns instead of throwing an
>>>>>>>>> exception.
>>>>>>>>>
>>>>>>>>> Tangentially, is it expected that the Beam playground examples
>>>>>>>>> don't have a way to see the outputs of a run example? I have a vague memory
>>>>>>>>> that there used to be a way to navigate to an output file after it's
>>>>>>>>> generated but not sure if I just dreamt that up. Playing with the examples,
>>>>>>>>> I wasn't positive if my runs were actually succeeding or not based on the
>>>>>>>>> stdout alone.
>>>>>>>>>
>>>>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>
>>>>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> BeamJava and BeamPython have the exact same behavior:
>>>>>>>>>> transform names within must be distinct [1]. This is because we do not
>>>>>>>>>> necessarily know at pipeline construction time if the pipeline will be
>>>>>>>>>> streaming or batch, or if it will be updated in the future, so the decision
>>>>>>>>>> was made to impose this restriction up front. Both will auto-generate a
>>>>>>>>>> name for you if one is not given, but will do so deterministically (not
>>>>>>>>>> depending on some global context) to avoid potential update problems.
>>>>>>>>>>
>>>>>>>>>> [1] Note that this applies to the fully qualified transform name,
>>>>>>>>>> so the naming only has to be distinct within a composite transform (or at
>>>>>>>>>> the top level--the pipeline itself is isomorphic to a single composite
>>>>>>>>>> transform).
>>>>>>>>>>
>>>>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <
>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>>>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <
>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> That suggests the default label is created as that, which
>>>>>>>>>>>> indeed causes the duplication error.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <
>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <
>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You don't have to specify the names if the callable you pass
>>>>>>>>>>>>>>> in is /different/ for two `beam.Map`s, but  if the callable is the same you
>>>>>>>>>>>>>>> must specify a label. For example, the below will raise an exception:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here's an example on playground that shows the error message
>>>>>>>>>>>>>>> you get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's a contrived example, but using a map or filter at the
>>>>>>>>>>>>>>> same pipeline level probably comes up often, at least in my inexperience.
>>>>>>>>>>>>>>> For example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>>>>>>>>> of filter on each of them.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often users will
>>>>>>>>>>>>>>> really have no need for a disambiguating label as they're often just
>>>>>>>>>>>>>>> writing unit tests that test a few different properties of their workflow.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you
>>>>>>>>>>>>>>>> have a stable / deterministic graph.
>>>>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection |
>>>>>>>>>>>>>>>> map_fn, instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> See an example here
>>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I forgot to mention my question is posed from the context
>>>>>>>>>>>>>>>>> of a python SDK user, and afaict, there doesn't seem to be an obvious way
>>>>>>>>>>>>>>>>> to autogenerate names/labels. Hearing that the java SDK supports it makes
>>>>>>>>>>>>>>>>> me wonder if the python SDK could support it as well though... (If so, I'd
>>>>>>>>>>>>>>>>> be happy to do implement it). Currently, it's fairly tedious to have to
>>>>>>>>>>>>>>>>> name every instance of a transform that you might reuse in a pipeline, e.g.
>>>>>>>>>>>>>>>>> when reapplying the same Map on different pcollections.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Are you talking about transform names? The main reason
>>>>>>>>>>>>>>>>>> was because for runners that support updating pipelines in place, the only
>>>>>>>>>>>>>>>>>> way to do so safely is if the runner can perfectly identify which
>>>>>>>>>>>>>>>>>> transforms in the new graph match the ones in the old graph. There's no
>>>>>>>>>>>>>>>>>> good way to auto generate names that will stay stable across updates - even
>>>>>>>>>>>>>>>>>> small changes to the pipeline might change the order of nodes in the graph,
>>>>>>>>>>>>>>>>>> which could result in a corrupted update.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply (if using
>>>>>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto generate a
>>>>>>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting
>>>>>>>>>>>>>>>>>>> tripped up from accidentally have duplicate labels from using multiple of
>>>>>>>>>>>>>>>>>>> the same transforms without labeling them. I figure this must be a common
>>>>>>>>>>>>>>>>>>> complaint, so I was just curious, what the rationale behind this design
>>>>>>>>>>>>>>>>>>> was? My naive thought off the top of my head is that it'd be more user
>>>>>>>>>>>>>>>>>>> friendly to just auto increment duplicate transforms, but I figure I must
>>>>>>>>>>>>>>>>>>> be missing something
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Robert Bradshaw via dev <de...@beam.apache.org>.

On Fri, Oct 13, 2023 at 10:08 AM Joey Tran <jo...@schrodinger.com>
wrote:

> That makes sense. Would you suggest the new option simply suppress the
> RuntimeError and use the non-unique label?
>

Yes. (Or, rather, not raise it.)


> Are there places on the SDK side that expect unique labels? Or in
> non-updateable runners?
>

That's a good question. The label eventually ends up here:
https://github.com/apache/beam/blob/release-2.51.0/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L142
which is technically a violation of the spec if these are not unique
(though I guess Java already does this?). More importantly, though, the id
of the transform (the proto contains a map of strings -> transforms) must
be unique.


> Would making `--auto_unique_labels` a mutually exclusive option with
> streaming be a reasonable option? Not sure if stream/batch-specific options
> are a good idea or not... but if there's precedent, then it'd be an easy
> option
>

It's not always possible to even tell at pipeline construction whether the
pipeline will be run in streaming mode. (I suppose one could check later?)
It's generally something we try to avoid though.

Another option is to make the suffix a uuid rather than a single counter.
(This would still have issues with the first application possibly getting
mixed up with a "different" first application unless it was always
appended.)


> On Fri, Oct 13, 2023 at 12:52 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>>
>> Thanks for the PR.
>>
>> I think we should follow Java and allow non-unique labels, but not
>> provide automatic uniquification, In particular, the danger of using a
>> counter is that one can get accidental (and potentially hard to check)
>> off-by-one collisions. As a concrete example, imagine one partitions a
>> dataset into two collections, each followed by a similarly-named transform.
>>
>>     --> B
>>   /
>> A
>>  \
>>    --> B
>>
>> Uniquification would give something like
>>
>>     --> B
>>   /
>> A
>>  \
>>    --> B_2
>>
>> Suppose one then realizes there's a third case to handle, giving
>>
>>     --> B
>>   /
>> A --> B
>>  \
>>    --> B
>>
>> But this would be uniquified to
>>
>>     --> B
>>   /
>> A --> B_2
>>  \
>>    --> B_3
>>
>> where the old B_2 got renamed to B_3 and a new B_2 got put in its place.
>> This is bad because an updating runner would then attribute old B_2's state
>> to the new B_2 (and also possibly mis-direct any inflight messages). At
>> least with the old, intersecting names we can detect this problem
>> rather than silently give corrupt data.
>>
>>
>> On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
>>> For posterity: https://github.com/apache/beam/pull/28984
>>>
>>> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> I would definitely support a PR making this an option. Changing the
>>>> default would be a rather big change that would require more thought.
>>>>
>>>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <jo...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
>>>>> Apache Beam at my company and I'm trying to foresee parts of the API they
>>>>> might find inconvenient.
>>>>>
>>>>> If there's a conclusion to make the behavior similar to java, I'm
>>>>> happy to put up a PR
>>>>>
>>>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> Is it really toggleable in Java? I imagine that if it's a toggle it'd
>>>>>> be a very sticky toggle since it'd be easy for PTransforms to accidentally
>>>>>> rely on it.
>>>>>>
>>>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Huh. This used to be a hard error in Java, but I guess it's
>>>>>>> togglable with an option now. We should probably add the option to toggle
>>>>>>> Python too. (Unclear what the default should be, but this probably ties
>>>>>>> into re-thinking how pipeline update should work.)
>>>>>>>
>>>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Makes sense that the requirement is the same, but is the label
>>>>>>>> auto-generation behavior the same? I modified the BeamJava
>>>>>>>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>>>>>>>> BeamPython example I posted before, it just warns instead of throwing an
>>>>>>>> exception.
>>>>>>>>
>>>>>>>> Tangentially, is it expected that the Beam playground examples
>>>>>>>> don't have a way to see the outputs of a run example? I have a vague memory
>>>>>>>> that there used to be a way to navigate to an output file after it's
>>>>>>>> generated but not sure if I just dreamt that up. Playing with the examples,
>>>>>>>> I wasn't positive if my runs were actually succeeding or not based on the
>>>>>>>> stdout alone.
>>>>>>>>
>>>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>
>>>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>
>>>>>>>>> BeamJava and BeamPython have the exact same behavior:
>>>>>>>>> transform names within must be distinct [1]. This is because we do not
>>>>>>>>> necessarily know at pipeline construction time if the pipeline will be
>>>>>>>>> streaming or batch, or if it will be updated in the future, so the decision
>>>>>>>>> was made to impose this restriction up front. Both will auto-generate a
>>>>>>>>> name for you if one is not given, but will do so deterministically (not
>>>>>>>>> depending on some global context) to avoid potential update problems.
>>>>>>>>>
>>>>>>>>> [1] Note that this applies to the fully qualified transform name,
>>>>>>>>> so the naming only has to be distinct within a composite transform (or at
>>>>>>>>> the top level--the pipeline itself is isomorphic to a single composite
>>>>>>>>> transform).
>>>>>>>>>
>>>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <
>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <
>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> That suggests the default label is created as that, which indeed
>>>>>>>>>>> causes the duplication error.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <
>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <
>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> You don't have to specify the names if the callable you pass
>>>>>>>>>>>>>> in is /different/ for two `beam.Map`s, but  if the callable is the same you
>>>>>>>>>>>>>> must specify a label. For example, the below will raise an exception:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here's an example on playground that shows the error message
>>>>>>>>>>>>>> you get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It's a contrived example, but using a map or filter at the
>>>>>>>>>>>>>> same pipeline level probably comes up often, at least in my inexperience.
>>>>>>>>>>>>>> For example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>>>>>>>> of filter on each of them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often users will
>>>>>>>>>>>>>> really have no need for a disambiguating label as they're often just
>>>>>>>>>>>>>> writing unit tests that test a few different properties of their workflow.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you
>>>>>>>>>>>>>>> have a stable / deterministic graph.
>>>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection |
>>>>>>>>>>>>>>> map_fn, instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> See an example here
>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I forgot to mention my question is posed from the context
>>>>>>>>>>>>>>>> of a python SDK user, and afaict, there doesn't seem to be an obvious way
>>>>>>>>>>>>>>>> to autogenerate names/labels. Hearing that the java SDK supports it makes
>>>>>>>>>>>>>>>> me wonder if the python SDK could support it as well though... (If so, I'd
>>>>>>>>>>>>>>>> be happy to do implement it). Currently, it's fairly tedious to have to
>>>>>>>>>>>>>>>> name every instance of a transform that you might reuse in a pipeline, e.g.
>>>>>>>>>>>>>>>> when reapplying the same Map on different pcollections.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>>>>>>>> because for runners that support updating pipelines in place, the only way
>>>>>>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which transforms in
>>>>>>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good way to auto
>>>>>>>>>>>>>>>>> generate names that will stay stable across updates - even small changes to
>>>>>>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which could
>>>>>>>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply (if using
>>>>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto generate a
>>>>>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped
>>>>>>>>>>>>>>>>>> up from accidentally have duplicate labels from using multiple of the same
>>>>>>>>>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Robert Bradshaw via user <us...@beam.apache.org>.

On Fri, Oct 13, 2023 at 10:08 AM Joey Tran <jo...@schrodinger.com>
wrote:

> That makes sense. Would you suggest the new option simply suppress the
> RuntimeError and use the non-unique label?
>

Yes. (Or, rather, not raise it.)


> Are there places on the SDK side that expect unique labels? Or in
> non-updateable runners?
>

That's a good question. The label eventually ends up here:
https://github.com/apache/beam/blob/release-2.51.0/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L142
which is technically a violation of the spec if these are not unique
(though I guess Java already does this?). More importantly, though, the id
of the transform (the proto contains a map of strings -> transforms) must
be unique.


> Would making `--auto_unique_labels` a mutually exclusive option with
> streaming be a reasonable option? Not sure if stream/batch-specific options
> are a good idea or not... but if there's precedent, then it'd be an easy
> option
>

It's not always possible to even tell at pipeline construction whether the
pipeline will be run in streaming mode. (I suppose one could check later?)
It's generally something we try to avoid though.

Another option is to make the suffix a uuid rather than a single counter.
(This would still have issues with the first application possibly getting
mixed up with a "different" first application unless it was always
appended.)


> On Fri, Oct 13, 2023 at 12:52 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>>
>> Thanks for the PR.
>>
>> I think we should follow Java and allow non-unique labels, but not
>> provide automatic uniquification, In particular, the danger of using a
>> counter is that one can get accidental (and potentially hard to check)
>> off-by-one collisions. As a concrete example, imagine one partitions a
>> dataset into two collections, each followed by a similarly-named transform.
>>
>>     --> B
>>   /
>> A
>>  \
>>    --> B
>>
>> Uniquification would give something like
>>
>>     --> B
>>   /
>> A
>>  \
>>    --> B_2
>>
>> Suppose one then realizes there's a third case to handle, giving
>>
>>     --> B
>>   /
>> A --> B
>>  \
>>    --> B
>>
>> But this would be uniquified to
>>
>>     --> B
>>   /
>> A --> B_2
>>  \
>>    --> B_3
>>
>> where the old B_2 got renamed to B_3 and a new B_2 got put in its place.
>> This is bad because an updating runner would then attribute old B_2's state
>> to the new B_2 (and also possibly mis-direct any inflight messages). At
>> least with the old, intersecting names we can detect this problem
>> rather than silently give corrupt data.
>>
>>
>> On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
>>> For posterity: https://github.com/apache/beam/pull/28984
>>>
>>> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> I would definitely support a PR making this an option. Changing the
>>>> default would be a rather big change that would require more thought.
>>>>
>>>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <jo...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
>>>>> Apache Beam at my company and I'm trying to foresee parts of the API they
>>>>> might find inconvenient.
>>>>>
>>>>> If there's a conclusion to make the behavior similar to java, I'm
>>>>> happy to put up a PR
>>>>>
>>>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> Is it really toggleable in Java? I imagine that if it's a toggle it'd
>>>>>> be a very sticky toggle since it'd be easy for PTransforms to accidentally
>>>>>> rely on it.
>>>>>>
>>>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Huh. This used to be a hard error in Java, but I guess it's
>>>>>>> togglable with an option now. We should probably add the option to toggle
>>>>>>> Python too. (Unclear what the default should be, but this probably ties
>>>>>>> into re-thinking how pipeline update should work.)
>>>>>>>
>>>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Makes sense that the requirement is the same, but is the label
>>>>>>>> auto-generation behavior the same? I modified the BeamJava
>>>>>>>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>>>>>>>> BeamPython example I posted before, it just warns instead of throwing an
>>>>>>>> exception.
>>>>>>>>
>>>>>>>> Tangentially, is it expected that the Beam playground examples
>>>>>>>> don't have a way to see the outputs of a run example? I have a vague memory
>>>>>>>> that there used to be a way to navigate to an output file after it's
>>>>>>>> generated but not sure if I just dreamt that up. Playing with the examples,
>>>>>>>> I wasn't positive if my runs were actually succeeding or not based on the
>>>>>>>> stdout alone.
>>>>>>>>
>>>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>
>>>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>
>>>>>>>>> BeamJava and BeamPython have the exact same behavior:
>>>>>>>>> transform names within must be distinct [1]. This is because we do not
>>>>>>>>> necessarily know at pipeline construction time if the pipeline will be
>>>>>>>>> streaming or batch, or if it will be updated in the future, so the decision
>>>>>>>>> was made to impose this restriction up front. Both will auto-generate a
>>>>>>>>> name for you if one is not given, but will do so deterministically (not
>>>>>>>>> depending on some global context) to avoid potential update problems.
>>>>>>>>>
>>>>>>>>> [1] Note that this applies to the fully qualified transform name,
>>>>>>>>> so the naming only has to be distinct within a composite transform (or at
>>>>>>>>> the top level--the pipeline itself is isomorphic to a single composite
>>>>>>>>> transform).
>>>>>>>>>
>>>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <
>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <
>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> That suggests the default label is created as that, which indeed
>>>>>>>>>>> causes the duplication error.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <
>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <
>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> You don't have to specify the names if the callable you pass
>>>>>>>>>>>>>> in is /different/ for two `beam.Map`s, but  if the callable is the same you
>>>>>>>>>>>>>> must specify a label. For example, the below will raise an exception:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here's an example on playground that shows the error message
>>>>>>>>>>>>>> you get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It's a contrived example, but using a map or filter at the
>>>>>>>>>>>>>> same pipeline level probably comes up often, at least in my inexperience.
>>>>>>>>>>>>>> For example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>>>>>>>> of filter on each of them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often users will
>>>>>>>>>>>>>> really have no need for a disambiguating label as they're often just
>>>>>>>>>>>>>> writing unit tests that test a few different properties of their workflow.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you
>>>>>>>>>>>>>>> have a stable / deterministic graph.
>>>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection |
>>>>>>>>>>>>>>> map_fn, instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> See an example here
>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I forgot to mention my question is posed from the context
>>>>>>>>>>>>>>>> of a python SDK user, and afaict, there doesn't seem to be an obvious way
>>>>>>>>>>>>>>>> to autogenerate names/labels. Hearing that the java SDK supports it makes
>>>>>>>>>>>>>>>> me wonder if the python SDK could support it as well though... (If so, I'd
>>>>>>>>>>>>>>>> be happy to do implement it). Currently, it's fairly tedious to have to
>>>>>>>>>>>>>>>> name every instance of a transform that you might reuse in a pipeline, e.g.
>>>>>>>>>>>>>>>> when reapplying the same Map on different pcollections.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>>>>>>>> because for runners that support updating pipelines in place, the only way
>>>>>>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which transforms in
>>>>>>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good way to auto
>>>>>>>>>>>>>>>>> generate names that will stay stable across updates - even small changes to
>>>>>>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which could
>>>>>>>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply (if using
>>>>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto generate a
>>>>>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped
>>>>>>>>>>>>>>>>>> up from accidentally have duplicate labels from using multiple of the same
>>>>>>>>>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

That makes sense. Would you suggest the new option simply suppress the
RuntimeError and use the non-unique label? Are there places on the SDK side
that expect unique labels? Or in non-updateable runners?

Would making `--auto_unique_labels` a mutually exclusive option with
streaming be a reasonable option? Not sure if stream/batch-specific options
are a good idea or not... but if there's precedent, then it'd be an easy
option

On Fri, Oct 13, 2023 at 12:52 PM Robert Bradshaw <ro...@google.com>
wrote:

>
> Thanks for the PR.
>
> I think we should follow Java and allow non-unique labels, but not provide
> automatic uniquification, In particular, the danger of using a counter is
> that one can get accidental (and potentially hard to check) off-by-one
> collisions. As a concrete example, imagine one partitions a dataset into
> two collections, each followed by a similarly-named transform.
>
>     --> B
>   /
> A
>  \
>    --> B
>
> Uniquification would give something like
>
>     --> B
>   /
> A
>  \
>    --> B_2
>
> Suppose one then realizes there's a third case to handle, giving
>
>     --> B
>   /
> A --> B
>  \
>    --> B
>
> But this would be uniquified to
>
>     --> B
>   /
> A --> B_2
>  \
>    --> B_3
>
> where the old B_2 got renamed to B_3 and a new B_2 got put in its place.
> This is bad because an updating runner would then attribute old B_2's state
> to the new B_2 (and also possibly mis-direct any inflight messages). At
> least with the old, intersecting names we can detect this problem
> rather than silently give corrupt data.
>
>
> On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <jo...@schrodinger.com>
> wrote:
>
>> For posterity: https://github.com/apache/beam/pull/28984
>>
>> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> I would definitely support a PR making this an option. Changing the
>>> default would be a rather big change that would require more thought.
>>>
>>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
>>>> Apache Beam at my company and I'm trying to foresee parts of the API they
>>>> might find inconvenient.
>>>>
>>>> If there's a conclusion to make the behavior similar to java, I'm happy
>>>> to put up a PR
>>>>
>>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> Is it really toggleable in Java? I imagine that if it's a toggle it'd
>>>>> be a very sticky toggle since it'd be easy for PTransforms to accidentally
>>>>> rely on it.
>>>>>
>>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Huh. This used to be a hard error in Java, but I guess it's togglable
>>>>>> with an option now. We should probably add the option to toggle Python too.
>>>>>> (Unclear what the default should be, but this probably ties into
>>>>>> re-thinking how pipeline update should work.)
>>>>>>
>>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Makes sense that the requirement is the same, but is the label
>>>>>>> auto-generation behavior the same? I modified the BeamJava
>>>>>>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>>>>>>> BeamPython example I posted before, it just warns instead of throwing an
>>>>>>> exception.
>>>>>>>
>>>>>>> Tangentially, is it expected that the Beam playground examples don't
>>>>>>> have a way to see the outputs of a run example? I have a vague memory that
>>>>>>> there used to be a way to navigate to an output file after it's generated
>>>>>>> but not sure if I just dreamt that up. Playing with the examples, I wasn't
>>>>>>> positive if my runs were actually succeeding or not based on the stdout
>>>>>>> alone.
>>>>>>>
>>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>
>>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>>>>> user@beam.apache.org> wrote:
>>>>>>>
>>>>>>>> BeamJava and BeamPython have the exact same behavior:
>>>>>>>> transform names within must be distinct [1]. This is because we do not
>>>>>>>> necessarily know at pipeline construction time if the pipeline will be
>>>>>>>> streaming or batch, or if it will be updated in the future, so the decision
>>>>>>>> was made to impose this restriction up front. Both will auto-generate a
>>>>>>>> name for you if one is not given, but will do so deterministically (not
>>>>>>>> depending on some global context) to avoid potential update problems.
>>>>>>>>
>>>>>>>> [1] Note that this applies to the fully qualified transform name,
>>>>>>>> so the naming only has to be distinct within a composite transform (or at
>>>>>>>> the top level--the pipeline itself is isomorphic to a single composite
>>>>>>>> transform).
>>>>>>>>
>>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>>>>
>>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> That suggests the default label is created as that, which indeed
>>>>>>>>>> causes the duplication error.
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <
>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <
>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> You don't have to specify the names if the callable you pass
>>>>>>>>>>>>> in is /different/ for two `beam.Map`s, but  if the callable is the same you
>>>>>>>>>>>>> must specify a label. For example, the below will raise an exception:
>>>>>>>>>>>>>
>>>>>>>>>>>>> ```
>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>> ```
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here's an example on playground that shows the error message
>>>>>>>>>>>>> you get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's a contrived example, but using a map or filter at the
>>>>>>>>>>>>> same pipeline level probably comes up often, at least in my inexperience.
>>>>>>>>>>>>> For example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>>>>>>> of filter on each of them.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often users will
>>>>>>>>>>>>> really have no need for a disambiguating label as they're often just
>>>>>>>>>>>>> writing unit tests that test a few different properties of their workflow.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have
>>>>>>>>>>>>>> a stable / deterministic graph.
>>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection |
>>>>>>>>>>>>>> map_fn, instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> See an example here
>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I forgot to mention my question is posed from the context of
>>>>>>>>>>>>>>> a python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>>>>>>> because for runners that support updating pipelines in place, the only way
>>>>>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which transforms in
>>>>>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good way to auto
>>>>>>>>>>>>>>>> generate names that will stay stable across updates - even small changes to
>>>>>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which could
>>>>>>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply (if using
>>>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto generate a
>>>>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped
>>>>>>>>>>>>>>>>> up from accidentally have duplicate labels from using multiple of the same
>>>>>>>>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

That makes sense. Would you suggest the new option simply suppress the
RuntimeError and use the non-unique label? Are there places on the SDK side
that expect unique labels? Or in non-updateable runners?

Would making `--auto_unique_labels` a mutually exclusive option with
streaming be a reasonable option? Not sure if stream/batch-specific options
are a good idea or not... but if there's precedent, then it'd be an easy
option

On Fri, Oct 13, 2023 at 12:52 PM Robert Bradshaw <ro...@google.com>
wrote:

>
> Thanks for the PR.
>
> I think we should follow Java and allow non-unique labels, but not provide
> automatic uniquification, In particular, the danger of using a counter is
> that one can get accidental (and potentially hard to check) off-by-one
> collisions. As a concrete example, imagine one partitions a dataset into
> two collections, each followed by a similarly-named transform.
>
>     --> B
>   /
> A
>  \
>    --> B
>
> Uniquification would give something like
>
>     --> B
>   /
> A
>  \
>    --> B_2
>
> Suppose one then realizes there's a third case to handle, giving
>
>     --> B
>   /
> A --> B
>  \
>    --> B
>
> But this would be uniquified to
>
>     --> B
>   /
> A --> B_2
>  \
>    --> B_3
>
> where the old B_2 got renamed to B_3 and a new B_2 got put in its place.
> This is bad because an updating runner would then attribute old B_2's state
> to the new B_2 (and also possibly mis-direct any inflight messages). At
> least with the old, intersecting names we can detect this problem
> rather than silently give corrupt data.
>
>
> On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <jo...@schrodinger.com>
> wrote:
>
>> For posterity: https://github.com/apache/beam/pull/28984
>>
>> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> I would definitely support a PR making this an option. Changing the
>>> default would be a rather big change that would require more thought.
>>>
>>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
>>>> Apache Beam at my company and I'm trying to foresee parts of the API they
>>>> might find inconvenient.
>>>>
>>>> If there's a conclusion to make the behavior similar to java, I'm happy
>>>> to put up a PR
>>>>
>>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> Is it really toggleable in Java? I imagine that if it's a toggle it'd
>>>>> be a very sticky toggle since it'd be easy for PTransforms to accidentally
>>>>> rely on it.
>>>>>
>>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Huh. This used to be a hard error in Java, but I guess it's togglable
>>>>>> with an option now. We should probably add the option to toggle Python too.
>>>>>> (Unclear what the default should be, but this probably ties into
>>>>>> re-thinking how pipeline update should work.)
>>>>>>
>>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Makes sense that the requirement is the same, but is the label
>>>>>>> auto-generation behavior the same? I modified the BeamJava
>>>>>>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>>>>>>> BeamPython example I posted before, it just warns instead of throwing an
>>>>>>> exception.
>>>>>>>
>>>>>>> Tangentially, is it expected that the Beam playground examples don't
>>>>>>> have a way to see the outputs of a run example? I have a vague memory that
>>>>>>> there used to be a way to navigate to an output file after it's generated
>>>>>>> but not sure if I just dreamt that up. Playing with the examples, I wasn't
>>>>>>> positive if my runs were actually succeeding or not based on the stdout
>>>>>>> alone.
>>>>>>>
>>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>
>>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>>>>> user@beam.apache.org> wrote:
>>>>>>>
>>>>>>>> BeamJava and BeamPython have the exact same behavior:
>>>>>>>> transform names within must be distinct [1]. This is because we do not
>>>>>>>> necessarily know at pipeline construction time if the pipeline will be
>>>>>>>> streaming or batch, or if it will be updated in the future, so the decision
>>>>>>>> was made to impose this restriction up front. Both will auto-generate a
>>>>>>>> name for you if one is not given, but will do so deterministically (not
>>>>>>>> depending on some global context) to avoid potential update problems.
>>>>>>>>
>>>>>>>> [1] Note that this applies to the fully qualified transform name,
>>>>>>>> so the naming only has to be distinct within a composite transform (or at
>>>>>>>> the top level--the pipeline itself is isomorphic to a single composite
>>>>>>>> transform).
>>>>>>>>
>>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>>>>
>>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> That suggests the default label is created as that, which indeed
>>>>>>>>>> causes the duplication error.
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <
>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <
>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> You don't have to specify the names if the callable you pass
>>>>>>>>>>>>> in is /different/ for two `beam.Map`s, but  if the callable is the same you
>>>>>>>>>>>>> must specify a label. For example, the below will raise an exception:
>>>>>>>>>>>>>
>>>>>>>>>>>>> ```
>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>> ```
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here's an example on playground that shows the error message
>>>>>>>>>>>>> you get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's a contrived example, but using a map or filter at the
>>>>>>>>>>>>> same pipeline level probably comes up often, at least in my inexperience.
>>>>>>>>>>>>> For example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>>>>>>> of filter on each of them.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often users will
>>>>>>>>>>>>> really have no need for a disambiguating label as they're often just
>>>>>>>>>>>>> writing unit tests that test a few different properties of their workflow.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have
>>>>>>>>>>>>>> a stable / deterministic graph.
>>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection |
>>>>>>>>>>>>>> map_fn, instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> See an example here
>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I forgot to mention my question is posed from the context of
>>>>>>>>>>>>>>> a python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>>>>>>> because for runners that support updating pipelines in place, the only way
>>>>>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which transforms in
>>>>>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good way to auto
>>>>>>>>>>>>>>>> generate names that will stay stable across updates - even small changes to
>>>>>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which could
>>>>>>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply (if using
>>>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto generate a
>>>>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped
>>>>>>>>>>>>>>>>> up from accidentally have duplicate labels from using multiple of the same
>>>>>>>>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Robert Bradshaw via user <us...@beam.apache.org>.

Thanks for the PR.

I think we should follow Java and allow non-unique labels, but not provide
automatic uniquification, In particular, the danger of using a counter is
that one can get accidental (and potentially hard to check) off-by-one
collisions. As a concrete example, imagine one partitions a dataset into
two collections, each followed by a similarly-named transform.

    --> B
  /
A
 \
   --> B

Uniquification would give something like

    --> B
  /
A
 \
   --> B_2

Suppose one then realizes there's a third case to handle, giving

    --> B
  /
A --> B
 \
   --> B

But this would be uniquified to

    --> B
  /
A --> B_2
 \
   --> B_3

where the old B_2 got renamed to B_3 and a new B_2 got put in its place.
This is bad because an updating runner would then attribute old B_2's state
to the new B_2 (and also possibly mis-direct any inflight messages). At
least with the old, intersecting names we can detect this problem
rather than silently give corrupt data.


On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <jo...@schrodinger.com> wrote:

> For posterity: https://github.com/apache/beam/pull/28984
>
> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> I would definitely support a PR making this an option. Changing the
>> default would be a rather big change that would require more thought.
>>
>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
>>> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
>>> Apache Beam at my company and I'm trying to foresee parts of the API they
>>> might find inconvenient.
>>>
>>> If there's a conclusion to make the behavior similar to java, I'm happy
>>> to put up a PR
>>>
>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> Is it really toggleable in Java? I imagine that if it's a toggle it'd
>>>> be a very sticky toggle since it'd be easy for PTransforms to accidentally
>>>> rely on it.
>>>>
>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> Huh. This used to be a hard error in Java, but I guess it's togglable
>>>>> with an option now. We should probably add the option to toggle Python too.
>>>>> (Unclear what the default should be, but this probably ties into
>>>>> re-thinking how pipeline update should work.)
>>>>>
>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> Makes sense that the requirement is the same, but is the label
>>>>>> auto-generation behavior the same? I modified the BeamJava
>>>>>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>>>>>> BeamPython example I posted before, it just warns instead of throwing an
>>>>>> exception.
>>>>>>
>>>>>> Tangentially, is it expected that the Beam playground examples don't
>>>>>> have a way to see the outputs of a run example? I have a vague memory that
>>>>>> there used to be a way to navigate to an output file after it's generated
>>>>>> but not sure if I just dreamt that up. Playing with the examples, I wasn't
>>>>>> positive if my runs were actually succeeding or not based on the stdout
>>>>>> alone.
>>>>>>
>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>
>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>>>> user@beam.apache.org> wrote:
>>>>>>
>>>>>>> BeamJava and BeamPython have the exact same behavior:
>>>>>>> transform names within must be distinct [1]. This is because we do not
>>>>>>> necessarily know at pipeline construction time if the pipeline will be
>>>>>>> streaming or batch, or if it will be updated in the future, so the decision
>>>>>>> was made to impose this restriction up front. Both will auto-generate a
>>>>>>> name for you if one is not given, but will do so deterministically (not
>>>>>>> depending on some global context) to avoid potential update problems.
>>>>>>>
>>>>>>> [1] Note that this applies to the fully qualified transform name, so
>>>>>>> the naming only has to be distinct within a composite transform (or at the
>>>>>>> top level--the pipeline itself is isomorphic to a single composite
>>>>>>> transform).
>>>>>>>
>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>>>
>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> That suggests the default label is created as that, which indeed
>>>>>>>>> causes the duplication error.
>>>>>>>>>
>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <
>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> You don't have to specify the names if the callable you pass in
>>>>>>>>>>>> is /different/ for two `beam.Map`s, but  if the callable is the same you
>>>>>>>>>>>> must specify a label. For example, the below will raise an exception:
>>>>>>>>>>>>
>>>>>>>>>>>> ```
>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>> ```
>>>>>>>>>>>>
>>>>>>>>>>>> Here's an example on playground that shows the error message
>>>>>>>>>>>> you get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>>
>>>>>>>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>>>>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>>>>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>>>>>> of filter on each of them.
>>>>>>>>>>>>
>>>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often users will
>>>>>>>>>>>> really have no need for a disambiguating label as they're often just
>>>>>>>>>>>> writing unit tests that test a few different properties of their workflow.
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have
>>>>>>>>>>>>> a stable / deterministic graph.
>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>
>>>>>>>>>>>>> See an example here
>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I forgot to mention my question is posed from the context of
>>>>>>>>>>>>>> a python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>>>>>> because for runners that support updating pipelines in place, the only way
>>>>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which transforms in
>>>>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good way to auto
>>>>>>>>>>>>>>> generate names that will stay stable across updates - even small changes to
>>>>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which could
>>>>>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply (if using
>>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto generate a
>>>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped
>>>>>>>>>>>>>>>> up from accidentally have duplicate labels from using multiple of the same
>>>>>>>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Robert Bradshaw via dev <de...@beam.apache.org>.

Thanks for the PR.

I think we should follow Java and allow non-unique labels, but not provide
automatic uniquification, In particular, the danger of using a counter is
that one can get accidental (and potentially hard to check) off-by-one
collisions. As a concrete example, imagine one partitions a dataset into
two collections, each followed by a similarly-named transform.

    --> B
  /
A
 \
   --> B

Uniquification would give something like

    --> B
  /
A
 \
   --> B_2

Suppose one then realizes there's a third case to handle, giving

    --> B
  /
A --> B
 \
   --> B

But this would be uniquified to

    --> B
  /
A --> B_2
 \
   --> B_3

where the old B_2 got renamed to B_3 and a new B_2 got put in its place.
This is bad because an updating runner would then attribute old B_2's state
to the new B_2 (and also possibly mis-direct any inflight messages). At
least with the old, intersecting names we can detect this problem
rather than silently give corrupt data.


On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <jo...@schrodinger.com> wrote:

> For posterity: https://github.com/apache/beam/pull/28984
>
> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> I would definitely support a PR making this an option. Changing the
>> default would be a rather big change that would require more thought.
>>
>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
>>> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
>>> Apache Beam at my company and I'm trying to foresee parts of the API they
>>> might find inconvenient.
>>>
>>> If there's a conclusion to make the behavior similar to java, I'm happy
>>> to put up a PR
>>>
>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> Is it really toggleable in Java? I imagine that if it's a toggle it'd
>>>> be a very sticky toggle since it'd be easy for PTransforms to accidentally
>>>> rely on it.
>>>>
>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> Huh. This used to be a hard error in Java, but I guess it's togglable
>>>>> with an option now. We should probably add the option to toggle Python too.
>>>>> (Unclear what the default should be, but this probably ties into
>>>>> re-thinking how pipeline update should work.)
>>>>>
>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> Makes sense that the requirement is the same, but is the label
>>>>>> auto-generation behavior the same? I modified the BeamJava
>>>>>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>>>>>> BeamPython example I posted before, it just warns instead of throwing an
>>>>>> exception.
>>>>>>
>>>>>> Tangentially, is it expected that the Beam playground examples don't
>>>>>> have a way to see the outputs of a run example? I have a vague memory that
>>>>>> there used to be a way to navigate to an output file after it's generated
>>>>>> but not sure if I just dreamt that up. Playing with the examples, I wasn't
>>>>>> positive if my runs were actually succeeding or not based on the stdout
>>>>>> alone.
>>>>>>
>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>
>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>>>> user@beam.apache.org> wrote:
>>>>>>
>>>>>>> BeamJava and BeamPython have the exact same behavior:
>>>>>>> transform names within must be distinct [1]. This is because we do not
>>>>>>> necessarily know at pipeline construction time if the pipeline will be
>>>>>>> streaming or batch, or if it will be updated in the future, so the decision
>>>>>>> was made to impose this restriction up front. Both will auto-generate a
>>>>>>> name for you if one is not given, but will do so deterministically (not
>>>>>>> depending on some global context) to avoid potential update problems.
>>>>>>>
>>>>>>> [1] Note that this applies to the fully qualified transform name, so
>>>>>>> the naming only has to be distinct within a composite transform (or at the
>>>>>>> top level--the pipeline itself is isomorphic to a single composite
>>>>>>> transform).
>>>>>>>
>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>>>
>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> That suggests the default label is created as that, which indeed
>>>>>>>>> causes the duplication error.
>>>>>>>>>
>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <
>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> You don't have to specify the names if the callable you pass in
>>>>>>>>>>>> is /different/ for two `beam.Map`s, but  if the callable is the same you
>>>>>>>>>>>> must specify a label. For example, the below will raise an exception:
>>>>>>>>>>>>
>>>>>>>>>>>> ```
>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>> ```
>>>>>>>>>>>>
>>>>>>>>>>>> Here's an example on playground that shows the error message
>>>>>>>>>>>> you get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>>
>>>>>>>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>>>>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>>>>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>>>>>> of filter on each of them.
>>>>>>>>>>>>
>>>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often users will
>>>>>>>>>>>> really have no need for a disambiguating label as they're often just
>>>>>>>>>>>> writing unit tests that test a few different properties of their workflow.
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have
>>>>>>>>>>>>> a stable / deterministic graph.
>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>
>>>>>>>>>>>>> See an example here
>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I forgot to mention my question is posed from the context of
>>>>>>>>>>>>>> a python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>>>>>> because for runners that support updating pipelines in place, the only way
>>>>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which transforms in
>>>>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good way to auto
>>>>>>>>>>>>>>> generate names that will stay stable across updates - even small changes to
>>>>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which could
>>>>>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply (if using
>>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto generate a
>>>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped
>>>>>>>>>>>>>>>> up from accidentally have duplicate labels from using multiple of the same
>>>>>>>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

For posterity: https://github.com/apache/beam/pull/28984

On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <ro...@google.com> wrote:

> I would definitely support a PR making this an option. Changing the
> default would be a rather big change that would require more thought.
>
> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <jo...@schrodinger.com>
> wrote:
>
>> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
>> Apache Beam at my company and I'm trying to foresee parts of the API they
>> might find inconvenient.
>>
>> If there's a conclusion to make the behavior similar to java, I'm happy
>> to put up a PR
>>
>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
>>> Is it really toggleable in Java? I imagine that if it's a toggle it'd be
>>> a very sticky toggle since it'd be easy for PTransforms to accidentally
>>> rely on it.
>>>
>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> Huh. This used to be a hard error in Java, but I guess it's togglable
>>>> with an option now. We should probably add the option to toggle Python too.
>>>> (Unclear what the default should be, but this probably ties into
>>>> re-thinking how pipeline update should work.)
>>>>
>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> Makes sense that the requirement is the same, but is the label
>>>>> auto-generation behavior the same? I modified the BeamJava
>>>>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>>>>> BeamPython example I posted before, it just warns instead of throwing an
>>>>> exception.
>>>>>
>>>>> Tangentially, is it expected that the Beam playground examples don't
>>>>> have a way to see the outputs of a run example? I have a vague memory that
>>>>> there used to be a way to navigate to an output file after it's generated
>>>>> but not sure if I just dreamt that up. Playing with the examples, I wasn't
>>>>> positive if my runs were actually succeeding or not based on the stdout
>>>>> alone.
>>>>>
>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>
>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>>> user@beam.apache.org> wrote:
>>>>>
>>>>>> BeamJava and BeamPython have the exact same behavior: transform names
>>>>>> within must be distinct [1]. This is because we do not necessarily know at
>>>>>> pipeline construction time if the pipeline will be streaming or batch, or
>>>>>> if it will be updated in the future, so the decision was made to impose
>>>>>> this restriction up front. Both will auto-generate a name for you if one is
>>>>>> not given, but will do so deterministically (not depending on some global
>>>>>> context) to avoid potential update problems.
>>>>>>
>>>>>> [1] Note that this applies to the fully qualified transform name, so
>>>>>> the naming only has to be distinct within a composite transform (or at the
>>>>>> top level--the pipeline itself is isomorphic to a single composite
>>>>>> transform).
>>>>>>
>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>>
>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> That suggests the default label is created as that, which indeed
>>>>>>>> causes the duplication error.
>>>>>>>>
>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Not sure what that suggests
>>>>>>>>>
>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> You don't have to specify the names if the callable you pass in
>>>>>>>>>>> is /different/ for two `beam.Map`s, but  if the callable is the same you
>>>>>>>>>>> must specify a label. For example, the below will raise an exception:
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>> ```
>>>>>>>>>>>
>>>>>>>>>>> Here's an example on playground that shows the error message you
>>>>>>>>>>> get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>
>>>>>>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>>>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>>>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>>>>> of filter on each of them.
>>>>>>>>>>>
>>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often users will
>>>>>>>>>>> really have no need for a disambiguating label as they're often just
>>>>>>>>>>> writing unit tests that test a few different properties of their workflow.
>>>>>>>>>>>
>>>>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>> [2]
>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>
>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>>>>>>>> stable / deterministic graph.
>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>
>>>>>>>>>>>> See an example here
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>>>>> because for runners that support updating pipelines in place, the only way
>>>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which transforms in
>>>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good way to auto
>>>>>>>>>>>>>> generate names that will stay stable across updates - even small changes to
>>>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which could
>>>>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply (if using
>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto generate a
>>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up
>>>>>>>>>>>>>>> from accidentally have duplicate labels from using multiple of the same
>>>>>>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

For posterity: https://github.com/apache/beam/pull/28984

On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <ro...@google.com> wrote:

> I would definitely support a PR making this an option. Changing the
> default would be a rather big change that would require more thought.
>
> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <jo...@schrodinger.com>
> wrote:
>
>> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
>> Apache Beam at my company and I'm trying to foresee parts of the API they
>> might find inconvenient.
>>
>> If there's a conclusion to make the behavior similar to java, I'm happy
>> to put up a PR
>>
>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
>>> Is it really toggleable in Java? I imagine that if it's a toggle it'd be
>>> a very sticky toggle since it'd be easy for PTransforms to accidentally
>>> rely on it.
>>>
>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> Huh. This used to be a hard error in Java, but I guess it's togglable
>>>> with an option now. We should probably add the option to toggle Python too.
>>>> (Unclear what the default should be, but this probably ties into
>>>> re-thinking how pipeline update should work.)
>>>>
>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> Makes sense that the requirement is the same, but is the label
>>>>> auto-generation behavior the same? I modified the BeamJava
>>>>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>>>>> BeamPython example I posted before, it just warns instead of throwing an
>>>>> exception.
>>>>>
>>>>> Tangentially, is it expected that the Beam playground examples don't
>>>>> have a way to see the outputs of a run example? I have a vague memory that
>>>>> there used to be a way to navigate to an output file after it's generated
>>>>> but not sure if I just dreamt that up. Playing with the examples, I wasn't
>>>>> positive if my runs were actually succeeding or not based on the stdout
>>>>> alone.
>>>>>
>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>
>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>>> user@beam.apache.org> wrote:
>>>>>
>>>>>> BeamJava and BeamPython have the exact same behavior: transform names
>>>>>> within must be distinct [1]. This is because we do not necessarily know at
>>>>>> pipeline construction time if the pipeline will be streaming or batch, or
>>>>>> if it will be updated in the future, so the decision was made to impose
>>>>>> this restriction up front. Both will auto-generate a name for you if one is
>>>>>> not given, but will do so deterministically (not depending on some global
>>>>>> context) to avoid potential update problems.
>>>>>>
>>>>>> [1] Note that this applies to the fully qualified transform name, so
>>>>>> the naming only has to be distinct within a composite transform (or at the
>>>>>> top level--the pipeline itself is isomorphic to a single composite
>>>>>> transform).
>>>>>>
>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>>
>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> That suggests the default label is created as that, which indeed
>>>>>>>> causes the duplication error.
>>>>>>>>
>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Not sure what that suggests
>>>>>>>>>
>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> You don't have to specify the names if the callable you pass in
>>>>>>>>>>> is /different/ for two `beam.Map`s, but  if the callable is the same you
>>>>>>>>>>> must specify a label. For example, the below will raise an exception:
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>> ```
>>>>>>>>>>>
>>>>>>>>>>> Here's an example on playground that shows the error message you
>>>>>>>>>>> get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>
>>>>>>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>>>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>>>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>>>>> of filter on each of them.
>>>>>>>>>>>
>>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often users will
>>>>>>>>>>> really have no need for a disambiguating label as they're often just
>>>>>>>>>>> writing unit tests that test a few different properties of their workflow.
>>>>>>>>>>>
>>>>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>> [2]
>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>
>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>>>>>>>> stable / deterministic graph.
>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>
>>>>>>>>>>>> See an example here
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>>>>> because for runners that support updating pipelines in place, the only way
>>>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which transforms in
>>>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good way to auto
>>>>>>>>>>>>>> generate names that will stay stable across updates - even small changes to
>>>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which could
>>>>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply (if using
>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto generate a
>>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up
>>>>>>>>>>>>>>> from accidentally have duplicate labels from using multiple of the same
>>>>>>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Robert Bradshaw via user <us...@beam.apache.org>.

I would definitely support a PR making this an option. Changing the default
would be a rather big change that would require more thought.

On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <jo...@schrodinger.com> wrote:

> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
> Apache Beam at my company and I'm trying to foresee parts of the API they
> might find inconvenient.
>
> If there's a conclusion to make the behavior similar to java, I'm happy to
> put up a PR
>
> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com> wrote:
>
>> Is it really toggleable in Java? I imagine that if it's a toggle it'd be
>> a very sticky toggle since it'd be easy for PTransforms to accidentally
>> rely on it.
>>
>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> Huh. This used to be a hard error in Java, but I guess it's togglable
>>> with an option now. We should probably add the option to toggle Python too.
>>> (Unclear what the default should be, but this probably ties into
>>> re-thinking how pipeline update should work.)
>>>
>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> Makes sense that the requirement is the same, but is the label
>>>> auto-generation behavior the same? I modified the BeamJava
>>>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>>>> BeamPython example I posted before, it just warns instead of throwing an
>>>> exception.
>>>>
>>>> Tangentially, is it expected that the Beam playground examples don't
>>>> have a way to see the outputs of a run example? I have a vague memory that
>>>> there used to be a way to navigate to an output file after it's generated
>>>> but not sure if I just dreamt that up. Playing with the examples, I wasn't
>>>> positive if my runs were actually succeeding or not based on the stdout
>>>> alone.
>>>>
>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>
>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>> user@beam.apache.org> wrote:
>>>>
>>>>> BeamJava and BeamPython have the exact same behavior: transform names
>>>>> within must be distinct [1]. This is because we do not necessarily know at
>>>>> pipeline construction time if the pipeline will be streaming or batch, or
>>>>> if it will be updated in the future, so the decision was made to impose
>>>>> this restriction up front. Both will auto-generate a name for you if one is
>>>>> not given, but will do so deterministically (not depending on some global
>>>>> context) to avoid potential update problems.
>>>>>
>>>>> [1] Note that this applies to the fully qualified transform name, so
>>>>> the naming only has to be distinct within a composite transform (or at the
>>>>> top level--the pipeline itself is isomorphic to a single composite
>>>>> transform).
>>>>>
>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>
>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> That suggests the default label is created as that, which indeed
>>>>>>> causes the duplication error.
>>>>>>>
>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Not sure what that suggests
>>>>>>>>
>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>
>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>>> You don't have to specify the names if the callable you pass in
>>>>>>>>>> is /different/ for two `beam.Map`s, but  if the callable is the same you
>>>>>>>>>> must specify a label. For example, the below will raise an exception:
>>>>>>>>>>
>>>>>>>>>> ```
>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>> ```
>>>>>>>>>>
>>>>>>>>>> Here's an example on playground that shows the error message you
>>>>>>>>>> get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>
>>>>>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>>>> of filter on each of them.
>>>>>>>>>>
>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often users will
>>>>>>>>>> really have no need for a disambiguating label as they're often just
>>>>>>>>>> writing unit tests that test a few different properties of their workflow.
>>>>>>>>>>
>>>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>> [2]
>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>> specify those names.
>>>>>>>>>>>
>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>>>>>>> stable / deterministic graph.
>>>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>
>>>>>>>>>>> See an example here
>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>
>>>>>>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>>>> because for runners that support updating pipelines in place, the only way
>>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which transforms in
>>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good way to auto
>>>>>>>>>>>>> generate names that will stay stable across updates - even small changes to
>>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which could
>>>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply (if using
>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto generate a
>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up
>>>>>>>>>>>>>> from accidentally have duplicate labels from using multiple of the same
>>>>>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Robert Bradshaw via dev <de...@beam.apache.org>.

I would definitely support a PR making this an option. Changing the default
would be a rather big change that would require more thought.

On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <jo...@schrodinger.com> wrote:

> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
> Apache Beam at my company and I'm trying to foresee parts of the API they
> might find inconvenient.
>
> If there's a conclusion to make the behavior similar to java, I'm happy to
> put up a PR
>
> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com> wrote:
>
>> Is it really toggleable in Java? I imagine that if it's a toggle it'd be
>> a very sticky toggle since it'd be easy for PTransforms to accidentally
>> rely on it.
>>
>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> Huh. This used to be a hard error in Java, but I guess it's togglable
>>> with an option now. We should probably add the option to toggle Python too.
>>> (Unclear what the default should be, but this probably ties into
>>> re-thinking how pipeline update should work.)
>>>
>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> Makes sense that the requirement is the same, but is the label
>>>> auto-generation behavior the same? I modified the BeamJava
>>>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>>>> BeamPython example I posted before, it just warns instead of throwing an
>>>> exception.
>>>>
>>>> Tangentially, is it expected that the Beam playground examples don't
>>>> have a way to see the outputs of a run example? I have a vague memory that
>>>> there used to be a way to navigate to an output file after it's generated
>>>> but not sure if I just dreamt that up. Playing with the examples, I wasn't
>>>> positive if my runs were actually succeeding or not based on the stdout
>>>> alone.
>>>>
>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>
>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>> user@beam.apache.org> wrote:
>>>>
>>>>> BeamJava and BeamPython have the exact same behavior: transform names
>>>>> within must be distinct [1]. This is because we do not necessarily know at
>>>>> pipeline construction time if the pipeline will be streaming or batch, or
>>>>> if it will be updated in the future, so the decision was made to impose
>>>>> this restriction up front. Both will auto-generate a name for you if one is
>>>>> not given, but will do so deterministically (not depending on some global
>>>>> context) to avoid potential update problems.
>>>>>
>>>>> [1] Note that this applies to the fully qualified transform name, so
>>>>> the naming only has to be distinct within a composite transform (or at the
>>>>> top level--the pipeline itself is isomorphic to a single composite
>>>>> transform).
>>>>>
>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>
>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> That suggests the default label is created as that, which indeed
>>>>>>> causes the duplication error.
>>>>>>>
>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Not sure what that suggests
>>>>>>>>
>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>
>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>>> You don't have to specify the names if the callable you pass in
>>>>>>>>>> is /different/ for two `beam.Map`s, but  if the callable is the same you
>>>>>>>>>> must specify a label. For example, the below will raise an exception:
>>>>>>>>>>
>>>>>>>>>> ```
>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>> ```
>>>>>>>>>>
>>>>>>>>>> Here's an example on playground that shows the error message you
>>>>>>>>>> get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>
>>>>>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>>>> of filter on each of them.
>>>>>>>>>>
>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often users will
>>>>>>>>>> really have no need for a disambiguating label as they're often just
>>>>>>>>>> writing unit tests that test a few different properties of their workflow.
>>>>>>>>>>
>>>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>> [2]
>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>> specify those names.
>>>>>>>>>>>
>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>>>>>>> stable / deterministic graph.
>>>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>
>>>>>>>>>>> See an example here
>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>
>>>>>>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>>>> because for runners that support updating pipelines in place, the only way
>>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which transforms in
>>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good way to auto
>>>>>>>>>>>>> generate names that will stay stable across updates - even small changes to
>>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which could
>>>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply (if using
>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto generate a
>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up
>>>>>>>>>>>>>> from accidentally have duplicate labels from using multiple of the same
>>>>>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
Apache Beam at my company and I'm trying to foresee parts of the API they
might find inconvenient.

If there's a conclusion to make the behavior similar to java, I'm happy to
put up a PR

On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com> wrote:

> Is it really toggleable in Java? I imagine that if it's a toggle it'd be a
> very sticky toggle since it'd be easy for PTransforms to accidentally rely
> on it.
>
> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> Huh. This used to be a hard error in Java, but I guess it's togglable
>> with an option now. We should probably add the option to toggle Python too.
>> (Unclear what the default should be, but this probably ties into
>> re-thinking how pipeline update should work.)
>>
>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
>>> Makes sense that the requirement is the same, but is the label
>>> auto-generation behavior the same? I modified the BeamJava
>>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>>> BeamPython example I posted before, it just warns instead of throwing an
>>> exception.
>>>
>>> Tangentially, is it expected that the Beam playground examples don't
>>> have a way to see the outputs of a run example? I have a vague memory that
>>> there used to be a way to navigate to an output file after it's generated
>>> but not sure if I just dreamt that up. Playing with the examples, I wasn't
>>> positive if my runs were actually succeeding or not based on the stdout
>>> alone.
>>>
>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>
>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>> user@beam.apache.org> wrote:
>>>
>>>> BeamJava and BeamPython have the exact same behavior: transform names
>>>> within must be distinct [1]. This is because we do not necessarily know at
>>>> pipeline construction time if the pipeline will be streaming or batch, or
>>>> if it will be updated in the future, so the decision was made to impose
>>>> this restriction up front. Both will auto-generate a name for you if one is
>>>> not given, but will do so deterministically (not depending on some global
>>>> context) to avoid potential update problems.
>>>>
>>>> [1] Note that this applies to the fully qualified transform name, so
>>>> the naming only has to be distinct within a composite transform (or at the
>>>> top level--the pipeline itself is isomorphic to a single composite
>>>> transform).
>>>>
>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>> behavior or if it's something worth changing for the python SDK
>>>>>
>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>>>>> wrote:
>>>>>
>>>>>> That suggests the default label is created as that, which indeed
>>>>>> causes the duplication error.
>>>>>>
>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Not sure what that suggests
>>>>>>>
>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>
>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> You don't have to specify the names if the callable you pass in is
>>>>>>>>> /different/ for two `beam.Map`s, but  if the callable is the same you must
>>>>>>>>> specify a label. For example, the below will raise an exception:
>>>>>>>>>
>>>>>>>>> ```
>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>> ```
>>>>>>>>>
>>>>>>>>> Here's an example on playground that shows the error message you
>>>>>>>>> get [1]. I marked every line I added with a "# ++".
>>>>>>>>>
>>>>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>>> of filter on each of them.
>>>>>>>>>
>>>>>>>>> The case that happens most often for me is using the `assert_that`
>>>>>>>>> [2] testing transform. In this case, I think often users will really have
>>>>>>>>> no need for a disambiguating label as they're often just writing unit tests
>>>>>>>>> that test a few different properties of their workflow.
>>>>>>>>>
>>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>> [2]
>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>
>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> If I understand the question correctly, you don't have to specify
>>>>>>>>>> those names.
>>>>>>>>>>
>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>>>>>> stable / deterministic graph.
>>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>
>>>>>>>>>> See an example here
>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>
>>>>>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>>> because for runners that support updating pipelines in place, the only way
>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which transforms in
>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good way to auto
>>>>>>>>>>>> generate names that will stay stable across updates - even small changes to
>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which could
>>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>>
>>>>>>>>>>>> However, if you don't care about update, Beam can auto generate
>>>>>>>>>>>> these names for you! When you call PCollection.apply (if using BeamJava),
>>>>>>>>>>>> simply omit the name parameter and Beam will auto generate a unique name
>>>>>>>>>>>> for you.
>>>>>>>>>>>>
>>>>>>>>>>>> Reuven
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up
>>>>>>>>>>>>> from accidentally have duplicate labels from using multiple of the same
>>>>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>>>> something
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
Apache Beam at my company and I'm trying to foresee parts of the API they
might find inconvenient.

If there's a conclusion to make the behavior similar to java, I'm happy to
put up a PR

On Thu, Oct 5, 2023, 12:49 PM Joey Tran <jo...@schrodinger.com> wrote:

> Is it really toggleable in Java? I imagine that if it's a toggle it'd be a
> very sticky toggle since it'd be easy for PTransforms to accidentally rely
> on it.
>
> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> Huh. This used to be a hard error in Java, but I guess it's togglable
>> with an option now. We should probably add the option to toggle Python too.
>> (Unclear what the default should be, but this probably ties into
>> re-thinking how pipeline update should work.)
>>
>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
>>> Makes sense that the requirement is the same, but is the label
>>> auto-generation behavior the same? I modified the BeamJava
>>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>>> BeamPython example I posted before, it just warns instead of throwing an
>>> exception.
>>>
>>> Tangentially, is it expected that the Beam playground examples don't
>>> have a way to see the outputs of a run example? I have a vague memory that
>>> there used to be a way to navigate to an output file after it's generated
>>> but not sure if I just dreamt that up. Playing with the examples, I wasn't
>>> positive if my runs were actually succeeding or not based on the stdout
>>> alone.
>>>
>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>
>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>> user@beam.apache.org> wrote:
>>>
>>>> BeamJava and BeamPython have the exact same behavior: transform names
>>>> within must be distinct [1]. This is because we do not necessarily know at
>>>> pipeline construction time if the pipeline will be streaming or batch, or
>>>> if it will be updated in the future, so the decision was made to impose
>>>> this restriction up front. Both will auto-generate a name for you if one is
>>>> not given, but will do so deterministically (not depending on some global
>>>> context) to avoid potential update problems.
>>>>
>>>> [1] Note that this applies to the fully qualified transform name, so
>>>> the naming only has to be distinct within a composite transform (or at the
>>>> top level--the pipeline itself is isomorphic to a single composite
>>>> transform).
>>>>
>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>> behavior or if it's something worth changing for the python SDK
>>>>>
>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>>>>> wrote:
>>>>>
>>>>>> That suggests the default label is created as that, which indeed
>>>>>> causes the duplication error.
>>>>>>
>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Not sure what that suggests
>>>>>>>
>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>
>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> You don't have to specify the names if the callable you pass in is
>>>>>>>>> /different/ for two `beam.Map`s, but  if the callable is the same you must
>>>>>>>>> specify a label. For example, the below will raise an exception:
>>>>>>>>>
>>>>>>>>> ```
>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>> ```
>>>>>>>>>
>>>>>>>>> Here's an example on playground that shows the error message you
>>>>>>>>> get [1]. I marked every line I added with a "# ++".
>>>>>>>>>
>>>>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>>> of filter on each of them.
>>>>>>>>>
>>>>>>>>> The case that happens most often for me is using the `assert_that`
>>>>>>>>> [2] testing transform. In this case, I think often users will really have
>>>>>>>>> no need for a disambiguating label as they're often just writing unit tests
>>>>>>>>> that test a few different properties of their workflow.
>>>>>>>>>
>>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>> [2]
>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>
>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> If I understand the question correctly, you don't have to specify
>>>>>>>>>> those names.
>>>>>>>>>>
>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>>>>>> stable / deterministic graph.
>>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>
>>>>>>>>>> See an example here
>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>
>>>>>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>>> because for runners that support updating pipelines in place, the only way
>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which transforms in
>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good way to auto
>>>>>>>>>>>> generate names that will stay stable across updates - even small changes to
>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which could
>>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>>
>>>>>>>>>>>> However, if you don't care about update, Beam can auto generate
>>>>>>>>>>>> these names for you! When you call PCollection.apply (if using BeamJava),
>>>>>>>>>>>> simply omit the name parameter and Beam will auto generate a unique name
>>>>>>>>>>>> for you.
>>>>>>>>>>>>
>>>>>>>>>>>> Reuven
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up
>>>>>>>>>>>>> from accidentally have duplicate labels from using multiple of the same
>>>>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>>>> something
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

Is it really toggleable in Java? I imagine that if it's a toggle it'd be a
very sticky toggle since it'd be easy for PTransforms to accidentally rely
on it.

On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com> wrote:

> Huh. This used to be a hard error in Java, but I guess it's togglable
> with an option now. We should probably add the option to toggle Python too.
> (Unclear what the default should be, but this probably ties into
> re-thinking how pipeline update should work.)
>
> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com>
> wrote:
>
>> Makes sense that the requirement is the same, but is the label
>> auto-generation behavior the same? I modified the BeamJava
>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>> BeamPython example I posted before, it just warns instead of throwing an
>> exception.
>>
>> Tangentially, is it expected that the Beam playground examples don't have
>> a way to see the outputs of a run example? I have a vague memory that there
>> used to be a way to navigate to an output file after it's generated but not
>> sure if I just dreamt that up. Playing with the examples, I wasn't positive
>> if my runs were actually succeeding or not based on the stdout alone.
>>
>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>
>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>> user@beam.apache.org> wrote:
>>
>>> BeamJava and BeamPython have the exact same behavior: transform names
>>> within must be distinct [1]. This is because we do not necessarily know at
>>> pipeline construction time if the pipeline will be streaming or batch, or
>>> if it will be updated in the future, so the decision was made to impose
>>> this restriction up front. Both will auto-generate a name for you if one is
>>> not given, but will do so deterministically (not depending on some global
>>> context) to avoid potential update problems.
>>>
>>> [1] Note that this applies to the fully qualified transform name, so the
>>> naming only has to be distinct within a composite transform (or at the top
>>> level--the pipeline itself is isomorphic to a single composite transform).
>>>
>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> Cross posting this thread to dev@ to see if this is intentional
>>>> behavior or if it's something worth changing for the python SDK
>>>>
>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>>>> wrote:
>>>>
>>>>> That suggests the default label is created as that, which indeed
>>>>> causes the duplication error.
>>>>>
>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> Not sure what that suggests
>>>>>>
>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>> `Filter(identity_filter)`.
>>>>>>>
>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> You don't have to specify the names if the callable you pass in is
>>>>>>>> /different/ for two `beam.Map`s, but  if the callable is the same you must
>>>>>>>> specify a label. For example, the below will raise an exception:
>>>>>>>>
>>>>>>>> ```
>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>> ```
>>>>>>>>
>>>>>>>> Here's an example on playground that shows the error message you
>>>>>>>> get [1]. I marked every line I added with a "# ++".
>>>>>>>>
>>>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>> of filter on each of them.
>>>>>>>>
>>>>>>>> The case that happens most often for me is using the `assert_that`
>>>>>>>> [2] testing transform. In this case, I think often users will really have
>>>>>>>> no need for a disambiguating label as they're often just writing unit tests
>>>>>>>> that test a few different properties of their workflow.
>>>>>>>>
>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>> [2]
>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>
>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>
>>>>>>>>> If I understand the question correctly, you don't have to specify
>>>>>>>>> those names.
>>>>>>>>>
>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>>>>> stable / deterministic graph.
>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>
>>>>>>>>> See an example here
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place".
>>>>>>>>>> Can you elaborate?
>>>>>>>>>>
>>>>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>>
>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>> because for runners that support updating pipelines in place, the only way
>>>>>>>>>>> to do so safely is if the runner can perfectly identify which transforms in
>>>>>>>>>>> the new graph match the ones in the old graph. There's no good way to auto
>>>>>>>>>>> generate names that will stay stable across updates - even small changes to
>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which could
>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>
>>>>>>>>>>> However, if you don't care about update, Beam can auto generate
>>>>>>>>>>> these names for you! When you call PCollection.apply (if using BeamJava),
>>>>>>>>>>> simply omit the name parameter and Beam will auto generate a unique name
>>>>>>>>>>> for you.
>>>>>>>>>>>
>>>>>>>>>>> Reuven
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up
>>>>>>>>>>>> from accidentally have duplicate labels from using multiple of the same
>>>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>>> something
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Joey
>>>>>>>>>>>>
>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

Is it really toggleable in Java? I imagine that if it's a toggle it'd be a
very sticky toggle since it'd be easy for PTransforms to accidentally rely
on it.

On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <ro...@google.com> wrote:

> Huh. This used to be a hard error in Java, but I guess it's togglable
> with an option now. We should probably add the option to toggle Python too.
> (Unclear what the default should be, but this probably ties into
> re-thinking how pipeline update should work.)
>
> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com>
> wrote:
>
>> Makes sense that the requirement is the same, but is the label
>> auto-generation behavior the same? I modified the BeamJava
>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>> BeamPython example I posted before, it just warns instead of throwing an
>> exception.
>>
>> Tangentially, is it expected that the Beam playground examples don't have
>> a way to see the outputs of a run example? I have a vague memory that there
>> used to be a way to navigate to an output file after it's generated but not
>> sure if I just dreamt that up. Playing with the examples, I wasn't positive
>> if my runs were actually succeeding or not based on the stdout alone.
>>
>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>
>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>> user@beam.apache.org> wrote:
>>
>>> BeamJava and BeamPython have the exact same behavior: transform names
>>> within must be distinct [1]. This is because we do not necessarily know at
>>> pipeline construction time if the pipeline will be streaming or batch, or
>>> if it will be updated in the future, so the decision was made to impose
>>> this restriction up front. Both will auto-generate a name for you if one is
>>> not given, but will do so deterministically (not depending on some global
>>> context) to avoid potential update problems.
>>>
>>> [1] Note that this applies to the fully qualified transform name, so the
>>> naming only has to be distinct within a composite transform (or at the top
>>> level--the pipeline itself is isomorphic to a single composite transform).
>>>
>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> Cross posting this thread to dev@ to see if this is intentional
>>>> behavior or if it's something worth changing for the python SDK
>>>>
>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>>>> wrote:
>>>>
>>>>> That suggests the default label is created as that, which indeed
>>>>> causes the duplication error.
>>>>>
>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> Not sure what that suggests
>>>>>>
>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>> `Filter(identity_filter)`.
>>>>>>>
>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> You don't have to specify the names if the callable you pass in is
>>>>>>>> /different/ for two `beam.Map`s, but  if the callable is the same you must
>>>>>>>> specify a label. For example, the below will raise an exception:
>>>>>>>>
>>>>>>>> ```
>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>> ```
>>>>>>>>
>>>>>>>> Here's an example on playground that shows the error message you
>>>>>>>> get [1]. I marked every line I added with a "# ++".
>>>>>>>>
>>>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>>> of filter on each of them.
>>>>>>>>
>>>>>>>> The case that happens most often for me is using the `assert_that`
>>>>>>>> [2] testing transform. In this case, I think often users will really have
>>>>>>>> no need for a disambiguating label as they're often just writing unit tests
>>>>>>>> that test a few different properties of their workflow.
>>>>>>>>
>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>> [2]
>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>
>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>
>>>>>>>>> If I understand the question correctly, you don't have to specify
>>>>>>>>> those names.
>>>>>>>>>
>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>>>>> stable / deterministic graph.
>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>
>>>>>>>>> See an example here
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place".
>>>>>>>>>> Can you elaborate?
>>>>>>>>>>
>>>>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>>
>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>> because for runners that support updating pipelines in place, the only way
>>>>>>>>>>> to do so safely is if the runner can perfectly identify which transforms in
>>>>>>>>>>> the new graph match the ones in the old graph. There's no good way to auto
>>>>>>>>>>> generate names that will stay stable across updates - even small changes to
>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which could
>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>
>>>>>>>>>>> However, if you don't care about update, Beam can auto generate
>>>>>>>>>>> these names for you! When you call PCollection.apply (if using BeamJava),
>>>>>>>>>>> simply omit the name parameter and Beam will auto generate a unique name
>>>>>>>>>>> for you.
>>>>>>>>>>>
>>>>>>>>>>> Reuven
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up
>>>>>>>>>>>> from accidentally have duplicate labels from using multiple of the same
>>>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>>> something
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Joey
>>>>>>>>>>>>
>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Robert Bradshaw via dev <de...@beam.apache.org>.

Huh. This used to be a hard error in Java, but I guess it's togglable
with an option now. We should probably add the option to toggle Python too.
(Unclear what the default should be, but this probably ties into
re-thinking how pipeline update should work.)

On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com> wrote:

> Makes sense that the requirement is the same, but is the label
> auto-generation behavior the same? I modified the BeamJava
> wordcount example[1] to do the regex filter twice in a row, and unlike the
> BeamPython example I posted before, it just warns instead of throwing an
> exception.
>
> Tangentially, is it expected that the Beam playground examples don't have
> a way to see the outputs of a run example? I have a vague memory that there
> used to be a way to navigate to an output file after it's generated but not
> sure if I just dreamt that up. Playing with the examples, I wasn't positive
> if my runs were actually succeeding or not based on the stdout alone.
>
> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>
> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
> user@beam.apache.org> wrote:
>
>> BeamJava and BeamPython have the exact same behavior: transform names
>> within must be distinct [1]. This is because we do not necessarily know at
>> pipeline construction time if the pipeline will be streaming or batch, or
>> if it will be updated in the future, so the decision was made to impose
>> this restriction up front. Both will auto-generate a name for you if one is
>> not given, but will do so deterministically (not depending on some global
>> context) to avoid potential update problems.
>>
>> [1] Note that this applies to the fully qualified transform name, so the
>> naming only has to be distinct within a composite transform (or at the top
>> level--the pipeline itself is isomorphic to a single composite transform).
>>
>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
>>> Cross posting this thread to dev@ to see if this is intentional
>>> behavior or if it's something worth changing for the python SDK
>>>
>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>>> wrote:
>>>
>>>> That suggests the default label is created as that, which indeed causes
>>>> the duplication error.
>>>>
>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> Not sure what that suggests
>>>>>
>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>>>> wrote:
>>>>>
>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>> `Filter(identity_filter)`.
>>>>>>
>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> You don't have to specify the names if the callable you pass in is
>>>>>>> /different/ for two `beam.Map`s, but  if the callable is the same you must
>>>>>>> specify a label. For example, the below will raise an exception:
>>>>>>>
>>>>>>> ```
>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>         | beam.Filter(identity_filter)
>>>>>>> ```
>>>>>>>
>>>>>>> Here's an example on playground that shows the error message you get
>>>>>>> [1]. I marked every line I added with a "# ++".
>>>>>>>
>>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>> of filter on each of them.
>>>>>>>
>>>>>>> The case that happens most often for me is using the `assert_that`
>>>>>>> [2] testing transform. In this case, I think often users will really have
>>>>>>> no need for a disambiguating label as they're often just writing unit tests
>>>>>>> that test a few different properties of their workflow.
>>>>>>>
>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>> [2]
>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>
>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>> user@beam.apache.org> wrote:
>>>>>>>
>>>>>>>> If I understand the question correctly, you don't have to specify
>>>>>>>> those names.
>>>>>>>>
>>>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>>>> stable / deterministic graph.
>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>>
>>>>>>>> See an example here
>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <jo...@schrodinger.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place".
>>>>>>>>> Can you elaborate?
>>>>>>>>>
>>>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>
>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>> because for runners that support updating pipelines in place, the only way
>>>>>>>>>> to do so safely is if the runner can perfectly identify which transforms in
>>>>>>>>>> the new graph match the ones in the old graph. There's no good way to auto
>>>>>>>>>> generate names that will stay stable across updates - even small changes to
>>>>>>>>>> the pipeline might change the order of nodes in the graph, which could
>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>
>>>>>>>>>> However, if you don't care about update, Beam can auto generate
>>>>>>>>>> these names for you! When you call PCollection.apply (if using BeamJava),
>>>>>>>>>> simply omit the name parameter and Beam will auto generate a unique name
>>>>>>>>>> for you.
>>>>>>>>>>
>>>>>>>>>> Reuven
>>>>>>>>>>
>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up
>>>>>>>>>>> from accidentally have duplicate labels from using multiple of the same
>>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>> something
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Joey
>>>>>>>>>>>
>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Robert Bradshaw via user <us...@beam.apache.org>.

Huh. This used to be a hard error in Java, but I guess it's togglable
with an option now. We should probably add the option to toggle Python too.
(Unclear what the default should be, but this probably ties into
re-thinking how pipeline update should work.)

On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <jo...@schrodinger.com> wrote:

> Makes sense that the requirement is the same, but is the label
> auto-generation behavior the same? I modified the BeamJava
> wordcount example[1] to do the regex filter twice in a row, and unlike the
> BeamPython example I posted before, it just warns instead of throwing an
> exception.
>
> Tangentially, is it expected that the Beam playground examples don't have
> a way to see the outputs of a run example? I have a vague memory that there
> used to be a way to navigate to an output file after it's generated but not
> sure if I just dreamt that up. Playing with the examples, I wasn't positive
> if my runs were actually succeeding or not based on the stdout alone.
>
> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>
> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
> user@beam.apache.org> wrote:
>
>> BeamJava and BeamPython have the exact same behavior: transform names
>> within must be distinct [1]. This is because we do not necessarily know at
>> pipeline construction time if the pipeline will be streaming or batch, or
>> if it will be updated in the future, so the decision was made to impose
>> this restriction up front. Both will auto-generate a name for you if one is
>> not given, but will do so deterministically (not depending on some global
>> context) to avoid potential update problems.
>>
>> [1] Note that this applies to the fully qualified transform name, so the
>> naming only has to be distinct within a composite transform (or at the top
>> level--the pipeline itself is isomorphic to a single composite transform).
>>
>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
>>> Cross posting this thread to dev@ to see if this is intentional
>>> behavior or if it's something worth changing for the python SDK
>>>
>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>>> wrote:
>>>
>>>> That suggests the default label is created as that, which indeed causes
>>>> the duplication error.
>>>>
>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> Not sure what that suggests
>>>>>
>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>>>> wrote:
>>>>>
>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>> `Filter(identity_filter)`.
>>>>>>
>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> You don't have to specify the names if the callable you pass in is
>>>>>>> /different/ for two `beam.Map`s, but  if the callable is the same you must
>>>>>>> specify a label. For example, the below will raise an exception:
>>>>>>>
>>>>>>> ```
>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>         | beam.Filter(identity_filter)
>>>>>>> ```
>>>>>>>
>>>>>>> Here's an example on playground that shows the error message you get
>>>>>>> [1]. I marked every line I added with a "# ++".
>>>>>>>
>>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>>> of filter on each of them.
>>>>>>>
>>>>>>> The case that happens most often for me is using the `assert_that`
>>>>>>> [2] testing transform. In this case, I think often users will really have
>>>>>>> no need for a disambiguating label as they're often just writing unit tests
>>>>>>> that test a few different properties of their workflow.
>>>>>>>
>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>> [2]
>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>
>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>> user@beam.apache.org> wrote:
>>>>>>>
>>>>>>>> If I understand the question correctly, you don't have to specify
>>>>>>>> those names.
>>>>>>>>
>>>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>>>> stable / deterministic graph.
>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>>
>>>>>>>> See an example here
>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <jo...@schrodinger.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place".
>>>>>>>>> Can you elaborate?
>>>>>>>>>
>>>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>
>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>> because for runners that support updating pipelines in place, the only way
>>>>>>>>>> to do so safely is if the runner can perfectly identify which transforms in
>>>>>>>>>> the new graph match the ones in the old graph. There's no good way to auto
>>>>>>>>>> generate names that will stay stable across updates - even small changes to
>>>>>>>>>> the pipeline might change the order of nodes in the graph, which could
>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>
>>>>>>>>>> However, if you don't care about update, Beam can auto generate
>>>>>>>>>> these names for you! When you call PCollection.apply (if using BeamJava),
>>>>>>>>>> simply omit the name parameter and Beam will auto generate a unique name
>>>>>>>>>> for you.
>>>>>>>>>>
>>>>>>>>>> Reuven
>>>>>>>>>>
>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up
>>>>>>>>>>> from accidentally have duplicate labels from using multiple of the same
>>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>> something
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Joey
>>>>>>>>>>>
>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

Makes sense that the requirement is the same, but is the label
auto-generation behavior the same? I modified the BeamJava
wordcount example[1] to do the regex filter twice in a row, and unlike the
BeamPython example I posted before, it just warns instead of throwing an
exception.

Tangentially, is it expected that the Beam playground examples don't have a
way to see the outputs of a run example? I have a vague memory that there
used to be a way to navigate to an output file after it's generated but not
sure if I just dreamt that up. Playing with the examples, I wasn't positive
if my runs were actually succeeding or not based on the stdout alone.

[1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
<https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
[2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW

On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
user@beam.apache.org> wrote:

> BeamJava and BeamPython have the exact same behavior: transform names
> within must be distinct [1]. This is because we do not necessarily know at
> pipeline construction time if the pipeline will be streaming or batch, or
> if it will be updated in the future, so the decision was made to impose
> this restriction up front. Both will auto-generate a name for you if one is
> not given, but will do so deterministically (not depending on some global
> context) to avoid potential update problems.
>
> [1] Note that this applies to the fully qualified transform name, so the
> naming only has to be distinct within a composite transform (or at the top
> level--the pipeline itself is isomorphic to a single composite transform).
>
> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
> wrote:
>
>> Cross posting this thread to dev@ to see if this is intentional behavior
>> or if it's something worth changing for the python SDK
>>
>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>> wrote:
>>
>>> That suggests the default label is created as that, which indeed causes
>>> the duplication error.
>>>
>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> Not sure what that suggests
>>>>
>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>>> wrote:
>>>>
>>>>> Looks like this is the current behaviour. If you have `t =
>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>> `Filter(identity_filter)`.
>>>>>
>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> You don't have to specify the names if the callable you pass in is
>>>>>> /different/ for two `beam.Map`s, but  if the callable is the same you must
>>>>>> specify a label. For example, the below will raise an exception:
>>>>>>
>>>>>> ```
>>>>>>         | beam.Filter(identity_filter)
>>>>>>         | beam.Filter(identity_filter)
>>>>>> ```
>>>>>>
>>>>>> Here's an example on playground that shows the error message you get
>>>>>> [1]. I marked every line I added with a "# ++".
>>>>>>
>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>> of filter on each of them.
>>>>>>
>>>>>> The case that happens most often for me is using the `assert_that`
>>>>>> [2] testing transform. In this case, I think often users will really have
>>>>>> no need for a disambiguating label as they're often just writing unit tests
>>>>>> that test a few different properties of their workflow.
>>>>>>
>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>> [2]
>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>
>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>> user@beam.apache.org> wrote:
>>>>>>
>>>>>>> If I understand the question correctly, you don't have to specify
>>>>>>> those names.
>>>>>>>
>>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>>> stable / deterministic graph.
>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>
>>>>>>> See an example here
>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <jo...@schrodinger.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place".
>>>>>>>> Can you elaborate?
>>>>>>>>
>>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>
>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>
>>>>>>>>> Are you talking about transform names? The main reason was because
>>>>>>>>> for runners that support updating pipelines in place, the only way to do so
>>>>>>>>> safely is if the runner can perfectly identify which transforms in the new
>>>>>>>>> graph match the ones in the old graph. There's no good way to auto generate
>>>>>>>>> names that will stay stable across updates - even small changes to the
>>>>>>>>> pipeline might change the order of nodes in the graph, which could result
>>>>>>>>> in a corrupted update.
>>>>>>>>>
>>>>>>>>> However, if you don't care about update, Beam can auto generate
>>>>>>>>> these names for you! When you call PCollection.apply (if using BeamJava),
>>>>>>>>> simply omit the name parameter and Beam will auto generate a unique name
>>>>>>>>> for you.
>>>>>>>>>
>>>>>>>>> Reuven
>>>>>>>>>
>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up from
>>>>>>>>>> accidentally have duplicate labels from using multiple of the same
>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>> something
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Joey
>>>>>>>>>>
>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

Makes sense that the requirement is the same, but is the label
auto-generation behavior the same? I modified the BeamJava
wordcount example[1] to do the regex filter twice in a row, and unlike the
BeamPython example I posted before, it just warns instead of throwing an
exception.

Tangentially, is it expected that the Beam playground examples don't have a
way to see the outputs of a run example? I have a vague memory that there
used to be a way to navigate to an output file after it's generated but not
sure if I just dreamt that up. Playing with the examples, I wasn't positive
if my runs were actually succeeding or not based on the stdout alone.

[1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
<https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
[2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW

On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
user@beam.apache.org> wrote:

> BeamJava and BeamPython have the exact same behavior: transform names
> within must be distinct [1]. This is because we do not necessarily know at
> pipeline construction time if the pipeline will be streaming or batch, or
> if it will be updated in the future, so the decision was made to impose
> this restriction up front. Both will auto-generate a name for you if one is
> not given, but will do so deterministically (not depending on some global
> context) to avoid potential update problems.
>
> [1] Note that this applies to the fully qualified transform name, so the
> naming only has to be distinct within a composite transform (or at the top
> level--the pipeline itself is isomorphic to a single composite transform).
>
> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com>
> wrote:
>
>> Cross posting this thread to dev@ to see if this is intentional behavior
>> or if it's something worth changing for the python SDK
>>
>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org>
>> wrote:
>>
>>> That suggests the default label is created as that, which indeed causes
>>> the duplication error.
>>>
>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> Not sure what that suggests
>>>>
>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>>> wrote:
>>>>
>>>>> Looks like this is the current behaviour. If you have `t =
>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>> `Filter(identity_filter)`.
>>>>>
>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> You don't have to specify the names if the callable you pass in is
>>>>>> /different/ for two `beam.Map`s, but  if the callable is the same you must
>>>>>> specify a label. For example, the below will raise an exception:
>>>>>>
>>>>>> ```
>>>>>>         | beam.Filter(identity_filter)
>>>>>>         | beam.Filter(identity_filter)
>>>>>> ```
>>>>>>
>>>>>> Here's an example on playground that shows the error message you get
>>>>>> [1]. I marked every line I added with a "# ++".
>>>>>>
>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>>> of filter on each of them.
>>>>>>
>>>>>> The case that happens most often for me is using the `assert_that`
>>>>>> [2] testing transform. In this case, I think often users will really have
>>>>>> no need for a disambiguating label as they're often just writing unit tests
>>>>>> that test a few different properties of their workflow.
>>>>>>
>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>> [2]
>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>
>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>> user@beam.apache.org> wrote:
>>>>>>
>>>>>>> If I understand the question correctly, you don't have to specify
>>>>>>> those names.
>>>>>>>
>>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>>> stable / deterministic graph.
>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>
>>>>>>> See an example here
>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <jo...@schrodinger.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place".
>>>>>>>> Can you elaborate?
>>>>>>>>
>>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>
>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>
>>>>>>>>> Are you talking about transform names? The main reason was because
>>>>>>>>> for runners that support updating pipelines in place, the only way to do so
>>>>>>>>> safely is if the runner can perfectly identify which transforms in the new
>>>>>>>>> graph match the ones in the old graph. There's no good way to auto generate
>>>>>>>>> names that will stay stable across updates - even small changes to the
>>>>>>>>> pipeline might change the order of nodes in the graph, which could result
>>>>>>>>> in a corrupted update.
>>>>>>>>>
>>>>>>>>> However, if you don't care about update, Beam can auto generate
>>>>>>>>> these names for you! When you call PCollection.apply (if using BeamJava),
>>>>>>>>> simply omit the name parameter and Beam will auto generate a unique name
>>>>>>>>> for you.
>>>>>>>>>
>>>>>>>>> Reuven
>>>>>>>>>
>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>>
>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up from
>>>>>>>>>> accidentally have duplicate labels from using multiple of the same
>>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>> something
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Joey
>>>>>>>>>>
>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Robert Bradshaw via dev <de...@beam.apache.org>.

BeamJava and BeamPython have the exact same behavior: transform names
within must be distinct [1]. This is because we do not necessarily know at
pipeline construction time if the pipeline will be streaming or batch, or
if it will be updated in the future, so the decision was made to impose
this restriction up front. Both will auto-generate a name for you if one is
not given, but will do so deterministically (not depending on some global
context) to avoid potential update problems.

[1] Note that this applies to the fully qualified transform name, so the
naming only has to be distinct within a composite transform (or at the top
level--the pipeline itself is isomorphic to a single composite transform).

On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com> wrote:

> Cross posting this thread to dev@ to see if this is intentional behavior
> or if it's something worth changing for the python SDK
>
> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org> wrote:
>
>> That suggests the default label is created as that, which indeed causes
>> the duplication error.
>>
>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
>>> Not sure what that suggests
>>>
>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>> wrote:
>>>
>>>> Looks like this is the current behaviour. If you have `t =
>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>> `Filter(identity_filter)`.
>>>>
>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> You don't have to specify the names if the callable you pass in is
>>>>> /different/ for two `beam.Map`s, but  if the callable is the same you must
>>>>> specify a label. For example, the below will raise an exception:
>>>>>
>>>>> ```
>>>>>         | beam.Filter(identity_filter)
>>>>>         | beam.Filter(identity_filter)
>>>>> ```
>>>>>
>>>>> Here's an example on playground that shows the error message you get
>>>>> [1]. I marked every line I added with a "# ++".
>>>>>
>>>>> It's a contrived example, but using a map or filter at the same
>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>> of filter on each of them.
>>>>>
>>>>> The case that happens most often for me is using the `assert_that` [2]
>>>>> testing transform. In this case, I think often users will really have no
>>>>> need for a disambiguating label as they're often just writing unit tests
>>>>> that test a few different properties of their workflow.
>>>>>
>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>> [2]
>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>
>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>> user@beam.apache.org> wrote:
>>>>>
>>>>>> If I understand the question correctly, you don't have to specify
>>>>>> those names.
>>>>>>
>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>> stable / deterministic graph.
>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>
>>>>>> See an example here
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>
>>>>>>
>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <jo...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place".
>>>>>>> Can you elaborate?
>>>>>>>
>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>
>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>> user@beam.apache.org> wrote:
>>>>>>>
>>>>>>>> Are you talking about transform names? The main reason was because
>>>>>>>> for runners that support updating pipelines in place, the only way to do so
>>>>>>>> safely is if the runner can perfectly identify which transforms in the new
>>>>>>>> graph match the ones in the old graph. There's no good way to auto generate
>>>>>>>> names that will stay stable across updates - even small changes to the
>>>>>>>> pipeline might change the order of nodes in the graph, which could result
>>>>>>>> in a corrupted update.
>>>>>>>>
>>>>>>>> However, if you don't care about update, Beam can auto generate
>>>>>>>> these names for you! When you call PCollection.apply (if using BeamJava),
>>>>>>>> simply omit the name parameter and Beam will auto generate a unique name
>>>>>>>> for you.
>>>>>>>>
>>>>>>>> Reuven
>>>>>>>>
>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>
>>>>>>>>> After writing a few pipelines now, I keep getting tripped up from
>>>>>>>>> accidentally have duplicate labels from using multiple of the same
>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>> something
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Joey
>>>>>>>>>
>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Robert Bradshaw via user <us...@beam.apache.org>.

BeamJava and BeamPython have the exact same behavior: transform names
within must be distinct [1]. This is because we do not necessarily know at
pipeline construction time if the pipeline will be streaming or batch, or
if it will be updated in the future, so the decision was made to impose
this restriction up front. Both will auto-generate a name for you if one is
not given, but will do so deterministically (not depending on some global
context) to avoid potential update problems.

[1] Note that this applies to the fully qualified transform name, so the
naming only has to be distinct within a composite transform (or at the top
level--the pipeline itself is isomorphic to a single composite transform).

On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <jo...@schrodinger.com> wrote:

> Cross posting this thread to dev@ to see if this is intentional behavior
> or if it's something worth changing for the python SDK
>
> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org> wrote:
>
>> That suggests the default label is created as that, which indeed causes
>> the duplication error.
>>
>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
>>> Not sure what that suggests
>>>
>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org>
>>> wrote:
>>>
>>>> Looks like this is the current behaviour. If you have `t =
>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>> `Filter(identity_filter)`.
>>>>
>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> You don't have to specify the names if the callable you pass in is
>>>>> /different/ for two `beam.Map`s, but  if the callable is the same you must
>>>>> specify a label. For example, the below will raise an exception:
>>>>>
>>>>> ```
>>>>>         | beam.Filter(identity_filter)
>>>>>         | beam.Filter(identity_filter)
>>>>> ```
>>>>>
>>>>> Here's an example on playground that shows the error message you get
>>>>> [1]. I marked every line I added with a "# ++".
>>>>>
>>>>> It's a contrived example, but using a map or filter at the same
>>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>>> of filter on each of them.
>>>>>
>>>>> The case that happens most often for me is using the `assert_that` [2]
>>>>> testing transform. In this case, I think often users will really have no
>>>>> need for a disambiguating label as they're often just writing unit tests
>>>>> that test a few different properties of their workflow.
>>>>>
>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>> [2]
>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>
>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>> user@beam.apache.org> wrote:
>>>>>
>>>>>> If I understand the question correctly, you don't have to specify
>>>>>> those names.
>>>>>>
>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>> stable / deterministic graph.
>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>
>>>>>> See an example here
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>
>>>>>>
>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <jo...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place".
>>>>>>> Can you elaborate?
>>>>>>>
>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>
>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>> user@beam.apache.org> wrote:
>>>>>>>
>>>>>>>> Are you talking about transform names? The main reason was because
>>>>>>>> for runners that support updating pipelines in place, the only way to do so
>>>>>>>> safely is if the runner can perfectly identify which transforms in the new
>>>>>>>> graph match the ones in the old graph. There's no good way to auto generate
>>>>>>>> names that will stay stable across updates - even small changes to the
>>>>>>>> pipeline might change the order of nodes in the graph, which could result
>>>>>>>> in a corrupted update.
>>>>>>>>
>>>>>>>> However, if you don't care about update, Beam can auto generate
>>>>>>>> these names for you! When you call PCollection.apply (if using BeamJava),
>>>>>>>> simply omit the name parameter and Beam will auto generate a unique name
>>>>>>>> for you.
>>>>>>>>
>>>>>>>> Reuven
>>>>>>>>
>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>>
>>>>>>>>> After writing a few pipelines now, I keep getting tripped up from
>>>>>>>>> accidentally have duplicate labels from using multiple of the same
>>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>> something
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Joey
>>>>>>>>>
>>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

Cross posting this thread to dev@ to see if this is intentional behavior or
if it's something worth changing for the python SDK

On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <us...@beam.apache.org> wrote:

> That suggests the default label is created as that, which indeed causes
> the duplication error.
>
> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com>
> wrote:
>
>> Not sure what that suggests
>>
>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org> wrote:
>>
>>> Looks like this is the current behaviour. If you have `t =
>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>> `Filter(identity_filter)`.
>>>
>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> You don't have to specify the names if the callable you pass in is
>>>> /different/ for two `beam.Map`s, but  if the callable is the same you must
>>>> specify a label. For example, the below will raise an exception:
>>>>
>>>> ```
>>>>         | beam.Filter(identity_filter)
>>>>         | beam.Filter(identity_filter)
>>>> ```
>>>>
>>>> Here's an example on playground that shows the error message you get
>>>> [1]. I marked every line I added with a "# ++".
>>>>
>>>> It's a contrived example, but using a map or filter at the same
>>>> pipeline level probably comes up often, at least in my inexperience. For
>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>> different pcolls, runs some transforms on them, and then runs the same type
>>>> of filter on each of them.
>>>>
>>>> The case that happens most often for me is using the `assert_that` [2]
>>>> testing transform. In this case, I think often users will really have no
>>>> need for a disambiguating label as they're often just writing unit tests
>>>> that test a few different properties of their workflow.
>>>>
>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>> [2]
>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>
>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>> user@beam.apache.org> wrote:
>>>>
>>>>> If I understand the question correctly, you don't have to specify
>>>>> those names.
>>>>>
>>>>> As Reuven pointed out, it is probably a good idea so you have a stable
>>>>> / deterministic graph.
>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>
>>>>> See an example here
>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>
>>>>>
>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <jo...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place". Can
>>>>>> you elaborate?
>>>>>>
>>>>>> I forgot to mention my question is posed from the context of a python
>>>>>> SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>>> reapplying the same Map on different pcollections.
>>>>>>
>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>> user@beam.apache.org> wrote:
>>>>>>
>>>>>>> Are you talking about transform names? The main reason was because
>>>>>>> for runners that support updating pipelines in place, the only way to do so
>>>>>>> safely is if the runner can perfectly identify which transforms in the new
>>>>>>> graph match the ones in the old graph. There's no good way to auto generate
>>>>>>> names that will stay stable across updates - even small changes to the
>>>>>>> pipeline might change the order of nodes in the graph, which could result
>>>>>>> in a corrupted update.
>>>>>>>
>>>>>>> However, if you don't care about update, Beam can auto generate
>>>>>>> these names for you! When you call PCollection.apply (if using BeamJava),
>>>>>>> simply omit the name parameter and Beam will auto generate a unique name
>>>>>>> for you.
>>>>>>>
>>>>>>> Reuven
>>>>>>>
>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>> joey.tran@schrodinger.com> wrote:
>>>>>>>
>>>>>>>> After writing a few pipelines now, I keep getting tripped up from
>>>>>>>> accidentally have duplicate labels from using multiple of the same
>>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>> something
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Joey
>>>>>>>>
>>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by XQ Hu via user <us...@beam.apache.org>.

That suggests the default label is created as that, which indeed causes the
duplication error.

On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <jo...@schrodinger.com> wrote:

> Not sure what that suggests
>
> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org> wrote:
>
>> Looks like this is the current behaviour. If you have `t =
>> beam.Filter(identity_filter)`, `t.label` is defined as
>> `Filter(identity_filter)`.
>>
>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
>>> You don't have to specify the names if the callable you pass in is
>>> /different/ for two `beam.Map`s, but  if the callable is the same you must
>>> specify a label. For example, the below will raise an exception:
>>>
>>> ```
>>>         | beam.Filter(identity_filter)
>>>         | beam.Filter(identity_filter)
>>> ```
>>>
>>> Here's an example on playground that shows the error message you get
>>> [1]. I marked every line I added with a "# ++".
>>>
>>> It's a contrived example, but using a map or filter at the same pipeline
>>> level probably comes up often, at least in my inexperience. For example,
>>> you. might have a pipeline that partitions a pcoll into three different
>>> pcolls, runs some transforms on them, and then runs the same type of filter
>>> on each of them.
>>>
>>> The case that happens most often for me is using the `assert_that` [2]
>>> testing transform. In this case, I think often users will really have no
>>> need for a disambiguating label as they're often just writing unit tests
>>> that test a few different properties of their workflow.
>>>
>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>> [2]
>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>
>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>> user@beam.apache.org> wrote:
>>>
>>>> If I understand the question correctly, you don't have to specify those
>>>> names.
>>>>
>>>> As Reuven pointed out, it is probably a good idea so you have a stable
>>>> / deterministic graph.
>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>> instead of pcollection | 'Map' >> map_fn.
>>>>
>>>> See an example here
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>
>>>>
>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <jo...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place". Can
>>>>> you elaborate?
>>>>>
>>>>> I forgot to mention my question is posed from the context of a python
>>>>> SDK user, and afaict, there doesn't seem to be an obvious way to
>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>>> reapplying the same Map on different pcollections.
>>>>>
>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>> user@beam.apache.org> wrote:
>>>>>
>>>>>> Are you talking about transform names? The main reason was because
>>>>>> for runners that support updating pipelines in place, the only way to do so
>>>>>> safely is if the runner can perfectly identify which transforms in the new
>>>>>> graph match the ones in the old graph. There's no good way to auto generate
>>>>>> names that will stay stable across updates - even small changes to the
>>>>>> pipeline might change the order of nodes in the graph, which could result
>>>>>> in a corrupted update.
>>>>>>
>>>>>> However, if you don't care about update, Beam can auto generate these
>>>>>> names for you! When you call PCollection.apply (if using BeamJava), simply
>>>>>> omit the name parameter and Beam will auto generate a unique name for you.
>>>>>>
>>>>>> Reuven
>>>>>>
>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <jo...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> After writing a few pipelines now, I keep getting tripped up from
>>>>>>> accidentally have duplicate labels from using multiple of the same
>>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>> something
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Joey
>>>>>>>
>>>>>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

Not sure what that suggests

On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <us...@beam.apache.org> wrote:

> Looks like this is the current behaviour. If you have `t =
> beam.Filter(identity_filter)`, `t.label` is defined as
> `Filter(identity_filter)`.
>
> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com>
> wrote:
>
>> You don't have to specify the names if the callable you pass in is
>> /different/ for two `beam.Map`s, but  if the callable is the same you must
>> specify a label. For example, the below will raise an exception:
>>
>> ```
>>         | beam.Filter(identity_filter)
>>         | beam.Filter(identity_filter)
>> ```
>>
>> Here's an example on playground that shows the error message you get [1].
>> I marked every line I added with a "# ++".
>>
>> It's a contrived example, but using a map or filter at the same pipeline
>> level probably comes up often, at least in my inexperience. For example,
>> you. might have a pipeline that partitions a pcoll into three different
>> pcolls, runs some transforms on them, and then runs the same type of filter
>> on each of them.
>>
>> The case that happens most often for me is using the `assert_that` [2]
>> testing transform. In this case, I think often users will really have no
>> need for a disambiguating label as they're often just writing unit tests
>> that test a few different properties of their workflow.
>>
>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>> [2]
>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>
>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>> user@beam.apache.org> wrote:
>>
>>> If I understand the question correctly, you don't have to specify those
>>> names.
>>>
>>> As Reuven pointed out, it is probably a good idea so you have a stable /
>>> deterministic graph.
>>> But in the Python SDK, you can simply use pcollection | map_fn, instead
>>> of pcollection | 'Map' >> map_fn.
>>>
>>> See an example here
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>
>>>
>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> Hmm, I'm not sure what you mean by "updating pipelines in place". Can
>>>> you elaborate?
>>>>
>>>> I forgot to mention my question is posed from the context of a python
>>>> SDK user, and afaict, there doesn't seem to be an obvious way to
>>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>>> reapplying the same Map on different pcollections.
>>>>
>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>> user@beam.apache.org> wrote:
>>>>
>>>>> Are you talking about transform names? The main reason was because for
>>>>> runners that support updating pipelines in place, the only way to do so
>>>>> safely is if the runner can perfectly identify which transforms in the new
>>>>> graph match the ones in the old graph. There's no good way to auto generate
>>>>> names that will stay stable across updates - even small changes to the
>>>>> pipeline might change the order of nodes in the graph, which could result
>>>>> in a corrupted update.
>>>>>
>>>>> However, if you don't care about update, Beam can auto generate these
>>>>> names for you! When you call PCollection.apply (if using BeamJava), simply
>>>>> omit the name parameter and Beam will auto generate a unique name for you.
>>>>>
>>>>> Reuven
>>>>>
>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <jo...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> After writing a few pipelines now, I keep getting tripped up from
>>>>>> accidentally have duplicate labels from using multiple of the same
>>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>> something
>>>>>>
>>>>>> Cheers,
>>>>>> Joey
>>>>>>
>>>>>

Re: [QUESTION] Why no auto labels?

Posted by XQ Hu via user <us...@beam.apache.org>.

Looks like this is the current behaviour. If you have `t =
beam.Filter(identity_filter)`, `t.label` is defined as
`Filter(identity_filter)`.

On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <jo...@schrodinger.com> wrote:

> You don't have to specify the names if the callable you pass in is
> /different/ for two `beam.Map`s, but  if the callable is the same you must
> specify a label. For example, the below will raise an exception:
>
> ```
>         | beam.Filter(identity_filter)
>         | beam.Filter(identity_filter)
> ```
>
> Here's an example on playground that shows the error message you get [1].
> I marked every line I added with a "# ++".
>
> It's a contrived example, but using a map or filter at the same pipeline
> level probably comes up often, at least in my inexperience. For example,
> you. might have a pipeline that partitions a pcoll into three different
> pcolls, runs some transforms on them, and then runs the same type of filter
> on each of them.
>
> The case that happens most often for me is using the `assert_that` [2]
> testing transform. In this case, I think often users will really have no
> need for a disambiguating label as they're often just writing unit tests
> that test a few different properties of their workflow.
>
> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
> [2]
> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>
> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
> user@beam.apache.org> wrote:
>
>> If I understand the question correctly, you don't have to specify those
>> names.
>>
>> As Reuven pointed out, it is probably a good idea so you have a stable /
>> deterministic graph.
>> But in the Python SDK, you can simply use pcollection | map_fn, instead
>> of pcollection | 'Map' >> map_fn.
>>
>> See an example here
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>
>>
>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
>>> Hmm, I'm not sure what you mean by "updating pipelines in place". Can
>>> you elaborate?
>>>
>>> I forgot to mention my question is posed from the context of a python
>>> SDK user, and afaict, there doesn't seem to be an obvious way to
>>> autogenerate names/labels. Hearing that the java SDK supports it makes me
>>> wonder if the python SDK could support it as well though... (If so, I'd be
>>> happy to do implement it). Currently, it's fairly tedious to have to name
>>> every instance of a transform that you might reuse in a pipeline, e.g. when
>>> reapplying the same Map on different pcollections.
>>>
>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <us...@beam.apache.org>
>>> wrote:
>>>
>>>> Are you talking about transform names? The main reason was because for
>>>> runners that support updating pipelines in place, the only way to do so
>>>> safely is if the runner can perfectly identify which transforms in the new
>>>> graph match the ones in the old graph. There's no good way to auto generate
>>>> names that will stay stable across updates - even small changes to the
>>>> pipeline might change the order of nodes in the graph, which could result
>>>> in a corrupted update.
>>>>
>>>> However, if you don't care about update, Beam can auto generate these
>>>> names for you! When you call PCollection.apply (if using BeamJava), simply
>>>> omit the name parameter and Beam will auto generate a unique name for you.
>>>>
>>>> Reuven
>>>>
>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <jo...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> After writing a few pipelines now, I keep getting tripped up from
>>>>> accidentally have duplicate labels from using multiple of the same
>>>>> transforms without labeling them. I figure this must be a common complaint,
>>>>> so I was just curious, what the rationale behind this design was? My naive
>>>>> thought off the top of my head is that it'd be more user friendly to just
>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>> something
>>>>>
>>>>> Cheers,
>>>>> Joey
>>>>>
>>>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

You don't have to specify the names if the callable you pass in is
/different/ for two `beam.Map`s, but  if the callable is the same you must
specify a label. For example, the below will raise an exception:

```
        | beam.Filter(identity_filter)
        | beam.Filter(identity_filter)
```

Here's an example on playground that shows the error message you get [1]. I
marked every line I added with a "# ++".

It's a contrived example, but using a map or filter at the same pipeline
level probably comes up often, at least in my inexperience. For example,
you. might have a pipeline that partitions a pcoll into three different
pcolls, runs some transforms on them, and then runs the same type of filter
on each of them.

The case that happens most often for me is using the `assert_that` [2]
testing transform. In this case, I think often users will really have no
need for a disambiguating label as they're often just writing unit tests
that test a few different properties of their workflow.

[1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
[2]
https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that

On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <us...@beam.apache.org>
wrote:

> If I understand the question correctly, you don't have to specify those
> names.
>
> As Reuven pointed out, it is probably a good idea so you have a stable /
> deterministic graph.
> But in the Python SDK, you can simply use pcollection | map_fn, instead
> of pcollection | 'Map' >> map_fn.
>
> See an example here
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>
>
> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <jo...@schrodinger.com>
> wrote:
>
>> Hmm, I'm not sure what you mean by "updating pipelines in place". Can you
>> elaborate?
>>
>> I forgot to mention my question is posed from the context of a python SDK
>> user, and afaict, there doesn't seem to be an obvious way to autogenerate
>> names/labels. Hearing that the java SDK supports it makes me wonder if the
>> python SDK could support it as well though... (If so, I'd be happy to do
>> implement it). Currently, it's fairly tedious to have to name every
>> instance of a transform that you might reuse in a pipeline, e.g. when
>> reapplying the same Map on different pcollections.
>>
>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <us...@beam.apache.org>
>> wrote:
>>
>>> Are you talking about transform names? The main reason was because for
>>> runners that support updating pipelines in place, the only way to do so
>>> safely is if the runner can perfectly identify which transforms in the new
>>> graph match the ones in the old graph. There's no good way to auto generate
>>> names that will stay stable across updates - even small changes to the
>>> pipeline might change the order of nodes in the graph, which could result
>>> in a corrupted update.
>>>
>>> However, if you don't care about update, Beam can auto generate these
>>> names for you! When you call PCollection.apply (if using BeamJava), simply
>>> omit the name parameter and Beam will auto generate a unique name for you.
>>>
>>> Reuven
>>>
>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <jo...@schrodinger.com>
>>> wrote:
>>>
>>>> After writing a few pipelines now, I keep getting tripped up from
>>>> accidentally have duplicate labels from using multiple of the same
>>>> transforms without labeling them. I figure this must be a common complaint,
>>>> so I was just curious, what the rationale behind this design was? My naive
>>>> thought off the top of my head is that it'd be more user friendly to just
>>>> auto increment duplicate transforms, but I figure I must be missing
>>>> something
>>>>
>>>> Cheers,
>>>> Joey
>>>>
>>>

Re: [QUESTION] Why no auto labels?

Posted by Bruno Volpato via user <us...@beam.apache.org>.

If I understand the question correctly, you don't have to specify those
names.

As Reuven pointed out, it is probably a good idea so you have a stable /
deterministic graph.
But in the Python SDK, you can simply use pcollection | map_fn,
instead of pcollection
| 'Map' >> map_fn.

See an example here
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116


On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <jo...@schrodinger.com> wrote:

> Hmm, I'm not sure what you mean by "updating pipelines in place". Can you
> elaborate?
>
> I forgot to mention my question is posed from the context of a python SDK
> user, and afaict, there doesn't seem to be an obvious way to autogenerate
> names/labels. Hearing that the java SDK supports it makes me wonder if the
> python SDK could support it as well though... (If so, I'd be happy to do
> implement it). Currently, it's fairly tedious to have to name every
> instance of a transform that you might reuse in a pipeline, e.g. when
> reapplying the same Map on different pcollections.
>
> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <us...@beam.apache.org>
> wrote:
>
>> Are you talking about transform names? The main reason was because for
>> runners that support updating pipelines in place, the only way to do so
>> safely is if the runner can perfectly identify which transforms in the new
>> graph match the ones in the old graph. There's no good way to auto generate
>> names that will stay stable across updates - even small changes to the
>> pipeline might change the order of nodes in the graph, which could result
>> in a corrupted update.
>>
>> However, if you don't care about update, Beam can auto generate these
>> names for you! When you call PCollection.apply (if using BeamJava), simply
>> omit the name parameter and Beam will auto generate a unique name for you.
>>
>> Reuven
>>
>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <jo...@schrodinger.com>
>> wrote:
>>
>>> After writing a few pipelines now, I keep getting tripped up from
>>> accidentally have duplicate labels from using multiple of the same
>>> transforms without labeling them. I figure this must be a common complaint,
>>> so I was just curious, what the rationale behind this design was? My naive
>>> thought off the top of my head is that it'd be more user friendly to just
>>> auto increment duplicate transforms, but I figure I must be missing
>>> something
>>>
>>> Cheers,
>>> Joey
>>>
>>

Re: [QUESTION] Why no auto labels?

Posted by Joey Tran <jo...@schrodinger.com>.

Hmm, I'm not sure what you mean by "updating pipelines in place". Can you
elaborate?

I forgot to mention my question is posed from the context of a python SDK
user, and afaict, there doesn't seem to be an obvious way to autogenerate
names/labels. Hearing that the java SDK supports it makes me wonder if the
python SDK could support it as well though... (If so, I'd be happy to do
implement it). Currently, it's fairly tedious to have to name every
instance of a transform that you might reuse in a pipeline, e.g. when
reapplying the same Map on different pcollections.

On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <us...@beam.apache.org>
wrote:

> Are you talking about transform names? The main reason was because for
> runners that support updating pipelines in place, the only way to do so
> safely is if the runner can perfectly identify which transforms in the new
> graph match the ones in the old graph. There's no good way to auto generate
> names that will stay stable across updates - even small changes to the
> pipeline might change the order of nodes in the graph, which could result
> in a corrupted update.
>
> However, if you don't care about update, Beam can auto generate these
> names for you! When you call PCollection.apply (if using BeamJava), simply
> omit the name parameter and Beam will auto generate a unique name for you.
>
> Reuven
>
> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <jo...@schrodinger.com>
> wrote:
>
>> After writing a few pipelines now, I keep getting tripped up from
>> accidentally have duplicate labels from using multiple of the same
>> transforms without labeling them. I figure this must be a common complaint,
>> so I was just curious, what the rationale behind this design was? My naive
>> thought off the top of my head is that it'd be more user friendly to just
>> auto increment duplicate transforms, but I figure I must be missing
>> something
>>
>> Cheers,
>> Joey
>>
>