You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Evan Galpin <eg...@apache.org> on 2022/07/05 15:03:02 UTC

Re: [Dataflow][Java] Guidance on Transform Mapping Streaming Update

+dev@

Reviving this thread as it has hit me again on Dataflow.  I am trying to
upgrade an active streaming pipeline from 2.36.0 to 2.40.0.  Originally, I
received an error that the step "Flatten.pCollections" was missing from the
new job graph.  I knew from the code that that wasn't true, so I dumped the
job file via "--dataflowJobFile" for both the running pipeline and for the
new version I'm attempting to update to.  Both job files showed identical
data for the Flatten.pCollections step, which raises the question of why
that would have been reported as missing.

Out of curiosity I then tried mapping the step to the same name, which
changed the error to:  "The Coder or type for step
Flatten.pCollections/Unzipped-2/FlattenReplace has changed."  Again, the
job files show identical coders for the Flatten step (though
"Unzipped-2/FlattenReplace" is not present in the job file, maybe an
internal Dataflow thing?), so I'm confident that the coder hasn't actually
changed.

I'm not sure how to proceed in updating the running pipeline, and I'd
really prefer not to drain.  Any ideas?

Thanks,
Evan


On Fri, Oct 22, 2021 at 3:36 PM Evan Galpin <ev...@gmail.com> wrote:

> Thanks for the ideas Luke. I checked out the json graphs as per your
> recommendation (thanks for that, was previously unaware), and the
> "output_info" was identical for both the running pipeline and the pipeline
> I was hoping to update it with.  I ended up opting to just drain and submit
> the updated pipeline as a new job.  Thanks for the tips!
>
> Thanks,
> Evan
>
> On Thu, Oct 21, 2021 at 7:02 PM Luke Cwik <lc...@google.com> wrote:
>
>> I would suggest dumping the JSON representation (with the
>> --dataflowJobFile=/path/to/output.json) of the pipeline before and after
>> and looking to see what is being submitted to Dataflow. Dataflow's JSON
>> graph representation is a bipartite graph where there are transform nodes
>> with inputs and outputs and PCollection nodes with no inputs or outputs.
>> The PCollection nodes typically end with the suffix ".out". This could help
>> find steps that have been added/removed/renamed.
>>
>> The PipelineDotRenderer[1] might be of use as well.
>>
>> 1:
>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/renderer/PipelineDotRenderer.java
>>
>> On Thu, Oct 21, 2021 at 11:54 AM Evan Galpin <ev...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm looking for any help regarding updating streaming jobs which are
>>> already running on Dataflow.  Specifically I'm seeking guidance for
>>> situations where Fusion is involved, and trying to decipher which old steps
>>> should be mapped to which new steps.
>>>
>>> I have a case where I updated the steps which come after the step in
>>> question, but when I attempt to update there is an error that "<old step>
>>> no longer produces data to the steps <downstream step>". I believe that
>>> <old step> is only changed as a result of fusion, and in reality it does in
>>> fact produce data to <downstream step> (confirmed when deployed as a new
>>> job for testing purposes).
>>>
>>> Is there a guide for how to deal with updates and fusion?
>>>
>>> Thanks,
>>> Evan
>>>
>>

Re: [Dataflow][Java] Guidance on Transform Mapping Streaming Update

Posted by Evan Galpin <eg...@apache.org>.
Ya fair enough, makes sense. I’ll reach out to GCP. Thanks Luke!

- Evan

On Fri, Jul 8, 2022 at 11:24 Luke Cwik <lc...@google.com> wrote:

> I was suggesting GCP support mainly because I don't think you want to
> share the 2.36 and 2.40 version of your job file publicly as someone
> familiar with the layout and format may spot a meaningful difference.
>
> Also, if it turns out that there is no meaningful difference between the
> two then the internal mechanics of how the graph is modified by Dataflow is
> not surfaced back to you in enough depth to debug further.
>
>
>
> On Fri, Jul 8, 2022 at 6:12 AM Evan Galpin <eg...@apache.org> wrote:
>
>> Thanks for your response Luke :-)
>>
>> Updating in 2.36.0 works as expected, but as you alluded to I'm
>> attempting to update to the latest SDK; in this case there are no code
>> changes in the user code, only the SDK version.  Is GCP support the only
>> tool when it comes to deciphering the steps added by Dataflow?  I would
>> love to be able to inspect the complete graph with those extra steps like
>> "Unzipped-2/FlattenReplace" that aren't in the job file.
>>
>> Thanks,
>> Evan
>>
>> On Wed, Jul 6, 2022 at 4:21 PM Luke Cwik via user <us...@beam.apache.org>
>> wrote:
>>
>>> Does doing a pipeline update in 2.36 work or do you want to do an update
>>> to get the latest version?
>>>
>>> Feel free to share the job files with GCP support. It could be something
>>> internal but the coders for ephemeral steps that Dataflow adds are based
>>> upon existing coders within the graph.
>>>
>>> On Tue, Jul 5, 2022 at 8:03 AM Evan Galpin <eg...@apache.org> wrote:
>>>
>>>> +dev@
>>>>
>>>> Reviving this thread as it has hit me again on Dataflow.  I am trying
>>>> to upgrade an active streaming pipeline from 2.36.0 to 2.40.0.  Originally,
>>>> I received an error that the step "Flatten.pCollections" was missing from
>>>> the new job graph.  I knew from the code that that wasn't true, so I dumped
>>>> the job file via "--dataflowJobFile" for both the running pipeline and for
>>>> the new version I'm attempting to update to.  Both job files showed
>>>> identical data for the Flatten.pCollections step, which raises the question
>>>> of why that would have been reported as missing.
>>>>
>>>> Out of curiosity I then tried mapping the step to the same name, which
>>>> changed the error to:  "The Coder or type for step
>>>> Flatten.pCollections/Unzipped-2/FlattenReplace has changed."  Again, the
>>>> job files show identical coders for the Flatten step (though
>>>> "Unzipped-2/FlattenReplace" is not present in the job file, maybe an
>>>> internal Dataflow thing?), so I'm confident that the coder hasn't actually
>>>> changed.
>>>>
>>>> I'm not sure how to proceed in updating the running pipeline, and I'd
>>>> really prefer not to drain.  Any ideas?
>>>>
>>>> Thanks,
>>>> Evan
>>>>
>>>>
>>>> On Fri, Oct 22, 2021 at 3:36 PM Evan Galpin <ev...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for the ideas Luke. I checked out the json graphs as per your
>>>>> recommendation (thanks for that, was previously unaware), and the
>>>>> "output_info" was identical for both the running pipeline and the pipeline
>>>>> I was hoping to update it with.  I ended up opting to just drain and submit
>>>>> the updated pipeline as a new job.  Thanks for the tips!
>>>>>
>>>>> Thanks,
>>>>> Evan
>>>>>
>>>>> On Thu, Oct 21, 2021 at 7:02 PM Luke Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> I would suggest dumping the JSON representation (with the
>>>>>> --dataflowJobFile=/path/to/output.json) of the pipeline before and after
>>>>>> and looking to see what is being submitted to Dataflow. Dataflow's JSON
>>>>>> graph representation is a bipartite graph where there are transform nodes
>>>>>> with inputs and outputs and PCollection nodes with no inputs or outputs.
>>>>>> The PCollection nodes typically end with the suffix ".out". This could help
>>>>>> find steps that have been added/removed/renamed.
>>>>>>
>>>>>> The PipelineDotRenderer[1] might be of use as well.
>>>>>>
>>>>>> 1:
>>>>>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/renderer/PipelineDotRenderer.java
>>>>>>
>>>>>> On Thu, Oct 21, 2021 at 11:54 AM Evan Galpin <ev...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'm looking for any help regarding updating streaming jobs which are
>>>>>>> already running on Dataflow.  Specifically I'm seeking guidance for
>>>>>>> situations where Fusion is involved, and trying to decipher which old steps
>>>>>>> should be mapped to which new steps.
>>>>>>>
>>>>>>> I have a case where I updated the steps which come after the step in
>>>>>>> question, but when I attempt to update there is an error that "<old step>
>>>>>>> no longer produces data to the steps <downstream step>". I believe that
>>>>>>> <old step> is only changed as a result of fusion, and in reality it does in
>>>>>>> fact produce data to <downstream step> (confirmed when deployed as a new
>>>>>>> job for testing purposes).
>>>>>>>
>>>>>>> Is there a guide for how to deal with updates and fusion?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Evan
>>>>>>>
>>>>>>

Re: [Dataflow][Java] Guidance on Transform Mapping Streaming Update

Posted by Evan Galpin <eg...@apache.org>.
Ya fair enough, makes sense. I’ll reach out to GCP. Thanks Luke!

- Evan

On Fri, Jul 8, 2022 at 11:24 Luke Cwik <lc...@google.com> wrote:

> I was suggesting GCP support mainly because I don't think you want to
> share the 2.36 and 2.40 version of your job file publicly as someone
> familiar with the layout and format may spot a meaningful difference.
>
> Also, if it turns out that there is no meaningful difference between the
> two then the internal mechanics of how the graph is modified by Dataflow is
> not surfaced back to you in enough depth to debug further.
>
>
>
> On Fri, Jul 8, 2022 at 6:12 AM Evan Galpin <eg...@apache.org> wrote:
>
>> Thanks for your response Luke :-)
>>
>> Updating in 2.36.0 works as expected, but as you alluded to I'm
>> attempting to update to the latest SDK; in this case there are no code
>> changes in the user code, only the SDK version.  Is GCP support the only
>> tool when it comes to deciphering the steps added by Dataflow?  I would
>> love to be able to inspect the complete graph with those extra steps like
>> "Unzipped-2/FlattenReplace" that aren't in the job file.
>>
>> Thanks,
>> Evan
>>
>> On Wed, Jul 6, 2022 at 4:21 PM Luke Cwik via user <us...@beam.apache.org>
>> wrote:
>>
>>> Does doing a pipeline update in 2.36 work or do you want to do an update
>>> to get the latest version?
>>>
>>> Feel free to share the job files with GCP support. It could be something
>>> internal but the coders for ephemeral steps that Dataflow adds are based
>>> upon existing coders within the graph.
>>>
>>> On Tue, Jul 5, 2022 at 8:03 AM Evan Galpin <eg...@apache.org> wrote:
>>>
>>>> +dev@
>>>>
>>>> Reviving this thread as it has hit me again on Dataflow.  I am trying
>>>> to upgrade an active streaming pipeline from 2.36.0 to 2.40.0.  Originally,
>>>> I received an error that the step "Flatten.pCollections" was missing from
>>>> the new job graph.  I knew from the code that that wasn't true, so I dumped
>>>> the job file via "--dataflowJobFile" for both the running pipeline and for
>>>> the new version I'm attempting to update to.  Both job files showed
>>>> identical data for the Flatten.pCollections step, which raises the question
>>>> of why that would have been reported as missing.
>>>>
>>>> Out of curiosity I then tried mapping the step to the same name, which
>>>> changed the error to:  "The Coder or type for step
>>>> Flatten.pCollections/Unzipped-2/FlattenReplace has changed."  Again, the
>>>> job files show identical coders for the Flatten step (though
>>>> "Unzipped-2/FlattenReplace" is not present in the job file, maybe an
>>>> internal Dataflow thing?), so I'm confident that the coder hasn't actually
>>>> changed.
>>>>
>>>> I'm not sure how to proceed in updating the running pipeline, and I'd
>>>> really prefer not to drain.  Any ideas?
>>>>
>>>> Thanks,
>>>> Evan
>>>>
>>>>
>>>> On Fri, Oct 22, 2021 at 3:36 PM Evan Galpin <ev...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for the ideas Luke. I checked out the json graphs as per your
>>>>> recommendation (thanks for that, was previously unaware), and the
>>>>> "output_info" was identical for both the running pipeline and the pipeline
>>>>> I was hoping to update it with.  I ended up opting to just drain and submit
>>>>> the updated pipeline as a new job.  Thanks for the tips!
>>>>>
>>>>> Thanks,
>>>>> Evan
>>>>>
>>>>> On Thu, Oct 21, 2021 at 7:02 PM Luke Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> I would suggest dumping the JSON representation (with the
>>>>>> --dataflowJobFile=/path/to/output.json) of the pipeline before and after
>>>>>> and looking to see what is being submitted to Dataflow. Dataflow's JSON
>>>>>> graph representation is a bipartite graph where there are transform nodes
>>>>>> with inputs and outputs and PCollection nodes with no inputs or outputs.
>>>>>> The PCollection nodes typically end with the suffix ".out". This could help
>>>>>> find steps that have been added/removed/renamed.
>>>>>>
>>>>>> The PipelineDotRenderer[1] might be of use as well.
>>>>>>
>>>>>> 1:
>>>>>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/renderer/PipelineDotRenderer.java
>>>>>>
>>>>>> On Thu, Oct 21, 2021 at 11:54 AM Evan Galpin <ev...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'm looking for any help regarding updating streaming jobs which are
>>>>>>> already running on Dataflow.  Specifically I'm seeking guidance for
>>>>>>> situations where Fusion is involved, and trying to decipher which old steps
>>>>>>> should be mapped to which new steps.
>>>>>>>
>>>>>>> I have a case where I updated the steps which come after the step in
>>>>>>> question, but when I attempt to update there is an error that "<old step>
>>>>>>> no longer produces data to the steps <downstream step>". I believe that
>>>>>>> <old step> is only changed as a result of fusion, and in reality it does in
>>>>>>> fact produce data to <downstream step> (confirmed when deployed as a new
>>>>>>> job for testing purposes).
>>>>>>>
>>>>>>> Is there a guide for how to deal with updates and fusion?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Evan
>>>>>>>
>>>>>>

Re: [Dataflow][Java] Guidance on Transform Mapping Streaming Update

Posted by Luke Cwik via user <us...@beam.apache.org>.
I was suggesting GCP support mainly because I don't think you want to share
the 2.36 and 2.40 version of your job file publicly as someone familiar
with the layout and format may spot a meaningful difference.

Also, if it turns out that there is no meaningful difference between the
two then the internal mechanics of how the graph is modified by Dataflow is
not surfaced back to you in enough depth to debug further.



On Fri, Jul 8, 2022 at 6:12 AM Evan Galpin <eg...@apache.org> wrote:

> Thanks for your response Luke :-)
>
> Updating in 2.36.0 works as expected, but as you alluded to I'm attempting
> to update to the latest SDK; in this case there are no code changes in the
> user code, only the SDK version.  Is GCP support the only tool when it
> comes to deciphering the steps added by Dataflow?  I would love to be able
> to inspect the complete graph with those extra steps like
> "Unzipped-2/FlattenReplace" that aren't in the job file.
>
> Thanks,
> Evan
>
> On Wed, Jul 6, 2022 at 4:21 PM Luke Cwik via user <us...@beam.apache.org>
> wrote:
>
>> Does doing a pipeline update in 2.36 work or do you want to do an update
>> to get the latest version?
>>
>> Feel free to share the job files with GCP support. It could be something
>> internal but the coders for ephemeral steps that Dataflow adds are based
>> upon existing coders within the graph.
>>
>> On Tue, Jul 5, 2022 at 8:03 AM Evan Galpin <eg...@apache.org> wrote:
>>
>>> +dev@
>>>
>>> Reviving this thread as it has hit me again on Dataflow.  I am trying to
>>> upgrade an active streaming pipeline from 2.36.0 to 2.40.0.  Originally, I
>>> received an error that the step "Flatten.pCollections" was missing from the
>>> new job graph.  I knew from the code that that wasn't true, so I dumped the
>>> job file via "--dataflowJobFile" for both the running pipeline and for the
>>> new version I'm attempting to update to.  Both job files showed identical
>>> data for the Flatten.pCollections step, which raises the question of why
>>> that would have been reported as missing.
>>>
>>> Out of curiosity I then tried mapping the step to the same name, which
>>> changed the error to:  "The Coder or type for step
>>> Flatten.pCollections/Unzipped-2/FlattenReplace has changed."  Again, the
>>> job files show identical coders for the Flatten step (though
>>> "Unzipped-2/FlattenReplace" is not present in the job file, maybe an
>>> internal Dataflow thing?), so I'm confident that the coder hasn't actually
>>> changed.
>>>
>>> I'm not sure how to proceed in updating the running pipeline, and I'd
>>> really prefer not to drain.  Any ideas?
>>>
>>> Thanks,
>>> Evan
>>>
>>>
>>> On Fri, Oct 22, 2021 at 3:36 PM Evan Galpin <ev...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for the ideas Luke. I checked out the json graphs as per your
>>>> recommendation (thanks for that, was previously unaware), and the
>>>> "output_info" was identical for both the running pipeline and the pipeline
>>>> I was hoping to update it with.  I ended up opting to just drain and submit
>>>> the updated pipeline as a new job.  Thanks for the tips!
>>>>
>>>> Thanks,
>>>> Evan
>>>>
>>>> On Thu, Oct 21, 2021 at 7:02 PM Luke Cwik <lc...@google.com> wrote:
>>>>
>>>>> I would suggest dumping the JSON representation (with the
>>>>> --dataflowJobFile=/path/to/output.json) of the pipeline before and after
>>>>> and looking to see what is being submitted to Dataflow. Dataflow's JSON
>>>>> graph representation is a bipartite graph where there are transform nodes
>>>>> with inputs and outputs and PCollection nodes with no inputs or outputs.
>>>>> The PCollection nodes typically end with the suffix ".out". This could help
>>>>> find steps that have been added/removed/renamed.
>>>>>
>>>>> The PipelineDotRenderer[1] might be of use as well.
>>>>>
>>>>> 1:
>>>>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/renderer/PipelineDotRenderer.java
>>>>>
>>>>> On Thu, Oct 21, 2021 at 11:54 AM Evan Galpin <ev...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm looking for any help regarding updating streaming jobs which are
>>>>>> already running on Dataflow.  Specifically I'm seeking guidance for
>>>>>> situations where Fusion is involved, and trying to decipher which old steps
>>>>>> should be mapped to which new steps.
>>>>>>
>>>>>> I have a case where I updated the steps which come after the step in
>>>>>> question, but when I attempt to update there is an error that "<old step>
>>>>>> no longer produces data to the steps <downstream step>". I believe that
>>>>>> <old step> is only changed as a result of fusion, and in reality it does in
>>>>>> fact produce data to <downstream step> (confirmed when deployed as a new
>>>>>> job for testing purposes).
>>>>>>
>>>>>> Is there a guide for how to deal with updates and fusion?
>>>>>>
>>>>>> Thanks,
>>>>>> Evan
>>>>>>
>>>>>

Re: [Dataflow][Java] Guidance on Transform Mapping Streaming Update

Posted by Luke Cwik via dev <de...@beam.apache.org>.
I was suggesting GCP support mainly because I don't think you want to share
the 2.36 and 2.40 version of your job file publicly as someone familiar
with the layout and format may spot a meaningful difference.

Also, if it turns out that there is no meaningful difference between the
two then the internal mechanics of how the graph is modified by Dataflow is
not surfaced back to you in enough depth to debug further.



On Fri, Jul 8, 2022 at 6:12 AM Evan Galpin <eg...@apache.org> wrote:

> Thanks for your response Luke :-)
>
> Updating in 2.36.0 works as expected, but as you alluded to I'm attempting
> to update to the latest SDK; in this case there are no code changes in the
> user code, only the SDK version.  Is GCP support the only tool when it
> comes to deciphering the steps added by Dataflow?  I would love to be able
> to inspect the complete graph with those extra steps like
> "Unzipped-2/FlattenReplace" that aren't in the job file.
>
> Thanks,
> Evan
>
> On Wed, Jul 6, 2022 at 4:21 PM Luke Cwik via user <us...@beam.apache.org>
> wrote:
>
>> Does doing a pipeline update in 2.36 work or do you want to do an update
>> to get the latest version?
>>
>> Feel free to share the job files with GCP support. It could be something
>> internal but the coders for ephemeral steps that Dataflow adds are based
>> upon existing coders within the graph.
>>
>> On Tue, Jul 5, 2022 at 8:03 AM Evan Galpin <eg...@apache.org> wrote:
>>
>>> +dev@
>>>
>>> Reviving this thread as it has hit me again on Dataflow.  I am trying to
>>> upgrade an active streaming pipeline from 2.36.0 to 2.40.0.  Originally, I
>>> received an error that the step "Flatten.pCollections" was missing from the
>>> new job graph.  I knew from the code that that wasn't true, so I dumped the
>>> job file via "--dataflowJobFile" for both the running pipeline and for the
>>> new version I'm attempting to update to.  Both job files showed identical
>>> data for the Flatten.pCollections step, which raises the question of why
>>> that would have been reported as missing.
>>>
>>> Out of curiosity I then tried mapping the step to the same name, which
>>> changed the error to:  "The Coder or type for step
>>> Flatten.pCollections/Unzipped-2/FlattenReplace has changed."  Again, the
>>> job files show identical coders for the Flatten step (though
>>> "Unzipped-2/FlattenReplace" is not present in the job file, maybe an
>>> internal Dataflow thing?), so I'm confident that the coder hasn't actually
>>> changed.
>>>
>>> I'm not sure how to proceed in updating the running pipeline, and I'd
>>> really prefer not to drain.  Any ideas?
>>>
>>> Thanks,
>>> Evan
>>>
>>>
>>> On Fri, Oct 22, 2021 at 3:36 PM Evan Galpin <ev...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for the ideas Luke. I checked out the json graphs as per your
>>>> recommendation (thanks for that, was previously unaware), and the
>>>> "output_info" was identical for both the running pipeline and the pipeline
>>>> I was hoping to update it with.  I ended up opting to just drain and submit
>>>> the updated pipeline as a new job.  Thanks for the tips!
>>>>
>>>> Thanks,
>>>> Evan
>>>>
>>>> On Thu, Oct 21, 2021 at 7:02 PM Luke Cwik <lc...@google.com> wrote:
>>>>
>>>>> I would suggest dumping the JSON representation (with the
>>>>> --dataflowJobFile=/path/to/output.json) of the pipeline before and after
>>>>> and looking to see what is being submitted to Dataflow. Dataflow's JSON
>>>>> graph representation is a bipartite graph where there are transform nodes
>>>>> with inputs and outputs and PCollection nodes with no inputs or outputs.
>>>>> The PCollection nodes typically end with the suffix ".out". This could help
>>>>> find steps that have been added/removed/renamed.
>>>>>
>>>>> The PipelineDotRenderer[1] might be of use as well.
>>>>>
>>>>> 1:
>>>>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/renderer/PipelineDotRenderer.java
>>>>>
>>>>> On Thu, Oct 21, 2021 at 11:54 AM Evan Galpin <ev...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm looking for any help regarding updating streaming jobs which are
>>>>>> already running on Dataflow.  Specifically I'm seeking guidance for
>>>>>> situations where Fusion is involved, and trying to decipher which old steps
>>>>>> should be mapped to which new steps.
>>>>>>
>>>>>> I have a case where I updated the steps which come after the step in
>>>>>> question, but when I attempt to update there is an error that "<old step>
>>>>>> no longer produces data to the steps <downstream step>". I believe that
>>>>>> <old step> is only changed as a result of fusion, and in reality it does in
>>>>>> fact produce data to <downstream step> (confirmed when deployed as a new
>>>>>> job for testing purposes).
>>>>>>
>>>>>> Is there a guide for how to deal with updates and fusion?
>>>>>>
>>>>>> Thanks,
>>>>>> Evan
>>>>>>
>>>>>

Re: [Dataflow][Java] Guidance on Transform Mapping Streaming Update

Posted by Evan Galpin <eg...@apache.org>.
Thanks for your response Luke :-)

Updating in 2.36.0 works as expected, but as you alluded to I'm attempting
to update to the latest SDK; in this case there are no code changes in the
user code, only the SDK version.  Is GCP support the only tool when it
comes to deciphering the steps added by Dataflow?  I would love to be able
to inspect the complete graph with those extra steps like
"Unzipped-2/FlattenReplace" that aren't in the job file.

Thanks,
Evan

On Wed, Jul 6, 2022 at 4:21 PM Luke Cwik via user <us...@beam.apache.org>
wrote:

> Does doing a pipeline update in 2.36 work or do you want to do an update
> to get the latest version?
>
> Feel free to share the job files with GCP support. It could be something
> internal but the coders for ephemeral steps that Dataflow adds are based
> upon existing coders within the graph.
>
> On Tue, Jul 5, 2022 at 8:03 AM Evan Galpin <eg...@apache.org> wrote:
>
>> +dev@
>>
>> Reviving this thread as it has hit me again on Dataflow.  I am trying to
>> upgrade an active streaming pipeline from 2.36.0 to 2.40.0.  Originally, I
>> received an error that the step "Flatten.pCollections" was missing from the
>> new job graph.  I knew from the code that that wasn't true, so I dumped the
>> job file via "--dataflowJobFile" for both the running pipeline and for the
>> new version I'm attempting to update to.  Both job files showed identical
>> data for the Flatten.pCollections step, which raises the question of why
>> that would have been reported as missing.
>>
>> Out of curiosity I then tried mapping the step to the same name, which
>> changed the error to:  "The Coder or type for step
>> Flatten.pCollections/Unzipped-2/FlattenReplace has changed."  Again, the
>> job files show identical coders for the Flatten step (though
>> "Unzipped-2/FlattenReplace" is not present in the job file, maybe an
>> internal Dataflow thing?), so I'm confident that the coder hasn't actually
>> changed.
>>
>> I'm not sure how to proceed in updating the running pipeline, and I'd
>> really prefer not to drain.  Any ideas?
>>
>> Thanks,
>> Evan
>>
>>
>> On Fri, Oct 22, 2021 at 3:36 PM Evan Galpin <ev...@gmail.com>
>> wrote:
>>
>>> Thanks for the ideas Luke. I checked out the json graphs as per your
>>> recommendation (thanks for that, was previously unaware), and the
>>> "output_info" was identical for both the running pipeline and the pipeline
>>> I was hoping to update it with.  I ended up opting to just drain and submit
>>> the updated pipeline as a new job.  Thanks for the tips!
>>>
>>> Thanks,
>>> Evan
>>>
>>> On Thu, Oct 21, 2021 at 7:02 PM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> I would suggest dumping the JSON representation (with the
>>>> --dataflowJobFile=/path/to/output.json) of the pipeline before and after
>>>> and looking to see what is being submitted to Dataflow. Dataflow's JSON
>>>> graph representation is a bipartite graph where there are transform nodes
>>>> with inputs and outputs and PCollection nodes with no inputs or outputs.
>>>> The PCollection nodes typically end with the suffix ".out". This could help
>>>> find steps that have been added/removed/renamed.
>>>>
>>>> The PipelineDotRenderer[1] might be of use as well.
>>>>
>>>> 1:
>>>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/renderer/PipelineDotRenderer.java
>>>>
>>>> On Thu, Oct 21, 2021 at 11:54 AM Evan Galpin <ev...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm looking for any help regarding updating streaming jobs which are
>>>>> already running on Dataflow.  Specifically I'm seeking guidance for
>>>>> situations where Fusion is involved, and trying to decipher which old steps
>>>>> should be mapped to which new steps.
>>>>>
>>>>> I have a case where I updated the steps which come after the step in
>>>>> question, but when I attempt to update there is an error that "<old step>
>>>>> no longer produces data to the steps <downstream step>". I believe that
>>>>> <old step> is only changed as a result of fusion, and in reality it does in
>>>>> fact produce data to <downstream step> (confirmed when deployed as a new
>>>>> job for testing purposes).
>>>>>
>>>>> Is there a guide for how to deal with updates and fusion?
>>>>>
>>>>> Thanks,
>>>>> Evan
>>>>>
>>>>

Re: [Dataflow][Java] Guidance on Transform Mapping Streaming Update

Posted by Evan Galpin <eg...@apache.org>.
Thanks for your response Luke :-)

Updating in 2.36.0 works as expected, but as you alluded to I'm attempting
to update to the latest SDK; in this case there are no code changes in the
user code, only the SDK version.  Is GCP support the only tool when it
comes to deciphering the steps added by Dataflow?  I would love to be able
to inspect the complete graph with those extra steps like
"Unzipped-2/FlattenReplace" that aren't in the job file.

Thanks,
Evan

On Wed, Jul 6, 2022 at 4:21 PM Luke Cwik via user <us...@beam.apache.org>
wrote:

> Does doing a pipeline update in 2.36 work or do you want to do an update
> to get the latest version?
>
> Feel free to share the job files with GCP support. It could be something
> internal but the coders for ephemeral steps that Dataflow adds are based
> upon existing coders within the graph.
>
> On Tue, Jul 5, 2022 at 8:03 AM Evan Galpin <eg...@apache.org> wrote:
>
>> +dev@
>>
>> Reviving this thread as it has hit me again on Dataflow.  I am trying to
>> upgrade an active streaming pipeline from 2.36.0 to 2.40.0.  Originally, I
>> received an error that the step "Flatten.pCollections" was missing from the
>> new job graph.  I knew from the code that that wasn't true, so I dumped the
>> job file via "--dataflowJobFile" for both the running pipeline and for the
>> new version I'm attempting to update to.  Both job files showed identical
>> data for the Flatten.pCollections step, which raises the question of why
>> that would have been reported as missing.
>>
>> Out of curiosity I then tried mapping the step to the same name, which
>> changed the error to:  "The Coder or type for step
>> Flatten.pCollections/Unzipped-2/FlattenReplace has changed."  Again, the
>> job files show identical coders for the Flatten step (though
>> "Unzipped-2/FlattenReplace" is not present in the job file, maybe an
>> internal Dataflow thing?), so I'm confident that the coder hasn't actually
>> changed.
>>
>> I'm not sure how to proceed in updating the running pipeline, and I'd
>> really prefer not to drain.  Any ideas?
>>
>> Thanks,
>> Evan
>>
>>
>> On Fri, Oct 22, 2021 at 3:36 PM Evan Galpin <ev...@gmail.com>
>> wrote:
>>
>>> Thanks for the ideas Luke. I checked out the json graphs as per your
>>> recommendation (thanks for that, was previously unaware), and the
>>> "output_info" was identical for both the running pipeline and the pipeline
>>> I was hoping to update it with.  I ended up opting to just drain and submit
>>> the updated pipeline as a new job.  Thanks for the tips!
>>>
>>> Thanks,
>>> Evan
>>>
>>> On Thu, Oct 21, 2021 at 7:02 PM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> I would suggest dumping the JSON representation (with the
>>>> --dataflowJobFile=/path/to/output.json) of the pipeline before and after
>>>> and looking to see what is being submitted to Dataflow. Dataflow's JSON
>>>> graph representation is a bipartite graph where there are transform nodes
>>>> with inputs and outputs and PCollection nodes with no inputs or outputs.
>>>> The PCollection nodes typically end with the suffix ".out". This could help
>>>> find steps that have been added/removed/renamed.
>>>>
>>>> The PipelineDotRenderer[1] might be of use as well.
>>>>
>>>> 1:
>>>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/renderer/PipelineDotRenderer.java
>>>>
>>>> On Thu, Oct 21, 2021 at 11:54 AM Evan Galpin <ev...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm looking for any help regarding updating streaming jobs which are
>>>>> already running on Dataflow.  Specifically I'm seeking guidance for
>>>>> situations where Fusion is involved, and trying to decipher which old steps
>>>>> should be mapped to which new steps.
>>>>>
>>>>> I have a case where I updated the steps which come after the step in
>>>>> question, but when I attempt to update there is an error that "<old step>
>>>>> no longer produces data to the steps <downstream step>". I believe that
>>>>> <old step> is only changed as a result of fusion, and in reality it does in
>>>>> fact produce data to <downstream step> (confirmed when deployed as a new
>>>>> job for testing purposes).
>>>>>
>>>>> Is there a guide for how to deal with updates and fusion?
>>>>>
>>>>> Thanks,
>>>>> Evan
>>>>>
>>>>

Re: [Dataflow][Java] Guidance on Transform Mapping Streaming Update

Posted by Luke Cwik via user <us...@beam.apache.org>.
Does doing a pipeline update in 2.36 work or do you want to do an update to
get the latest version?

Feel free to share the job files with GCP support. It could be something
internal but the coders for ephemeral steps that Dataflow adds are based
upon existing coders within the graph.

On Tue, Jul 5, 2022 at 8:03 AM Evan Galpin <eg...@apache.org> wrote:

> +dev@
>
> Reviving this thread as it has hit me again on Dataflow.  I am trying to
> upgrade an active streaming pipeline from 2.36.0 to 2.40.0.  Originally, I
> received an error that the step "Flatten.pCollections" was missing from the
> new job graph.  I knew from the code that that wasn't true, so I dumped the
> job file via "--dataflowJobFile" for both the running pipeline and for the
> new version I'm attempting to update to.  Both job files showed identical
> data for the Flatten.pCollections step, which raises the question of why
> that would have been reported as missing.
>
> Out of curiosity I then tried mapping the step to the same name, which
> changed the error to:  "The Coder or type for step
> Flatten.pCollections/Unzipped-2/FlattenReplace has changed."  Again, the
> job files show identical coders for the Flatten step (though
> "Unzipped-2/FlattenReplace" is not present in the job file, maybe an
> internal Dataflow thing?), so I'm confident that the coder hasn't actually
> changed.
>
> I'm not sure how to proceed in updating the running pipeline, and I'd
> really prefer not to drain.  Any ideas?
>
> Thanks,
> Evan
>
>
> On Fri, Oct 22, 2021 at 3:36 PM Evan Galpin <ev...@gmail.com> wrote:
>
>> Thanks for the ideas Luke. I checked out the json graphs as per your
>> recommendation (thanks for that, was previously unaware), and the
>> "output_info" was identical for both the running pipeline and the pipeline
>> I was hoping to update it with.  I ended up opting to just drain and submit
>> the updated pipeline as a new job.  Thanks for the tips!
>>
>> Thanks,
>> Evan
>>
>> On Thu, Oct 21, 2021 at 7:02 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> I would suggest dumping the JSON representation (with the
>>> --dataflowJobFile=/path/to/output.json) of the pipeline before and after
>>> and looking to see what is being submitted to Dataflow. Dataflow's JSON
>>> graph representation is a bipartite graph where there are transform nodes
>>> with inputs and outputs and PCollection nodes with no inputs or outputs.
>>> The PCollection nodes typically end with the suffix ".out". This could help
>>> find steps that have been added/removed/renamed.
>>>
>>> The PipelineDotRenderer[1] might be of use as well.
>>>
>>> 1:
>>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/renderer/PipelineDotRenderer.java
>>>
>>> On Thu, Oct 21, 2021 at 11:54 AM Evan Galpin <ev...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm looking for any help regarding updating streaming jobs which are
>>>> already running on Dataflow.  Specifically I'm seeking guidance for
>>>> situations where Fusion is involved, and trying to decipher which old steps
>>>> should be mapped to which new steps.
>>>>
>>>> I have a case where I updated the steps which come after the step in
>>>> question, but when I attempt to update there is an error that "<old step>
>>>> no longer produces data to the steps <downstream step>". I believe that
>>>> <old step> is only changed as a result of fusion, and in reality it does in
>>>> fact produce data to <downstream step> (confirmed when deployed as a new
>>>> job for testing purposes).
>>>>
>>>> Is there a guide for how to deal with updates and fusion?
>>>>
>>>> Thanks,
>>>> Evan
>>>>
>>>

Re: [Dataflow][Java] Guidance on Transform Mapping Streaming Update

Posted by Luke Cwik via dev <de...@beam.apache.org>.
Does doing a pipeline update in 2.36 work or do you want to do an update to
get the latest version?

Feel free to share the job files with GCP support. It could be something
internal but the coders for ephemeral steps that Dataflow adds are based
upon existing coders within the graph.

On Tue, Jul 5, 2022 at 8:03 AM Evan Galpin <eg...@apache.org> wrote:

> +dev@
>
> Reviving this thread as it has hit me again on Dataflow.  I am trying to
> upgrade an active streaming pipeline from 2.36.0 to 2.40.0.  Originally, I
> received an error that the step "Flatten.pCollections" was missing from the
> new job graph.  I knew from the code that that wasn't true, so I dumped the
> job file via "--dataflowJobFile" for both the running pipeline and for the
> new version I'm attempting to update to.  Both job files showed identical
> data for the Flatten.pCollections step, which raises the question of why
> that would have been reported as missing.
>
> Out of curiosity I then tried mapping the step to the same name, which
> changed the error to:  "The Coder or type for step
> Flatten.pCollections/Unzipped-2/FlattenReplace has changed."  Again, the
> job files show identical coders for the Flatten step (though
> "Unzipped-2/FlattenReplace" is not present in the job file, maybe an
> internal Dataflow thing?), so I'm confident that the coder hasn't actually
> changed.
>
> I'm not sure how to proceed in updating the running pipeline, and I'd
> really prefer not to drain.  Any ideas?
>
> Thanks,
> Evan
>
>
> On Fri, Oct 22, 2021 at 3:36 PM Evan Galpin <ev...@gmail.com> wrote:
>
>> Thanks for the ideas Luke. I checked out the json graphs as per your
>> recommendation (thanks for that, was previously unaware), and the
>> "output_info" was identical for both the running pipeline and the pipeline
>> I was hoping to update it with.  I ended up opting to just drain and submit
>> the updated pipeline as a new job.  Thanks for the tips!
>>
>> Thanks,
>> Evan
>>
>> On Thu, Oct 21, 2021 at 7:02 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> I would suggest dumping the JSON representation (with the
>>> --dataflowJobFile=/path/to/output.json) of the pipeline before and after
>>> and looking to see what is being submitted to Dataflow. Dataflow's JSON
>>> graph representation is a bipartite graph where there are transform nodes
>>> with inputs and outputs and PCollection nodes with no inputs or outputs.
>>> The PCollection nodes typically end with the suffix ".out". This could help
>>> find steps that have been added/removed/renamed.
>>>
>>> The PipelineDotRenderer[1] might be of use as well.
>>>
>>> 1:
>>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/renderer/PipelineDotRenderer.java
>>>
>>> On Thu, Oct 21, 2021 at 11:54 AM Evan Galpin <ev...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm looking for any help regarding updating streaming jobs which are
>>>> already running on Dataflow.  Specifically I'm seeking guidance for
>>>> situations where Fusion is involved, and trying to decipher which old steps
>>>> should be mapped to which new steps.
>>>>
>>>> I have a case where I updated the steps which come after the step in
>>>> question, but when I attempt to update there is an error that "<old step>
>>>> no longer produces data to the steps <downstream step>". I believe that
>>>> <old step> is only changed as a result of fusion, and in reality it does in
>>>> fact produce data to <downstream step> (confirmed when deployed as a new
>>>> job for testing purposes).
>>>>
>>>> Is there a guide for how to deal with updates and fusion?
>>>>
>>>> Thanks,
>>>> Evan
>>>>
>>>