You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/07/03 06:32:02 UTC

[GitHub] [beam] ihji commented on pull request #12086: [BEAM-10322] allow only single assignment to producing stages by pcol…

ihji commented on pull request #12086:
URL: https://github.com/apache/beam/pull/12086#issuecomment-653377848


   Here's a concrete example that can be fixed by this PR:
   ```
   Stage A:
     (Input PCollection i - PTransform 1 - Output PCollection j)
   Stage B:
     (Input PCollection j - PTransform 2 - Output PCollection k)
   Stage C:
     (Input PCollection j for side input - PTransform 3)
   ```
   We want to find `Stage A` for emitting side input for `Stage C`. However, some synthetic PTransforms are inserted during pipeline optimization phase:
   ```
   Stage A:
     (Input PCollection i - PTransform 1 - Output PCollection j)
     (Input PCollection j - Synthetic Write - Data Sink)
   Stage B:
     (Data source - Synthetic Read - Output PCollection j)
     (Input PCollection j - PTransform 2 - Output PCollection k)
   Stage C:
     (Input PCollection j for side input - PTransform 3)
   ```
   If we allow multiple assignments to `producing_stages_by_pcoll`, `Stage B` will emit side input for `Stage C` (topologically `Stage B` comes after `Stage A`). Since `Stage B` and `Stage C` have no dependencies, the pipeline will succeed when `Stage B` is executed first and fail when `Stage C` is executed before `Stage B`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org