You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by "mosche (via GitHub)" <gi...@apache.org> on 2023/02/01 15:03:30 UTC

[GitHub] [beam] mosche commented on a diff in pull request #25187: [Spark Dataset runner] Break linage of dataset to reduce Spark planning overhead in case of large query plans

mosche commented on code in PR #25187:
URL: https://github.com/apache/beam/pull/25187#discussion_r1093346371


##########
runners/spark/3/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/PipelineTranslator.java:
##########
@@ -129,12 +138,22 @@ public EvaluationContext translate(
    */
   private static final class TranslationResult<T> implements EvaluationContext.NamedDataset<T> {
     private final String name;
+    private final float complexityFactor;
+    private float planComplexity = 0;
+
     private @MonotonicNonNull Dataset<WindowedValue<T>> dataset = null;
     private @MonotonicNonNull Broadcast<SideInputValues<T>> sideInputBroadcast = null;
+
+    // dependent downstream transforms (if empty this is a leaf)
     private final Set<PTransform<?, ?>> dependentTransforms = new HashSet<>();
+    // upstream dependencies (requires inputs)

Review Comment:
   This is the upstream dependencies in terms of data flow, so the data this depends on. Above the downstream dependencies, so transform that use this as input.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org