You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Amit Sela (JIRA)" <ji...@apache.org> on 2017/01/06 22:01:59 UTC

[jira] [Created] (BEAM-1250) Remove leaf when materializing PCollection to avoid re-evaluation.

Amit Sela created BEAM-1250:
-------------------------------

             Summary: Remove leaf when materializing PCollection to avoid re-evaluation.
                 Key: BEAM-1250
                 URL: https://issues.apache.org/jira/browse/BEAM-1250
             Project: Beam
          Issue Type: Bug
          Components: runner-spark
            Reporter: Amit Sela
            Assignee: Amit Sela


When materializing a {{PCollection}} (implemented as {{RDD}}), to create a {{PCollectionView}} for example, the runner should remove the materialized {{RDD}} from the "leaves" set.
The runner keeps track of leaves left un-handled in the DAG to force action on them - {{Write}} for one is implemented via a sequence of {{ParDo}}s which are implemented by the runner via {{mapPartitions}} so we need to force an action.
Materializing an {{RDD}} is done via the action {{collect()}} so no reason to keep in "leaves" set.
Currently, it remains in the "leaves" set and so it is forced and evaluates the lineage and if not cached it will execute twice the lineage twice (unless caches are applied for some reason).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)