You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/01/06 22:04:59 UTC

[jira] [Commented] (BEAM-1250) Remove leaf when materializing PCollection to avoid re-evaluation.

    [ https://issues.apache.org/jira/browse/BEAM-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15805923#comment-15805923 ] 

ASF GitHub Bot commented on BEAM-1250:
--------------------------------------

GitHub user amitsela opened a pull request:

    https://github.com/apache/beam/pull/1747

    [BEAM-1250] Remove leaf when materializing PCollection to avoid re-ev…

    …aluation.
    
    Be sure to do all of the following to help us incorporate your contribution
    quickly and easily:
    
     - [ ] Make sure the PR title is formatted like:
       `[BEAM-<Jira issue #>] Description of pull request`
     - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
           Travis-CI on your fork and ensure the whole test matrix passes).
     - [ ] Replace `<Jira issue #>` in the title with the actual Jira issue
           number, if there is one.
     - [ ] If this contribution is large, please file an Apache
           [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt).
    
    ---


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/amitsela/beam remove-leaf-getvalues

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/beam/pull/1747.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1747
    
----
commit 7e7715035c870c28f4294fe52a0cc7c5d838aee2
Author: Sela <an...@paypal.com>
Date:   2017-01-06T22:03:34Z

    [BEAM-1250] Remove leaf when materializing PCollection to avoid re-evaluation.

----


> Remove leaf when materializing PCollection to avoid re-evaluation.
> ------------------------------------------------------------------
>
>                 Key: BEAM-1250
>                 URL: https://issues.apache.org/jira/browse/BEAM-1250
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Amit Sela
>            Assignee: Amit Sela
>
> When materializing a {{PCollection}} (implemented as {{RDD}}), to create a {{PCollectionView}} for example, the runner should remove the materialized {{RDD}} from the "leaves" set.
> The runner keeps track of leaves left un-handled in the DAG to force action on them - {{Write}} for one is implemented via a sequence of ParDos which are implemented by the runner via {{mapPartitions}} so we need to force an action.
> Materializing an {{RDD}} is done via the action {{collect()}} so no reason to keep in "leaves" set.
> Currently, it remains in the "leaves" set and so it is forced and evaluates the lineage and if not cached it will execute twice the lineage twice (unless caches are applied for some reason).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)