You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@crunch.apache.org by "Josh Wills (JIRA)" <ji...@apache.org> on 2013/01/17 15:56:13 UTC

[jira] [Commented] (CRUNCH-144) Ability to re-use PCollections after a write without having to recompute them

    [ https://issues.apache.org/jira/browse/CRUNCH-144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556238#comment-13556238 ] 

Josh Wills commented on CRUNCH-144:
-----------------------------------

An update on this-- this caused some issues with the AvroPipelineIT test, where the optimizer doesn't realize that it can't figure out how to read the Avro objects from the text file that it writes out during the first pipeline run. I need to add more strict rules for indicating when it's possible to consume some text output, but it looks sort of ugly right now.
                
> Ability to re-use PCollections after a write without having to recompute them
> -----------------------------------------------------------------------------
>
>                 Key: CRUNCH-144
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-144
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.4.0
>            Reporter: Dave Beech
>            Assignee: Josh Wills
>         Attachments: CRUNCH-144.patch
>
>
> I have a pipeline that consists of several stages to process and filter a dataset. I would like to persist this dataset to HDFS and then perform further computation on it. 
> Example:
> 1. ) Load text data A and convert to avro -> A'
> 2. ) Load text data B and convert to avro -> B'
> 3. ) Union A' and B' -> C
> 4. ) Filter C -> D
> 5. ) Write D to HDFS
> 6a. ) Use DoFn to extract strings from D -> E
> 6b. ) Aggregate E ( count strings ) -> F
> 6c. ) Convert F to HBase puts -> G
> 6d. ) Write G to HBase
> Running this pipeline code generates two mapreduce jobs which run in parallel:
> job A) runs steps 1, 2, 3, 4, 5
> job B) runs steps 1, 2, 3, 4, 6abcd
> If a "pipeline.run()" call is included after step 5, the same two jobs are run but sequentially. 
> What I would like is to be able to hold on to the PCollection reference to "D", so that steps 6* can be run without going back to the start and re-doing all the work needed to generate it.
> -- 
> Ref to original discussion on crunch-user: http://mail-archives.apache.org/mod_mbox/incubator-crunch-user/201301.mbox/%3CCAH29n6MORejkxD%2ByRycRw40vxf4GruJ8m46AMjx_RGd6DvDUQA%40mail.gmail.com%3E 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira