You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Gabriel Reid (JIRA)" <ji...@apache.org> on 2013/06/13 07:54:23 UTC

[jira] [Commented] (CRUNCH-218) Add new Target.WriteMode to skip the write and continue pipeline if an output target exists

    [ https://issues.apache.org/jira/browse/CRUNCH-218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13681934#comment-13681934 ] 

Gabriel Reid commented on CRUNCH-218:
-------------------------------------

I get the feeling that what Dave is describing here is different from what the patch does, although (obviously) I might be misinterpreting things.

What I get from the description is that the following semi-pseudocode, processedWords should only be calculated once, and the creation of counted should start from the existing processedWords:

    PCollection<String> words = ...;
    PCollection<String> processedWords = words.parallelDo(new MyExpensiveOperation());
    pipeline.write(processedWords, "processed", WriteMode.SKIP_IF_EXISTS);
    PCollection<Pair<String,Long>> counted = processedWords.count();
    pipeline.write(counted, "counted");

As things are now (and even with this patch I think), MyExpensiveOperation will be used twice on all the input data.

I also like what the patch is doing (making it so that re-running the above pipeline won't recreate "processed", but just read from the existing data), although I'm also wondering if we need to do some extra checking to prevent people from shooting themselves in the foot (like maybe checking if the input at the start of the pipeline is newer than the data that was written out in SKIP_IF_EXISTS mode.
                
> Add new Target.WriteMode to skip the write and continue pipeline if an output target exists
> -------------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-218
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-218
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6.0
>            Reporter: Dave Beech
>            Assignee: Josh Wills
>            Priority: Minor
>         Attachments: CRUNCH-218b.patch, CRUNCH-218.patch
>
>
> Quite often I write pipelines which persist data to the filesystem midway through the process, and then carry on doing further work. 
> If this intermediate data is already present, I think it would be good if I could set a write mode which skips over this first half of processing. This way I'd avoid running jobs unnecessarily and wasting cluster resources regenerating data I already have. 
> Example:
> PCollection<B> inter = pipeline.read(source).parallelDo(something).parallelDo(somethingElse);
> inter.write(At.sequenceFile('output'), WriteMode.SKIP_IF_EXISTS);
> PCollection<C> final = inter.parallelDo(moreWork);
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira