You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Josh Wills (JIRA)" <ji...@apache.org> on 2012/09/13 18:22:08 UTC
[jira] [Resolved] (CRUNCH-50) Map-only parallelDos with multiple outputs should be fused

     [ https://issues.apache.org/jira/browse/CRUNCH-50?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Wills resolved CRUNCH-50.
------------------------------

    Resolution: Fixed

Committed the new integration test after the MSCRPlanner refactoring was in. Thanks Ryan!
                
> Map-only parallelDos with multiple outputs should be fused
> ----------------------------------------------------------
>
>                 Key: CRUNCH-50
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-50
>             Project: Crunch
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Ryan Brush
>             Fix For: 0.4.0
>
>         Attachments: MULTIOUTPUT_FUSE_TEST.patch
>
>
> I'm not sure if this is a bug or just an optimization yet to be implemented, but sibling parallelDo operations that have separate outputs are not being fused into a single Map operation.  (The original FlumeJava paper suggests they should be, and it would be a nice optimization to have here as well.)  Instead, each parallelDo results in a separate Map-only job that (redundantly) scans the input source of data.
> This can be seen in the current MultipleOutputIT integration test.  Notice the logs below from running one of those tests scans the same input in multiple jobs.
> 8414 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Running job "org.apache.crunch.MultipleOutputIT: Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+even+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/even)"
> 8415 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Job status available at: http://localhost:8080/
> 8417 [Thread-38] WARN  org.apache.hadoop.mapred.JobClient  - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
> 8497 [Thread-38] WARN  org.apache.hadoop.mapred.JobClient  - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
> 8532 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Running job "org.apache.crunch.MultipleOutputIT: Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+odd+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/odd)"
> 8532 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Job status available at: http://localhost:8080/
> I was going to take a stab at a patch for this, but noticed some major refactoring in this space is on deck as part of CRUNCH-34...so it might be best to address this after CRUNCH-34 lands.
> As an aside, it wasn't clear how to write a good integration test to expose this functionality.  Would simply counting the stage results and ensuring we have the expected number for a simple job be the best way?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira