You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@crunch.apache.org by "Ron (JIRA)" <ji...@apache.org> on 2013/11/26 09:40:35 UTC

[jira] [Commented] (CRUNCH-305) Multiuse between parellelDos which sharing the same input

    [ https://issues.apache.org/jira/browse/CRUNCH-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832407#comment-13832407 ] 

Ron commented on CRUNCH-305:
----------------------------

I have a careful reading of crunch future work on http://crunch.apache.org/future-work.html, and found that this is already in the future work of crunch, as combine related groupByKey into one single MR job like flumejava does. 

> Multiuse between parellelDos which sharing the same input
> ---------------------------------------------------------
>
>                 Key: CRUNCH-305
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-305
>             Project: Crunch
>          Issue Type: Wish
>            Reporter: Ron
>
>   When I start to use crunch, many of my jobs are in this pattern: I have five different parallelDo functions, and all of them work on a same input. Currently, I read the input first by using "pipeline.readTextFile()", and then apply each parallelDo function to the PCollection. However, I find that crunch will break my plan into five different mr jobs, each of them read the input and do mr, so it need to read the input five times. However, when referring to the paper of flumejava, the origin of crunch, I suggest that optimizations could be done that the input only be read once, and then apply the five different paralledDo functions. Since the input size is large, and the cost of IO is big, this optimization may help a lot in crunch jobs in patterns similar to mine.



--
This message was sent by Atlassian JIRA
(v6.1#6144)