You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Joseph Adler (JIRA)" <ji...@apache.org> on 2013/06/17 19:31:20 UTC

[jira] [Created] (CRUNCH-222) Planner should choose to run DoFns on Map or Reduce side depending on data size

Joseph Adler created CRUNCH-222:
-----------------------------------

             Summary: Planner should choose to run DoFns on Map or Reduce side depending on data size
                 Key: CRUNCH-222
                 URL: https://issues.apache.org/jira/browse/CRUNCH-222
             Project: Crunch
          Issue Type: New Feature
    Affects Versions: 0.7.0
            Reporter: Joseph Adler


Hi guys,

I was using Crunch to run a large data pipeline, and came across a problem. In one stage (between two group functions), I have a DoFn that increases the output data size by a factor of 5. You can picture the flow like this:

  GroupByKeyA -> MyDoFn -> GroupByKeyB

On Hadoop, this translates to something like this:

  -----------------------------         -----------------------------
  | map1  -> reduce1 |   ->   | map2  -> reduce2 |
  -----------------------------         -----------------------------

Logically, you can either run MyDoFn within reduce1 or map2 and get the same results. (The same data will be created as an input to reduce2.)

In the current implementation, MyDoFn always runs in Reduce1. That means that the first map/reduce job will write out 5x more data than it would if MyDoFn ran in Map2. For my job, this is a big deal: that's 5x more data written to HDFS and read from HDFS; that's a lot of extra work on HDFS and a lot of extra network traffic. Clearly, it would be more efficient to run MyDoFn in map2.

I'd like to propose that we change the Crunch Planner to take into account the output size of a DoFn (or set of DoFns) when deciding where it should be run. When the scaleFactor() is <= 1 (and reduces the data size), it should run on the reduce side. When the scaleFactor is > 1 (and increases the data size), it should run on the map side. (For a chain of n DoFns, this implies that Crunch will need to inspect up to n+1 different scales when deciding where to run each DoFn.)

-- Joe



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira