You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Jie Li (JIRA)" <ji...@apache.org> on 2012/07/02 20:47:23 UTC

[jira] [Commented] (PIG-2779) Refactoring the code for setting number of reducers

    [ https://issues.apache.org/jira/browse/PIG-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405233#comment-13405233 ] 

Jie Li commented on PIG-2779:
-----------------------------

Maybe we can refactor like this: 

*Merge default parallel* 
First, let's merge PigContext.default_parallel into LogicalRelationalOperator's requestedParallelism after the logical plan is built, so we don't need to worry about the default parallel later in Physical phase and MR phase.

*Assignable only once*
Then, let's make all requestedParallelism fields in Logical/Physical/MR operators *assignable only once*, and add other fields if necessary (e.g. runtimeParallel introduced below), as we don't want to mess different semantics together. If we detect it's being reassigned, we can throw a runtime exception.

*Synchronize sample's #partitions and order-by's #reducers*
For the order-by, it's a little tricky: the sample job's requestedParallelism is set to 1 as it only needs one reducer, and it also needs the #partitions (which is the runtime #reducers of the order-by) to generate the partition file. 

Right now we set sample's #partitions twice (once in compilation-time and once in runtime), and we want to set only once at runtime. 

In order to prevent the sample job using a different #partition than the order-by's runtime #reducer, we can introduce another field *runtimeParallel*. The sample job will estimate and calculate #partitions and set it to the order-by's runtimeParallel, which won't be re-calculated later. 

Any comment is appreciated.
                
> Refactoring the code for setting number of reducers
> ---------------------------------------------------
>
>                 Key: PIG-2779
>                 URL: https://issues.apache.org/jira/browse/PIG-2779
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Jie Li
>             Fix For: 0.11
>
>         Attachments: TestNumberOfReducers.java, TestNumberOfReducers.java
>
>
> As PIG-2652 observed, currently the code for setting number of reducers is a little messy. MapReduceOper.requestedParallelism seems being misused in some plases, and now we support runtime estimation of #reducer which further complicates the problem.
> For example, if we specify parallel 1 for the order-by, the estimated #reducer will be used. If we specify parallel 2 while it estimates 4, order-by will fail due to "Illegal partition for Null". If we specify parallel 4 while it estimates 2, then some reducers will have nothing to do. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira