You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Jie Li (JIRA)" <ji...@apache.org> on 2012/07/02 20:47:23 UTC
[jira] [Commented] (PIG-2779) Refactoring the code for setting
number of reducers
[ https://issues.apache.org/jira/browse/PIG-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405233#comment-13405233 ]
Jie Li commented on PIG-2779:
-----------------------------
Maybe we can refactor like this:
*Merge default parallel*
First, let's merge PigContext.default_parallel into LogicalRelationalOperator's requestedParallelism after the logical plan is built, so we don't need to worry about the default parallel later in Physical phase and MR phase.
*Assignable only once*
Then, let's make all requestedParallelism fields in Logical/Physical/MR operators *assignable only once*, and add other fields if necessary (e.g. runtimeParallel introduced below), as we don't want to mess different semantics together. If we detect it's being reassigned, we can throw a runtime exception.
*Synchronize sample's #partitions and order-by's #reducers*
For the order-by, it's a little tricky: the sample job's requestedParallelism is set to 1 as it only needs one reducer, and it also needs the #partitions (which is the runtime #reducers of the order-by) to generate the partition file.
Right now we set sample's #partitions twice (once in compilation-time and once in runtime), and we want to set only once at runtime.
In order to prevent the sample job using a different #partition than the order-by's runtime #reducer, we can introduce another field *runtimeParallel*. The sample job will estimate and calculate #partitions and set it to the order-by's runtimeParallel, which won't be re-calculated later.
Any comment is appreciated.
> Refactoring the code for setting number of reducers
> ---------------------------------------------------
>
> Key: PIG-2779
> URL: https://issues.apache.org/jira/browse/PIG-2779
> Project: Pig
> Issue Type: Bug
> Reporter: Jie Li
> Fix For: 0.11
>
> Attachments: TestNumberOfReducers.java, TestNumberOfReducers.java
>
>
> As PIG-2652 observed, currently the code for setting number of reducers is a little messy. MapReduceOper.requestedParallelism seems being misused in some plases, and now we support runtime estimation of #reducer which further complicates the problem.
> For example, if we specify parallel 1 for the order-by, the estimated #reducer will be used. If we specify parallel 2 while it estimates 4, order-by will fail due to "Illegal partition for Null". If we specify parallel 4 while it estimates 2, then some reducers will have nothing to do.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira