You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Namit Jain (JIRA)" <ji...@apache.org> on 2008/12/06 00:15:44 UTC

[jira] Commented: (HIVE-105) estimate number of required reducers and other map-reduce parameters automatically

    [ https://issues.apache.org/jira/browse/HIVE-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653970#action_12653970 ] 

Namit Jain commented on HIVE-105:
---------------------------------

Based on the discussion between Ashish, Joy, Raghu, Zheng and Namit, here is what was decided for estimating the number of reducers:

1. Number of users will be a function of input size only.
2. The user will be able to specify size per reducer - a new parameter will be added for the same. The default is 1G i.e if the input size is 10G, it will use 10 reducers.
    However, the number of reducers based on the above cannot be more than the one specified in the configuration parameter mapred.reduce.tasks
3. The user can do a explain plan and see the plan. Assuming there are 'n' stages: 
    the parameters: mapred.reduce.tasks.stage'r' can be used to override number of reducers at that stage. These parameters will be cleared when any query is executed


> estimate number of required reducers and other map-reduce parameters automatically
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-105
>                 URL: https://issues.apache.org/jira/browse/HIVE-105
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>
> currently users have to specify number of reducers. In a multi-user environment - we generally ask users to be prudent in selecting number of reducers (since they are long running and block other users). Also - large number of reducers produce large number of output files - which puts pressure on namenode resources.
> there are other map-reduce parameters - for example the min split size and the proposed use of combinefileinputformat that are also fairly tricky for the user to determine (since they depend on map side selectivity and cluster size). This will become totally critical when there is integration with BI tools since there will be no opportunity to optimize job settings and there will be a wide variety of jobs.
> This jira calls for automating the selection of such parameters - possibly by a best effort at estimating map side selectivity/output size using sampling and determining such parameters from there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.