You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2010/02/01 23:04:29 UTC

[jira] Updated: (HIVE-1118) Hive merge map files should have different bytes/mapper setting

     [ https://issues.apache.org/jira/browse/HIVE-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-1118:
-----------------------------

    Attachment: HIVE-1118.1.patch

Actually the option is already there. I just modified the default to be: condition = 16MB, merged file size = 32MB.
I think this setting is a good default.

I also added the missed conf variable to hive-default.xml.


> Hive merge map files should have different bytes/mapper setting
> ---------------------------------------------------------------
>
>                 Key: HIVE-1118
>                 URL: https://issues.apache.org/jira/browse/HIVE-1118
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>         Attachments: HIVE-1118.1.patch
>
>
> Currently, by default, we get one reducer for each 1GB of input data.
> It's also true for the conditional merge job that will run if the average file size is smaller than a threshold.
> This actually makes those job very slow, because each reducer needs to consume 1GB of data.
> Alternatively, we can just use that threshold to determine the number of reducers per job (or introduce a new parameter).
> Let's say the threshold is 1MB, then we only start the the merge job if the average file size is less than 1MB, and the eventual result file size will be around 1MB (or another small number).
> This will remove the extreme cases where we have thousands of empty files, but still make normal jobs fast enough.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.