You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2010/01/29 06:33:36 UTC

[jira] Created: (HIVE-1118) Hive merge map files should have different bytes/mapper setting

Hive merge map files should have different bytes/mapper setting
---------------------------------------------------------------

                 Key: HIVE-1118
                 URL: https://issues.apache.org/jira/browse/HIVE-1118
             Project: Hadoop Hive
          Issue Type: Improvement
            Reporter: Zheng Shao


Currently, by default, we get one reducer for each 1GB of input data.
It's also true for the conditional merge job that will run if the average file size is smaller than a threshold.

This actually makes those job very slow, because each reducer needs to consume 1GB of data.

Alternatively, we can just use that threshold to determine the number of reducers per job (or introduce a new parameter).
Let's say the threshold is 1MB, then we only start the the merge job if the average file size is less than 1MB, and the eventual result file size will be around 1MB (or another small number).

This will remove the extreme cases where we have thousands of empty files, but still make normal jobs fast enough.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1118) Hive merge map files should have different bytes/mapper setting

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806435#action_12806435 ] 

Zheng Shao commented on HIVE-1118:
----------------------------------

That's still much better than NOT running the merge task.
With the currently setting, almost nobody will enable this by default. As a result, we are seeing a lot of 1KB files in the HDFS.

If we make this change, we can enable it by default.


I agree 1MB is not a good default. We can set it to 32MB or 64MB (and on by default).

If that's not good enough, let's introduce another parameter so we can say (32MB, 64MB), which will start the merge job if average size of file is smaller than 32MB, and we will end up with files with 64MB.

Thoughts?

> Hive merge map files should have different bytes/mapper setting
> ---------------------------------------------------------------
>
>                 Key: HIVE-1118
>                 URL: https://issues.apache.org/jira/browse/HIVE-1118
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>
> Currently, by default, we get one reducer for each 1GB of input data.
> It's also true for the conditional merge job that will run if the average file size is smaller than a threshold.
> This actually makes those job very slow, because each reducer needs to consume 1GB of data.
> Alternatively, we can just use that threshold to determine the number of reducers per job (or introduce a new parameter).
> Let's say the threshold is 1MB, then we only start the the merge job if the average file size is less than 1MB, and the eventual result file size will be around 1MB (or another small number).
> This will remove the extreme cases where we have thousands of empty files, but still make normal jobs fast enough.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1118) Hive merge map files should have different bytes/mapper setting

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-1118:
-----------------------------

    Attachment: HIVE-1118.3.patch

After some more thought, I think the current settings of 256MB is good for most pipeline jobs.

Interactive users will have to set the options differently (either disable hive.merge.mapfiles, or make hive.merge.size.per.task a smaller number).

This patch now simply adds the missing conf variable and corrects the comment.

> Hive merge map files should have different bytes/mapper setting
> ---------------------------------------------------------------
>
>                 Key: HIVE-1118
>                 URL: https://issues.apache.org/jira/browse/HIVE-1118
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>         Attachments: HIVE-1118.1.patch, HIVE-1118.2.patch, HIVE-1118.3.patch
>
>
> Currently, by default, we get one reducer for each 1GB of input data.
> It's also true for the conditional merge job that will run if the average file size is smaller than a threshold.
> This actually makes those job very slow, because each reducer needs to consume 1GB of data.
> Alternatively, we can just use that threshold to determine the number of reducers per job (or introduce a new parameter).
> Let's say the threshold is 1MB, then we only start the the merge job if the average file size is less than 1MB, and the eventual result file size will be around 1MB (or another small number).
> This will remove the extreme cases where we have thousands of empty files, but still make normal jobs fast enough.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1118) Hive merge map files should have different bytes/mapper setting

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-1118:
-----------------------------

    Attachment: HIVE-1118.2.patch

The change is trivial, but hopefully it helps our users to use this feature without the concern of making the job running much longer.


> Hive merge map files should have different bytes/mapper setting
> ---------------------------------------------------------------
>
>                 Key: HIVE-1118
>                 URL: https://issues.apache.org/jira/browse/HIVE-1118
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>         Attachments: HIVE-1118.1.patch, HIVE-1118.2.patch
>
>
> Currently, by default, we get one reducer for each 1GB of input data.
> It's also true for the conditional merge job that will run if the average file size is smaller than a threshold.
> This actually makes those job very slow, because each reducer needs to consume 1GB of data.
> Alternatively, we can just use that threshold to determine the number of reducers per job (or introduce a new parameter).
> Let's say the threshold is 1MB, then we only start the the merge job if the average file size is less than 1MB, and the eventual result file size will be around 1MB (or another small number).
> This will remove the extreme cases where we have thousands of empty files, but still make normal jobs fast enough.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1118) Add hive.merge.size.per.task to HiveConf

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-1118:
-----------------------------

    Summary: Add hive.merge.size.per.task to HiveConf  (was: Hive merge map files should have different bytes/mapper setting)

> Add hive.merge.size.per.task to HiveConf
> ----------------------------------------
>
>                 Key: HIVE-1118
>                 URL: https://issues.apache.org/jira/browse/HIVE-1118
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>         Attachments: HIVE-1118.1.patch, HIVE-1118.2.patch, HIVE-1118.3.patch
>
>
> Currently, by default, we get one reducer for each 1GB of input data.
> It's also true for the conditional merge job that will run if the average file size is smaller than a threshold.
> This actually makes those job very slow, because each reducer needs to consume 1GB of data.
> Alternatively, we can just use that threshold to determine the number of reducers per job (or introduce a new parameter).
> Let's say the threshold is 1MB, then we only start the the merge job if the average file size is less than 1MB, and the eventual result file size will be around 1MB (or another small number).
> This will remove the extreme cases where we have thousands of empty files, but still make normal jobs fast enough.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1118) Add hive.merge.size.per.task to HiveConf

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828307#action_12828307 ] 

Namit Jain commented on HIVE-1118:
----------------------------------

+1

looks good

> Add hive.merge.size.per.task to HiveConf
> ----------------------------------------
>
>                 Key: HIVE-1118
>                 URL: https://issues.apache.org/jira/browse/HIVE-1118
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>         Attachments: HIVE-1118.1.patch, HIVE-1118.2.patch, HIVE-1118.3.patch
>
>
> Currently, by default, we get one reducer for each 1GB of input data.
> It's also true for the conditional merge job that will run if the average file size is smaller than a threshold.
> This actually makes those job very slow, because each reducer needs to consume 1GB of data.
> Alternatively, we can just use that threshold to determine the number of reducers per job (or introduce a new parameter).
> Let's say the threshold is 1MB, then we only start the the merge job if the average file size is less than 1MB, and the eventual result file size will be around 1MB (or another small number).
> This will remove the extreme cases where we have thousands of empty files, but still make normal jobs fast enough.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1118) Hive merge map files should have different bytes/mapper setting

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-1118:
-----------------------------

    Attachment: HIVE-1118.1.patch

Actually the option is already there. I just modified the default to be: condition = 16MB, merged file size = 32MB.
I think this setting is a good default.

I also added the missed conf variable to hive-default.xml.


> Hive merge map files should have different bytes/mapper setting
> ---------------------------------------------------------------
>
>                 Key: HIVE-1118
>                 URL: https://issues.apache.org/jira/browse/HIVE-1118
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>         Attachments: HIVE-1118.1.patch
>
>
> Currently, by default, we get one reducer for each 1GB of input data.
> It's also true for the conditional merge job that will run if the average file size is smaller than a threshold.
> This actually makes those job very slow, because each reducer needs to consume 1GB of data.
> Alternatively, we can just use that threshold to determine the number of reducers per job (or introduce a new parameter).
> Let's say the threshold is 1MB, then we only start the the merge job if the average file size is less than 1MB, and the eventual result file size will be around 1MB (or another small number).
> This will remove the extreme cases where we have thousands of empty files, but still make normal jobs fast enough.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1118) Add hive.merge.size.per.task to HiveConf

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-1118:
-----------------------------

       Resolution: Fixed
    Fix Version/s: 0.6.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

Committed. Thanks Zheng

> Add hive.merge.size.per.task to HiveConf
> ----------------------------------------
>
>                 Key: HIVE-1118
>                 URL: https://issues.apache.org/jira/browse/HIVE-1118
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>             Fix For: 0.6.0
>
>         Attachments: HIVE-1118.1.patch, HIVE-1118.2.patch, HIVE-1118.3.patch
>
>
> Currently, by default, we get one reducer for each 1GB of input data.
> It's also true for the conditional merge job that will run if the average file size is smaller than a threshold.
> This actually makes those job very slow, because each reducer needs to consume 1GB of data.
> Alternatively, we can just use that threshold to determine the number of reducers per job (or introduce a new parameter).
> Let's say the threshold is 1MB, then we only start the the merge job if the average file size is less than 1MB, and the eventual result file size will be around 1MB (or another small number).
> This will remove the extreme cases where we have thousands of empty files, but still make normal jobs fast enough.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1118) Hive merge map files should have different bytes/mapper setting

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806408#action_12806408 ] 

Namit Jain commented on HIVE-1118:
----------------------------------

Wont it lead to a lot of small files - 1MB, assuming the reducer maintains the same size for the data

> Hive merge map files should have different bytes/mapper setting
> ---------------------------------------------------------------
>
>                 Key: HIVE-1118
>                 URL: https://issues.apache.org/jira/browse/HIVE-1118
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>
> Currently, by default, we get one reducer for each 1GB of input data.
> It's also true for the conditional merge job that will run if the average file size is smaller than a threshold.
> This actually makes those job very slow, because each reducer needs to consume 1GB of data.
> Alternatively, we can just use that threshold to determine the number of reducers per job (or introduce a new parameter).
> Let's say the threshold is 1MB, then we only start the the merge job if the average file size is less than 1MB, and the eventual result file size will be around 1MB (or another small number).
> This will remove the extreme cases where we have thousands of empty files, but still make normal jobs fast enough.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1118) Hive merge map files should have different bytes/mapper setting

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-1118:
-----------------------------

    Assignee: Zheng Shao
      Status: Patch Available  (was: Open)

> Hive merge map files should have different bytes/mapper setting
> ---------------------------------------------------------------
>
>                 Key: HIVE-1118
>                 URL: https://issues.apache.org/jira/browse/HIVE-1118
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>         Attachments: HIVE-1118.1.patch
>
>
> Currently, by default, we get one reducer for each 1GB of input data.
> It's also true for the conditional merge job that will run if the average file size is smaller than a threshold.
> This actually makes those job very slow, because each reducer needs to consume 1GB of data.
> Alternatively, we can just use that threshold to determine the number of reducers per job (or introduce a new parameter).
> Let's say the threshold is 1MB, then we only start the the merge job if the average file size is less than 1MB, and the eventual result file size will be around 1MB (or another small number).
> This will remove the extreme cases where we have thousands of empty files, but still make normal jobs fast enough.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.