You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Ning Zhang (JIRA)" <ji...@apache.org> on 2010/02/12 00:44:27 UTC

[jira] Created: (HIVE-1158) Introducing a new parameter for Map-side join bucket size

Introducing a new parameter for Map-side join bucket size
---------------------------------------------------------

                 Key: HIVE-1158
                 URL: https://issues.apache.org/jira/browse/HIVE-1158
             Project: Hadoop Hive
          Issue Type: Improvement
    Affects Versions: 0.5.0, 0.6.0
            Reporter: Ning Zhang
            Assignee: Ning Zhang


Map-side join cache the small table in memory and join with the split of the large table at the mapper side. If the small table is too large, it uses RowContainer to cache a number of rows indicated by parameter hive.join.cache.size, whose default value is 25000. This parameter is also used for regular reducer-side joins to cache all input tables except the streaming table. This default value is too large for map-side join bucket size, resulting in OOM exceptions sometimes. We should define a different parameter to separate these two cache sizes. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1158) Introducing a new parameter for Map-side join bucket size

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-1158:
-----------------------------

      Resolution: Fixed
    Release Note: HIVE-1158. Introducing a new parameter for Map-side join bucket size. (Ning Zhang via zshao)
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

Committed. Thanks Ning!

> Introducing a new parameter for Map-side join bucket size
> ---------------------------------------------------------
>
>                 Key: HIVE-1158
>                 URL: https://issues.apache.org/jira/browse/HIVE-1158
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1158.patch
>
>
> Map-side join cache the small table in memory and join with the split of the large table at the mapper side. If the small table is too large, it uses RowContainer to cache a number of rows indicated by parameter hive.join.cache.size, whose default value is 25000. This parameter is also used for regular reducer-side joins to cache all input tables except the streaming table. This default value is too large for map-side join bucket size, resulting in OOM exceptions sometimes. We should define a different parameter to separate these two cache sizes. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1158) Introducing a new parameter for Map-side join bucket size

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-1158:
-----------------------------

       Resolution: Fixed
    Fix Version/s: 0.5.0
           Status: Resolved  (was: Patch Available)

Committed in 0.5 also. Thanks Ning

> Introducing a new parameter for Map-side join bucket size
> ---------------------------------------------------------
>
>                 Key: HIVE-1158
>                 URL: https://issues.apache.org/jira/browse/HIVE-1158
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.5.0
>
>         Attachments: HIVE-1158.patch, HIVE-1158_branch_0_5.patch
>
>
> Map-side join cache the small table in memory and join with the split of the large table at the mapper side. If the small table is too large, it uses RowContainer to cache a number of rows indicated by parameter hive.join.cache.size, whose default value is 25000. This parameter is also used for regular reducer-side joins to cache all input tables except the streaming table. This default value is too large for map-side join bucket size, resulting in OOM exceptions sometimes. We should define a different parameter to separate these two cache sizes. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1158) Introducing a new parameter for Map-side join bucket size

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834784#action_12834784 ] 

Namit Jain commented on HIVE-1158:
----------------------------------

+1

0.5 patch looks good - will commit if the tests pass

> Introducing a new parameter for Map-side join bucket size
> ---------------------------------------------------------
>
>                 Key: HIVE-1158
>                 URL: https://issues.apache.org/jira/browse/HIVE-1158
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1158.patch, HIVE-1158_branch_0_5.patch
>
>
> Map-side join cache the small table in memory and join with the split of the large table at the mapper side. If the small table is too large, it uses RowContainer to cache a number of rows indicated by parameter hive.join.cache.size, whose default value is 25000. This parameter is also used for regular reducer-side joins to cache all input tables except the streaming table. This default value is too large for map-side join bucket size, resulting in OOM exceptions sometimes. We should define a different parameter to separate these two cache sizes. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1158) Introducing a new parameter for Map-side join bucket size

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1158:
-----------------------------

    Status: Patch Available  (was: Open)

> Introducing a new parameter for Map-side join bucket size
> ---------------------------------------------------------
>
>                 Key: HIVE-1158
>                 URL: https://issues.apache.org/jira/browse/HIVE-1158
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1158.patch
>
>
> Map-side join cache the small table in memory and join with the split of the large table at the mapper side. If the small table is too large, it uses RowContainer to cache a number of rows indicated by parameter hive.join.cache.size, whose default value is 25000. This parameter is also used for regular reducer-side joins to cache all input tables except the streaming table. This default value is too large for map-side join bucket size, resulting in OOM exceptions sometimes. We should define a different parameter to separate these two cache sizes. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1158) Introducing a new parameter for Map-side join bucket size

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834097#action_12834097 ] 

Zheng Shao commented on HIVE-1158:
----------------------------------

+1. Will test and commit.

> Introducing a new parameter for Map-side join bucket size
> ---------------------------------------------------------
>
>                 Key: HIVE-1158
>                 URL: https://issues.apache.org/jira/browse/HIVE-1158
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1158.patch
>
>
> Map-side join cache the small table in memory and join with the split of the large table at the mapper side. If the small table is too large, it uses RowContainer to cache a number of rows indicated by parameter hive.join.cache.size, whose default value is 25000. This parameter is also used for regular reducer-side joins to cache all input tables except the streaming table. This default value is too large for map-side join bucket size, resulting in OOM exceptions sometimes. We should define a different parameter to separate these two cache sizes. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1158) Introducing a new parameter for Map-side join bucket size

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1158:
-----------------------------

    Attachment: HIVE-1158_branch_0_5.patch

Uploading HIVE-1158_branch_0_5.patch for branch 0.5. This patch includes changes pulled from other patches in trunk to make the packport possible. 

Still running unit tests, but it seems all relavent tests have passed. I will update the test results once they are done. 

> Introducing a new parameter for Map-side join bucket size
> ---------------------------------------------------------
>
>                 Key: HIVE-1158
>                 URL: https://issues.apache.org/jira/browse/HIVE-1158
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1158.patch, HIVE-1158_branch_0_5.patch
>
>
> Map-side join cache the small table in memory and join with the split of the large table at the mapper side. If the small table is too large, it uses RowContainer to cache a number of rows indicated by parameter hive.join.cache.size, whose default value is 25000. This parameter is also used for regular reducer-side joins to cache all input tables except the streaming table. This default value is too large for map-side join bucket size, resulting in OOM exceptions sometimes. We should define a different parameter to separate these two cache sizes. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (HIVE-1158) Introducing a new parameter for Map-side join bucket size

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao reopened HIVE-1158:
------------------------------


Need a patch for branch 0.5

> Introducing a new parameter for Map-side join bucket size
> ---------------------------------------------------------
>
>                 Key: HIVE-1158
>                 URL: https://issues.apache.org/jira/browse/HIVE-1158
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1158.patch
>
>
> Map-side join cache the small table in memory and join with the split of the large table at the mapper side. If the small table is too large, it uses RowContainer to cache a number of rows indicated by parameter hive.join.cache.size, whose default value is 25000. This parameter is also used for regular reducer-side joins to cache all input tables except the streaming table. This default value is too large for map-side join bucket size, resulting in OOM exceptions sometimes. We should define a different parameter to separate these two cache sizes. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1158) Introducing a new parameter for Map-side join bucket size

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1158:
-----------------------------

    Status: Patch Available  (was: Reopened)

all unit tests passed.

> Introducing a new parameter for Map-side join bucket size
> ---------------------------------------------------------
>
>                 Key: HIVE-1158
>                 URL: https://issues.apache.org/jira/browse/HIVE-1158
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1158.patch, HIVE-1158_branch_0_5.patch
>
>
> Map-side join cache the small table in memory and join with the split of the large table at the mapper side. If the small table is too large, it uses RowContainer to cache a number of rows indicated by parameter hive.join.cache.size, whose default value is 25000. This parameter is also used for regular reducer-side joins to cache all input tables except the streaming table. This default value is too large for map-side join bucket size, resulting in OOM exceptions sometimes. We should define a different parameter to separate these two cache sizes. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1158) Introducing a new parameter for Map-side join bucket size

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1158:
-----------------------------

    Attachment: HIVE-1158.patch

> Introducing a new parameter for Map-side join bucket size
> ---------------------------------------------------------
>
>                 Key: HIVE-1158
>                 URL: https://issues.apache.org/jira/browse/HIVE-1158
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1158.patch
>
>
> Map-side join cache the small table in memory and join with the split of the large table at the mapper side. If the small table is too large, it uses RowContainer to cache a number of rows indicated by parameter hive.join.cache.size, whose default value is 25000. This parameter is also used for regular reducer-side joins to cache all input tables except the streaming table. This default value is too large for map-side join bucket size, resulting in OOM exceptions sometimes. We should define a different parameter to separate these two cache sizes. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.