You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "He Yongqiang (JIRA)" <ji...@apache.org> on 2010/01/25 05:44:34 UTC

[jira] Created: (HIVE-1093) Add a "skew join map join size" variable to control the input size of skew join's following map join job.

Add a "skew join map join size" variable to control the input size of skew join's following map join job.
---------------------------------------------------------------------------------------------------------

                 Key: HIVE-1093
                 URL: https://issues.apache.org/jira/browse/HIVE-1093
             Project: Hadoop Hive
          Issue Type: Improvement
            Reporter: He Yongqiang


In a test, many skew join key itself >250M size. And the following mapjoin will take several hours to do a mapjoin for those big skew keys. 
This can be better by using a small map input size for the following map join job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1093) Add a "skew join map join size" variable to control the input size of skew join's following map join job.

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805603#action_12805603 ] 

Namit Jain commented on HIVE-1093:
----------------------------------

The changes look good - does it work for combinehiveinputsplit also ?

> Add a "skew join map join size" variable to control the input size of skew join's following map join job.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1093
>                 URL: https://issues.apache.org/jira/browse/HIVE-1093
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: hive-1093.patch
>
>
> In a test, many skew join key itself >250M size. And the following mapjoin will take several hours to do a mapjoin for those big skew keys. 
> This can be better by using a small map input size for the following map join job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1093) Add a "skew join map join size" variable to control the input size of skew join's following map join job.

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1093:
-------------------------------

    Attachment: hive-1093.2.patch

> Add a "skew join map join size" variable to control the input size of skew join's following map join job.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1093
>                 URL: https://issues.apache.org/jira/browse/HIVE-1093
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1093.2.patch, hive-1093.patch
>
>
> In a test, many skew join key itself >250M size. And the following mapjoin will take several hours to do a mapjoin for those big skew keys. 
> This can be better by using a small map input size for the following map join job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1093) Add a "skew join map join size" variable to control the input size of skew join's following map join job.

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805629#action_12805629 ] 

He Yongqiang commented on HIVE-1093:
------------------------------------

>>does it work for combinehiveinputsplit also ?
No. We should not use combine inputformat for this. CombineFileInputFormat use block size as the minimum split size. We need to explicitly specify the second job to use HiveInputFormat. Will update the patch to "explicitly specify the second job to use HiveInputFormat".

> Add a "skew join map join size" variable to control the input size of skew join's following map join job.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1093
>                 URL: https://issues.apache.org/jira/browse/HIVE-1093
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: hive-1093.patch
>
>
> In a test, many skew join key itself >250M size. And the following mapjoin will take several hours to do a mapjoin for those big skew keys. 
> This can be better by using a small map input size for the following map join job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1093) Add a "skew join map join size" variable to control the input size of skew join's following map join job.

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804395#action_12804395 ] 

Namit Jain commented on HIVE-1093:
----------------------------------

Do you have performance numbers for the testcase ? I mean, small map size will lead to more mappers each of which is reading the 
other tables. 

> Add a "skew join map join size" variable to control the input size of skew join's following map join job.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1093
>                 URL: https://issues.apache.org/jira/browse/HIVE-1093
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>
> In a test, many skew join key itself >250M size. And the following mapjoin will take several hours to do a mapjoin for those big skew keys. 
> This can be better by using a small map input size for the following map join job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1093) Add a "skew join map join size" variable to control the input size of skew join's following map join job.

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804438#action_12804438 ] 

He Yongqiang commented on HIVE-1093:
------------------------------------

>>Do you have performance numbers for the testcase ?
Yes. In my testcase, a split of 256M join with 100K is now taking more than 5 hours. (join value can be ignored, so 256M and 100K are about pure key size).
And the 'map join size' should not be determined only by the big size ( eg. 256M). The small size is more important in this case. 

The point is that  KEY1 ("256M join 100K") should use a much smaller split size than KEY2 ("256M join 1K").  The problem here is that we are now doing KEY1 and KEY2 in a same job. So if we choose a split size according to KEY1, it maybe a bit small for KEY2.

If we are going to choose to use bucket join for the followup mapjoin job. We will be able to choose split size independently for different keys (because we are doing that in different jobs).

> Add a "skew join map join size" variable to control the input size of skew join's following map join job.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1093
>                 URL: https://issues.apache.org/jira/browse/HIVE-1093
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>
> In a test, many skew join key itself >250M size. And the following mapjoin will take several hours to do a mapjoin for those big skew keys. 
> This can be better by using a small map input size for the following map join job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1093) Add a "skew join map join size" variable to control the input size of skew join's following map join job.

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-1093:
-----------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

Committed. Thanks Yongqiang

> Add a "skew join map join size" variable to control the input size of skew join's following map join job.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1093
>                 URL: https://issues.apache.org/jira/browse/HIVE-1093
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1093.2.patch, hive-1093.patch
>
>
> In a test, many skew join key itself >250M size. And the following mapjoin will take several hours to do a mapjoin for those big skew keys. 
> This can be better by using a small map input size for the following map join job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HIVE-1093) Add a "skew join map join size" variable to control the input size of skew join's following map join job.

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain reassigned HIVE-1093:
--------------------------------

    Assignee: He Yongqiang

> Add a "skew join map join size" variable to control the input size of skew join's following map join job.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1093
>                 URL: https://issues.apache.org/jira/browse/HIVE-1093
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>
> In a test, many skew join key itself >250M size. And the following mapjoin will take several hours to do a mapjoin for those big skew keys. 
> This can be better by using a small map input size for the following map join job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1093) Add a "skew join map join size" variable to control the input size of skew join's following map join job.

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1093:
-------------------------------

    Attachment: hive-1093.patch

> Add a "skew join map join size" variable to control the input size of skew join's following map join job.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1093
>                 URL: https://issues.apache.org/jira/browse/HIVE-1093
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: hive-1093.patch
>
>
> In a test, many skew join key itself >250M size. And the following mapjoin will take several hours to do a mapjoin for those big skew keys. 
> This can be better by using a small map input size for the following map join job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1093) Add a "skew join map join size" variable to control the input size of skew join's following map join job.

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805683#action_12805683 ] 

Namit Jain commented on HIVE-1093:
----------------------------------

+1

looks good

> Add a "skew join map join size" variable to control the input size of skew join's following map join job.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1093
>                 URL: https://issues.apache.org/jira/browse/HIVE-1093
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1093.2.patch, hive-1093.patch
>
>
> In a test, many skew join key itself >250M size. And the following mapjoin will take several hours to do a mapjoin for those big skew keys. 
> This can be better by using a small map input size for the following map join job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1093) Add a "skew join map join size" variable to control the input size of skew join's following map join job.

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1093:
-------------------------------

    Fix Version/s: 0.6.0
           Status: Patch Available  (was: Open)

> Add a "skew join map join size" variable to control the input size of skew join's following map join job.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1093
>                 URL: https://issues.apache.org/jira/browse/HIVE-1093
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: hive-1093.2.patch, hive-1093.patch
>
>
> In a test, many skew join key itself >250M size. And the following mapjoin will take several hours to do a mapjoin for those big skew keys. 
> This can be better by using a small map input size for the following map join job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.