You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Nadeem Moidu (JIRA)" <ji...@apache.org> on 2012/06/05 04:05:23 UTC

[jira] [Created] (HIVE-3086) Skewed Join Optimization

Nadeem Moidu created HIVE-3086:
----------------------------------

             Summary: Skewed Join Optimization
                 Key: HIVE-3086
                 URL: https://issues.apache.org/jira/browse/HIVE-3086
             Project: Hive
          Issue Type: New Feature
            Reporter: Nadeem Moidu
            Assignee: Nadeem Moidu


During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "Kevin Wilfong (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457188#comment-13457188 ] 

Kevin Wilfong commented on HIVE-3086:
-------------------------------------

+1 This looks good to me now.
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3086:
-----------------------------

    Status: Patch Available  (was: Open)
    
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>         Attachments: hive.3086.1.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "Nadeem Moidu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401530#comment-13401530 ] 

Nadeem Moidu commented on HIVE-3086:
------------------------------------

@Alex, I'm sorry but your question is not very clear. Can you please give the exact schema, query and the skewed keys that you have in mind. Here are some comments based on what I understood from your question:
1. The bottleneck mentioned is only when the join key is skewed, so only that case is handled.
2. If a table is small, we have map-join to handle that.
3. We are not doing any pre-partioning.
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Nadeem Moidu
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458005#comment-13458005 ] 

Namit Jain commented on HIVE-3086:
----------------------------------

[~yhuai], right now both the input and output tables will be scanned twice.
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>             Fix For: 0.10.0
>
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3086:
-----------------------------

    Status: Patch Available  (was: Open)

addressed comments from Carl and Nadeem
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442512#comment-13442512 ] 

Namit Jain commented on HIVE-3086:
----------------------------------

yes
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3086:
-----------------------------

    Status: Open  (was: Patch Available)

comments from Kevin
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>         Attachments: hive.3086.1.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3086:
-----------------------------

    Attachment: hive.3086.2.patch
    
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>         Attachments: hive.3086.1.patch, hive.3086.2.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3086:
-----------------------------

    Status: Patch Available  (was: Open)

addressed comments
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>         Attachments: hive.3086.1.patch, hive.3086.2.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457754#comment-13457754 ] 

Hudson commented on HIVE-3086:
------------------------------

Integrated in Hive-trunk-h0.21 #1679 (See [https://builds.apache.org/job/Hive-trunk-h0.21/1679/])
    HIVE-3086. Skewed Join Optimization. njain via kevinwilfong (Revision 1386996)

     Result = FAILURE
kevinwilfong : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1386996
Files : 
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SkewJoinOptimizer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeColumnDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeConstantDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeFieldDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeGenericFuncDesc.java
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt1.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt10.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt11.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt12.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt13.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt14.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt15.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt16.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt17.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt18.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt19.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt2.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt20.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt3.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt4.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt5.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt6.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt7.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt8.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoinopt9.q
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt1.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt10.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt11.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt12.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt13.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt14.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt15.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt16.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt17.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt18.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt19.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt2.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt20.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt3.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt4.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt5.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt6.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt7.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt8.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoinopt9.q.out

                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>             Fix For: 0.10.0
>
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13444687#comment-13444687 ] 

Namit Jain commented on HIVE-3086:
----------------------------------

https://reviews.facebook.net/D5043
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3086:
-----------------------------

    Attachment: hive.3086.6.patch
    
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3086:
-----------------------------

    Attachment: hive.3086.5.patch
    
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, hive.3086.4.patch, hive.3086.5.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "alex gemini (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401945#comment-13401945 ] 

alex gemini commented on HIVE-3086:
-----------------------------------

maybe what we want is dynamic change our partition key base on our hints,for examples:
select /**+ partitions logs(userid,timestamps),users(id) */ count(userid),to_date(timestamps,'YYYYMMDD'),age from logs where timestamps > 2011-12-01 and timestamps < 2011-12-31 and age<25 and age>18.
this time we will partition logs by userid and timestamps . so for records in 2011-12-24 it will hash to six reduce instead of one, each reduce will process same amout of records.
another query example:
select /**+ partitions logs(userid),users(id,age) */ count(userid),to_date(timestamps,'YYYYMMDD'),age from logs where timestamps > 2011-01-01 and timestamps < 2011-12-31 and age<25 and age>18.
this time timestamp is not primary skewed key, we change our parititon key to age.
In the "ListBucketing" desing, 
create table T (c1 string, c2 string, c3 string) skewed by (c1, c2) on (('x1', 'x2'), ('y1', 'y2'));
we need assume we know tables skewed by some column,but data is always skewed and we can't list every skewed value combination.

                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Nadeem Moidu
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "alex gemini (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428568#comment-13428568 ] 

alex gemini commented on HIVE-3086:
-----------------------------------

@Yongqiang  We don't need hint here,the above example is just for clarify.The main point here is if some key is skewed ,just mixed this key with     
another low selectivity key like primary key.Use this composite key as input for hash partition.
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Nadeem Moidu
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442372#comment-13442372 ] 

Carl Steinbach commented on HIVE-3086:
--------------------------------------

@Namit: Is this a work in progress?
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427239#comment-13427239 ] 

Namit Jain commented on HIVE-3086:
----------------------------------

@Yongqiang, the current skew join does the optimization after most of the damage has already been done.
The reducer detects that a particular key is skewed, and then processes that key in a separate MR job.

However, in this approach, we are planning to know about the skewed keys before hand (stored in the metastore),
and then use them to do a map-join for the skewed keys and a normal join for the other keys. This does require
some change from the user (the user needs to store the skewed keys in the metastore). However, this approach can
be very good for repetitive workloads - similar queries running every day for similar data. Most probably, the skew
does not change every day. The skew can be calculated periodically.
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Nadeem Moidu
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-3086) Skewed Join Optimization

Posted by "Nadeem Moidu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nadeem Moidu updated HIVE-3086:
-------------------------------

    Assignee:     (was: Nadeem Moidu)
    
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457939#comment-13457939 ] 

Yin Huai commented on HIVE-3086:
--------------------------------

@Nadeem: Thanks! Just found another question. It seems that the large table (which has the skewed keys) will be scanned twice. Is my understanding correct?
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>             Fix For: 0.10.0
>
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3086:
-----------------------------

    Attachment: hive.3086.1.patch
    
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>         Attachments: hive.3086.1.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450442#comment-13450442 ] 

Namit Jain commented on HIVE-3086:
----------------------------------

addressed Nadeem's comments
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445823#comment-13445823 ] 

Namit Jain commented on HIVE-3086:
----------------------------------

I have not run the existing tests yet - just started them.
Have verified the outputs of the new tests that were added as part of this patch
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>         Attachments: hive.3086.1.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "Nadeem Moidu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457947#comment-13457947 ] 

Nadeem Moidu commented on HIVE-3086:
------------------------------------

Yes, in the current implementation, both the tables will be scanned twice. This can be avoided if the table scan operator is not replicated and has multiple children instead, but this optimization has not been done in this patch.
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>             Fix For: 0.10.0
>
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457885#comment-13457885 ] 

Yin Huai commented on HIVE-3086:
--------------------------------

a quick question. Can you let me know where the join operator for skewed keys is converted to a map join operator? Thanks!
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>             Fix For: 0.10.0
>
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "alex gemini (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401262#comment-13401262 ] 

alex gemini commented on HIVE-3086:
-----------------------------------

the design is very complicated IMO,what if we have a big table logs and a small table users, table users have a column 'age', if we have issue a query skewed by age which we can't pre-partition the big table.this design didn't handle it,right? I guess what we want is customer partition at runtime,for the above example, we need customer partition(or some hint)or tell the query plan we want to partition the users table at 'userid,age' column and also partition the logs table at 'userid' column, the partition number for same userid for two table need to be same for further join.
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Nadeem Moidu
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401960#comment-13401960 ] 

He Yongqiang commented on HIVE-3086:
------------------------------------

'hint' by user has been proven not very useful. Automatically detecting skewed keys, like what the current skew join processor is doing now, will make it more powerful and useful.

@Nadeem, can you add more details to the wiki about the differences between the existing one and the one you are working on. The current one can not process the case where a same join key is skewed in more than one table. Are you targeting those cases? Also there are some problems with existing skew join opt, can you also try to fix those as part of your project?
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Nadeem Moidu
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428864#comment-13428864 ] 

Namit Jain commented on HIVE-3086:
----------------------------------

@Alex, The problem that you mentioned can be handled by 
https://issues.apache.org/jira/browse/HIVE-3286.

Navis is working on that. These are independent strategies and can be applied.
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Nadeem Moidu
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "Nadeem Moidu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457891#comment-13457891 ] 

Nadeem Moidu commented on HIVE-3086:
------------------------------------

@Yin Huai: The join operator for skewed keys is automatically converted to map join when the map join optimization is performed. That is not included in this patch.
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>             Fix For: 0.10.0
>
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451760#comment-13451760 ] 

Namit Jain commented on HIVE-3086:
----------------------------------

addressed Kevin's comments
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, hive.3086.4.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3086) Skewed Join Optimization

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-3086:
---------------------------------

    Component/s: Query Processor
    
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, hive.3086.4.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3086) Skewed Join Optimization

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-3086:
---------------------------------

    Status: Open  (was: Patch Available)

Comments on phabricator. Thanks.
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, hive.3086.4.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3086:
-----------------------------

    Attachment: hive.3086.4.patch
    
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, hive.3086.4.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3086) Skewed Join Optimization

Posted by "Kevin Wilfong (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin Wilfong updated HIVE-3086:
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.10.0
           Status: Resolved  (was: Patch Available)

Committed.  Thanks Namit.
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>             Fix For: 0.10.0
>
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch, hive.3086.4.patch, hive.3086.5.patch, hive.3086.6.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3086) Skewed Join Optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3086:
-----------------------------

    Attachment: hive.3086.3.patch
    
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Namit Jain
>         Attachments: hive.3086.1.patch, hive.3086.2.patch, hive.3086.3.patch
>
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3086) Skewed Join Optimization

Posted by "alex gemini (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401939#comment-13401939 ] 

alex gemini commented on HIVE-3086:
-----------------------------------

for a big table logs(userid,region,timestamps,url) which has more than 10 billion record,a middle size table users(userid,age) which has 10 million records, if there is a query :
 select count(userid) from logs a ,users b where a.userid=b.userid group by b.age.
let's say age 18-25 have more than 50% of total records and age 40-60 have only 5% of records, age 25-50 have rest.
what we defined skewed is always by our query ,in this case skewed key is age,we can't always assume two table are skewed by join key,right?
another example : select count(userid),to_date(timestamps,'YYYYMMDD'),age from logs where timestamps > 2011-12-01 and timestamps < 2011-12-31 and age<25 and age>18.
because the Christmas,records in 2011-12-25 to 2011-12-31 maybe have more records than other day in this month(this query particular assume age is not skewed for the purpose discussion).
since hive user hash partition ,let's say 6 reduce,then 2011-12-24 and 2011-12-30 will go into same reduce which cause one reduce process much more records than others.
                
> Skewed Join Optimization
> ------------------------
>
>                 Key: HIVE-3086
>                 URL: https://issues.apache.org/jira/browse/HIVE-3086
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Nadeem Moidu
>            Assignee: Nadeem Moidu
>
> During a join operation, if one of the columns has a skewed key, it can cause that particular reducer to become the bottleneck. The following feature will address it:
> https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira