You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "He Yongqiang (JIRA)" <ji...@apache.org> on 2011/06/08 08:39:58 UTC

[jira] [Created] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

add a new optimizer for query correlation discovery and optimization
--------------------------------------------------------------------

                 Key: HIVE-2206
                 URL: https://issues.apache.org/jira/browse/HIVE-2206
             Project: Hive
          Issue Type: New Feature
            Reporter: He Yongqiang


reference:

http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Release Note: This optimizer exploits the intra-query correlations and merge multiple correlated MapReduce jobs into one jobs. The patch is generated based on hive-trunk with revision 1171917. 
          Status: Patch Available  (was: In Progress)

In unit tests, there are four failures in TestParse (groupby1, groupby2, groupby3 and groupby5). These four failures are caused by changes I made in the method "genGroupByPlanReduceSinkOperator" in the class "SemanticAnalyzer". Current results should be updated. But I am not sure how to change the correct results. Need some suggestions. Thanks.

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Work stopped] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on HIVE-2206 stopped by Yin Huai.

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456511#comment-13456511 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

Opened a new review request at https://reviews.apache.org/r/7126/, since I have been working on hive-git. 
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-2206:
---------------------------------

    Status: Open  (was: Patch Available)

Please explain what is preventing us from enabling this feature by default, e.g. in which cases is it expected not to work, and what are the failure scenarios? 

Based on the current test coverage (not much) I can't tell if it's actually possible to use this feature in its current state.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13490681#comment-13490681 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

[~namit]
Sure. I created the umbrella jira (HIVE-3667) for all work related to correlation optimizer and also created several follow-up jiras as sub-tasks. You can also add other sub-tasks into that jira.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195805#comment-13195805 ] 

jiraposter@reviews.apache.org commented on HIVE-2206:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2001/
-----------------------------------------------------------

(Updated 2012-01-29 17:56:48.704757)


Review request for hive.


Changes
-------

make the patch compatible with latest trunk (revision 1237253).


Summary
-------

This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs.


This addresses bug HIVE-2206.
    https://issues.apache.org/jira/browse/HIVE-2206


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1237326 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1237326 
  trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1237326 
  trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1237326 
  trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1237326 
  trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1237326 

Diff: https://reviews.apache.org/r/2001/diff


Testing
-------


Thanks,

Yin


                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8-r1237253.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500499#comment-13500499 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

[~cwsteinbach]
If the optimizer is enabled by default, based on my last tests, only auto_join26.q is expected to fail, because it will be optimized by correlation optimizer. But, except the query plan, the query result of auto_join26.q is correct. Also, once I finished HIVE-3671 (I am working on it right now), the failure of auto_join26.q should be eliminated.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Affects Version/s: 0.10.0
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.15-r1392491.patch.txt
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-2206:
---------------------------------

    Status: Open  (was: Patch Available)

@Yin: Please see my comments on reviewboard. Thanks.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: testQueries.1.q
                HIVE-2206.7.patch.txt

New version of patch and testing queries are available. I also updated the diff in the review request (link: https://reviews.apache.org/r/2001/).
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.1.q, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496658#comment-13496658 ] 

Carl Steinbach commented on HIVE-2206:
--------------------------------------

@Yongqiang: Please hold off on committing this for a day. Thanks.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276409#comment-13276409 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

@anders,
I will try to make this patch more concise and easier to read than the current version. If I have any thought on the optimization framework, I will comment under HIVE-3027.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8-r1237253.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101952#comment-13101952 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

Almost finish the patch. I did a preliminary test based on TPC-H Q17 and Q18. My machine has a quad-core Intel Xeon X3220 processor (2.4 GHz), 4GB of RAM, a 500GB hard disk and Ubuntu 11.04. With scale factor 10, the execution time of Q17 is 1216.94s without the patch versus 713.581s with the patch, and that of Q18 is 1737.18s without the patch versus 867.334s with the patch. 

I am facing a issue which I have not found a good way to solve. Suppose that we have a query "SELECT * FROM (SELECT L.c1 as c11, R.c2 as c12 FROM L JOIN R ON L.c1=R.C2) t1 JOIN (SELECT R.c1 as c21, count(distinct R.c2) as c22 FROM R GROUP BY R.c1) ON t1.c11=t2.c21". In this query, only one MapReduce job is necessary. However, because Hive will use R.c1 and R.c2 as the key columns of the original ReduceSinkOperator for the sub-query involving distinct count function, it is impossible to merged MapReduce jobs of two sub-queries into one. To optimize this kind of query, I write a new UDF function count_distinct(...), e.g. count_distinct(R.c2). This count_distinct function use a HashSet to get the number of distinct records. Is there any better solution for optimizing this kind of queries? Thanks.     

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: Queries, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107530#comment-13107530 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

found a bug in the HIVE-2206.1.patch.txt. will upload a update version later. 

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496701#comment-13496701 ] 

Carl Steinbach commented on HIVE-2206:
--------------------------------------

Thanks!
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment:     (was: HIVE-2206.12-r1386996.patch.txt)
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment:     (was: testQueries.q)
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.1.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Kevin Wilfong (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190151#comment-13190151 ] 

Kevin Wilfong commented on HIVE-2206:
-------------------------------------

Nevermind, sorry, it was the distribute by followed by sort by.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Description: 
This issue proposes a new optimizer called Correlation Optimizer, which is used to merge correlated operations into a single MapReduce job. In current implementation, correlation optimizer will only try to merge join and aggregation operators. Will add more later...

The idea is based on YSmart, an SQL-to-MapReduce translator (http://ysmart.cse.ohio-state.edu/). The paper and slides are also linked below. 

References:
Paper and presentation of YSmart.
Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

  was:
References:
Paper and presentation of YSmart.
Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Presentation: http://sdrv.ms/UpwJJc

    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new optimizer called Correlation Optimizer, which is used to merge correlated operations into a single MapReduce job. In current implementation, correlation optimizer will only try to merge join and aggregation operators. Will add more later...
> The idea is based on YSmart, an SQL-to-MapReduce translator (http://ysmart.cse.ohio-state.edu/). The paper and slides are also linked below. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162963#comment-13162963 ] 

jiraposter@reviews.apache.org commented on HIVE-2206:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2001/
-----------------------------------------------------------

(Updated 2011-12-05 19:12:23.087778)


Review request for hive.


Changes
-------

CorrelationReduceSinkOperator has been merged into ReduceSinkOperator. Detailed comments has been added to new operator.


Summary
-------

This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs.


This addresses bug HIVE-2206.
    https://issues.apache.org/jira/browse/HIVE-2206


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1210283 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationFakeReduceSinkOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationManualForwardOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1210283 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1210283 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1210283 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1210283 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1210283 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1210283 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1210283 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1210283 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1210283 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationFakeReduceSinkDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationManualForwardDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1210283 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1210283 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1210283 
  trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1210283 
  trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1210283 
  trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1210283 
  trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1210283 

Diff: https://reviews.apache.org/r/2001/diff


Testing (updated)
-------

Previous version of diff passed all unit tests. Since the latest trunk (r1209696) cannot finish all of unit tests, the latest version of diff has not been tested.


Thanks,

Yin


                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466582#comment-13466582 ] 

Carl Steinbach commented on HIVE-2206:
--------------------------------------

@Yongqiang: Sorry, but that's not the way it works. You vote +1 first, wait 24 hours, and then commit the patch. This is all covered in the project bylaws. Please revert this patch. Thanks.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500469#comment-13500469 ] 

Carl Steinbach commented on HIVE-2206:
--------------------------------------

@Yin: The correlation optimizer is only enabled for a small set of new CliDriver tests. If I enable the correlation optimizer by default, which of the existing CliDriver tests are expected to fail?
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.11-r1385084.patch.txt

new patch for trunk (revision 1385084). Disabled the optimizer by default and updated test results.

[~he yongqiang]:
can you help me test a few cases which the trunk on my machine cannot pass?
Those are TestHBaseMinimrCliDriver, TestHBaseCliDriver, TestHBaseNegativeCliDriver, testSynchronized in TestEmbeddedHiveMetaStore, testSynchronized in TestRemoteHiveMetaStore, testSynchronized in TestSetUGIOnBothClientServer, testSynchronized in TestSetUGIOnOnlyClient, testSynchronized in TestSetUGIOnOnlyServer, and testNegativeCliDriver_local_mapred_error_cache in TestNegativeCliDriver. Thanks!	
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Status: Patch Available  (was: Reopened)

@Namit:
You can review the latest patch. I removed the first phase and other unnecessary contents. 
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment:     (was: testQueries.2.q)
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107900#comment-13107900 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

@Ashutosh, Thanks

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Release Note: This optimizer exploits the intra-query correlations and merge multiple correlated MapReduce jobs into one jobs.  (was: This optimizer exploits the intra-query correlations and merge multiple correlated MapReduce jobs into one jobs. The patch is generated based on hive-trunk with revision 1171917. )
          Status: Patch Available  (was: Open)

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496529#comment-13496529 ] 

He Yongqiang commented on HIVE-2206:
------------------------------------

@Carl, keep in mind that you already months of time to comment. So maybe addressing your comments in new jiras will make more sense.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.8.r1224646.patch.txt

diff in the review board has also been updated (https://reviews.apache.org/r/2001/). 
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.1.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.17-r1404933.patch.txt

update a new patch which can be applied to r1404933. Also added the description of this issue. 

However, I do not have the permission to add a page in wiki. Where should I request the permission?
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496528#comment-13496528 ] 

He Yongqiang commented on HIVE-2206:
------------------------------------

@carl, you can go ahead comment, huai will address them in a sperate diff. 
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107866#comment-13107866 ] 

Ashutosh Chauhan commented on HIVE-2206:
----------------------------------------

@Yin,

To overwrite current results you can do the following:
{code}
ant test -Dtestcase=TestCliDriver -Dqfile=groupby1.q -Doverwrite=true
{code}

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.5-1.patch.txt

This is a working-in-progress patch. There are two issues to be addressed before next version of patch. Firstly, I will look at if I can remove the operator FakeReduceSinkOperator. Secondly, I will look at if correlation optimizer can be a one-phase optimizer instead of two-phase one. The current implementation will call the correlation optimizer twice (at the beginning and the end of optimization, respectively). 
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: testQueries.2.q

testQueries.2.q have three testing queries
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Status: Patch Available  (was: In Progress)
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105641#comment-13105641 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

ok. I will start to cleanup my code and upload updated patch soon.   

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: Queries, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109753#comment-13109753 ] 

jiraposter@reviews.apache.org commented on HIVE-2206:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2001/
-----------------------------------------------------------

Review request for hive.


Summary
-------

This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs.


This addresses bug HIVE-2206.
    https://issues.apache.org/jira/browse/HIVE-2206


Diffs
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1173271 
  trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java 1173271 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationDispatchOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationFakeReduceSinkOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationManualForwardOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReduceSinkOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1173271 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 1173271 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1173271 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1173271 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationGenMRRedSink1.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1173271 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1173271 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1173271 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1173271 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationDispatchDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationFakeReduceSinkDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationManualForwardDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReduceSinkDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1173271 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCountDistinct.java PRE-CREATION 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1173271 
  trunk/ql/src/test/results/clientpositive/show_functions.q.out 1173271 
  trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1173271 
  trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1173271 
  trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1173271 
  trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1173271 

Diff: https://reviews.apache.org/r/2001/diff


Testing
-------

Ran all unit tests


Thanks,

Yin



> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Reopened] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach reopened HIVE-2206:
----------------------------------

    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.12-r1386996.patch.txt
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.16-r1399936.patch.txt
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment:     (was: testQueries.1.q)
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Status: In Progress  (was: Patch Available)
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "alex gemini (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480878#comment-13480878 ] 

alex gemini commented on HIVE-2206:
-----------------------------------

Did this jira have a short version description? I know a join followed by group is optimized like pipeline, what else we may want to add to wiki?
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105379#comment-13105379 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

Yongqiang: I will change the name of optimizer. But, I'd prefer the name like "query correlation detector" or "multi-query optimizer", because I think that the name of "cooperative scan" limits the scope of this optimizer. Besides shared-scan, if ReduceSinkOperators of two chained Hive-generated MapReduce jobs share the same key(s), this optimizer can merge the second job into the reduce phase of the first job. 

I will upload a patch by this Sunday.   

 
  

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: Queries, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466576#comment-13466576 ] 

Carl Steinbach commented on HIVE-2206:
--------------------------------------

@Yongqiang: I don't see a +1 vote in this JIRA. According to the project bylaws (https://cwiki.apache.org/confluence/display/Hive/Bylaws) this patch should not have been committed. Please back this patch out. Thanks.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190276#comment-13190276 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

@Kevin,
I will take a look at it. 
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: Queries

Two queries (TPC-H Q17 and TPC-H Q18) can be used for testing this optimizer. Q17 is the same with the query provided in https://issues.apache.org/jira/browse/HIVE-600, but to expose the correlation, Q18 is modified. With this optimizer, Q17 and Q18 needs 2 and 4 MapReduce jobs, respectively. Without this optimizer, these two queries need 4 and 8 MapReduce jobs, respectively.

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>         Attachments: Queries, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105619#comment-13105619 ] 

He Yongqiang commented on HIVE-2206:
------------------------------------

ok. how about just "correlation"? 
Also can you take a look if it is possible to the optimization as part of physical optimizer. We need a lot of code cleanup in the current patch.

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: Queries, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Description: 
This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.

Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
# Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
# Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
# Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.

The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
# There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
# All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
# No self join is involved in those correlated MR jobs.

Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.

Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 

There are several work that can be done in future to improve this optimizer. Here are three examples.
# Support queries only involve TC;
# Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
# Optimize queries involving self join. 

References:
Paper and presentation of YSmart.
Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

  was:
This issue proposes a new optimizer called Correlation Optimizer, which is used to merge correlated operations into a single MapReduce job. In current implementation, correlation optimizer will only try to merge join and aggregation operators. Will add more later...

The idea is based on YSmart, an SQL-to-MapReduce translator (http://ysmart.cse.ohio-state.edu/). The paper and slides are also linked below. 

References:
Paper and presentation of YSmart.
Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.5.patch.txt

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment:     (was: Queries)
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.1.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Status: Patch Available  (was: Open)
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.14-r1389704.patch.txt
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497756#comment-13497756 ] 

Namit Jain commented on HIVE-2206:
----------------------------------

It would be a good idea to get HIVE-3671 in this patch.
With HIVE-3671, the functionality will be much more useful to the whole community.
[~yhuai], can you investigate getting HIVE-3671 as part of this patch, and see how much
work is it ? Based on that, we can proceed.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Status: In Progress  (was: Patch Available)
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Component/s: Query Processor
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500626#comment-13500626 ] 

Carl Steinbach commented on HIVE-2206:
--------------------------------------

I'm surprised that auto_join26 is the only test that fails due to different EXPLAIN output. Is that because this optimization doesn't affect the queries in most tests, or because we don't consistently call EXPLAIN in the tests?

What is preventing us from enabling this by default right now?
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457484#comment-13457484 ] 

He Yongqiang commented on HIVE-2206:
------------------------------------

The current patch looks ok. 
@Carl, please give more specific comments. 

We should agree on that new big features should not be enabled by default. That's too risky. 
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.10-r1384442.patch.txt

The patch is ported to the latest trunk (revision 1384442). I tested this patch with an enabled CorrelationOptimizer (hive.optimize.correlation=true). During the testing, I fixed several bugs and all tests should be ok except those I explained below. 

In case TestParse, there are 42 queries failed. Since I made several minor changes in SemanticAnalyzer. Seems those results should be updated. 

In TestCliDriver, auto_join26.q is failed since it is optimized by the optimizer. Considering I will make the optimizer disabled by default, I will not do any change regarding this query and its result. 

In TestCliDriver, create_view.q and udaf_percentile_approx.q are two weird queries. If hive.map.aggr=false, the original trunk will also fail. Seems bug is involved in the trunk. I have sent an email to dev mailing list regarding create_view.q. For udaf_percentile_approx.q, I have got time to look at it in detail.

In TestCliDriver, join31.q is failed. For this case, the query should be updated to have "set hive.optimize.correlation=true". But, since the optimizer is disabled by default, I will not update this query. 

Also, I got some queries which trunk cannot pass. These are cascade_dbdrop_hadoop20.q, hbase_binary_external_table_queries.q, hbase_binary_map_queries.q, hbase_binary_storage_queries.q, hbase_joins.q, hbase_ppd_key_range.q, hbase_pushdown.q, hbase_queries.q, local_mapred_error_cache.q, and TestCase TestHBaseMinimrCliDriver. 

I will run all tests again and will fix any bug related to the patch. 

                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13499858#comment-13499858 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

[~namit]
Sure. I just took a look at the code. Seems that once I get all content summaries of input table, I can make the guess on if join auto resolver will work for join operators on input tables. Because, as far as I know, existing util functions on retrieving content summaries (called after logical optimization) cannot be used directly at here, I need to write some util functions to get sizes of input tables. I will start to work on this asap. Also, although HIVE-3671 seems not hard to do, but it is not a quick fix. I suggest we track this work in a separate jira.

[~cwsteinbach]
Have you got time to look at current patch? Any comment?
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.3.patch.txt

I used "ant clean package test tar -logfile ant.log" to test all cases again and all unknown errors are gone... There are four failures left (groupby1, groupby2, groupby3 and groupby5). These four failures are caused by changes I made in the method "genGroupByPlanReduceSinkOperator" in the class "SemanticAnalyzer". So, current results should be updated. But I am not sure how to change the correct results. Does overwrite current results with the new results work? Or, is there anything I should do?

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205654#comment-13205654 ] 

jiraposter@reviews.apache.org commented on HIVE-2206:
-----------------------------------------------------



bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.  > I've started reviewing this, here's my comments so far.  I'll continue to look over it.

I will update this patch soon. 


bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.  > trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java, line 453
bq.  > <https://reviews.apache.org/r/2001/diff/4/?file=71297#file71297line453>
bq.  >
bq.  >     Does this have to default to false, does anything break if it's true?
bq.  >     
bq.  >     Similarly, have you tried running the tests with this set to true?

I have not tried running the tests with this set to true. I will do it when I find a revision which can pass all unit tests (btw, any suggestion on which revision should I use?). In my opinion, since this optimizer is kind of complicated and it is still being developed, it will be safer to default it to false and let users to decide when to use it than default it to true.


bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java, line 101
bq.  > <https://reviews.apache.org/r/2001/diff/4/?file=71299#file71299line101>
bq.  >
bq.  >     It's not clear to me why we need both setRowNumber and processOp.

Since a CorrelationCompositeOperator may have multiple parents, I used a buffer to store the output of parents of the CorrelationCompositeOperator (shown processOp method). The TableScanOperator will trigger the setRowNumber method and then CorrelationCompositeOperator will decide the operationPathTags of this row based on the contents in the buffer and then forward the row in its buffer to its child. So, setRowNumber in here is used to evaluate the operationPathTags of the row in the buffer before the CorrelationCompositeOperator gets the new row. 


bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java, lines 150-177
bq.  > <https://reviews.apache.org/r/2001/diff/4/?file=71299#file71299line150>
bq.  >
bq.  >     Putting this code in a helper method would be better than having it both here and in setRowNumber.

I will do it. 


bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java, line 274
bq.  > <https://reviews.apache.org/r/2001/diff/4/?file=71300#file71300line274>
bq.  >
bq.  >     Does this commented out code need to be kept?

This commented out code is not needed. I will delete it. 


bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java, line 1337
bq.  > <https://reviews.apache.org/r/2001/diff/4/?file=71303#file71303line1337>
bq.  >
bq.  >     I couldn't find a CorrelationFakeReduceSinkOperator class.

CorrelationLocalSimulativeReduceSinkOperator was named as CorrelationFakeReduceSinkOperator. I will use the right name in the comment. 


bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java, line 273
bq.  > <https://reviews.apache.org/r/2001/diff/4/?file=71305#file71305line273>
bq.  >
bq.  >     Tabs are bad, could you change them to spaces, at least in the new code your introducing.

I will change the format of my code. Thanks for letting me know.


bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java, line 239
bq.  > <https://reviews.apache.org/r/2001/diff/4/?file=71308#file71308line239>
bq.  >
bq.  >     I take it from this line it's a requirement that in order for this correlation optimization to be attempted every reduce sink has to be followed only by children with a single child.
bq.  >     
bq.  >     Could this be relaxed?  Could the optimization simply not be applied if there is an operator between two ReduceSinks that has more than one child?
bq.  >     
bq.  >     Also, if there is a ReduceSink which is not followed by another ReduceSink, but is followed by an operator with more than one child, this prevents the optimization from being used, even though it shouldn't have an effect.
bq.  >     
bq.  >     Also, regarding checking if the size <=1, if the size <1 the next line will throw an exception.

Only "assert op.getChildOperators().size() > 0;" is needed at here. Thank you for letting me know. 


bq.  On 2012-02-10 17:38:09, Kevin Wilfong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java, line 335
bq.  > <https://reviews.apache.org/r/2001/diff/4/?file=71308#file71308line335>
bq.  >
bq.  >     findNextChildReduceSinkOperator can return null, do you need to check for this?

findNextChildReduceSinkOperator will not return null since its input will not be the last ReduceSinkOperator before the FileSinkOperator. For example, suppose that we have a plan tree like (some operators)->RS1->(some operators)->RS2->(some operators)->FS. The input of findNextChildReduceSinkOperator will not be RS2. I will add an assertion and a comment after this line. 


- Yin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2001/#review4912
-----------------------------------------------------------


On 2012-01-29 17:56:48, Yin Huai wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2001/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-01-29 17:56:48)
bq.  
bq.  
bq.  Review request for hive.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs.
bq.  
bq.  
bq.  This addresses bug HIVE-2206.
bq.      https://issues.apache.org/jira/browse/HIVE-2206
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1237326 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1237326 
bq.    trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1237326 
bq.    trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1237326 
bq.    trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1237326 
bq.    trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1237326 
bq.  
bq.  Diff: https://reviews.apache.org/r/2001/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Yin
bq.  
bq.


                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8-r1237253.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466627#comment-13466627 ] 

Namit Jain commented on HIVE-2206:
----------------------------------

Sorry for jumping in late on this. This is a pretty big feature - can you give me sometime to review this as well ?
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.19-r1410581.patch.txt

I just integrate HIVE-3671 into this patch. At the beginning of correlation optimizer, it will predict if a join operator will be converted by CommonJoinResolver, if so, correlation optimizer will annotate this join operator and in the future optimization, ignore this operator. The prediction can only be made to those join operators the input tables of which are not intermediate tables. The method of the prediction is ported from CommonJoinResolver. Also, a test is added in correlationoptimizer1.q

[~namit]
Please take a look at this patch. Let me know if you have any comment.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "David Inbar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500474#comment-13500474 ] 

David Inbar commented on HIVE-2206:
-----------------------------------

I will be on vacation through Friday Nov 23rd, but will be checking email and voicemail periodically.

For all time-critical items, please call my mobile phone.

Many thanks,
David

NOTICE: All information in and attached to this email may be proprietary, confidential, privileged and otherwise protected from improper or erroneous disclosure. If you are not the sender's intended recipient, you are not authorized to intercept, read, print, retain, copy, forward, or disseminate this message.


                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: testQueries.q
                HIVE-2206.1.patch.txt

The patch is ready. Queries used in testing are included in the file of testQueries.q. 

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466625#comment-13466625 ] 

He Yongqiang commented on HIVE-2206:
------------------------------------

@Carl, i just reverted. I will commit again tomorrow.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.18-r1407720.patch.txt
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480881#comment-13480881 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

[~gemini5201314]
I do not have a short version description right now. Let me write one and create a wiki page.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Description: 
References:
Paper and presentation of YSmart.
Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Presentation: http://sdrv.ms/UpwJJc

  was:
reference:

http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Presentation: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466581#comment-13466581 ] 

He Yongqiang commented on HIVE-2206:
------------------------------------

@Carl, btw, i did mentioned a few times on the comments that i am planing to commit this one.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466631#comment-13466631 ] 

Carl Steinbach commented on HIVE-2206:
--------------------------------------

bq. I did not see a 24 hours waiting on the bylaw page?

This is specified in the "minimum length" column in the table that appears in the "Actions" section of the bylaws document. We could definitely make this easier to undertand, but all of the other committers already follow the convention that you +1 a patch before committing it, and allow some time to elapse in between those two actions in order to give other people a chance to weigh in.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205742#comment-13205742 ] 

jiraposter@reviews.apache.org commented on HIVE-2206:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2001/
-----------------------------------------------------------

(Updated 2012-02-10 20:49:01.177796)


Review request for hive.


Changes
-------

updated patch on revision 1237253. Will generate the patch based on the latest trunk latter. 


Summary
-------

This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs.


This addresses bug HIVE-2206.
    https://issues.apache.org/jira/browse/HIVE-2206


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1237326 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1237326 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1237326 
  trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1237326 
  trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1237326 
  trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1237326 
  trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1237326 

Diff: https://reviews.apache.org/r/2001/diff


Testing
-------


Thanks,

Yin


                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8-r1237253.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.8-r1237253.patch.txt

@Kevin,
I wrongly assumed that all output names of the ReduceSinkOperator has a structure of "KEY/VALUE.internalName". I have solved this issue.

However, the current optimizer cannot handel the case that a table is directly connect to a post computation operator (in this case, table b directly connects to the operator join). I am planning to solve this issue after this patch. To walkaround, you can use ...
SET hive.optimize.reducededuplication=false;
SET hive.optimize.correlation=true;
SELECT * FROM (SELECT * FROM src DISTRIBUTE BY key SORT BY key) a JOIN (SELECT * FROM src DISTRIBUTE BY key SORT BY key) b ON a.key = b.key;. 
This query will be optimized and be executed in a single MapReduce job. 

Also, I have updated the patch and it is compatible with revision 1237253.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8-r1237253.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Kevin Wilfong (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190147#comment-13190147 ] 

Kevin Wilfong commented on HIVE-2206:
-------------------------------------

I tried running 

explain select * from (select * from src distribute by key sort by key) a join src b  on a.key = b.key;

using HIVE-2206.8.r1224646.patch.txt and I get the following exception:

FAILED: Hive Internal Error: java.lang.ClassCastException(org.apache.hadoop.hive.ql.exec.SelectOperator cannot be cast to org.apache.hadoop.hive.ql.exec.ReduceSinkOperator)
java.lang.ClassCastException: org.apache.hadoop.hive.ql.exec.SelectOperator cannot be cast to org.apache.hadoop.hive.ql.exec.ReduceSinkOperator
	at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizer$CorrelationNodeProc.findPeerReduceSinkOperators(CorrelationOptimizer.java:256)
	at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizer$CorrelationNodeProc.process(CorrelationOptimizer.java:503)
	at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
	at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
	at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125)
	at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
	at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizer.transform(CorrelationOptimizer.java:193)
	at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:100)
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7384)
	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
	at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:50)
	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:430)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889)
	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
	at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Status: Patch Available  (was: In Progress)
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496399#comment-13496399 ] 

He Yongqiang commented on HIVE-2206:
------------------------------------

+1, i will commit after tests pass.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.2.patch.txt

there are some failures in TestCliDriver. Some failures seems that the output misses some lines when using keyword "explain". Other failures are related to queries with index, e.g. index_quth.q. When I tested query index_quth.q manually, there were two errors. One was "java.lang.ClassNotFoundException: org.apache.derby.jdbc.EmbeddedDriver" and another one was "java.lang.NoClassDefFoundError: javaewah/EWAHCompressedBitmap". 

These errors seems irrelevance to changes I made, but there should be some thing wrong... 

@Yongqiang: Can you have a look at my patch and give me some suggestions on how to fix it? Thanks

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466584#comment-13466584 ] 

He Yongqiang commented on HIVE-2206:
------------------------------------

I did not see a 24 hours waiting on the bylaw page?
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505495#comment-13505495 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

[~cwsteinbach] I am not sure if unit tests in Hive are comprehensive enough. If not, it might be better that we turn on this optimizer by default in future after we can use more queries to test it.

I just tested all unit tests with an enabled correlation optimizer. Because, if map side aggregation is on, correlation optimizer also requires regular reduce side aggregation to be generated, if "cube" or "rollup" is used in the query, error message 10209 (org.apache.hadoop.hive.ql.ErrorMsg.HIVE_GROUPING_SETS_AGGR_NOMAPAGGR) will be thrown. Seems HIVE-3508 can solve this issue. Except this issue, a few query plans need to be re-generated because of changing operator ids.

This jira has taken a long time. Can we wrap it up and I will start to work on follow-up jiras.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Status: Patch Available  (was: Open)
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.6.patch.txt
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13490432#comment-13490432 ] 

Namit Jain commented on HIVE-2206:
----------------------------------

[~yhuai], can you file follow-up jiras for the cases that dont work with this optimization ?
It would be good to link them along with this jira. Adding them in the wiki would be useful too for tracking.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Status: Open  (was: Patch Available)

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463855#comment-13463855 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

I corrected my local configurations related to HBase and checked out HIVE-3507, now all tests pass. 
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105098#comment-13105098 ] 

He Yongqiang commented on HIVE-2206:
------------------------------------

Cool! Yin, please let us know when u are mostly done. one small things is that in the hive code let's call the new optimizer as "cooperative scan" instead of YSmart. But we can add the paper ref in the comment.

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: Queries, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107307#comment-13107307 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

I tested three queries. TPC-H Q17, Q18 and the left-outer-join sub-tree in the Q21. You can check the query plan trees in the paper http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

Here are the results. 
        	disabled(s)	enabled(s)
Q17	        1288.917	655.07
Q18      	1731.734	911.761
Q21 subtree	1865.597	658.58


> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466552#comment-13466552 ] 

He Yongqiang commented on HIVE-2206:
------------------------------------

All tests passed for me.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177318#comment-13177318 ] 

jiraposter@reviews.apache.org commented on HIVE-2206:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2001/
-----------------------------------------------------------

(Updated 2011-12-29 18:50:12.277210)


Review request for hive.


Summary
-------

This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs.


This addresses bug HIVE-2206.
    https://issues.apache.org/jira/browse/HIVE-2206


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1224666 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1224666 
  trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1224666 
  trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1224666 
  trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1224666 
  trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1224666 
  trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1224666 

Diff: https://reviews.apache.org/r/2001/diff


Testing (updated)
-------


Thanks,

Yin


                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.1.q, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.4.patch.txt

All of unit test cases are passed, except TestParse_groupby1, TestParse_groupby2, TestParse_groupby3 and TestParse_groupby5. Because I made some changes in method "genGroupByPlanReduceSinkOperator" of class "SemanticAnalyzer", so results of these four cases should be updated. However, I found that when these four cases are tested individually, the results differ from the results when these four cases are tested with all other cases (when I used "ant clean package test tar -logfile ant.log"). Hive-trunk I checked out from svn also has this issue. Is it a bug or did I miss anything?

This patch does not contain updates on the results of cases TestParse_groupby1, TestParse_groupby2, TestParse_groupby3 and TestParse_groupby5. 

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466580#comment-13466580 ] 

He Yongqiang commented on HIVE-2206:
------------------------------------

I commented that all tests passed.

ok, +1.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496697#comment-13496697 ] 

He Yongqiang commented on HIVE-2206:
------------------------------------

okay, i will target commit it this weekend or earlier next week.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424315#comment-13424315 ] 

He Yongqiang commented on HIVE-2206:
------------------------------------

For the last few months (almost one year), Yin has been actively maintaining this patch, and i think it is in a very good state to check into trunk. 

So i will do some final review, and hope to commit it sometime next month. Please feel free to jump in to review the patch and put any comments here before the commit.

In the last review, I will make sure this patch will not have big changes to existing execution path, so it can be simply disabled like other optimizations in Hive. And Yin will still be actively maintaining this patch (help fix  bugs etc) after the commit. 
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8-r1237253.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: testQueries.2.q
    
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13467067#comment-13467067 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

I just found I can remove the first phase of this optimizer. Apparently there were changes in the trunk, so I do not need to save original ColumnExprMap and OpParseCtx. I have removed unnecessary code and are running tests. Will update the patch later.

                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496521#comment-13496521 ] 

Carl Steinbach commented on HIVE-2206:
--------------------------------------

@Yongqiang: Can you please hold off on committing while I take another look? Thanks.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054816#comment-13054816 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

The current optimizer can identify correlations with query plan tree structures like TPC-H Q17 (in attached file Queries). Using Q17 as an example, sub-query (denoted as sub-Q1 and originally executed by MapReduce job J1) "SELECT l_partkey as t_partkey, 0.2 * avg(l_quantity) AS t_avg_quantity FROM lineitem GROUP BY l_partkey" has correlation with sub-query (denoted as sub-Q2 and originally executed by MapReduce job J2) "SELECT l_quantity, l_partkey, l_extendedprice FROM part p JOIN lineitem l ON p.p_partkey = l.l_partkey AND p.p_brand = 'Brand#52' AND p.p_container = 'JUMBO CAN'", because (1)sub-Q1 and sub-Q2 share the same input 'lineitem'; (2) ReduceSinkOperators in J1 and J2 share the same 'key', which is l_partkey (p_partkey). Also, because intermediate tables generated by sub-Q1 and sub-Q2 will be joined by a MapReduce job J3, of which the 'key' of ReduceSinkOperator is 'l_partkey', J3 has correlation with J1 and J2. Thus, J1, J2 and J3 can be merged into one MapReduce job J'. In the map function of J', a composite operator will be used to execute FilterOperators (if any) for sub-Q1 and sub-Q2. Then, in the reduce function of J', a dispatch operator is used to dispatch reduce-input records to JoinOperator in J1 and GroupByOperator in J2. Then, the results of JoinOperator and GroupByOperator will be fed to the JoinOperator in J3. 

For this optimizer, there are several issues. 

1: Because for the MapReduce job executing correlated MapReduce jobs, intermediate key/value pairs will be consumed by multiple operators, Map-side Aggregation is disabled. 
2: For the MapReduce job executing correlated MapReduce jobs, if the depth of execution path in the reduce function is not the same (for example "SELECT * FROM lineitem l1 JOIN (SELECT l_partkey FROM part p JOIN lineitem l ON p.p_partkey = l.l_partkey) tmp ON l1.partkey = tmp.partkey"), one or multiple YSmartForwardOperator should be used. I have not completely solved this issue.
3: For two independent MapReduce jobs J1 and J2, the current correlation identifier only searches ReduceSinkOperators with the same 'key(s)' for correlation, actually the set of 'key(s)' of the ReduceSinkOperator in J1 is a subset of that in J2, these two MapReduce jobs are correlated. (Also, sub-queries with distinct keyword associated with Group By clause is under this issue, since distinct keyword is handled by using all columns as 'keys' in its corresponding ReduceSinkOperator)
4: The current correlation identifier can not identify correlations represented by columns involving "max(<column name>)" or "min(<column name>)".

I will start working on this optimizer in August and will firstly solve issues 2-4 mentioned above.  

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: Queries, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-2206:
-------------------------------

    Attachment: YSmartPatchForHive.patch

a draft patch

will submit revised version later.

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>         Attachments: YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466937#comment-13466937 ] 

He Yongqiang commented on HIVE-2206:
------------------------------------

I will be on vacation this whole week. Given this is a very big diff, I will keep this open for another one week or two for more comments. 

                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-2206:
-------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed. Thanks for the hard work, Yin Huai!
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205570#comment-13205570 ] 

jiraposter@reviews.apache.org commented on HIVE-2206:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2001/#review4912
-----------------------------------------------------------


I've started reviewing this, here's my comments so far.  I'll continue to look over it.


trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
<https://reviews.apache.org/r/2001/#comment11010>

    Does this have to default to false, does anything break if it's true?
    
    Similarly, have you tried running the tests with this set to true?



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
<https://reviews.apache.org/r/2001/#comment10818>

    It's not clear to me why we need both setRowNumber and processOp.



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
<https://reviews.apache.org/r/2001/#comment10817>

    Putting this code in a helper method would be better than having it both here and in setRowNumber.



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
<https://reviews.apache.org/r/2001/#comment10819>

    Does this commented out code need to be kept?



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
<https://reviews.apache.org/r/2001/#comment10820>

    I couldn't find a CorrelationFakeReduceSinkOperator class.



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
<https://reviews.apache.org/r/2001/#comment10821>

    Tabs are bad, could you change them to spaces, at least in the new code your introducing.



trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
<https://reviews.apache.org/r/2001/#comment10850>

    I take it from this line it's a requirement that in order for this correlation optimization to be attempted every reduce sink has to be followed only by children with a single child.
    
    Could this be relaxed?  Could the optimization simply not be applied if there is an operator between two ReduceSinks that has more than one child?
    
    Also, if there is a ReduceSink which is not followed by another ReduceSink, but is followed by an operator with more than one child, this prevents the optimization from being used, even though it shouldn't have an effect.
    
    Also, regarding checking if the size <=1, if the size <1 the next line will throw an exception.



trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
<https://reviews.apache.org/r/2001/#comment10851>

    findNextChildReduceSinkOperator can return null, do you need to check for this?


- Kevin


On 2012-01-29 17:56:48, Yin Huai wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2001/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-01-29 17:56:48)
bq.  
bq.  
bq.  Review request for hive.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs.
bq.  
bq.  
bq.  This addresses bug HIVE-2206.
bq.      https://issues.apache.org/jira/browse/HIVE-2206
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java 1237326 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1237326 
bq.    trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 1237326 
bq.    trunk/ql/src/test/results/compiler/plan/groupby1.q.xml 1237326 
bq.    trunk/ql/src/test/results/compiler/plan/groupby2.q.xml 1237326 
bq.    trunk/ql/src/test/results/compiler/plan/groupby3.q.xml 1237326 
bq.    trunk/ql/src/test/results/compiler/plan/groupby5.q.xml 1237326 
bq.  
bq.  Diff: https://reviews.apache.org/r/2001/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Yin
bq.  
bq.


                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8-r1237253.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457114#comment-13457114 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

Carl:
The main reason that Yongqiang and I decided to disable this feature by default first is that we have not got a chance to test this optimizer heavily.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462865#comment-13462865 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

updated patch at reviewboard.

@Carl: Pleas also see my comments under yours. Thanks.
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Work started] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on HIVE-2206 started by Yin Huai.

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: Queries, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang reassigned HIVE-2206:
----------------------------------

    Assignee: Yin Huai

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: Queries, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Work started] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on HIVE-2206 started by Yin Huai.

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.12-r1386996.patch.txt

patch updated. bug fix+ 3 test cases
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai updated HIVE-2206:
---------------------------

    Attachment: HIVE-2206.13-r1389072.patch.txt

two new tests + bug fix. This patch is ready to review. Diff r4 in https://reviews.apache.org/r/7126/ is the latest patch. 
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Kevin Wilfong (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190161#comment-13190161 ] 

Kevin Wilfong commented on HIVE-2206:
-------------------------------------

The above bug is a pre-existing issue with reduce sink reduplication.

The following new exception is produced by the query:

set hive.optimize.reducededuplication=false;
explain select * from (select * from src distribute by key sort by key) a join src b on a.key = b.key;

FAILED: Hive Internal Error: java.lang.ArrayIndexOutOfBoundsException(1)
java.lang.ArrayIndexOutOfBoundsException: 1
	at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizerUtils.createCorrelationCompositeReducesinkOperaotr(CorrelationOptimizerUtils.java:599)
	at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizerUtils.applyCorrelation(CorrelationOptimizerUtils.java:365)
	at org.apache.hadoop.hive.ql.optimizer.CorrelationOptimizer.transform(CorrelationOptimizer.java:198)
	at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:100)
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7384)
	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
	at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:50)
	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:430)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889)
	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
	at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, YSmartPatchForHive.patch, testQueries.2.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466638#comment-13466638 ] 

Hudson commented on HIVE-2206:
------------------------------

Integrated in Hive-trunk-h0.21 #1711 (See [https://builds.apache.org/job/Hive-trunk-h0.21/1711/])
    HIVE-2206:add a new optimizer for query correlation discovery and optimization (Yin Huai via He Yongqiang) (Revision 1392105)

     Result = FAILURE
heyongqiang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1392105
Files : 
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* /hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.out
* /hive/trunk/ql/src/test/results/compiler/plan/groupby1.q.xml
* /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml
* /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml
* /hive/trunk/ql/src/test/results/compiler/plan/groupby5.q.xml

                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.10.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Posted by "Yin Huai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109755#comment-13109755 ] 

Yin Huai commented on HIVE-2206:
--------------------------------

Submitted a review request. The link is [https://reviews.apache.org/r/2001/].

> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>         Attachments: HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5.patch.txt, Queries, YSmartPatchForHive.patch, testQueries.q
>
>
> reference:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira