You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Ashutosh Chauhan <ha...@apache.org> on 2013/01/10 03:24:04 UTC

Re: Review Request: HIVE-2206: add a new optimizer for query correlation discovery and optimization

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/7126/#review15124
-----------------------------------------------------------



ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
<https://reviews.apache.org/r/7126/#comment32714>

    I don't see any modifications to ql/if/queryplan.thrift Please mod that file appropriately and add the generated code  in the patch



ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java
<https://reviews.apache.org/r/7126/#comment32715>

    Unused import.



ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java
<https://reviews.apache.org/r/7126/#comment32716>

    Please add javadocs, explaining this class as well as the fact that there are currently two classes which extends this. Also, add difference in behavior of those two classes which necessitates the need for this base class.



ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
<https://reviews.apache.org/r/7126/#comment32811>

    Unused import.



ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
<https://reviews.apache.org/r/7126/#comment32812>

    Unused import.



ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java
<https://reviews.apache.org/r/7126/#comment32813>

    please do return "CorrelationComposite" here instead of "CCO"



ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
<https://reviews.apache.org/r/7126/#comment32722>

    It seems like you are serializing and then immediately deserializing keys and values here, which I think is required for ReduceSinkOperator since keys and values are transferred from mapper process to reducer process. This is redundant in CLSReduceSinkOp since its all running inline in one operator pipeline in same memory. So, its looks like this could be avoided. 
    I guess doing this keeps implementation easier, but if this is true, we should take this up in follow-up jira as performance improvement.



ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
<https://reviews.apache.org/r/7126/#comment32810>

    Does this imply CLSRSOperator cannot have more than one child operator in any case. If so, can you please add comments stating that along with small description why is that?



ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
<https://reviews.apache.org/r/7126/#comment32718>

    Please add comments here about why its overridden for empty implementation and how startGroup() is taken care of in processOp()



ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
<https://reviews.apache.org/r/7126/#comment32719>

    Please add comments here about why its overridden for empty implementation and how endGroup() is dealt with in processOp()



ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
<https://reviews.apache.org/r/7126/#comment32720>

    Please add comments here about why its overridden for empty implementation and how this is taken care of in processOp()



ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
<https://reviews.apache.org/r/7126/#comment32814>

    I don't get the reason of introducing rowNumber in Operator. It doesn't look its required for optimization to take place correctly. Can you elaborate the need of this variable and different associated methods which are introduced along with it?



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
<https://reviews.apache.org/r/7126/#comment32815>

    Can you add comments clarifying whats the reason for this? 



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
<https://reviews.apache.org/r/7126/#comment32816>

    Seems like the logic of auto conversion of join to map-join is same as the one used in CommonJoinResolver. If thats the case, instead of replicating the logic here, it will be good to refactor that logic out of CommonJoinResolver in some util function in some util class and then use that function here. That logic will likely change over course of time and risk getting diverged if we duplicate instead of reuse. 



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
<https://reviews.apache.org/r/7126/#comment32817>

    If I am reading this correctly than, map-side groupbys are converted to reduce-side groupbys. I don't see any fundamental reason why correlation optimizer cannot work with map-side groupbys. Is this just the limitation of current implementation or is there any fundamental reason for it. Please add comments whatever the case is.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java
<https://reviews.apache.org/r/7126/#comment32828>

    Instead of changing the plan first and than undo the changes on the plan later on (in case optimizer can not  be applied) I am wondering a safer choice could have been to clone the plan and than do your changes on the cloned plan. If we find that plan can be optimized, only than change the original plan. This will be a bit more safer. Thoughts?



ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
<https://reviews.apache.org/r/7126/#comment32837>

    Can you add comments why it is not compatible with skew-join and groupby-skew optimizations?



ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java
<https://reviews.apache.org/r/7126/#comment32838>

    Explain annotation should have worked, no ? Having this working will be very useful for debugging.


- Ashutosh Chauhan


On Nov. 19, 2012, 7:51 p.m., Yin Huai wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/7126/
> -----------------------------------------------------------
> 
> (Updated Nov. 19, 2012, 7:51 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Description
> -------
> 
> This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs. Open a new request since I have been working on hive-git.
> 
> 
> This addresses bug HIVE-2206.
>     https://issues.apache.org/jira/browse/HIVE-2206
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 9fa9525 
>   conf/hive-default.xml.template f332f3a 
>   ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java 7c4c413 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 18a9bd2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java 46daeb2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 68302f8 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 0c22141 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 919a140 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1469325 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java edde378 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java d1555e2 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 2bf284d 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 330aa52 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 5a9f064 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java b33d616 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 9a95efd 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 6f8bc47 
>   ql/src/test/queries/clientpositive/correlationoptimizer1.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer2.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer3.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer4.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer5.q PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer1.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer2.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer3.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer4.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer5.q.out PRE-CREATION 
>   ql/src/test/results/compiler/plan/groupby1.q.xml cd0d6e4 
>   ql/src/test/results/compiler/plan/groupby2.q.xml 7b07f02 
>   ql/src/test/results/compiler/plan/groupby3.q.xml a6a1986 
>   ql/src/test/results/compiler/plan/groupby5.q.xml 25e3583 
> 
> Diff: https://reviews.apache.org/r/7126/diff/
> 
> 
> Testing
> -------
> 
> All tests pass.
> 
> 
> Thanks,
> 
> Yin Huai
> 
>


Re: Review Request: HIVE-2206: add a new optimizer for query correlation discovery and optimization

Posted by Yin Huai <hu...@cse.ohio-state.edu>.

> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java, line 32
> > <https://reviews.apache.org/r/7126/diff/11/?file=221824#file221824line32>
> >
> >     I don't see any modifications to ql/if/queryplan.thrift Please mod that file appropriately and add the generated code  in the patch

done


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java, line 36
> > <https://reviews.apache.org/r/7126/diff/11/?file=221825#file221825line36>
> >
> >     Unused import.

done


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java, line 47
> > <https://reviews.apache.org/r/7126/diff/11/?file=221825#file221825line47>
> >
> >     Please add javadocs, explaining this class as well as the fact that there are currently two classes which extends this. Also, add difference in behavior of those two classes which necessitates the need for this base class.

done


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java, line 25
> > <https://reviews.apache.org/r/7126/diff/11/?file=221826#file221826line25>
> >
> >     Unused import.

done


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java, line 26
> > <https://reviews.apache.org/r/7126/diff/11/?file=221826#file221826line26>
> >
> >     Unused import.

done


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java, line 155
> > <https://reviews.apache.org/r/7126/diff/11/?file=221826#file221826line155>
> >
> >     please do return "CorrelationComposite" here instead of "CCO"

done


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java, line 224
> > <https://reviews.apache.org/r/7126/diff/11/?file=221827#file221827line224>
> >
> >     It seems like you are serializing and then immediately deserializing keys and values here, which I think is required for ReduceSinkOperator since keys and values are transferred from mapper process to reducer process. This is redundant in CLSReduceSinkOp since its all running inline in one operator pipeline in same memory. So, its looks like this could be avoided. 
> >     I guess doing this keeps implementation easier, but if this is true, we should take this up in follow-up jira as performance improvement.

yes, it was for easier implementation. I will add a comment indicating it will be addressed in a follow-up jira.


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java, line 275
> > <https://reviews.apache.org/r/7126/diff/11/?file=221827#file221827line275>
> >
> >     Does this imply CLSRSOperator cannot have more than one child operator in any case. If so, can you please add comments stating that along with small description why is that?

I thought that, in the original plan, a ReduceSinkOperator can only have 1 child. Because a CLSRSOperator replaces a ReduceSinkOperator, it also has a single child. Is my understanding correct?


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java, line 285
> > <https://reviews.apache.org/r/7126/diff/11/?file=221827#file221827line285>
> >
> >     Please add comments here about why its overridden for empty implementation and how startGroup() is taken care of in processOp()

done


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java, line 290
> > <https://reviews.apache.org/r/7126/diff/11/?file=221827#file221827line290>
> >
> >     Please add comments here about why its overridden for empty implementation and how endGroup() is dealt with in processOp()

done


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java, line 294
> > <https://reviews.apache.org/r/7126/diff/11/?file=221827#file221827line294>
> >
> >     Please add comments here about why its overridden for empty implementation and how this is taken care of in processOp()

added a comment to explain this method.


- Yin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/7126/#review15124
-----------------------------------------------------------


On Nov. 19, 2012, 7:51 p.m., Yin Huai wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/7126/
> -----------------------------------------------------------
> 
> (Updated Nov. 19, 2012, 7:51 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Description
> -------
> 
> This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs. Open a new request since I have been working on hive-git.
> 
> 
> This addresses bug HIVE-2206.
>     https://issues.apache.org/jira/browse/HIVE-2206
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 9fa9525 
>   conf/hive-default.xml.template f332f3a 
>   ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java 7c4c413 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 18a9bd2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java 46daeb2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 68302f8 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 0c22141 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 919a140 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1469325 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java edde378 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java d1555e2 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 2bf284d 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 330aa52 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 5a9f064 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java b33d616 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 9a95efd 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 6f8bc47 
>   ql/src/test/queries/clientpositive/correlationoptimizer1.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer2.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer3.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer4.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer5.q PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer1.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer2.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer3.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer4.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer5.q.out PRE-CREATION 
>   ql/src/test/results/compiler/plan/groupby1.q.xml cd0d6e4 
>   ql/src/test/results/compiler/plan/groupby2.q.xml 7b07f02 
>   ql/src/test/results/compiler/plan/groupby3.q.xml a6a1986 
>   ql/src/test/results/compiler/plan/groupby5.q.xml 25e3583 
> 
> Diff: https://reviews.apache.org/r/7126/diff/
> 
> 
> Testing
> -------
> 
> All tests pass.
> 
> 
> Thanks,
> 
> Yin Huai
> 
>


Re: Review Request: HIVE-2206: add a new optimizer for query correlation discovery and optimization

Posted by Yin Huai <hu...@cse.ohio-state.edu>.

> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java, line 275
> > <https://reviews.apache.org/r/7126/diff/11/?file=221827#file221827line275>
> >
> >     Does this imply CLSRSOperator cannot have more than one child operator in any case. If so, can you please add comments stating that along with small description why is that?
> 
> Yin Huai wrote:
>     I thought that, in the original plan, a ReduceSinkOperator can only have 1 child. Because a CLSRSOperator replaces a ReduceSinkOperator, it also has a single child. Is my understanding correct?
> 
> Ashutosh Chauhan wrote:
>     I think you are correct. Since its a terminal operator, RSOperator can only have 1 child. It will be good to add this as comment there.

done


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java, line 1451
> > <https://reviews.apache.org/r/7126/diff/11/?file=221831#file221831line1451>
> >
> >     I don't get the reason of introducing rowNumber in Operator. It doesn't look its required for optimization to take place correctly. Can you elaborate the need of this variable and different associated methods which are introduced along with it?

If a query plan is optimized by correlation optimizer, multiple operation paths can be merged to a single Map phase. CorrelationCompositeOperator is used to evaluate results from different paths and forward a single row to ReduceSinkOperator. CorrelationCompositeOperator will first buffer rows forwarded to it from different paths. rowNumber is introduced to let CorrelationCompositeOperator know when a row has been processed by all paths. When a new rowNumber (a new row will come) is set through setRowNumber, CorrelationCompositeOperator will evaluate its current buffer and tag which operation paths this current row belongs to. I will add comment to explain it.


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java, line 134
> > <https://reviews.apache.org/r/7126/diff/11/?file=221835#file221835line134>
> >
> >     Can you add comments clarifying whats the reason for this?

done


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java, line 270
> > <https://reviews.apache.org/r/7126/diff/11/?file=221835#file221835line270>
> >
> >     If I am reading this correctly than, map-side groupbys are converted to reduce-side groupbys. I don't see any fundamental reason why correlation optimizer cannot work with map-side groupbys. Is this just the limitation of current implementation or is there any fundamental reason for it. Please add comments whatever the case is.

For example, if an MR job for an aggregation operator shares the input table with an MR job for a join operator, with Reduce-side aggregation (RS-GBY pattern), the ReduceSinkOperator for the merged Map phase only needs to emit a single row for both the aggregation operator and the join operator. Thus, I decide to convert map-side groupbys to reduce-side groupbys.


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java, line 94
> > <https://reviews.apache.org/r/7126/diff/11/?file=221838#file221838line94>
> >
> >     Can you add comments why it is not compatible with skew-join and groupby-skew optimizations?

seems skew optimizations will split a single MR job to 2 jobs. I have not carefully thought how to apply the correlation optimization on those plans optimized by skew optimizations. So, I'd suggest we evaluate this issue in a follow-up jira.


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java, line 393
> > <https://reviews.apache.org/r/7126/diff/11/?file=221845#file221845line393>
> >
> >     Explain annotation should have worked, no ? Having this working will be very useful for debugging.

Enabled. We need to re-generate some test results (probably all results involving "EXPLAIN"?).


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java, line 175
> > <https://reviews.apache.org/r/7126/diff/11/?file=221835#file221835line175>
> >
> >     Seems like the logic of auto conversion of join to map-join is same as the one used in CommonJoinResolver. If thats the case, instead of replicating the logic here, it will be good to refactor that logic out of CommonJoinResolver in some util function in some util class and then use that function here. That logic will likely change over course of time and risk getting diverged if we duplicate instead of reuse.

i am working on it.


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java, line 342
> > <https://reviews.apache.org/r/7126/diff/11/?file=221835#file221835line342>
> >
> >     Instead of changing the plan first and than undo the changes on the plan later on (in case optimizer can not  be applied) I am wondering a safer choice could have been to clone the plan and than do your changes on the cloned plan. If we find that plan can be optimized, only than change the original plan. This will be a bit more safer. Thoughts?

Yes, I agree. One question. Is there any convenient way to clone a ParseContext or an entire plan tree?


- Yin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/7126/#review15124
-----------------------------------------------------------


On Nov. 19, 2012, 7:51 p.m., Yin Huai wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/7126/
> -----------------------------------------------------------
> 
> (Updated Nov. 19, 2012, 7:51 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Description
> -------
> 
> This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs. Open a new request since I have been working on hive-git.
> 
> 
> This addresses bug HIVE-2206.
>     https://issues.apache.org/jira/browse/HIVE-2206
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 9fa9525 
>   conf/hive-default.xml.template f332f3a 
>   ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java 7c4c413 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 18a9bd2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java 46daeb2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 68302f8 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 0c22141 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 919a140 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1469325 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java edde378 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java d1555e2 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 2bf284d 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 330aa52 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 5a9f064 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java b33d616 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 9a95efd 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 6f8bc47 
>   ql/src/test/queries/clientpositive/correlationoptimizer1.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer2.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer3.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer4.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer5.q PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer1.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer2.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer3.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer4.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer5.q.out PRE-CREATION 
>   ql/src/test/results/compiler/plan/groupby1.q.xml cd0d6e4 
>   ql/src/test/results/compiler/plan/groupby2.q.xml 7b07f02 
>   ql/src/test/results/compiler/plan/groupby3.q.xml a6a1986 
>   ql/src/test/results/compiler/plan/groupby5.q.xml 25e3583 
> 
> Diff: https://reviews.apache.org/r/7126/diff/
> 
> 
> Testing
> -------
> 
> All tests pass.
> 
> 
> Thanks,
> 
> Yin Huai
> 
>


Re: Review Request: HIVE-2206: add a new optimizer for query correlation discovery and optimization

Posted by Ashutosh Chauhan <ha...@apache.org>.

> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java, line 275
> > <https://reviews.apache.org/r/7126/diff/11/?file=221827#file221827line275>
> >
> >     Does this imply CLSRSOperator cannot have more than one child operator in any case. If so, can you please add comments stating that along with small description why is that?
> 
> Yin Huai wrote:
>     I thought that, in the original plan, a ReduceSinkOperator can only have 1 child. Because a CLSRSOperator replaces a ReduceSinkOperator, it also has a single child. Is my understanding correct?

I think you are correct. Since its a terminal operator, RSOperator can only have 1 child. It will be good to add this as comment there. 


- Ashutosh


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/7126/#review15124
-----------------------------------------------------------


On Nov. 19, 2012, 7:51 p.m., Yin Huai wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/7126/
> -----------------------------------------------------------
> 
> (Updated Nov. 19, 2012, 7:51 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Description
> -------
> 
> This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs. Open a new request since I have been working on hive-git.
> 
> 
> This addresses bug HIVE-2206.
>     https://issues.apache.org/jira/browse/HIVE-2206
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 9fa9525 
>   conf/hive-default.xml.template f332f3a 
>   ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java 7c4c413 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 18a9bd2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java 46daeb2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 68302f8 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 0c22141 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 919a140 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1469325 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java edde378 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java d1555e2 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 2bf284d 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 330aa52 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 5a9f064 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java b33d616 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 9a95efd 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 6f8bc47 
>   ql/src/test/queries/clientpositive/correlationoptimizer1.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer2.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer3.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer4.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer5.q PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer1.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer2.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer3.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer4.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer5.q.out PRE-CREATION 
>   ql/src/test/results/compiler/plan/groupby1.q.xml cd0d6e4 
>   ql/src/test/results/compiler/plan/groupby2.q.xml 7b07f02 
>   ql/src/test/results/compiler/plan/groupby3.q.xml a6a1986 
>   ql/src/test/results/compiler/plan/groupby5.q.xml 25e3583 
> 
> Diff: https://reviews.apache.org/r/7126/diff/
> 
> 
> Testing
> -------
> 
> All tests pass.
> 
> 
> Thanks,
> 
> Yin Huai
> 
>