You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by pengcheng xiong <px...@hortonworks.com> on 2017/05/22 22:31:52 UTC

Review Request 59468: Optimize a combination of avg(), sum(), count(distinct) etc

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59468/
-----------------------------------------------------------

Review request for hive, Ashutosh Chauhan and Gopal V.


Repository: hive-git


Description
-------

HIVE-16654


Diffs
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 7dedd23591 
  itests/src/test/resources/testconfiguration.properties e23ef6317f 
  ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 8b04cd44fa 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/GroupByOptimizer.java 3233157d8d 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 7dace9076f 
  ql/src/java/org/apache/hadoop/hive/ql/plan/GroupByDesc.java 38a9ef2af1 
  ql/src/test/queries/clientpositive/count_dist_rewrite.q PRE-CREATION 
  ql/src/test/results/clientpositive/count_dist_rewrite.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/groupby_sort_11.q.out 2b3bf4a07a 
  ql/src/test/results/clientpositive/groupby_sort_8.q.out 4faa0757cc 
  ql/src/test/results/clientpositive/llap/count_dist_rewrite.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/llap/metadataonly1.q.out 27218cf599 
  ql/src/test/results/clientpositive/nullgroup4.q.out e5a8eeee14 
  ql/src/test/results/clientpositive/perf/query16.q.out cf90c0c162 
  ql/src/test/results/clientpositive/perf/query28.q.out 78129cf68b 
  ql/src/test/results/clientpositive/perf/query94.q.out 836b16bf9f 
  ql/src/test/results/clientpositive/perf/query95.q.out fa94d0842b 
  ql/src/test/results/clientpositive/spark/nullgroup4.q.out 24f0291dec 
  ql/src/test/results/clientpositive/udf_count.q.out f60ad0485e 
  ql/src/test/results/clientpositive/vector_empty_where.q.out b2dec6d7f6 


Diff: https://reviews.apache.org/r/59468/diff/1/


Testing
-------


Thanks,

pengcheng xiong

Re: Review Request 59468: Optimize a combination of avg(), sum(), count(distinct) etc

Posted by pengcheng xiong <px...@hortonworks.com>.


> On May 23, 2017, 12:36 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
> > Line 206 (original), 210 (patched)
> > <https://reviews.apache.org/r/59468/diff/1/?file=1727328#file1727328line210>
> >
> >     I think we should call this optimizer here. This way a) its called after ReduceSinkDedup so that there is no chance that extra RS introduced by this optimization does not get removed and b) It won't violate any assumptions made by GroupbyOptimizer.

ColumnPruner will prune some of the columns, which violates the assumption that we make in the rule. Thus, we need to put it before Column Pruner.


> On May 23, 2017, 12:36 a.m., Ashutosh Chauhan wrote:
> > ql/src/test/results/clientpositive/llap/metadataonly1.q.out
> > Line 256 (original), 256-257 (patched)
> > <https://reviews.apache.org/r/59468/diff/1/?file=1727335#file1727335line256>
> >
> >     Earlier plan was better. When metadataonly optimizer can kick in that should.

Yes, but i think the performance gain is minor as the table is empty. In NullScanTaskDispatcher, it expects that the GBY is distinct like. However, after our patch, we remove the distinct.


- pengcheng


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59468/#review175736
-----------------------------------------------------------


On May 22, 2017, 10:31 p.m., pengcheng xiong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59468/
> -----------------------------------------------------------
> 
> (Updated May 22, 2017, 10:31 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Gopal V.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-16654
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 7dedd23591 
>   itests/src/test/resources/testconfiguration.properties e23ef6317f 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 8b04cd44fa 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GroupByOptimizer.java 3233157d8d 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 7dace9076f 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/GroupByDesc.java 38a9ef2af1 
>   ql/src/test/queries/clientpositive/count_dist_rewrite.q PRE-CREATION 
>   ql/src/test/results/clientpositive/count_dist_rewrite.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/groupby_sort_11.q.out 2b3bf4a07a 
>   ql/src/test/results/clientpositive/groupby_sort_8.q.out 4faa0757cc 
>   ql/src/test/results/clientpositive/llap/count_dist_rewrite.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/llap/metadataonly1.q.out 27218cf599 
>   ql/src/test/results/clientpositive/nullgroup4.q.out e5a8eeee14 
>   ql/src/test/results/clientpositive/perf/query16.q.out cf90c0c162 
>   ql/src/test/results/clientpositive/perf/query28.q.out 78129cf68b 
>   ql/src/test/results/clientpositive/perf/query94.q.out 836b16bf9f 
>   ql/src/test/results/clientpositive/perf/query95.q.out fa94d0842b 
>   ql/src/test/results/clientpositive/spark/nullgroup4.q.out 24f0291dec 
>   ql/src/test/results/clientpositive/udf_count.q.out f60ad0485e 
>   ql/src/test/results/clientpositive/vector_empty_where.q.out b2dec6d7f6 
> 
> 
> Diff: https://reviews.apache.org/r/59468/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> pengcheng xiong
> 
>

Re: Review Request 59468: Optimize a combination of avg(), sum(), count(distinct) etc

Posted by Ashutosh Chauhan <ha...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59468/#review175736
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
Line 206 (original), 210 (patched)
<https://reviews.apache.org/r/59468/#comment249092>

    I think we should call this optimizer here. This way a) its called after ReduceSinkDedup so that there is no chance that extra RS introduced by this optimization does not get removed and b) It won't violate any assumptions made by GroupbyOptimizer.



ql/src/test/queries/clientpositive/count_dist_rewrite.q
Lines 24 (patched)
<https://reviews.apache.org/r/59468/#comment249094>

    Can you add few tests when group by is on some key e.g. select max(key), count(distinct key), min(key), avg(key) from src group by value;



ql/src/test/results/clientpositive/llap/metadataonly1.q.out
Line 256 (original), 256-257 (patched)
<https://reviews.apache.org/r/59468/#comment249095>

    Earlier plan was better. When metadataonly optimizer can kick in that should.


- Ashutosh Chauhan


On May 22, 2017, 10:31 p.m., pengcheng xiong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59468/
> -----------------------------------------------------------
> 
> (Updated May 22, 2017, 10:31 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Gopal V.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-16654
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 7dedd23591 
>   itests/src/test/resources/testconfiguration.properties e23ef6317f 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 8b04cd44fa 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GroupByOptimizer.java 3233157d8d 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 7dace9076f 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/GroupByDesc.java 38a9ef2af1 
>   ql/src/test/queries/clientpositive/count_dist_rewrite.q PRE-CREATION 
>   ql/src/test/results/clientpositive/count_dist_rewrite.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/groupby_sort_11.q.out 2b3bf4a07a 
>   ql/src/test/results/clientpositive/groupby_sort_8.q.out 4faa0757cc 
>   ql/src/test/results/clientpositive/llap/count_dist_rewrite.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/llap/metadataonly1.q.out 27218cf599 
>   ql/src/test/results/clientpositive/nullgroup4.q.out e5a8eeee14 
>   ql/src/test/results/clientpositive/perf/query16.q.out cf90c0c162 
>   ql/src/test/results/clientpositive/perf/query28.q.out 78129cf68b 
>   ql/src/test/results/clientpositive/perf/query94.q.out 836b16bf9f 
>   ql/src/test/results/clientpositive/perf/query95.q.out fa94d0842b 
>   ql/src/test/results/clientpositive/spark/nullgroup4.q.out 24f0291dec 
>   ql/src/test/results/clientpositive/udf_count.q.out f60ad0485e 
>   ql/src/test/results/clientpositive/vector_empty_where.q.out b2dec6d7f6 
> 
> 
> Diff: https://reviews.apache.org/r/59468/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> pengcheng xiong
> 
>

Re: Review Request 59468: Optimize a combination of avg(), sum(), count(distinct) etc

Posted by pengcheng xiong <px...@hortonworks.com>.


> On May 23, 2017, 5 p.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
> > Lines 61 (patched)
> > <https://reviews.apache.org/r/59468/diff/1/?file=1727326#file1727326line61>
> >
> >     Comment: Queries of form : select max(c), count(distinct c) from T; generates a plan of form TS->mGBy->RS->rGBy->FS 
> >     This plan suffers from a problem that vertex containing rGBy->FS necessarily need to have 1 task. This limitation results in slow execution because that task gets all the data. 
> >     This optimization if successful will rewrite above plan to TS->mGby->RS->mGby2->RS->rGBy->FS This introduces extra vertex of mGby2->RS Note this vertex can have multiple tasks and since we are doing aggregation, output of this must necessarily be smaller than its input, which results in much less data going in to rGby->FS vertex, which continues to have single task.
> >     Also note on calcite tree we have HiveExpandDistinctAggregatesRule rule which does similiar plan transformation but has different conditions which needs to be satisified.
> >     Additionally, we don't do any costing here but this is possibly that this transformation may slow down query a bit since if data is small enough to fit in a single task of last reducer, injecting additional vertex in pipeline may make query slower.

Thanks for the detailed comments.


> On May 23, 2017, 5 p.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
> > Lines 313 (patched)
> > <https://reviews.apache.org/r/59468/diff/1/?file=1727326#file1727326line313>
> >
> >     This should be PARTIAL2 mode as well, since GBy operator is running in Partial2 mode.

partial2 is expecting integer as input. However, here we are counting key_col0, which is a string. Thus, hash is more appropriate.


- pengcheng


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59468/#review175801
-----------------------------------------------------------


On May 25, 2017, 4:03 a.m., pengcheng xiong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59468/
> -----------------------------------------------------------
> 
> (Updated May 25, 2017, 4:03 a.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Gopal V.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-16654
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 2dfc8b6f89 
>   itests/src/test/resources/testconfiguration.properties 47a13c93b9 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 8b04cd44fa 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 7dace9076f 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/GroupByDesc.java 38a9ef2af1 
>   ql/src/test/queries/clientpositive/count_dist_rewrite.q PRE-CREATION 
>   ql/src/test/results/clientpositive/groupby_sort_11.q.out 2b3bf4a07a 
>   ql/src/test/results/clientpositive/groupby_sort_8.q.out 4faa0757cc 
>   ql/src/test/results/clientpositive/llap/count_dist_rewrite.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/nullgroup4.q.out e5a8eeee14 
>   ql/src/test/results/clientpositive/perf/query16.q.out cf90c0c162 
>   ql/src/test/results/clientpositive/perf/query28.q.out 78129cf68b 
>   ql/src/test/results/clientpositive/perf/query94.q.out 836b16bf9f 
>   ql/src/test/results/clientpositive/perf/query95.q.out fa94d0842b 
>   ql/src/test/results/clientpositive/udf_count.q.out f60ad0485e 
>   ql/src/test/results/clientpositive/vector_empty_where.q.out b2dec6d7f6 
> 
> 
> Diff: https://reviews.apache.org/r/59468/diff/2/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> pengcheng xiong
> 
>

Re: Review Request 59468: Optimize a combination of avg(), sum(), count(distinct) etc

Posted by Ashutosh Chauhan <ha...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59468/#review175801
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 61 (patched)
<https://reviews.apache.org/r/59468/#comment249127>

    Comment: Queries of form : select max(c), count(distinct c) from T; generates a plan of form TS->mGBy->RS->rGBy->FS 
    This plan suffers from a problem that vertex containing rGBy->FS necessarily need to have 1 task. This limitation results in slow execution because that task gets all the data. 
    This optimization if successful will rewrite above plan to TS->mGby->RS->mGby2->RS->rGBy->FS This introduces extra vertex of mGby2->RS Note this vertex can have multiple tasks and since we are doing aggregation, output of this must necessarily be smaller than its input, which results in much less data going in to rGby->FS vertex, which continues to have single task.
    Also note on calcite tree we have HiveExpandDistinctAggregatesRule rule which does similiar plan transformation but has different conditions which needs to be satisified.
    Additionally, we don't do any costing here but this is possibly that this transformation may slow down query a bit since if data is small enough to fit in a single task of last reducer, injecting additional vertex in pipeline may make query slower.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 121 (patched)
<https://reviews.apache.org/r/59468/#comment249136>

    Unused field.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 122 (patched)
<https://reviews.apache.org/r/59468/#comment249123>

    Comment : Position of distinct column in aggregator list of map Gby before rewrite.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 135 (patched)
<https://reviews.apache.org/r/59468/#comment249134>

    Should this be cntDist > 1 ?



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 156-158 (patched)
<https://reviews.apache.org/r/59468/#comment249130>

    This seems redundant since we already checked for cntDist != in loop.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 159 (patched)
<https://reviews.apache.org/r/59468/#comment249133>

    Some extra safety checks:
    1) mGby is in hash mode.
    2) rGby is in mergepartial mode.
    3) RS.getKeys().size() =1 
    4) RS partition column size = 1
    5) RS sort col size = 1.
    6) mGby has no grouping sets.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 197 (patched)
<https://reviews.apache.org/r/59468/#comment249135>

    Comment : distinct is at lost position.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 313 (patched)
<https://reviews.apache.org/r/59468/#comment249137>

    This should be PARTIAL2 mode as well, since GBy operator is running in Partial2 mode.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 405 (patched)
<https://reviews.apache.org/r/59468/#comment249122>

    throw new SemanticException(e);



ql/src/java/org/apache/hadoop/hive/ql/optimizer/GroupByOptimizer.java
Line 538 (original), 538-546 (patched)
<https://reviews.apache.org/r/59468/#comment249138>

    This change may not be needed if we run Count distinct optimization after this has alreday run.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
Lines 75 (patched)
<https://reviews.apache.org/r/59468/#comment249141>

    Also, lets call this optimizaiton only for Tez.


- Ashutosh Chauhan


On May 22, 2017, 10:31 p.m., pengcheng xiong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59468/
> -----------------------------------------------------------
> 
> (Updated May 22, 2017, 10:31 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Gopal V.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-16654
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 7dedd23591 
>   itests/src/test/resources/testconfiguration.properties e23ef6317f 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 8b04cd44fa 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GroupByOptimizer.java 3233157d8d 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 7dace9076f 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/GroupByDesc.java 38a9ef2af1 
>   ql/src/test/queries/clientpositive/count_dist_rewrite.q PRE-CREATION 
>   ql/src/test/results/clientpositive/count_dist_rewrite.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/groupby_sort_11.q.out 2b3bf4a07a 
>   ql/src/test/results/clientpositive/groupby_sort_8.q.out 4faa0757cc 
>   ql/src/test/results/clientpositive/llap/count_dist_rewrite.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/llap/metadataonly1.q.out 27218cf599 
>   ql/src/test/results/clientpositive/nullgroup4.q.out e5a8eeee14 
>   ql/src/test/results/clientpositive/perf/query16.q.out cf90c0c162 
>   ql/src/test/results/clientpositive/perf/query28.q.out 78129cf68b 
>   ql/src/test/results/clientpositive/perf/query94.q.out 836b16bf9f 
>   ql/src/test/results/clientpositive/perf/query95.q.out fa94d0842b 
>   ql/src/test/results/clientpositive/spark/nullgroup4.q.out 24f0291dec 
>   ql/src/test/results/clientpositive/udf_count.q.out f60ad0485e 
>   ql/src/test/results/clientpositive/vector_empty_where.q.out b2dec6d7f6 
> 
> 
> Diff: https://reviews.apache.org/r/59468/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> pengcheng xiong
> 
>

Re: Review Request 59468: Optimize a combination of avg(), sum(), count(distinct) etc

Posted by Ashutosh Chauhan <ha...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59468/#review176086
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 337 (patched)
<https://reviews.apache.org/r/59468/#comment249410>

    Comment : All aggregations for this GBy run in PARTIAL2 mode.



ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java
Lines 372 (patched)
<https://reviews.apache.org/r/59468/#comment249411>

    Although this is semantically correct, to achieve data reduction we need to use a constant key since we are now doing aggregations across keys in this GBy.



ql/src/test/queries/clientpositive/count_dist_rewrite.q
Lines 59-61 (patched)
<https://reviews.apache.org/r/59468/#comment249414>

    Optimization will not fire for these cases. No advantage for adding this variant. Lets remove them. Sorry I suggested them earlier.



ql/src/test/results/clientpositive/groupby_sort_11.q.out
Lines 28-29 (original), 28-30 (patched)
<https://reviews.apache.org/r/59468/#comment249415>

    In this case original plan was better, since table T1 is bucketed and sorted, all values for keys are together in Mapper, so there is no need to distinct computation. Note the absence of distinct index in rGBy of original plan. This transformation was done by GroupbyOptimizer.
    We should check if count(DISTINCT KEY._col0:0._col0) is present in rGby too in new optimizer and fire it only if its present.



ql/src/test/results/clientpositive/groupby_sort_8.q.out
Line 40 (original), 40-41 (patched)
<https://reviews.apache.org/r/59468/#comment249418>

    Previous plan was better because of reason described earlier.



ql/src/test/results/clientpositive/nullgroup4.q.out
Line 41 (original), 64 (patched)
<https://reviews.apache.org/r/59468/#comment249419>

    Mode: partials? Expected?



ql/src/test/results/clientpositive/udf_count.q.out
Lines 77 (patched)
<https://reviews.apache.org/r/59468/#comment249420>

    No mGby2 in plan. Probably removed by ReduceSinkDedup since keys are same for 2 Gbys. As I suggested earlier second Gby should have constant key and then it should show up here.


- Ashutosh Chauhan


On May 25, 2017, 4:03 a.m., pengcheng xiong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59468/
> -----------------------------------------------------------
> 
> (Updated May 25, 2017, 4:03 a.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Gopal V.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-16654
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 2dfc8b6f89 
>   itests/src/test/resources/testconfiguration.properties 47a13c93b9 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 8b04cd44fa 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 7dace9076f 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/GroupByDesc.java 38a9ef2af1 
>   ql/src/test/queries/clientpositive/count_dist_rewrite.q PRE-CREATION 
>   ql/src/test/results/clientpositive/groupby_sort_11.q.out 2b3bf4a07a 
>   ql/src/test/results/clientpositive/groupby_sort_8.q.out 4faa0757cc 
>   ql/src/test/results/clientpositive/llap/count_dist_rewrite.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/nullgroup4.q.out e5a8eeee14 
>   ql/src/test/results/clientpositive/perf/query16.q.out cf90c0c162 
>   ql/src/test/results/clientpositive/perf/query28.q.out 78129cf68b 
>   ql/src/test/results/clientpositive/perf/query94.q.out 836b16bf9f 
>   ql/src/test/results/clientpositive/perf/query95.q.out fa94d0842b 
>   ql/src/test/results/clientpositive/udf_count.q.out f60ad0485e 
>   ql/src/test/results/clientpositive/vector_empty_where.q.out b2dec6d7f6 
> 
> 
> Diff: https://reviews.apache.org/r/59468/diff/2/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> pengcheng xiong
> 
>

Re: Review Request 59468: Optimize a combination of avg(), sum(), count(distinct) etc

Posted by Ashutosh Chauhan <ha...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59468/#review176486
-----------------------------------------------------------




ql/src/test/results/clientpositive/perf/query16.q.out
Lines 3-5 (original), 3-5 (patched)
<https://reviews.apache.org/r/59468/#comment249839>

    yes.. you are correct. My bad.


- Ashutosh Chauhan


On May 27, 2017, 2:20 a.m., pengcheng xiong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59468/
> -----------------------------------------------------------
> 
> (Updated May 27, 2017, 2:20 a.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Gopal V.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-16654
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 2dfc8b6f89 
>   itests/src/test/resources/testconfiguration.properties 47a13c93b9 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 8b04cd44fa 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 7dace9076f 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/GroupByDesc.java 38a9ef2af1 
>   ql/src/test/queries/clientpositive/count_dist_rewrite.q PRE-CREATION 
>   ql/src/test/results/clientpositive/count_dist_rewrite.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/groupby_sort_11.q.out 2b3bf4a07a 
>   ql/src/test/results/clientpositive/llap/count_dist_rewrite.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/nullgroup4.q.out e5a8eeee14 
>   ql/src/test/results/clientpositive/perf/query16.q.out cf90c0c162 
>   ql/src/test/results/clientpositive/perf/query28.q.out 78129cf68b 
>   ql/src/test/results/clientpositive/perf/query94.q.out 836b16bf9f 
>   ql/src/test/results/clientpositive/perf/query95.q.out fa94d0842b 
>   ql/src/test/results/clientpositive/udf_count.q.out f60ad0485e 
> 
> 
> Diff: https://reviews.apache.org/r/59468/diff/3/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> pengcheng xiong
> 
>

Re: Review Request 59468: Optimize a combination of avg(), sum(), count(distinct) etc

Posted by pengcheng xiong <px...@hortonworks.com>.


> On May 27, 2017, 4:41 p.m., Ashutosh Chauhan wrote:
> > ql/src/test/queries/clientpositive/count_dist_rewrite.q
> > Lines 63-65 (patched)
> > <https://reviews.apache.org/r/59468/diff/3/?file=1733999#file1733999line63>
> >
> >     As mentioned previously, lets delete these tests.

I assume that previoulsy you want some negative test for this? no?


> On May 27, 2017, 4:41 p.m., Ashutosh Chauhan wrote:
> > ql/src/test/results/clientpositive/perf/query16.q.out
> > Lines 3-5 (original), 3-5 (patched)
> > <https://reviews.apache.org/r/59468/diff/3/?file=1734004#file1734004line3>
> >
> >     Optimization shouldn't have fired in this case. Aggregations are on different columns.

IMHO, i think it should fire in this case. In this case and the following ones, there will be a single reducer producing a single row with constant group by key, i.e., everything should go to the same group. After the patch, in the first stage, we just introduce the partial result with group by the distinct column. Then in the second stage, we aggregate all of the partial results together. I think this is exactly what you want previously, i.e., use extra stage to reduce the result step by step. Please correct me if my understanding is wrong... FYI, we also have test cases in count_dist_rewrite.q to cover this.


- pengcheng


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59468/#review176243
-----------------------------------------------------------


On May 27, 2017, 2:20 a.m., pengcheng xiong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59468/
> -----------------------------------------------------------
> 
> (Updated May 27, 2017, 2:20 a.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Gopal V.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-16654
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 2dfc8b6f89 
>   itests/src/test/resources/testconfiguration.properties 47a13c93b9 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 8b04cd44fa 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 7dace9076f 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/GroupByDesc.java 38a9ef2af1 
>   ql/src/test/queries/clientpositive/count_dist_rewrite.q PRE-CREATION 
>   ql/src/test/results/clientpositive/count_dist_rewrite.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/groupby_sort_11.q.out 2b3bf4a07a 
>   ql/src/test/results/clientpositive/llap/count_dist_rewrite.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/nullgroup4.q.out e5a8eeee14 
>   ql/src/test/results/clientpositive/perf/query16.q.out cf90c0c162 
>   ql/src/test/results/clientpositive/perf/query28.q.out 78129cf68b 
>   ql/src/test/results/clientpositive/perf/query94.q.out 836b16bf9f 
>   ql/src/test/results/clientpositive/perf/query95.q.out fa94d0842b 
>   ql/src/test/results/clientpositive/udf_count.q.out f60ad0485e 
> 
> 
> Diff: https://reviews.apache.org/r/59468/diff/3/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> pengcheng xiong
> 
>

Re: Review Request 59468: Optimize a combination of avg(), sum(), count(distinct) etc

Posted by Ashutosh Chauhan <ha...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59468/#review176243
-----------------------------------------------------------




ql/src/test/queries/clientpositive/count_dist_rewrite.q
Lines 63-65 (patched)
<https://reviews.apache.org/r/59468/#comment249600>

    As mentioned previously, lets delete these tests.



ql/src/test/results/clientpositive/perf/query16.q.out
Lines 3-5 (original), 3-5 (patched)
<https://reviews.apache.org/r/59468/#comment249601>

    Optimization shouldn't have fired in this case. Aggregations are on different columns.



ql/src/test/results/clientpositive/perf/query16.q.out
Line 93 (original), 100 (patched)
<https://reviews.apache.org/r/59468/#comment249602>

    Optimization shouldn't have fired in this case. Aggregations are on different columns.



ql/src/test/results/clientpositive/perf/query94.q.out
Line 1 (original), 1 (patched)
<https://reviews.apache.org/r/59468/#comment249603>

    Optimization shouldn't have fired in this case. Aggregations are on columns different than keys.



ql/src/test/results/clientpositive/perf/query94.q.out
Line 29 (original), 36 (patched)
<https://reviews.apache.org/r/59468/#comment249604>

    Optimization shouldn't have fired in this case. Aggregations are on columns different than keys.



ql/src/test/results/clientpositive/perf/query95.q.out
Line 1 (original), 1 (patched)
<https://reviews.apache.org/r/59468/#comment249605>

    Optimization shouldn't have fired in this case. Aggregations are on columns different than keys.


- Ashutosh Chauhan


On May 27, 2017, 2:20 a.m., pengcheng xiong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59468/
> -----------------------------------------------------------
> 
> (Updated May 27, 2017, 2:20 a.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Gopal V.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-16654
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 2dfc8b6f89 
>   itests/src/test/resources/testconfiguration.properties 47a13c93b9 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 8b04cd44fa 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 7dace9076f 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/GroupByDesc.java 38a9ef2af1 
>   ql/src/test/queries/clientpositive/count_dist_rewrite.q PRE-CREATION 
>   ql/src/test/results/clientpositive/count_dist_rewrite.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/groupby_sort_11.q.out 2b3bf4a07a 
>   ql/src/test/results/clientpositive/llap/count_dist_rewrite.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/nullgroup4.q.out e5a8eeee14 
>   ql/src/test/results/clientpositive/perf/query16.q.out cf90c0c162 
>   ql/src/test/results/clientpositive/perf/query28.q.out 78129cf68b 
>   ql/src/test/results/clientpositive/perf/query94.q.out 836b16bf9f 
>   ql/src/test/results/clientpositive/perf/query95.q.out fa94d0842b 
>   ql/src/test/results/clientpositive/udf_count.q.out f60ad0485e 
> 
> 
> Diff: https://reviews.apache.org/r/59468/diff/3/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> pengcheng xiong
> 
>

Re: Review Request 59468: Optimize a combination of avg(), sum(), count(distinct) etc

Posted by pengcheng xiong <px...@hortonworks.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59468/
-----------------------------------------------------------

(Updated May 27, 2017, 2:20 a.m.)


Review request for hive, Ashutosh Chauhan and Gopal V.


Repository: hive-git


Description
-------

HIVE-16654


Diffs (updated)
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 2dfc8b6f89 
  itests/src/test/resources/testconfiguration.properties 47a13c93b9 
  ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 8b04cd44fa 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 7dace9076f 
  ql/src/java/org/apache/hadoop/hive/ql/plan/GroupByDesc.java 38a9ef2af1 
  ql/src/test/queries/clientpositive/count_dist_rewrite.q PRE-CREATION 
  ql/src/test/results/clientpositive/count_dist_rewrite.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/groupby_sort_11.q.out 2b3bf4a07a 
  ql/src/test/results/clientpositive/llap/count_dist_rewrite.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/nullgroup4.q.out e5a8eeee14 
  ql/src/test/results/clientpositive/perf/query16.q.out cf90c0c162 
  ql/src/test/results/clientpositive/perf/query28.q.out 78129cf68b 
  ql/src/test/results/clientpositive/perf/query94.q.out 836b16bf9f 
  ql/src/test/results/clientpositive/perf/query95.q.out fa94d0842b 
  ql/src/test/results/clientpositive/udf_count.q.out f60ad0485e 


Diff: https://reviews.apache.org/r/59468/diff/3/

Changes: https://reviews.apache.org/r/59468/diff/2-3/


Testing
-------


Thanks,

pengcheng xiong

Re: Review Request 59468: Optimize a combination of avg(), sum(), count(distinct) etc

Posted by pengcheng xiong <px...@hortonworks.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59468/
-----------------------------------------------------------

(Updated May 25, 2017, 4:03 a.m.)


Review request for hive, Ashutosh Chauhan and Gopal V.


Repository: hive-git


Description
-------

HIVE-16654


Diffs (updated)
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 2dfc8b6f89 
  itests/src/test/resources/testconfiguration.properties 47a13c93b9 
  ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 8b04cd44fa 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java 7dace9076f 
  ql/src/java/org/apache/hadoop/hive/ql/plan/GroupByDesc.java 38a9ef2af1 
  ql/src/test/queries/clientpositive/count_dist_rewrite.q PRE-CREATION 
  ql/src/test/results/clientpositive/groupby_sort_11.q.out 2b3bf4a07a 
  ql/src/test/results/clientpositive/groupby_sort_8.q.out 4faa0757cc 
  ql/src/test/results/clientpositive/llap/count_dist_rewrite.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/nullgroup4.q.out e5a8eeee14 
  ql/src/test/results/clientpositive/perf/query16.q.out cf90c0c162 
  ql/src/test/results/clientpositive/perf/query28.q.out 78129cf68b 
  ql/src/test/results/clientpositive/perf/query94.q.out 836b16bf9f 
  ql/src/test/results/clientpositive/perf/query95.q.out fa94d0842b 
  ql/src/test/results/clientpositive/udf_count.q.out f60ad0485e 
  ql/src/test/results/clientpositive/vector_empty_where.q.out b2dec6d7f6 


Diff: https://reviews.apache.org/r/59468/diff/2/

Changes: https://reviews.apache.org/r/59468/diff/1-2/


Testing
-------


Thanks,

pengcheng xiong