You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2020/08/27 15:58:35 UTC

[GitHub] [hive] kgyrtkirk opened a new pull request #1439: HIVE-24084 cost aggr

kgyrtkirk opened a new pull request #1439:
URL: https://github.com/apache/hive/pull/1439


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/Hive/HowToContribute
     2. Ensure that you have created an issue on the Hive project JIRA: https://issues.apache.org/jira/projects/HIVE/summary
     3. Ensure you have added or run the appropriate tests for your PR: 
     4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]HIVE-XXXXX:  Your PR title ...'.
     5. Be sure to keep the PR description updated to reflect all changes.
     6. Please write your PR title to summarize what this PR proposes.
     7. If possible, provide a concise example to reproduce the issue for a faster review.
   
   -->
   
   ### What changes were proposed in this pull request?
   
   * changes to cost estimations - fixed a few bugs
   * changed Aggregate estimation to a different on HiveOnTezCostmodel
   * in HiveAggregateJoinTransposeRule relaxed the comparision with the new plan to use <= instead < 
   
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description, screenshot and/or a reproducable example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Hive versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #1439: HIVE-24084 cost aggr

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r484419776



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       this method recursively checks that the above condition is satisfied or not - that's why it needs to call itself




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #1439: HIVE-24084 Push Aggregates thru joins in case it re-groups previously unique columns

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r485001912



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);

Review comment:
       yes; I've explored using `areColumnsUnique` because it matches the usecase here - however for some tests it emitted some NPEs so I've gone back to the `getUniqueKeys` approach
   I'll file a jira for `areColumnsUnique` when I know what's wrong with it..




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #1439: HIVE-24084 Push Aggregates thru joins in case it re-groups previously unique columns

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r487096103



##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       I've added both constraints - it only removed the IS NOT NULL filter
   it seems to me that 1 of the sum() is used as an output and the other is being used to filter by >300 - so both of them is being "used"




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #1439: HIVE-24084 Push Aggregates thru joins in case it re-groups previously unique columns

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r487130820



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       OK. If it turns out there are many changes and it may need some time to be fixed, feel free to defer to follow-up JIRA and let's merge this one.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #1439: HIVE-24084 cost aggr

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r482301409



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##########
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) {
     } else {
       final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
       // 1. Sum of input cardinalities
-      final Double rCount = mq.getRowCount(aggregate.getInput());
-      if (rCount == null) {
+      final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+      final Double rowCount = mq.getRowCount(aggregate);
+      if (inputRowCount == null || rowCount == null) {
         return null;
       }
       // 2. CPU cost = sorting cost
-      final double cpuCost = algoUtils.computeSortCPUCost(rCount);
+      final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + inputRowCount * algoUtils.getCpuUnitCost();

Review comment:
       I think the problem is that we are trying to encapsulate here the algorithm selection too: The fact that we are grouping in each node before sorting the data (I think this is also somehow reflected in the `isLe` discussion above). However, that is not represented with precision by current model, since output rows is supposed to be the output of the final step in the aggregation.
   Wrt read, there is also the IO part of the cost, I am trying to understand whether some of the cost representation that you are talking about is IO.
   There is some more info about the original formulas that were used to compute this here: https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive
   Can we split this into two patches and have the changes to the cost model on their own? This should also help to discuss this in more detail.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #1439: HIVE-24084 cost aggr

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r482092255



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -290,7 +291,8 @@ public void onMatch(RelOptRuleCall call) {
       RelNode r = relBuilder.build();
       RelOptCost afterCost = mq.getCumulativeCost(r);
       RelOptCost beforeCost = mq.getCumulativeCost(aggregate);
-      if (afterCost.isLt(beforeCost)) {

Review comment:
       yes; if we use `isLe` the current cost model which only takes rowcount into account will prefer the pushing aggregates further
   
   there is another alternative to the `force` based approach: the rule can be configured to use the more advanced cost system - so that it could take cpu/io cost into account




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #1439: HIVE-24084 cost aggr

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r484434865



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##########
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) {
     } else {
       final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
       // 1. Sum of input cardinalities
-      final Double rCount = mq.getRowCount(aggregate.getInput());
-      if (rCount == null) {
+      final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+      final Double rowCount = mq.getRowCount(aggregate);
+      if (inputRowCount == null || rowCount == null) {
         return null;
       }
       // 2. CPU cost = sorting cost
-      final double cpuCost = algoUtils.computeSortCPUCost(rCount);
+      final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + inputRowCount * algoUtils.getCpuUnitCost();

Review comment:
       sure; I'll open a separate ticket for the cost model changes




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #1439: HIVE-24084 cost aggr

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r482123540



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##########
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) {
     } else {
       final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
       // 1. Sum of input cardinalities
-      final Double rCount = mq.getRowCount(aggregate.getInput());
-      if (rCount == null) {
+      final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+      final Double rowCount = mq.getRowCount(aggregate);
+      if (inputRowCount == null || rowCount == null) {
         return null;
       }
       // 2. CPU cost = sorting cost
-      final double cpuCost = algoUtils.computeSortCPUCost(rCount);
+      final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + inputRowCount * algoUtils.getCpuUnitCost();
       // 3. IO cost = cost of writing intermediary results to local FS +
       //              cost of reading from local FS for transferring to GBy +
       //              cost of transferring map outputs to GBy operator
       final Double rAverageSize = mq.getAverageRowSize(aggregate.getInput());
       if (rAverageSize == null) {
         return null;
       }
-      final double ioCost = algoUtils.computeSortIOCost(new Pair<Double,Double>(rCount,rAverageSize));
+      final double ioCost = algoUtils.computeSortIOCost(new Pair<Double, Double>(rowCount, rAverageSize));

Review comment:
       if we will be doing a 2 phase groupby: every mapper will do some grouping before it starts emitting; in case `iRC >> oRC` the mappers could eliminate a lot of rows ; and they will most likely utilize `O(oRC)` io
   
   this is an underestimation ; I wanted to multiply it with the number of mappers - but I don't think that's known at this point....I can add a config key for a fixed multiplier.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk merged pull request #1439: HIVE-24084 Push Aggregates thru joins in case it re-groups previously unique columns

Posted by GitBox <gi...@apache.org>.
kgyrtkirk merged pull request #1439:
URL: https://github.com/apache/hive/pull/1439


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #1439: HIVE-24084 cost aggr

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r479474959



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);

Review comment:
       We could rely on `mq.areColumnsUnique`.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##########
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) {
     } else {
       final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
       // 1. Sum of input cardinalities
-      final Double rCount = mq.getRowCount(aggregate.getInput());
-      if (rCount == null) {
+      final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+      final Double rowCount = mq.getRowCount(aggregate);
+      if (inputRowCount == null || rowCount == null) {
         return null;
       }
       // 2. CPU cost = sorting cost
-      final double cpuCost = algoUtils.computeSortCPUCost(rCount);
+      final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + inputRowCount * algoUtils.getCpuUnitCost();

Review comment:
       Not sure about this change. If the algorithm is sort-based, you will still sort the complete input, right?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       This could call `mq.areColumnsUnique` instead of making the recursive call.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -290,7 +291,8 @@ public void onMatch(RelOptRuleCall call) {
       RelNode r = relBuilder.build();
       RelOptCost afterCost = mq.getCumulativeCost(r);
       RelOptCost beforeCost = mq.getCumulativeCost(aggregate);
-      if (afterCost.isLt(beforeCost)) {

Review comment:
       I think you suggested changing this... Maybe `isLe` if we do not introduce an additional aggregate on top?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -290,7 +291,8 @@ public void onMatch(RelOptRuleCall call) {
       RelNode r = relBuilder.build();
       RelOptCost afterCost = mq.getCumulativeCost(r);
       RelOptCost beforeCost = mq.getCumulativeCost(aggregate);
-      if (afterCost.isLt(beforeCost)) {
+      boolean shouldForceTransform = isGroupingUnique(join, aggregate.getGroupSet());
+      if (shouldForceTransform || afterCost.isLt(beforeCost)) {

Review comment:
       Additionally, when we force triggering the transform, does it make sense to verify that we are not creating an aggregate on top (i.e., we end up with agg before join and after join?
   That may narrow it down even further to a case when the pushdown should always be beneficial.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -290,7 +291,8 @@ public void onMatch(RelOptRuleCall call) {
       RelNode r = relBuilder.build();
       RelOptCost afterCost = mq.getCumulativeCost(r);
       RelOptCost beforeCost = mq.getCumulativeCost(aggregate);
-      if (afterCost.isLt(beforeCost)) {
+      boolean shouldForceTransform = isGroupingUnique(join, aggregate.getGroupSet());

Review comment:
       The rule is enabled via config so we will only reach here if it is enabled.
   We should be able to force the transform even if cost-based variant is disabled.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {
+          return true;
+        }
+        if (isGroupingUnique(r, groupR)) {

Review comment:
       Same as above.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {
+          return true;
+        }
+        if (isGroupingUnique(r, groupR)) {
+          return true;
+        }
+      }
+    }
+    return false;
+  }
+
+  static class SimpleConditionInfo {

Review comment:
       Can you use `JoinInfo.of` for this? It seems it does something very similar.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #1439: HIVE-24084 Push Aggregates thru joins in case it re-groups previously unique columns

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r487134213



##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       Thanks @kgyrtkirk .
   
   This seems to need further exploration, we thought https://issues.apache.org/jira/browse/HIVE-24087 was going to help here. @vineetgarg02 , could you take a look at this once this patch is merged? Maybe the shape of the plan is slightly different to the one we anticipated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #1439: HIVE-24084 Push Aggregates thru joins in case it re-groups previously unique columns

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r487096103



##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       I've added both constraints - it only removed the IS NOT NULL filter
   it seems to me that 1 of the sum() is used as an output and the other is being used to filter by >300 - so both of them is being "used"

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       That could be done; and I'm sure it was true in this - but this logic will work better if it could walk down as many joins as it could - we might have an aggregate on top in the meantime a bunch of joins under it...so I feel that it will be beneficial to retain it.
   I feeled tempted to write a RelMd handler - however I don't think I could just introduce a new one easily.
   RelShuttle doesn't look like a good match - I'll leave it as a set of `instanceof` calls for now.
   
   I'll upload a new patch to see if digging deeper in the tree could do more or not.

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       I've added both constraints - it only removed the IS NOT NULL filter
   it seems to me that 1 of the sum() is used as an output and the other is being used to filter by >300 - so both of them is being "used"

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       That could be done; and I'm sure it was true in this - but this logic will work better if it could walk down as many joins as it could - we might have an aggregate on top in the meantime a bunch of joins under it...so I feel that it will be beneficial to retain it.
   I feeled tempted to write a RelMd handler - however I don't think I could just introduce a new one easily.
   RelShuttle doesn't look like a good match - I'll leave it as a set of `instanceof` calls for now.
   
   I'll upload a new patch to see if digging deeper in the tree could do more or not.

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       I've added both constraints - it only removed the IS NOT NULL filter
   it seems to me that 1 of the sum() is used as an output and the other is being used to filter by >300 - so both of them is being "used"

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       That could be done; and I'm sure it was true in this - but this logic will work better if it could walk down as many joins as it could - we might have an aggregate on top in the meantime a bunch of joins under it...so I feel that it will be beneficial to retain it.
   I feeled tempted to write a RelMd handler - however I don't think I could just introduce a new one easily.
   RelShuttle doesn't look like a good match - I'll leave it as a set of `instanceof` calls for now.
   
   I'll upload a new patch to see if digging deeper in the tree could do more or not.

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       I've added both constraints - it only removed the IS NOT NULL filter
   it seems to me that 1 of the sum() is used as an output and the other is being used to filter by >300 - so both of them is being "used"

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       That could be done; and I'm sure it was true in this - but this logic will work better if it could walk down as many joins as it could - we might have an aggregate on top in the meantime a bunch of joins under it...so I feel that it will be beneficial to retain it.
   I feeled tempted to write a RelMd handler - however I don't think I could just introduce a new one easily.
   RelShuttle doesn't look like a good match - I'll leave it as a set of `instanceof` calls for now.
   
   I'll upload a new patch to see if digging deeper in the tree could do more or not.

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       I've added both constraints - it only removed the IS NOT NULL filter
   it seems to me that 1 of the sum() is used as an output and the other is being used to filter by >300 - so both of them is being "used"

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       That could be done; and I'm sure it was true in this - but this logic will work better if it could walk down as many joins as it could - we might have an aggregate on top in the meantime a bunch of joins under it...so I feel that it will be beneficial to retain it.
   I feeled tempted to write a RelMd handler - however I don't think I could just introduce a new one easily.
   RelShuttle doesn't look like a good match - I'll leave it as a set of `instanceof` calls for now.
   
   I'll upload a new patch to see if digging deeper in the tree could do more or not.

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       I've added both constraints - it only removed the IS NOT NULL filter
   it seems to me that 1 of the sum() is used as an output and the other is being used to filter by >300 - so both of them is being "used"

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       That could be done; and I'm sure it was true in this - but this logic will work better if it could walk down as many joins as it could - we might have an aggregate on top in the meantime a bunch of joins under it...so I feel that it will be beneficial to retain it.
   I feeled tempted to write a RelMd handler - however I don't think I could just introduce a new one easily.
   RelShuttle doesn't look like a good match - I'll leave it as a set of `instanceof` calls for now.
   
   I'll upload a new patch to see if digging deeper in the tree could do more or not.

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       I've added both constraints - it only removed the IS NOT NULL filter
   it seems to me that 1 of the sum() is used as an output and the other is being used to filter by >300 - so both of them is being "used"

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       That could be done; and I'm sure it was true in this - but this logic will work better if it could walk down as many joins as it could - we might have an aggregate on top in the meantime a bunch of joins under it...so I feel that it will be beneficial to retain it.
   I feeled tempted to write a RelMd handler - however I don't think I could just introduce a new one easily.
   RelShuttle doesn't look like a good match - I'll leave it as a set of `instanceof` calls for now.
   
   I'll upload a new patch to see if digging deeper in the tree could do more or not.

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       I've added both constraints - it only removed the IS NOT NULL filter
   it seems to me that 1 of the sum() is used as an output and the other is being used to filter by >300 - so both of them is being "used"

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       That could be done; and I'm sure it was true in this - but this logic will work better if it could walk down as many joins as it could - we might have an aggregate on top in the meantime a bunch of joins under it...so I feel that it will be beneficial to retain it.
   I feeled tempted to write a RelMd handler - however I don't think I could just introduce a new one easily.
   RelShuttle doesn't look like a good match - I'll leave it as a set of `instanceof` calls for now.
   
   I'll upload a new patch to see if digging deeper in the tree could do more or not.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #1439: HIVE-24084 cost aggr

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r482119043



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##########
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) {
     } else {
       final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
       // 1. Sum of input cardinalities
-      final Double rCount = mq.getRowCount(aggregate.getInput());
-      if (rCount == null) {
+      final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+      final Double rowCount = mq.getRowCount(aggregate);
+      if (inputRowCount == null || rowCount == null) {
         return null;
       }
       // 2. CPU cost = sorting cost
-      final double cpuCost = algoUtils.computeSortCPUCost(rCount);
+      final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + inputRowCount * algoUtils.getCpuUnitCost();

Review comment:
       maybe...I'm trying to catch the case when `inputRowCount >> outputRowCount`; we are also grouping - so it will not be a full sort at all ; I was using the above to achieve:
   ```
   log(outputRowCount)*outputRowCount + inputRowCount*COST
   ```
   the rational behind this is that it needs to really sort `oRC` and read `iRC` rows - this could be an underestimation...but `log(iRC)*iRC` was highly overestimating the cost
   
   one alternative for the above could be to use:
   ```
   log(outputRowCount) * inputRowCount
   ```
   
   the rational behind this:
   we will need to find the place for every input row; but we also know that the output will be at most `outputRowCount` - so it shouldn't take more time to find the place for the actual row than `log(outputRowCount)`
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #1439: HIVE-24084 cost aggr

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r482243376



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       Could you execute `areColumnsUnique` on the join input then? Wouldn't that simplify this logic?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] vineetgarg02 commented on a change in pull request #1439: HIVE-24084 cost aggr

Posted by GitBox <gi...@apache.org>.
vineetgarg02 commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r479508618



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##########
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) {
     } else {
       final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
       // 1. Sum of input cardinalities
-      final Double rCount = mq.getRowCount(aggregate.getInput());
-      if (rCount == null) {
+      final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+      final Double rowCount = mq.getRowCount(aggregate);

Review comment:
       Can we change `rowCount` to `outputRowCount`? This will make the change more readable.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##########
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) {
     } else {
       final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
       // 1. Sum of input cardinalities
-      final Double rCount = mq.getRowCount(aggregate.getInput());
-      if (rCount == null) {
+      final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+      final Double rowCount = mq.getRowCount(aggregate);
+      if (inputRowCount == null || rowCount == null) {
         return null;
       }
       // 2. CPU cost = sorting cost
-      final double cpuCost = algoUtils.computeSortCPUCost(rCount);
+      final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + inputRowCount * algoUtils.getCpuUnitCost();
       // 3. IO cost = cost of writing intermediary results to local FS +
       //              cost of reading from local FS for transferring to GBy +
       //              cost of transferring map outputs to GBy operator
       final Double rAverageSize = mq.getAverageRowSize(aggregate.getInput());
       if (rAverageSize == null) {
         return null;
       }
-      final double ioCost = algoUtils.computeSortIOCost(new Pair<Double,Double>(rCount,rAverageSize));
+      final double ioCost = algoUtils.computeSortIOCost(new Pair<Double, Double>(rowCount, rAverageSize));

Review comment:
       `rAverageSize` is based on input row count but `rowCount` is output row count. Is this intended or should average row size be computed based on output row count?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);

Review comment:
       If the purpose of this method is to determine that given a set of columns are unique or not you can use `areColumnsUnique` as @jcamachor  suggested.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #1439: HIVE-24084 Push Aggregates thru joins in case it re-groups previously unique columns

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r487130820



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       OK. If it turns out there are many changes and it may need some time to be fixed, feel free to defer to follow-up JIRA and let's merge this one.
   

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       Thanks @kgyrtkirk .
   
   This seems to need further exploration, we thought https://issues.apache.org/jira/browse/HIVE-24087 was going to help here. @vineetgarg02 , could you take a look at this once this patch is merged? Maybe the shape of the plan is slightly different to the one we anticipated.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       OK. If it turns out there are many changes and it may need some time to be fixed, feel free to defer to follow-up JIRA and let's merge this one.
   

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       Thanks @kgyrtkirk .
   
   This seems to need further exploration, we thought https://issues.apache.org/jira/browse/HIVE-24087 was going to help here. @vineetgarg02 , could you take a look at this once this patch is merged? Maybe the shape of the plan is slightly different to the one we anticipated.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       OK. If it turns out there are many changes and it may need some time to be fixed, feel free to defer to follow-up JIRA and let's merge this one.
   

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       Thanks @kgyrtkirk .
   
   This seems to need further exploration, we thought https://issues.apache.org/jira/browse/HIVE-24087 was going to help here. @vineetgarg02 , could you take a look at this once this patch is merged? Maybe the shape of the plan is slightly different to the one we anticipated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #1439: HIVE-24084 Push Aggregates thru joins in case it re-groups previously unique columns

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r487096103



##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       I've added both constraints - it only removed the IS NOT NULL filter
   it seems to me that 1 of the sum() is used as an output and the other is being used to filter by >300 - so both of them is being "used"

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       That could be done; and I'm sure it was true in this - but this logic will work better if it could walk down as many joins as it could - we might have an aggregate on top in the meantime a bunch of joins under it...so I feel that it will be beneficial to retain it.
   I feeled tempted to write a RelMd handler - however I don't think I could just introduce a new one easily.
   RelShuttle doesn't look like a good match - I'll leave it as a set of `instanceof` calls for now.
   
   I'll upload a new patch to see if digging deeper in the tree could do more or not.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #1439: HIVE-24084 Push Aggregates thru joins in case it re-groups previously unique columns

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r485325338



##########
File path: ql/src/test/results/clientpositive/llap/constraints_optimization.q.out
##########
@@ -2631,13 +2631,12 @@ POSTHOOK: Input: default@customer
 POSTHOOK: Input: default@store_sales
 #### A masked pattern was here ####
 CBO PLAN:
-HiveAggregate(group=[{0}])
-  HiveJoin(condition=[=($0, $8)], joinType=[inner], algorithm=[none], cost=[not available])
-    HiveProject(c_customer_sk=[$0], c_customer_id=[$1], c_first_name=[$8], c_last_name=[$9], c_preferred_cust_flag=[$10], c_birth_country=[$14], c_login=[$15], c_email_address=[$16])
-      HiveTableScan(table=[[default, customer]], table:alias=[customer])
-    HiveProject(ss_customer_sk=[$3])
-      HiveFilter(condition=[IS NOT NULL($3)])
-        HiveTableScan(table=[[default, store_sales]], table:alias=[store_sales])
+HiveSemiJoin(condition=[=($0, $1)], joinType=[semi])

Review comment:
       This is quite neat.

##########
File path: ql/src/test/results/clientpositive/perf/tez/constraints/cbo_query23.q.out
##########
@@ -155,18 +155,19 @@ HiveAggregate(group=[{}], agg#0=[sum($0)])
                                     HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
           HiveProject($f1=[$0])
             HiveFilter(condition=[>($2, 4)])
-              HiveProject(i_item_sk=[$1], d_date=[$0], $f2=[$2])
-                HiveAggregate(group=[{3, 4}], agg#0=[count()])
-                  HiveJoin(condition=[=($1, $4)], joinType=[inner], algorithm=[none], cost=[not available])
-                    HiveJoin(condition=[=($0, $2)], joinType=[inner], algorithm=[none], cost=[not available])
-                      HiveProject(ss_sold_date_sk=[$0], ss_item_sk=[$2])
-                        HiveFilter(condition=[IS NOT NULL($0)])
-                          HiveTableScan(table=[[default, store_sales]], table:alias=[store_sales])
-                      HiveProject(d_date_sk=[$0], d_date=[$2])
-                        HiveFilter(condition=[IN($6, 1999, 2000, 2001, 2002)])
-                          HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
-                    HiveProject(i_item_sk=[$0], substr=[substr($4, 1, 30)])
-                      HiveTableScan(table=[[default, item]], table:alias=[item])
+              HiveProject(i_item_sk=[$3], d_date=[$1], $f2=[$2])
+                HiveJoin(condition=[=($0, $3)], joinType=[inner], algorithm=[none], cost=[not available])
+                  HiveProject(ss_item_sk=[$0], d_date=[$1], $f2=[$2])
+                    HiveAggregate(group=[{1, 3}], agg#0=[count()])

Review comment:
       Cool!

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       As you go down recursively, you may start finding `HepRelVertex` as the rel node. I think you hit the `instanceof Project` below only for the first project because you create it using the builder.
   That is why I think once you have gone through the first join, you will not hit another join; you could simply call `areColumnsUnique` in these `if` clauses, which could potentially uncover new cases.
   
   I may be wrong though, I just wanted to leave you a final note to make sure it was clear what I meant.

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       Can we declare {{(L_ORDERKEY, L_LINENUMBER)}} as primary key? That way we will be able to verify whether HIVE-24087 is kicking in and removing the unnecessary join. 
   We can also try the variant with {{L_ORDERKEY NOT NULL}} constraint.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #1439: HIVE-24084 cost aggr

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r482090289



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       this method does a bit different thing - honestly I feeled like I'm in trouble when I've given this name to it :)
   
   this method checks if the given columns contain an unique column somewhere in the covered joins; (this still sound fuzzy) so let's take an example
   
   consider:
   ```
   select c_id, sum(i_prize) from customer c join item i on(i.c_id=c.c_id)
   ```
   
   * do an aggregate grouping by the column C_ID  ; and sum up something 
   * below is a join which joins by C_ID
   * asking wether C_ID  is a unique column on top of the join is false; but there is subtree in which C_ID is unique => so if we push the aggregate on that branch the aggregation will be a no-op
   
   I think this case is not handled by `areColumnsUnique`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #1439: HIVE-24084 Push Aggregates thru joins in case it re-groups previously unique columns

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r487130820



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       OK. If it turns out there are many changes and it may need some time to be fixed, feel free to defer to follow-up JIRA and let's merge this one.
   

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       Thanks @kgyrtkirk .
   
   This seems to need further exploration, we thought https://issues.apache.org/jira/browse/HIVE-24087 was going to help here. @vineetgarg02 , could you take a look at this once this patch is merged? Maybe the shape of the plan is slightly different to the one we anticipated.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       OK. If it turns out there are many changes and it may need some time to be fixed, feel free to defer to follow-up JIRA and let's merge this one.
   

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       Thanks @kgyrtkirk .
   
   This seems to need further exploration, we thought https://issues.apache.org/jira/browse/HIVE-24087 was going to help here. @vineetgarg02 , could you take a look at this once this patch is merged? Maybe the shape of the plan is slightly different to the one we anticipated.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       OK. If it turns out there are many changes and it may need some time to be fixed, feel free to defer to follow-up JIRA and let's merge this one.
   

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       Thanks @kgyrtkirk .
   
   This seems to need further exploration, we thought https://issues.apache.org/jira/browse/HIVE-24087 was going to help here. @vineetgarg02 , could you take a look at this once this patch is merged? Maybe the shape of the plan is slightly different to the one we anticipated.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       OK. If it turns out there are many changes and it may need some time to be fixed, feel free to defer to follow-up JIRA and let's merge this one.
   

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       Thanks @kgyrtkirk .
   
   This seems to need further exploration, we thought https://issues.apache.org/jira/browse/HIVE-24087 was going to help here. @vineetgarg02 , could you take a look at this once this patch is merged? Maybe the shape of the plan is slightly different to the one we anticipated.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       OK. If it turns out there are many changes and it may need some time to be fixed, feel free to defer to follow-up JIRA and let's merge this one.
   

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       Thanks @kgyrtkirk .
   
   This seems to need further exploration, we thought https://issues.apache.org/jira/browse/HIVE-24087 was going to help here. @vineetgarg02 , could you take a look at this once this patch is merged? Maybe the shape of the plan is slightly different to the one we anticipated.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       OK. If it turns out there are many changes and it may need some time to be fixed, feel free to defer to follow-up JIRA and let's merge this one.
   

##########
File path: ql/src/test/queries/clientpositive/tpch18.q
##########
@@ -0,0 +1,133 @@
+--! qt:dataset:tpch_0_001.customer
+--! qt:dataset:tpch_0_001.lineitem
+--! qt:dataset:tpch_0_001.nation
+--! qt:dataset:tpch_0_001.orders
+--! qt:dataset:tpch_0_001.part
+--! qt:dataset:tpch_0_001.partsupp
+--! qt:dataset:tpch_0_001.region
+--! qt:dataset:tpch_0_001.supplier
+
+
+use tpch_0_001;
+
+set hive.transpose.aggr.join=true;
+set hive.transpose.aggr.join.unique=true;
+set hive.mapred.mode=nonstrict;
+
+create view q18_tmp_cached as
+select
+	l_orderkey,
+	sum(l_quantity) as t_sum_quantity
+from
+	lineitem
+where
+	l_orderkey is not null
+group by
+	l_orderkey;
+
+
+
+explain cbo select
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice,
+sum(l_quantity)
+from
+	customer,
+	orders,
+	q18_tmp_cached t,
+	lineitem l
+where
+c_custkey = o_custkey
+and o_orderkey = t.l_orderkey
+and o_orderkey is not null
+and t.t_sum_quantity > 300
+and o_orderkey = l.l_orderkey
+and l.l_orderkey is not null
+group by
+c_name,
+c_custkey,
+o_orderkey,
+o_orderdate,
+o_totalprice
+order by
+o_totalprice desc,
+o_orderdate
+limit 100;
+
+
+
+select 'add constraints';
+
+alter table orders add constraint pk_o primary key (o_orderkey) disable novalidate rely;
+alter table customer add constraint pk_c primary key (c_custkey) disable novalidate rely;
+

Review comment:
       Thanks @kgyrtkirk .
   
   This seems to need further exploration, we thought https://issues.apache.org/jira/browse/HIVE-24087 was going to help here. @vineetgarg02 , could you take a look at this once this patch is merged? Maybe the shape of the plan is slightly different to the one we anticipated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #1439: HIVE-24084 cost aggr

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r484439075



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -290,7 +291,8 @@ public void onMatch(RelOptRuleCall call) {
       RelNode r = relBuilder.build();
       RelOptCost afterCost = mq.getCumulativeCost(r);
       RelOptCost beforeCost = mq.getCumulativeCost(aggregate);
-      if (afterCost.isLt(beforeCost)) {
+      boolean shouldForceTransform = isGroupingUnique(join, aggregate.getGroupSet());

Review comment:
       I've added a config: `hive.transpose.aggr.join.unique` to enable/disable this feature




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #1439: HIVE-24084 Push Aggregates thru joins in case it re-groups previously unique columns

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r487114620



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##########
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
     }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+    if (groups.isEmpty()) {
+      return false;
+    }
+    RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+    Set<ImmutableBitSet> uKeys = mq.getUniqueKeys(input);
+    for (ImmutableBitSet u : uKeys) {
+      if (groups.contains(u)) {
+        return true;
+      }
+    }
+    if (input instanceof Join) {
+      Join join = (Join) input;
+      RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+      SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), rexBuilder);
+
+      if (cond.valid) {
+        ImmutableBitSet newGroup = groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+        RelNode l = join.getLeft();
+        RelNode r = join.getRight();
+
+        int joinFieldCount = join.getRowType().getFieldCount();
+        int lFieldCount = l.getRowType().getFieldCount();
+
+        ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+        ImmutableBitSet groupR = newGroup.get(lFieldCount, joinFieldCount).shift(-lFieldCount);
+
+        if (isGroupingUnique(l, groupL)) {

Review comment:
       That could be done; and I'm sure it was true in this - but this logic will work better if it could walk down as many joins as it could - we might have an aggregate on top in the meantime a bunch of joins under it...so I feel that it will be beneficial to retain it.
   I feeled tempted to write a RelMd handler - however I don't think I could just introduce a new one easily.
   RelShuttle doesn't look like a good match - I'll leave it as a set of `instanceof` calls for now.
   
   I'll upload a new patch to see if digging deeper in the tree could do more or not.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org