You are viewing a plain text version of this content. The canonical link for it is here.

Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2020/04/21 16:46:00 UTC

[GitHub] [hive] kgyrtkirk opened a new pull request #988: HIVE-23031 rewrite distinct

kgyrtkirk opened a new pull request #988:
URL: https://github.com/apache/hive/pull/988


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] kgyrtkirk commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

kgyrtkirk commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r418107419



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2465,6 +2465,19 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "If the number of references to a CTE clause exceeds this threshold, Hive will materialize it\n" +
         "before executing the main query block. -1 will disable this feature."),
 
+    HIVE_OPTIMIZE_BI_ENABLED("hive.optimize.bi.enabled", false,
+        "Enables query rewrites based on approximate functions(sketches)."),
+
+    HIVE_OPTIMIZE_BI_REWRITE_COUNTDISTINCT_ENABLED("hive.optimize.bi.rewrite.countdistinct.enabled",
+        true,
+        "Enables to rewrite COUNT(DISTINCT(X)) queries to be rewritten to use sketch functions."),
+
+    HIVE_OPTIMIZE_BI_REWRITE_COUNT_DISTINCT_SKETCH(
+        "hive.optimize.bi.rewrite.countdistinct.sketch", "hll",
+        new StringSet("hll", "cpc", "theta"),

Review comment:
       I've tried it out - I didn't seen any exceptions the MV match for a plain `count(distinct id)` didn't happened....
   when I've changed the default algo no exceptions happened; but matches were made incorrectly - so there could be dragons...
   
   I've removed cpc/theta for now...we can add it back later




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r414662832



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2465,6 +2465,12 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "If the number of references to a CTE clause exceeds this threshold, Hive will materialize it\n" +
         "before executing the main query block. -1 will disable this feature."),
 
+    HIVE_OPTIMIZE_REWRITE_COUNTDISTINCT_ENABLED("hive.optimize.sketches.rewrite.countdistintct.enabled", false,
+        "Enables to rewrite COUNT(DISTINCT(X)) queries to be rewritten to use sketch functions."),
+
+    HIVE_OPTIMIZE_REWRITE_COUNT_DISTINCT_SKETCHCLASS("hive.optimize.sketches.rewrite.countdistintct.sketchclass", "hll",

Review comment:
       What about simply `sketch`? Or `sketch type` I guess?
   Family may be confusing because in their documentation they associate families with how they are commonly used, so it seems wider indeed.
   http://datasketches.apache.org/docs/Architecture/SketchFeaturesMatrix.html.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] kgyrtkirk commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

kgyrtkirk commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r415903603



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DataSketchesFunctions.java
##########
@@ -128,14 +141,26 @@ private void buildCalciteFns() {
           OperandTypes.family(),
           unionFn);
 
+
       unionSFD.setCalciteFunction(unionFn);
       sketchSFD.setCalciteFunction(sketchFn);
+      if (estimateSFD != null) {
+        SqlFunction estimateFn = new HiveSqlFunction(estimateSFD.name,
+            SqlKind.OTHER_FUNCTION,
+            ReturnTypes.explicit(SqlTypeName.DOUBLE),

Review comment:
       it's a little bit more complicated than what we have at other places:
   
   * this should be run without anything being initialized - which means to get a "RexBuilder" I will need to construct that from the ground up....
   * after that I could start registering; but to do that I would need a FuinctionInfo which is created only after the function is "registered" 
   * and lastly: I have no info about what argument types the function is expecting (and how many) - when we construct these kind of things for other function we do have that...
   
   I somehow feel like I'm locked into a specific thinking right now
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r415912339



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2465,6 +2465,12 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "If the number of references to a CTE clause exceeds this threshold, Hive will materialize it\n" +
         "before executing the main query block. -1 will disable this feature."),
 
+    HIVE_OPTIMIZE_REWRITE_COUNTDISTINCT_ENABLED("hive.optimize.sketches.rewrite.countdistintct.enabled", false,
+        "Enables to rewrite COUNT(DISTINCT(X)) queries to be rewritten to use sketch functions."),
+
+    HIVE_OPTIMIZE_REWRITE_COUNT_DISTINCT_SKETCHCLASS("hive.optimize.sketches.rewrite.countdistintct.sketchclass", "hll",

Review comment:
       Yes, agree, that's better.
   I also like the shorter version for the main config more: `hive.optimize.bi.enabled` 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r415912339



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2465,6 +2465,12 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "If the number of references to a CTE clause exceeds this threshold, Hive will materialize it\n" +
         "before executing the main query block. -1 will disable this feature."),
 
+    HIVE_OPTIMIZE_REWRITE_COUNTDISTINCT_ENABLED("hive.optimize.sketches.rewrite.countdistintct.enabled", false,
+        "Enables to rewrite COUNT(DISTINCT(X)) queries to be rewritten to use sketch functions."),
+
+    HIVE_OPTIMIZE_REWRITE_COUNT_DISTINCT_SKETCHCLASS("hive.optimize.sketches.rewrite.countdistintct.sketchclass", "hll",

Review comment:
       Yes, agree, that's better.
   I also like the shorter version for the main config more: `hive.optimize.bi.enable` 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r418110338



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2465,6 +2465,19 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "If the number of references to a CTE clause exceeds this threshold, Hive will materialize it\n" +
         "before executing the main query block. -1 will disable this feature."),
 
+    HIVE_OPTIMIZE_BI_ENABLED("hive.optimize.bi.enabled", false,
+        "Enables query rewrites based on approximate functions(sketches)."),
+
+    HIVE_OPTIMIZE_BI_REWRITE_COUNTDISTINCT_ENABLED("hive.optimize.bi.rewrite.countdistinct.enabled",
+        true,
+        "Enables to rewrite COUNT(DISTINCT(X)) queries to be rewritten to use sketch functions."),
+
+    HIVE_OPTIMIZE_BI_REWRITE_COUNT_DISTINCT_SKETCH(
+        "hive.optimize.bi.rewrite.countdistinct.sketch", "hll",
+        new StringSet("hll", "cpc", "theta"),

Review comment:
       You are right, failure can still happen when the sketch is stored and the mode changes.
   
   Thanks for making the changes in any case. Let's check in this patch and give priority to the overlay issue, it should not be too difficult to address and will fix all these issues.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] kgyrtkirk commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

kgyrtkirk commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r415909888



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2465,6 +2465,12 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "If the number of references to a CTE clause exceeds this threshold, Hive will materialize it\n" +
         "before executing the main query block. -1 will disable this feature."),
 
+    HIVE_OPTIMIZE_REWRITE_COUNTDISTINCT_ENABLED("hive.optimize.sketches.rewrite.countdistintct.enabled", false,
+        "Enables to rewrite COUNT(DISTINCT(X)) queries to be rewritten to use sketch functions."),
+
+    HIVE_OPTIMIZE_REWRITE_COUNT_DISTINCT_SKETCHCLASS("hive.optimize.sketches.rewrite.countdistintct.sketchclass", "hll",

Review comment:
       I've changed it to `sketch type` for now.
   
   note that I'm reluctant to use the word "sketch" because we already have that in the conf's name...
   I think we might consider removing the "sketches" keyword from the conf keys; that way 'sketch' could be used
   
   * `hive.optimize.bi.rewrite.enabled`
     * or..simply: `hive.optimize.bi.enabled`  ?
   * `hive.optimize.bi.rewrite.countdistintct.enabled`
   * `hive.optimize.bi.rewrite.countdistintct.sketch`
   
   now that I've written these down; I kinda like them better than the one with the "sketches" keyword in them - what do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r418081401



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2465,6 +2465,19 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "If the number of references to a CTE clause exceeds this threshold, Hive will materialize it\n" +
         "before executing the main query block. -1 will disable this feature."),
 
+    HIVE_OPTIMIZE_BI_ENABLED("hive.optimize.bi.enabled", false,
+        "Enables query rewrites based on approximate functions(sketches)."),
+
+    HIVE_OPTIMIZE_BI_REWRITE_COUNTDISTINCT_ENABLED("hive.optimize.bi.rewrite.countdistinct.enabled",
+        true,
+        "Enables to rewrite COUNT(DISTINCT(X)) queries to be rewritten to use sketch functions."),
+
+    HIVE_OPTIMIZE_BI_REWRITE_COUNT_DISTINCT_SKETCH(
+        "hive.optimize.bi.rewrite.countdistinct.sketch", "hll",
+        new StringSet("hll", "cpc", "theta"),

Review comment:
       I understand for a single algorithm it will work. However, consider the following scenario:
   - A user enables BI mode and algorithm `hll`.
   - The user creates a MV with count distinct. The MV has stored the count distinct field using `hll`. The SQL statement still has count distinct.
   - We change default algorithm to `cpc` and restart HS2. Thus, when the MV is loaded by HS2, the count distinct is transformed to `cpc`.
   - The user runs a query with count distinct, which transforms to `cpc`, matches the MV... but fails at deserialization time because the sketch stored for the MV is `hll`.
   
   That is why I suggested we could limit the options for algorithms till we have proper support. The risk I see if we do not do that now is that if anyone creates MVs using the different default algorithms, we will not have any way to distinguish between them anymore.
   
   From the two choices that you mention above, I was suggesting the second option, since the main goal of the whole effort is to be able to use these algorithms seamlessly with the MVs. I agree it can be outside of the scope of this change, but let's limit the algorithm choices till then?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] kgyrtkirk commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

kgyrtkirk commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r416708106



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DataSketchesFunctions.java
##########
@@ -96,19 +97,31 @@ private DataSketchesFunctions() {
     return descriptors;
   }
 
+  public SketchFunctionDescriptor getSketchFunction(String className, String function) {

Review comment:
       Yes it would be great - but I would also like to change the function to be an enum as well - I would like to postpone this to a later patch. I've opened: HIVE-23313




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] kgyrtkirk commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

kgyrtkirk commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r414350369



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2465,6 +2465,12 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "If the number of references to a CTE clause exceeds this threshold, Hive will materialize it\n" +
         "before executing the main query block. -1 will disable this feature."),
 
+    HIVE_OPTIMIZE_REWRITE_COUNTDISTINCT_ENABLED("hive.optimize.sketches.rewrite.countdistintct.enabled", false,
+        "Enables to rewrite COUNT(DISTINCT(X)) queries to be rewritten to use sketch functions."),
+
+    HIVE_OPTIMIZE_REWRITE_COUNT_DISTINCT_SKETCHCLASS("hive.optimize.sketches.rewrite.countdistintct.sketchclass", "hll",

Review comment:
       actually I'm not 100% satisified with the "sketch class" name; but it was the best so far I've came up with. Do you have any suggestion - or it's good enough?
   
   some alternatives I was considering:
   * sketch family - I think this is too wide
   * sketch framework
   * sketch class
   * sketch type
   * ?
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r414266001



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2465,6 +2465,12 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "If the number of references to a CTE clause exceeds this threshold, Hive will materialize it\n" +
         "before executing the main query block. -1 will disable this feature."),
 
+    HIVE_OPTIMIZE_REWRITE_COUNTDISTINCT_ENABLED("hive.optimize.sketches.rewrite.countdistintct.enabled", false,

Review comment:
       Let's prefix all of them with `hive.optimize.bi`.
   
   Additionally, let's create a general toggle for all of them (`hive.optimize.bi.sketches.rewrite.enabled`?) that is `false` by default. Then individual ones such as `hive.optimize.bi.sketches.rewrite.countdistintct.enabled` are by default `true`.
   The idea is that users can enable the feature with a single change in their property values, and they disable selectively some of the transformations in case there are bugs, want to test anything else, etc.

##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2465,6 +2465,12 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "If the number of references to a CTE clause exceeds this threshold, Hive will materialize it\n" +
         "before executing the main query block. -1 will disable this feature."),
 
+    HIVE_OPTIMIZE_REWRITE_COUNTDISTINCT_ENABLED("hive.optimize.sketches.rewrite.countdistintct.enabled", false,
+        "Enables to rewrite COUNT(DISTINCT(X)) queries to be rewritten to use sketch functions."),
+
+    HIVE_OPTIMIZE_REWRITE_COUNT_DISTINCT_SKETCHCLASS("hive.optimize.sketches.rewrite.countdistintct.sketchclass", "hll",

Review comment:
       Let's limit the sketch classes options with `StringSet` with those that are valid.
   
   Additionally, can we add a comment in the description about what a 'sketch class' means?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DataSketchesFunctions.java
##########
@@ -128,14 +141,26 @@ private void buildCalciteFns() {
           OperandTypes.family(),
           unionFn);
 
+
       unionSFD.setCalciteFunction(unionFn);
       sketchSFD.setCalciteFunction(sketchFn);
+      if (estimateSFD != null) {
+        SqlFunction estimateFn = new HiveSqlFunction(estimateSFD.name,
+            SqlKind.OTHER_FUNCTION,
+            ReturnTypes.explicit(SqlTypeName.DOUBLE),

Review comment:
       If this is a UDF, we should probably dynamically generate the return type from it as we do for other UDFs?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRewriteCountDistinctToDataSketches.java
##########
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.rel.RelCollation;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Aggregate;
+import org.apache.calcite.rel.core.AggregateCall;
+import org.apache.calcite.rel.core.RelFactories.AggregateFactory;
+import org.apache.calcite.rel.core.RelFactories.ProjectFactory;
+import org.apache.calcite.rel.type.RelDataType;
+import org.apache.calcite.rex.RexBuilder;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlAggFunction;
+import org.apache.calcite.sql.SqlOperator;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.conf.HiveConf.ConfVars;
+import org.apache.hadoop.hive.ql.exec.DataSketchesFunctions;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAggregate;
+import org.apache.hive.plugin.api.HiveUDFPlugin.UDFDescriptor;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.google.common.collect.ImmutableList;
+
+/**
+ * This rule could rewrite {@code count(distinct(x))} calls to be calculated using sketch based functions.
+ */
+public final class HiveRewriteCountDistinctToDataSketches extends RelOptRule {
+
+  protected static final Logger LOG = LoggerFactory.getLogger(HiveRewriteCountDistinctToDataSketches.class);
+  private String sketchClass;
+
+  public HiveRewriteCountDistinctToDataSketches(HiveConf conf) {
+    super(operand(HiveAggregate.class, any()));
+    sketchClass = conf.getVar(ConfVars.HIVE_OPTIMIZE_REWRITE_COUNT_DISTINCT_SKETCHCLASS);
+  }
+
+  @Override
+  public void onMatch(RelOptRuleCall call) {
+    final Aggregate aggregate = call.rel(0);
+
+    if (aggregate.getGroupSets().size() != 1) {
+      // not yet supported
+      return;
+    }
+
+    List<AggregateCall> newAggCalls = new ArrayList<AggregateCall>();
+
+    AggregateFactory f = HiveRelFactories.HIVE_AGGREGATE_FACTORY;

Review comment:
       I guess you are not passing the builder because it would incur a penalty on every rule instantiation?
   That is perfect but maybe we can set these factories in the constructor so the rest remains generic? Additionally, we could pass the `sketchClass` enum directly instead of the full HiveConf.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DataSketchesFunctions.java
##########
@@ -96,19 +97,31 @@ private DataSketchesFunctions() {
     return descriptors;
   }
 
+  public SketchFunctionDescriptor getSketchFunction(String className, String function) {

Review comment:
       Let's make className an `enum`, it will be neat.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRewriteCountDistinctToDataSketches.java
##########
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.rel.RelCollation;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Aggregate;
+import org.apache.calcite.rel.core.AggregateCall;
+import org.apache.calcite.rel.core.RelFactories.AggregateFactory;
+import org.apache.calcite.rel.core.RelFactories.ProjectFactory;
+import org.apache.calcite.rel.type.RelDataType;
+import org.apache.calcite.rex.RexBuilder;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlAggFunction;
+import org.apache.calcite.sql.SqlOperator;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.conf.HiveConf.ConfVars;
+import org.apache.hadoop.hive.ql.exec.DataSketchesFunctions;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAggregate;
+import org.apache.hive.plugin.api.HiveUDFPlugin.UDFDescriptor;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.google.common.collect.ImmutableList;
+
+/**
+ * This rule could rewrite {@code count(distinct(x))} calls to be calculated using sketch based functions.
+ */
+public final class HiveRewriteCountDistinctToDataSketches extends RelOptRule {
+
+  protected static final Logger LOG = LoggerFactory.getLogger(HiveRewriteCountDistinctToDataSketches.class);
+  private String sketchClass;
+
+  public HiveRewriteCountDistinctToDataSketches(HiveConf conf) {
+    super(operand(HiveAggregate.class, any()));
+    sketchClass = conf.getVar(ConfVars.HIVE_OPTIMIZE_REWRITE_COUNT_DISTINCT_SKETCHCLASS);
+  }
+
+  @Override
+  public void onMatch(RelOptRuleCall call) {
+    final Aggregate aggregate = call.rel(0);
+
+    if (aggregate.getGroupSets().size() != 1) {
+      // not yet supported
+      return;
+    }
+
+    List<AggregateCall> newAggCalls = new ArrayList<AggregateCall>();
+
+    AggregateFactory f = HiveRelFactories.HIVE_AGGREGATE_FACTORY;

Review comment:
       It seems this factory is never used?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRewriteCountDistinctToDataSketches.java
##########
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.rel.RelCollation;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Aggregate;
+import org.apache.calcite.rel.core.AggregateCall;
+import org.apache.calcite.rel.core.RelFactories.AggregateFactory;
+import org.apache.calcite.rel.core.RelFactories.ProjectFactory;
+import org.apache.calcite.rel.type.RelDataType;
+import org.apache.calcite.rex.RexBuilder;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlAggFunction;
+import org.apache.calcite.sql.SqlOperator;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.conf.HiveConf.ConfVars;
+import org.apache.hadoop.hive.ql.exec.DataSketchesFunctions;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAggregate;
+import org.apache.hive.plugin.api.HiveUDFPlugin.UDFDescriptor;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.google.common.collect.ImmutableList;
+
+/**
+ * This rule could rewrite {@code count(distinct(x))} calls to be calculated using sketch based functions.
+ */
+public final class HiveRewriteCountDistinctToDataSketches extends RelOptRule {
+
+  protected static final Logger LOG = LoggerFactory.getLogger(HiveRewriteCountDistinctToDataSketches.class);
+  private String sketchClass;
+
+  public HiveRewriteCountDistinctToDataSketches(HiveConf conf) {
+    super(operand(HiveAggregate.class, any()));
+    sketchClass = conf.getVar(ConfVars.HIVE_OPTIMIZE_REWRITE_COUNT_DISTINCT_SKETCHCLASS);
+  }
+
+  @Override
+  public void onMatch(RelOptRuleCall call) {
+    final Aggregate aggregate = call.rel(0);
+
+    if (aggregate.getGroupSets().size() != 1) {
+      // not yet supported
+      return;
+    }
+
+    List<AggregateCall> newAggCalls = new ArrayList<AggregateCall>();
+
+    AggregateFactory f = HiveRelFactories.HIVE_AGGREGATE_FACTORY;
+
+    VBuilder vb = new VBuilder(aggregate);
+
+    ProjectFactory projectFactory = HiveRelFactories.HIVE_PROJECT_FACTORY;
+
+    if (aggregate.getAggCallList().equals(vb.newAggCalls)) {
+      // rule didn't made any changes
+      return;
+    }
+
+    newAggCalls = vb.newAggCalls;
+    // FIXME HiveAggregate?
+    RelNode newAgg = aggregate.copy(aggregate.getTraitSet(), aggregate.getInput(), aggregate.getGroupSet(),
+        aggregate.getGroupSets(), newAggCalls);
+
+    RelNode newProject = projectFactory.createProject(newAgg, vb.newProjects, aggregate.getRowType().getFieldNames());
+
+    call.transformTo(newProject);
+    return;
+  }
+
+  /**
+   * Helper class to help in building a new Aggregate and Project.
+   */
+  // NOTE: methods in this class are not re-entrant; drop-to-frame to constructor during debugging
+  class VBuilder {
+
+    private Aggregate aggregate;
+    private List<AggregateCall> newAggCalls;
+    private List<RexNode> newProjects;
+    private final RexBuilder rexBuilder;
+
+    public VBuilder(Aggregate aggregate) {
+      this.aggregate = aggregate;
+      newAggCalls = new ArrayList<AggregateCall>();
+      newProjects = new ArrayList<RexNode>();
+      rexBuilder = aggregate.getCluster().getRexBuilder();
+
+      // add non-aggregated fields - as identity projections
+      addGroupFields();
+
+      for (AggregateCall aggCall : aggregate.getAggCallList()) {
+        processAggCall(aggCall);
+      }
+    }
+
+    private void addGroupFields() {
+      for (int i = 0; i < aggregate.getGroupCount(); i++) {
+        newProjects.add(rexBuilder.makeInputRef(aggregate, 0));
+      }
+    }
+
+    private void processAggCall(AggregateCall aggCall) {
+      if (isSimpleCountDistinct(aggCall)) {
+        rewriteCountDistinct(aggCall);
+        return;
+      }
+      appendAggCall(aggCall, null);
+    }
+
+    private void appendAggCall(AggregateCall aggCall, SqlOperator projectOperator) {
+      RelDataType origType = aggregate.getRowType().getFieldList().get(newProjects.size()).getType();
+      RexNode projRex = rexBuilder.makeInputRef(aggCall.getType(), newProjects.size());
+      if (projectOperator != null) {
+        projRex = rexBuilder.makeCall(projectOperator, ImmutableList.of(projRex));
+        projRex = rexBuilder.makeCast(origType, projRex);
+      }
+      newAggCalls.add(aggCall);
+      newProjects.add(projRex);
+    }
+
+    private boolean isSimpleCountDistinct(AggregateCall aggCall) {
+      return aggCall.isDistinct() && aggCall.getArgList().size() == 1
+          && aggCall.getAggregation().getName().equalsIgnoreCase("count") && !aggCall.hasFilter();
+    }
+
+    private void rewriteCountDistinct(AggregateCall aggCall) {
+

Review comment:
       nit. newline

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
##########
@@ -1967,6 +1968,13 @@ private RelNode applyPreJoinOrderingTransforms(RelNode basePlan, RelMetadataProv
       generatePartialProgram(program, false, HepMatchOrder.DEPTH_FIRST,
           HiveExceptRewriteRule.INSTANCE);
 
+      // ?

Review comment:
       We can add a comment here?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
##########
@@ -1967,6 +1968,13 @@ private RelNode applyPreJoinOrderingTransforms(RelNode basePlan, RelMetadataProv
       generatePartialProgram(program, false, HepMatchOrder.DEPTH_FIRST,
           HiveExceptRewriteRule.INSTANCE);
 
+      // ?
+      if (conf.getBoolVar(ConfVars.HIVE_OPTIMIZE_REWRITE_COUNTDISTINCT_ENABLED)) {
+        generatePartialProgram(program, true, HepMatchOrder.TOP_DOWN,
+            new HiveRewriteCountDistinctToDataSketches(conf));
+      }
+
+

Review comment:
       nit. 2 newlines

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
##########
@@ -1967,6 +1968,13 @@ private RelNode applyPreJoinOrderingTransforms(RelNode basePlan, RelMetadataProv
       generatePartialProgram(program, false, HepMatchOrder.DEPTH_FIRST,
           HiveExceptRewriteRule.INSTANCE);
 
+      // ?
+      if (conf.getBoolVar(ConfVars.HIVE_OPTIMIZE_REWRITE_COUNTDISTINCT_ENABLED)) {
+        generatePartialProgram(program, true, HepMatchOrder.TOP_DOWN,
+            new HiveRewriteCountDistinctToDataSketches(conf));

Review comment:
       As mentioned above, 1) let's use the general flag + specific one, and 2) let's not pass the full conf object.

##########
File path: ql/src/test/results/clientpositive/llap/sketches_rewrite.q.out
##########
@@ -0,0 +1,110 @@
+PREHOOK: query: create table sketch_input (id int, category char(1))
+STORED AS ORC
+TBLPROPERTIES ('transactional'='true')
+PREHOOK: type: CREATETABLE
+PREHOOK: Output: database:default
+PREHOOK: Output: default@sketch_input
+POSTHOOK: query: create table sketch_input (id int, category char(1))
+STORED AS ORC
+TBLPROPERTIES ('transactional'='true')
+POSTHOOK: type: CREATETABLE
+POSTHOOK: Output: database:default
+POSTHOOK: Output: default@sketch_input
+PREHOOK: query: insert into table sketch_input values
+  (1,'a'),(1, 'a'), (2, 'a'), (3, 'a'), (4, 'a'), (5, 'a'), (6, 'a'), (7, 'a'), (8, 'a'), (9, 'a'), (10, 'a'),
+  (6,'b'),(6, 'b'), (7, 'b'), (8, 'b'), (9, 'b'), (10, 'b'), (11, 'b'), (12, 'b'), (13, 'b'), (14, 'b'), (15, 'b')
+PREHOOK: type: QUERY
+PREHOOK: Input: _dummy_database@_dummy_table
+PREHOOK: Output: default@sketch_input
+POSTHOOK: query: insert into table sketch_input values
+  (1,'a'),(1, 'a'), (2, 'a'), (3, 'a'), (4, 'a'), (5, 'a'), (6, 'a'), (7, 'a'), (8, 'a'), (9, 'a'), (10, 'a'),
+  (6,'b'),(6, 'b'), (7, 'b'), (8, 'b'), (9, 'b'), (10, 'b'), (11, 'b'), (12, 'b'), (13, 'b'), (14, 'b'), (15, 'b')
+POSTHOOK: type: QUERY
+POSTHOOK: Input: _dummy_database@_dummy_table
+POSTHOOK: Output: default@sketch_input
+POSTHOOK: Lineage: sketch_input.category SCRIPT []
+POSTHOOK: Lineage: sketch_input.id SCRIPT []
+PREHOOK: query: explain
+select category, count(distinct id) from sketch_input group by category
+PREHOOK: type: QUERY
+PREHOOK: Input: default@sketch_input
+#### A masked pattern was here ####
+POSTHOOK: query: explain
+select category, count(distinct id) from sketch_input group by category
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@sketch_input
+#### A masked pattern was here ####
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+    Tez
+#### A masked pattern was here ####
+      Edges:
+        Reducer 2 <- Map 1 (SIMPLE_EDGE)
+#### A masked pattern was here ####
+      Vertices:
+        Map 1 
+            Map Operator Tree:
+                TableScan
+                  alias: sketch_input
+                  Statistics: Num rows: 22 Data size: 1958 Basic stats: COMPLETE Column stats: COMPLETE
+                  Select Operator
+                    expressions: id (type: int), category (type: char(1))
+                    outputColumnNames: id, category
+                    Statistics: Num rows: 22 Data size: 1958 Basic stats: COMPLETE Column stats: COMPLETE
+                    Group By Operator
+                      aggregations: ds_hll_sketch(id)
+                      keys: category (type: char(1))
+                      minReductionHashAggr: 0.9090909
+                      mode: hash
+                      outputColumnNames: _col0, _col1
+                      Statistics: Num rows: 2 Data size: 946 Basic stats: COMPLETE Column stats: COMPLETE
+                      Reduce Output Operator
+                        key expressions: _col0 (type: char(1))
+                        null sort order: z
+                        sort order: +
+                        Map-reduce partition columns: _col0 (type: char(1))
+                        Statistics: Num rows: 2 Data size: 946 Basic stats: COMPLETE Column stats: COMPLETE
+                        value expressions: _col1 (type: struct<lgk:int,type:string,sketch:binary>)
+            Execution mode: llap
+            LLAP IO: may be used (ACID table)
+        Reducer 2 
+            Execution mode: llap
+            Reduce Operator Tree:
+              Group By Operator
+                aggregations: ds_hll_sketch(VALUE._col0)
+                keys: KEY._col0 (type: char(1))
+                mode: mergepartial
+                outputColumnNames: _col0, _col1
+                Statistics: Num rows: 2 Data size: 458 Basic stats: COMPLETE Column stats: COMPLETE
+                Select Operator
+                  expressions: _col0 (type: char(1)), UDFToLong(ds_hll_estimate(_col1)) (type: bigint)
+                  outputColumnNames: _col0, _col1
+                  Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: COMPLETE
+                  File Output Operator
+                    compressed: false
+                    Statistics: Num rows: 2 Data size: 186 Basic stats: COMPLETE Column stats: COMPLETE
+                    table:
+                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
+                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
+                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+
+  Stage: Stage-0
+    Fetch Operator
+      limit: -1
+      Processor Tree:
+        ListSink
+
+PREHOOK: query: select category, count(distinct id) from sketch_input group by category
+PREHOOK: type: QUERY
+PREHOOK: Input: default@sketch_input
+#### A masked pattern was here ####
+POSTHOOK: query: select category, count(distinct id) from sketch_input group by category
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@sketch_input
+#### A masked pattern was here ####
+a	10
+b	10

Review comment:
       Let's add a test to `sketches_materialized_view_rollup.q` (or to a new file) with the new `bi` flag on and the following steps:
   1) We create a MV that contains a count distinct over a source table.
   2) Then we explain cbo/execute a query that contains count distinct and should be rewritten to use the MV.
   3) Then we can disable `bi` acceleration and check that the query is not rewritten to use the MV (we enable it again after that).
   4) Then insert new data into the source table.
   5) Trigger a MV rebuild (we need explain to verify that it is incremental).
   6) Trigger the query in 2) again and we should hit the MV again.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRewriteCountDistinctToDataSketches.java
##########
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.rel.RelCollation;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Aggregate;
+import org.apache.calcite.rel.core.AggregateCall;
+import org.apache.calcite.rel.core.RelFactories.AggregateFactory;
+import org.apache.calcite.rel.core.RelFactories.ProjectFactory;
+import org.apache.calcite.rel.type.RelDataType;
+import org.apache.calcite.rex.RexBuilder;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlAggFunction;
+import org.apache.calcite.sql.SqlOperator;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.conf.HiveConf.ConfVars;
+import org.apache.hadoop.hive.ql.exec.DataSketchesFunctions;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAggregate;
+import org.apache.hive.plugin.api.HiveUDFPlugin.UDFDescriptor;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.google.common.collect.ImmutableList;
+
+/**
+ * This rule could rewrite {@code count(distinct(x))} calls to be calculated using sketch based functions.

Review comment:
       Can we describe the source expr and target expr in the rewriting here? These are complex rewritings with several function calls so it is good to show exactly the transformation that is being executed.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRewriteCountDistinctToDataSketches.java
##########
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.rel.RelCollation;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Aggregate;
+import org.apache.calcite.rel.core.AggregateCall;
+import org.apache.calcite.rel.core.RelFactories.AggregateFactory;
+import org.apache.calcite.rel.core.RelFactories.ProjectFactory;
+import org.apache.calcite.rel.type.RelDataType;
+import org.apache.calcite.rex.RexBuilder;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlAggFunction;
+import org.apache.calcite.sql.SqlOperator;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.conf.HiveConf.ConfVars;
+import org.apache.hadoop.hive.ql.exec.DataSketchesFunctions;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAggregate;
+import org.apache.hive.plugin.api.HiveUDFPlugin.UDFDescriptor;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.google.common.collect.ImmutableList;
+
+/**
+ * This rule could rewrite {@code count(distinct(x))} calls to be calculated using sketch based functions.
+ */
+public final class HiveRewriteCountDistinctToDataSketches extends RelOptRule {
+
+  protected static final Logger LOG = LoggerFactory.getLogger(HiveRewriteCountDistinctToDataSketches.class);
+  private String sketchClass;
+
+  public HiveRewriteCountDistinctToDataSketches(HiveConf conf) {
+    super(operand(HiveAggregate.class, any()));
+    sketchClass = conf.getVar(ConfVars.HIVE_OPTIMIZE_REWRITE_COUNT_DISTINCT_SKETCHCLASS);
+  }
+
+  @Override
+  public void onMatch(RelOptRuleCall call) {
+    final Aggregate aggregate = call.rel(0);
+
+    if (aggregate.getGroupSets().size() != 1) {
+      // not yet supported
+      return;
+    }
+
+    List<AggregateCall> newAggCalls = new ArrayList<AggregateCall>();
+
+    AggregateFactory f = HiveRelFactories.HIVE_AGGREGATE_FACTORY;
+
+    VBuilder vb = new VBuilder(aggregate);
+
+    ProjectFactory projectFactory = HiveRelFactories.HIVE_PROJECT_FACTORY;
+
+    if (aggregate.getAggCallList().equals(vb.newAggCalls)) {
+      // rule didn't made any changes
+      return;
+    }
+
+    newAggCalls = vb.newAggCalls;
+    // FIXME HiveAggregate?

Review comment:
       ?

##########
File path: ql/src/test/results/clientpositive/llap/sketches_rewrite.q.out
##########
@@ -0,0 +1,110 @@
+PREHOOK: query: create table sketch_input (id int, category char(1))
+STORED AS ORC
+TBLPROPERTIES ('transactional'='true')
+PREHOOK: type: CREATETABLE
+PREHOOK: Output: database:default
+PREHOOK: Output: default@sketch_input
+POSTHOOK: query: create table sketch_input (id int, category char(1))
+STORED AS ORC
+TBLPROPERTIES ('transactional'='true')
+POSTHOOK: type: CREATETABLE
+POSTHOOK: Output: database:default
+POSTHOOK: Output: default@sketch_input
+PREHOOK: query: insert into table sketch_input values
+  (1,'a'),(1, 'a'), (2, 'a'), (3, 'a'), (4, 'a'), (5, 'a'), (6, 'a'), (7, 'a'), (8, 'a'), (9, 'a'), (10, 'a'),
+  (6,'b'),(6, 'b'), (7, 'b'), (8, 'b'), (9, 'b'), (10, 'b'), (11, 'b'), (12, 'b'), (13, 'b'), (14, 'b'), (15, 'b')
+PREHOOK: type: QUERY
+PREHOOK: Input: _dummy_database@_dummy_table
+PREHOOK: Output: default@sketch_input
+POSTHOOK: query: insert into table sketch_input values
+  (1,'a'),(1, 'a'), (2, 'a'), (3, 'a'), (4, 'a'), (5, 'a'), (6, 'a'), (7, 'a'), (8, 'a'), (9, 'a'), (10, 'a'),
+  (6,'b'),(6, 'b'), (7, 'b'), (8, 'b'), (9, 'b'), (10, 'b'), (11, 'b'), (12, 'b'), (13, 'b'), (14, 'b'), (15, 'b')
+POSTHOOK: type: QUERY
+POSTHOOK: Input: _dummy_database@_dummy_table
+POSTHOOK: Output: default@sketch_input
+POSTHOOK: Lineage: sketch_input.category SCRIPT []
+POSTHOOK: Lineage: sketch_input.id SCRIPT []
+PREHOOK: query: explain
+select category, count(distinct id) from sketch_input group by category
+PREHOOK: type: QUERY
+PREHOOK: Input: default@sketch_input
+#### A masked pattern was here ####
+POSTHOOK: query: explain
+select category, count(distinct id) from sketch_input group by category
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@sketch_input
+#### A masked pattern was here ####
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+    Tez
+#### A masked pattern was here ####
+      Edges:
+        Reducer 2 <- Map 1 (SIMPLE_EDGE)
+#### A masked pattern was here ####
+      Vertices:
+        Map 1 
+            Map Operator Tree:
+                TableScan
+                  alias: sketch_input
+                  Statistics: Num rows: 22 Data size: 1958 Basic stats: COMPLETE Column stats: COMPLETE
+                  Select Operator
+                    expressions: id (type: int), category (type: char(1))
+                    outputColumnNames: id, category
+                    Statistics: Num rows: 22 Data size: 1958 Basic stats: COMPLETE Column stats: COMPLETE
+                    Group By Operator
+                      aggregations: ds_hll_sketch(id)
+                      keys: category (type: char(1))
+                      minReductionHashAggr: 0.9090909
+                      mode: hash
+                      outputColumnNames: _col0, _col1
+                      Statistics: Num rows: 2 Data size: 946 Basic stats: COMPLETE Column stats: COMPLETE
+                      Reduce Output Operator
+                        key expressions: _col0 (type: char(1))
+                        null sort order: z
+                        sort order: +
+                        Map-reduce partition columns: _col0 (type: char(1))
+                        Statistics: Num rows: 2 Data size: 946 Basic stats: COMPLETE Column stats: COMPLETE
+                        value expressions: _col1 (type: struct<lgk:int,type:string,sketch:binary>)
+            Execution mode: llap
+            LLAP IO: may be used (ACID table)
+        Reducer 2 
+            Execution mode: llap
+            Reduce Operator Tree:
+              Group By Operator
+                aggregations: ds_hll_sketch(VALUE._col0)
+                keys: KEY._col0 (type: char(1))
+                mode: mergepartial
+                outputColumnNames: _col0, _col1
+                Statistics: Num rows: 2 Data size: 458 Basic stats: COMPLETE Column stats: COMPLETE
+                Select Operator
+                  expressions: _col0 (type: char(1)), UDFToLong(ds_hll_estimate(_col1)) (type: bigint)

Review comment:
       nice :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r417606413



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2465,6 +2465,19 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "If the number of references to a CTE clause exceeds this threshold, Hive will materialize it\n" +
         "before executing the main query block. -1 will disable this feature."),
 
+    HIVE_OPTIMIZE_BI_ENABLED("hive.optimize.bi.enabled", false,
+        "Enables query rewrites based on approximate functions(sketches)."),
+
+    HIVE_OPTIMIZE_BI_REWRITE_COUNTDISTINCT_ENABLED("hive.optimize.bi.rewrite.countdistinct.enabled",
+        true,
+        "Enables to rewrite COUNT(DISTINCT(X)) queries to be rewritten to use sketch functions."),
+
+    HIVE_OPTIMIZE_BI_REWRITE_COUNT_DISTINCT_SKETCH(
+        "hive.optimize.bi.rewrite.countdistinct.sketch", "hll",
+        new StringSet("hll", "cpc", "theta"),

Review comment:
       Can we limit the algorithm choices to a single one for the time being?
   The reason I am asking this is that this will not work with materialized views. Since we are not storing in the SQL view definition the algorithm that we used to generate the column, if the property value changes, this would lead to errors.
   The multi-algorithm supports needs a little bit more work. One option would be to store this information in the MV table properties so we know how to interpret them when HS2 needs to load them (and thus parse them). What do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] kgyrtkirk commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

kgyrtkirk commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r416683591



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DataSketchesFunctions.java
##########
@@ -128,14 +141,26 @@ private void buildCalciteFns() {
           OperandTypes.family(),
           unionFn);
 
+
       unionSFD.setCalciteFunction(unionFn);
       sketchSFD.setCalciteFunction(sketchFn);
+      if (estimateSFD != null) {
+        SqlFunction estimateFn = new HiveSqlFunction(estimateSFD.name,
+            SqlKind.OTHER_FUNCTION,
+            ReturnTypes.explicit(SqlTypeName.DOUBLE),

Review comment:
       I've approached this right now by identifying the return type from the UDF class; we might need to replace that later - but it will work for now...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r414658473



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRewriteCountDistinctToDataSketches.java
##########
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.rel.RelCollation;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Aggregate;
+import org.apache.calcite.rel.core.AggregateCall;
+import org.apache.calcite.rel.core.RelFactories.AggregateFactory;
+import org.apache.calcite.rel.core.RelFactories.ProjectFactory;
+import org.apache.calcite.rel.type.RelDataType;
+import org.apache.calcite.rex.RexBuilder;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlAggFunction;
+import org.apache.calcite.sql.SqlOperator;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.conf.HiveConf.ConfVars;
+import org.apache.hadoop.hive.ql.exec.DataSketchesFunctions;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAggregate;
+import org.apache.hive.plugin.api.HiveUDFPlugin.UDFDescriptor;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.google.common.collect.ImmutableList;
+
+/**
+ * This rule could rewrite {@code count(distinct(x))} calls to be calculated using sketch based functions.
+ */
+public final class HiveRewriteCountDistinctToDataSketches extends RelOptRule {
+
+  protected static final Logger LOG = LoggerFactory.getLogger(HiveRewriteCountDistinctToDataSketches.class);
+  private String sketchClass;
+
+  public HiveRewriteCountDistinctToDataSketches(HiveConf conf) {
+    super(operand(HiveAggregate.class, any()));
+    sketchClass = conf.getVar(ConfVars.HIVE_OPTIMIZE_REWRITE_COUNT_DISTINCT_SKETCHCLASS);
+  }
+
+  @Override
+  public void onMatch(RelOptRuleCall call) {
+    final Aggregate aggregate = call.rel(0);
+
+    if (aggregate.getGroupSets().size() != 1) {
+      // not yet supported
+      return;
+    }
+
+    List<AggregateCall> newAggCalls = new ArrayList<AggregateCall>();
+
+    AggregateFactory f = HiveRelFactories.HIVE_AGGREGATE_FACTORY;

Review comment:
       About `HiveConf`, I suggested this because it makes rules easier to instantiate by passing a well defined parameter.
   
   About the factories, one could make a case that 1) if the rule is a final static instance, you can pass the `HIVE_REL_BUILDER` to the RelOptRule constructor since you will not incur any additional instantiation cost per query, and thus use call.builder, and 2) if the rule is parameterized (as it is the case for this one), you have to instantiate it for every query compilation, thus your initial implementation using the factory directly would work better from performance point of view (instantiation of a builder based on `HIVE_REL_BUILDER` adds some overhead). If factories are not needed at all, I guess `copy` may be an option too.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] kgyrtkirk commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

kgyrtkirk commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r414363412



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRewriteCountDistinctToDataSketches.java
##########
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to you under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.rel.RelCollation;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Aggregate;
+import org.apache.calcite.rel.core.AggregateCall;
+import org.apache.calcite.rel.core.RelFactories.AggregateFactory;
+import org.apache.calcite.rel.core.RelFactories.ProjectFactory;
+import org.apache.calcite.rel.type.RelDataType;
+import org.apache.calcite.rex.RexBuilder;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlAggFunction;
+import org.apache.calcite.sql.SqlOperator;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.conf.HiveConf.ConfVars;
+import org.apache.hadoop.hive.ql.exec.DataSketchesFunctions;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAggregate;
+import org.apache.hive.plugin.api.HiveUDFPlugin.UDFDescriptor;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.google.common.collect.ImmutableList;
+
+/**
+ * This rule could rewrite {@code count(distinct(x))} calls to be calculated using sketch based functions.
+ */
+public final class HiveRewriteCountDistinctToDataSketches extends RelOptRule {
+
+  protected static final Logger LOG = LoggerFactory.getLogger(HiveRewriteCountDistinctToDataSketches.class);
+  private String sketchClass;
+
+  public HiveRewriteCountDistinctToDataSketches(HiveConf conf) {
+    super(operand(HiveAggregate.class, any()));
+    sketchClass = conf.getVar(ConfVars.HIVE_OPTIMIZE_REWRITE_COUNT_DISTINCT_SKETCHCLASS);
+  }
+
+  @Override
+  public void onMatch(RelOptRuleCall call) {
+    final Aggregate aggregate = call.rel(0);
+
+    if (aggregate.getGroupSets().size() != 1) {
+      // not yet supported
+      return;
+    }
+
+    List<AggregateCall> newAggCalls = new ArrayList<AggregateCall>();
+
+    AggregateFactory f = HiveRelFactories.HIVE_AGGREGATE_FACTORY;

Review comment:
       I just followed the "usual practice" by passing the `HiveConf` :D 
   but since there is nothing else needed - I've removed it...
   
   About the factory stuff: I've moved it to the constructor/etc
   but this is not entirely clear to me:
   * we the `HiveRelFactories` - to construct things
   * in the meantime there is also `call.builder()` which could be probably could be used to do the same
   
   Other rules seemed to utilize `HiveRelFactories` so I've followed that - but I feel that it would be better to use the builder - and fix the issues we might encounter along the way...what do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] kgyrtkirk commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

kgyrtkirk commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r418098701



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2465,6 +2465,19 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "If the number of references to a CTE clause exceeds this threshold, Hive will materialize it\n" +
         "before executing the main query block. -1 will disable this feature."),
 
+    HIVE_OPTIMIZE_BI_ENABLED("hive.optimize.bi.enabled", false,
+        "Enables query rewrites based on approximate functions(sketches)."),
+
+    HIVE_OPTIMIZE_BI_REWRITE_COUNTDISTINCT_ENABLED("hive.optimize.bi.rewrite.countdistinct.enabled",
+        true,
+        "Enables to rewrite COUNT(DISTINCT(X)) queries to be rewritten to use sketch functions."),
+
+    HIVE_OPTIMIZE_BI_REWRITE_COUNT_DISTINCT_SKETCH(
+        "hive.optimize.bi.rewrite.countdistinct.sketch", "hll",
+        new StringSet("hll", "cpc", "theta"),

Review comment:
       I was not thinking about restarting HS2
   
   sure...we can limit it to one - but if this incorrect behaviour does exists - then I think it could also be triggered with the main bi mode switch as well:
   * in one case there will be a sketch there
   * in the other some integer value
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] kgyrtkirk commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

kgyrtkirk commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r418060431



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2465,6 +2465,19 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "If the number of references to a CTE clause exceeds this threshold, Hive will materialize it\n" +
         "before executing the main query block. -1 will disable this feature."),
 
+    HIVE_OPTIMIZE_BI_ENABLED("hive.optimize.bi.enabled", false,
+        "Enables query rewrites based on approximate functions(sketches)."),
+
+    HIVE_OPTIMIZE_BI_REWRITE_COUNTDISTINCT_ENABLED("hive.optimize.bi.rewrite.countdistinct.enabled",
+        true,
+        "Enables to rewrite COUNT(DISTINCT(X)) queries to be rewritten to use sketch functions."),
+
+    HIVE_OPTIMIZE_BI_REWRITE_COUNT_DISTINCT_SKETCH(
+        "hive.optimize.bi.rewrite.countdistinct.sketch", "hll",
+        new StringSet("hll", "cpc", "theta"),

Review comment:
       I don't think that would be neccessary
   * it works correctly - even if we have 1 algo the interesting behaviour is still there - which is: if the rewrite is enabled the created MV will be a rewritten one
   * it won't get applied for different modes/etc so it doesn't lead to errors at all...
   
   I've added a test(sketches_materialized_view_sketchtype.q)
   which shows how it works when there is an MV for HLL ; in case the mode is not HLL the MV is ignored and computed directly
   
   I think the real meaning of the MV should not change(I think we agree on this); we have 2 choices here:
   * ignore all rewriting during MV construction/rebuild - so this rewrite may not happen for an MV - and users have to use the expanded form of the sketch stuff to create an MV for that purpose
   * save a conf overlay alongside with the MV 
   I think addressing this is outside of the scope of this change




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] kgyrtkirk commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

kgyrtkirk commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r414374199



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
##########
@@ -1967,6 +1968,13 @@ private RelNode applyPreJoinOrderingTransforms(RelNode basePlan, RelMetadataProv
       generatePartialProgram(program, false, HepMatchOrder.DEPTH_FIRST,
           HiveExceptRewriteRule.INSTANCE);
 
+      // ?

Review comment:
       actually...I was wondering if this is the right place to add this rule or not... :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r417595617



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DataSketchesFunctions.java
##########
@@ -128,14 +141,26 @@ private void buildCalciteFns() {
           OperandTypes.family(),
           unionFn);
 
+
       unionSFD.setCalciteFunction(unionFn);
       sketchSFD.setCalciteFunction(sketchFn);
+      if (estimateSFD != null) {
+        SqlFunction estimateFn = new HiveSqlFunction(estimateSFD.name,
+            SqlKind.OTHER_FUNCTION,
+            ReturnTypes.explicit(SqlTypeName.DOUBLE),

Review comment:
       Do we have a follow-up? Would you mind to create it so we can keep track?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org

[GitHub] [hive] jcamachor commented on a change in pull request #988: HIVE-23031 rewrite distinct

Posted by GitBox <gi...@apache.org>.

jcamachor commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r414669324



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
##########
@@ -1967,6 +1968,13 @@ private RelNode applyPreJoinOrderingTransforms(RelNode basePlan, RelMetadataProv
       generatePartialProgram(program, false, HepMatchOrder.DEPTH_FIRST,
           HiveExceptRewriteRule.INSTANCE);
 
+      // ?

Review comment:
       Good point.
   I am not sure the decorrelation logic or any rule executed before this one would introduce a `count distinct` or any of the other functions that we will be targeting.
   However, if that would be the case at some point, probably we do not want to be mangling with them in our new rules.
   Thus, executing it as the very first step, even before decorrelation, would make sense.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org