You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2020/08/28 19:47:29 UTC
[GitHub] [hive] jcamachor commented on a change in pull request #1437: HIVE-24081: Enable pre-materializing CTEs referenced in scalar subqueries

jcamachor commented on a change in pull request #1437:
URL: https://github.com/apache/hive/pull/1437#discussion_r479498296



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2591,6 +2591,10 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
     HIVE_CTE_MATERIALIZE_THRESHOLD("hive.optimize.cte.materialize.threshold", -1,

Review comment:
       Should we set the default to `2` now that it is only triggered in very specific cases?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java
##########
@@ -677,6 +685,33 @@ public void setNoScanAnalyzeCommand(boolean isNoScanAnalyzeCommand) {
   public boolean hasInsertTables() {
     return this.insertIntoTables.size() > 0 || this.insertOverwriteTables.size() > 0;
   }
+
+  public boolean isFullyAggregate() throws SemanticException {

Review comment:
       Although the method is evident, could we add a comment?
   
   Should this be static since it is a utility method that can be used beyond this scope?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java
##########
@@ -457,4 +469,17 @@ public boolean hasTableDefined() {
     return !(aliases.size() == 1 && aliases.get(0).equals(SemanticAnalyzer.DUMMY_TABLE));
   }
 
+  public void addSubqExprAlias(ASTNode expressionTree, SemanticAnalyzer semanticAnalyzer) throws SemanticException {
+    String alias = "__subexpr" + subQueryExpressionAliasCounter++;
+
+    // Recursively do the first phase of semantic analysis for the subquery
+    QBExpr qbexpr = new QBExpr(alias);
+
+    ASTNode subqref = (ASTNode) expressionTree.getChild(1);
+    semanticAnalyzer.doPhase1QBExpr(subqref, qbexpr, getId(), alias, isInsideView());

Review comment:
       Trying to understand this step. Does this lead to parsing the same subquery multiple times?

##########
File path: ql/src/test/queries/clientpositive/cte_mat_6.q
##########
@@ -0,0 +1,81 @@
+set hive.optimize.cte.materialize.threshold=1;
+
+create table t0(col0 int);
+
+insert into t0(col0) values
+(1),(2),
+(100),(100),(100),
+(200),(200);
+
+-- CTE is referenced from scalar subquery in the select clause
+explain
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, (select small_count from cte)
+from t0
+order by t0.col0;
+
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, (select small_count from cte)
+from t0
+order by t0.col0;
+
+-- disable cte materialization
+set hive.optimize.cte.materialize.threshold=-1;
+
+explain
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, (select small_count from cte)
+from t0
+order by t0.col0;
+
+
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, (select small_count from cte)
+from t0
+order by t0.col0;
+
+
+-- enable cte materialization
+set hive.optimize.cte.materialize.threshold=1;
+
+-- CTE is referenced from scalar subquery in the where clause
+explain
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0
+from t0
+where t0.col0 > (select small_count from cte)
+order by t0.col0;
+
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0
+from t0
+where t0.col0 > (select small_count from cte)
+order by t0.col0;
+
+-- CTE is referenced from scalar subquery in the having clause
+explain
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, count(*)
+from t0
+group by col0
+having count(*) > (select small_count from cte)
+order by t0.col0;
+
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, count(*)
+from t0
+group by col0
+having count(*) > (select small_count from cte)
+order by t0.col0;
+
+-- mix full aggregate and non-full aggregate ctes
+explain
+with cte1 as (select col0 as k1 from t0 where col0 = '5'),
+     cte2 as (select count(*) as all_count from t0),
+     cte3 as (select col0 as k3, col0 + col0 as k3_2x, count(*) as key_count from t0 group by col0)
+select t0.col0, count(*)
+from t0
+join cte1 on t0.col0 = cte1.k1
+join cte3 on t0.col0 = cte3.k3
+group by col0
+having count(*) > (select all_count from cte2)

Review comment:
       Could we add tests to make sure the optimization is only triggered for SELECT queries? For instance, I am thinking about CTAS and CMV statements, the optimization should not triggered in those cases (I guess it could lead some kind of side effect, at least for CMV).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org