You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/01/04 11:02:38 UTC

[GitHub] [hive] okumin opened a new pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

okumin opened a new pull request #1531:
URL: https://github.com/apache/hive/pull/1531


   ### What changes were proposed in this pull request?
   
   Estimate statistics of LATERAL VIEW correctly.
   
   StatsRulesProcFactory doesn't have any rules to handle a JOIN by LATERAL VIEW.
   This can cause an underestimation in case that UDTF in LATERAL VIEW generates multiple rows.
   
   ### Why are the changes needed?
   
   Significant underestimation can happen when LATERAL VIEW increases the number of records a lot and the source table has large.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Added one test case.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r501078148



##########
File path: ql/src/test/results/clientpositive/llap/annotate_stats_lateral_view_join.q.out
##########
@@ -503,14 +503,14 @@ STAGE PLANS:
                             Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: COMPLETE
                             Lateral View Join Operator
                               outputColumnNames: _col0, _col1, _col5, _col6
-                              Statistics: Num rows: 0 Data size: 24 Basic stats: PARTIAL Column stats: NONE
+                              Statistics: Num rows: 0 Data size: 24 Basic stats: PARTIAL Column stats: COMPLETE

Review comment:
       This is an edge case since `HIVE_STATS_UDTF_FACTOR` is greater than or equal to 1. Anyway, I created a ticket.
   https://issues.apache.org/jira/browse/HIVE-24240




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r501691897



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+      final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+      final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows();

Review comment:
       Added steps to check both numbers and ensure at least one record on stats.
   https://github.com/apache/hive/pull/1531/commits/50396346eaed5d6bab4ff87dd079918a769a7ebd




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk merged pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
kgyrtkirk merged pull request #1531:
URL: https://github.com/apache/hive/pull/1531


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r501689451



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+      final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+      final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows();
+      final long selectDataSize = StatsUtils.safeMult(selectStats.getDataSize(), factor);
+      final long dataSize = StatsUtils.safeAdd(selectDataSize, udtfStats.getDataSize());
+      Statistics joinedStats = new Statistics(udtfStats.getNumRows(), dataSize, 0, 0);
+
+      if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
+        final Map<String, ExprNodeDesc> columnExprMap = lop.getColumnExprMap();
+        final RowSchema schema = lop.getSchema();
+
+        joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
+        final List<ColStatistics> selectColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, selectStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor));
+
+        joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
+        final List<ColStatistics> udtfColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(udtfColStats);
+
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[0] STATS-" + lop.toString() + ": " + joinedStats.extendedToString());
+        }

Review comment:
       I also agree and I did that.
   https://github.com/apache/hive/pull/1531/commits/d333d5d70184a1cf1f0c0f239e9229965e486202




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
okumin commented on pull request #1531:
URL: https://github.com/apache/hive/pull/1531#issuecomment-708179596


   @kgyrtkirk I have updated some points so that # of rows will never be 0.
   Could you please have a look when you have a chance?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r501203341



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+      final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+      final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows();
+      final long selectDataSize = StatsUtils.safeMult(selectStats.getDataSize(), factor);
+      final long dataSize = StatsUtils.safeAdd(selectDataSize, udtfStats.getDataSize());
+      Statistics joinedStats = new Statistics(udtfStats.getNumRows(), dataSize, 0, 0);
+
+      if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
+        final Map<String, ExprNodeDesc> columnExprMap = lop.getColumnExprMap();
+        final RowSchema schema = lop.getSchema();
+
+        joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
+        final List<ColStatistics> selectColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, selectStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor));
+
+        joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
+        final List<ColStatistics> udtfColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(udtfColStats);
+
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[0] STATS-" + lop.toString() + ": " + joinedStats.extendedToString());
+        }

Review comment:
       I don't know what's the point of these `[0]`/`[1]` markers; from one of the historical commits it seems to me like these are some kind of "log message indexes" inside the method ....
   I think we could stop doing that...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r504422554



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2961,10 +2961,11 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
       final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
       final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
 
-      final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows();
+      final long udtfNumRows = Math.max(udtfStats.getNumRows(), 1);
+      final double factor = (double) udtfNumRows / (double) Math.max(selectStats.getNumRows(), 1);

Review comment:
       `factor` will be greater than 0.0 and must not 0 or infinity.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r500006396



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics().clone();

Review comment:
       As for `udtfStats`, we can totally avoid clone.
   As for `udtfStats`, its column stats will be updated. However, looks like `StatsUtils.getColStatisticsFromExprMap` clones them?
   Anyway I think we can remove them if CI passes. I will try it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] github-actions[bot] closed pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #1531:
URL: https://github.com/apache/hive/pull/1531


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r497526561



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,77 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator joins the output of select with the output of UDTF.

Review comment:
       @zabetak Thanks for taking a look!
   I added a description. Please feel free to ask me if something doesn't make sense.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r496441090



##########
File path: ql/src/test/queries/clientpositive/annotate_stats_lateral_view_join.q
##########
@@ -0,0 +1,38 @@
+set hive.fetch.task.conversion=none;

Review comment:
       To make EXPLAIN show Statistics. I'm thinking to create another ticket and add this line to other `annotate_stats_*.q`.
   
   e.g. https://github.com/apache/hive/blob/master/ql/src/test/results/clientpositive/llap/annotate_stats_select.q.out




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r501697579



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+      final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+      final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows();
+      final long selectDataSize = StatsUtils.safeMult(selectStats.getDataSize(), factor);
+      final long dataSize = StatsUtils.safeAdd(selectDataSize, udtfStats.getDataSize());
+      Statistics joinedStats = new Statistics(udtfStats.getNumRows(), dataSize, 0, 0);
+
+      if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
+        final Map<String, ExprNodeDesc> columnExprMap = lop.getColumnExprMap();
+        final RowSchema schema = lop.getSchema();
+
+        joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
+        final List<ColStatistics> selectColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, selectStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor));
+
+        joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
+        final List<ColStatistics> udtfColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(udtfColStats);
+
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[0] STATS-" + lop.toString() + ": " + joinedStats.extendedToString());
+        }
+      } else {
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[1] STATS-" + lop.toString() + ": " + joinedStats.extendedToString());
+        }
+      }
+      return null;
+    }
+
+    private List<ColStatistics> multiplyColStats(List<ColStatistics> colStatistics, double factor) {
+      for (ColStatistics colStats : colStatistics) {
+        colStats.setNumFalses(StatsUtils.safeMult(colStats.getNumFalses(), factor));
+        colStats.setNumTrues(StatsUtils.safeMult(colStats.getNumTrues(), factor));
+        colStats.setNumNulls(StatsUtils.safeMult(colStats.getNumNulls(), factor));
+        // When factor > 1, the same records are duplicated and countDistinct never changes.
+        if (factor < 1.0) {
+          colStats.setCountDistint(StatsUtils.safeMult(colStats.getCountDistint(), factor));

Review comment:
       Ceiled. I moved this method since I'd like to reuse it for HIVE-24240.
   https://github.com/apache/hive/commit/50396346eaed5d6bab4ff87dd079918a769a7ebd




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r500420813



##########
File path: ql/src/test/results/clientpositive/llap/annotate_stats_lateral_view_join.q.out
##########
@@ -503,14 +503,14 @@ STAGE PLANS:
                             Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: COMPLETE
                             Lateral View Join Operator
                               outputColumnNames: _col0, _col1, _col5, _col6
-                              Statistics: Num rows: 0 Data size: 24 Basic stats: PARTIAL Column stats: NONE
+                              Statistics: Num rows: 0 Data size: 24 Basic stats: PARTIAL Column stats: COMPLETE

Review comment:
       With clone, the following condition is not satisfied since the basic stats of parent operators are PARTIAL.
   https://github.com/apache/hive/blob/91e492de239427fc1e38e5e4350cfdce409ebb70/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L2969

##########
File path: ql/src/test/results/clientpositive/llap/annotate_stats_lateral_view_join.q.out
##########
@@ -503,14 +503,14 @@ STAGE PLANS:
                             Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: COMPLETE
                             Lateral View Join Operator
                               outputColumnNames: _col0, _col1, _col5, _col6
-                              Statistics: Num rows: 0 Data size: 24 Basic stats: PARTIAL Column stats: NONE
+                              Statistics: Num rows: 0 Data size: 24 Basic stats: PARTIAL Column stats: COMPLETE

Review comment:
       BTW, it would be better that the UDTF rule puts one in num rows in case that it becomes zero.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics().clone();

Review comment:
       Nothing was not unexpectedly broken. CI failed but it would not be related to this PR...
   - https://github.com/apache/hive/pull/1531/commits/91e492de239427fc1e38e5e4350cfdce409ebb70




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r501101794



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+      final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+      final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows();
+      final long selectDataSize = StatsUtils.safeMult(selectStats.getDataSize(), factor);
+      final long dataSize = StatsUtils.safeAdd(selectDataSize, udtfStats.getDataSize());
+      Statistics joinedStats = new Statistics(udtfStats.getNumRows(), dataSize, 0, 0);
+
+      if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
+        final Map<String, ExprNodeDesc> columnExprMap = lop.getColumnExprMap();
+        final RowSchema schema = lop.getSchema();
+
+        joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
+        final List<ColStatistics> selectColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, selectStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor));
+
+        joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
+        final List<ColStatistics> udtfColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(udtfColStats);
+
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[0] STATS-" + lop.toString() + ": " + joinedStats.extendedToString());
+        }

Review comment:
       I wonder if we should switch `[0]` or `[1]` based on a condition. I can see some rules use a different marker based on maybe the existence of column stats.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] github-actions[bot] commented on pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #1531:
URL: https://github.com/apache/hive/pull/1531#issuecomment-744104263


   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
zabetak commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r497059455



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,77 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator joins the output of select with the output of UDTF.

Review comment:
       Could you provide a bit more details about what the rule does? Most of the other rules in this class give a general overview of the cost model they implement.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r501110922



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+      final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+      final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows();
+      final long selectDataSize = StatsUtils.safeMult(selectStats.getDataSize(), factor);
+      final long dataSize = StatsUtils.safeAdd(selectDataSize, udtfStats.getDataSize());
+      Statistics joinedStats = new Statistics(udtfStats.getNumRows(), dataSize, 0, 0);
+
+      if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
+        final Map<String, ExprNodeDesc> columnExprMap = lop.getColumnExprMap();
+        final RowSchema schema = lop.getSchema();
+
+        joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
+        final List<ColStatistics> selectColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, selectStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor));
+
+        joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
+        final List<ColStatistics> udtfColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(udtfColStats);
+
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[0] STATS-" + lop.toString() + ": " + joinedStats.extendedToString());
+        }
+      } else {
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[1] STATS-" + lop.toString() + ": " + joinedStats.extendedToString());
+        }
+      }
+      return null;
+    }
+
+    private List<ColStatistics> multiplyColStats(List<ColStatistics> colStatistics, double factor) {
+      for (ColStatistics colStats : colStatistics) {
+        colStats.setNumFalses(StatsUtils.safeMult(colStats.getNumFalses(), factor));
+        colStats.setNumTrues(StatsUtils.safeMult(colStats.getNumTrues(), factor));
+        colStats.setNumNulls(StatsUtils.safeMult(colStats.getNumNulls(), factor));
+        // When factor > 1, the same records are duplicated and countDistinct never changes.
+        if (factor < 1.0) {
+          colStats.setCountDistint(StatsUtils.safeMult(colStats.getCountDistint(), factor));

Review comment:
       This method may include additional logging and logics to optimize JOIN such as `cs.setFilterColumn`. It would be better to implement a simple and separate utility.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r501689451



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+      final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+      final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows();
+      final long selectDataSize = StatsUtils.safeMult(selectStats.getDataSize(), factor);
+      final long dataSize = StatsUtils.safeAdd(selectDataSize, udtfStats.getDataSize());
+      Statistics joinedStats = new Statistics(udtfStats.getNumRows(), dataSize, 0, 0);
+
+      if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
+        final Map<String, ExprNodeDesc> columnExprMap = lop.getColumnExprMap();
+        final RowSchema schema = lop.getSchema();
+
+        joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
+        final List<ColStatistics> selectColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, selectStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor));
+
+        joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
+        final List<ColStatistics> udtfColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(udtfColStats);
+
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[0] STATS-" + lop.toString() + ": " + joinedStats.extendedToString());
+        }

Review comment:
       I also agree and I did that.
   https://github.com/apache/hive/pull/1531/commits/d333d5d70184a1cf1f0c0f239e9229965e486202

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+      final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+      final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows();

Review comment:
       Added steps to check both numbers and ensure at least one record on stats.
   https://github.com/apache/hive/pull/1531/commits/50396346eaed5d6bab4ff87dd079918a769a7ebd

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+      final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+      final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows();
+      final long selectDataSize = StatsUtils.safeMult(selectStats.getDataSize(), factor);
+      final long dataSize = StatsUtils.safeAdd(selectDataSize, udtfStats.getDataSize());
+      Statistics joinedStats = new Statistics(udtfStats.getNumRows(), dataSize, 0, 0);
+
+      if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
+        final Map<String, ExprNodeDesc> columnExprMap = lop.getColumnExprMap();
+        final RowSchema schema = lop.getSchema();
+
+        joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
+        final List<ColStatistics> selectColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, selectStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor));
+
+        joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
+        final List<ColStatistics> udtfColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(udtfColStats);
+
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[0] STATS-" + lop.toString() + ": " + joinedStats.extendedToString());
+        }
+      } else {
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[1] STATS-" + lop.toString() + ": " + joinedStats.extendedToString());
+        }
+      }
+      return null;
+    }
+
+    private List<ColStatistics> multiplyColStats(List<ColStatistics> colStatistics, double factor) {
+      for (ColStatistics colStats : colStatistics) {
+        colStats.setNumFalses(StatsUtils.safeMult(colStats.getNumFalses(), factor));
+        colStats.setNumTrues(StatsUtils.safeMult(colStats.getNumTrues(), factor));
+        colStats.setNumNulls(StatsUtils.safeMult(colStats.getNumNulls(), factor));
+        // When factor > 1, the same records are duplicated and countDistinct never changes.
+        if (factor < 1.0) {
+          colStats.setCountDistint(StatsUtils.safeMult(colStats.getCountDistint(), factor));

Review comment:
       Ceiled. I moved this method since I'd like to reuse it for HIVE-24240.
   https://github.com/apache/hive/commit/50396346eaed5d6bab4ff87dd079918a769a7ebd




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r501096999



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+      final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+      final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows();
+      final long selectDataSize = StatsUtils.safeMult(selectStats.getDataSize(), factor);
+      final long dataSize = StatsUtils.safeAdd(selectDataSize, udtfStats.getDataSize());
+      Statistics joinedStats = new Statistics(udtfStats.getNumRows(), dataSize, 0, 0);
+
+      if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
+        final Map<String, ExprNodeDesc> columnExprMap = lop.getColumnExprMap();
+        final RowSchema schema = lop.getSchema();
+
+        joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
+        final List<ColStatistics> selectColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, selectStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor));
+
+        joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
+        final List<ColStatistics> udtfColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(udtfColStats);
+
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[0] STATS-" + lop.toString() + ": " + joinedStats.extendedToString());
+        }
+      } else {
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[1] STATS-" + lop.toString() + ": " + joinedStats.extendedToString());
+        }
+      }
+      return null;
+    }
+
+    private List<ColStatistics> multiplyColStats(List<ColStatistics> colStatistics, double factor) {
+      for (ColStatistics colStats : colStatistics) {
+        colStats.setNumFalses(StatsUtils.safeMult(colStats.getNumFalses(), factor));
+        colStats.setNumTrues(StatsUtils.safeMult(colStats.getNumTrues(), factor));
+        colStats.setNumNulls(StatsUtils.safeMult(colStats.getNumNulls(), factor));
+        // When factor > 1, the same records are duplicated and countDistinct never changes.
+        if (factor < 1.0) {
+          colStats.setCountDistint(StatsUtils.safeMult(colStats.getCountDistint(), factor));

Review comment:
       Now I think this is available for this purpose if we add updating num trues and num falses?
   https://github.com/apache/hive/blob/c082a724648b6bfbdd4b0ff72d7c41c29257beba/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L2050-L2100




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r500976202



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+      final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+      final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows();

Review comment:
       I know `selectStats.getNumRows()` should not be zero - but just in case... could you also add the resulting logic as `StatsUtils` or something like that? 

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+      final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+      final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows();
+      final long selectDataSize = StatsUtils.safeMult(selectStats.getDataSize(), factor);
+      final long dataSize = StatsUtils.safeAdd(selectDataSize, udtfStats.getDataSize());
+      Statistics joinedStats = new Statistics(udtfStats.getNumRows(), dataSize, 0, 0);
+
+      if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
+        final Map<String, ExprNodeDesc> columnExprMap = lop.getColumnExprMap();
+        final RowSchema schema = lop.getSchema();
+
+        joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
+        final List<ColStatistics> selectColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, selectStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor));
+
+        joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
+        final List<ColStatistics> udtfColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(udtfColStats);
+
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[0] STATS-" + lop.toString() + ": " + joinedStats.extendedToString());
+        }

Review comment:
       this seems to be a common expression in both branches of the `if` - could you move it outside?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+      final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+      final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows();
+      final long selectDataSize = StatsUtils.safeMult(selectStats.getDataSize(), factor);
+      final long dataSize = StatsUtils.safeAdd(selectDataSize, udtfStats.getDataSize());
+      Statistics joinedStats = new Statistics(udtfStats.getNumRows(), dataSize, 0, 0);
+
+      if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
+        final Map<String, ExprNodeDesc> columnExprMap = lop.getColumnExprMap();
+        final RowSchema schema = lop.getSchema();
+
+        joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
+        final List<ColStatistics> selectColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, selectStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor));
+
+        joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
+        final List<ColStatistics> udtfColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, schema);
+        joinedStats.addToColumnStats(udtfColStats);
+
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[0] STATS-" + lop.toString() + ": " + joinedStats.extendedToString());
+        }
+      } else {
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[1] STATS-" + lop.toString() + ": " + joinedStats.extendedToString());
+        }
+      }
+      return null;
+    }
+
+    private List<ColStatistics> multiplyColStats(List<ColStatistics> colStatistics, double factor) {
+      for (ColStatistics colStats : colStatistics) {
+        colStats.setNumFalses(StatsUtils.safeMult(colStats.getNumFalses(), factor));
+        colStats.setNumTrues(StatsUtils.safeMult(colStats.getNumTrues(), factor));
+        colStats.setNumNulls(StatsUtils.safeMult(colStats.getNumNulls(), factor));
+        // When factor > 1, the same records are duplicated and countDistinct never changes.
+        if (factor < 1.0) {
+          colStats.setCountDistint(StatsUtils.safeMult(colStats.getCountDistint(), factor));

Review comment:
       I think we should make sure that NDV is at least 1 in case numrows is >0

##########
File path: ql/src/test/results/clientpositive/llap/annotate_stats_lateral_view_join.q.out
##########
@@ -503,14 +503,14 @@ STAGE PLANS:
                             Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: COMPLETE
                             Lateral View Join Operator
                               outputColumnNames: _col0, _col1, _col5, _col6
-                              Statistics: Num rows: 0 Data size: 24 Basic stats: PARTIAL Column stats: NONE
+                              Statistics: Num rows: 0 Data size: 24 Basic stats: PARTIAL Column stats: COMPLETE

Review comment:
       definetly - I don't think it will be `0` in reality!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #1531: HIVE-24203: Implement stats annotation rule for the LateralViewJoinOperator

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r497866947



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics().clone();

Review comment:
       Do you need to clone them? Are you modifying them? (Same for next line)

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.

Review comment:
       Just leaving a note. I took a quick look at the UDTF logic and it seems the selectivity is hardcoded via config. It seems the outer flag is not taken into account either, which could be a straightforward improvement for the estimates, i.e., UDFT will produce at least as many rows as it receives.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org