You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/10/06 03:15:00 UTC

[jira] [Work logged] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator

     [ https://issues.apache.org/jira/browse/HIVE-24203?focusedWorklogId=495700&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495700 ]

ASF GitHub Bot logged work on HIVE-24203:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 06/Oct/20 03:14
            Start Date: 06/Oct/20 03:14
    Worklog Time Spent: 10m 
      Work Description: jcamachor commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r497866947



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics().clone();

Review comment:
       Do you need to clone them? Are you modifying them? (Same for next line)

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right branch.
+   * The join has one-to-many relationship since UDTF can generate multiple rows.

Review comment:
       Just leaving a note. I took a quick look at the UDTF logic and it seems the selectivity is hardcoded via config. It seems the outer flag is not taken into account either, which could be a straightforward improvement for the estimates, i.e., UDFT will produce at least as many rows as it receives.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 495700)
    Time Spent: 50m  (was: 40m)

> Implement stats annotation rule for the LateralViewJoinOperator
> ---------------------------------------------------------------
>
>                 Key: HIVE-24203
>                 URL: https://issues.apache.org/jira/browse/HIVE-24203
>             Project: Hive
>          Issue Type: Improvement
>          Components: Physical Optimizer
>    Affects Versions: 4.0.0, 3.1.2, 2.3.7
>            Reporter: okumin
>            Assignee: okumin
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> StatsRulesProcFactory doesn't have any rules to handle a JOIN by LATERAL VIEW.
> This can cause an underestimation in case that UDTF in LATERAL VIEW generates multiple rows.
> HIVE-20262 has already added the rule for UDTF.
> This issue would add the rule for LateralViewJoinOperator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)