You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/06/18 14:31:00 UTC

[jira] [Work logged] (HIVE-23598) Add option to rewrite NTILE and RANK to sketch functions

     [ https://issues.apache.org/jira/browse/HIVE-23598?focusedWorklogId=447849&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-447849 ]

ASF GitHub Bot logged work on HIVE-23598:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 18/Jun/20 14:30
            Start Date: 18/Jun/20 14:30
    Worklog Time Spent: 10m 
      Work Description: jcamachor commented on a change in pull request #1126:
URL: https://github.com/apache/hive/pull/1126#discussion_r442266978



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRewriteToDataSketchesRules.java
##########
@@ -483,46 +489,44 @@ protected final SqlOperator getSqlOperator(String fnName) {
   }
 
   /**
-   * Rewrites {@code cume_dist() over (order by id)}.
+   * Provides a generic way to rewrite function into using an estimation based on CDF.
+   *
+   *  There are a few methods which could be supported this way: NTILE, CUME_DIST, RANK
    *
+   *  For example:
    *  <pre>
    *   SELECT id, CUME_DIST() OVER (ORDER BY id) FROM sketch_input;
-   *     ⇒ SELECT id, 1.0-ds_kll_cdf(ds, CAST(-id AS FLOAT) )[0]
+   *     ⇒ SELECT id, ds_kll_cdf(ds, CAST(id AS FLOAT) )[0]
    *       FROM sketch_input JOIN (
-   *         SELECT ds_kll_sketch(CAST(-id AS FLOAT)) AS ds FROM sketch_input
+   *         SELECT ds_kll_sketch(CAST(id AS FLOAT)) AS ds FROM sketch_input
    *       ) q;
    *  </pre>
    */
-  public static class CumeDistRewrite extends WindowingToProjectAggregateJoinProject {
+  public static abstract class AbstractRankBasedRewriteRule extends WindowingToProjectAggregateJoinProject {
 
-    public CumeDistRewrite(String sketchType) {
+    public AbstractRankBasedRewriteRule(String sketchType) {
       super(sketchType);
     }
 
-    @Override
-    protected VbuilderPAP buildProcessor(RelOptRuleCall call) {
-      return new VB(sketchType, call.builder());
-    }
+    protected static abstract class AbstractRankBasedRewriteBuilder extends VbuilderPAP {
 
-    private static class VB extends VbuilderPAP {
-
-      protected VB(String sketchClass, RelBuilder relBuilder) {
+      protected AbstractRankBasedRewriteBuilder(String sketchClass, RelBuilder relBuilder) {
         super(sketchClass, relBuilder);
       }
 
       @Override
-      boolean isApplicable(RexOver over) {
-        SqlAggFunction aggOp = over.getAggOperator();
+      final boolean isApplicable(RexOver over) {
         RexWindow window = over.getWindow();
-        if (aggOp.getName().equalsIgnoreCase("cume_dist") && window.orderKeys.size() == 1
-            && window.getLowerBound().isUnbounded() && window.getUpperBound().isUnbounded()) {
+        if (window.orderKeys.size() == 1
+            && window.getLowerBound().isUnbounded() && window.getUpperBound().isUnbounded()

Review comment:
       Fwiw, Hive does a complete mess about this. If there is no window frame, following the SQL standard:
   - No order by spec -> rows, unbounded preceding, unbounded following
   - Order by spec -> range, unbounded preceding, current row
   This also aligns with most RDBMSs implementation.
   
   We may need to revisit the logic in this `WindowingSpec` block at some point:
   https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/WindowingSpec.java#L214
   But I suspect there is compensation logic for that in execution code path and it may require some work.

##########
File path: ql/src/test/results/clientpositive/llap/sketches_rewrite_cume_dist_partition_by.q.out
##########
@@ -26,76 +26,42 @@ POSTHOOK: Input: _dummy_database@_dummy_table
 POSTHOOK: Output: default@sketch_input
 POSTHOOK: Lineage: sketch_input.category SCRIPT []
 POSTHOOK: Lineage: sketch_input.id SCRIPT []
-PREHOOK: query: select id,category,cume_dist() over (partition by category order by id) from sketch_input order by category,id
-PREHOOK: type: QUERY
-PREHOOK: Input: default@sketch_input
-#### A masked pattern was here ####
-POSTHOOK: query: select id,category,cume_dist() over (partition by category order by id) from sketch_input order by category,id
-POSTHOOK: type: QUERY
-POSTHOOK: Input: default@sketch_input
-#### A masked pattern was here ####
-1	a	0.18181818181818182
-1	a	0.18181818181818182
-2	a	0.2727272727272727
-3	a	0.36363636363636365
-4	a	0.45454545454545453
-5	a	0.5454545454545454
-6	a	0.6363636363636364
-7	a	0.7272727272727273
-8	a	0.8181818181818182
-9	a	0.9090909090909091
-10	a	1.0
-6	b	0.18181818181818182
-6	b	0.18181818181818182
-7	b	0.2727272727272727
-8	b	0.36363636363636365
-9	b	0.45454545454545453
-10	b	0.5454545454545454
-11	b	0.6363636363636364
-12	b	0.7272727272727273
-13	b	0.8181818181818182
-14	b	0.9090909090909091
-15	b	1.0
-1	NULL	0.25
-2	NULL	0.5
-10	NULL	0.75
-13	NULL	1.0
-PREHOOK: query: select id,category,cume_dist() over (partition by category order by id),1.0-ds_kll_cdf(ds, CAST(-id AS FLOAT))[0]
+PREHOOK: query: select id,category,cume_dist() over (partition by category order by id),ds_kll_cdf(ds, CAST(id AS FLOAT))[0]
 from sketch_input
-join ( select category as c,ds_kll_sketch(cast(-id as float)) as ds from sketch_input group by category) q on (q.c=category)
+join ( select category as c,ds_kll_sketch(cast(id as float)) as ds from sketch_input group by category) q on (q.c=category)
 order by category,id
 PREHOOK: type: QUERY
 PREHOOK: Input: default@sketch_input
 #### A masked pattern was here ####
-POSTHOOK: query: select id,category,cume_dist() over (partition by category order by id),1.0-ds_kll_cdf(ds, CAST(-id AS FLOAT))[0]
+POSTHOOK: query: select id,category,cume_dist() over (partition by category order by id),ds_kll_cdf(ds, CAST(id AS FLOAT))[0]
 from sketch_input
-join ( select category as c,ds_kll_sketch(cast(-id as float)) as ds from sketch_input group by category) q on (q.c=category)
+join ( select category as c,ds_kll_sketch(cast(id as float)) as ds from sketch_input group by category) q on (q.c=category)
 order by category,id
 POSTHOOK: type: QUERY
 POSTHOOK: Input: default@sketch_input
 #### A masked pattern was here ####
-1	a	0.18181818181818182	0.18181818181818177
-1	a	0.18181818181818182	0.18181818181818177
-2	a	0.2727272727272727	0.2727272727272727
-3	a	0.36363636363636365	0.36363636363636365
-4	a	0.45454545454545453	0.4545454545454546
-5	a	0.5454545454545454	0.5454545454545454
-6	a	0.6363636363636364	0.6363636363636364
-7	a	0.7272727272727273	0.7272727272727273
-8	a	0.8181818181818182	0.8181818181818181
-9	a	0.9090909090909091	0.9090909090909091
-10	a	1.0	1.0
-6	b	0.18181818181818182	0.18181818181818177
-6	b	0.18181818181818182	0.18181818181818177
-7	b	0.2727272727272727	0.2727272727272727
-8	b	0.36363636363636365	0.36363636363636365
-9	b	0.45454545454545453	0.4545454545454546
-10	b	0.5454545454545454	0.5454545454545454
-11	b	0.6363636363636364	0.6363636363636364
-12	b	0.7272727272727273	0.7272727272727273
-13	b	0.8181818181818182	0.8181818181818181
-14	b	0.9090909090909091	0.9090909090909091
-15	b	1.0	1.0
+1	a	0.18181818181818182	0.0

Review comment:
       Results are not the same and there seems to be incorrect?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 447849)
    Time Spent: 20m  (was: 10m)

> Add option to rewrite NTILE and RANK to sketch functions
> --------------------------------------------------------
>
>                 Key: HIVE-23598
>                 URL: https://issues.apache.org/jira/browse/HIVE-23598
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Zoltan Haindrich
>            Assignee: Zoltan Haindrich
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)