You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/03/22 11:40:05 UTC

[GitHub] [hive] abstractdog opened a new pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

abstractdog opened a new pull request #2099:
URL: https://github.com/apache/hive/pull/2099


   Change-Id: I318d9825b1a6306af0fbe23d7da4ab0a75f3ebbc
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/Hive/HowToContribute
     2. Ensure that you have created an issue on the Hive project JIRA: https://issues.apache.org/jira/projects/HIVE/summary
     3. Ensure you have added or run the appropriate tests for your PR: 
     4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]HIVE-XXXXX:  Your PR title ...'.
     5. Be sure to keep the PR description updated to reflect all changes.
     6. Please write your PR title to summarize what this PR proposes.
     7. If possible, provide a concise example to reproduce the issue for a faster review.
   
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description, screenshot and/or a reproducable example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Hive versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] abstractdog commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
abstractdog commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r616599645



##########
File path: ql/src/gen/vectorization/ExpressionTemplates/ColumnArithmeticColumn.txt
##########
@@ -34,20 +34,17 @@ public class <ClassName> extends VectorExpression {
 
   private static final long serialVersionUID = 1L;
 
-  private final int colNum1;
   private final int colNum2;

Review comment:
       I agree that the current solution is not really clean by having only the first column put into VectorExpression
   a couple of notes here, which needs to be discussed before proceeding with this huge refactor (which I'm happy to do once we're 100% certain about the "perfect" solution):
   
   1. unary, binary is not enough, unfortunately, we have even expressions involving even more cols, this is not a problem, we have the language support for that :) tertiary, quaternary...
   
   2. what's confusing is, how to show with simple class names that unary/binary/... is only a story about the input columns? an expression can have constants too, e.g. in IfExprScalarScalar.txt:
   ```
        this.arg1Column = arg1Column;
        this.arg2Scalar = arg2Scalar;
        this.arg3Scalar = arg3Scalar;
   ```
   in our terminology here, this is a unary expression because of arg1Column + scalars, but in reality, it's obviously not a unary function...
   
   3. with subclasses, we'll have to implement a general VectorExpression.setInputColumnNum(int i, int j, int k, ...vararg), otherwise, we won't be able to change the input column numbers (which is important, this was the intention of this huge vector expression refactor), I think this will work by simply overriding vararg method in subclasses




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] abstractdog commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
abstractdog commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r629269903



##########
File path: ql/src/gen/vectorization/ExpressionTemplates/ColumnArithmeticColumn.txt
##########
@@ -34,20 +34,17 @@ public class <ClassName> extends VectorExpression {
 
   private static final long serialVersionUID = 1L;
 
-  private final int colNum1;
   private final int colNum2;

Review comment:
       a) regarding input columns, a simple solution is could be https://github.com/apache/hive/pull/2099/commits/5db4c60aec34df291a1d1d85be319c865adcd1ad
   created convenience constructors, subclasses should be refactored accordingly now with care
   
   b) I was also wondering about having an array of inputcolumns as:
   ```
   public int[] inputColumnNums = new int[] { -1, -1, -1 };
   ```
   this could be more general, more easily extendable, and needs a bit more refactoring (every existing occurrence of inputColumn should become inputColumnNums[0])
   
   so regarding the class naming, we won't change anything (to unary, binary, ...) just use the proper constructor in subclasses




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] ramesh0201 commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
ramesh0201 commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r634911824



##########
File path: ql/src/gen/vectorization/ExpressionTemplates/ColumnArithmeticColumn.txt
##########
@@ -34,20 +34,17 @@ public class <ClassName> extends VectorExpression {
 
   private static final long serialVersionUID = 1L;
 
-  private final int colNum1;
   private final int colNum2;

Review comment:
       @abstractdog Option a also looks good to me. If you think that will be better then we can go with that. 
   
   Otherwise if we choose option b we can choose 2) new int[] { arg1Column, arg3Column, -1}; This way it is always easy to know only first n columns are used, rather than confused with the interleaving part




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] abstractdog merged pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
abstractdog merged pull request #2099:
URL: https://github.com/apache/hive/pull/2099


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] ramesh0201 commented on pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
ramesh0201 commented on pull request #2099:
URL: https://github.com/apache/hive/pull/2099#issuecomment-847144409


   +1 looks good to me


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] abstractdog commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
abstractdog commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r616599645



##########
File path: ql/src/gen/vectorization/ExpressionTemplates/ColumnArithmeticColumn.txt
##########
@@ -34,20 +34,17 @@ public class <ClassName> extends VectorExpression {
 
   private static final long serialVersionUID = 1L;
 
-  private final int colNum1;
   private final int colNum2;

Review comment:
       I agree that the current solution is not really clean by having only the first column put into VectorExpression
   a couple of notes here, which needs to be discussed before proceeding with this huge refactor (which I'm happy to do once we're 100% certain about the "perfect" solution):
   
   1. unary, binary is not enough, unfortunately, we have even expressions involving even more cols, this is not a problem, we have the language support for that :) tertiary, quaternary...
   
   2. what's confusing is, how to show with simple class names that unary/binary/... is only a story about the input columns? an expression can have constants too, e.g. in IfExprScalarScalar.txt:
   ```
        this.arg1Column = arg1Column;
        this.arg2Scalar = arg2Scalar;
        this.arg3Scalar = arg3Scalar;
   ```
   in our terminology here, this is a unary expression because of arg1Column + scalars, but in reality, it's obviously not a unary function...
   
   3. with subclasses, we'll have to implement a general VectorExpression.setInputColumnNum(int i, int j, int k, ...vararg), otherwise, we won't be able to change the input column numbers (which is important, this was the intention of this huge vector expression refactor), I think this will simply work by simply overriding vararg method in subclasses




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] abstractdog commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
abstractdog commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r633453402



##########
File path: ql/src/gen/vectorization/ExpressionTemplates/ColumnArithmeticColumn.txt
##########
@@ -34,20 +34,17 @@ public class <ClassName> extends VectorExpression {
 
   private static final long serialVersionUID = 1L;
 
-  private final int colNum1;
   private final int colNum2;

Review comment:
       what do you think about this @ramesh0201?
   I think a general input col array would be nice (option b) )
   
   however, there some rare cases where it's not obvious which position should be used, but it's up to agreement e.g.:
   IfExprScalarColumn.txt
   ```
   protected final int arg1Column;
   protected final <OperandType2> arg2Scalar;
   protected final int arg3Column;
   ```
   this is tricky because there is a scalar interleaved into the columns, input col array might look like:
   1. new int[] { arg1Column, -1, arg3Column};
   to emphasize that that the second argument is a scalar, so we'll refactor as:
   ```
   arg3Column => inputColumnNums[2]
   ```
   
   2. new int[] { arg1Column, arg3Column, -1};
   to ignore the fact that there is an interleaved scalar input, so we'll refactor as:
   ```
   arg3Column => inputColumnNums[1]
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] abstractdog commented on pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
abstractdog commented on pull request #2099:
URL: https://github.com/apache/hive/pull/2099#issuecomment-847854102


   thanks for the review @ramesh0201, I'm rebasing on master and will merge if everything looks good


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] abstractdog commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
abstractdog commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r629269903



##########
File path: ql/src/gen/vectorization/ExpressionTemplates/ColumnArithmeticColumn.txt
##########
@@ -34,20 +34,17 @@ public class <ClassName> extends VectorExpression {
 
   private static final long serialVersionUID = 1L;
 
-  private final int colNum1;
   private final int colNum2;

Review comment:
       a) regarding input columns, a simple solution is could be https://github.com/apache/hive/pull/2099/commits/5db4c60aec34df291a1d1d85be319c865adcd1ad
   created convenience constructors, subclasses should be refactored accordingly now with care
   
   b) I was also wondering about having an array of inputcolumns as:
   ```
   public int[] inputColumnNums = new int[] { -1, -1, -1 };
   ```
   this could be more general, more easily extendable, and needs a bit more refactoring (every existing occurrence of inputColumn should becode inputColumnNums[0])
   
   so regarding the class naming, we won't change anything (to unary, binary, ...) just use the proper constructor in subclasses




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] abstractdog commented on pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
abstractdog commented on pull request #2099:
URL: https://github.com/apache/hive/pull/2099#issuecomment-848295810


   merged to master, thanks for the review @ramesh0201 !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] abstractdog commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
abstractdog commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r629269903



##########
File path: ql/src/gen/vectorization/ExpressionTemplates/ColumnArithmeticColumn.txt
##########
@@ -34,20 +34,17 @@ public class <ClassName> extends VectorExpression {
 
   private static final long serialVersionUID = 1L;
 
-  private final int colNum1;
   private final int colNum2;

Review comment:
       a) regarding input columns, a simple solution is could be https://github.com/apache/hive/pull/2099/commits/5db4c60aec34df291a1d1d85be319c865adcd1ad
   so createdall needed input column fields in the base class + convenience constructors, subclasses should be refactored accordingly now with care
   
   b) I was also wondering about having an array of inputcolumns as:
   ```
   public int[] inputColumnNums = new int[] { -1, -1, -1 };
   ```
   this could be more general, more easily extendable, and needs a bit more refactoring (every existing occurrence of inputColumn should become inputColumnNums[0])
   
   so regarding the class naming, we won't change anything (to unary, binary, ...) just use the proper constructor in subclasses




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] abstractdog commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
abstractdog commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r616521343



##########
File path: ql/src/test/results/clientpositive/llap/windowing_udaf.q.out
##########
@@ -503,7 +503,7 @@ alice brown	25.258749999999996
 alice brown	25.529374999999998
 alice brown	25.63012987012987
 alice brown	26.472439024390237
-alice brown	27.100638297872322
+alice brown	27.27881720430106

Review comment:
       sure, I confirmed this manually, and I found that the new, vectorized average is correct (27.27881720430106)
   here is how I checked:
   1. table and original query
   ```
   create table over10k_n4(
   t tinyint,
   si smallint,
   i int,
   b bigint,
   f float,
   d double,
   bo boolean,
   s string,
   ts timestamp, 
   `dec` decimal,  
   bin binary)
   row format delimited
   fields terminated by '|';
   
   load data local inpath '../../data/files/over10k' into table over10k_n4;
   
   select t, f, d, avg(d) over (partition by t order by f) a from over10k_n4 order by s, a limit 100;
   ```
   
   2.  the original query is problematic to represent the problem, doesn't contain all the important rows due to limit 100, but here is a cleaner scenario
   ```
   select t, f, d, avg(d) over (partition by t order by f) a from over10k_n4 where t = 114;
   ```
   result:
   ```
   
   | 114  | 95.01  | 13.77  | 27.31472527472526   |
   | 114  | 95.09  | 45.37  | 27.510978260869546  |
   | 114  | 97.94  | 5.92   | 27.27881720430106   | <--------- this is the changed value
   | 114  | 97.94  | 10.53  | 27.100638297872322  |
   +------+--------+--------+---------------------+
   
   so we can see the avg in the row before the last row is 27.2788, how can we check this?
   let's calculate the sum and count for this row to get the average:
   1. sum (sum of all rows, the subtract the last one)
   ```
   select sum(d) - 10.53 from over10k_n4 where t = 114;
   2536.9299999999994
   ```
   
   2. count (count all, we can subtract 1 while calculating the average in the next step)
   ```
   select count(d) from over10k_n4 where t = 114;
   94
   ```
   
   3. average:
   ```
     2536.9299999999994 / 93 = 27.2788172043
   ```
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] ramesh0201 commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
ramesh0201 commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r616365379



##########
File path: ql/src/test/results/clientpositive/llap/windowing_udaf.q.out
##########
@@ -503,7 +503,7 @@ alice brown	25.258749999999996
 alice brown	25.529374999999998
 alice brown	25.63012987012987
 alice brown	26.472439024390237
-alice brown	27.100638297872322
+alice brown	27.27881720430106

Review comment:
       Can we have a comment in the PR review of why this is changed after this patch




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] abstractdog commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
abstractdog commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r616566788



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java
##########
@@ -3005,6 +3003,19 @@ private boolean validatePTFOperator(PTFOperator op, VectorizationContext vContex
           }
         }
       }
+      if (vectorPTFDesc.getOrderExprNodeDescs().length > 1) {
+        /*
+         * Currently, we need to rule out here all cases where a range boundary scanner can run,
+         * basically: 1. bounded start 2. bounded end which is not current row
+         */
+        if (windowFrameDef.getWindowType() == WindowType.RANGE
+            && (!windowFrameDef.isStartUnbounded() || !windowFrameDef.getEnd().isCurrentRow())) {

Review comment:
       good catch, I'm going to fix the condition to enable unbounded end
   FYI, the connection between this part and boundary scanner is the unimplemented method:
   {code}
     @Override
     public boolean isDistanceGreater(Object v1, Object v2, int amt) {
       throw new UnsupportedOperationException("Only unbounded ranges supported");
     }
   {code}

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java
##########
@@ -3005,6 +3003,19 @@ private boolean validatePTFOperator(PTFOperator op, VectorizationContext vContex
           }
         }
       }
+      if (vectorPTFDesc.getOrderExprNodeDescs().length > 1) {
+        /*
+         * Currently, we need to rule out here all cases where a range boundary scanner can run,
+         * basically: 1. bounded start 2. bounded end which is not current row
+         */
+        if (windowFrameDef.getWindowType() == WindowType.RANGE
+            && (!windowFrameDef.isStartUnbounded() || !windowFrameDef.getEnd().isCurrentRow())) {

Review comment:
       good catch! I'm going to fix the condition to enable unbounded end
   FYI, the connection between this part and boundary scanner is the unimplemented method:
   ```
     @Override
     public boolean isDistanceGreater(Object v1, Object v2, int amt) {
       throw new UnsupportedOperationException("Only unbounded ranges supported");
     }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] abstractdog commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
abstractdog commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r633453402



##########
File path: ql/src/gen/vectorization/ExpressionTemplates/ColumnArithmeticColumn.txt
##########
@@ -34,20 +34,17 @@ public class <ClassName> extends VectorExpression {
 
   private static final long serialVersionUID = 1L;
 
-  private final int colNum1;
   private final int colNum2;

Review comment:
       what do you think about this @ramesh0201?
   I think a general input col array would be nice (option b) )
   
   however, there some rare cases where it's not obvious which position should be used, but it's up to agreement e.g.:
   IfExprScalarColumn.txt
   ```
   protected final int arg1Column;
   protected final <OperandType2> arg2Scalar;
   protected final int arg3Column;
   ```
   this is tricky because there is a scalar interleaved into the columns, input col array might look like:
   1. new int[] { arg1Column, -1, arg3Column};
   to emphasize that that the second argument is a scalar, so we'll refactor as:
   ```
   arg3Column => inputColumnNums[2]
   ```
   
   2. new int[] { arg1Column, arg3Column, -1};
   to ignore the fact that there is an interleaved scalar input, so we'll refactor as:
   ```
   arg3Column => inputColumnNums[1]
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] ramesh0201 commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
ramesh0201 commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r616334377



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java
##########
@@ -3005,6 +3003,19 @@ private boolean validatePTFOperator(PTFOperator op, VectorizationContext vContex
           }
         }
       }
+      if (vectorPTFDesc.getOrderExprNodeDescs().length > 1) {
+        /*
+         * Currently, we need to rule out here all cases where a range boundary scanner can run,
+         * basically: 1. bounded start 2. bounded end which is not current row
+         */
+        if (windowFrameDef.getWindowType() == WindowType.RANGE
+            && (!windowFrameDef.isStartUnbounded() || !windowFrameDef.getEnd().isCurrentRow())) {

Review comment:
       I am not much aware of the range boundary scanner, but do we have any issue vectorizing UNBOUNDED FOLLOWING? Can you help me understand how does range boundary scanner affects the vectorization




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] ramesh0201 commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
ramesh0201 commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r616364689



##########
File path: ql/src/gen/vectorization/ExpressionTemplates/ColumnArithmeticColumn.txt
##########
@@ -34,20 +34,17 @@ public class <ClassName> extends VectorExpression {
 
   private static final long serialVersionUID = 1L;
 
-  private final int colNum1;
   private final int colNum2;

Review comment:
       It will be better to have the two input columns together in the same place. But it is a good idea to move this to parent class and share the assigning code for all subclasses. Can we introduce two subclasses VectorUnaryExpression and VectorBinaryExpression( or similar name) for VectorExpression and then extend other classes from either one of these classes. That way we can also avoid passing -1 from classes that do not use the inputcolumn




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] ramesh0201 commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
ramesh0201 commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r616364689



##########
File path: ql/src/gen/vectorization/ExpressionTemplates/ColumnArithmeticColumn.txt
##########
@@ -34,20 +34,17 @@ public class <ClassName> extends VectorExpression {
 
   private static final long serialVersionUID = 1L;
 
-  private final int colNum1;
   private final int colNum2;

Review comment:
       It will be better to have the two input columns together in the same place. But it is a good idea to move this to parent class and share the assigning code for all subclasses. Can we introduce two subclasses VectorUnaryExpression and VectorBinaryExpression( or similar name) for VectorExpression and then extend other classes from either one of these classes. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] abstractdog commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
abstractdog commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r616521343



##########
File path: ql/src/test/results/clientpositive/llap/windowing_udaf.q.out
##########
@@ -503,7 +503,7 @@ alice brown	25.258749999999996
 alice brown	25.529374999999998
 alice brown	25.63012987012987
 alice brown	26.472439024390237
-alice brown	27.100638297872322
+alice brown	27.27881720430106

Review comment:
       sure, I confirmed this manually, and I found that the new, vectorized average is correct (27.27881720430106)
   here is how I checked:
   1. table and original query
   ```
   create table over10k_n4(
   t tinyint,
   si smallint,
   i int,
   b bigint,
   f float,
   d double,
   bo boolean,
   s string,
   ts timestamp, 
   `dec` decimal,  
   bin binary)
   row format delimited
   fields terminated by '|';
   
   load data local inpath '../../data/files/over10k' into table over10k_n4;
   
   select t, f, d, avg(d) over (partition by t order by f) a from over10k_n4 order by s, a limit 100;
   ```
   
   2.  the original query is problematic to represent the problem, doesn't contain all the important rows due to limit 100, but here is a cleaner scenario
   ```
   select t, f, d, avg(d) over (partition by t order by f) a from over10k_n4 where t = 114;
   ```
   result:
   ```
   
   | 114  | 95.01  | 13.77  | 27.31472527472526   |
   | 114  | 95.09  | 45.37  | 27.510978260869546  |
   | 114  | 97.94  | 5.92   | 27.27881720430106   | <--------- this is the changed value
   | 114  | 97.94  | 10.53  | 27.100638297872322  |
   +------+--------+--------+---------------------+
   ```
   so we can see the avg in the row before the last row is 27.2788, how can we check this?
   let's calculate the sum and count for this row to get the average:
   1. sum (sum of all rows, the subtract the last one)
   ```
   select sum(d) - 10.53 from over10k_n4 where t = 114;
   2536.9299999999994
   ```
   
   2. count (count all, we can subtract 1 while calculating the average in the next step)
   ```
   select count(d) from over10k_n4 where t = 114;
   94
   ```
   
   3. average:
   ```
     2536.9299999999994 / 93 = 27.2788172043
   ```
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] ramesh0201 commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
ramesh0201 commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r634911888



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java
##########
@@ -3005,6 +3003,19 @@ private boolean validatePTFOperator(PTFOperator op, VectorizationContext vContex
           }
         }
       }
+      if (vectorPTFDesc.getOrderExprNodeDescs().length > 1) {
+        /*
+         * Currently, we need to rule out here all cases where a range boundary scanner can run,
+         * basically: 1. bounded start 2. bounded end which is not current row
+         */
+        if (windowFrameDef.getWindowType() == WindowType.RANGE
+            && (!windowFrameDef.isStartUnbounded() || !windowFrameDef.getEnd().isCurrentRow())) {

Review comment:
       Thanks it looks good




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] abstractdog commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
abstractdog commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r629266309



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java
##########
@@ -3005,6 +3003,19 @@ private boolean validatePTFOperator(PTFOperator op, VectorizationContext vContex
           }
         }
       }
+      if (vectorPTFDesc.getOrderExprNodeDescs().length > 1) {
+        /*
+         * Currently, we need to rule out here all cases where a range boundary scanner can run,
+         * basically: 1. bounded start 2. bounded end which is not current row
+         */
+        if (windowFrameDef.getWindowType() == WindowType.RANGE
+            && (!windowFrameDef.isStartUnbounded() || !windowFrameDef.getEnd().isCurrentRow())) {

Review comment:
       this is fixed in https://github.com/apache/hive/pull/2099/commits/f6b5ea27501ccef175da961b49be9e0587a1ce00 and will be squashed later




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] abstractdog commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
abstractdog commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r635952821



##########
File path: ql/src/gen/vectorization/ExpressionTemplates/ColumnArithmeticColumn.txt
##########
@@ -34,20 +34,17 @@ public class <ClassName> extends VectorExpression {
 
   private static final long serialVersionUID = 1L;
 
-  private final int colNum1;
   private final int colNum2;

Review comment:
       okay, I'm trying to proceed with 2) with your advice, because an array is nicer than having fields, not to mention if we're adding convenience methods for processing input columns (simple loop vs. mentioning all the fields)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] abstractdog commented on a change in pull request #2099: HIVE-24761: Support vectorization for bounded windows in PTF

Posted by GitBox <gi...@apache.org>.
abstractdog commented on a change in pull request #2099:
URL: https://github.com/apache/hive/pull/2099#discussion_r616599645



##########
File path: ql/src/gen/vectorization/ExpressionTemplates/ColumnArithmeticColumn.txt
##########
@@ -34,20 +34,17 @@ public class <ClassName> extends VectorExpression {
 
   private static final long serialVersionUID = 1L;
 
-  private final int colNum1;
   private final int colNum2;

Review comment:
       I agree that the current solution is not really clean by having only the first column put into VectorExpression
   a couple of notes here, which needs to be discussed before proceeding with this huge refactor (which I'm happy to do once we 100% certain about the "perfect" solution):
   
   1. unary, binary is not enough, unfortunately, we have even expressions involving even more cols, this is not a problem, we have the language support for that :) tertiary, quaternary...
   
   2. what's confusing is, how to show with simple class names that unary/binary/... is only a story about the input columns? an expression can have constants too, e.g. in IfExprScalarScalar.txt:
   ```
        this.arg1Column = arg1Column;
        this.arg2Scalar = arg2Scalar;
        this.arg3Scalar = arg3Scalar;
   ```
   in our terminology here, this is a unary expression because of arg1Column + scalars, but in reality, it's obviously not a unary function...
   
   3. with subclasses, we'll have to implement a general VectorExpression.setInputColumnNum(int i, int j, int k, ...vararg), otherwise, we won't be able to change the input column numbers (which is important, this was the intention of this huge vector expression refactor), I think this will simply work by simply overriding vararg method in subclasses




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org