You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/02/01 09:50:14 UTC

[GitHub] [hive] zabetak opened a new pull request #1544: HIVE-24221: Use vectorizable expression to combine multiple columns in semijoin bloom filters

zabetak opened a new pull request #1544:
URL: https://github.com/apache/hive/pull/1544


   
   ### What changes were proposed in this pull request?
   
   Use hash(hash(hash(a,b),c),d) instead of hash(a,b,c,d) when constructing
   the multi-col semijoin reducer.
   
   ### Why are the changes needed?
   In order to use fully vectorized execution on multi-col semijoin reducers.
   
   ### Does this PR introduce _any_ user-facing change?
   Only changes in EXPLAIN plans.
   
   ### How was this patch tested?
   `mvn test -Dtest=TestTezPerfCliDriver -Dqfile="query50.q"`
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on pull request #1544: HIVE-24221: Use vectorizable expression to combine multiple columns in semijoin bloom filters

Posted by GitBox <gi...@apache.org>.
zabetak commented on pull request #1544:
URL: https://github.com/apache/hive/pull/1544#issuecomment-751886331


   > @zabetak , can this be rebased / merged?
   It's easy to rebase but don't know if its worth merging. I didn't do any tests about the quality of computing the overall hash by nesting the hash functions. 
   
   **Advantage**: Computation becomes vectorized
   **Disadvantage**: Quality of hash is not measured
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on a change in pull request #1544: HIVE-24221: Use vectorizable expression to combine multiple columns in semijoin bloom filters

Posted by GitBox <gi...@apache.org>.
zabetak commented on a change in pull request #1544:
URL: https://github.com/apache/hive/pull/1544#discussion_r504638460



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java
##########
@@ -233,6 +235,23 @@ public static ExprNodeGenericFuncDesc and(List<ExprNodeDesc> exps) {
     return new ExprNodeGenericFuncDesc(TypeInfoFactory.booleanTypeInfo, new GenericUDFOPAnd(), "and", flatExps);
   }
 
+  /**
+   * Create an expression for computing a hash by recursively hashing given expressions by two:
+   * <pre>
+   * Input: HASH(A, B, C, D)
+   * Output: HASH(HASH(HASH(A,B),C),D)
+   * </pre>
+   */
+  public static ExprNodeGenericFuncDesc hash(List<ExprNodeDesc> exps) {
+    assert exps.size() >= 2;
+    ExprNodeDesc hashExp = exps.get(0);
+    for (int i = 1; i < exps.size(); i++) {
+      List<ExprNodeDesc> hArgs = Arrays.asList(hashExp, exps.get(i));
+      hashExp = new ExprNodeGenericFuncDesc(TypeInfoFactory.intTypeInfo, new GenericUDFMurmurHash(), "hash", hArgs);

Review comment:
       Good catch @kgyrtkirk ! I've never noticed that we have two different UDFs for hashing. Indeed having the same annotation can create quite some confusion and difficult to debug problems. I guess your suggestion is to change the annotation of GenericUDFMurmurHash to murmur_hash right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on pull request #1544: HIVE-24221: Use vectorizable expression to combine multiple columns in semijoin bloom filters

Posted by GitBox <gi...@apache.org>.
zabetak commented on pull request #1544:
URL: https://github.com/apache/hive/pull/1544#issuecomment-773175647


   Hey @kgyrtkirk , we finally got a green run on this one :) Can you please help merging this back.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #1544: HIVE-24221: Use vectorizable expression to combine multiple columns in semijoin bloom filters

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #1544:
URL: https://github.com/apache/hive/pull/1544#discussion_r504482010



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java
##########
@@ -233,6 +235,23 @@ public static ExprNodeGenericFuncDesc and(List<ExprNodeDesc> exps) {
     return new ExprNodeGenericFuncDesc(TypeInfoFactory.booleanTypeInfo, new GenericUDFOPAnd(), "and", flatExps);
   }
 
+  /**
+   * Create an expression for computing a hash by recursively hashing given expressions by two:
+   * <pre>
+   * Input: HASH(A, B, C, D)
+   * Output: HASH(HASH(HASH(A,B),C),D)
+   * </pre>
+   */
+  public static ExprNodeGenericFuncDesc hash(List<ExprNodeDesc> exps) {
+    assert exps.size() >= 2;
+    ExprNodeDesc hashExp = exps.get(0);
+    for (int i = 1; i < exps.size(); i++) {
+      List<ExprNodeDesc> hArgs = Arrays.asList(hashExp, exps.get(i));
+      hashExp = new ExprNodeGenericFuncDesc(TypeInfoFactory.intTypeInfo, new GenericUDFMurmurHash(), "hash", hArgs);

Review comment:
       it seems like we have some inconsistency in `GenericUDFMurmurHash` which is registered as `murmur_hash` in the `FunctionRegistry` ; however in the UDF's annotation it only has `hash` - and here as well we use simply "hash".
   a change like this will most likely cause a lot of q.out changes - could you file a follow-up ticket?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on pull request #1544: HIVE-24221: Use vectorizable expression to combine multiple columns in semijoin bloom filters

Posted by GitBox <gi...@apache.org>.
jcamachor commented on pull request #1544:
URL: https://github.com/apache/hive/pull/1544#issuecomment-745013346


   @zabetak , can this be rebased / merged?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on a change in pull request #1544: HIVE-24221: Use vectorizable expression to combine multiple columns in semijoin bloom filters

Posted by GitBox <gi...@apache.org>.
zabetak commented on a change in pull request #1544:
URL: https://github.com/apache/hive/pull/1544#discussion_r549513959



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java
##########
@@ -233,6 +235,23 @@ public static ExprNodeGenericFuncDesc and(List<ExprNodeDesc> exps) {
     return new ExprNodeGenericFuncDesc(TypeInfoFactory.booleanTypeInfo, new GenericUDFOPAnd(), "and", flatExps);
   }
 
+  /**
+   * Create an expression for computing a hash by recursively hashing given expressions by two:
+   * <pre>
+   * Input: HASH(A, B, C, D)
+   * Output: HASH(HASH(HASH(A,B),C),D)
+   * </pre>
+   */
+  public static ExprNodeGenericFuncDesc hash(List<ExprNodeDesc> exps) {
+    assert exps.size() >= 2;
+    ExprNodeDesc hashExp = exps.get(0);
+    for (int i = 1; i < exps.size(); i++) {
+      List<ExprNodeDesc> hArgs = Arrays.asList(hashExp, exps.get(i));
+      hashExp = new ExprNodeGenericFuncDesc(TypeInfoFactory.intTypeInfo, new GenericUDFMurmurHash(), "hash", hArgs);

Review comment:
       Logged https://issues.apache.org/jira/browse/HIVE-24572 for this purpose.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk merged pull request #1544: HIVE-24221: Use vectorizable expression to combine multiple columns in semijoin bloom filters

Posted by GitBox <gi...@apache.org>.
kgyrtkirk merged pull request #1544:
URL: https://github.com/apache/hive/pull/1544


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on pull request #1544: HIVE-24221: Use vectorizable expression to combine multiple columns in semijoin bloom filters

Posted by GitBox <gi...@apache.org>.
zabetak commented on pull request #1544:
URL: https://github.com/apache/hive/pull/1544#issuecomment-773175647


   Hey @kgyrtkirk , we finally got a green run on this one :) Can you please help merging this back.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on pull request #1544: HIVE-24221: Use vectorizable expression to combine multiple columns in semijoin bloom filters

Posted by GitBox <gi...@apache.org>.
zabetak commented on pull request #1544:
URL: https://github.com/apache/hive/pull/1544#issuecomment-770725925


   Close/Reopen to trigger tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk merged pull request #1544: HIVE-24221: Use vectorizable expression to combine multiple columns in semijoin bloom filters

Posted by GitBox <gi...@apache.org>.
kgyrtkirk merged pull request #1544:
URL: https://github.com/apache/hive/pull/1544


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] github-actions[bot] commented on pull request #1544: HIVE-24221: Use vectorizable expression to combine multiple columns in semijoin bloom filters

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #1544:
URL: https://github.com/apache/hive/pull/1544#issuecomment-744104254


   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #1544: HIVE-24221: Use vectorizable expression to combine multiple columns in semijoin bloom filters

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #1544:
URL: https://github.com/apache/hive/pull/1544#discussion_r542113182



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDescUtils.java
##########
@@ -233,6 +235,23 @@ public static ExprNodeGenericFuncDesc and(List<ExprNodeDesc> exps) {
     return new ExprNodeGenericFuncDesc(TypeInfoFactory.booleanTypeInfo, new GenericUDFOPAnd(), "and", flatExps);
   }
 
+  /**
+   * Create an expression for computing a hash by recursively hashing given expressions by two:
+   * <pre>
+   * Input: HASH(A, B, C, D)
+   * Output: HASH(HASH(HASH(A,B),C),D)
+   * </pre>
+   */
+  public static ExprNodeGenericFuncDesc hash(List<ExprNodeDesc> exps) {
+    assert exps.size() >= 2;
+    ExprNodeDesc hashExp = exps.get(0);
+    for (int i = 1; i < exps.size(); i++) {
+      List<ExprNodeDesc> hArgs = Arrays.asList(hashExp, exps.get(i));
+      hashExp = new ExprNodeGenericFuncDesc(TypeInfoFactory.intTypeInfo, new GenericUDFMurmurHash(), "hash", hArgs);

Review comment:
       yes; excatly




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak closed pull request #1544: HIVE-24221: Use vectorizable expression to combine multiple columns in semijoin bloom filters

Posted by GitBox <gi...@apache.org>.
zabetak closed pull request #1544:
URL: https://github.com/apache/hive/pull/1544


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org