You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/07/30 18:09:03 UTC

[GitHub] [hive] okumin opened a new pull request #2551: HIVE-25410: CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size

okumin opened a new pull request #2551:
URL: https://github.com/apache/hive/pull/2551


   ### What changes were proposed in this pull request?
   
   As commented in the following ticket, the current implementation can't perform JOIN on a variant-sized ARRAY.
   This PR will let CommonMergeJoinOperator handle any ARRAY objects as JOIN keys.
   https://issues.apache.org/jira/browse/HIVE-25410
   
   ### Why are the changes needed?
   
   We have no reasons to allow only fixed-length ARRAYs as JOIN keys. We can regard this issue as a kind of bug.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This case should have been covered in HIVE-24883.
   
   ### How was this patch tested?
   
   - I've modified two test cases
   - I've tested the query to reproduce the issue in the ticket


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #2551: HIVE-25410: CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #2551:
URL: https://github.com/apache/hive/pull/2551#discussion_r680140617



##########
File path: ql/src/test/queries/clientpositive/smb_mapjoin_complex_type.q
##########
@@ -10,10 +10,10 @@ set hive.merge.mapfiles=false;
 set hive.merge.mapredfiles=false; 
 
 CREATE TABLE test_list1 (key INT, value array<int>, col_1 STRING) CLUSTERED BY (value) SORTED BY (value) INTO 2 BUCKETS;
-INSERT INTO test_list1 VALUES (99, array(0,0), 'Alice'), (99, array(2,2), 'Mat'), (100, array(0,0), 'Bob'), (101, array(2,2), 'Car');
+INSERT INTO test_list1 VALUES (99, array(0,0), 'Alice'), (99, array(2,2), 'Mat'), (100, array(0,0), 'Bob'), (101, array(2,2), 'Car'), (102, array(1, 2, 3, 4), 'Mallory');
 
 CREATE TABLE test_list2 (key INT, value array<int>, col_2 STRING) CLUSTERED BY (value) SORTED BY (value) INTO 2 BUCKETS;
-INSERT INTO test_list2 VALUES (102, array(2,2), 'Del'), (103, array(2,2), 'Ema'), (104, array(3,3), 'Fli');
+INSERT INTO test_list2 VALUES (102, array(2,2), 'Del'), (103, array(2,2), 'Ema'), (104, array(3,3), 'Fli'), (105, array(1, 2, 3, 4), 'Victor');

Review comment:
       FYI: Actually, SMB Map JOIN is not affected by this issue because WritableComarator is not reused, unlike CommonMergeJoinOperator.
   https://github.com/maheshk114/hive/blob/9a6e6cafdcac47f57c04947fefcaa35815243d77/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java#L472-L473




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on pull request #2551: HIVE-25410: CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size

Posted by GitBox <gi...@apache.org>.
okumin commented on pull request #2551:
URL: https://github.com/apache/hive/pull/2551#issuecomment-895936902


   Thanks for your quick review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak closed pull request #2551: HIVE-25410: CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size

Posted by GitBox <gi...@apache.org>.
zabetak closed pull request #2551:
URL: https://github.com/apache/hive/pull/2551


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #2551: HIVE-25410: CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #2551:
URL: https://github.com/apache/hive/pull/2551#discussion_r685915956



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/HiveStructComparator.java
##########
@@ -45,16 +48,14 @@ public int compare(Object key1, Object key2) {
         if (a1.size() == 0) {
             return 0;
         }
-        if (comparator == null) {
-            comparator = new WritableComparator[a1.size()];
-            // For struct all elements may not be of same type, so create comparator for each entry.
-            for (int i = 0; i < a1.size(); i++) {
-                comparator[i] = WritableComparatorFactory.get(a1.get(i), nullSafe, nullOrdering);
-            }
+        // For array, the length may not be fixed, so extend comparators on demand
+        for (int i = comparators.size(); i < a1.size(); i++) {
+            // For struct, all elements may not be of same type, so create comparator for each entry.
+            comparators.add(i, WritableComparatorFactory.get(a1.get(i), nullSafe, nullOrdering));
         }
         result = 0;
         for (int i = 0; i < a1.size(); i++) {
-            result = comparator[i].compare(a1.get(i), a2.get(i));
+            result = comparators.get(i).compare(a1.get(i), a2.get(i));

Review comment:
       @zabetak Maybe, I have the same feeling. Basically, STRUCT and ARRAY are different data structures and we can have different approach. That would also make implementation straightforward.
   My idea is let WritableComparatorFactory identify more precise types so that it can distinct STRUCT, ARRAY, and so on. It will require some effort since WritableComparatorFactory has to accept ObjectInspector or type information. But it sounds more robust than inferring data types from `Object`.
   Anyway, I agree to create a follow-up ticket and I will do that if you have nothing more to discuss here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on pull request #2551: HIVE-25410: CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size

Posted by GitBox <gi...@apache.org>.
okumin commented on pull request #2551:
URL: https://github.com/apache/hive/pull/2551#issuecomment-895835220


   @maheshk114 @zabetak Could you please take a look when you have a chance? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #2551: HIVE-25410: CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #2551:
URL: https://github.com/apache/hive/pull/2551#discussion_r680138295



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/WritableComparatorFactory.java
##########
@@ -19,15 +19,14 @@
 
 import org.apache.hadoop.hive.ql.util.NullOrdering;
 import org.apache.hadoop.hive.serde2.objectinspector.StandardUnionObjectInspector.StandardUnion;
-import org.apache.hadoop.io.WritableComparable;
 import org.apache.hadoop.io.WritableComparator;
 import java.util.List;
 import java.util.Map;
 
 public final class WritableComparatorFactory {
     public static WritableComparator get(Object key, boolean nullSafe, NullOrdering nullOrdering) {
         if (key instanceof List) {
-            // For array type struct is used as we do not know if all elements of array are of same type.
+            // STRUCT or ARRAY are expressed as java.util.List
             return new HiveStructComparator(nullSafe, nullOrdering);

Review comment:
       In my understanding, all elements of ARRAY should have the same type.
   - https://github.com/apache/hive/blob/eef2a5dda0470525d0d89bd9820e761fe44dc3a8/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/StandardListObjectInspector.java#L34
   - https://github.com/apache/hive/blob/eef2a5dda0470525d0d89bd9820e761fe44dc3a8/serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/ListTypeInfo.java#L39
   
   Though UNION can have multiple types, it's expressed as a special type. Currently, it is not supported as below.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] okumin commented on a change in pull request #2551: HIVE-25410: CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size

Posted by GitBox <gi...@apache.org>.
okumin commented on a change in pull request #2551:
URL: https://github.com/apache/hive/pull/2551#discussion_r680135993



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/HiveStructComparator.java
##########
@@ -45,16 +48,14 @@ public int compare(Object key1, Object key2) {
         if (a1.size() == 0) {
             return 0;
         }
-        if (comparator == null) {

Review comment:
       The root cause is that the num of comparators is fixed to the size of the first List. This will work for STRUCT since the number of elements is consistent across records. However, in the case of ARRAY, it varies.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on a change in pull request #2551: HIVE-25410: CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size

Posted by GitBox <gi...@apache.org>.
zabetak commented on a change in pull request #2551:
URL: https://github.com/apache/hive/pull/2551#discussion_r685832922



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/HiveStructComparator.java
##########
@@ -45,16 +48,14 @@ public int compare(Object key1, Object key2) {
         if (a1.size() == 0) {
             return 0;
         }
-        if (comparator == null) {
-            comparator = new WritableComparator[a1.size()];
-            // For struct all elements may not be of same type, so create comparator for each entry.
-            for (int i = 0; i < a1.size(); i++) {
-                comparator[i] = WritableComparatorFactory.get(a1.get(i), nullSafe, nullOrdering);
-            }
+        // For array, the length may not be fixed, so extend comparators on demand
+        for (int i = comparators.size(); i < a1.size(); i++) {
+            // For struct, all elements may not be of same type, so create comparator for each entry.
+            comparators.add(i, WritableComparatorFactory.get(a1.get(i), nullSafe, nullOrdering));
         }
         result = 0;
         for (int i = 0; i < a1.size(); i++) {
-            result = comparator[i].compare(a1.get(i), a2.get(i));
+            result = comparators.get(i).compare(a1.get(i), a2.get(i));

Review comment:
       For `ARRAY` comparisons we could even initialize and use only one comparator instead of creating the same one multiple times. If the size of the ARRAYs is small as in the example it will not make any real difference but for bigger ARRAYs we may feel both in terms of memory and CPU. Since this might need a bit of refactoring we can tackle it in a follow up JIRA, what do you think @okumin ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org