You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/12/18 19:32:25 UTC

[GitHub] [hive] kgyrtkirk opened a new pull request #2891: HIVE-25822: Unexpected result rows in case of outer join contains conditions only affecting one side

kgyrtkirk opened a new pull request #2891:
URL: https://github.com/apache/hive/pull/2891


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/Hive/HowToContribute
     2. Ensure that you have created an issue on the Hive project JIRA: https://issues.apache.org/jira/projects/HIVE/summary
     3. Ensure you have added or run the appropriate tests for your PR: 
     4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]HIVE-XXXXX:  Your PR title ...'.
     5. Be sure to keep the PR description updated to reflect all changes.
     6. Please write your PR title to summarize what this PR proposes.
     7. If possible, provide a concise example to reproduce the issue for a faster review.
   
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description, screenshot and/or a reproducable example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Hive versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #2891: HIVE-25822: Unexpected result rows in case of outer join contains conditions only affecting one side

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #2891:
URL: https://github.com/apache/hive/pull/2891#discussion_r772151792



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -1836,6 +1836,8 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
     HIVEALIAS("hive.alias", "", ""),
     HIVEMAPSIDEAGGREGATE("hive.map.aggr", true, "Whether to use map-side aggregation in Hive Group By queries"),
     HIVEGROUPBYSKEW("hive.groupby.skewindata", false, "Whether there is skew in data to optimize group by queries"),
+    HIVE_JOIN_SHORTCUT_EXTENSIONROWS("hive.join.shortcut.extensionrows", true,
+        "Enables to shortcut processing of known filtered rows in merge joins."),

Review comment:
       idea was to add this toggle to make it possible to check if this feature is causing any perf regression - and make it possible to validate that quickly




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #2891: HIVE-25822: Unexpected result rows in case of outer join contains conditions only affecting one side

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #2891:
URL: https://github.com/apache/hive/pull/2891#discussion_r772314289



##########
File path: ql/src/test/results/clientpositive/llap/join_1to1.q.out
##########
@@ -597,22 +592,10 @@ NULL	NULL	NULL	10	10010	66
 NULL	NULL	NULL	25	10025	66
 NULL	NULL	NULL	30	10030	88
 NULL	NULL	NULL	35	10035	88
-NULL	NULL	NULL	40	10040	66
 NULL	NULL	NULL	40	10040	88
-NULL	NULL	NULL	40	10040	88
-NULL	NULL	NULL	50	10050	66
-NULL	NULL	NULL	50	10050	66
-NULL	NULL	NULL	50	10050	66
-NULL	NULL	NULL	50	10050	66
-NULL	NULL	NULL	50	10050	88
 NULL	NULL	NULL	50	10050	88
-NULL	NULL	NULL	50	10050	88
-NULL	NULL	NULL	70	10040	88
 NULL	NULL	NULL	70	10040	88
 NULL	NULL	NULL	70	10040	88

Review comment:
       I've just compared all the results in this file against postgres - and the results are now good




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] amansinha100 commented on a change in pull request #2891: HIVE-25822: Unexpected result rows in case of outer join contains conditions only affecting one side

Posted by GitBox <gi...@apache.org>.
amansinha100 commented on a change in pull request #2891:
URL: https://github.com/apache/hive/pull/2891#discussion_r772471567



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -1836,6 +1836,8 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
     HIVEALIAS("hive.alias", "", ""),
     HIVEMAPSIDEAGGREGATE("hive.map.aggr", true, "Whether to use map-side aggregation in Hive Group By queries"),
     HIVEGROUPBYSKEW("hive.groupby.skewindata", false, "Whether there is skew in data to optimize group by queries"),
+    HIVE_JOIN_SHORTCUT_EXTENSIONROWS("hive.join.shortcut.extensionrows", true,
+        "Enables to shortcut processing of known filtered rows in merge joins."),

Review comment:
       Agree that this config is needed for performance testing.  At the same time, agree that it should be marked for internal use.  I see there's precedent ..there are a few config options with 'Internal use only' in the description. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #2891: HIVE-25822: Unexpected result rows in case of outer join contains conditions only affecting one side

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #2891:
URL: https://github.com/apache/hive/pull/2891#discussion_r772151007



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonMergeJoinOperator.java
##########
@@ -302,7 +322,47 @@ public void process(Object row, int tag) throws HiveException {
 
     assert !nextKeyGroup;
     candidateStorage[tag].addRow(value);
+  }
+
+  private void emitExtensionRow(int tag, List<Object> value) throws HiveException {
+    extensionStorage[tag].addRow(value);
+    for (byte i = 0; i < order.length; i++) {
+      if (i == tag) {
+        storage[i] = extensionStorage[i];
+      } else {
+        putDummyOrEmpty(i);
+      }
+    }
+    checkAndGenObject();
+    extensionStorage[tag].clearRows();
+  }
 
+  /**
+   * Decides if the actual row must be an extension row.
+   *
+   * Extension rows are those which are not part of the inner-join.
+   * May not correctly identify all extension rows - but will remove trivially filtered ones.

Review comment:
       > May not identify all extension rows
   
   it could not identify non-trivial ones like that no rows could satisfy the `l.id=r.id and l.id+2*r.id=1` condition...
   guess what...running:
   ```
   select * from t_xy l full outer join t_y r on (l.id=r.id and l.s='y' and l.id+2*r.id=1);
   ```
   resulted in incorrect results as well...
   
   this will leave me no choice but to remove that `skip` crap and replace it with something decent....




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk merged pull request #2891: HIVE-25822: Unexpected result rows in case of outer join contains conditions only affecting one side

Posted by GitBox <gi...@apache.org>.
kgyrtkirk merged pull request #2891:
URL: https://github.com/apache/hive/pull/2891


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #2891: HIVE-25822: Unexpected result rows in case of outer join contains conditions only affecting one side

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #2891:
URL: https://github.com/apache/hive/pull/2891#discussion_r773030708



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonMergeJoinOperator.java
##########
@@ -240,6 +249,17 @@ public void process(Object row, int tag) throws HiveException {
 
     byte alias = (byte) tag;
     List<Object> value = getFilteredValue(alias, row);
+
+    if (shortcutExtensionRows && isOuterJoinExtensionRow(tag, value)) {
+      int type = condn[0].getType();
+      if (tag == 0 && (type == JoinDesc.LEFT_OUTER_JOIN || type == JoinDesc.FULL_OUTER_JOIN)) {
+        emitExtensionRow(tag, value);

Review comment:
       it have stopped for me for other outer joins ; however I had to disable auto.join.converion to have that - without that the join was converted into a mapjoin
   I will make some further tests for different types of joins to see if we don't have any issues




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #2891: HIVE-25822: Unexpected result rows in case of outer join contains conditions only affecting one side

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #2891:
URL: https://github.com/apache/hive/pull/2891#discussion_r772145536



##########
File path: ql/src/test/results/clientpositive/llap/join_1to1.q.out
##########
@@ -373,9 +373,6 @@ NULL	NULL	NULL	25	10025	66
 NULL	NULL	NULL	30	10030	88
 NULL	NULL	NULL	35	10035	88
 NULL	NULL	NULL	40	10040	88
-NULL	NULL	NULL	50	10050	66
-NULL	NULL	NULL	50	10050	66
-NULL	NULL	NULL	50	10050	88

Review comment:
       yes; overwrite without checking tends to happen - that's why I see no value in the tests which use the `src` table - which content is not defined at all - one way around these problems is to use `assert_true` ...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] amansinha100 commented on a change in pull request #2891: HIVE-25822: Unexpected result rows in case of outer join contains conditions only affecting one side

Posted by GitBox <gi...@apache.org>.
amansinha100 commented on a change in pull request #2891:
URL: https://github.com/apache/hive/pull/2891#discussion_r772487246



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonMergeJoinOperator.java
##########
@@ -240,6 +249,17 @@ public void process(Object row, int tag) throws HiveException {
 
     byte alias = (byte) tag;
     List<Object> value = getFilteredValue(alias, row);
+
+    if (shortcutExtensionRows && isOuterJoinExtensionRow(tag, value)) {
+      int type = condn[0].getType();
+      if (tag == 0 && (type == JoinDesc.LEFT_OUTER_JOIN || type == JoinDesc.FULL_OUTER_JOIN)) {
+        emitExtensionRow(tag, value);

Review comment:
       I applied this patch and added a log message to emitExtensionRow() and ran queries with full , left and right outer join (with the extra single table predicate in ON clause) but only see the log msg for full outer join, not for left/right. It is unclear which situations would lead to extension rows for left/right outer join.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on a change in pull request #2891: HIVE-25822: Unexpected result rows in case of outer join contains conditions only affecting one side

Posted by GitBox <gi...@apache.org>.
zabetak commented on a change in pull request #2891:
URL: https://github.com/apache/hive/pull/2891#discussion_r772195126



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonMergeJoinOperator.java
##########
@@ -92,6 +94,7 @@
 
   // A field because we cannot multi-inherit.
   transient InterruptibleProcessing interruptChecker;
+  transient boolean shortcutExtensionRows;

Review comment:
       Minor: should we use `unmatched` everywhere we use `extension`?
   
   "Unmatched rows of an outer join" sounds more natural than "Extension rows of an outer join".

##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -1836,6 +1836,8 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
     HIVEALIAS("hive.alias", "", ""),
     HIVEMAPSIDEAGGREGATE("hive.map.aggr", true, "Whether to use map-side aggregation in Hive Group By queries"),
     HIVEGROUPBYSKEW("hive.groupby.skewindata", false, "Whether there is skew in data to optimize group by queries"),
+    HIVE_JOIN_SHORTCUT_EXTENSIONROWS("hive.join.shortcut.extensionrows", true,
+        "Enables to shortcut processing of known filtered rows in merge joins."),

Review comment:
       I see your point but I still think that it would be better to avoid toggles when it comes to correctness issues. I guess it is not the first time we are doing this so I am not gonna push back on this but it shouldn't be the norm. I am leaving the final judgement up to you.
   
   If it goes in I would add some extra information that it is for "internal use only" and changing the value can lead to incorrect results.

##########
File path: ql/src/test/results/clientpositive/llap/join_1to1.q.out
##########
@@ -597,22 +592,10 @@ NULL	NULL	NULL	10	10010	66
 NULL	NULL	NULL	25	10025	66
 NULL	NULL	NULL	30	10030	88
 NULL	NULL	NULL	35	10035	88
-NULL	NULL	NULL	40	10040	66
 NULL	NULL	NULL	40	10040	88
-NULL	NULL	NULL	40	10040	88
-NULL	NULL	NULL	50	10050	66
-NULL	NULL	NULL	50	10050	66
-NULL	NULL	NULL	50	10050	66
-NULL	NULL	NULL	50	10050	66
-NULL	NULL	NULL	50	10050	88
 NULL	NULL	NULL	50	10050	88
-NULL	NULL	NULL	50	10050	88
-NULL	NULL	NULL	70	10040	88
 NULL	NULL	NULL	70	10040	88
 NULL	NULL	NULL	70	10040	88

Review comment:
       I didn't verify that all the changes in this file are correct. Should I do it or you went over them already?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonMergeJoinOperator.java
##########
@@ -302,7 +322,47 @@ public void process(Object row, int tag) throws HiveException {
 
     assert !nextKeyGroup;
     candidateStorage[tag].addRow(value);
+  }
+
+  private void emitExtensionRow(int tag, List<Object> value) throws HiveException {
+    extensionStorage[tag].addRow(value);
+    for (byte i = 0; i < order.length; i++) {
+      if (i == tag) {
+        storage[i] = extensionStorage[i];
+      } else {
+        putDummyOrEmpty(i);
+      }
+    }
+    checkAndGenObject();
+    extensionStorage[tag].clearRows();
+  }
 
+  /**
+   * Decides if the actual row must be an extension row.
+   *
+   * Extension rows are those which are not part of the inner-join.
+   * May not correctly identify all extension rows - but will remove trivially filtered ones.

Review comment:
       Since you have examples that do not work, maybe it would be better to write "The current implementation, does not correctly identify all..." and put a reference to a follow up JIRA highlighting that there is an implementation limitation.

##########
File path: ql/src/test/results/clientpositive/llap/join_1to1.q.out
##########
@@ -373,9 +373,6 @@ NULL	NULL	NULL	25	10025	66
 NULL	NULL	NULL	30	10030	88
 NULL	NULL	NULL	35	10035	88
 NULL	NULL	NULL	40	10040	88
-NULL	NULL	NULL	50	10050	66
-NULL	NULL	NULL	50	10050	66
-NULL	NULL	NULL	50	10050	88

Review comment:
       I am not a big fan of the the `src` and similar tables but if people validated the results against another DBMS I can leave with it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] zabetak commented on a change in pull request #2891: HIVE-25822: Unexpected result rows in case of outer join contains conditions only affecting one side

Posted by GitBox <gi...@apache.org>.
zabetak commented on a change in pull request #2891:
URL: https://github.com/apache/hive/pull/2891#discussion_r771938082



##########
File path: ql/src/test/results/clientpositive/llap/join_1to1.q.out
##########
@@ -373,9 +373,6 @@ NULL	NULL	NULL	25	10025	66
 NULL	NULL	NULL	30	10030	88
 NULL	NULL	NULL	35	10035	88
 NULL	NULL	NULL	40	10040	88
-NULL	NULL	NULL	50	10050	66
-NULL	NULL	NULL	50	10050	66
-NULL	NULL	NULL	50	10050	88

Review comment:
       Indeed these rows shouldn't be there. It's a pity that people update q.out files without paying attention if the resulting changes are correct. We could have found this bug much earlier if people verified the results against another DBMS as well.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonMergeJoinOperator.java
##########
@@ -302,7 +322,47 @@ public void process(Object row, int tag) throws HiveException {
 
     assert !nextKeyGroup;
     candidateStorage[tag].addRow(value);
+  }
+
+  private void emitExtensionRow(int tag, List<Object> value) throws HiveException {
+    extensionStorage[tag].addRow(value);
+    for (byte i = 0; i < order.length; i++) {
+      if (i == tag) {
+        storage[i] = extensionStorage[i];
+      } else {
+        putDummyOrEmpty(i);
+      }
+    }
+    checkAndGenObject();
+    extensionStorage[tag].clearRows();
+  }
 
+  /**
+   * Decides if the actual row must be an extension row.
+   *
+   * Extension rows are those which are not part of the inner-join.
+   * May not correctly identify all extension rows - but will remove trivially filtered ones.

Review comment:
       `May not correctly...` It is not clear what this means. Possibly an example could help clarify this better.
   What happens in the case where it does not identify an extension row? Wrong results?

##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -1836,6 +1836,8 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
     HIVEALIAS("hive.alias", "", ""),
     HIVEMAPSIDEAGGREGATE("hive.map.aggr", true, "Whether to use map-side aggregation in Hive Group By queries"),
     HIVEGROUPBYSKEW("hive.groupby.skewindata", false, "Whether there is skew in data to optimize group by queries"),
+    HIVE_JOIN_SHORTCUT_EXTENSIONROWS("hive.join.shortcut.extensionrows", true,
+        "Enables to shortcut processing of known filtered rows in merge joins."),

Review comment:
       If my understanding is correct if this is set to false the outer join produces incorrect results. I don't think it is a good idea to have a property which determines the correctness of the results.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org