You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/02/27 06:19:54 UTC

[GitHub] [hive] scarlin-cloudera opened a new pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

scarlin-cloudera opened a new pull request #2027:
URL: https://github.com/apache/hive/pull/2027


   …ercion
   
   When the query has a where clause that has an integer column checking against
   being "not in" a decimal column, the decimal column is being changed to null,
   causing incorrect results.
   
   This is a sample query of a failure:
   select count from my_tbl where int_col not in (355.8);
   
   Since the int_col can never be 355.8, one would expect all the rows to be
   returned, but it is changing the 355.8 into a null value causing no rows
   to be returned.
   
   The fix is that when the value is attempted to be interpreted and it fails,
   the original type should be used rather than null.
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/Hive/HowToContribute
     2. Ensure that you have created an issue on the Hive project JIRA: https://issues.apache.org/jira/projects/HIVE/summary
     3. Ensure you have added or run the appropriate tests for your PR: 
     4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]HIVE-XXXXX:  Your PR title ...'.
     5. Be sure to keep the PR description updated to reflect all changes.
     6. Please write your PR title to summarize what this PR proposes.
     7. If possible, provide a concise example to reproduce the issue for a faster review.
   
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description, screenshot and/or a reproducable example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Hive versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor merged pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

Posted by GitBox <gi...@apache.org>.
jcamachor merged pull request #2027:
URL: https://github.com/apache/hive/pull/2027


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #2027:
URL: https://github.com/apache/hive/pull/2027#discussion_r590387847



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {

Review comment:
       yes they could be in Calcite some day...we can live without them
   
   that if statement is easier to be viewed without all these comments:
   
   https://github.com/apache/hive/blob/20137b5894fc7ec6f4acaef4304dfdc2ce93275a/ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java#L1012-L1015




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] scarlin-cloudera commented on a change in pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

Posted by GitBox <gi...@apache.org>.
scarlin-cloudera commented on a change in pull request #2027:
URL: https://github.com/apache/hive/pull/2027#discussion_r591853701



##########
File path: ql/src/test/queries/clientpositive/in_coercion.q
##########
@@ -0,0 +1,14 @@
+DROP TABLE src_table;
+CREATE TABLE src_table (key int);
+LOAD DATA LOCAL INPATH '../../data/files/kv6.txt' INTO TABLE src_table;
+
+-- verify table has data
+select count(*) from src_table; 

Review comment:
       Fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] scarlin-cloudera commented on a change in pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

Posted by GitBox <gi...@apache.org>.
scarlin-cloudera commented on a change in pull request #2027:
URL: https://github.com/apache/hive/pull/2027#discussion_r589822395



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {

Review comment:
       So I'm not sure how to address your comments, I"m gonna address all of them here.  Lemme take a step back and tell you what I've already discussed with Jesus and the direction we were going.
   
   So we know we're losing some optimizations as you've noted.  Jesus felt that they weren't that big of a deal.  For instance, we'd lose the optimization of "tinyint_col in (2500000)".  Previously, we saw this changing to false since a tinyint col can never be that value, but that check won't be optimized out now.  I think that's the main one I saw with the tests.
   
   So for now, we'd like to avoid a bigger rewrite.  He also noted that this optimization should perhaps be more in the Calcite framework, which makes sense to me.
   
   You did have one other comment about an "if" statement not being hit.  I'm not sure I understand the bug that you're referring to.  It's a bit complicated to understand, but it seems ok to me?  Can you explain this further?
   Thanks again!
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] dengzhhu653 commented on a change in pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

Posted by GitBox <gi...@apache.org>.
dengzhhu653 commented on a change in pull request #2027:
URL: https://github.com/apache/hive/pull/2027#discussion_r600985864



##########
File path: ql/src/test/results/clientpositive/llap/compare_cols_null.q.out
##########
@@ -34,8 +34,9 @@ STAGE PLANS:
       Processor Tree:
         TableScan
           alias: ccn_table
+          filterExpr: (key > '123a') (type: boolean)
           Filter Operator
-            predicate: false (type: boolean)
+            predicate: (key > '123a') (type: boolean)

Review comment:
       A little question: would we do a full table scan when the predicate is  ` (key > '123a')` here, or cloud we also do something optimization for these cases? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] scarlin-cloudera commented on a change in pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

Posted by GitBox <gi...@apache.org>.
scarlin-cloudera commented on a change in pull request #2027:
URL: https://github.com/apache/hive/pull/2027#discussion_r591854070



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {
               TypeInfo targetType = exprFactory.getTypeInfo(valueDesc);
               if (!expressions.containsKey(targetType)) {

Review comment:
       Addressed in comment below




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] dengzhhu653 commented on a change in pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

Posted by GitBox <gi...@apache.org>.
dengzhhu653 commented on a change in pull request #2027:
URL: https://github.com/apache/hive/pull/2027#discussion_r592091193



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {

Review comment:
       sorry,  not(null) seems do not work as expected....




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] dengzhhu653 commented on a change in pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

Posted by GitBox <gi...@apache.org>.
dengzhhu653 commented on a change in pull request #2027:
URL: https://github.com/apache/hive/pull/2027#discussion_r592087377



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {

Review comment:
       I mean the way we deal with the constant that cannot be interpreted, treat it as a null or false expression.
   
    For example:
   ```col not in (355.8)```  translates to ```not(null)```;
   ```col not in (355.8, 123)```  translates to ```not(null or col in (123))```;
   
    




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] dengzhhu653 commented on a change in pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

Posted by GitBox <gi...@apache.org>.
dengzhhu653 commented on a change in pull request #2027:
URL: https://github.com/apache/hive/pull/2027#discussion_r592039373



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {

Review comment:
       Cloud we repace null with Boolean.False here in this case?  to expression like  false or col in (int1, int2,..)  or col in (...)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] scarlin-cloudera commented on a change in pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

Posted by GitBox <gi...@apache.org>.
scarlin-cloudera commented on a change in pull request #2027:
URL: https://github.com/apache/hive/pull/2027#discussion_r592373912



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {

Review comment:
       Yeah, not in (null) is a very weird construct.  Equivalent to why you need to say "x IS NULL" and you can't say "x = NULL"




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] scarlin-cloudera commented on a change in pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

Posted by GitBox <gi...@apache.org>.
scarlin-cloudera commented on a change in pull request #2027:
URL: https://github.com/apache/hive/pull/2027#discussion_r592041244



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {

Review comment:
       I'm sorry...which statement are you referring to where you want to replace null with "false"?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] scarlin-cloudera commented on a change in pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

Posted by GitBox <gi...@apache.org>.
scarlin-cloudera commented on a change in pull request #2027:
URL: https://github.com/apache/hive/pull/2027#discussion_r590750469



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {

Review comment:
       Oh, I see.  It is a multimap though, so I think it retains all values? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #2027:
URL: https://github.com/apache/hive/pull/2027#discussion_r601016092



##########
File path: ql/src/test/results/clientpositive/llap/compare_cols_null.q.out
##########
@@ -34,8 +34,9 @@ STAGE PLANS:
       Processor Tree:
         TableScan
           alias: ccn_table
+          filterExpr: (key > '123a') (type: boolean)
           Filter Operator
-            predicate: false (type: boolean)
+            predicate: (key > '123a') (type: boolean)

Review comment:
       @dengzhhu653 , that's a great point. The path moving forward should be to fold/optimize this type of expressions in the optimizer (Calcite) rather than in `TypeCheckProcFactory.java`. It seems we have a gap in the optimization logic, that's why it's not getting folded. @scarlin-cloudera , could you create a follow-up JIRA to fix this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] kgyrtkirk commented on a change in pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

Posted by GitBox <gi...@apache.org>.
kgyrtkirk commented on a change in pull request #2027:
URL: https://github.com/apache/hive/pull/2027#discussion_r585391616



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {

Review comment:
       I was going thru here and there and I think there might be another way around this problem which could retain this optimization as well:
   * introduce a new `NOT` operator: which can be controlled to return true/false in case of null values
   * in case of filter expressions start using the new not operator; and switch mode below every `NOT` operator
   
   but this feels like a more complicated change - we should only do it if we loose important optimizations

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {
               TypeInfo targetType = exprFactory.getTypeInfo(valueDesc);
               if (!expressions.containsKey(targetType)) {

Review comment:
       this if statement has no effect - the map value will be overwritten anyway ; I wonder if we have a bug here

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));

Review comment:
       for `IN` the original logic is valid as long as it's in `UnknownAs.FALSE` mode...but for `NOT IN` the correct interpretation would be `UnknownAs.TRUE`.
   
   I think we might be better off not coping with the `UnknownAs` devils here - and retain the original expressions as in the current proposed patch; I'm not sure how much optimization opportunities/performance we will loose that way.
   
   

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {

Review comment:
       other idea could be to retain and do these optimizations but only if we are not below a NOT operator

##########
File path: ql/src/test/queries/clientpositive/in_coercion.q
##########
@@ -0,0 +1,14 @@
+DROP TABLE src_table;
+CREATE TABLE src_table (key int);
+LOAD DATA LOCAL INPATH '../../data/files/kv6.txt' INTO TABLE src_table;
+
+-- verify table has data
+select count(*) from src_table; 

Review comment:
       note: we may use the `assert_true` udf here (so that no one will be able to just silently overwrite the q.out)
   ```
   select assert_true(count(*) = 100) from src_table
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org