You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/06/15 11:58:00 UTC

[GitHub] [iceberg] ConeyLiu opened a new pull request, #5049: Spark: Fixes NPE when convert NOT IN filter

ConeyLiu opened a new pull request, #5049:
URL: https://github.com/apache/iceberg/pull/5049

   This pr fixes the following problems:
   ``` java
   Cannot create expression literal from null
   java.lang.NullPointerException: Cannot create expression literal from null
   	at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:907)
   	at org.apache.iceberg.expressions.Literals.from(Literals.java:62)
   	at org.apache.iceberg.relocated.com.google.common.collect.Iterators$6.transform(Iterators.java:826)
   	at org.apache.iceberg.relocated.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:52)
   	at org.apache.iceberg.relocated.com.google.common.collect.Iterators.addAll(Iterators.java:367)
   	at org.apache.iceberg.relocated.com.google.common.collect.Lists.newArrayList(Lists.java:147)
   	at org.apache.iceberg.relocated.com.google.common.collect.Lists.newArrayList(Lists.java:133)
   	at org.apache.iceberg.expressions.UnboundPredicate.<init>(UnboundPredicate.java:55)
   	at org.apache.iceberg.expressions.Expressions.predicate(Expressions.java:265)
   	at org.apache.iceberg.expressions.Expressions.predicate(Expressions.java:254)
   	at org.apache.iceberg.expressions.Expressions.notIn(Expressions.java:234)
   	at org.apache.iceberg.spark.SparkFilters.convert(SparkFilters.java:179)
   ```
   
   This happens when we convert `some_col NOT IN values` filter from spark to Iceberg and the `values` contains `null`.  For example: `DELETE FROM table WHERE id IN (1, null)`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5049: Spark: Fixes NPE when convert NOT IN filter

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5049:
URL: https://github.com/apache/iceberg/pull/5049#discussion_r898259719


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkFilters.java:
##########
@@ -178,6 +178,7 @@ public static Expression convert(Filter filter) {
             In childInFilter = (In) childFilter;
             Expression notIn = notIn(unquote(childInFilter.attribute()),
                 Stream.of(childInFilter.values())
+                    .filter(Objects::nonNull)

Review Comment:
   > should we support NULL literal in Iceberg?
   
   No. `null` is purposely not allowed as a literal because that brings in differences between SQL 3-valued boolean semantics and normal semantics. Engines are responsible for interpreting and rewriting filters. Similarly, in cases like this we actually want to reject null values in IN predicates. The user or the engine needs to rewrite before passing to Iceberg.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #5049: Spark: Fixes NPE when convert NOT IN filter

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #5049:
URL: https://github.com/apache/iceberg/pull/5049#issuecomment-1160754188

   @ConeyLiu, sure. If you want to backport that, I think it would be useful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #5049: Spark: Fixes NPE when convert NOT IN filter

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #5049:
URL: https://github.com/apache/iceberg/pull/5049#issuecomment-1156764355

   I don't think this should be merged. This NPE is intended. This code can catch the NPE and throw a more friendly error, but I think it is also reasonable to keep the exception message as it is right now. The problem is that `null` should not be passed as a literal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #5049: Spark: Fixes NPE when convert NOT IN filter

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #5049:
URL: https://github.com/apache/iceberg/pull/5049#issuecomment-1157551908

   Thanks, @rdblue @kbendick for the detailed explanation. And this doc: https://modern-sql.com/concept/three-valued-logic is really helpful. I submit this PR because the user could do this with Spark 3.1 while failing in Spark 3.2. The current logic should be enough. Thanks again.
   
   Another question. Should we port the https://github.com/apache/iceberg/pull/3613 to other spark versions? Currently, the other Spark versions have different behavior again Spark 3.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #5049: Spark: Fixes NPE when convert NOT IN filter

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #5049:
URL: https://github.com/apache/iceberg/pull/5049#discussion_r897893379


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkFilters.java:
##########
@@ -178,6 +178,7 @@ public static Expression convert(Filter filter) {
             In childInFilter = (In) childFilter;
             Expression notIn = notIn(unquote(childInFilter.attribute()),
                 Stream.of(childInFilter.values())
+                    .filter(Objects::nonNull)

Review Comment:
   This refers to the process in L166. I think this is because we don't support NULL literal in Iceberg. However,  I think here have changed the original filter meanings. @rdblue, should we support NULL literal in Iceberg?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a diff in pull request #5049: Spark: Fixes NPE when convert NOT IN filter

Posted by GitBox <gi...@apache.org>.
kbendick commented on code in PR #5049:
URL: https://github.com/apache/iceberg/pull/5049#discussion_r898195347


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkFilters.java:
##########
@@ -178,6 +178,7 @@ public static Expression convert(Filter filter) {
             In childInFilter = (In) childFilter;
             Expression notIn = notIn(unquote(childInFilter.attribute()),
                 Stream.of(childInFilter.values())
+                    .filter(Objects::nonNull)

Review Comment:
   I think this should potentially be handled elsewhere.
   
   As you said, this changes the original filter's meaning (removes null from the list of values), which is arguably a potentially common use case such as `select * from tbl where id not in (null, 'N/A', 'unknown')` (or some other representation of not available that people might be using upstream from the data before cleaning.
   
   Does the NPE go away if you update the actual test that the `notIn` predicate winds up using (I think this is it): https://github.com/apache/iceberg/blob/98dc9fe8724375e8b932e7e8afdb925c897b4695/api/src/main/java/org/apache/iceberg/expressions/BoundSetPredicate.java#L56-L66



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkFilters.java:
##########
@@ -178,6 +178,7 @@ public static Expression convert(Filter filter) {
             In childInFilter = (In) childFilter;
             Expression notIn = notIn(unquote(childInFilter.attribute()),
                 Stream.of(childInFilter.values())
+                    .filter(Objects::nonNull)

Review Comment:
   I think this should potentially be handled elsewhere.
   
   As you said, this changes the original filter's meaning (removes null from the list of values), which is arguably a potentially common use case such as `select * from tbl where id not in (null, 'N/A', 'unknown')`, or some other representation of not available that people might be using upstream from the data before cleaning.
   
   Does the NPE go away if you update the actual test that the `notIn` predicate winds up using (I think this is it): https://github.com/apache/iceberg/blob/98dc9fe8724375e8b932e7e8afdb925c897b4695/api/src/main/java/org/apache/iceberg/expressions/BoundSetPredicate.java#L56-L66



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed pull request #5049: Spark: Fixes NPE when convert NOT IN filter

Posted by GitBox <gi...@apache.org>.
rdblue closed pull request #5049: Spark: Fixes NPE when convert NOT IN filter
URL: https://github.com/apache/iceberg/pull/5049


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #5049: Spark: Fixes NPE when convert NOT IN filter

Posted by GitBox <gi...@apache.org>.
kbendick commented on PR #5049:
URL: https://github.com/apache/iceberg/pull/5049#issuecomment-1156945817

   I recall now when working on the `NOT_STARTS_WITH` that we wanted to leave it up to the engine to handle 3 value boolean logic.
   
   The example query, `DELETE FROM table WHERE id NOT IN (1, null)`, should probably be written as `DELETE FROM table WHERE id is not null and id != 1`. Spark will add the not null-check itself often times.
   
   With 3-valued boolean logic, for comparison purposes every null is a different null and nothing equals null. For those who might want a refresher (like I often need), I like this website's explanation of 3-valued boolean logic involving nulls: https://modern-sql.com/concept/three-valued-logic
   
   To avoid the headache that can incur, we explicitly avoid 3-valued boolean logic. There would be better filter pushdown and more advantage taken of Iceberg's statistics by writing the query with explicit `AND is [not] null` checking too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a diff in pull request #5049: Spark: Fixes NPE when convert NOT IN filter

Posted by GitBox <gi...@apache.org>.
kbendick commented on code in PR #5049:
URL: https://github.com/apache/iceberg/pull/5049#discussion_r898195347


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkFilters.java:
##########
@@ -178,6 +178,7 @@ public static Expression convert(Filter filter) {
             In childInFilter = (In) childFilter;
             Expression notIn = notIn(unquote(childInFilter.attribute()),
                 Stream.of(childInFilter.values())
+                    .filter(Objects::nonNull)

Review Comment:
   I think this should potentially be handled elsewhere.
   
   As you said, this changes the original filter's meaning (removes null from the list of values), which is arguably a potentially common use case such as `select * from tbl where id not in (null, 'N/A', 'unknown')` (or some other representation of null that people might be using upstream).
   
   Does the NPE go away if you update the actual test that the `notIn` predicate winds up using (I think this is it): https://github.com/apache/iceberg/blob/98dc9fe8724375e8b932e7e8afdb925c897b4695/api/src/main/java/org/apache/iceberg/expressions/BoundSetPredicate.java#L56-L66



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #5049: Spark: Fixes NPE when convert NOT IN filter

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #5049:
URL: https://github.com/apache/iceberg/pull/5049#discussion_r897893379


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkFilters.java:
##########
@@ -178,6 +178,7 @@ public static Expression convert(Filter filter) {
             In childInFilter = (In) childFilter;
             Expression notIn = notIn(unquote(childInFilter.attribute()),
                 Stream.of(childInFilter.values())
+                    .filter(Objects::nonNull)

Review Comment:
   This refers to the process in L166. I think this process here is because we don't support NULL literal in Iceberg. However,  I think here have changed the original filter meanings. @rdblue, should we support NULL literal in Iceberg?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #5049: Spark: Fixes NPE when convert NOT IN filter

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #5049:
URL: https://github.com/apache/iceberg/pull/5049#issuecomment-1168136550

   I'm going to close this since I think the behavior is not controversial.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #5049: Spark: Fixes NPE when convert NOT IN filter

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #5049:
URL: https://github.com/apache/iceberg/pull/5049#issuecomment-1156385716

   Hi, @rdblue @aokolnychyi @chenjunjiedada @jackye1995, could you help to review this? Thanks in advance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org