You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/05 04:54:21 UTC

[GitHub] [iceberg] kbendick edited a comment on pull request #3645: fix startsWith expression NullPointerException error caused by null v…

kbendick edited a comment on pull request #3645:
URL: https://github.com/apache/iceberg/pull/3645#issuecomment-986166519

> Hi @kbendick! I get this error when I'm trying to rewrite iceberg table in scala spark code with a partition filter like this: `SparkActions.get().rewriteDataFiles(table) .filter(Expressions.startsWith("imp_date",'20211202')) .execute()` "imp_date" is a time partition field, it contains null value from some abnormal rows.

Ohhh that would explain why Spark isn't injecting an implicit `IS NOT NULL` check on the filter. We parse the text of the `WHERE` clause from the SQL and then convert from an Iceberg Filter to a Spark filter, and not via the LogicalPlan that would be generated by Spark.

This means potentially all inputs that would get an implicit null check (string inputs at the least) would likely have this same issue.

Where we parse the `WHERE` clause and convert: https://github.com/apache/iceberg/blob/b6554fccfac7a0c0ba35ebbcbff60d5f7eb0826d/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java#L120-L131

I'm not sure if we should individually handle each one, or if we want to try to make that use the parsed LogicalPlan instead via `sqlParser.parsePlan(where)` instead of the current `sqlParser.parseExpression(where)`.

Using `parsePlan` would provide ` LogicalPlan`, which has an `expressions` attribute of type `Seq[Exrpession]` which I'm guessing would have the null checks Spark would normally add.

But it might just be easier to add the null check ourselves instead of updating that logic.

cc @RussellSpitzer @karuppayya @flyrain who might have some input on this.

I believe your approach will work @hbgstc123, but there might be a more robust way so that the normal Spark plans that would put theimp_date IS NOT NULL AND imp_date LIKE '20211202%'` for us.

If I'm correct, then probably a number of these things need to be updated to handle `null` input (only for this particular code path though).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org