You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/14 23:50:08 UTC

[GitHub] [spark] xkrogen opened a new pull request, #38660: [SPARK-40199][SQL][WIP] Provide useful error when encountering null values in non-null fields

xkrogen opened a new pull request, #38660:
URL: https://github.com/apache/spark/pull/38660

### What changes were proposed in this pull request?

This modifies various operators that can accept _untrusted_ input (e.g. DSv2 sources, UDFs) so that they can perform null checks on their input, guaranteeing that if an input is _annotated_ as non-null, it _actually_ contains no NULL values. Currently all operators in Spark assume that the schema/nullability annotations are correct, which makes sense for operators consuming the output of other internal operators, but doesn't necessarily make sense for operators handling the output of user defined code.

For UDFs/UDAFs such as `ScalaUDF`, this simply adds a null check in all cases when the output is non-null, since a UDF will always be untrusted. For other sources of introducing untrusted values, such as DSv2, the values are accepted into Spark by way of `BoundReference`. To handle this situation, `Expression` nodes are enhanced with the concept of "trusting" their input, which is set to false in the case of user-provided inputs. If the input is both non-null _and_ untrusted, the extra null-check is generated. There are also minor enhancements to allow for passing the SQL-friendly name down to the `BoundReference`, so that a user-friendly name can be generated in the error message.

An example error message, when projecting the string field "s" from the struct field "nest", is like:

```
java.lang.RuntimeException: The value at nest.`s` cannot be null, but a NULL was found. This is typically caused by the presence of a NULL value when the schema indicates the value should be non-null. Check that the input data matches the schema and/or that UDFs which can return null have a nullable return schema.
```

This is in contrast to the approach originally proposed in #37634; see the discussion there for some more context on why this approach was chosen.

### Why are the changes needed?
This is needed to help users decipher the error message; currently a `NullPointerException` without any message is thrown, which provides the user no guidance on what they've done wrong, and typically leads them to believe there is a bug in Spark. See the Jira for a specific example of how this behavior can be triggered and what the exception looks like currently.

### Does this PR introduce _any_ user-facing change?
Yes, in the case that a user has a data-schema mismatch, they will not get a much more helpful error message. In other cases, no change.

### How was this patch tested?
See tests in `DataFrameSuite` and `DataSourceV2Suite`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #38660: [SPARK-40199][SQL][WIP] Provide useful error when encountering null values in non-null fields

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #38660:
URL: https://github.com/apache/spark/pull/38660#issuecomment-1510007113

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org