You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/14 23:50:08 UTC

[GitHub] [spark] xkrogen opened a new pull request, #38660: [SPARK-40199][SQL][WIP] Provide useful error when encountering null values in non-null fields

xkrogen opened a new pull request, #38660:
URL: https://github.com/apache/spark/pull/38660

   ### What changes were proposed in this pull request?
   
   This modifies various operators that can accept _untrusted_ input (e.g. DSv2 sources, UDFs) so that they can perform null checks on their input, guaranteeing that if an input is _annotated_ as non-null, it _actually_ contains no NULL values. Currently all operators in Spark assume that the schema/nullability annotations are correct, which makes sense for operators consuming the output of other internal operators, but doesn't necessarily make sense for operators handling the output of user defined code.
   
   For UDFs/UDAFs such as `ScalaUDF`, this simply adds a null check in all cases when the output is non-null, since a UDF will always be untrusted. For other sources of introducing untrusted values, such as DSv2, the values are accepted into Spark by way of `BoundReference`. To handle this situation, `Expression` nodes are enhanced with the concept of "trusting" their input, which is set to false in the case of user-provided inputs. If the input is both non-null _and_ untrusted, the extra null-check is generated. There are also minor enhancements to allow for passing the SQL-friendly name down to the `BoundReference`, so that a user-friendly name can be generated in the error message.
   
   An example error message, when projecting the string field "s" from the struct field "nest", is like:
   
   ```
   java.lang.RuntimeException: The value at nest.`s` cannot be null, but a NULL was found. This is typically caused by the presence of a NULL value when the schema indicates the value should be non-null. Check that the input data matches the schema and/or that UDFs which can return null have a nullable return schema.
   ```
   
   This is in contrast to the approach originally proposed in #37634; see the discussion there for some more context on why this approach was chosen.
   
   ### Why are the changes needed?
   This is needed to help users decipher the error message; currently a `NullPointerException` without any message is thrown, which provides the user no guidance on what they've done wrong, and typically leads them to believe there is a bug in Spark. See the Jira for a specific example of how this behavior can be triggered and what the exception looks like currently.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, in the case that a user has a data-schema mismatch, they will not get a much more helpful error message. In other cases, no change.
   
   ### How was this patch tested?
   See tests in `DataFrameSuite` and `DataSourceV2Suite`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #38660: [SPARK-40199][SQL][WIP] Provide useful error when encountering null values in non-null fields

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #38660:
URL: https://github.com/apache/spark/pull/38660#issuecomment-1510007113

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #38660: [SPARK-40199][SQL][WIP] Provide useful error when encountering null values in non-null fields

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #38660: [SPARK-40199][SQL][WIP] Provide useful error when encountering null values in non-null fields
URL: https://github.com/apache/spark/pull/38660


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xkrogen commented on pull request #38660: [SPARK-40199][SQL][WIP] Provide useful error when encountering null values in non-null fields

Posted by GitBox <gi...@apache.org>.
xkrogen commented on PR #38660:
URL: https://github.com/apache/spark/pull/38660#issuecomment-1314567023

   One point that I'd be interested in discussing is handling of untrusted input data sources (not UDFs). For some context, in our environment this situation mostly arises because we have a DSv2 source which tracks the schema for a table in a catalog (including nullability information) as well as a pointer to an HDFS location. At times due to erroneous pipelines, the schema can reflect non-null even though there are underlying files written with null values. Currently, diagnosing such issues and determining where the mismatched input lives is very challenging.
   
   But, I suspect that treating _all_ DSv2 sources as untrusted doesn't make sense either. One option I was considering is to add a list of "trusted" (or "untrusted") entities, essentially an include/exclude list, denoting which DSv2 sources and/or UDFs are considered trusted or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] allisonwang-db commented on a diff in pull request #38660: [SPARK-40199][SQL][WIP] Provide useful error when encountering null values in non-null fields

Posted by GitBox <gi...@apache.org>.
allisonwang-db commented on code in PR #38660:
URL: https://github.com/apache/spark/pull/38660#discussion_r1038557594


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala:
##########
@@ -117,6 +117,7 @@ abstract class Expression extends TreeNode[Expression] {
   lazy val deterministic: Boolean = children.forall(_.deterministic)
 
   def nullable: Boolean
+  def trustNullability: Boolean = true

Review Comment:
   I might be missing some context but why do we need a new field `trustNullability`? Can't we simply use `nullable` and maybe add a flag to enforce the nullability of UDFs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xkrogen commented on pull request #38660: [SPARK-40199][SQL][WIP] Provide useful error when encountering null values in non-null fields

Posted by GitBox <gi...@apache.org>.
xkrogen commented on PR #38660:
URL: https://github.com/apache/spark/pull/38660#issuecomment-1371607231

   Merged into latest master to resolve conflicts. @allisonwang-db or @cloud-fan , any thoughts/comments on the latest diff? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xkrogen commented on a diff in pull request #38660: [SPARK-40199][SQL][WIP] Provide useful error when encountering null values in non-null fields

Posted by GitBox <gi...@apache.org>.
xkrogen commented on code in PR #38660:
URL: https://github.com/apache/spark/pull/38660#discussion_r1043895649


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala:
##########
@@ -117,6 +117,7 @@ abstract class Expression extends TreeNode[Expression] {
   lazy val deterministic: Boolean = children.forall(_.deterministic)
 
   def nullable: Boolean
+  def trustNullability: Boolean = true

Review Comment:
   `nullable` <- this is the nullability according to the schema
   `trustNullability` <- whether or not we should _trust_ that schema
   
   For internal operators, we want to "trust" the nullability of their output, i.e., we completely eliminate null checks for those outputs. However for "untrusted" sources like UDFs, we can't trust the reported nullability, so we still generate a null check and throw an exception if a null appears despite the schema reporting non-null. If designing something from scratch I might make nullable have three possible values: `NULLABLE`, `NONNULL`, `NONULL_UNTRUSTED` to indicate the three states, but it's not feasible in the current framework.
   
   I am also not that excited about having to add this new field for all `Expression`s, but I couldn't find another way to propagate it. The problem is that the null check typically happens in the operator _consuming_ the untrusted output. So changing the codegen for e.g. the UDF isn't sufficient; it requires updating the codegen for `BoundReference`, `GetStructField`, etc. And these consumer operators have to adjust their codegen based on the "trustworthiness" of their input operators. The input operators are generic `Expression` instances, so to propagate the trustworthiness, the only way I found was to add this new property on `Expression`. But I am very open to any suggestions on a more clean way to propagate the required information! (Also open to suggestions on a better name if we do keep the property ... I'm not thrilled with the current name.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xkrogen commented on pull request #38660: [SPARK-40199][SQL][WIP] Provide useful error when encountering null values in non-null fields

Posted by GitBox <gi...@apache.org>.
xkrogen commented on PR #38660:
URL: https://github.com/apache/spark/pull/38660#issuecomment-1358140631

   @allisonwang-db or @MaxGekk, any thoughts on the current diff?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #38660: [SPARK-40199][SQL][WIP] Provide useful error when encountering null values in non-null fields

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on code in PR #38660:
URL: https://github.com/apache/spark/pull/38660#discussion_r1037898056


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##########
@@ -541,6 +541,13 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase {
       messageParameters = Map("fieldCannotBeNullMsg" -> fieldCannotBeNullMsg(index, fieldName)))
   }
 
+  def valueCannotBeNullError(locationDesc: String): RuntimeException = {
+    new RuntimeException(s"The value at $locationDesc cannot be null, but a NULL was found. " +
+      "This is typically caused by the presence of a NULL value when the schema indicates the " +
+      "value should be non-null. Check that the input data matches the schema and/or that UDFs " +
+      "which can return null have a nullable return schema.")

Review Comment:
   Could you create an error class in `error-classes.json`, and move the error text there.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##########
@@ -541,6 +541,13 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase {
       messageParameters = Map("fieldCannotBeNullMsg" -> fieldCannotBeNullMsg(index, fieldName)))
   }
 
+  def valueCannotBeNullError(locationDesc: String): RuntimeException = {
+    new RuntimeException(s"The value at $locationDesc cannot be null, but a NULL was found. " +

Review Comment:
   Please, use a Spark's exception like `SparkRuntimeException`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #38660: [SPARK-40199][SQL][WIP] Provide useful error when encountering null values in non-null fields

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on PR #38660:
URL: https://github.com/apache/spark/pull/38660#issuecomment-1334908174

   @xkrogen Please, resolve conflicts and rebase on the recent master. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org