You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/24 00:21:50 UTC

[GitHub] [spark] xkrogen opened a new pull request, #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

xkrogen opened a new pull request, #37634:
URL: https://github.com/apache/spark/pull/37634

### What changes were proposed in this pull request?
This modifies `GenerateUnsafeProjection` to wrap projections of non-null fields in try-catch blocks which swallow any `NullPointerException` that is thrown, and instead throw a helpful error message indicating that a null value was encountered where the schema indicated non-null. This new error is added in `QueryExecutionErrors`.

Small modifications are made to a few methods in `GenerateUnsafeProjection` to allow for passing down the path to the projection in question, to give the user a helpful indication of what they need to change. Getting the name of the top-level column seems to require substantial changes since the name is thrown away when the `BoundReference` is created (in favor of an ordinal), so for the top-level only an ordinal is given; for nested fields the name is provided. An example error message looks like:

```
java.lang.RuntimeException: The value at <POS_0>.`middle`.`bottom` cannot be null, but a NULL was found. This is typically caused by the presence of a NULL value when the schema indicates the value should be non-null. Check that the input data matches the schema and/or that UDFs which can return null have a nullable return schema.
```

### Why are the changes needed?
This is needed to help users decipher the error message; currently a `NullPointerException` without any message is thrown, which provides the user no guidance on what they've done wrong, and typically leads them to believe there is a bug in Spark. See the Jira for a specific example of how this behavior can be triggered and what the exception looks like currently.

### Does this PR introduce _any_ user-facing change?
Yes, in the case that a user has a data-schema mismatch, they will not get a much more helpful error message. In other cases, no change.

### How was this patch tested?
See tests in `DataFrameSuite` and `GeneratedProjectionSuite`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xkrogen commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by GitBox <gi...@apache.org>.

xkrogen commented on code in PR #37634:
URL: https://github.com/apache/spark/pull/37634#discussion_r955234541


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:
##########
@@ -252,28 +264,43 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro
      """.stripMargin
   }
 
+  /**
+   * Wrap `inputExpr` in a try-catch block that will catch any [[NullPointerException]] that is
+   * thrown, instead throwing a (more helpful) error message as provided by
+   * [[org.apache.spark.sql.errors.QueryExecutionErrors.valueCannotBeNullError]].
+   */
+  private def wrapWithNpeHandling(inputExpr: String, descPath: Seq[String]): String =
+    s"""
+       |try {
+       |  ${inputExpr.trim}
+       |} catch (NullPointerException npe) {
+       |  throw QueryExecutionErrors.valueCannotBeNullError("${descPath.mkString(".")}");

Review Comment:
   Good catch! I can't believe the ridiculous stuff Spark will accept as a valid column name. Fixed and added a test for this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xkrogen commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by GitBox <gi...@apache.org>.

xkrogen commented on code in PR #37634:
URL: https://github.com/apache/spark/pull/37634#discussion_r974499166


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:
##########
@@ -252,28 +267,44 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro
      """.stripMargin
   }
 
+  /**
+   * Wrap `inputExpr` in a try-catch block that will catch any [[NullPointerException]] that is
+   * thrown, instead throwing a (more helpful) error message as provided by
+   * [[org.apache.spark.sql.errors.QueryExecutionErrors.valueCannotBeNullError]].
+   */
+  private def wrapWithNpeHandling(inputExpr: String, descPath: Seq[String]): String =
+    s"""
+       |try {
+       |  ${inputExpr.trim}

Review Comment:
   I prefer exception-catching as it handles this issue with zero overhead. Adding a null-check here essentially falls back to the logic for a nullable schema:
   https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L119-L133
   From the benchmark results, we can see that there is nontrivial overhead for the null-check; for the simple case of a projection of a primitive, the overhead is almost 50%:
   https://github.com/apache/spark/blob/2a1f9767213c321bd52e7714fa3b5bfc4973ba40/sql/catalyst/benchmarks/UnsafeProjectionBenchmark-jdk17-results.txt#L9-L10
   
   You call out the situation of a null silently being replaced with a default value; this is a good point. I'm not sure how we can handle that without additional overhead of an explicit check. It seems that the default value replacement logic is coming from [Scala's own unboxing logic](https://github.com/scala/scala/blob/986dcc160aab85298f6cab0bf8dd0345497cdc01/src/library/scala/runtime/BoxesRunTime.java#L102).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] robreeves commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by GitBox <gi...@apache.org>.

robreeves commented on code in PR #37634:
URL: https://github.com/apache/spark/pull/37634#discussion_r959739223


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:
##########
@@ -252,28 +266,44 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro
      """.stripMargin
   }
 
+  /**
+   * Wrap `inputExpr` in a try-catch block that will catch any [[NullPointerException]] that is
+   * thrown, instead throwing a (more helpful) error message as provided by
+   * [[org.apache.spark.sql.errors.QueryExecutionErrors.valueCannotBeNullError]].
+   */
+  private def wrapWithNpeHandling(inputExpr: String, descPath: Seq[String]): String =
+    s"""
+       |try {
+       |  ${inputExpr.trim}
+       |} catch (NullPointerException npe) {
+       |  throw QueryExecutionErrors.valueCannotBeNullError(

Review Comment:
   Would it be possible to optionally write the problematic row data in the error message too? This will help save users a step of manually having to find the problematic row. This should be optional and set to false by default in case the data is sensitive. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1250488822

   Looks making sense to me ... @MaxGekk @gengliangwang or @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value
URL: https://github.com/apache/spark/pull/37634


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by GitBox <gi...@apache.org>.

mridulm commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1250159106

   +CC @ueshin, as you reviewed the surrounding code before.
   Also, CC @HyukjinKwon.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1257692263

   I made `SparkSession.internalCreateDataFrame` public to easily test the nullability mismatch bug. You can also use a data source to reproduce it.
   ```
   val rdd = sc.makeRDD(Seq(InternalRow(null)))
   val df = spark.internalCreateDataFrame(rdd, new StructType().add("i", "int", false))
   df.show
   +---+
   |  i|
   +---+
   |  0|
   +---+
   ```
   
   If you use the public APIs like `SparkSession.createDataFrame`, they use `RowEncoder` which does runtime null check via expression `GetExternalRowField`. An example
   ```
   val rdd = sc.makeRDD(Seq(Row(null)))
   val df = spark.createDataFrame(rdd, new StructType().add("i", "int", false))
   df.show
   java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: The 0th field 'i' of input row cannot be null.
   ```
   
   I think this is the right direction to go: for untrusted data, add extra validation even if it has perf overhead. If perf is very critical and we don't want any overhead, we can add a flag to skip this check and trust the data. try-catch NPE seems a bit hacky and can't cover all the cases.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xkrogen commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by GitBox <gi...@apache.org>.

xkrogen commented on code in PR #37634:
URL: https://github.com/apache/spark/pull/37634#discussion_r955182512


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##########
@@ -447,6 +447,13 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase {
     new RuntimeException(fieldCannotBeNullMsg(index, fieldName))
   }
 
+  def valueCannotBeNullError(locationDesc: String): RuntimeException = {
+    new RuntimeException(s"The value at $locationDesc cannot be null, but a NULL was found. " +
+      "This is typically caused by the presence of a NULL value when the schema indicates the " +
+      "value should be non-null. Check that the input data matches the schema and/or that UDFs " +

Review Comment:
   Yeah, it handles this situation. Marking a UDF as non-nullable just adjusts the schema, then the output row will contain a null value -- so the situation is identical to what is already tested in `GeneratedProjectionSuite`. But I can add this to `DataFrameSuite` to explicitly demonstrate that it is covered.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1250772044

   Spark trusts data nullability in many places (expressions, projection generators, optimizer rules, etc.). It's a lot of efforts to improve error messages for all these places when data does not match the nullability. We'd better pick a clear scope here.
   
   AFAIK a common source of mismatch is data source and UDF. We can focus on these 2 cases only.
   
   For data sources, we can add a Filter node above the data source relation to apply null check, using the existing `AssertNotNull` expression. For UDF, we can wrap the UDF expression with `AssertNotNull` to do the null check as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by GitBox <gi...@apache.org>.

xkrogen commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1251319065

   Thanks for the suggestion @cloud-fan ! Good point about there many places where Spark trusts nullability. Here I am trying to target places where _user code_ could introduce a null. This includes data sources (including in-memory dataset creation like `sparkContext.parallelize()` or `DatasetHolder.toDS()`) and UDFs (including lambdas in calls to DF/DS APIs). User code is inherently untrustworthy, so more checks are warranted. I think any of these places where user code supplies values would be covered by this PR since they all go through projection before being accessed by other codepaths, but LMK if you disagree. I guess one situation could be if the optimizer completely removes some operator because of the nullability, so the data is never even accessed--are you thinking about situations like this?
   
   Unfortunately we cannot use `AssertNotNull` for this; the optimizer will remove it if the input schema indicates that the schema is non-null:
   https://github.com/apache/spark/blob/96831bbb6749910c8f9497c048ba05e6e317649f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L826
   And that optimizer rule is there for a good reason; there is nontrivial overhead associated with the null check as discussed further in [my other comment](https://github.com/apache/spark/pull/37634#discussion_r974499166).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by GitBox <gi...@apache.org>.

xkrogen commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1239693536

   Ping again to @MaxGekk @karenfeng @gengliangwang -- any thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #37634:
URL: https://github.com/apache/spark/pull/37634#discussion_r974014046


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:
##########
@@ -252,28 +267,44 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro
      """.stripMargin
   }
 
+  /**
+   * Wrap `inputExpr` in a try-catch block that will catch any [[NullPointerException]] that is
+   * thrown, instead throwing a (more helpful) error message as provided by
+   * [[org.apache.spark.sql.errors.QueryExecutionErrors.valueCannotBeNullError]].
+   */
+  private def wrapWithNpeHandling(inputExpr: String, descPath: Seq[String]): String =
+    s"""
+       |try {
+       |  ${inputExpr.trim}

Review Comment:
   I think it's better to explicitly check the null value, instead of catching `NullPointerException`. The consequence of wrong nullability is not always `NullPointerException`, but can also be wrong answer, e.g. a null integer becomes -1 after projection.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] shardulm94 commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by GitBox <gi...@apache.org>.

shardulm94 commented on code in PR #37634:
URL: https://github.com/apache/spark/pull/37634#discussion_r954381765


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:
##########
@@ -252,28 +264,43 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro
      """.stripMargin
   }
 
+  /**
+   * Wrap `inputExpr` in a try-catch block that will catch any [[NullPointerException]] that is
+   * thrown, instead throwing a (more helpful) error message as provided by
+   * [[org.apache.spark.sql.errors.QueryExecutionErrors.valueCannotBeNullError]].
+   */
+  private def wrapWithNpeHandling(inputExpr: String, descPath: Seq[String]): String =
+    s"""
+       |try {
+       |  ${inputExpr.trim}
+       |} catch (NullPointerException npe) {
+       |  throw QueryExecutionErrors.valueCannotBeNullError("${descPath.mkString(".")}");

Review Comment:
   A column with `"` as part of the column name will break the syntax in this generated code. We should escape quotes here. I tried to look if other errors escaped quotes in user controlled strings, but wasn't able to find any. e.g https://github.com/apache/spark/blob/295e98d29b34e2b472c375608b8782c3b9189444/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L1165



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##########
@@ -447,6 +447,13 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase {
     new RuntimeException(fieldCannotBeNullMsg(index, fieldName))
   }
 
+  def valueCannotBeNullError(locationDesc: String): RuntimeException = {
+    new RuntimeException(s"The value at $locationDesc cannot be null, but a NULL was found. " +
+      "This is typically caused by the presence of a NULL value when the schema indicates the " +
+      "value should be non-null. Check that the input data matches the schema and/or that UDFs " +

Review Comment:
   Do the changes in `GenerateUnsafeProjection` also take care of the throwing useful error messages for non-nullable UDFs returning null values? If yes, can we also add a test case for the same?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by GitBox <gi...@apache.org>.

xkrogen commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1225021291

   cc @shardulm94 (thanks for the push towards constructing the path into a single message instead of using chained exceptions, the additional code changes were minimal and the resulting message is much nicer)
   and a few folks who have been working on error reporting @MaxGekk @karenfeng @gengliangwang


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by GitBox <gi...@apache.org>.

xkrogen commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1233486676

   Pushed up new commits rebasing on latest changes.
   
   cc @cloud-fan @dongjoon-hyun in case either of you are interested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xkrogen commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by GitBox <gi...@apache.org>.

xkrogen commented on code in PR #37634:
URL: https://github.com/apache/spark/pull/37634#discussion_r960070035


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:
##########
@@ -252,28 +266,44 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro
      """.stripMargin
   }
 
+  /**
+   * Wrap `inputExpr` in a try-catch block that will catch any [[NullPointerException]] that is
+   * thrown, instead throwing a (more helpful) error message as provided by
+   * [[org.apache.spark.sql.errors.QueryExecutionErrors.valueCannotBeNullError]].
+   */
+  private def wrapWithNpeHandling(inputExpr: String, descPath: Seq[String]): String =
+    s"""
+       |try {
+       |  ${inputExpr.trim}
+       |} catch (NullPointerException npe) {
+       |  throw QueryExecutionErrors.valueCannotBeNullError(

Review Comment:
   Printing the single datum won't be helpful since it's always NULL, and it would be challenging to access the whole input row from this location. We create the projection recursively, so at this point while recursing, we don't even have a reference to the fully created projection to grab the other fields. Note also that this is a failure to project the data, and we would also need to project the data to print it, so we'd have to selectively skip this field.
   
   Open to suggestions, but I don't see a clear path forward.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by GitBox <gi...@apache.org>.

xkrogen commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1314561335

Thanks for your comments @cloud-fan. If it were possible to handle all data types using the exception-catching approach, I would still advocate for it. But you raised a good point that it's not possible to handle primitives in this manner, and this is actually an even more severe issue, because it becomes a correctness issue (NULL silently replaced with 0) instead of just a confusing error message.

My initial next attempt, following along your suggestion regarding `AssertNotNull`, was going to be along these lines:
- Create a new unary expression `UntrustedNullGuard`. Behaviorally, it is identical to `AssertNotNull`, except for maybe a different error message.
- In contrast to `AssertNotNull` which gets optimized away in cases of non-nullable input, `UntrustedNullGuard` gets optimized away _in cases of nullable input_.
- We look for untrusted sources (input data sources, UDFs, etc.) and wrap them in `UntrustedNullGuard` without concern for whether the data source returns a null or non-null type--if it's nullable, the check will be optimized away, and if it's not, then we get the guard as expected.
- We can have a flag like `spark.sql.optimizer.guardUntrustedNullability` which defaults to true, but can be set to false by those wishing to indicate to Spark that they really do trust their data sources and don't want the runtime checks added. (The optimizer rule won't apply in this case.)

However this approach has some drawbacks. The one I found to be a fatal flaw is that there are situations where the operator itself does some unpacking of the result value and will already replace a NULL value with a default, e.g. here:
https://github.com/apache/spark/blob/3f97cd620a61bc2685bf35215185365e63bedc0a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L1176-L1180

I don't think there's any way to catch this situation except to modify the codegen for the operators themselves. I've put up an alternate approach demonstrating this in PR #38660; please take a look.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1441040591

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org