You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ivan Sadikov (Jira)" <ji...@apache.org> on 2022/10/14 18:43:00 UTC
[jira] [Comment Edited] (SPARK-40541) NullPointerException with UTF8String.getBaseObject() when UDF

    [ https://issues.apache.org/jira/browse/SPARK-40541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617914#comment-17617914 ] 

Ivan Sadikov edited comment on SPARK-40541 at 10/14/22 6:42 PM:
----------------------------------------------------------------

I was asking about the actual problem. It is not clear what you are reporting in the ticket, it reads as a statement, there is no expected behaviour or root cause analysis, or steps to reproduce. As it is, it is not a very good bug report.

Spark is not a database. Any assertions in the generated code, especially row based could affect performance. NullPointerException has a fairly descriptive error: {{{}Cannot invoke "org.apache.spark.unsafe.types.UTF8String.getBaseObject()" because "input" is null{}}}. IMHO, in this case the mitigation should be the right way to fix it for you as it is partially a user error: the UDF returns nulls for a non-null column. 

I can take a look to see how to improve the error message. Meanwhile, you can try to disable whole stage code gen and add to your bug report what error you get and whether that one is descriptive enough.


was (Author: ivan.sadikov):
I was asking about the actual problem. It is not clear what you are reporting in the ticket, it reads as a statement, there is no expected behaviour or root cause analysis, or steps to reproduce. As it is, it is not a very good bug report.

Spark is not a database. Any assertions in the generated code, especially row based could affect performance. NullPointerException has a fairly descriptive error: {{{}Cannot invoke "org.apache.spark.unsafe.types.UTF8String.getBaseObject()" because "input" is null{}}}. IMHO, in this case the mitigation should be the right way to fix it for you as it is partially a user error: the UDF returns nulls for a non-null column. 

You can try to disable whole stage code gen and add it to your bug report what error you get and whether that one is descriptive enough.

> NullPointerException with UTF8String.getBaseObject() when UDF
> -------------------------------------------------------------
>
>                 Key: SPARK-40541
>                 URL: https://issues.apache.org/jira/browse/SPARK-40541
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Garret Wilson
>            Priority: Major
>
> I'm using Spark 3.3.0 on Windows with Java 17. I have a UDF that returns several columns using:
> {code}
> StructType schema = createStructType(List.of(… createStructField("bar", StringType, false)));
> UserDefinedFunction foobarUdf = udf((String foo) -> {
>   …
> }, schema).asNondeterministic();
> {code}
> Note that I specify {{false}} for {{bar}}'s nullability. It turns out that {{foobarUdf}} actually returns {{null}} for {{bar}} sometimes. In the relational database world, I would expect that if my integrity constraint wasn't met, the database would say, "you put {{null}} in {{bar}}, but {{bar}} is not nullable".
> What I did _not_ expect is what Spark does: it hits a {{NullPointerException}} and has a nervous breakdown:
> {noformat}
> [ERROR] Exception in task 0.0 in stage 8.0 (TID 5)
> java.lang.NullPointerException: Cannot invoke "org.apache.spark.unsafe.types.UTF8String.getBaseObject()" because "input" is null
>         at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
>         at scala.collection.Iterator$$anon$9.next(Iterator.scala:577)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.hashAgg_doAggregateWithoutKey_1$(Unknown Source)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.hashAgg_doAggregateWithoutKey_0$(Unknown Source)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
>         at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
>         at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
>         at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:136)
>         at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>         at java.base/java.lang.Thread.run(Thread.java:833)
> [WARN] Lost task 0.0 in stage 8.0 (TID 5) (xps-13-9310 executor driver): java.lang.NullPointerException: Cannot invoke "org.apache.spark.unsafe.types.UTF8String.getBaseObject()" because "input" is null
>         at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
>         at scala.collection.Iterator$$anon$9.next(Iterator.scala:577)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.hashAgg_doAggregateWithoutKey_1$(Unknown Source)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.hashAgg_doAggregateWithoutKey_0$(Unknown Source)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
>         at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
>         at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
>         at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:136)
>         at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>         at java.base/java.lang.Thread.run(Thread.java:833)
> {noformat}
> It finally fails with:
> {noformat}
> [ERROR] Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 5) (xps-13-9310 executor driver): java.lang.NullPointerException: Cannot invoke "org.apache.spark.unsafe.types.UTF8String.getBaseObject()" because "input" is null
>         at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
>         at scala.collection.Iterator$$anon$9.next(Iterator.scala:577)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.hashAgg_doAggregateWithoutKey_1$(Unknown Source)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.hashAgg_doAggregateWithoutKey_0$(Unknown Source)
>         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
>         at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
>         at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
>         at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:136)
>         at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>         at java.base/java.lang.Thread.run(Thread.java:833)
> {noformat}
> See also [Spark dies with NullPointerException UTF8String.getBaseObject() "input" is null|https://stackoverflow.com/q/73815800].
> The irony is that when I actually intend to mark a column as non-nullable when reading in data, Spark ignores me. See [Spark 3.3.0 not honoring my schema nullability in Java|https://stackoverflow.com/q/73476202].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org