You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/08/24 00:23:00 UTC

[jira] [Assigned] (SPARK-40199) Spark throws NPE without useful message when NULL value appears in non-null schema

     [ https://issues.apache.org/jira/browse/SPARK-40199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-40199:
------------------------------------

    Assignee: Apache Spark

> Spark throws NPE without useful message when NULL value appears in non-null schema
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-40199
>                 URL: https://issues.apache.org/jira/browse/SPARK-40199
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.2.2
>            Reporter: Erik Krogen
>            Assignee: Apache Spark
>            Priority: Major
>
> Currently in some cases, if Spark encounters a NULL value where the schema indicates that the column/field should be non-null, it will throw a {{NullPointerException}} with no message and thus no way to debug further. This can happen, for example, if you have a UDF which is erroneously marked as {{asNonNullable()}}, or if you read input data where the actual values don't match the schema (which could happen e.g. with Avro if the reader provides a schema declaring non-null although the data was written with null values).
> As an example of how to reproduce:
> {code:scala}
>     val badUDF = spark.udf.register[String, Int]("bad_udf", in => null).asNonNullable()
>     Seq(1, 2).toDF("c1").select(badUDF($"c1")).collect()
> {code}
> This throws an exception like:
> {code}
> Driver stacktrace:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1) (xxxxxxxxxx executor driver): java.lang.NullPointerException
> 	at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
> 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
> 	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> 	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
> 	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
> 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
> 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
> 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:139)
> 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
> 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1490)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> {code}
> As a user, it is very confusing -- it looks like there is a bug in Spark. We have had many users report such problems, and though we can guide them to a schema-data mismatch, there is no indication of what field might contain the bad values, so a laborious data exploration process is required to find and remedy it.
> We should provide a better error message in such cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org