You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Erik Krogen (Jira)" <ji...@apache.org> on 2022/08/24 00:10:00 UTC
[jira] [Created] (SPARK-40199) Spark throws NPE without useful message when NULL value appears in non-null schema

Erik Krogen created SPARK-40199:
-----------------------------------

             Summary: Spark throws NPE without useful message when NULL value appears in non-null schema
                 Key: SPARK-40199
                 URL: https://issues.apache.org/jira/browse/SPARK-40199
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.2.2
            Reporter: Erik Krogen


Currently in some cases, if Spark encounters a NULL value where the schema indicates that the column/field should be non-null, it will throw a {{NullPointerException}} with no message and thus no way to debug further. This can happen, for example, if you have a UDF which is erroneously marked as {{asNonNullable()}}, or if you read input data where the actual values don't match the schema (which could happen e.g. with Avro if the reader provides a schema declaring non-null although the data was written with null values).

As an example of how to reproduce:
{code:scala}
    val badUDF = spark.udf.register[String, Int]("bad_udf", in => null).asNonNullable()
    Seq(1, 2).toDF("c1").select(badUDF($"c1")).collect()
{code}

This throws an exception like:
{code}
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1) (xxxxxxxxxx executor driver): java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
	at org.apache.spark.scheduler.Task.run(Task.scala:139)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1490)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
{code}

As a user, it is very confusing -- it looks like there is a bug in Spark. We have had many users report such problems, and though we can guide them to a schema-data mismatch, there is no indication of what field might contain the bad values, so a laborious data exploration process is required to find and remedy it.

We should provide a better error message in such cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org