You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2023/11/03 00:18:00 UTC

[jira] [Updated] (SPARK-44517) first operator should respect the nullability of child expression as well as ignoreNulls option

     [ https://issues.apache.org/jira/browse/SPARK-44517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated SPARK-44517:
-----------------------------------
    Labels: pull-request-available  (was: )

> first operator should respect the nullability of child expression as well as ignoreNulls option
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-44517
>                 URL: https://issues.apache.org/jira/browse/SPARK-44517
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>            Reporter: Nan Zhu
>            Priority: Major
>              Labels: pull-request-available
>
> I found the following problem when using Spark recently:
>  
> {code:java}
> // code placeholder
> import spark.implicits._
> val s = Seq((1.2, "s", 2.2)).toDF("v1", "v2", "v3")
> val schema = StructType(Seq(StructField("v1", DoubleType, nullable = false),StructField("v2", StringType, nullable = true),StructField("v3", DoubleType, nullable = false)))
> val df = spark.createDataFrame(s.rdd, schema)val inputDF = 
> val inputDF = df.dropDuplicates("v3")
> spark.sql("CREATE TABLE local.db.table (\n v1 DOUBLE NOT NULL,\n v2 STRING, v3 DOUBLE NOT NULL)")
> inputDF.write.mode("overwrite").format("iceberg").save("local.db.table") {code}
>  
>  
> when I use the above code to write to iceberg (i guess Delta Lake will have the same problem) , I got very confusing exception
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: Cannot write incompatible dataset to table with schema:
> table 
> {  1: v1: required double  2: v2: optional string  3: v3: required double}
> Provided schema:
> table {  1: v1: optional double  2: v2: optional string  3: v3: required double} {code}
> basically it complains that we have v1 as the nullable column in our `inputDF` above which is not allowed since we created table with the v1 as not nullable. The confusion comes from that,  if we check the schema with printSchema() of inputDF, v1 is not nullable
> {noformat}
> root 
> |-- v1: double (nullable = false) 
> |-- v2: string (nullable = true) 
> |-- v3: double (nullable = false){noformat}
> Clearly, something changed the v1's nullability unexpectedly!
>  
> After some debugging I found that the key is that dropDuplicates("v3"). In optimization phase, we have ReplaceDeduplicateWithAggregate to replace the Deduplicate with aggregate on v3 and run first() over all other columns. However, first() operator has hard coded nullable as always "true" which is the source of changed nullability of v1
>  
> this is a very confusing behavior of Spark, and probably no one really noticed as we do not care too much without the new table formats like delta lake and iceberg which can make nullability check correctly. Nowadays, we users adopt them more and more, this is surfaced up
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org