You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "CodingCat (via GitHub)" <gi...@apache.org> on 2023/07/24 02:17:29 UTC

[GitHub] [spark] CodingCat opened a new pull request, #42117: [SPARK-44517][SQL]respect ignorenulls and child's nullability in first

CodingCat opened a new pull request, #42117:
URL: https://github.com/apache/spark/pull/42117

   
   ### What changes were proposed in this pull request?
   
   this PR changed the nullable of first() operator by respecting the ignoreNulls and child's nullability
   
   
   ### Why are the changes needed?
   
   copied from JIRA
   
   I found the following problem when using Spark recently:
   
    
   
   ```scala
   import spark.implicits._
   
   val s = Seq((1.2, "s", 2.2)).toDF("v1", "v2", "v3")
   
   val schema = StructType(Seq(StructField("v1", DoubleType, nullable = false),StructField("v2", StringType, nullable = true),StructField("v3", DoubleType, nullable = false)))
   
   val df = spark.createDataFrame(s.rdd, schema)val inputDF = 
   
   val inputDF = df.dropDuplicates("v3")
   
   spark.sql("CREATE TABLE local.db.table (\n v1 DOUBLE NOT NULL,\n v2 STRING, v3 DOUBLE NOT NULL)")
   
   inputDF.write.mode("overwrite").format("iceberg").save("local.db.table") 
   ```
   
    
   
   when I use the above code to write to iceberg (i guess Delta Lake will have the same problem) , I got very confusing exception
   
   Exception in thread "main" java.lang.IllegalArgumentException: Cannot write incompatible dataset to table with schema:
   
   ```
   table 
   
   {  1: v1: required double  2: v2: optional string  3: v3: required double}
   
   Provided schema:
   
   table {  1: v1: optional double  2: v2: optional string  3: v3: required double} 
   
   ```
   
   basically it complains that we have v1 as the nullable column in our `inputDF` above which is not allowed since we created table with the v1 as not nullable. The confusion comes from that,  if we check the schema with printSchema() of inputDF, v1 is not nullable
   
   ```
   root 
   |-- v1: double (nullable = false) 
   |-- v2: string (nullable = true) 
   |-- v3: double (nullable = false)
   ```
   
   Clearly, something changed the v1's nullability unexpectedly!
   
    
   
   After some debugging I found that the key is that `dropDuplicates("v3")`. In optimization phase, we have `ReplaceDeduplicateWithAggregate` to replace the Deduplicate with aggregate on v3 and run first() over all other columns. However, first() operator has hard coded nullable as always "true" which is the source of changed nullability of v1
   
   
   
   this is a very confusing behavior of Spark, and probably no one really noticed as we do not care too much without the new table formats like delta lake and iceberg which can make nullability check correctly. Nowadays, we users adopt them more and more, this is surfaced up
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   NO
   
   ### How was this patch tested?
   
   unit test and integration test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] CodingCat commented on a diff in pull request #42117: [SPARK-44517][SQL] Respect ignorenulls and child's nullability in first

Posted by "CodingCat (via GitHub)" <gi...@apache.org>.

CodingCat commented on code in PR #42117:
URL: https://github.com/apache/spark/pull/42117#discussion_r1272475261


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala:
##########
@@ -60,7 +60,13 @@ case class First(child: Expression, ignoreNulls: Boolean)
     this(child, FirstLast.validateIgnoreNullExpr(ignoreNullsExpr, "first"))
   }
 
-  override def nullable: Boolean = true
+  override def nullable: Boolean = {
+    if (ignoreNulls) {
+      false

Review Comment:
   ah, good point!
   
   I agree that taking ignoreNulls into consideration can be problematic in the case of all null
   
   for the empty relation case, does it matter? because in the end we will have an empty table but with certain column as non-null, I personally think there should be a difference between null value and the row doesn't exist at all 
   
   maybe we should just set nullable as child.nullable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-44517][SQL] Respect ignorenulls and child's nullability in first [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed pull request #42117: [SPARK-44517][SQL] Respect ignorenulls and child's nullability in first
URL: https://github.com/apache/spark/pull/42117


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] CodingCat commented on a diff in pull request #42117: [SPARK-44517][SQL] Respect ignorenulls and child's nullability in first

Posted by "CodingCat (via GitHub)" <gi...@apache.org>.

CodingCat commented on code in PR #42117:
URL: https://github.com/apache/spark/pull/42117#discussion_r1274167131


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala:
##########
@@ -60,7 +60,13 @@ case class First(child: Expression, ignoreNulls: Boolean)
     this(child, FirstLast.validateIgnoreNullExpr(ignoreNullsExpr, "first"))
   }
 
-  override def nullable: Boolean = true
+  override def nullable: Boolean = {
+    if (ignoreNulls) {
+      false

Review Comment:
   I see what you mean now
   
   yes, I reproduced it and it will even throw NPE in shuffle write stage
   
   is there any disadvantage for us to change https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L549-L551 to `optimizedPlan.schema` ?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala:
##########
@@ -60,7 +60,13 @@ case class First(child: Expression, ignoreNulls: Boolean)
     this(child, FirstLast.validateIgnoreNullExpr(ignoreNullsExpr, "first"))
   }
 
-  override def nullable: Boolean = true
+  override def nullable: Boolean = {
+    if (ignoreNulls) {
+      false

Review Comment:
   I see what you mean now
   
   yes, I reproduced it and it will even throw NPE in shuffle write stage
   
   is there any disadvantage for us to change https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L549-L551 to `optimizedPlan.schema` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] CodingCat commented on a diff in pull request #42117: [SPARK-44517][SQL] Respect ignorenulls and child's nullability in first

Posted by "CodingCat (via GitHub)" <gi...@apache.org>.

CodingCat commented on code in PR #42117:
URL: https://github.com/apache/spark/pull/42117#discussion_r1274164870


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala:
##########
@@ -60,7 +60,13 @@ case class First(child: Expression, ignoreNulls: Boolean)
     this(child, FirstLast.validateIgnoreNullExpr(ignoreNullsExpr, "first"))
   }
 
-  override def nullable: Boolean = true
+  override def nullable: Boolean = {
+    if (ignoreNulls) {
+      false

Review Comment:
   I see what you mean now
   
   yes, I reproduced it and it will even throw NPE in shuffle write stage
   
   is there any disadvantage for us to change https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L549-L551 to `optimizedPlan.schema` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] CodingCat commented on a diff in pull request #42117: [SPARK-44517][SQL] Respect ignorenulls and child's nullability in first

Posted by "CodingCat (via GitHub)" <gi...@apache.org>.

CodingCat commented on code in PR #42117:
URL: https://github.com/apache/spark/pull/42117#discussion_r1273654334


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala:
##########
@@ -60,7 +60,13 @@ case class First(child: Expression, ignoreNulls: Boolean)
     this(child, FirstLast.validateIgnoreNullExpr(ignoreNullsExpr, "first"))
   }
 
-  override def nullable: Boolean = true
+  override def nullable: Boolean = {
+    if (ignoreNulls) {
+      false

Review Comment:
   @beliefer , I agree, as I added above, probably we just don't consider ignoreNulls and set first.nullable = child.nullable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #42117: [SPARK-44517][SQL] Respect ignorenulls and child's nullability in first

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #42117:
URL: https://github.com/apache/spark/pull/42117#discussion_r1273379892


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala:
##########
@@ -60,7 +60,13 @@ case class First(child: Expression, ignoreNulls: Boolean)
     this(child, FirstLast.validateIgnoreNullExpr(ignoreNullsExpr, "first"))
   }
 
-  override def nullable: Boolean = true
+  override def nullable: Boolean = {
+    if (ignoreNulls) {
+      false

Review Comment:
   All the input values are null is possible, e.g. `SELECT first(col) FILTER (col is null)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] CodingCat commented on a diff in pull request #42117: [SPARK-44517][SQL] Respect ignorenulls and child's nullability in first

Posted by "CodingCat (via GitHub)" <gi...@apache.org>.

CodingCat commented on code in PR #42117:
URL: https://github.com/apache/spark/pull/42117#discussion_r1272475261


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala:
##########
@@ -60,7 +60,13 @@ case class First(child: Expression, ignoreNulls: Boolean)
     this(child, FirstLast.validateIgnoreNullExpr(ignoreNullsExpr, "first"))
   }
 
-  override def nullable: Boolean = true
+  override def nullable: Boolean = {
+    if (ignoreNulls) {
+      false

Review Comment:
   ah, good point!
   
   I agree that taking ignoreNulls into consideration can be problematic in the case of all null
   
   for the empty relation case, does it matter? because in the end we will have an empty table but with certain column as non-null defined in schema, I personally think there should be a difference between null value and the row doesn't exist at all 
   
   maybe we should just set nullable as child.nullable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a diff in pull request #42117: [SPARK-44517][SQL] Respect ignorenulls and child's nullability in first

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #42117:
URL: https://github.com/apache/spark/pull/42117#discussion_r1273799102


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala:
##########
@@ -60,7 +60,13 @@ case class First(child: Expression, ignoreNulls: Boolean)
     this(child, FirstLast.validateIgnoreNullExpr(ignoreNullsExpr, "first"))
   }
 
-  override def nullable: Boolean = true
+  override def nullable: Boolean = {
+    if (ignoreNulls) {
+      false

Review Comment:
   > for the empty relation case, does it matter? because in the end we will have an empty table but with certain column as non-null defined in schema, I personally think there should be a difference between null value and the row doesn't exist at all
   
   If the column is non-nullable in the empty relation, `First` will return null value but its `nullable` is false (because you set it to `child.nullable`). This produces inconsistent schema and data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a diff in pull request #42117: [SPARK-44517][SQL] Respect ignorenulls and child's nullability in first

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #42117:
URL: https://github.com/apache/spark/pull/42117#discussion_r1271772407


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala:
##########
@@ -60,7 +60,13 @@ case class First(child: Expression, ignoreNulls: Boolean)
     this(child, FirstLast.validateIgnoreNullExpr(ignoreNullsExpr, "first"))
   }
 
-  override def nullable: Boolean = true
+  override def nullable: Boolean = {
+    if (ignoreNulls) {
+      false

Review Comment:
   Hmm, even `ignoreNulls` is true to ignore null values, because `first`'s initial value is null, the evaluated value of `First` is still possibly null if,
   
   1. No inputs, i.e. empty relation
   2. All input values are null
   
   I think this maybe why `nullable` is set to true always.
   
   Although we can use `child.nullable` to check if input value could be null (2), but we cannot know if inputs are empty or not.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-44517][SQL] Respect ignorenulls and child's nullability in first [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #42117:
URL: https://github.com/apache/spark/pull/42117#issuecomment-1791732388

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org