You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/03 00:47:08 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request, #36435: [SPARK-35912][SQL][FOLLOW-UP]Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)

HyukjinKwon opened a new pull request, #36435:
URL: https://github.com/apache/spark/pull/36435

   ### What changes were proposed in this pull request?
   
   This PR is a followup of https://github.com/apache/spark/pull/33436 that adds a legacy configuration. It's found that that it can break a valid usacase (https://github.com/apache/spark/pull/33436/files#r863271189):
   
   ```scala
   import org.apache.spark.sql.types._
   val ds = Seq("a,", "a,b").toDS
   spark.read.schema(
     StructType(
       StructField("f1", StringType, nullable = false) ::
       StructField("f2", StringType, nullable = false) :: Nil)
     ).option("mode", "FAILFAST").csv(ds).show()
   ```
   
   **Before:**
   
   ```
   +---+---+
   | f1| f2|
   +---+---+
   |  a|  b|
   +---+---+
   ```
   
   **After:**
   
   ```
   +---+----+
   | f1|  f2|
   +---+----+
   |  a|null|
   |  a|   b|
   +---+----+
   ```
   
   ### Why are the changes needed?
   
   To avoid breakage of valid usecases.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it adds a new configuration `spark.sql.legacy.respectNullabilityInTextDatasetConversion` (`false` by default) to respect the nullability in `DataFrameReader.schema(schema).csv(dataset)` and `DataFrameReader.schema(schema).json(dataset)` when the user-specified schema is provided.
   
   ### How was this patch tested?
   
   Unittests were added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36435: [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36435:
URL: https://github.com/apache/spark/pull/36435#issuecomment-1118248039

   Merged to master and branch-3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #36435: [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #36435: [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)
URL: https://github.com/apache/spark/pull/36435


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #36435: [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on code in PR #36435:
URL: https://github.com/apache/spark/pull/36435#discussion_r865208393


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3025,6 +3025,18 @@ object SQLConf {
     .intConf
     .createOptional
 
+  val LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION =
+  buildConf("spark.sql.legacy.respectNullabilityInTextDatasetConversion")

Review Comment:
   Looking configs around, this should be shifted by 2 spaces left?



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -2693,6 +2693,17 @@ abstract class CSVSuite
       assert(df.schema == expected)
       checkAnswer(df, Row(1, null) :: Nil)
     }
+
+    withSQLConf(
+        SQLConf.LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION.key -> true.toString) {

Review Comment:
   nit:
   ```suggestion
    withSQLConf(SQLConf.LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION.key -> "true") {
   
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3025,6 +3025,18 @@ object SQLConf {
     .intConf
     .createOptional
 
+  val LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION =
+  buildConf("spark.sql.legacy.respectNullabilityInTextDatasetConversion")
+    .internal()
+    .doc(
+      "When true, the nullability in the user-specified schema for " +
+        "`DataFrameReader.schema(schema).json(jsonDataset)` and " +
+        "`DataFrameReader.schema(schema).csv(csvDataset)` is respected. Otherwise, they are " +
+        "turned to a nullable schema forcely.")

Review Comment:
   unindent by 2 spaces



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3025,6 +3025,18 @@ object SQLConf {
     .intConf
     .createOptional
 
+  val LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION =
+  buildConf("spark.sql.legacy.respectNullabilityInTextDatasetConversion")
+    .internal()
+    .doc(
+      "When true, the nullability in the user-specified schema for " +
+        "`DataFrameReader.schema(schema).json(jsonDataset)` and " +
+        "`DataFrameReader.schema(schema).csv(csvDataset)` is respected. Otherwise, they are " +
+        "turned to a nullable schema forcely.")

Review Comment:
   forcely -> forcibly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36435: [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36435:
URL: https://github.com/apache/spark/pull/36435#issuecomment-1115507707

   cc @cloud-fan and @cfmcgrady


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36435: [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36435:
URL: https://github.com/apache/spark/pull/36435#discussion_r865531628


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -2693,6 +2693,16 @@ abstract class CSVSuite
       assert(df.schema == expected)
       checkAnswer(df, Row(1, null) :: Nil)
     }
+
+    withSQLConf(SQLConf.LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION.key -> "true") {
+      checkAnswer(
+        spark.read.schema(
+          StructType(
+            StructField("f1", StringType, nullable = false) ::
+            StructField("f2", StringType, nullable = false) :: Nil)
+        ).option("mode", "DROPMALFORMED").csv(Seq("a,", "a,b").toDS),

Review Comment:
   One test should be good enough to see if nullability is respected or not since that's what this PR adds as a configuration.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cfmcgrady commented on a diff in pull request #36435: [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)

Posted by GitBox <gi...@apache.org>.
cfmcgrady commented on code in PR #36435:
URL: https://github.com/apache/spark/pull/36435#discussion_r865524477


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -2693,6 +2693,16 @@ abstract class CSVSuite
       assert(df.schema == expected)
       checkAnswer(df, Row(1, null) :: Nil)
     }
+
+    withSQLConf(SQLConf.LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION.key -> "true") {
+      checkAnswer(
+        spark.read.schema(
+          StructType(
+            StructField("f1", StringType, nullable = false) ::
+            StructField("f2", StringType, nullable = false) :: Nil)
+        ).option("mode", "DROPMALFORMED").csv(Seq("a,", "a,b").toDS),

Review Comment:
   Shall we also add a test case for `"FAILFAST/PERMISSIVE"` mode? The behavior is different in the legacy conversion.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org