You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/29 21:26:07 UTC

[GitHub] [spark] robreeves commented on a change in pull request #35969: [SPARK-38651][SQL] Add configuration to support writing out empty schemas in supported filebased datasources

robreeves commented on a change in pull request #35969:
URL: https://github.com/apache/spark/pull/35969#discussion_r837919464



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2714,6 +2714,14 @@ object SQLConf {
     .stringConf
     .createWithDefault("avro,csv,json,kafka,orc,parquet,text")
 
+  val ALLOW_EMPTY_SCHEMAS_FOR_WRITES =
+    buildConf("spark.sql.sources.file.allowEmptySchemaWrite")

Review comment:
       Other configurations use `spark.sql.sources.files` instead of `file` as a singular. Can you update it to be consistent?
   
   On a related note, why does this include `sources`? At a glance `spark.sql.files.allowEmptySchemaWrite` seems to fully define the configuration. If this is okay, this should live with the rest of the `spark.sql.files.` configs. Let me know if there is a defined convention that I'm not aware of.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
##########
@@ -154,6 +156,19 @@ class FileBasedDataSourceSuite extends QueryTest
     }
   }
 
+  val emptySchemaSupportedDataSources = Seq("orc", "csv", "json")
+  emptySchemaSupportedDataSources.foreach { format =>
+    val emptySchemaValidationConf = SQLConf.ALLOW_EMPTY_SCHEMAS_FOR_WRITES.key
+    test(s"SPARK-38651 allow writing empty schema files " +
+      s"using $format when ${emptySchemaValidationConf} is enabled") {
+      withSQLConf(emptySchemaValidationConf -> "true") {
+        withTempPath { outputPath =>
+          spark.emptyDataFrame.write.format(format).save(outputPath.toString)

Review comment:
       Is it possible to validate the file content for this test case?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##########
@@ -553,8 +553,12 @@ case class DataSource(
         disallowWritingIntervals(data.schema.map(_.dataType), forbidAnsiIntervals = true)
         SaveIntoDataSourceCommand(data, dataSource, caseInsensitiveOptions, mode)
       case format: FileFormat =>
+        val shouldAllowEmptySchema =
+          sparkSession.sessionState.conf.getConf(SQLConf.ALLOW_EMPTY_SCHEMAS_FOR_WRITES)
         disallowWritingIntervals(data.schema.map(_.dataType), forbidAnsiIntervals = false)
-        DataSource.validateSchema(data.schema)
+        if (!shouldAllowEmptySchema) {

Review comment:
       Should we only honor this config for formats that support it? I'm leaning towards keeping it simple and not checking that here because it is more complicated to maintain as format support changes, but wanted to call it out for discussion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org