You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/08 00:09:02 UTC

[GitHub] [spark] dtenedor opened a new pull request, #37431: [SPARK-40001][SQL] Add config to make DEFAULT values in JSON tables mutually exclusive with SQLConf.JSON_GENERATOR_IGNORE_NULL_FIELDS

dtenedor opened a new pull request, #37431:
URL: https://github.com/apache/spark/pull/37431

   ### What changes were proposed in this pull request?
   
   Add config to make DEFAULT values in JSON tables mutually exclusive with SQLConf.JSON_GENERATOR_IGNORE_NULL_FIELDS.
   
   When this new config is true, allow DEFAULT column values with JSON tables when thr JSON_GENERATOR_IGNORE_NULL_FIELDS conf is enabled. Otherwise, the two become mutually exclusive. This can be useful to enforce that inserted NULL values are present in storage to differentiate from missing data.
   
   ### Why are the changes needed?
   
   This can help guard correctness of query results.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, please see above.
   
   ### How was this patch tested?
   
   This PR adds new unit test coverage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on a diff in pull request #37431: [SPARK-40001][SQL] Make NULL writes to JSON DEFAULT columns write ‘null’ to storage

Posted by GitBox <gi...@apache.org>.
dtenedor commented on code in PR #37431:
URL: https://github.com/apache/spark/pull/37431#discussion_r944785318


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2924,6 +2924,18 @@ object SQLConf {
       .stringConf
       .createWithDefault("csv,json,orc,parquet")
 
+  val DEFAULT_COLUMN_JSON_GENERATOR_FORCE_NULL_FIELDS =
+    buildConf("spark.sql.defaultColumn.jsonGeneratorForceNullFields")

Review Comment:
   Sounds good, done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a diff in pull request #37431: [SPARK-40001][SQL] Make NULL writes to JSON DEFAULT columns write ‘null’ to storage

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on code in PR #37431:
URL: https://github.com/apache/spark/pull/37431#discussion_r944860509


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4523,6 +4535,9 @@ class SQLConf extends Serializable with Logging {
 
   def defaultColumnAllowedProviders: String = getConf(SQLConf.DEFAULT_COLUMN_ALLOWED_PROVIDERS)
 
+  def defaultColumnJsonWriteNullIfWithDefaultValue: Boolean =

Review Comment:
   ```suggestion
     def JsonWriteNullIfWithDefaultValue: Boolean =
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a diff in pull request #37431: [SPARK-40001][SQL] Make NULL writes to JSON DEFAULT columns write ‘null’ to storage

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on code in PR #37431:
URL: https://github.com/apache/spark/pull/37431#discussion_r944861189


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala:
##########
@@ -84,6 +84,12 @@ private[sql] class JSONOptions(
   val ignoreNullFields = parameters.get("ignoreNullFields").map(_.toBoolean)
     .getOrElse(SQLConf.get.jsonGeneratorIgnoreNullFields)
 
+  // If this is true, when writing NULL values to columns of JSON tables with explicit DEFAULT
+  // values, never skip writing the NULL values to storage, overriding 'ignoreNullFields' above.
+  // This can be useful to enforce that inserted NULL values are present in storage to differentiate
+  // from missing data.
+  val defaultColumnsForceNullFields = SQLConf.get.defaultColumnJsonWriteNullIfWithDefaultValue

Review Comment:
   ```suggestion
     val writeNullIfWithDefaultValue = SQLConf.get.jsonWriteNullIfWithDefaultValue
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a diff in pull request #37431: [SPARK-40001][SQL] Make NULL writes to JSON DEFAULT columns write ‘null’ to storage

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on code in PR #37431:
URL: https://github.com/apache/spark/pull/37431#discussion_r944860366


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2924,6 +2924,18 @@ object SQLConf {
       .stringConf
       .createWithDefault("csv,json,orc,parquet")
 
+  val DEFAULT_COLUMN_JSON_GENERATOR_FORCE_NULL_FIELDS =

Review Comment:
   ```suggestion
     val JSON_GENERATOR_WRITE_NULL_IF_WITH_DEFAULT_VALUE =
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on a diff in pull request #37431: [SPARK-40001][SQL] Add config to make DEFAULT values in JSON tables mutually exclusive with SQLConf.JSON_GENERATOR_IGNORE_NULL_FIELDS

Posted by GitBox <gi...@apache.org>.
dtenedor commented on code in PR #37431:
URL: https://github.com/apache/spark/pull/37431#discussion_r943921356


##########
sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala:
##########
@@ -1657,6 +1656,28 @@ class InsertSuite extends DataSourceTest with SharedSparkSession {
     }
   }
 
+  test("SPARK-40001 JSON DEFAULT columns require JSON_GENERATOR_IGNORE_NULL_FIELDS off") {
+    val error = "DEFAULT values are not supported for JSON tables"
+    withTable("t") {
+      assert(intercept[AnalysisException] {
+        sql("create table t (a int default 42) using json")

Review Comment:
   Thanks for the useful comment! This made me realize that other writers may create JSON with missing NULL values in the storage as well. I fixed this by:
   
   1) Reverting the analyzer changes in this PR
   2) Changed the new config to `DEFAULT_COLUMN_JSON_GENERATOR_FORCE_NULL_FIELDS` which overrides any other settings for target columns with DEFAULT values to always write explicit NULLs to storage.
   3) Updated the `DEFAULT_COLUMN_ALLOWED_PROVIDERS` config to ban `ALTER TABLE ADD COLUMN` commands with `DEFAULT` values for JSON tables, with a descriptive error message.
   
   This ensures correctness with JSON `DEFAULT` columns by always ensuring that new rows with NULL values get explicit NULLs written to the JSON storage, so that subsequent scans can tell the difference between those NULLs and any written `DEFAULT` values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a diff in pull request #37431: [SPARK-40001][SQL] Make NULL writes to JSON DEFAULT columns write ‘null’ to storage

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on code in PR #37431:
URL: https://github.com/apache/spark/pull/37431#discussion_r944860509


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4523,6 +4535,9 @@ class SQLConf extends Serializable with Logging {
 
   def defaultColumnAllowedProviders: String = getConf(SQLConf.DEFAULT_COLUMN_ALLOWED_PROVIDERS)
 
+  def defaultColumnJsonWriteNullIfWithDefaultValue: Boolean =

Review Comment:
   ```suggestion
     def jsonWriteNullIfWithDefaultValue: Boolean =
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37431: [SPARK-40001][SQL] Make NULL writes to JSON DEFAULT columns write ‘null’ to storage

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #37431:
URL: https://github.com/apache/spark/pull/37431#discussion_r949741781


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2924,6 +2924,18 @@ object SQLConf {
       .stringConf
       .createWithDefault("csv,json,orc,parquet")
 
+  val JSON_GENERATOR_WRITE_NULL_IF_WITH_DEFAULT_VALUE =
+    buildConf("spark.sql.jsonGenerator.writeNullIfWithDefaultValue")
+      .internal()
+      .doc("When true, when writing NULL values to columns of JSON tables with explicit DEFAULT " +
+        "values using INSERT, UPDATE, or MERGE commands, never skip writing the NULL values to " +
+        "storage, overriding spark.sql.jsonGenerator.ignoreNullFields or the ignoreNullFields " +

Review Comment:
   nit but we could use `JSON_GENERATOR_IGNORE_NULL_FIELDS.key` instead of `spark.sql.jsonGenerator.ignoreNullFields`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #37431: [SPARK-40001][SQL] Add config to make DEFAULT values in JSON tables mutually exclusive with SQLConf.JSON_GENERATOR_IGNORE_NULL_FIELDS

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #37431:
URL: https://github.com/apache/spark/pull/37431#issuecomment-1207537404

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37431: [SPARK-40001][SQL] Add config to make DEFAULT values in JSON tables mutually exclusive with SQLConf.JSON_GENERATOR_IGNORE_NULL_FIELDS

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #37431:
URL: https://github.com/apache/spark/pull/37431#discussion_r939745224


##########
sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala:
##########
@@ -1657,6 +1656,28 @@ class InsertSuite extends DataSourceTest with SharedSparkSession {
     }
   }
 
+  test("SPARK-40001 JSON DEFAULT columns require JSON_GENERATOR_IGNORE_NULL_FIELDS off") {
+    val error = "DEFAULT values are not supported for JSON tables"
+    withTable("t") {
+      assert(intercept[AnalysisException] {
+        sql("create table t (a int default 42) using json")

Review Comment:
   Hm, should we take `ignoreNullFields` option into account too? e.g.( create table t ... using json options (ignoreNullFields ... )



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on pull request #37431: [SPARK-40001][SQL] Make NULL writes to JSON DEFAULT columns write ‘null’ to storage

Posted by GitBox <gi...@apache.org>.
dtenedor commented on PR #37431:
URL: https://github.com/apache/spark/pull/37431#issuecomment-1213456293

   Hi @gengliangwang responded to comments, this is ready for another round when ready :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on pull request #37431: [SPARK-40001][SQL] Add config to make DEFAULT values in JSON tables mutually exclusive with SQLConf.JSON_GENERATOR_IGNORE_NULL_FIELDS

Posted by GitBox <gi...@apache.org>.
dtenedor commented on PR #37431:
URL: https://github.com/apache/spark/pull/37431#issuecomment-1212471765

   Hi @HyukjinKwon, thanks for your review, please take another look when ready.
   @gengliangwang FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #37431: [SPARK-40001][SQL] Make NULL writes to JSON DEFAULT columns write ‘null’ to storage

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on PR #37431:
URL: https://github.com/apache/spark/pull/37431#issuecomment-1213361133

   +1 on having such a new configuration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #37431: [SPARK-40001][SQL] Make NULL writes to JSON DEFAULT columns write ‘null’ to storage

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on PR #37431:
URL: https://github.com/apache/spark/pull/37431#issuecomment-1214197449

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on pull request #37431: [SPARK-40001][SQL] Add config to make DEFAULT values in JSON tables mutually exclusive with SQLConf.JSON_GENERATOR_IGNORE_NULL_FIELDS

Posted by GitBox <gi...@apache.org>.
dtenedor commented on PR #37431:
URL: https://github.com/apache/spark/pull/37431#issuecomment-1207520184

   Hi @gengliangwang this PR double-checks correctness for column DEFAULT values with another corner case (JSON tables).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang closed pull request #37431: [SPARK-40001][SQL] Make NULL writes to JSON DEFAULT columns write ‘null’ to storage

Posted by GitBox <gi...@apache.org>.
gengliangwang closed pull request #37431: [SPARK-40001][SQL] Make NULL writes to JSON DEFAULT columns write ‘null’ to storage
URL: https://github.com/apache/spark/pull/37431


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a diff in pull request #37431: [SPARK-40001][SQL] Make NULL writes to JSON DEFAULT columns write ‘null’ to storage

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on code in PR #37431:
URL: https://github.com/apache/spark/pull/37431#discussion_r944674301


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2924,6 +2924,18 @@ object SQLConf {
       .stringConf
       .createWithDefault("csv,json,orc,parquet")
 
+  val DEFAULT_COLUMN_JSON_GENERATOR_FORCE_NULL_FIELDS =
+    buildConf("spark.sql.defaultColumn.jsonGeneratorForceNullFields")

Review Comment:
   Since we have a conf `spark.sql.jsonGenerator.ignoreNullFields` already, I suggest making this one a JSON data source configuration as well. How about `spark.sql.jsonGenerator.writeNullIfWithDefaultValue`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org