You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/17 17:05:37 UTC

[GitHub] [spark] dtenedor opened a new pull request, #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

dtenedor opened a new pull request, #36583:
URL: https://github.com/apache/spark/pull/36583

   ### What changes were proposed in this pull request?
   
   Support JSON scans when the table schema has associated DEFAULT column values.
   
   Example:
   
   ```
   create table t(i int) using json;
   insert into t values(42);
   alter table t add column s string default concat('abc', def');
   select * from t;
   > 42, 'abcdef'
   ```
   
   Interesting note: JSON does not distinguish between NULL values and the absence of values. Therefore inserting NULL and then selecting back the same column yields the default value (if any), since the insert did not change any storage.
   
   ### Why are the changes needed?
   
   This change makes it easier to build, query, and maintain tables backed by JSON data.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes.
   
   ### How was this patch tested?
   
   This PR includes new test coverage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.
dtenedor commented on PR #36583:
URL: https://github.com/apache/spark/pull/36583#issuecomment-1129351769

   > Is this [[[SPARK-38067](https://issues.apache.org/jira/browse/SPARK-38067)][PYTHON] Preserve None values when saved to JSON](https://github.com/apache/spark/commit/66b9087233bc0cb90cf3af07ec34ba74b9c32d5b) the same case?
   
   @bjornjorgensen thanks for pointing that out! It looks like there is a code path to store null values in the JSON result. I'll add a code path and test cases to cover both ways and update this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.
dtenedor commented on PR #36583:
URL: https://github.com/apache/spark/pull/36583#issuecomment-1130268912

   @gengliangwang @HyukjinKwon this is ready for review at your convenience


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on a diff in pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.
dtenedor commented on code in PR #36583:
URL: https://github.com/apache/spark/pull/36583#discussion_r876484352


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -419,7 +419,10 @@ class JacksonParser(
     val row = new GenericInternalRow(schema.length)
     var badRecordException: Option[Throwable] = None
     var skipRow = false
-
+    // Apply default values from the column metadata to the initial row, if any.
+    for ((value: Any, i: Int) <- schema.existenceDefaultValues.zipWithIndex) {

Review Comment:
   SG, done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bjornjorgensen commented on pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.
bjornjorgensen commented on PR #36583:
URL: https://github.com/apache/spark/pull/36583#issuecomment-1129337164

   
   > Interesting note: JSON does not distinguish between NULL values and the absence of values. Therefore inserting NULL and then selecting back the same column yields the default value (if any), since the insert did not change any storage.
   > 
   
   Is this [[[SPARK-38067](https://issues.apache.org/jira/browse/SPARK-38067)][PYTHON] Preserve None values when saved to JSON](https://github.com/apache/spark/commit/66b9087233bc0cb90cf3af07ec34ba74b9c32d5b) the same case? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on a diff in pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.
dtenedor commented on code in PR #36583:
URL: https://github.com/apache/spark/pull/36583#discussion_r878419700


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -419,7 +419,13 @@ class JacksonParser(
     val row = new GenericInternalRow(schema.length)
     var badRecordException: Option[Throwable] = None
     var skipRow = false
-
+    // Apply default values from the column metadata to the initial row, if any.
+    if (schema.hasExistenceDefaultValues) {
+      for ((value: Any, i: Int) <- schema.existenceDefaultValues.zipWithIndex) {

Review Comment:
   Thanks for pointing this out, I implemented a suggested from Gengliang to keep a boolean array to track which columns we need to assign default values to. This way it's fast, and we only update columns with explicit default values that never got values assigned during the scan.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36583:
URL: https://github.com/apache/spark/pull/36583#discussion_r876460895


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -419,7 +419,10 @@ class JacksonParser(
     val row = new GenericInternalRow(schema.length)
     var badRecordException: Option[Throwable] = None
     var skipRow = false
-
+    // Apply default values from the column metadata to the initial row, if any.
+    for ((value: Any, i: Int) <- schema.existenceDefaultValues.zipWithIndex) {

Review Comment:
   No biggie but maybe we could add an if-else to skip this assignment if there are no default values (since arguably most cases won't have default values, and this is performance sensitive code path)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.
dtenedor commented on PR #36583:
URL: https://github.com/apache/spark/pull/36583#issuecomment-1130827706

   @HyukjinKwon Thanks for your review, I also implemented your suggestions from the previous PR for CSV files (https://github.com/apache/spark/pull/36501) here as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36583:
URL: https://github.com/apache/spark/pull/36583#discussion_r876501196


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -419,7 +419,13 @@ class JacksonParser(
     val row = new GenericInternalRow(schema.length)
     var badRecordException: Option[Throwable] = None
     var skipRow = false
-
+    // Apply default values from the column metadata to the initial row, if any.
+    if (schema.hasExistenceDefaultValues) {
+      for ((value: Any, i: Int) <- schema.existenceDefaultValues.zipWithIndex) {

Review Comment:
   Oh yeah. `zipWithIndex` is actually pretty slow too IIRC.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.
dtenedor commented on PR #36583:
URL: https://github.com/apache/spark/pull/36583#issuecomment-1134938848

   @gengliangwang @HyukjinKwon this is ready for another look when you have a moment :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #36583:
URL: https://github.com/apache/spark/pull/36583#issuecomment-1129652285

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang closed pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.
gengliangwang closed pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values
URL: https://github.com/apache/spark/pull/36583


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36583:
URL: https://github.com/apache/spark/pull/36583#issuecomment-1135409978

   I will defer to @gengliangwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36583:
URL: https://github.com/apache/spark/pull/36583#issuecomment-1130758683

   Adding @MaxGekk too FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.
dtenedor commented on PR #36583:
URL: https://github.com/apache/spark/pull/36583#issuecomment-1129125976

   Note: this PR is based on https://github.com/apache/spark/pull/36501. The additional changes comprise about 15 lines of code, in this commit: 
   
   https://github.com/apache/spark/pull/36583/commits/c4fab6090d2a3f17cfa0ae65c484fd0d08f9b545


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a diff in pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on code in PR #36583:
URL: https://github.com/apache/spark/pull/36583#discussion_r876496238


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala:
##########
@@ -419,7 +419,13 @@ class JacksonParser(
     val row = new GenericInternalRow(schema.length)
     var badRecordException: Option[Throwable] = None
     var skipRow = false
-
+    // Apply default values from the column metadata to the initial row, if any.
+    if (schema.hasExistenceDefaultValues) {
+      for ((value: Any, i: Int) <- schema.existenceDefaultValues.zipWithIndex) {

Review Comment:
   Will this be a perf regression?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #36583: [SPARK-39211][SQL] Support JSON scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on PR #36583:
URL: https://github.com/apache/spark/pull/36583#issuecomment-1135927938

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org