You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "BramBoog (via GitHub)" <gi...@apache.org> on 2023/05/02 15:31:08 UTC

[GitHub] [spark] BramBoog opened a new pull request, #41016: [SPARK-43341-][SQL] Patch StructType.toDDL not picking up on non-nullability of nested column

BramBoog opened a new pull request, #41016:
URL: https://github.com/apache/spark/pull/41016

   ### What changes were proposed in this pull request?
   When converting a StructType instance containing a nested StructType column which in turn contains a column for which `nullable = false` to a DDL string using `.toDDL`, the resulting DDL string does not include this non-nullability. For example:
   ```
   val testSchema = StructType(List(
     StructField("key", IntegerType, false),
     StructField("value", StringType, true),
     StructField("nestedCols", StructType(List(
       StructField("nestedKey", IntegerType, false),
       StructField("nestedValue", StringType, true)
     )), false)
   ))
   
   println(testSchema.toDDL)
   println(StructType.fromDDL(testschema.toDDL))
   ```
   gives
   ```
   key INT NOT NULL,value STRING,nestedCols STRUCT<nestedKey: INT, nestedValue: STRING> NOT NULL
   
   StructType(
     StructField(key,IntegerType,false),
     StructField(value,StringType,true),
     StructField(nestedCols,StructType(
       StructField(nestedKey,IntegerType,true),
       StructField(nestedValue,StringType,true)
     ),false)
   )
   ```
   
   This is due to the fact that `StructType.toDDL` calls `StructField.toDDL` for its fields, which in turn calls `.sql` for its `dataType`. If `dataType` is a StructType, the call to `.sql` in turn calls `.sql` for all the nested fields, and this last method does not include the nullability of the field in its output. The proposed solution is therefore to have `StructField.toDDL` call `dataType.toDDL` for a StructType, since this will include information about nullability of nested columns.
   
   To work around the different DDL formats of nested and non-nested structs (the former is wrapped in `"STRUCT ...>"` and uses `colName: dataType` for its fields instead of `colName dataType`), package-private nested-specific versions of `.toDDL` have been added for StructType and StructField.
   
   ### Why are the changes needed?
   Currently, converting a StructType schema to a DDL string does not pass information about nullability of nested columns. This leads to a loss of information, and means converting to DDL and then back could alter the StructType schema. 
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, given the example above, the output will become:
   ```
   key INT NOT NULL,value STRING,nestedCols STRUCT<nestedKey: INT NOT NULL, nestedValue: STRING> NOT NULL
   
   StructType(
     StructField(key,IntegerType,false),
     StructField(value,StringType,true),
     StructField(nestedCols,StructType(
       StructField(nestedKey,IntegerType,false),
       StructField(nestedValue,StringType,true)
     ),false)
   )
   ```
   
   ### How was this patch tested?
   In `StructTypeSuite`, the `nestedStruct` testing value has been modified to include a non-nullable nested column. The relevant unit tests have been changed accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #41016: [SPARK-43341][SQL] Patch StructType.toDDL not picking up on non-nullability of nested column

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #41016:
URL: https://github.com/apache/spark/pull/41016#issuecomment-1618024859

   I think something went wrong during the rebase ..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] BramBoog commented on pull request #41016: [SPARK-43341][SQL] Patch StructType.toDDL not picking up on non-nullability of nested column

Posted by "BramBoog (via GitHub)" <gi...@apache.org>.
BramBoog commented on PR #41016:
URL: https://github.com/apache/spark/pull/41016#issuecomment-1725386935

   @zaza Yeah the way I see it the PR is ready, I've just been waiting for a review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #41016: [SPARK-43341][SQL] Patch StructType.toDDL not picking up on non-nullability of nested column

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #41016:
URL: https://github.com/apache/spark/pull/41016#issuecomment-1537627334

   cc @MaxGekk FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] BramBoog commented on pull request #41016: [SPARK-43341][SQL] Patch StructType.toDDL not picking up on non-nullability of nested column

Posted by "BramBoog (via GitHub)" <gi...@apache.org>.
BramBoog commented on PR #41016:
URL: https://github.com/apache/spark/pull/41016#issuecomment-1618227111

   Hmm yeah that was strange, I should have noticed that immediately :/. Seems GitHub had added a whole list of commits which already were on master to the diff. Changing the base to a different branch and back to master has fixed it though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43341][SQL] Patch StructType.toDDL not picking up on non-nullability of nested column [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk closed pull request #41016: [SPARK-43341][SQL] Patch StructType.toDDL not picking up on non-nullability of nested column
URL: https://github.com/apache/spark/pull/41016


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] BramBoog commented on pull request #41016: [SPARK-43341][SQL] Patch StructType.toDDL not picking up on non-nullability of nested column

Posted by "BramBoog (via GitHub)" <gi...@apache.org>.
BramBoog commented on PR #41016:
URL: https://github.com/apache/spark/pull/41016#issuecomment-1612526727

   @MaxGekk Done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43341][SQL] Patch StructType.toDDL not picking up on non-nullability of nested column [spark]

Posted by "BramBoog (via GitHub)" <gi...@apache.org>.
BramBoog commented on PR #41016:
URL: https://github.com/apache/spark/pull/41016#issuecomment-1801399586

   @HyukjinKwon Apologies, it took a while for me to find the time to get back to this. Resolved the conflicts, PR is up to date again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #41016: [SPARK-43341][SQL] Patch StructType.toDDL not picking up on non-nullability of nested column

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #41016:
URL: https://github.com/apache/spark/pull/41016#issuecomment-1727125449

   @BramBoog it has a conflicts against the lastest master branch. You would need to resolve the conflicts by git fetch upstream & git rebase upstream/master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zaza commented on pull request #41016: [SPARK-43341][SQL] Patch StructType.toDDL not picking up on non-nullability of nested column

Posted by "zaza (via GitHub)" <gi...@apache.org>.
zaza commented on PR #41016:
URL: https://github.com/apache/spark/pull/41016#issuecomment-1725193535

   Hey guys, any chance to have the PR merged? Although not critical it would simplify our schema tests a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org