You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/30 09:10:30 UTC

[GitHub] [spark] AngersZhuuuu opened a new pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

AngersZhuuuu opened a new pull request #30202:
URL: https://github.com/apache/spark/pull/30202


   ### What changes were proposed in this pull request?
   Affress comment
   https://github.com/apache/spark/pull/30156#discussion_r514894596
   https://github.com/apache/spark/pull/30156#discussion_r514894838
   
   
   ### Why are the changes needed?
   Make change more readable
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Existed UT


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30202:
URL: https://github.com/apache/spark/pull/30202#issuecomment-719439854


   **[Test build #130450 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130450/testReport)** for PR 30202 at commit [`c401465`](https://github.com/apache/spark/commit/c401465e2e5dcf849b52593addf3ebb18f803354).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30202:
URL: https://github.com/apache/spark/pull/30202#issuecomment-719476723


   **[Test build #130453 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130453/testReport)** for PR 30202 at commit [`6af4e86`](https://github.com/apache/spark/commit/6af4e867e00451257a3ed979bb9f0e7781135fee).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Update migration guide to make clear what behavior changed and make variable names and configuration name more clearer

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #30202:
URL: https://github.com/apache/spark/pull/30202#discussion_r515933291



##########
File path: docs/sql-migration-guide.md
##########
@@ -52,7 +52,7 @@ license: |
   
   - In Spark 3.1, the `schema_of_json` and `schema_of_csv` functions return the schema in the SQL format in which field names are quoted. In Spark 3.0, the function returns a catalog string without field quoting and in lower case. 
  
-  - In Spark 3.1, when `spark.sql.legacy.transformationPadNullWhenValueLessThenSchema` is true, Spark will pad NULL value when script transformation's output value size less then schema size in default-serde mode(script transformation with row format of `ROW FORMAT DELIMITED`). If false, Spark will keep original behavior to throw `ArrayIndexOutOfBoundsException`.
+  - In Spark 3.1, when script transformation output's value size is less then schema size in default-serde mode(script transformation with row format of `ROW FORMAT DELIMITED`), Spark will pad NUll value to supplement data. In Spark 3.0 or earlier, Spark will do nothing and throw `ArrayIndexOutOfBoundsException`. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.transformationNotPadNullToSupplementData.enabled` to `true`.

Review comment:
       So, What should I do next? Personally, we need to let users know about this change

##########
File path: docs/sql-migration-guide.md
##########
@@ -52,7 +52,7 @@ license: |
   
   - In Spark 3.1, the `schema_of_json` and `schema_of_csv` functions return the schema in the SQL format in which field names are quoted. In Spark 3.0, the function returns a catalog string without field quoting and in lower case. 
  
-  - In Spark 3.1, when `spark.sql.legacy.transformationPadNullWhenValueLessThenSchema` is true, Spark will pad NULL value when script transformation's output value size less then schema size in default-serde mode(script transformation with row format of `ROW FORMAT DELIMITED`). If false, Spark will keep original behavior to throw `ArrayIndexOutOfBoundsException`.
+  - In Spark 3.1, when script transformation output's value size is less then schema size in default-serde mode(script transformation with row format of `ROW FORMAT DELIMITED`), Spark will pad NUll value to supplement data. In Spark 3.0 or earlier, Spark will do nothing and throw `ArrayIndexOutOfBoundsException`. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.transformationNotPadNullToSupplementData.enabled` to `true`.

Review comment:
       All  right, we  can just revert last pr.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #30202:
URL: https://github.com/apache/spark/pull/30202#issuecomment-719439854


   **[Test build #130450 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130450/testReport)** for PR 30202 at commit [`c401465`](https://github.com/apache/spark/commit/c401465e2e5dcf849b52593addf3ebb18f803354).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #30202:
URL: https://github.com/apache/spark/pull/30202#issuecomment-719472350






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Update migration guide to make clear what behavior changed and make variable names and configuration name more clearer

Posted by GitBox <gi...@apache.org>.

srowen commented on a change in pull request #30202:
URL: https://github.com/apache/spark/pull/30202#discussion_r515970048



##########
File path: docs/sql-migration-guide.md
##########
@@ -52,7 +52,7 @@ license: |
   
   - In Spark 3.1, the `schema_of_json` and `schema_of_csv` functions return the schema in the SQL format in which field names are quoted. In Spark 3.0, the function returns a catalog string without field quoting and in lower case. 
  
-  - In Spark 3.1, when `spark.sql.legacy.transformationPadNullWhenValueLessThenSchema` is true, Spark will pad NULL value when script transformation's output value size less then schema size in default-serde mode(script transformation with row format of `ROW FORMAT DELIMITED`). If false, Spark will keep original behavior to throw `ArrayIndexOutOfBoundsException`.
+  - In Spark 3.1, when script transformation output's value size is less then schema size in default-serde mode(script transformation with row format of `ROW FORMAT DELIMITED`), Spark will pad NUll value to supplement data. In Spark 3.0 or earlier, Spark will do nothing and throw `ArrayIndexOutOfBoundsException`. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.transformationNotPadNullToSupplementData.enabled` to `true`.

Review comment:
       Nit: NUll -> NULL
   Yeah, why would you ever set this flag to the original behavior, to fail? this is just a bug fix. Nobody can reasonably be relying on this behavior.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on pull request #30202:
URL: https://github.com/apache/spark/pull/30202#issuecomment-719436437


   FYI @cloud-fan @HyukjinKwon 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30202:
URL: https://github.com/apache/spark/pull/30202#issuecomment-719472119


   **[Test build #130450 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130450/testReport)** for PR 30202 at commit [`c401465`](https://github.com/apache/spark/commit/c401465e2e5dcf849b52593addf3ebb18f803354).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu closed pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Update migration guide to make clear what behavior changed and make variable names and configuration name more clearer

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu closed pull request #30202:
URL: https://github.com/apache/spark/pull/30202


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30202:
URL: https://github.com/apache/spark/pull/30202#issuecomment-719472454


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/130450/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30202:
URL: https://github.com/apache/spark/pull/30202#issuecomment-719472328


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35056/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30202:
URL: https://github.com/apache/spark/pull/30202#issuecomment-719443724


   **[Test build #130451 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130451/testReport)** for PR 30202 at commit [`c3437b1`](https://github.com/apache/spark/commit/c3437b1705309856d4db6e3eb60a62eba2a8566c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #30202:
URL: https://github.com/apache/spark/pull/30202#issuecomment-719461369


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35056/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Update migration guide to make clear what behavior changed and make variable names and configuration name more clearer

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #30202:
URL: https://github.com/apache/spark/pull/30202#discussion_r515753866



##########
File path: docs/sql-migration-guide.md
##########
@@ -52,7 +52,7 @@ license: |
   
   - In Spark 3.1, the `schema_of_json` and `schema_of_csv` functions return the schema in the SQL format in which field names are quoted. In Spark 3.0, the function returns a catalog string without field quoting and in lower case. 
  
-  - In Spark 3.1, when `spark.sql.legacy.transformationPadNullWhenValueLessThenSchema` is true, Spark will pad NULL value when script transformation's output value size less then schema size in default-serde mode(script transformation with row format of `ROW FORMAT DELIMITED`). If false, Spark will keep original behavior to throw `ArrayIndexOutOfBoundsException`.
+  - In Spark 3.1, when script transformation output's value size is less then schema size in default-serde mode(script transformation with row format of `ROW FORMAT DELIMITED`), Spark will pad NUll value to supplement data. In Spark 3.0 or earlier, Spark will do nothing and throw `ArrayIndexOutOfBoundsException`. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.transformationNotPadNullToSupplementData.enabled` to `true`.

Review comment:
       Yea I'm just saying that we usually don't add migration guide if we make something failed to work, as it's not a breaking change and users have nothing to do for migration.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #30202: [SPARK-33248][SQL][FOLLOWUP] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #30202:
URL: https://github.com/apache/spark/pull/30202#issuecomment-719472350






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org