You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/06 23:03:03 UTC

[GitHub] [spark] attilapiros opened a new pull request #31501: [SPARK-34370] Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"

attilapiros opened a new pull request #31501:
URL: https://github.com/apache/spark/pull/31501


   ### What changes were proposed in this pull request?
   
   With https://github.com/apache/spark/pull/31133 Avro schema evolution is introduce for partitioned hive tables where the schema is given by `avro.schema.literal`. 
   Here that functionality is extended to support schema evolution where the schema is defined via `avro.schema.url`. 
   
   ### Why are the changes needed?
   
   Without this PR the problem described in https://github.com/apache/spark/pull/31133 can be reproduced by tables where `avro.schema.url` is used. As in this case always the property value given at partition level is used for the `avro.schema.url`.
   
   So for example when a new column (with a default value) is added to the table then one the following problem happens:
   -  when the new field is added after the last one the cell values will be null values instead of the default value
   -  when the schema is extended somewhere before the last field then values will be listed for the wrong column positions
   
   Similar error will happen when one of the field is removed from the schema.
   
   For details please check the attached unit tests where both cases are checked.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Fixes the potential value error. 
   
   ### How was this patch tested?
   
   The existing unit tests for schema evolution is generalized and reused. 
   New tests:
   - SPARK-34370: support Avro schema evolution (add column with avro.schema.url)
   - SPARK-34370: support Avro schema evolution (remove column with avro.schema.url)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31501:
URL: https://github.com/apache/spark/pull/31501#discussion_r571502874



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
##########
@@ -240,6 +240,8 @@ class HadoopTableReader(
       fillPartitionKeys(partValues, mutableRow)
 
       val tableProperties = tableDesc.getProperties
+      val avroSchemaEvolutionProperties = Seq(AvroTableProperties.SCHEMA_LITERAL,

Review comment:
       `avroSchemaEvolutionProperties` -> `avroSchemaProperties` because the schema is identical in the most cases. `SCHEMA_LITERAL` and `SCHEMA_URL` is just about `schema` and they doesn't mean evolutions. Evolution only happens when the users put a different value for its schema.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31501:
URL: https://github.com/apache/spark/pull/31501#issuecomment-774570681


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134962/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #31501:
URL: https://github.com/apache/spark/pull/31501


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31501:
URL: https://github.com/apache/spark/pull/31501#issuecomment-774565745


   **[Test build #134965 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134965/testReport)** for PR 31501 at commit [`e3bd8e2`](https://github.com/apache/spark/commit/e3bd8e26b727f7f75417c4d14c1f0bbf6302ef0d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31501: [SPARK-34370] Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31501:
URL: https://github.com/apache/spark/pull/31501#issuecomment-774557570


   **[Test build #134962 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134962/testReport)** for PR 31501 at commit [`f29439c`](https://github.com/apache/spark/commit/f29439c7d1c04c596b03588a1fa5f846ca55d979).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31501:
URL: https://github.com/apache/spark/pull/31501#issuecomment-774578037


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39546/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31501:
URL: https://github.com/apache/spark/pull/31501#issuecomment-774577042


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39548/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31501:
URL: https://github.com/apache/spark/pull/31501#issuecomment-774570384


   **[Test build #134962 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134962/testReport)** for PR 31501 at commit [`f29439c`](https://github.com/apache/spark/commit/f29439c7d1c04c596b03588a1fa5f846ca55d979).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31501:
URL: https://github.com/apache/spark/pull/31501#discussion_r571503242



##########
File path: sql/hive/src/test/resources/schemaEvolution/schemaWithOneField.avsc
##########
@@ -0,0 +1,12 @@
+{

Review comment:
       Please don't create `resources/schemaEvolution` directory here. Just put these files into `resources` directory.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31501:
URL: https://github.com/apache/spark/pull/31501#issuecomment-774583168






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31501:
URL: https://github.com/apache/spark/pull/31501#discussion_r571502874



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
##########
@@ -240,6 +240,8 @@ class HadoopTableReader(
       fillPartitionKeys(partValues, mutableRow)
 
       val tableProperties = tableDesc.getProperties
+      val avroSchemaEvolutionProperties = Seq(AvroTableProperties.SCHEMA_LITERAL,

Review comment:
       `avroSchemaEvolutionProperties` -> `avroSchemaProperties` because the schema is identical in the most cases.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31501:
URL: https://github.com/apache/spark/pull/31501#issuecomment-774557570


   **[Test build #134962 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134962/testReport)** for PR 31501 at commit [`f29439c`](https://github.com/apache/spark/commit/f29439c7d1c04c596b03588a1fa5f846ca55d979).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31501:
URL: https://github.com/apache/spark/pull/31501#issuecomment-774578985


   **[Test build #134965 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134965/testReport)** for PR 31501 at commit [`e3bd8e2`](https://github.com/apache/spark/commit/e3bd8e26b727f7f75417c4d14c1f0bbf6302ef0d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31501:
URL: https://github.com/apache/spark/pull/31501#issuecomment-774565745


   **[Test build #134965 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134965/testReport)** for PR 31501 at commit [`e3bd8e2`](https://github.com/apache/spark/commit/e3bd8e26b727f7f75417c4d14c1f0bbf6302ef0d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31501:
URL: https://github.com/apache/spark/pull/31501#discussion_r571503242



##########
File path: sql/hive/src/test/resources/schemaEvolution/schemaWithOneField.avsc
##########
@@ -0,0 +1,12 @@
+{

Review comment:
       Please don't create `schemaEvolution` directory here. Just put these files into `resources` directory.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31501:
URL: https://github.com/apache/spark/pull/31501#issuecomment-774577042


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39548/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31501:
URL: https://github.com/apache/spark/pull/31501#issuecomment-774570681


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134962/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31501:
URL: https://github.com/apache/spark/pull/31501#issuecomment-774572419


   Merged to master for Apache Spark 3.2.0.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31501:
URL: https://github.com/apache/spark/pull/31501#issuecomment-774583169






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31501:
URL: https://github.com/apache/spark/pull/31501#discussion_r571502874



##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
##########
@@ -240,6 +240,8 @@ class HadoopTableReader(
       fillPartitionKeys(partValues, mutableRow)
 
       val tableProperties = tableDesc.getProperties
+      val avroSchemaEvolutionProperties = Seq(AvroTableProperties.SCHEMA_LITERAL,

Review comment:
       `avroSchemaEvolutionProperties` -> `avroSchemaProperties` because the schema is identical in the most cases. `SCHEMA_LITERAL` and `SCHEMA_URL` is just about `schema` and they doesn't mean evolutions.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31501: [SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url"

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31501:
URL: https://github.com/apache/spark/pull/31501#issuecomment-774567595


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39546/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org