You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/20 00:45:04 UTC

[GitHub] [spark] sadikovi opened a new pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

sadikovi opened a new pull request #34044:
URL: https://github.com/apache/spark/pull/34044


   ### What changes were proposed in this pull request?
   
   This PR fixes an issue when reading of a Parquet file written with legacy mode would fail due to incorrect Parquet LIST to ArrayType conversion.
   
   The issue arises when using schema evolution and utilising the parquet-mr reader. 2-level LIST annotated types could be parsed incorrectly as 3-level LIST annotated types because their underlying type does not match the full inferred schema.
   
   ### Why are the changes needed?
   
   It appears to be a long-standing issue with the legacy mode due to the imprecise check in ParquetRowConverter that was trying to determine Parquet backward compatibility using Catalyst schema: `DataType.equalsIgnoreCompatibleNullability(guessedElementType, elementType)` in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L606.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   Added a new test case in ParquetInteroperabilitySuite.scala.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922714789


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47960/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

sadikovi commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922829821


   Re-triggered the build


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922574139


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922827686


   @sadikovi mind retriggering https://github.com/sadikovi/spark/runs/3649026062 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r711862180



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -266,7 +266,7 @@ class ParquetToSparkSchemaConverter(
   // Here we implement Parquet LIST backwards-compatibility rules.
   // See: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
   // scalastyle:on
-  private def isElementType(repeatedType: Type, parentName: String): Boolean = {
+  private[parquet] def isElementType(repeatedType: Type, parentName: String): Boolean = {

Review comment:
       Do we need this change? If you don't mind, shall we keep this file unchanged?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922865728


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143451/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-924541652


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48005/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-924498979


   **[Test build #143494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143494/testReport)** for PR 34044 at commit [`8a21e05`](https://github.com/apache/spark/commit/8a21e051130f1c360ea4432a9c1ffda9cb27865c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-924611643


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143494/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922574139


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on a change in pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

sadikovi commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r711884836



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala
##########
@@ -96,6 +97,58 @@ class ParquetInteroperabilitySuite extends ParquetCompatibilityTest with SharedS
     }
   }
 
+  test("parquet files with legacy mode and schema evolution") {
+    // This test case writes arrays in Parquet legacy mode and schema evolution and verifies that
+    // the data can be correctly read back.
+
+    Seq(false, true).foreach { legacyMode =>
+      withSQLConf(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key -> legacyMode.toString) {

Review comment:
       `spark.sql.parquet.writeLegacyFormat` is only applicable for writes, it is irrelevant when reading a Parquet file.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r711863256



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala
##########
@@ -96,6 +97,58 @@ class ParquetInteroperabilitySuite extends ParquetCompatibilityTest with SharedS
     }
   }
 
+  test("parquet files with legacy mode and schema evolution") {
+    // This test case writes arrays in Parquet legacy mode and schema evolution and verifies that
+    // the data can be correctly read back.
+
+    Seq(false, true).foreach { legacyMode =>
+      withSQLConf(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key -> legacyMode.toString) {

Review comment:
       Thank you for adding this test case, @sadikovi . What happens if we write with `spark.sql.parquet.writeLegacyFormat=true` and try to read `spark.sql.parquet.writeLegacyFormat=false` and schema evolution? Do we have a test coverage for that?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-924541652


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48005/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922864085


   **[Test build #143451 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143451/testReport)** for PR 34044 at commit [`33fa362`](https://github.com/apache/spark/commit/33fa362553e85e5362c32854d8eb0f2904d48db6).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922589479


   ok to test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922593099


   **[Test build #143441 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143441/testReport)** for PR 34044 at commit [`d3d47f7`](https://github.com/apache/spark/commit/d3d47f736286af906e66c3faf3dae7995eb28c9c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

sadikovi commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922816061


   @cloud-fan Would you be able to help me figure out why CI fails in my PR. I checked the logs but no failed tests were reported and I got the following log:
   ```
   2021-09-20T08:48:08.2253254Z [info] Tests: succeeded 7729, failed 0, canceled 0, ignored 26, pending 0
   2021-09-20T08:48:08.2254305Z [info] All tests passed.
   2021-09-20T08:48:08.2379778Z [error] Error: Total 0, Failed 0, Errors 0, Passed 0
   2021-09-20T08:48:08.2503788Z [error] Error during tests:
   2021-09-20T08:48:08.7685175Z [error] 	Running java with options 
   ...
   sbt.ForkMain 36107 failed with exit code 134
   ```
   
   The newly added test passes: `2021-09-20T08:42:40.7932026Z [info] - SPARK-36803: parquet files with legacy mode and schema evolution (846 milliseconds)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922605842


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47949/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922675983


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143441/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922667447


   **[Test build #143441 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143441/testReport)** for PR 34044 at commit [`d3d47f7`](https://github.com/apache/spark/commit/d3d47f736286af906e66c3faf3dae7995eb28c9c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922605842


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47949/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922714789


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/47960/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on a change in pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

sadikovi commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r711884939



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
##########
@@ -602,8 +602,10 @@ private[parquet] class ParquetRowConverter(
       // matches the Catalyst array element type. If it doesn't match, then it's case 1; otherwise,
       // it's case 2.
       val guessedElementType = schemaConverter.convertField(repeatedType)
+      // We also need to check if the list element follows the backward compatible pattern.

Review comment:
       Yes, we can. I will update, thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on a change in pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

sadikovi commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r711885517



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -266,7 +266,7 @@ class ParquetToSparkSchemaConverter(
   // Here we implement Parquet LIST backwards-compatibility rules.
   // See: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
   // scalastyle:on
-  private def isElementType(repeatedType: Type, parentName: String): Boolean = {
+  private[parquet] def isElementType(repeatedType: Type, parentName: String): Boolean = {

Review comment:
       I am open to suggestions around how I can avoid changes in this file. 
   
   I considered alternative approaches but they would require duplicating the code including the option of creating `isLegacyParquetList` method. This approach was the least intrusive, IMHO.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r711857132



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
##########
@@ -602,8 +602,10 @@ private[parquet] class ParquetRowConverter(
       // matches the Catalyst array element type. If it doesn't match, then it's case 1; otherwise,
       // it's case 2.
       val guessedElementType = schemaConverter.convertField(repeatedType)
+      // We also need to check if the list element follows the backward compatible pattern.

Review comment:
       shall we also update the long code comment above?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

cloud-fan closed pull request #34044:
URL: https://github.com/apache/spark/pull/34044


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r713517224



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
##########
@@ -598,12 +598,20 @@ private[parquet] class ParquetRowConverter(
       //
       //      ARRAY<STRUCT<element: STRUCT<element: INT>>>
       //
+      //
       // Here we try to convert field `list` into a Catalyst type to see whether the converted type
-      // matches the Catalyst array element type. If it doesn't match, then it's case 1; otherwise,
-      // it's case 2.
+      // matches the Catalyst array element type.
+      //
+      // If the guessed element type from the above does not match the Catalyst type (for example,
+      // in case of schema evolution), we need to check if the repeated type matches one of the
+      // backward-compatibility rules for legacy LIST types (see the link above).
+      //
+      // If the element type does not match the Catalyst type and the underlying repeated type
+      // does not belong to the legacy LIST type, then it is case 1; otherwise, it is case 2.
       val guessedElementType = schemaConverter.convertField(repeatedType)
+      val isLegacy = schemaConverter.isElementType(repeatedType, parquetSchema.getName())

Review comment:
       Got it. Yea it's just a nit from me and this looks OK to me too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

sadikovi commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-924788397


   Thank you @cloud-fan @HyukjinKwon @sunchao @dongjoon-hyun for the reviews!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922714756


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47960/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on a change in pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

sadikovi commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r713508125



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
##########
@@ -598,12 +598,20 @@ private[parquet] class ParquetRowConverter(
       //
       //      ARRAY<STRUCT<element: STRUCT<element: INT>>>
       //
+      //
       // Here we try to convert field `list` into a Catalyst type to see whether the converted type
-      // matches the Catalyst array element type. If it doesn't match, then it's case 1; otherwise,
-      // it's case 2.
+      // matches the Catalyst array element type.
+      //
+      // If the guessed element type from the above does not match the Catalyst type (for example,
+      // in case of schema evolution), we need to check if the repeated type matches one of the
+      // backward-compatibility rules for legacy LIST types (see the link above).
+      //
+      // If the element type does not match the Catalyst type and the underlying repeated type
+      // does not belong to the legacy LIST type, then it is case 1; otherwise, it is case 2.
       val guessedElementType = schemaConverter.convertField(repeatedType)
+      val isLegacy = schemaConverter.isElementType(repeatedType, parquetSchema.getName())

Review comment:
       Yes, that is correct, legacy format would still be read by Spark, it was schema evolution of a list element that could trigger this issue. If all of the files have the same schema, everything should work just fine.
   
   I considered having something like "contains" instead of "equals" but I had a concern that this might introduce issues when the schema "contains" but it should still be treated as a 3-level LIST. Also, I could not find "contains" method for DataType in the codebase. IMHO, it is better to check parquet compatibility issues using parquet schema rather Catalyst schema which was meant to reconcile those types anyway. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-924514646


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48005/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922604752


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47949/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922708525


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47960/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-924536363


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48005/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922690264


   **[Test build #143451 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143451/testReport)** for PR 34044 at commit [`33fa362`](https://github.com/apache/spark/commit/33fa362553e85e5362c32854d8eb0f2904d48db6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on a change in pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

sadikovi commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r713509111



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
##########
@@ -598,12 +598,20 @@ private[parquet] class ParquetRowConverter(
       //
       //      ARRAY<STRUCT<element: STRUCT<element: INT>>>
       //
+      //
       // Here we try to convert field `list` into a Catalyst type to see whether the converted type
-      // matches the Catalyst array element type. If it doesn't match, then it's case 1; otherwise,
-      // it's case 2.
+      // matches the Catalyst array element type.
+      //
+      // If the guessed element type from the above does not match the Catalyst type (for example,
+      // in case of schema evolution), we need to check if the repeated type matches one of the
+      // backward-compatibility rules for legacy LIST types (see the link above).
+      //
+      // If the element type does not match the Catalyst type and the underlying repeated type
+      // does not belong to the legacy LIST type, then it is case 1; otherwise, it is case 2.
       val guessedElementType = schemaConverter.convertField(repeatedType)
+      val isLegacy = schemaConverter.isElementType(repeatedType, parquetSchema.getName())

Review comment:
       I know, making method non-private is not ideal but neither is adding a new DataType."contains" method or duplicating code for another function. Let me know what you think could be a better (or the least intrusive) approach. Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-924611643


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143494/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922690264


   **[Test build #143451 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143451/testReport)** for PR 34044 at commit [`33fa362`](https://github.com/apache/spark/commit/33fa362553e85e5362c32854d8eb0f2904d48db6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922618354


   cc @sunchao since this is Parquet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-924498979


   **[Test build #143494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143494/testReport)** for PR 34044 at commit [`8a21e05`](https://github.com/apache/spark/commit/8a21e051130f1c360ea4432a9c1ffda9cb27865c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922865728


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143451/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922675983


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143441/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r713258330



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
##########
@@ -598,12 +598,20 @@ private[parquet] class ParquetRowConverter(
       //
       //      ARRAY<STRUCT<element: STRUCT<element: INT>>>
       //
+      //
       // Here we try to convert field `list` into a Catalyst type to see whether the converted type
-      // matches the Catalyst array element type. If it doesn't match, then it's case 1; otherwise,
-      // it's case 2.
+      // matches the Catalyst array element type.
+      //
+      // If the guessed element type from the above does not match the Catalyst type (for example,
+      // in case of schema evolution), we need to check if the repeated type matches one of the
+      // backward-compatibility rules for legacy LIST types (see the link above).
+      //
+      // If the element type does not match the Catalyst type and the underlying repeated type
+      // does not belong to the legacy LIST type, then it is case 1; otherwise, it is case 2.
       val guessedElementType = schemaConverter.convertField(repeatedType)
+      val isLegacy = schemaConverter.isElementType(repeatedType, parquetSchema.getName())

Review comment:
       I see, the existing `schemaConverter.convertField(repeatedType)` already covered the legacy format lists but this particular issue is about schema evolution with added new struct fields. I wonder whether it's better to just expand `equalsIgnoreCompatibleNullability` and allow `element` to _contain_ `guessedElementType`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-924602232


   **[Test build #143494 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143494/testReport)** for PR 34044 at commit [`8a21e05`](https://github.com/apache/spark/commit/8a21e051130f1c360ea4432a9c1ffda9cb27865c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on a change in pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

sadikovi commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r711884670



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -266,7 +266,7 @@ class ParquetToSparkSchemaConverter(
   // Here we implement Parquet LIST backwards-compatibility rules.
   // See: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
   // scalastyle:on
-  private def isElementType(repeatedType: Type, parentName: String): Boolean = {
+  private[parquet] def isElementType(repeatedType: Type, parentName: String): Boolean = {

Review comment:
       Yes, we do unless I duplicate the code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r713204930



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -266,7 +266,7 @@ class ParquetToSparkSchemaConverter(
   // Here we implement Parquet LIST backwards-compatibility rules.
   // See: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
   // scalastyle:on
-  private def isElementType(repeatedType: Type, parentName: String): Boolean = {
+  private[parquet] def isElementType(repeatedType: Type, parentName: String): Boolean = {

Review comment:
       Got it. +1 for your decision, @sadikovi .




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922593099


   **[Test build #143441 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143441/testReport)** for PR 34044 at commit [`d3d47f7`](https://github.com/apache/spark/commit/d3d47f736286af906e66c3faf3dae7995eb28c9c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r711839848



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala
##########
@@ -96,6 +97,58 @@ class ParquetInteroperabilitySuite extends ParquetCompatibilityTest with SharedS
     }
   }
 
+  test("parquet files with legacy mode and schema evolution") {

Review comment:
       ```suggestion
     test("SPARK-36803: parquet files with legacy mode and schema evolution") {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-922603459


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47949/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on a change in pull request #34044: [SPARK-36803] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

sadikovi commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r711885517



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -266,7 +266,7 @@ class ParquetToSparkSchemaConverter(
   // Here we implement Parquet LIST backwards-compatibility rules.
   // See: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
   // scalastyle:on
-  private def isElementType(repeatedType: Type, parentName: String): Boolean = {
+  private[parquet] def isElementType(repeatedType: Type, parentName: String): Boolean = {

Review comment:
       I am open to suggestions around how I can avoid changes in this file.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #34044:
URL: https://github.com/apache/spark/pull/34044#discussion_r713213267



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
##########
@@ -598,12 +598,20 @@ private[parquet] class ParquetRowConverter(
       //
       //      ARRAY<STRUCT<element: STRUCT<element: INT>>>
       //
+      //
       // Here we try to convert field `list` into a Catalyst type to see whether the converted type
-      // matches the Catalyst array element type. If it doesn't match, then it's case 1; otherwise,
-      // it's case 2.
+      // matches the Catalyst array element type.
+      //
+      // If the guessed element type from the above does not match the Catalyst type (for example,
+      // in case of schema evolution), we need to check if the repeated type matches one of the
+      // backward-compatibility rules for legacy LIST types (see the link above).
+      //
+      // If the element type does not match the Catalyst type and the underlying repeated type
+      // does not belong to the legacy LIST type, then it is case 1; otherwise, it is case 2.
       val guessedElementType = schemaConverter.convertField(repeatedType)
+      val isLegacy = schemaConverter.isElementType(repeatedType, parquetSchema.getName())

Review comment:
       interesting - does it mean in the parquet-mr read path Spark were not able to handle legacy list format? also do we need to do something similar to legacy map format?
   
   BTW: you can remove `()` in `parquetSchema.getName()` since this is an accessor method.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #34044: [SPARK-36803][SQL] Fix ArrayType conversion when reading Parquet files written in legacy mode

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #34044:
URL: https://github.com/apache/spark/pull/34044#issuecomment-924761408


   thanks, merging to master/3.2/3.1/3.0!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org