You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/01 18:38:33 UTC

[GitHub] [spark] jackierwzhang opened a new pull request #35700: [SPARK-38094] Quick followup on enabling id matching for Parquet reader

jackierwzhang opened a new pull request #35700:
URL: https://github.com/apache/spark/pull/35700


   ### What changes were proposed in this pull request?
   Minor follow ups on https://github.com/apache/spark/pull/35385:
   1. Add a nested schema test
   2. Fixed an error message.
   
   ### Why are the changes needed?
   Better observability. 
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Existing test
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #35700: [SPARK-38094][SQL][FOLLOWUP] Fix exception message and add a test case

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #35700:
URL: https://github.com/apache/spark/pull/35700#discussion_r817046385



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
##########
@@ -140,7 +140,7 @@ object ParquetReadSupport extends Logging {
         "Spark read schema expects field Ids, " +
           "but Parquet file schema doesn't contain any field Ids.\n" +
         "Please remove the field ids from Spark schema or ignore missing ids by " +
-          "setting `spark.sql.parquet.fieldId.ignoreMissing = true`\n" +
+          "setting `spark.sql.parquet.fieldId.read.ignoreMissing = true`\n" +

Review comment:
       To prevent this kind of issue, we prefer `SQLConf.IGNORE_MISSING_PARQUET_FIELD_ID.key` style. Shall we revise?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #35700: [SPARK-38094][SQL][FOLLOWUP] Fix exception message and add a test case

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #35700:
URL: https://github.com/apache/spark/pull/35700#issuecomment-1056059437


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #35700: [SPARK-38094][SQL][FOLLOWUP] Fix exception message and add a test case

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun closed pull request #35700:
URL: https://github.com/apache/spark/pull/35700


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #35700: [SPARK-38094][SQL][FOLLOWUP] Fix exception message and add a test case

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #35700:
URL: https://github.com/apache/spark/pull/35700#issuecomment-1056134846


   Merged to master. Thank you, @jackierwzhang .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #35700: [SPARK-38094][SQL][FOLLOWUP] Fix exception message and add a test case

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #35700:
URL: https://github.com/apache/spark/pull/35700#discussion_r817314506



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
##########
@@ -139,8 +139,8 @@ object ParquetReadSupport extends Logging {
       throw new RuntimeException(
         "Spark read schema expects field Ids, " +
           "but Parquet file schema doesn't contain any field Ids.\n" +
-        "Please remove the field ids from Spark schema or ignore missing ids by " +
-          "setting `spark.sql.parquet.fieldId.ignoreMissing = true`\n" +
+        s"Please remove the field ids from Spark schema or ignore missing ids by " +

Review comment:
       `s"` -> `"`? We don't need `s` for plain strings.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #35700: [SPARK-38094][SQL][FOLLOWUP] Fix exception message and add a test case

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #35700:
URL: https://github.com/apache/spark/pull/35700#discussion_r817047075



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFieldIdIOSuite.scala
##########
@@ -107,6 +107,35 @@ class ParquetFieldIdIOSuite extends QueryTest with ParquetTest with SharedSparkS
     }
   }
 
+  test("absence of field ids: reading nested schema") {

Review comment:
       Although it's not a mandatory in this case, if you don't mind, please add `SPARK-38094: `.
   ```scala
   - test("absence of field ids: reading nested schema") {
   + test("SPARK-38094: absence of field ids: reading nested schema") {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #35700: [SPARK-38094][SQL][FOLLOWUP] Fix exception message and add a test case

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #35700:
URL: https://github.com/apache/spark/pull/35700#discussion_r817047531



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFieldIdIOSuite.scala
##########
@@ -107,6 +107,35 @@ class ParquetFieldIdIOSuite extends QueryTest with ParquetTest with SharedSparkS
     }
   }
 
+  test("absence of field ids: reading nested schema") {
+    withTempDir { dir =>
+      // now with nested schema/complex type
+      val readSchema =
+        new StructType()
+          .add("a", IntegerType, true, withId(1))
+          .add("b", ArrayType(StringType), true, withId(2))
+          .add("c", new StructType().add("xx", IntegerType, true, withId(6)), true, withId(3))

Review comment:
       Please use more meaningful name instead of `xx`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on a change in pull request #35700: [SPARK-38094][SQL][FOLLOWUP] Fix exception message and add a test case

Posted by GitBox <gi...@apache.org>.

sadikovi commented on a change in pull request #35700:
URL: https://github.com/apache/spark/pull/35700#discussion_r817159109



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
##########
@@ -139,8 +139,8 @@ object ParquetReadSupport extends Logging {
       throw new RuntimeException(
         "Spark read schema expects field Ids, " +
           "but Parquet file schema doesn't contain any field Ids.\n" +
-        "Please remove the field ids from Spark schema or ignore missing ids by " +
-          "setting `spark.sql.parquet.fieldId.ignoreMissing = true`\n" +
+        s"Please remove the field ids from Spark schema or ignore missing ids by " +
+          s"setting ${SQLConf.IGNORE_MISSING_PARQUET_FIELD_ID.key}` = true`\n" +

Review comment:
       Do we need \` character at the end of `${SQLConf.IGNORE_MISSING_PARQUET_FIELD_ID.key}`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org