You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/05 05:05:39 UTC

[GitHub] [spark] sadikovi opened a new pull request, #37419: [SPARK-39833][SQL] Fix a Parquet incorrect count issue when requiredSchema is empty and column index is enabled

sadikovi opened a new pull request, #37419:
URL: https://github.com/apache/spark/pull/37419

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   This PR patches an issue in Parquet data source when a user tries to count records for a Parquet table where file schema columns overlap with partition columns and with filter on those columns.
   
   ```
   root/
     col0=0/
       part-0001.parquet (schema: COL0)
   ```
   
   When projection overlaps with partition columns, the output schema (`requiredSchema`) becomes empty. In Parquet, when the predicate is provided and column index is enabled, we would try to filter row ranges to figure out what the count should be. Unfortunately, there is an issue that if the projection is empty, any checks on columns would fail and 0 rows are returned (`RowRanges.EMPTY`) even though there is data matching the filter.
   
   This case is rare and only happens when doing count on a DataFrame that results in empty projection (most of the cases would include the filtered column which would work) but it is still good to fix.
   
   This is a quick fix, the actual fix needs to go into Parquet-MR: https://issues.apache.org/jira/browse/PARQUET-2170. Once that is fixed and updated in Spark, we can remove the change.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   Fixes a rare correctness issue when running `count` on a filtered Parquet DataFrame.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   
   I added a unit test that reproduces this behaviour. The test fails without the fix and passes with the fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #37419: [SPARK-39833][SQL] Fix a rare correctness issue with count() in the case of overlapping partition and data columns in Parquet DSv1

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1212769273

   Oh, this was originally implemented in PARQUET-1201 which is parquet-mr 1.11.0. Again, this is a rare case maybe we don't do anything about it or merge in master only. 
    
   I don't know if there is a better way to fix the problem other than fixing it in parquet-mr.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #37419: [SPARK-39833][SQL] Disable Parquet column index in DSv1 to fix a correctness issue in the case of overlapping partition and data columns

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1220130819

   @sunchao Can you help me to find a workaround in Spark for this? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #37419: [SPARK-39833][SQL] Remove partition columns from data schema in the case of overlapping columns to fix Parquet DSv1 incorrect count issue

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1207719726

   @cloud-fan Can you review? This change modifies a test that was added in SPARK-22356 that you authored. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a diff in pull request #37419: [SPARK-39833][SQL] Fix a rare correctness issue with count() in the case of overlapping partition and data columns in Parquet DSv1

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r944890706


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala:
##########
@@ -1108,6 +1108,17 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
       checkAnswer(sql("select * from tbl"), expected)
     }
   }
+
+  test("SPARK-39833: count() with pushed filters from Parquet files") {

Review Comment:
   >  ... I would have to change the code to check filters that are pushed down against the requested schema instead of a full one but then it might introduce a performance issue.
   
   Yes I think this would be the approach, but why it might introduce performance issues?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37419: [SPARK-39833][SQL] Remove partition columns from data schema in the case of overlapping columns to fix Parquet DSv1 incorrect count issue

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r940896438


##########
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:
##########
@@ -2777,18 +2777,24 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
     }
   }
 
-  test("SPARK-22356: overlapped columns between data and partition schema in data source tables") {
+  test("SPARK-39833: overlapped columns between data and partition schema in data source tables") {
+    // SPARK-39833 changed behaviour of the column order in the case of overlapping columns between
+    // data and partition schemas: data schema does not include partition columns anymore and the
+    // overlapping columns would appear at the end of the schema together with other partition
+    // columns.
     withTempPath { path =>
       Seq((1, 1, 1), (1, 2, 1)).toDF("i", "p", "j")
         .write.mode("overwrite").parquet(new File(path, "p=1").getCanonicalPath)
       withTable("t") {
         sql(s"create table t using parquet options(path='${path.getCanonicalPath}')")
-        // We should respect the column order in data schema.
-        assert(spark.table("t").columns === Array("i", "p", "j"))
+        // MSCK command is required now to update partitions in the catalog.
+        sql(s"msck repair table t")
+
+        assert(spark.table("t").columns === Array("i", "j", "p"))
         checkAnswer(spark.table("t"), Row(1, 1, 1) :: Row(1, 1, 1) :: Nil)
         // The DESC TABLE should report same schema as table scan.
         assert(sql("desc t").select("col_name")
-          .as[String].collect().mkString(",").contains("i,p,j"))
+          .as[String].collect().mkString(",").contains("i,j,p"))

Review Comment:
   @sadikovi do you know how is it caused? It's unclear from the code changes in this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #37419: [SPARK-39833][SQL] Fix a rare correctness issue with count() in the case of overlapping partition and data columns in Parquet DSv1

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1212757388

   Not exactly, the filter actually references columns that exist in the file. It is the projection that matters in the code apparently. 
   
   Here is what they have in the javadoc:
   ```
      * @param paths
      *          the paths of the columns used in the actual projection; a column not being part of the projection will be
      *          handled as containing {@code null} values only even if the column has values written in the file
   ```
   https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L80)
   
   I am not very familiar with the implementation but I think the library should be returning all rows instead of empty rows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37419: [SPARK-39833][SQL] Remove partition columns from data schema in the case of overlapping columns to fix Parquet DSv1 incorrect count issue

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r940976880


##########
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:
##########
@@ -2777,18 +2777,24 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
     }
   }
 
-  test("SPARK-22356: overlapped columns between data and partition schema in data source tables") {
+  test("SPARK-39833: overlapped columns between data and partition schema in data source tables") {
+    // SPARK-39833 changed behaviour of the column order in the case of overlapping columns between
+    // data and partition schemas: data schema does not include partition columns anymore and the
+    // overlapping columns would appear at the end of the schema together with other partition
+    // columns.
     withTempPath { path =>
       Seq((1, 1, 1), (1, 2, 1)).toDF("i", "p", "j")
         .write.mode("overwrite").parquet(new File(path, "p=1").getCanonicalPath)
       withTable("t") {
         sql(s"create table t using parquet options(path='${path.getCanonicalPath}')")
-        // We should respect the column order in data schema.
-        assert(spark.table("t").columns === Array("i", "p", "j"))
+        // MSCK command is required now to update partitions in the catalog.
+        sql(s"msck repair table t")
+
+        assert(spark.table("t").columns === Array("i", "j", "p"))
         checkAnswer(spark.table("t"), Row(1, 1, 1) :: Row(1, 1, 1) :: Nil)
         // The DESC TABLE should report same schema as table scan.
         assert(sql("desc t").select("col_name")
-          .as[String].collect().mkString(",").contains("i,p,j"))
+          .as[String].collect().mkString(",").contains("i,j,p"))

Review Comment:
   @cloud-fan My initial fix was this:
   ```scala
   // See PARQUET-2170.
       // Disable column index optimisation when required schema is empty so we get the correct
       // row count from parquet-mr.
       if (requiredSchema.isEmpty) {
         hadoopConf.setBoolean(ParquetInputFormat.COLUMN_INDEX_FILTERING_ENABLED, false)
       }
   ```
   
   I can revert to this code if you prefer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #37419: [SPARK-39833][SQL] Disable Parquet column index in DSv1 to fix a correctness issue in the case of overlapping partition and data columns

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1220021193

   I will take a look on how to fix it in Spark, I have not had enough time to work on this problem yet.
   I would like to mention that the fix should be merged into 3.3 and 3.4 and potentially earlier releases where the bug occurs as it is a correctness issue, it is also unclear if it partition columns related at this point, it could be potentially reproducible with a simple predicate pushdown and projection.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #37419: [SPARK-39833][SQL] Disable Parquet column index in DSv1 to fix a correctness issue in the case of overlapping partition and data columns

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1219137464

   I decided to disable column index altogether until I have a better fix or parquet bug is fixed. I also moved tests to ParquetQueryV1 as one of the tests fails in DSv2 due to another bug in projection.
   
   @cloud-fan @sunchao Can you review this PR? 
   I just think adding a check on required schema and column filters could be error-prone especially when nested fields are involved. It seems to me it is easier to disable column index by default which can still be enabled manually by users.
   
   I am also open to other suggestions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #37419: [SPARK-39833][SQL] Fix Parquet incorrect count issue when requiredSchema is empty and column index is enabled

Posted by GitBox <gi...@apache.org>.
sunchao commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1206745959

   @sadikovi in the example you gave:
   ```
   root/
     col0=0/
       part-0001.parquet (schema: COL0)
   ```
   what's the content in `part-0001.parquet`? I wonder why we need to pushdown partition filters to Parquet, given that we'll not materialize the partition values in the Parquet files. What is the pushed filters to Parquet in this example?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #37419: [SPARK-39833][SQL] Fix a rare correctness issue with count() in the case of overlapping partition and data columns in Parquet DSv1

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1212764705

   @sadikovi which spark version starts to have this bug?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37419: [SPARK-39833][SQL] Remove partition columns from data schema in the case of overlapping columns to fix Parquet DSv1 incorrect count issue

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r941001713


##########
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:
##########
@@ -2777,18 +2777,24 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
     }
   }
 
-  test("SPARK-22356: overlapped columns between data and partition schema in data source tables") {
+  test("SPARK-39833: overlapped columns between data and partition schema in data source tables") {
+    // SPARK-39833 changed behaviour of the column order in the case of overlapping columns between
+    // data and partition schemas: data schema does not include partition columns anymore and the
+    // overlapping columns would appear at the end of the schema together with other partition
+    // columns.
     withTempPath { path =>
       Seq((1, 1, 1), (1, 2, 1)).toDF("i", "p", "j")
         .write.mode("overwrite").parquet(new File(path, "p=1").getCanonicalPath)
       withTable("t") {
         sql(s"create table t using parquet options(path='${path.getCanonicalPath}')")
-        // We should respect the column order in data schema.
-        assert(spark.table("t").columns === Array("i", "p", "j"))
+        // MSCK command is required now to update partitions in the catalog.
+        sql(s"msck repair table t")
+
+        assert(spark.table("t").columns === Array("i", "j", "p"))
         checkAnswer(spark.table("t"), Row(1, 1, 1) :: Row(1, 1, 1) :: Nil)
         // The DESC TABLE should report same schema as table scan.
         assert(sql("desc t").select("col_name")
-          .as[String].collect().mkString(",").contains("i,p,j"))
+          .as[String].collect().mkString(",").contains("i,j,p"))

Review Comment:
   I'd prefer the surgical fix. Making it consistent with file source v2 does not justify a breaking change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #37419: [SPARK-39833][SQL] Disable Parquet column index in DSv1 to fix a correctness issue in the case of overlapping partition and data columns

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1221635922

   Thank you for merging the PR. I have opened the follow-up ticket https://issues.apache.org/jira/browse/SPARK-40169 to fix this properly. I will sync with @sunchao separately on this, I am sure we will be able to come up with a proper way to fix it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #37419: [SPARK-39833][SQL] Disable Parquet column index in DSv1 to fix a correctness issue in the case of overlapping partition and data columns

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #37419: [SPARK-39833][SQL] Disable Parquet column index in DSv1 to fix a correctness issue in the case of overlapping partition and data columns
URL: https://github.com/apache/spark/pull/37419


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #37419: [SPARK-39833][SQL] Fix a rare correctness issue with count() in the case of overlapping partition and data columns in Parquet DSv1

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1219077388

   Apologies, I did not have time to debug this yet. I will do that tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37419: [SPARK-39833][SQL] Fix a rare correctness issue with count() in the case of overlapping partition and data columns in Parquet DSv1

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r944790118


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala:
##########
@@ -1108,6 +1108,17 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
       checkAnswer(sql("select * from tbl"), expected)
     }
   }
+
+  test("SPARK-39833: count() with pushed filters from Parquet files") {

Review Comment:
   Yes, the provided test fails. it appears the problem is wide-spread due to this parquet-mr bug. I would have to change the code to check filters that are pushed down against the requested schema instead of a full one but then it might introduce a performance issue.
   
   I think the safest for now is to disable column indexes altogether. I will take a look at what could be done. 
   We can also merge this PR and I can follow up on the fix later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a diff in pull request #37419: [SPARK-39833][SQL] Fix Parquet incorrect count issue when requiredSchema is empty and column index is enabled

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r938458487


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:
##########
@@ -228,6 +228,13 @@ class ParquetFileFormat
       SQLConf.PARQUET_TIMESTAMP_NTZ_ENABLED.key,
       sparkSession.sessionState.conf.parquetTimestampNTZEnabled)
 
+    // See PARQUET-2170.
+    // Disable column index optimisation when required schema is empty so we get the correct
+    // row count from parquet-mr.
+    if (requiredSchema.isEmpty) {

Review Comment:
   Do we need a similar fix for DSv2?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37419: [SPARK-39833][SQL] Fix Parquet incorrect count issue when requiredSchema is empty and column index is enabled

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r938492838


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:
##########
@@ -228,6 +228,13 @@ class ParquetFileFormat
       SQLConf.PARQUET_TIMESTAMP_NTZ_ENABLED.key,
       sparkSession.sessionState.conf.parquetTimestampNTZEnabled)
 
+    // See PARQUET-2170.
+    // Disable column index optimisation when required schema is empty so we get the correct
+    // row count from parquet-mr.
+    if (requiredSchema.isEmpty) {

Review Comment:
   No, this is not required for DSv2.
   
   The test works in DSv2 due to another inconsistency - Parquet DSv2 does not consider the full file schema when creating pushdown filters. There is a check in FileScanBuilder to ignore partition columns so in this case, the schema is empty so no filters will be pushed down, returning the correct number of records. It is rather a performance inefficiency in DSv2 as the entire file will be scanned. However, the result will be correct.
   
   I thought about fixing it the same way DSv2 fixed the issue but it is a much bigger change as it would affect not just this case but others as well. I hope my explanation makes sense. Let me know your thoughts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37419: [SPARK-39833][SQL] Remove partition columns from data schema in the case of overlapping columns to fix Parquet DSv1 incorrect count issue

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r940896030


##########
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:
##########
@@ -2777,18 +2777,24 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
     }
   }
 
-  test("SPARK-22356: overlapped columns between data and partition schema in data source tables") {
+  test("SPARK-39833: overlapped columns between data and partition schema in data source tables") {
+    // SPARK-39833 changed behaviour of the column order in the case of overlapping columns between
+    // data and partition schemas: data schema does not include partition columns anymore and the
+    // overlapping columns would appear at the end of the schema together with other partition
+    // columns.
     withTempPath { path =>
       Seq((1, 1, 1), (1, 2, 1)).toDF("i", "p", "j")
         .write.mode("overwrite").parquet(new File(path, "p=1").getCanonicalPath)
       withTable("t") {
         sql(s"create table t using parquet options(path='${path.getCanonicalPath}')")
-        // We should respect the column order in data schema.
-        assert(spark.table("t").columns === Array("i", "p", "j"))
+        // MSCK command is required now to update partitions in the catalog.
+        sql(s"msck repair table t")
+
+        assert(spark.table("t").columns === Array("i", "j", "p"))
         checkAnswer(spark.table("t"), Row(1, 1, 1) :: Row(1, 1, 1) :: Nil)
         // The DESC TABLE should report same schema as table scan.
         assert(sql("desc t").select("col_name")
-          .as[String].collect().mkString(",").contains("i,p,j"))
+          .as[String].collect().mkString(",").contains("i,j,p"))

Review Comment:
   this is a behavior change (query schema change) that is hard to accept.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37419: [SPARK-39833][SQL] Fix a rare correctness issue with count() in the case of overlapping partition and data columns in Parquet DSv1

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r948714062


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala:
##########
@@ -1108,6 +1108,17 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
       checkAnswer(sql("select * from tbl"), expected)
     }
   }
+
+  test("SPARK-39833: count() with pushed filters from Parquet files") {

Review Comment:
   I think we may need to disable column index altogether until we figure out the proper fix. I verified that `hadoopConf.setBoolean(ParquetInputFormat.COLUMN_INDEX_FILTERING_ENABLED, false)` makes both tests work.



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala:
##########
@@ -1108,6 +1108,17 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
       checkAnswer(sql("select * from tbl"), expected)
     }
   }
+
+  test("SPARK-39833: count() with pushed filters from Parquet files") {

Review Comment:
   I think we may need to disable column index altogether until we figure out the proper fix. I verified that `hadoopConf.setBoolean(ParquetInputFormat.COLUMN_INDEX_FILTERING_ENABLED, false)` makes both tests pass.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #37419: [SPARK-39833][SQL] Disable Parquet column index in DSv1 to fix a correctness issue in the case of overlapping partition and data columns

Posted by GitBox <gi...@apache.org>.
sunchao commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1219724467

   I think it's fine to disable it temporarily, but I'd prefer a fix in Spark itself though so that we can backport it to 3.3 without relying on a Parquet release and bumping the version there. I can also take a look on the approach of checking filters against required schema. 
   
   Could you open a JIRA tracking the permanent fix in Spark, and mark it as blocker for 3.4.0 release?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37419: [SPARK-39833][SQL] Remove partition columns from data schema in the case of overlapping columns to fix Parquet DSv1 incorrect count issue

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r940976880


##########
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:
##########
@@ -2777,18 +2777,24 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
     }
   }
 
-  test("SPARK-22356: overlapped columns between data and partition schema in data source tables") {
+  test("SPARK-39833: overlapped columns between data and partition schema in data source tables") {
+    // SPARK-39833 changed behaviour of the column order in the case of overlapping columns between
+    // data and partition schemas: data schema does not include partition columns anymore and the
+    // overlapping columns would appear at the end of the schema together with other partition
+    // columns.
     withTempPath { path =>
       Seq((1, 1, 1), (1, 2, 1)).toDF("i", "p", "j")
         .write.mode("overwrite").parquet(new File(path, "p=1").getCanonicalPath)
       withTable("t") {
         sql(s"create table t using parquet options(path='${path.getCanonicalPath}')")
-        // We should respect the column order in data schema.
-        assert(spark.table("t").columns === Array("i", "p", "j"))
+        // MSCK command is required now to update partitions in the catalog.
+        sql(s"msck repair table t")
+
+        assert(spark.table("t").columns === Array("i", "j", "p"))
         checkAnswer(spark.table("t"), Row(1, 1, 1) :: Row(1, 1, 1) :: Nil)
         // The DESC TABLE should report same schema as table scan.
         assert(sql("desc t").select("col_name")
-          .as[String].collect().mkString(",").contains("i,p,j"))
+          .as[String].collect().mkString(",").contains("i,j,p"))

Review Comment:
   @cloud-fan My initial fix was this:
   ```scala
   // See PARQUET-2170.
   // Disable column index optimisation when required schema is empty so we get the correct
   // row count from parquet-mr.
   if (requiredSchema.isEmpty) {
     hadoopConf.setBoolean(ParquetInputFormat.COLUMN_INDEX_FILTERING_ENABLED, false)
   }
   ```
   
   I can revert to this code if you prefer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a diff in pull request #37419: [SPARK-39833][SQL] Fix a rare correctness issue with count() in the case of overlapping partition and data columns in Parquet DSv1

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r944155079


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala:
##########
@@ -1108,6 +1108,17 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
       checkAnswer(sql("select * from tbl"), expected)
     }
   }
+
+  test("SPARK-39833: count() with pushed filters from Parquet files") {

Review Comment:
   What if we change this test to:
   ```scala
     test("SPARK-39833-2: count() with pushed filters from Parquet files") {
       withTempPath { path =>
         val p = s"${path.getCanonicalPath}${File.separator}col=0${File.separator}"
         Seq((0, "a")).toDF("COL", "b").coalesce(1).write.save(p)
         val df = spark.read.parquet(path.getCanonicalPath)
         checkAnswer(df.filter("col = 0"), Seq(Row(0, "a")))
         assert(df.filter("col = 0").select('b).collect().toSeq == Row("a") :: Nil)
       }
     }
   ```
   
   it seems checking whether `requestedSchema` is empty is not sufficient enough, since it can be non-empty but the pushed filter could reference column that do not exist in it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #37419: [SPARK-39833][SQL] Fix a rare correctness issue with count() in the case of overlapping partition and data columns in Parquet DSv1

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1211595931

   @sunchao @cloud-fan Would you be able to take another look?
   
   I have kept the original "patch". It is essentially a band aid until the Parquet ticket is fixed. I cannot think of a better and less intrusive way to fix the problem. Let me know if you have any questions about it, I will be happy to clarify. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37419: [SPARK-39833][SQL] Remove partition columns from data schema in the case of overlapping columns to fix Parquet DSv1 incorrect count issue

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r940976365


##########
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:
##########
@@ -2777,18 +2777,24 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
     }
   }
 
-  test("SPARK-22356: overlapped columns between data and partition schema in data source tables") {
+  test("SPARK-39833: overlapped columns between data and partition schema in data source tables") {
+    // SPARK-39833 changed behaviour of the column order in the case of overlapping columns between
+    // data and partition schemas: data schema does not include partition columns anymore and the
+    // overlapping columns would appear at the end of the schema together with other partition
+    // columns.
     withTempPath { path =>
       Seq((1, 1, 1), (1, 2, 1)).toDF("i", "p", "j")
         .write.mode("overwrite").parquet(new File(path, "p=1").getCanonicalPath)
       withTable("t") {
         sql(s"create table t using parquet options(path='${path.getCanonicalPath}')")
-        // We should respect the column order in data schema.
-        assert(spark.table("t").columns === Array("i", "p", "j"))
+        // MSCK command is required now to update partitions in the catalog.
+        sql(s"msck repair table t")
+
+        assert(spark.table("t").columns === Array("i", "j", "p"))
         checkAnswer(spark.table("t"), Row(1, 1, 1) :: Row(1, 1, 1) :: Nil)
         // The DESC TABLE should report same schema as table scan.
         assert(sql("desc t").select("col_name")
-          .as[String].collect().mkString(",").contains("i,p,j"))
+          .as[String].collect().mkString(",").contains("i,j,p"))

Review Comment:
   Partition columns are always appended to the schema. In the case of overlapping columns, we now remove all of the partition columns from the schema and append them afterwards. This does not change the result but changes the column output.
   
   Essentially:
   data schema: `i, p, j`, partition schema: `p`. We will remove `p` and append partition column: `i, j, p`.
   
   Previously we would keep the partition column as part of the data schema and insert partition values into it, which IMHO a bit confusing. This change also makes it compatible with DSv2 which is how it works there.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37419: [SPARK-39833][SQL] Fix a rare correctness issue with count() in the case of overlapping partition and data columns in Parquet DSv1

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r944905559


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala:
##########
@@ -1108,6 +1108,17 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
       checkAnswer(sql("select * from tbl"), expected)
     }
   }
+
+  test("SPARK-39833: count() with pushed filters from Parquet files") {

Review Comment:
   There will be no predicate pushdown in this case so reads could be slower potentially.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #37419: [SPARK-39833][SQL] Disable Parquet column index in DSv1 to fix a correctness issue in the case of overlapping partition and data columns

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1219139453

   I suggest we merge the PR with this fix and I will follow up on a more permanent resolution, maybe fix it in Parquet-mr. I am also thinking that we may need to backport it to 3.3 although this would be up to committers. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #37419: [SPARK-39833][SQL] Disable Parquet column index in DSv1 to fix a correctness issue in the case of overlapping partition and data columns

Posted by GitBox <gi...@apache.org>.
sunchao commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1221083165

   @sadikovi yes, I can also take a look at this next week. I'm fine either way: what do you think @cloud-fan @HyukjinKwon , should we merge this PR as it is (via disabling column index) first, and work on a fix separately? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37419: [SPARK-39833][SQL] Remove partition columns from data schema in the case of overlapping columns to fix Parquet DSv1 incorrect count issue

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r943126399


##########
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:
##########
@@ -2777,18 +2777,24 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
     }
   }
 
-  test("SPARK-22356: overlapped columns between data and partition schema in data source tables") {
+  test("SPARK-39833: overlapped columns between data and partition schema in data source tables") {
+    // SPARK-39833 changed behaviour of the column order in the case of overlapping columns between
+    // data and partition schemas: data schema does not include partition columns anymore and the
+    // overlapping columns would appear at the end of the schema together with other partition
+    // columns.
     withTempPath { path =>
       Seq((1, 1, 1), (1, 2, 1)).toDF("i", "p", "j")
         .write.mode("overwrite").parquet(new File(path, "p=1").getCanonicalPath)
       withTable("t") {
         sql(s"create table t using parquet options(path='${path.getCanonicalPath}')")
-        // We should respect the column order in data schema.
-        assert(spark.table("t").columns === Array("i", "p", "j"))
+        // MSCK command is required now to update partitions in the catalog.
+        sql(s"msck repair table t")
+
+        assert(spark.table("t").columns === Array("i", "j", "p"))
         checkAnswer(spark.table("t"), Row(1, 1, 1) :: Row(1, 1, 1) :: Nil)
         // The DESC TABLE should report same schema as table scan.
         assert(sql("desc t").select("col_name")
-          .as[String].collect().mkString(",").contains("i,p,j"))
+          .as[String].collect().mkString(",").contains("i,j,p"))

Review Comment:
   Yes, makes perfect sense. Let me update the PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #37419: [SPARK-39833][SQL] Disable Parquet column index in DSv1 to fix a correctness issue in the case of overlapping partition and data columns

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1221512878

   Merged to master, branch-3.3, and branch-3.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #37419: [SPARK-39833][SQL] Disable Parquet column index in DSv1 to fix a correctness issue in the case of overlapping partition and data columns

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1221512803

   Yeah, let's just get this in first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37419: [SPARK-39833][SQL] Fix a rare correctness issue with count() in the case of overlapping partition and data columns in Parquet DSv1

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r948723119


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala:
##########
@@ -1108,6 +1108,17 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
       checkAnswer(sql("select * from tbl"), expected)
     }
   }
+
+  test("SPARK-39833: count() with pushed filters from Parquet files") {

Review Comment:
   Actually, there seems to be another bug in DSv2 which I will fix later: 
   ```
   checkAnswer(df.filter("col = 0"), Seq(Row(0, "a")))
   ```
   fails in DSv2 due to a different column order.
   
   Let's do one problem at a time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #37419: [SPARK-39833][SQL] Fix Parquet incorrect count issue when requiredSchema is empty and column index is enabled

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1207088877

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #37419: [SPARK-39833][SQL] Fix Parquet incorrect count issue when requiredSchema is empty and column index is enabled

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1207493705

   The content is just one column with one value. You would still want to filter out row groups that don't much the predicate otherwise it could be a performance regression. The filters are evaluated in the example, see https://issues.apache.org/jira/browse/PARQUET-2170.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on a diff in pull request #37419: [SPARK-39833][SQL] Fix Parquet incorrect count issue when requiredSchema is empty and column index is enabled in DSv1

Posted by GitBox <gi...@apache.org>.
sadikovi commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r938492838


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:
##########
@@ -228,6 +228,13 @@ class ParquetFileFormat
       SQLConf.PARQUET_TIMESTAMP_NTZ_ENABLED.key,
       sparkSession.sessionState.conf.parquetTimestampNTZEnabled)
 
+    // See PARQUET-2170.
+    // Disable column index optimisation when required schema is empty so we get the correct
+    // row count from parquet-mr.
+    if (requiredSchema.isEmpty) {

Review Comment:
   No, this is not required for DSv2.
   
   The test works in DSv2 due to another inconsistency - Parquet DSv2 filters out the column in `readDataSchema`()` method due to the fact that both partition column and data column are similar in a case insensitive mode. The final schema becomes empty resulting in the empty list of filters and thus returning the correct number of records. It is rather a performance inefficiency in DSv2 as the entire file will be scanned. However, the result will be correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sadikovi commented on pull request #37419: [SPARK-39833][SQL] Fix Parquet incorrect count issue when requiredSchema is empty and column index is enabled

Posted by GitBox <gi...@apache.org>.
sadikovi commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1207510962

   To be honest, I am still thinking about the best way to mitigate the problem at the moment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org