You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/05 05:05:39 UTC

[GitHub] [spark] sadikovi opened a new pull request, #37419: [SPARK-39833][SQL] Fix a Parquet incorrect count issue when requiredSchema is empty and column index is enabled

sadikovi opened a new pull request, #37419:
URL: https://github.com/apache/spark/pull/37419

### What changes were proposed in this pull request?

This PR patches an issue in Parquet data source when a user tries to count records for a Parquet table where file schema columns overlap with partition columns and with filter on those columns.

```
root/
col0=0/
part-0001.parquet (schema: COL0)
```

When projection overlaps with partition columns, the output schema (`requiredSchema`) becomes empty. In Parquet, when the predicate is provided and column index is enabled, we would try to filter row ranges to figure out what the count should be. Unfortunately, there is an issue that if the projection is empty, any checks on columns would fail and 0 rows are returned (`RowRanges.EMPTY`) even though there is data matching the filter.

This case is rare and only happens when doing count on a DataFrame that results in empty projection (most of the cases would include the filtered column which would work) but it is still good to fix.

This is a quick fix, the actual fix needs to go into Parquet-MR: https://issues.apache.org/jira/browse/PARQUET-2170. Once that is fixed and updated in Spark, we can remove the change.

### Why are the changes needed?

Fixes a rare correctness issue when running `count` on a filtered Parquet DataFrame.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added a unit test that reproduces this behaviour. The test fails without the fix and passes with the fix.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org