You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "zhongyujiang (via GitHub)" <gi...@apache.org> on 2023/02/25 15:40:21 UTC
[GitHub] [iceberg] zhongyujiang commented on pull request #6935: Parquet: Add page filter using page indexes

zhongyujiang commented on PR #6935:
URL: https://github.com/apache/iceberg/pull/6935#issuecomment-1445144640

   Thanks for your sharing!
   >  The filter is applied at the row group level, but it looks like that's not how RowRanges works in Parquet, so this probably needs to be refactored to run the filter, produce RowRanges, and set them on the ParquetFileReader just once.
   
   I think `RowRanges` is also at the row [group level](https://github.com/apache/parquet-mr/blob/c9cfe821448a2f99797fda7f46c70a16cc1250a9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/RowRanges.java#L33), Parquet-mr will [generate](https://github.com/apache/parquet-mr/blob/c9cfe821448a2f99797fda7f46c70a16cc1250a9/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1142）) a `RowRanges` for each row-group when running the column index filter. So I think the `allRows` is correct because the row index stored in the page index is also its position in the row-group.
   
   I have a convern is that  currently we can't read filtered row-group even after we set `RowRanges` in ParquetFileReader.  The `readFilteredRowGroup` method provided by Parquet will [detect](https://github.com/apache/parquet-mr/blob/c9cfe821448a2f99797fda7f46c70a16cc1250a9/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1004) whether there is a filter pushed down,
   and only return the filtered row-group when there is a push-down filter.
   So I think we maybe can add a method in ParquetFileReader that allows users to specify RowRanges by themselves, after all Iceberg will calculate these RowRanges by itself:
   
   ```
   public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex, RowRanges rowRanges) throws IOException {
   ```
   @rdblue WDYT? In addition, I think we can also let Parquet expose some methods to make it easier for Iceberg to implement column index filter. We will have to wait for the next version to use, but Parquet-mr may have a release in a month, see this [comment](https://github.com/apache/parquet-mr/pull/982#issuecomment-1376750703) , we should be able to catch up with this release. If you agree, I can open a PR in the Parquet-mr repo. I think we could use these [changes](https://github.com/zhongyujiang/parquet-mr/commit/25ad270f733ea8797c2bb7af24b58cd13d76a748) in Parquet. Here is my [implementation](https://github.com/zhongyujiang/iceberg/commit/b8c13c500d4437cd3f718f29df489f9fe6119e2d#diff-1c6b5e906f6299257c0d58fef74ee99569bfa24a9f3a6a257a39744d07e2a63b) based on these Parquet APIs, if you are willing to review it :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org