You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "fatemah (Jira)" <ji...@apache.org> on 2022/10/31 16:41:00 UTC
[jira] [Created] (PARQUET-2210) Add FilteredPageReader to filter rows based on page statistics
fatemah created PARQUET-2210:
--------------------------------
Summary: Add FilteredPageReader to filter rows based on page statistics
Key: PARQUET-2210
URL: https://issues.apache.org/jira/browse/PARQUET-2210
Project: Parquet
Issue Type: New Feature
Reporter: fatemah
Currently, we do not use the statistics that is stored in the page headers for pruning the rows that we read. Row group pruning is very coarse-grained and in many cases does not prune the row group. I propose adding a FilteredPageReader that would accept a filter and would not return the pages that do not match the filter based on page statistics.
Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL.
Also, the FilteredPageReader will keep track of what row ranges matched and not matched. We could use this to skip reading rows that do not match from the rest of the columns. Note that the SkipRecords API was recently added to the Parquet reader (https://issues.apache.org/jira/browse/PARQUET-2188)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)