You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "fatemah (Jira)" <ji...@apache.org> on 2022/10/31 16:41:00 UTC

[jira] [Created] (PARQUET-2210) Add FilteredPageReader to filter rows based on page statistics

fatemah created PARQUET-2210:
--------------------------------

             Summary: Add FilteredPageReader to filter rows based on page statistics
                 Key: PARQUET-2210
                 URL: https://issues.apache.org/jira/browse/PARQUET-2210
             Project: Parquet
          Issue Type: New Feature
            Reporter: fatemah


Currently, we do not use the statistics that is stored in the page headers for pruning the rows that we read. Row group pruning is very coarse-grained and in many cases does not prune the row group. I propose adding a FilteredPageReader that would accept a filter and would not return the pages that do not match the filter based on page statistics.

Initial set of filters can be: EQUALS, IS NULL, IS NOT NULL.

Also, the FilteredPageReader will keep track of what row ranges matched and not matched. We could use this to skip reading rows that do not match from the rest of the columns. Note that the SkipRecords API was recently added to the Parquet reader (https://issues.apache.org/jira/browse/PARQUET-2188)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)