You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2023/02/04 09:22:00 UTC
[jira] [Commented] (PARQUET-2237) Improve performance when filters in RowGroupFilter can match exactly
[ https://issues.apache.org/jira/browse/PARQUET-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684133#comment-17684133 ]
ASF GitHub Bot commented on PARQUET-2237:
-----------------------------------------
yabola opened a new pull request, #1023:
URL: https://github.com/apache/parquet-mr/pull/1023
Bloomfilter needs to load from filesystem, it may costs time and space. If we can exactly determine the existence/nonexistence of the value from other filters , then we can avoid using Bloomfilter to Improve performance.
When the minMax values in StatisticsFilter is same, we can exactly determine the existence/nonexistence of the value.
When we have page dictionaries, we can also determine the existence/nonexistence of the value.
> Improve performance when filters in RowGroupFilter can match exactly
> --------------------------------------------------------------------
>
> Key: PARQUET-2237
> URL: https://issues.apache.org/jira/browse/PARQUET-2237
> Project: Parquet
> Issue Type: Improvement
> Reporter: Mars
> Priority: Major
>
> Bloomfilter needs to load from filesystem, it may costs time and space. If we can exactly determine the existence/nonexistence of the value from other filters , then we can avoid using Bloomfilter to Improve performance.
>
> When the minMax values in StatisticsFilter is same, we can exactly determine the existence/nonexistence of the value.
> When we have page dictionaries, we can also determine the existence/nonexistence of the value.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)