You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Jim Apple (JIRA)" <ji...@apache.org> on 2018/06/15 14:29:00 UTC

[jira] [Comment Edited] (PARQUET-41) Add bloom filters to parquet statistics

    [ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513883#comment-16513883 ] 

Jim Apple edited comment on PARQUET-41 at 6/15/18 2:28 PM:
-----------------------------------------------------------

I took a look at that benchmark and I now believe that in the case where the number of distinct values is the same as the number of values, or close to it, that this can provide some performance benefit.

Junjie also shared with me the following resources:

More benchmarks results: https://docs.google.com/spreadsheets/d/1LQqGZ1EQSkPBXtdi9nyANiQOhwNFwqiiFe8Sazclf5Y/edit#gid=0

Fork with BF enabled: https://github.com/cjjnjust/parquet-mr/tree/parquet-41-base-1.8.x

Data generator: https://github.com/cjjnjust/SQLDataGen



was (Author: jbapple):
I took a look at that benchmark and I now believe that in the case where the number of distinct values is the same as the number of values, or close to it, that this can provide some performance benefit.

> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Junjie Chen
>            Priority: Major
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)