You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ferdinand Xu (JIRA)" <ji...@apache.org> on 2016/02/16 06:56:18 UTC

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

    [ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15148130#comment-15148130 ] 

Ferdinand Xu commented on PARQUET-41:
-------------------------------------

Hi [~rdblue],
I have a basic idea about how to estimate the expected entries required by bloom filter. 
AFAIK we can’t get the row count for each row group before all data are flushed into the disk. Since this we can estimate the size in the following way.
For the first row group, we don’t create bloom filter statistics for it at the beginning. By flushing the first row group, we’re able to have a general idea of the row counts. For the rest of the row groups, we will choose this row count to create the bloom filter bit set. 
We can do a small improvement for the strategy above. We have the size for the whole row group. We can calculate the expected entry number based on the average size for the first 100 or 1000 rows. Since the characteristic of bloom filter, we need to store these items in a tmp buffer. Once the bloom filter bit set is created, we will flush these data into bit set and then drop them.
One thing I want to highlight is that we don’t need to know the *exact* row count and an estimated value is enough. 
Any thoughts about the idea?


> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)