You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ferdinand Xu (JIRA)" <ji...@apache.org> on 2015/07/07 08:27:04 UTC

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

    [ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14616251#comment-14616251 ] 

Ferdinand Xu commented on PARQUET-41:
-------------------------------------

Hi [~rdblue], I have some thoughts for the bloom filter about the space efficiency.
At first, I think we should define in which level the bloom filter takes effect. The bloom filter is a complement to the dictionary. For page level, we have already the dictionary page which helps us filter data page. In the upper level, we could use bloom filter to filter the column chunk without parsing the dictionary page. Serving for this purpose, we could do some changes on the current implementations. Now bloom filter statistics is part of the statistics stored with the data page header. It's not a good design since it used more space than expectations. So I am thinking about making the bloom filter statistics as part of ColumnChunk instead. One extra benefits we can obtain is that we can postpone the time for constructing the bloom filter. In this way, we can do the construction of bloom filter in the flush method. In this stage, we have a better understand about how data is like(how much unique value there is). Any suggestions on this? We could have several rounds of discussions and do the POC work once completed. 

Regards,
Ferd

> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. This could be very useful in filtering entire row groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)