You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by 俊杰陈 <cj...@gmail.com> on 2017/06/29 08:09:15 UTC

Add Bloom Filter to Parquet

Hi

PARQUET-41 <https://issues.apache.org/jira/browse/PARQUET-41> was created
to add bloom filter feature. Bloom filter could dramatically improve
precise queries performance with quite few overhead, especially in case of
big tables.  I have drafted a document to present the design and related
technology, while there are still some places need brainstorming and
finalize, such as what hash algorithm should be used, where to put bloom
filter data.

About hash, Guava use Murmur3 for bloom filter, while it can't produce
consistent 128 bit hash in x86 and x64 platform. I have list some hash
algorithms in the doc and need help here.

For bloom filter location, two proposals were presented: statistic in
column chunk metadata and new type page in column chunk. Store BF data to
statistic already has a PR <https://github.com/apache/parquet-mr/pull/215>,
while there are some concerns 1) Statistic currently only support part of
types, decimal, timestamp, etc. was not support yet; 2) There is a trend to
keep footer metadata small; 3) Statistic is used in both column chunk and
page level which may cause confusion.  I proposed another idea is to store
bloom filter in new type page and store page offset in column chunk
metadata, which can be more separated and clear, but not sure any other
concerns. Here also need brainstorming to discuss best location.

Any feedback is welcome!

Thanks & Best Regards