You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Jim Apple <jb...@apache.org> on 2019/03/04 17:09:32 UTC

Re: [Discussion] How to build bloom filter in parquet

> A bit of brainstorming (just some ideas that may or may not be useful): One
> more thing to consider is whether some smart encoding of the bit vector
> would help saving space. I expect the entropy of a nearly empty or nearly
> full bloom filter to be relatively low, because they consist mostly of
> zeroes/ones (respectively). For example, RLE encoding could take advantage
> of such a pattern (but should only be used when it saves space, because
> under regular circumstances the entropy will be high and RLE will only
> increase the data size).

You can read more about this in "Compressed Bloom Filters", by
Michael Mitzenmacher: http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf

I think we should consider this as an extension. The configuration allows for additional formats to be specified later.

> Alternatively, multiple bloom filters may be built at the same time until
> it becomes obvious which one matches the data characteristics best. The
> downside of this the increased number of hash calculations. This downside
> could be worked around by only building a large bloom filter and "folding
> the bit vector onto itself multiple times, cutting at the desired size
> boundary".

Yes, I think this folding mechanism is the way to go, and it is compatible with the current format specification, so can be an add-on to any library writing parquet.


>For example, if we build a bloom filter of 512 bits but in the
> end we see that 64 bits would have been enough, we can split the bit vector
> into 8 equal chunks and XOR them together.

One nit: by ORing them together.

> The resulting bit vector can
> still function as a bloom filter by applying a modulo 64 operation to the
> hashes during lookup. Its efficiency may be worse though than if we used
> hash functions that directly map onto to 64 bits.

Right now the specification uses a subset of the bits already, so no additional operation is required.

Best,
Jim