You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Claude Warren <cl...@xenei.com> on 2019/10/18 15:49:34 UTC

[collections] Bloom filter - Discussion of Shape

 I think the other discussion is getting a bit long so I thought we could
start this discussion here and see if we can close out the other discussion
with agreement on the remaining topics.

The “Shape” of a bloom filter (excluding the hash algo) is defined
mathematically by

Number of Items (AKA: n) n = ceil(m / (-k / log(1 - exp(log(p) / k))))

Probability of Collision (AKA: p) p = (1 – exp(-kn/m))^k

Number of Bits (AKA: m) m = ceil((n * log(p)) / log(1 / pow(2, log(2))))

Number of Functions (AKA: k) k = round((m / n) * log(2))

Since the probability (p) is actually calculated directly from the other 3
the shape is actually concretely defined by n, m, and k.

This is all in the BloomFilterConfiguration class.

The BloomFilterConfigurationClass has 4 constrcutors (n,p), (p,m,k), (n,m),
and (n,m,k). in all cases where is an "p" it is assumed to be a desired
value and values of the other missing properties are calculated. The actual
value for p is then recalculated from n, m and k.

Now it makes sense that the hash algo should also be specified I am happy
to assume an enum here for the moment and also assume a mechanism for
resolving the enum to a concrete hash implementation.

I think that you will find that BloomFilterConfiguration fits the bill but
perhaps should be renamed to BloomFilterShape.



There is one other component that we have not addressed lately: The
ProtoBloomFilter.

The ProtoBloomFilter addresses an issue that arises when there are filters
with different “shape” but same Hashing Algo.

Since hashing is the expensive operation, we want to try to cache the hash
values and reuse them rather than recalculate them.

The ProtoBloomFilter contains the results of the hashing in a format that
is easily used to create filters of the specified shape.

The current implementation uses a circular method whereby a murmur128 hash
is generated for a parameter to a with() call (e.g. with(“foo”) generates a
murmur128 hash of “foo”)
the hash  is split into 2 longs.

Now when the shape has a “k” of:
  1 we return the first long.
  2 we add the second long to the first long and return that.
  3 we add the second long the the result above and return that.

This has been shown to be as randomly distributed as calling
  murmur128( “foo”, (seed1) )
  murmur128( “foo”, (seed2) )
  murmur128( “foo”, (seed3) )

and much faster.

We can do this with any implementation that produces 128 bit hashes.

Once we have the long value for the specific k we take the mod() of that
and the number of bits in the Bloom filter (m). For Bloom filters that have
small enough (m) we could use smaller integers and it would not have a
significant impact on the distribution of the bits.

There is an issue of whether the long value is considered signed or
unsigned as this significantly impacts the mod(). This is why I suggested a
string like Murmur128-UC or Murmur128-SC (unsigned or signed).

If instead the ProtoBloomFilter called the hash function directly then it
would be what I called an incremented call. So calling

  murmur128( “foo”, (seed1) )
  murmur128( “foo”, (seed2) )
  murmur128( “foo”, (seed3) )

inside the ProtoBloomFilter would yield a Murmur128-UI or Murmur128-SI hash
depending on whether the values were assumed to be signed or unsigned for
the mod() call.

Anyway, we can rename the ProtoBloomFilter to something else, but it
effectively generates the required number of functions (k) for each with(x)
parameter.

So you take the BloomFilterConfiguration (Shape) and the ProtoBloomFilter
(k provider) and you can build the BloomFilter.

There are some problems here. Almost none of the nomenclature is common. So
we can define as we go. I think there is a need for a proper hashing naming
were we can determine how the hash was created from the name (like the with
encryption algo names and similar to the RandomSource code you pointed me
to), but I have found no standards here.

Claude

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] Bloom filter - Discussion of Shape

Posted by Gilles Sadowski <gi...@gmail.com>.
Hi.

Le ven. 18 oct. 2019 à 17:55, Claude Warren <cl...@xenei.com> a écrit :
>
>  I think the other discussion is getting a bit long so I thought we could
> start this discussion here and see if we can close out the other discussion
> with agreement on the remaining topics.
>
> The “Shape” of a bloom filter (excluding the hash algo) is defined
> mathematically by
>
> Number of Items (AKA: n) n = ceil(m / (-k / log(1 - exp(log(p) / k))))
>
> Probability of Collision (AKA: p) p = (1 – exp(-kn/m))^k
>
> Number of Bits (AKA: m) m = ceil((n * log(p)) / log(1 / pow(2, log(2))))
>
> Number of Functions (AKA: k) k = round((m / n) * log(2))
>
> Since the probability (p) is actually calculated directly from the other 3
> the shape is actually concretely defined by n, m, and k.
>
> This is all in the BloomFilterConfiguration class.
>
> The BloomFilterConfigurationClass has 4 constrcutors (n,p), (p,m,k), (n,m),
> and (n,m,k). in all cases where is an "p" it is assumed to be a desired
> value and values of the other missing properties are calculated. The actual
> value for p is then recalculated from n, m and k.
>
> Now it makes sense that the hash algo should also be specified I am happy
> to assume an enum here for the moment and also assume a mechanism for
> resolving the enum to a concrete hash implementation.
>
> I think that you will find that BloomFilterConfiguration fits the bill but
> perhaps should be renamed to BloomFilterShape.
>
>
>
> There is one other component that we have not addressed lately: The
> ProtoBloomFilter.
>
> The ProtoBloomFilter addresses an issue that arises when there are filters
> with different “shape” but same Hashing Algo.
>
> Since hashing is the expensive operation, we want to try to cache the hash
> values and reuse them rather than recalculate them.
>
> The ProtoBloomFilter contains the results of the hashing in a format that
> is easily used to create filters of the specified shape.
>
> The current implementation uses a circular method whereby a murmur128 hash
> is generated for a parameter to a with() call (e.g. with(“foo”) generates a
> murmur128 hash of “foo”)
> the hash  is split into 2 longs.
>
> Now when the shape has a “k” of:
>   1 we return the first long.
>   2 we add the second long to the first long and return that.
>   3 we add the second long the the result above and return that.
>
> This has been shown to be as randomly distributed as calling
>   murmur128( “foo”, (seed1) )
>   murmur128( “foo”, (seed2) )
>   murmur128( “foo”, (seed3) )
>
> and much faster.
>
> We can do this with any implementation that produces 128 bit hashes.
>
> Once we have the long value for the specific k we take the mod() of that
> and the number of bits in the Bloom filter (m). For Bloom filters that have
> small enough (m) we could use smaller integers and it would not have a
> significant impact on the distribution of the bits.
>
> There is an issue of whether the long value is considered signed or
> unsigned as this significantly impacts the mod(). This is why I suggested a
> string like Murmur128-UC or Murmur128-SC (unsigned or signed).
>
> If instead the ProtoBloomFilter called the hash function directly then it
> would be what I called an incremented call. So calling
>
>   murmur128( “foo”, (seed1) )
>   murmur128( “foo”, (seed2) )
>   murmur128( “foo”, (seed3) )
>
> inside the ProtoBloomFilter would yield a Murmur128-UI or Murmur128-SI hash
> depending on whether the values were assumed to be signed or unsigned for
> the mod() call.
>
> Anyway, we can rename the ProtoBloomFilter to something else, but it
> effectively generates the required number of functions (k) for each with(x)
> parameter.
>
> So you take the BloomFilterConfiguration (Shape) and the ProtoBloomFilter
> (k provider) and you can build the BloomFilter.
>
> There are some problems here. Almost none of the nomenclature is common. So
> we can define as we go. I think there is a need for a proper hashing naming
> were we can determine how the hash was created from the name (like the with
> encryption algo names and similar to the RandomSource code you pointed me
> to), but I have found no standards here.

At the issue page
    https://issues.apache.org/jira/browse/COLLECTIONS-728
I've attached a design suggestion from what I gathered from this discussion.
[It's incomplete and not tested, some names were changed (TBD), and the
commented out part ("HashSource") will not compile until more code is added.]

Prominent feature is the explicit namespace that groups in a single file all the
boiler-plate functionality needed by implementations of the
"BloomFilter" interface.
[This is also in line with the layout of the top-level package of
[Collections] that
contains various interfaces, each with their implementations located
in a specific
sub-package.]

Next step would be to create a
---CUT---
public class BitSetBloomFilter implements BloomFilter {
    private final BitSet state;
    // ...
}
---CUT---

WDYT?

Regards,
Gilles

>
> Claude
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org