You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Claude Warren <cl...@xenei.com> on 2019/10/07 17:31:18 UTC

[collections] BloomFilter or BitSet functions?

As noted earlier I am preparing a contribution of Bloom Filter classes to
the collections module.  As part of this submission there are several
methods that operate on BitSets that are used as part  of Bloom Filter
manipulation and analysis.  My question is, should these be contributed as
Bloom Filter specific methods or would it be better to submit a BitSet
function library.

The methods in question are:
hammingDistance() = the cardinality (A xor B)
jaccardDistance()  = the 1 - jaccardSimilarity()
jaccardSimilarity() = cardinality(A xor B) / cardinality (A or B)
cosineDistance() = 1 - cosineSimilarity()
cosineSimilarity() = cardinality( A and B ) / (Sqrt( cardinality( A ) ) *
Sqrt( cardinality( B )))
estimatedLog = estimated log2 of the BitSet if considered a large unsigned
int.

Opinions requested.

Claude
--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] BloomFilter or BitSet functions?

Posted by Gilles Sadowski <gi...@gmail.com>.
Hello.

Le dim. 13 oct. 2019 à 21:49, Claude Warren <cl...@xenei.com> a écrit :
>
> I did some work on the BloomFiltersFunction idea and found that moving the
> hamming distance and approximate log methods out of the BloomFilter
> interface made implementation much easier and cleaner.  In fact, all the
> calls in my code that is utilizing the proposed contribution got simpler.
> I now firmly believe that this is the proper way forward.  The change is on
> my commons-collection fork under a "funcs" branch.

I'm not so sure (sorry!).
It suspiciously looks like a utility class, and moreover, most of the
method actually require a "BitSet" which the "BloomFilter" interface
mandates.
The fact that it seems to make other codes cleaner is perhaps a
sign that something is wrong with the design.
IMHO, it starts with a question asked before:  Is there going to
be other implementations of the "BloomFilter" interface?  If so, how
would they differ from "StandardBloomFilter"?  And would they use
those utility functions?

Other questions:
* Why, in "BloomNestedCollection", is the "BucketFactory" (nested)
interface is implemented by classes nested in that same class?
* Why do "BloomCollection" and "BloomNestedCollection" provide
the "BloomFilterGated" API?  Isn't the latter an "implementation
detail"?  IOW, what is the usage pattern?

Perhaps those remarks reflect my lack of knowledge of the proposed
functionality, but the size of your contribution makes it difficult to tell
apart what needs to be "public" (from a user perspective) from what
should be "private" (or "protected") as pertaining to the details of the
implementation.  E.g. it looks to me that the functions currently in
"BloomFilterFunctions" should not be part of the public API (?).

Regards,
Gilles

> My plan is to complete the documentation for BloomFilterFunctions, verify
> that I have not added any issues to any of the build reports and then merge
> it back into my contribution WIP.
>
> If anyone objects or thinks that this change is not proper please let me
> know soonest.
>
> Thanks,
> Claude
>
> On Sun, Oct 13, 2019 at 7:07 AM Claude Warren <cl...@xenei.com> wrote:
>
> > I believe the functions should be in a separate class as it increases the
> > separation of concerns.
> >
> > The methods are used in Bloom Filter manipulation and analysis but are not
> > specific to bloom filters.  They are in fact specific to bit vectors.
> >
> > My original thought was to create a function class (Name to be determined
> > but lets call it Func for now) that would have static methods like:
> >
> > int Func.hammingDistance( BloomFilter one, BloomFilter two)
> >
> > I think this makes sense as it emphasizes the fact that we are dealing
> > with BloomFilters in this package.  However, the implementation is probably
> > going to be along the lines of
> >
> > BitSet bsOne = one.getBitSet();
> > BitSet bsTwo = two.getBitSet();
> > // manipulate the bitsets here
> >
> > so the question was should there be a BitSet contribution that would be
> >
> > int Func.hammingDistance( BitSet one, BitSet two)
> >
> > The more I think about it the more I think that the BitSet functionality
> > should wait until someone wants it.  Right now I can proceed without it and
> > provide the semantically sound methods for the BloomFilters.
> >
> > Claude
> >
> > On Fri, Oct 11, 2019 at 12:36 AM Gilles Sadowski <gi...@gmail.com>
> > wrote:
> >
> >> Hello.
> >>
> >> Le lun. 7 oct. 2019 à 19:42, Claude Warren <cl...@xenei.com> a écrit :
> >> >
> >> > As noted earlier I am preparing a contribution of Bloom Filter classes
> >> to
> >> > the collections module.  As part of this submission there are several
> >> > methods that operate on BitSets that are used as part  of Bloom Filter
> >> > manipulation and analysis.  My question is, should these be contributed
> >> as
> >> > Bloom Filter specific methods or would it be better to submit a BitSet
> >> > function library.
> >>
> >> What do you mean?
> >> What would be the alternative?  How would usage change (from a
> >> user perspective)?  Would it improve the design (e.g. be increasing
> >> the "separation of concerns")?
> >>
> >> Thanks,
> >> Gilles
> >>
> >> >
> >> > The methods in question are:
> >> > hammingDistance() = the cardinality (A xor B)
> >> > jaccardDistance()  = the 1 - jaccardSimilarity()
> >> > jaccardSimilarity() = cardinality(A xor B) / cardinality (A or B)
> >> > cosineDistance() = 1 - cosineSimilarity()
> >> > cosineSimilarity() = cardinality( A and B ) / (Sqrt( cardinality( A ) )
> >> *
> >> > Sqrt( cardinality( B )))
> >> > estimatedLog = estimated log2 of the BitSet if considered a large
> >> unsigned
> >> > int.
> >> >
> >> > Opinions requested.
> >> >
> >> > Claude
> >> > --
> >> > I like: Like Like - The likeliest place on the web
> >> > <http://like-like.xenei.com>
> >> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >> For additional commands, e-mail: dev-help@commons.apache.org
> >>
> >>
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [collections] BloomFilter or BitSet functions?

Posted by Claude Warren <cl...@xenei.com>.
I did some work on the BloomFiltersFunction idea and found that moving the
hamming distance and approximate log methods out of the BloomFilter
interface made implementation much easier and cleaner.  In fact, all the
calls in my code that is utilizing the proposed contribution got simpler.
I now firmly believe that this is the proper way forward.  The change is on
my commons-collection fork under a "funcs" branch.

My plan is to complete the documentation for BloomFilterFunctions, verify
that I have not added any issues to any of the build reports and then merge
it back into my contribution WIP.

If anyone objects or thinks that this change is not proper please let me
know soonest.

Thanks,
Claude

On Sun, Oct 13, 2019 at 7:07 AM Claude Warren <cl...@xenei.com> wrote:

> I believe the functions should be in a separate class as it increases the
> separation of concerns.
>
> The methods are used in Bloom Filter manipulation and analysis but are not
> specific to bloom filters.  They are in fact specific to bit vectors.
>
> My original thought was to create a function class (Name to be determined
> but lets call it Func for now) that would have static methods like:
>
> int Func.hammingDistance( BloomFilter one, BloomFilter two)
>
> I think this makes sense as it emphasizes the fact that we are dealing
> with BloomFilters in this package.  However, the implementation is probably
> going to be along the lines of
>
> BitSet bsOne = one.getBitSet();
> BitSet bsTwo = two.getBitSet();
> // manipulate the bitsets here
>
> so the question was should there be a BitSet contribution that would be
>
> int Func.hammingDistance( BitSet one, BitSet two)
>
> The more I think about it the more I think that the BitSet functionality
> should wait until someone wants it.  Right now I can proceed without it and
> provide the semantically sound methods for the BloomFilters.
>
> Claude
>
> On Fri, Oct 11, 2019 at 12:36 AM Gilles Sadowski <gi...@gmail.com>
> wrote:
>
>> Hello.
>>
>> Le lun. 7 oct. 2019 à 19:42, Claude Warren <cl...@xenei.com> a écrit :
>> >
>> > As noted earlier I am preparing a contribution of Bloom Filter classes
>> to
>> > the collections module.  As part of this submission there are several
>> > methods that operate on BitSets that are used as part  of Bloom Filter
>> > manipulation and analysis.  My question is, should these be contributed
>> as
>> > Bloom Filter specific methods or would it be better to submit a BitSet
>> > function library.
>>
>> What do you mean?
>> What would be the alternative?  How would usage change (from a
>> user perspective)?  Would it improve the design (e.g. be increasing
>> the "separation of concerns")?
>>
>> Thanks,
>> Gilles
>>
>> >
>> > The methods in question are:
>> > hammingDistance() = the cardinality (A xor B)
>> > jaccardDistance()  = the 1 - jaccardSimilarity()
>> > jaccardSimilarity() = cardinality(A xor B) / cardinality (A or B)
>> > cosineDistance() = 1 - cosineSimilarity()
>> > cosineSimilarity() = cardinality( A and B ) / (Sqrt( cardinality( A ) )
>> *
>> > Sqrt( cardinality( B )))
>> > estimatedLog = estimated log2 of the BitSet if considered a large
>> unsigned
>> > int.
>> >
>> > Opinions requested.
>> >
>> > Claude
>> > --
>> > I like: Like Like - The likeliest place on the web
>> > <http://like-like.xenei.com>
>> > LinkedIn: http://www.linkedin.com/in/claudewarren
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] BloomFilter or BitSet functions?

Posted by Claude Warren <cl...@xenei.com>.
I believe the functions should be in a separate class as it increases the
separation of concerns.

The methods are used in Bloom Filter manipulation and analysis but are not
specific to bloom filters.  They are in fact specific to bit vectors.

My original thought was to create a function class (Name to be determined
but lets call it Func for now) that would have static methods like:

int Func.hammingDistance( BloomFilter one, BloomFilter two)

I think this makes sense as it emphasizes the fact that we are dealing with
BloomFilters in this package.  However, the implementation is probably
going to be along the lines of

BitSet bsOne = one.getBitSet();
BitSet bsTwo = two.getBitSet();
// manipulate the bitsets here

so the question was should there be a BitSet contribution that would be

int Func.hammingDistance( BitSet one, BitSet two)

The more I think about it the more I think that the BitSet functionality
should wait until someone wants it.  Right now I can proceed without it and
provide the semantically sound methods for the BloomFilters.

Claude

On Fri, Oct 11, 2019 at 12:36 AM Gilles Sadowski <gi...@gmail.com>
wrote:

> Hello.
>
> Le lun. 7 oct. 2019 à 19:42, Claude Warren <cl...@xenei.com> a écrit :
> >
> > As noted earlier I am preparing a contribution of Bloom Filter classes to
> > the collections module.  As part of this submission there are several
> > methods that operate on BitSets that are used as part  of Bloom Filter
> > manipulation and analysis.  My question is, should these be contributed
> as
> > Bloom Filter specific methods or would it be better to submit a BitSet
> > function library.
>
> What do you mean?
> What would be the alternative?  How would usage change (from a
> user perspective)?  Would it improve the design (e.g. be increasing
> the "separation of concerns")?
>
> Thanks,
> Gilles
>
> >
> > The methods in question are:
> > hammingDistance() = the cardinality (A xor B)
> > jaccardDistance()  = the 1 - jaccardSimilarity()
> > jaccardSimilarity() = cardinality(A xor B) / cardinality (A or B)
> > cosineDistance() = 1 - cosineSimilarity()
> > cosineSimilarity() = cardinality( A and B ) / (Sqrt( cardinality( A ) ) *
> > Sqrt( cardinality( B )))
> > estimatedLog = estimated log2 of the BitSet if considered a large
> unsigned
> > int.
> >
> > Opinions requested.
> >
> > Claude
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] BloomFilter or BitSet functions?

Posted by Gilles Sadowski <gi...@gmail.com>.
Hello.

Le lun. 7 oct. 2019 à 19:42, Claude Warren <cl...@xenei.com> a écrit :
>
> As noted earlier I am preparing a contribution of Bloom Filter classes to
> the collections module.  As part of this submission there are several
> methods that operate on BitSets that are used as part  of Bloom Filter
> manipulation and analysis.  My question is, should these be contributed as
> Bloom Filter specific methods or would it be better to submit a BitSet
> function library.

What do you mean?
What would be the alternative?  How would usage change (from a
user perspective)?  Would it improve the design (e.g. be increasing
the "separation of concerns")?

Thanks,
Gilles

>
> The methods in question are:
> hammingDistance() = the cardinality (A xor B)
> jaccardDistance()  = the 1 - jaccardSimilarity()
> jaccardSimilarity() = cardinality(A xor B) / cardinality (A or B)
> cosineDistance() = 1 - cosineSimilarity()
> cosineSimilarity() = cardinality( A and B ) / (Sqrt( cardinality( A ) ) *
> Sqrt( cardinality( B )))
> estimatedLog = estimated log2 of the BitSet if considered a large unsigned
> int.
>
> Opinions requested.
>
> Claude
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org