You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Gary Gregory <ga...@gmail.com> on 2020/01/08 14:15:27 UTC

Re: [collections] bloom filters comments

I think it is time to bring this PR in and make any adjustments within
master beyond that. This will be quicker and simpler than going round and
round for simple things like Javadoc tweaks and small non-functional
changes (formatting, variable names, and so on.) I'll proceed with that
tonight.

Gary

On Sun, Dec 29, 2019 at 11:56 AM Claude Warren <cl...@xenei.com> wrote:

> It is currently a sub-class.  There was a suggestion to move it outside of
> the BloomFilter interface.
>
> On Sun, Dec 29, 2019 at 3:47 PM Gilles Sadowski <gi...@gmail.com>
> wrote:
>
> > Le dim. 29 déc. 2019 à 15:30, Claude Warren <cl...@xenei.com> a écrit :
> > >
> > > If the Shape class (BloomFilter.Shape) is extracted from the
> BloomFilter
> > > interface and made a stand-alone class would the name change or is the
> > fact
> > > that it is in the o.a.c.c.bloomfilter package enough?
> > >
> > > I prefer the name Shape to BloomFilterShape.
> >
> > If "Shape" is only used for "BloomFilter" (as the suggestion above
> > seems to indicate, why not declare it as a sub-interface:
> > ---CUT---
> > public interface BloomFilter {
> >     // ...
> >
> >     public interface Shape {
> >         // ...
> >     }
> > }
> > ---CUT---
> > ?
> >
> > Regards,
> > Gilles
> >
> > >
> > > Claude
> > >
> > > On Sat, Dec 28, 2019 at 9:06 AM Claude Warren <cl...@xenei.com>
> wrote:
> > >
> > > > Once the interface is extracted and reduced to the minimum necessary
> > the
> > > > following methods are removed:
> > > >
> > > > orCardinality() -- we have andCardinality() and xorCardinality() this
> > was
> > > > included for completeness.
> > > >
> > > > isFull() -- true if all the bits in the vector are on. A convenience
> > > > method used to short circuit some logic.
> > > >
> > > > I think these 2 methods should go into the BloomFilter interface.
> > > >
> > > >
> > > > Set operations:
> > > >
> > > > jaccardSimilarity -- a measure of similarity in sets with range [0,1]
> > > >
> > > > jaccardDistance -- defined as 1 - jaccardSimilarity
> > > >
> > > > cosineSimilarity -- a measure of similarity with range  [0,1]
> > > >
> > > > cosineDistance -- defined as 1 - cosineSimilarity
> > > >
> > > > estimateSize -- Set operation estimate of the number of items that
> were
> > > > placed in the filter. (Requires Shape)
> > > >
> > > > estimateUnionSize -- Set operation estimate of the number of items
> that
> > > > would be represented by the merging of two filters. (Requires Shape)
> > > >
> > > > estimateIntersectionSize -- Set operations estimate of the number of
> > items
> > > > in the intersection. (Requires Shape)
> > > >
> > > > Perhaps it makes sense to move the Set operations into their own
> class
> > > > with static methods?  The set operations all operate on 2 Bloom
> > filters.
> > > > Moving them would clarify the AbstractBloomFilter class.
> > > >
> > > > Claude
> > > >
> > > >
> > > > On Sat, Dec 28, 2019 at 2:01 AM Gary Gregory <garydgregory@gmail.com
> >
> > > > wrote:
> > > >
> > > >> On Fri, Dec 27, 2019 at 11:36 AM Claude Warren <cl...@xenei.com>
> > wrote:
> > > >>
> > > >> > On Fri, Dec 27, 2019 at 1:02 AM Gary Gregory <
> > garydgregory@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > Hi Claude and all:
> > > >> > >
> > > >> > > Here are a couple of comments on the bloom filter PR.
> > > >> > >
> > > >> > > 10,100 ft level comment we do not have to worry about today:
> > Before
> > > >> the
> > > >> > > release, we might want to split Commons Collection into a
> > multi-module
> > > >> > > project and have the BF code in it own module. The _main_ reason
> > for
> > > >> this
> > > >> > > is that it will allow the dependency on Commons Codecis be a
> > > >> non-optional
> > > >> > > dependency. Optional dependency are a a bit of a pain IMO,
> > especially
> > > >> > when
> > > >> > > you deal with large stacks. You end up sucking in everything
> that
> > is
> > > >> > > optional when you deliver an app because you do not know for
> > certain
> > > >> > what's
> > > >> > > really required at runtime.
> > > >> > >
> > > >> > > Closer to the ground: I would make BloomFilter an interface and
> > rename
> > > >> > > the current BloomFilter class AbstractBloomFilter implements
> > > >> BloomFilter.
> > > >> > > This will allow the user and maintainer to see what is truly
> > public.
> > > >> > >
> > > >> >
> > > >> > I have done this (not checked in) but the net effect is that all
> the
> > > >> public
> > > >> > methods in the AbstractBloomFilter class are reflected in the
> > > >> BloomFilter
> > > >> > interface and the Shape class has moved to the Interface as well.
> > > >> >
> > > >>
> > > >> OK, I think I like BloomFilter as an interface especially since it
> is
> > used
> > > >> as an argument in methods. I'll leave it to you as to whether Shape
> > needs
> > > >> to be folded in. I did notice that Shape is an argument in a few
> > places.
> > > >> Might we loose some focus and abstraction by folding Shape into the
> > > >> BloomFilter interface?
> > > >>
> > > >>
> > > >> > I can remove several methods from BloomFilter that are not
> strictly
> > > >> > necessary for this code to function.  I am and have been reticent
> > to do
> > > >> so
> > > >> > since there are a number of home-grown libraries used by various
> > > >> > researchers that provide these functions.  But if that is what it
> > takes
> > > >> to
> > > >> > get this out the door it will be done.
> > > >> >
> > > >>
> > > >> Once you have BloomFilter as an interface with an implementing
> class,
> > it
> > > >> might make it much clearer as to what really belongs in the
> > interface. The
> > > >> handy methods can stay in the abstract class of course.
> > > >>
> > > >>
> > > >> >
> > > >> > > Since this is a first cut for Commons Collection, I would
> consider
> > > >> only
> > > >> > > making APIs public that must be public. Once released, we MUST
> > > >> maintain
> > > >> > > binary compatibility within minor releases. Major releases allow
> > us to
> > > >> > > break this compatibility but these do not happen very often. I
> do
> > not
> > > >> > know
> > > >> > > what this means yet for BF but it's a thought to consider. This
> > kind
> > > >> of
> > > >> > > work is made harder due to the current packaging of the BF code.
> > > >> > >
> > > >> > > We could consider putting it all in one package if that helps
> > reduce
> > > >> the
> > > >> > > public API footprint.
> > > >> > >
> > > >> >
> > > >> > I think that putting all the pieces into a single package will
> > muddy the
> > > >> > waters a bit.  The classes in o.a.c.c.bloomfilter are high level
> > classes
> > > >> > and used wherever the Bloom filter code is used.  Objects in the
> > > >> > o.a.c.c.bloomfilter.hasher classes are Hashers and are really only
> > > >> selected
> > > >> > when the application developer determines which hasher is best
> > suited
> > > >> for
> > > >> > the environment.  Finally o.a.c.c.bloomfilter.hasher.function are
> > > >> > HasherFunction implementations.  These can be used or special
> cases
> > > >> built
> > > >> > by the implementer as necessary.  Perhaps I am being a bit too
> > pedantic
> > > >> but
> > > >> > I do not think it makes sense to merge them into a single package.
> > > >> >
> > > >>
> > > >> Let's leave it as is then.
> > > >>
> > > >> Side note: I'll need to release Commons Codec 1.14 before we can
> > bring in
> > > >> this PR. I hope to do this after Commons VFS has gone through its
> own
> > > >> release cycle (which should be done in 48 hours or so if all goes
> > well.)
> > > >> Well, we could bring in the PR but depending on a SNAPSHOT build is
> > > >> usually
> > > >> not a good idea unless it is truly short term.
> > > >>
> > > >> Thank you,
> > > >> Gary
> > > >>
> > > >>
> > > >> >
> > > >> >
> > > >> > >
> > > >> > > Q: All the code in main's
> > > >> > > org.apache.commons.collections4.bloomfilter.hasher.function is
> > only
> > > >> > called
> > > >> > > from test code. Why is that?
> > > >> > >
> > > >> >
> > > >> > They are pluggable modules.  When a developer uses them they an
> > > >> instance is
> > > >> > created and it is passed to the BloomFilter.Shape() constructor as
> > well
> > > >> as
> > > >> > to the various Hasher constructors.  For example:
> > > >> >
> > > >> >     HashFunction hashFunction = new Murmur128x86Cyclic();
> > > >> >     Shape bloomFilterConfig = new Shape( hashFunction, 3, 1.0 /
> > 100000);
> > > >> >
> > > >> >
> > > >> > >
> > > >> > > Nit: Some odd formatting like spaces after open parens:
> > 'toUpperCase(
> > > >> > > Locale.ROOT)' should be  'toUpperCase(Locale.ROOT)'
> > > >> > >
> > > >> > > Gary
> > > >> > >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > I like: Like Like - The likeliest place on the web
> > > >> > <http://like-like.xenei.com>
> > > >> > LinkedIn: http://www.linkedin.com/in/claudewarren
> > > >> >
> > > >>
> > > >
> > > >
> > > > --
> > > > I like: Like Like - The likeliest place on the web
> > > > <http://like-like.xenei.com>
> > > > LinkedIn: http://www.linkedin.com/in/claudewarren
> > > >
> > >
> > >
> > > --
> > > I like: Like Like - The likeliest place on the web
> > > <http://like-like.xenei.com>
> > > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > For additional commands, e-mail: dev-help@commons.apache.org
> >
> >
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>

Re: [collections] bloom filters comments

Posted by Claude Warren <cl...@xenei.com>.
Gary suggested merging the pull request, Gilles points out there were
issues raised on the mailing list, I posted what I think was an accurate
summary.

What needs to be done to resolve this so that a final decision on the pull
request can be made?

Claude

On Thu, Jan 9, 2020 at 2:57 AM Bruno P. Kinoshita
<br...@yahoo.com.br.invalid> wrote:

>  Sorry, I'd read Gary's reply as if there was no PR yet. Reviewed it a bit
> now, lots of tests! Will play with the code and read the comments and
> finish the review by the end of the week.
>
> Thanks Claude
>
>     On Thursday, 9 January 2020, 11:20:40 am NZDT, Claude Warren <
> claude@xenei.com> wrote:
>
>  There is a pull request open:
> https://github.com/apache/commons-collections/pull/83
>
> On Wed, Jan 8, 2020 at 9:32 PM Bruno P. Kinoshita
> <br...@yahoo.com.br.invalid> wrote:
>
> >  Thanks for the summary Claude. I read some of the messages, but got lost
> > in the middle of the discussion, and tend to understand problems better
> if
> > there's some code to read along. And I am used to GitHub/GitLab diff
> > interface.
> > So I agree with Gary that this could be a good time for a PR (maybe a
> > draft one).
> > Bruno
> >
> >
> >    On Thursday, 9 January 2020, 6:32:09 am NZDT, Claude Warren <
> > claude@xenei.com> wrote:
> >
> >  I believe the issue (I think history is at
> >
> >
> https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17003600
> > )
> > is about the identification of hash implementations.
> >
> > Currently there are a couple of classes involved:
> >
> > Hasher interface, has a method that returns a HashFunctionIdentity and a
> > method that returns an iterator of enabled bits.  There are a couple of
> > implementations of Hasher:  the DynamicHasher contains buffers that are
> > passed to the hash function several times, the StaticHasher contains a
> list
> > of bits enabled by a hasher for a specific Shape.
> >
> > HashFunction interface: extends HashFunctionIdentity and adds a method
> that
> > calls the actual hash function.
> >
> > HashFunctionIdentity: contains the name of the hash function, the name of
> > the provider, the processType (cyclic or iterative), Signedness and a
> > signature.
> >
> > There are places in the code where the actual function is not required
> and
> > is some use cases would make the implementation difficult or fragile.
> > These code places are where the Bloom filter has been built and the
> system
> > is verifying that two filters used the same hash function.  In these
> cases
> > the comparison is the hashName, processType and Signedness.  In cases
> where
> > the bloom filters are stored in a database retrieval would mean some sort
> > of serialization/deserialization of the hash function or ensure that the
> > hash function is otherwise available.  This is problematic.
> >
> > The provider was added in a nod to a future factory that would follow the
> > JCA pattern and allow implementations of multiple providers.
> >
> > The signature was added to support a requested quick check.  The
> signature
> > is calculated by calling hashFunction.apply( String.format( "%s-%s-%s",
> > getName(), getSignedness(), getProcess() ).getBytes( "UTF-8" ), 0 ).
> >
> > There were suggestions to create an enum of HashFunctions controlled  by
> > the Collections.  I think that this adds a layer of coordination and
> > management on the Collections team that as a team we may not want to take
> > on.  In addition, it makes it almost impossible for 3rd party users to
> > create new hash functions and test them with the library.
> >
> > I believe the current implementation provides the minimal information
> > necessary to determine if two functions are supposed to produce the same
> > result.  In my mind the signature and provider methods are extra and not
> > necessary but desirable.
> >
> > I think this is a summary of the open discussion.
> >
> >
> > On Wed, Jan 8, 2020 at 2:32 PM Gilles Sadowski <gi...@gmail.com>
> > wrote:
> >
> > > Le mer. 8 janv. 2020 à 15:15, Gary Gregory <ga...@gmail.com> a
> > > écrit :
> > > >
> > > > I think it is time to bring this PR in and make any adjustments
> within
> > > > master beyond that. This will be quicker and simpler than going round
> > and
> > > > round for simple things like Javadoc tweaks and small non-functional
> > > > changes (formatting, variable names, and so on.) I'll proceed with
> that
> > > > tonight.
> > >
> > > Design issues were raised on the ML: With no agreement and no opinions
> > > other than Claude's and mine, things stayed where they were.
> > >
> > > Gilles
> > >
> > > >> [...]
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > > For additional commands, e-mail: dev-help@commons.apache.org
> > >
> > >
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
>
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren



-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] bloom filters comments

Posted by "Bruno P. Kinoshita" <br...@yahoo.com.br.INVALID>.
 Sorry, I'd read Gary's reply as if there was no PR yet. Reviewed it a bit now, lots of tests! Will play with the code and read the comments and finish the review by the end of the week.

Thanks Claude

    On Thursday, 9 January 2020, 11:20:40 am NZDT, Claude Warren <cl...@xenei.com> wrote:  
 
 There is a pull request open:
https://github.com/apache/commons-collections/pull/83

On Wed, Jan 8, 2020 at 9:32 PM Bruno P. Kinoshita
<br...@yahoo.com.br.invalid> wrote:

>  Thanks for the summary Claude. I read some of the messages, but got lost
> in the middle of the discussion, and tend to understand problems better if
> there's some code to read along. And I am used to GitHub/GitLab diff
> interface.
> So I agree with Gary that this could be a good time for a PR (maybe a
> draft one).
> Bruno
>
>
>    On Thursday, 9 January 2020, 6:32:09 am NZDT, Claude Warren <
> claude@xenei.com> wrote:
>
>  I believe the issue (I think history is at
>
> https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17003600
> )
> is about the identification of hash implementations.
>
> Currently there are a couple of classes involved:
>
> Hasher interface, has a method that returns a HashFunctionIdentity and a
> method that returns an iterator of enabled bits.  There are a couple of
> implementations of Hasher:  the DynamicHasher contains buffers that are
> passed to the hash function several times, the StaticHasher contains a list
> of bits enabled by a hasher for a specific Shape.
>
> HashFunction interface: extends HashFunctionIdentity and adds a method that
> calls the actual hash function.
>
> HashFunctionIdentity: contains the name of the hash function, the name of
> the provider, the processType (cyclic or iterative), Signedness and a
> signature.
>
> There are places in the code where the actual function is not required and
> is some use cases would make the implementation difficult or fragile.
> These code places are where the Bloom filter has been built and the system
> is verifying that two filters used the same hash function.  In these cases
> the comparison is the hashName, processType and Signedness.  In cases where
> the bloom filters are stored in a database retrieval would mean some sort
> of serialization/deserialization of the hash function or ensure that the
> hash function is otherwise available.  This is problematic.
>
> The provider was added in a nod to a future factory that would follow the
> JCA pattern and allow implementations of multiple providers.
>
> The signature was added to support a requested quick check.  The signature
> is calculated by calling hashFunction.apply( String.format( "%s-%s-%s",
> getName(), getSignedness(), getProcess() ).getBytes( "UTF-8" ), 0 ).
>
> There were suggestions to create an enum of HashFunctions controlled  by
> the Collections.  I think that this adds a layer of coordination and
> management on the Collections team that as a team we may not want to take
> on.  In addition, it makes it almost impossible for 3rd party users to
> create new hash functions and test them with the library.
>
> I believe the current implementation provides the minimal information
> necessary to determine if two functions are supposed to produce the same
> result.  In my mind the signature and provider methods are extra and not
> necessary but desirable.
>
> I think this is a summary of the open discussion.
>
>
> On Wed, Jan 8, 2020 at 2:32 PM Gilles Sadowski <gi...@gmail.com>
> wrote:
>
> > Le mer. 8 janv. 2020 à 15:15, Gary Gregory <ga...@gmail.com> a
> > écrit :
> > >
> > > I think it is time to bring this PR in and make any adjustments within
> > > master beyond that. This will be quicker and simpler than going round
> and
> > > round for simple things like Javadoc tweaks and small non-functional
> > > changes (formatting, variable names, and so on.) I'll proceed with that
> > > tonight.
> >
> > Design issues were raised on the ML: With no agreement and no opinions
> > other than Claude's and mine, things stayed where they were.
> >
> > Gilles
> >
> > >> [...]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > For additional commands, e-mail: dev-help@commons.apache.org
> >
> >
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren



-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren  

Re: [collections] bloom filters comments

Posted by Claude Warren <cl...@xenei.com>.
There is a pull request open:
https://github.com/apache/commons-collections/pull/83

On Wed, Jan 8, 2020 at 9:32 PM Bruno P. Kinoshita
<br...@yahoo.com.br.invalid> wrote:

>  Thanks for the summary Claude. I read some of the messages, but got lost
> in the middle of the discussion, and tend to understand problems better if
> there's some code to read along. And I am used to GitHub/GitLab diff
> interface.
> So I agree with Gary that this could be a good time for a PR (maybe a
> draft one).
> Bruno
>
>
>     On Thursday, 9 January 2020, 6:32:09 am NZDT, Claude Warren <
> claude@xenei.com> wrote:
>
>  I believe the issue (I think history is at
>
> https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17003600
> )
> is about the identification of hash implementations.
>
> Currently there are a couple of classes involved:
>
> Hasher interface, has a method that returns a HashFunctionIdentity and a
> method that returns an iterator of enabled bits.  There are a couple of
> implementations of Hasher:  the DynamicHasher contains buffers that are
> passed to the hash function several times, the StaticHasher contains a list
> of bits enabled by a hasher for a specific Shape.
>
> HashFunction interface: extends HashFunctionIdentity and adds a method that
> calls the actual hash function.
>
> HashFunctionIdentity: contains the name of the hash function, the name of
> the provider, the processType (cyclic or iterative), Signedness and a
> signature.
>
> There are places in the code where the actual function is not required and
> is some use cases would make the implementation difficult or fragile.
> These code places are where the Bloom filter has been built and the system
> is verifying that two filters used the same hash function.  In these cases
> the comparison is the hashName, processType and Signedness.  In cases where
> the bloom filters are stored in a database retrieval would mean some sort
> of serialization/deserialization of the hash function or ensure that the
> hash function is otherwise available.  This is problematic.
>
> The provider was added in a nod to a future factory that would follow the
> JCA pattern and allow implementations of multiple providers.
>
> The signature was added to support a requested quick check.  The signature
> is calculated by calling hashFunction.apply( String.format( "%s-%s-%s",
> getName(), getSignedness(), getProcess() ).getBytes( "UTF-8" ), 0 ).
>
> There were suggestions to create an enum of HashFunctions controlled  by
> the Collections.  I think that this adds a layer of coordination and
> management on the Collections team that as a team we may not want to take
> on.  In addition, it makes it almost impossible for 3rd party users to
> create new hash functions and test them with the library.
>
> I believe the current implementation provides the minimal information
> necessary to determine if two functions are supposed to produce the same
> result.  In my mind the signature and provider methods are extra and not
> necessary but desirable.
>
> I think this is a summary of the open discussion.
>
>
> On Wed, Jan 8, 2020 at 2:32 PM Gilles Sadowski <gi...@gmail.com>
> wrote:
>
> > Le mer. 8 janv. 2020 à 15:15, Gary Gregory <ga...@gmail.com> a
> > écrit :
> > >
> > > I think it is time to bring this PR in and make any adjustments within
> > > master beyond that. This will be quicker and simpler than going round
> and
> > > round for simple things like Javadoc tweaks and small non-functional
> > > changes (formatting, variable names, and so on.) I'll proceed with that
> > > tonight.
> >
> > Design issues were raised on the ML: With no agreement and no opinions
> > other than Claude's and mine, things stayed where they were.
> >
> > Gilles
> >
> > >> [...]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > For additional commands, e-mail: dev-help@commons.apache.org
> >
> >
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren



-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] bloom filters comments

Posted by "Bruno P. Kinoshita" <br...@yahoo.com.br.INVALID>.
 Thanks for the summary Claude. I read some of the messages, but got lost in the middle of the discussion, and tend to understand problems better if there's some code to read along. And I am used to GitHub/GitLab diff interface.
So I agree with Gary that this could be a good time for a PR (maybe a draft one).
Bruno


    On Thursday, 9 January 2020, 6:32:09 am NZDT, Claude Warren <cl...@xenei.com> wrote:  
 
 I believe the issue (I think history is at
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17003600)
is about the identification of hash implementations.

Currently there are a couple of classes involved:

Hasher interface, has a method that returns a HashFunctionIdentity and a
method that returns an iterator of enabled bits.  There are a couple of
implementations of Hasher:  the DynamicHasher contains buffers that are
passed to the hash function several times, the StaticHasher contains a list
of bits enabled by a hasher for a specific Shape.

HashFunction interface: extends HashFunctionIdentity and adds a method that
calls the actual hash function.

HashFunctionIdentity: contains the name of the hash function, the name of
the provider, the processType (cyclic or iterative), Signedness and a
signature.

There are places in the code where the actual function is not required and
is some use cases would make the implementation difficult or fragile.
These code places are where the Bloom filter has been built and the system
is verifying that two filters used the same hash function.  In these cases
the comparison is the hashName, processType and Signedness.  In cases where
the bloom filters are stored in a database retrieval would mean some sort
of serialization/deserialization of the hash function or ensure that the
hash function is otherwise available.  This is problematic.

The provider was added in a nod to a future factory that would follow the
JCA pattern and allow implementations of multiple providers.

The signature was added to support a requested quick check.  The signature
is calculated by calling hashFunction.apply( String.format( "%s-%s-%s",
getName(), getSignedness(), getProcess() ).getBytes( "UTF-8" ), 0 ).

There were suggestions to create an enum of HashFunctions controlled  by
the Collections.  I think that this adds a layer of coordination and
management on the Collections team that as a team we may not want to take
on.  In addition, it makes it almost impossible for 3rd party users to
create new hash functions and test them with the library.

I believe the current implementation provides the minimal information
necessary to determine if two functions are supposed to produce the same
result.  In my mind the signature and provider methods are extra and not
necessary but desirable.

I think this is a summary of the open discussion.


On Wed, Jan 8, 2020 at 2:32 PM Gilles Sadowski <gi...@gmail.com> wrote:

> Le mer. 8 janv. 2020 à 15:15, Gary Gregory <ga...@gmail.com> a
> écrit :
> >
> > I think it is time to bring this PR in and make any adjustments within
> > master beyond that. This will be quicker and simpler than going round and
> > round for simple things like Javadoc tweaks and small non-functional
> > changes (formatting, variable names, and so on.) I'll proceed with that
> > tonight.
>
> Design issues were raised on the ML: With no agreement and no opinions
> other than Claude's and mine, things stayed where they were.
>
> Gilles
>
> >> [...]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren  

Re: [collections] bloom filters comments

Posted by Claude Warren <cl...@xenei.com>.
I believe the issue (I think history is at
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17003600)
is about the identification of hash implementations.

Currently there are a couple of classes involved:

Hasher interface, has a method that returns a HashFunctionIdentity and a
method that returns an iterator of enabled bits.  There are a couple of
implementations of Hasher:  the DynamicHasher contains buffers that are
passed to the hash function several times, the StaticHasher contains a list
of bits enabled by a hasher for a specific Shape.

HashFunction interface: extends HashFunctionIdentity and adds a method that
calls the actual hash function.

HashFunctionIdentity: contains the name of the hash function, the name of
the provider, the processType (cyclic or iterative), Signedness and a
signature.

There are places in the code where the actual function is not required and
is some use cases would make the implementation difficult or fragile.
These code places are where the Bloom filter has been built and the system
is verifying that two filters used the same hash function.  In these cases
the comparison is the hashName, processType and Signedness.  In cases where
the bloom filters are stored in a database retrieval would mean some sort
of serialization/deserialization of the hash function or ensure that the
hash function is otherwise available.  This is problematic.

The provider was added in a nod to a future factory that would follow the
JCA pattern and allow implementations of multiple providers.

The signature was added to support a requested quick check.  The signature
is calculated by calling hashFunction.apply( String.format( "%s-%s-%s",
getName(), getSignedness(), getProcess() ).getBytes( "UTF-8" ), 0 ).

There were suggestions to create an enum of HashFunctions controlled  by
the Collections.  I think that this adds a layer of coordination and
management on the Collections team that as a team we may not want to take
on.  In addition, it makes it almost impossible for 3rd party users to
create new hash functions and test them with the library.

I believe the current implementation provides the minimal information
necessary to determine if two functions are supposed to produce the same
result.  In my mind the signature and provider methods are extra and not
necessary but desirable.

I think this is a summary of the open discussion.


On Wed, Jan 8, 2020 at 2:32 PM Gilles Sadowski <gi...@gmail.com> wrote:

> Le mer. 8 janv. 2020 à 15:15, Gary Gregory <ga...@gmail.com> a
> écrit :
> >
> > I think it is time to bring this PR in and make any adjustments within
> > master beyond that. This will be quicker and simpler than going round and
> > round for simple things like Javadoc tweaks and small non-functional
> > changes (formatting, variable names, and so on.) I'll proceed with that
> > tonight.
>
> Design issues were raised on the ML: With no agreement and no opinions
> other than Claude's and mine, things stayed where they were.
>
> Gilles
>
> >> [...]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] bloom filters comments

Posted by Gilles Sadowski <gi...@gmail.com>.
Le mer. 8 janv. 2020 à 15:15, Gary Gregory <ga...@gmail.com> a écrit :
>
> I think it is time to bring this PR in and make any adjustments within
> master beyond that. This will be quicker and simpler than going round and
> round for simple things like Javadoc tweaks and small non-functional
> changes (formatting, variable names, and so on.) I'll proceed with that
> tonight.

Design issues were raised on the ML: With no agreement and no opinions
other than Claude's and mine, things stayed where they were.

Gilles

>> [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org