You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Benson Margulies <bi...@gmail.com> on 2010/04/02 12:57:59 UTC

[collections] and what about 'identity'?

There's a Trove feature I didn't address in the 0.3 version of collections:
control of the hash function for Object types. In Trove, you construct over
a 'strategy' object if you want something like the JDK IdentityHashMap. COLT
didn't do that. A user can get this currently via subclassing, though they
have to subclass the primitive side of the house one-class-at-a-time. Some
choices include:

1) Include Identity flavors ourself.
2) Adopt the strategy pattern, which is an extensibility pattern that allows
users to implement one object and use it with all of the Object hashed
containers.
3) something else I haven't thought of?

Re: [collections] and what about 'identity'?

Posted by Benson Margulies <bi...@gmail.com>.
I'm going to code subclasses in my code to get something done before we get
to the next Mahout release, and that will give me some practical experience
with the whole business.

Meanwhile, consider the case of IdentityHashMap. If nothing else, the
strategy approach means adding One class to the library, not 6.

On Fri, Apr 2, 2010 at 7:35 AM, Sean Owen <sr...@gmail.com> wrote:

> I'm also thinking of the added complexity in the code, and slight
> possible performance hit to injecting this level of flexibility. My
> gut says it's perhaps questionable, but I base that on nothing but
> gut.
>
> I'm thinking of this versus, in special cases like you mention, just
> proceeding with a wrapper or subclass? how bad is that?
>
> On Fri, Apr 2, 2010 at 12:27 PM, Drew Farris <dr...@gmail.com>
> wrote:
> > On Fri, Apr 2, 2010 at 7:01 AM, Sean Owen <sr...@gmail.com> wrote:
> >> What's the use case for needing to vary the hash function?
> >
> > I was doing something funky with string prefixes the other day and
> > could have used something like this baked into collections already. I
> > vote for the strategy pattern.
> >
>

Re: [collections] and what about 'identity'?

Posted by Sean Owen <sr...@gmail.com>.
I'm also thinking of the added complexity in the code, and slight
possible performance hit to injecting this level of flexibility. My
gut says it's perhaps questionable, but I base that on nothing but
gut.

I'm thinking of this versus, in special cases like you mention, just
proceeding with a wrapper or subclass? how bad is that?

On Fri, Apr 2, 2010 at 12:27 PM, Drew Farris <dr...@gmail.com> wrote:
> On Fri, Apr 2, 2010 at 7:01 AM, Sean Owen <sr...@gmail.com> wrote:
>> What's the use case for needing to vary the hash function?
>
> I was doing something funky with string prefixes the other day and
> could have used something like this baked into collections already. I
> vote for the strategy pattern.
>

Re: [collections] and what about 'identity'?

Posted by Drew Farris <dr...@gmail.com>.
On Fri, Apr 2, 2010 at 7:01 AM, Sean Owen <sr...@gmail.com> wrote:
> What's the use case for needing to vary the hash function?

I was doing something funky with string prefixes the other day and
could have used something like this baked into collections already. I
vote for the strategy pattern.

Re: [collections] and what about 'identity'?

Posted by Dawid Weiss <da...@gmail.com>.
The source code to HPPC is public and accessible, so you are more then
welcome to peek/ contribute/ take whatever you want, Benson.

Dawid

On Fri, Apr 2, 2010 at 10:45 PM, Benson Margulies <bi...@gmail.com> wrote:
> Dawid,
>
> Now I recall why I stopped working on features of Mahout collections :-)
> HPPC.
>
> We'll see who gets where first.
>
> --benson
>
>
> On Fri, Apr 2, 2010 at 10:06 AM, Dawid Weiss <da...@gmail.com> wrote:
>
>> > What's the use case for needing to vary the hash function? It's one of
>> > those things where I assume there are incorrect ways to do it, and
>> > correct ways, and among the correct ways fairly clear arguments about
>> > which function will be better -- i.e. the object should provide the
>> > best function.
>>
>> Unfortunately this is not true -- just recently I've hit a use case
>> where the keys stored were Long values and their distribution had a
>> very low variance in the lower bits. HPPC implemented open hashing
>> using 2^n arrays and hashes were modulo bitmask... this caused really,
>> really long conflict chains for values that were actually very
>> different. I looked at how JDK's HashMap solves this problem -- they
>> do a simple rehashing scheme internally (so it's object hash and then
>> remixing hash in a cascade). I've finally decided to allow external
>> hash functions AND changed the _default_ hash function used for
>> "remixing" to be murmur hash. Performance benchmarks show this yields
>> virtually no degradation in execution time (the CPUs seem to spend
>> most of their time waiting on cache misses anyway, so internal
>> rehashing is not an issue).
>>
>> I must also apologize for a bit of inactivity with HPPC... Like I
>> said, we have released it internally on our "labs" Web site here:
>>
>> http://labs.carrotsearch.com/hppc.html
>>
>> It doesn't mean we turn our backs on contributing HPPC to Mahout --
>> the opposite, we would love to do it. But contrary to what I
>> originally thought (to push HPPC to Mahout as soon as possible) I kind
>> of grew reluctant because so many things are missing (equals/hashcode,
>> java collections adapters) or can be improved (documentation, faster
>> iterators).
>>
>> So... I'm still going to experiment with HPPC in our labs, especially
>> API-wise, release one or two versions in between and then kindly ask
>> you to peek at the final (?) result and consider moving the code under
>> Mahout umbrella. Sounds good?
>>
>> Dawid
>>
>

Re: [collections] and what about 'identity'?

Posted by Benson Margulies <bi...@gmail.com>.
Dawid,

Now I recall why I stopped working on features of Mahout collections :-)
HPPC.

We'll see who gets where first.

--benson


On Fri, Apr 2, 2010 at 10:06 AM, Dawid Weiss <da...@gmail.com> wrote:

> > What's the use case for needing to vary the hash function? It's one of
> > those things where I assume there are incorrect ways to do it, and
> > correct ways, and among the correct ways fairly clear arguments about
> > which function will be better -- i.e. the object should provide the
> > best function.
>
> Unfortunately this is not true -- just recently I've hit a use case
> where the keys stored were Long values and their distribution had a
> very low variance in the lower bits. HPPC implemented open hashing
> using 2^n arrays and hashes were modulo bitmask... this caused really,
> really long conflict chains for values that were actually very
> different. I looked at how JDK's HashMap solves this problem -- they
> do a simple rehashing scheme internally (so it's object hash and then
> remixing hash in a cascade). I've finally decided to allow external
> hash functions AND changed the _default_ hash function used for
> "remixing" to be murmur hash. Performance benchmarks show this yields
> virtually no degradation in execution time (the CPUs seem to spend
> most of their time waiting on cache misses anyway, so internal
> rehashing is not an issue).
>
> I must also apologize for a bit of inactivity with HPPC... Like I
> said, we have released it internally on our "labs" Web site here:
>
> http://labs.carrotsearch.com/hppc.html
>
> It doesn't mean we turn our backs on contributing HPPC to Mahout --
> the opposite, we would love to do it. But contrary to what I
> originally thought (to push HPPC to Mahout as soon as possible) I kind
> of grew reluctant because so many things are missing (equals/hashcode,
> java collections adapters) or can be improved (documentation, faster
> iterators).
>
> So... I'm still going to experiment with HPPC in our labs, especially
> API-wise, release one or two versions in between and then kindly ask
> you to peek at the final (?) result and consider moving the code under
> Mahout umbrella. Sounds good?
>
> Dawid
>

Re: [collections] and what about 'identity'?

Posted by Dawid Weiss <da...@gmail.com>.
> What's the use case for needing to vary the hash function? It's one of
> those things where I assume there are incorrect ways to do it, and
> correct ways, and among the correct ways fairly clear arguments about
> which function will be better -- i.e. the object should provide the
> best function.

Unfortunately this is not true -- just recently I've hit a use case
where the keys stored were Long values and their distribution had a
very low variance in the lower bits. HPPC implemented open hashing
using 2^n arrays and hashes were modulo bitmask... this caused really,
really long conflict chains for values that were actually very
different. I looked at how JDK's HashMap solves this problem -- they
do a simple rehashing scheme internally (so it's object hash and then
remixing hash in a cascade). I've finally decided to allow external
hash functions AND changed the _default_ hash function used for
"remixing" to be murmur hash. Performance benchmarks show this yields
virtually no degradation in execution time (the CPUs seem to spend
most of their time waiting on cache misses anyway, so internal
rehashing is not an issue).

I must also apologize for a bit of inactivity with HPPC... Like I
said, we have released it internally on our "labs" Web site here:

http://labs.carrotsearch.com/hppc.html

It doesn't mean we turn our backs on contributing HPPC to Mahout --
the opposite, we would love to do it. But contrary to what I
originally thought (to push HPPC to Mahout as soon as possible) I kind
of grew reluctant because so many things are missing (equals/hashcode,
java collections adapters) or can be improved (documentation, faster
iterators).

So... I'm still going to experiment with HPPC in our labs, especially
API-wise, release one or two versions in between and then kindly ask
you to peek at the final (?) result and consider moving the code under
Mahout umbrella. Sounds good?

Dawid

Re: [collections] and what about 'identity'?

Posted by Benson Margulies <bi...@gmail.com>.
On Fri, Apr 2, 2010 at 7:01 AM, Sean Owen <sr...@gmail.com> wrote:

> What's the use case for needing to vary the hash function? It's one of
> those things where I assume there are incorrect ways to do it, and
> correct ways, and among the correct ways fairly clear arguments about
> which function will be better -- i.e. the object should provide the
> best function.
>
> I don't think subclassing or wrapping is a big deal in the rare case
> you need this. But perhaps I am missing the use case.
>

I'm not the inventor of the strategy pattern here so I'm not a hard
advocate. What seemed relevant to me was how irritating it would be to have
to resubclass OpenObjectXXXHashTable<T> for each of the XXX primitives if
you needed more than one.

Re: [collections] and what about 'identity'?

Posted by Sean Owen <sr...@gmail.com>.
What's the use case for needing to vary the hash function? It's one of
those things where I assume there are incorrect ways to do it, and
correct ways, and among the correct ways fairly clear arguments about
which function will be better -- i.e. the object should provide the
best function.

I don't think subclassing or wrapping is a big deal in the rare case
you need this. But perhaps I am missing the use case.