You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2007/10/22 20:44:57 UTC

HBase Bloom filters hash

Hi,

I'm curious why the hashing function that these filters use is based on 
SHA-1 (which is relatively slow to compute) instead of a bunch of fast 
and simple non-cryptographic functions such as Jenkins' hash (see 
http://bretm.home.comcast.net/hash/7.html for the evaluation of Jenkins 
hash).

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: HBase Bloom filters hash

Posted by Andrzej Bialecki <ab...@getopt.org>.

Jim Kellerman wrote:
>> -----Original Message-----
>> From: Andrzej Bialecki [mailto:ab@getopt.org]
>> Sent: Monday, October 22, 2007 11:45 AM
>> To: hadoop-dev@lucene.apache.org
>> Subject: HBase Bloom filters hash
>>
>> Hi,
>>
>> I'm curious why the hashing function that these filters use
>> is based on SHA-1 (which is relatively slow to compute)
>> instead of a bunch of fast and simple non-cryptographic
>> functions such as Jenkins' hash (see
>> http://bretm.home.comcast.net/hash/7.html
>> for the evaluation of Jenkins hash).
> 
> The reason for SHA-1 is that it was what came with the open
> source bloom filter implementation we used. We've been focused
> on just getting things to work and not on performance, yet.
> 
> If you'd like to open a Jira, it will be on the list of things
> to do - sometime. If this is really important to you, how about
> submitting a patch? There's mostly just the two of us, +
> contributors, so we have to put our priorities on bug fixing
> and making HBase robust before we get around to adding more
> features or doing performance analysis.
> 
> We'd really like to get more contributions... the project would
> mature much more rapidly.

That's perfectly understandable. I'm still in the discovery phase, and 
it was something that caught my eye. I agree, it's a relatively low 
priotity - I happen to have a Jenkins hash impl. in Java, if I find some 
free time I'll prepare a patch.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: HBase Bloom filters hash

Posted by Jim Kellerman <ji...@powerset.com>.

> -----Original Message-----
> From: Andrzej Bialecki [mailto:ab@getopt.org]
> Sent: Monday, October 22, 2007 11:45 AM
> To: hadoop-dev@lucene.apache.org
> Subject: HBase Bloom filters hash
>
> Hi,
>
> I'm curious why the hashing function that these filters use
> is based on SHA-1 (which is relatively slow to compute)
> instead of a bunch of fast and simple non-cryptographic
> functions such as Jenkins' hash (see
> http://bretm.home.comcast.net/hash/7.html
> for the evaluation of Jenkins hash).

The reason for SHA-1 is that it was what came with the open
source bloom filter implementation we used. We've been focused
on just getting things to work and not on performance, yet.

If you'd like to open a Jira, it will be on the list of things
to do - sometime. If this is really important to you, how about
submitting a patch? There's mostly just the two of us, +
contributors, so we have to put our priorities on bug fixing
and making HBase robust before we get around to adding more
features or doing performance analysis.

We'd really like to get more contributions... the project would
mature much more rapidly.

Hope this helps.

---
Jim Kellerman, Senior Engineer; Powerset
jim@powerset.com