You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Jonathan Gray (JIRA)" <ji...@apache.org> on 2009/03/06 21:21:56 UTC
[jira] Updated: (HBASE-1246) BloomFilter's use of BitSet is too inefficient

     [ https://issues.apache.org/jira/browse/HBASE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-1246:
---------------------------------

    Attachment: ByteBloomFilter.java

This is a custom bloom implementation we have used internally.  I have modified it to use a byte[] to back the bit vector rather than a BitSet.

It also has an optimized/efficient Writable implementation (and static serialize/deserialize that we use).

The way I'm actually doing the hashing is probably non-optimal.  Our use of it always took a long as the unit.  We need (byte[],offset,len) as the unit.  This implementation generates an MD5 hash, puts that into a long, instantiates a Random with the md5 long as the seed, and then uses Random.nextInt() for each hash.  Should take this and pair it with Jenkins or whatever is being used up in Hadoop maybe.

main() contains some basic unit testing.  Everything does appear to work and is WAY more efficient.

{code}
Bloom filter initialized, 9585 bits, 1199 bytes, 7 hashes, 0.01 error rate, up to 1000 elements
Bloom bit vector contains 9585 bits
Serialized as 1230 bytes
{code}

In-memory size will be similarly efficient with the byte[] instead of the BitSet or boolean[] as before.

> BloomFilter's use of BitSet is too inefficient
> ----------------------------------------------
>
>                 Key: HBASE-1246
>                 URL: https://issues.apache.org/jira/browse/HBASE-1246
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>         Environment: Java 1.6, OSX 64 bit
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.20.0
>
>         Attachments: ByteBloomFilter.java
>
>
> From the logfile run of TestBloomFilter with special SizeOf agent jar:
> Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839 for size: 100
> 2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319): New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
> Serialized bloomfilter size: 160
> In memory bf size: 1248
> As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact - 160 bytes.
> But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter we see:
>   BitSet bits;
> is the only field.
> Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of where we should be.
> Considering every HFile could potentially have a bloom filter, and bloom filters are more likely to have bit vector sizes of 10,000-100,000, we should do something about this.  Aka: write our own bit-set that uses byte[] and bit ops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.