You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "ryan rawson (JIRA)" <ji...@apache.org> on 2009/03/06 11:05:57 UTC

[jira] Created: (HBASE-1246) BloomFilter's use of BitSet is too inefficient

BloomFilter's use of BitSet is too inefficient
----------------------------------------------

                 Key: HBASE-1246
                 URL: https://issues.apache.org/jira/browse/HBASE-1246
             Project: Hadoop HBase
          Issue Type: Bug
    Affects Versions: 0.20.0
         Environment: Java 1.6, OSX 64 bit
            Reporter: ryan rawson
             Fix For: 0.20.0


>From the logfile run of TestBloomFilter with special SizeOf agent jar:

Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839 for size: 100
2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319): New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
Serialized bloomfilter size: 160
In memory bf size: 1248


As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact - 160 bytes.

But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter we see:

  BitSet bits;

is the only field.

Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of where we should be.

Considering every HFile could potentially have a bloom filter, and bloom filters are more likely to have bit vector sizes of 10,000-100,000, we should do something about this.  Aka: write our own bit-set that uses byte[] and bit ops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1246) BloomFilter's use of BitSet is too inefficient

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806627#action_12806627 ] 

stack commented on HBASE-1246:
------------------------------

We'll be reviving bloomfilters in the near future.  Lets keep this issue around.  It'll inspire our verifying how much space a bit actually occupies in our implementation.

> BloomFilter's use of BitSet is too inefficient
> ----------------------------------------------
>
>                 Key: HBASE-1246
>                 URL: https://issues.apache.org/jira/browse/HBASE-1246
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>         Environment: Java 1.6, OSX 64 bit
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.21.0
>
>         Attachments: ByteBloomFilter.java
>
>
> From the logfile run of TestBloomFilter with special SizeOf agent jar:
> Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839 for size: 100
> 2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319): New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
> Serialized bloomfilter size: 160
> In memory bf size: 1248
> As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact - 160 bytes.
> But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter we see:
>   BitSet bits;
> is the only field.
> Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of where we should be.
> Considering every HFile could potentially have a bloom filter, and bloom filters are more likely to have bit vector sizes of 10,000-100,000, we should do something about this.  Aka: write our own bit-set that uses byte[] and bit ops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1246) BloomFilter's use of BitSet is too inefficient

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806619#action_12806619 ] 

Jean-Daniel Cryans commented on HBASE-1246:
-------------------------------------------

We don't use Bloom Filters at the moment, can we close this?

> BloomFilter's use of BitSet is too inefficient
> ----------------------------------------------
>
>                 Key: HBASE-1246
>                 URL: https://issues.apache.org/jira/browse/HBASE-1246
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>         Environment: Java 1.6, OSX 64 bit
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.21.0
>
>         Attachments: ByteBloomFilter.java
>
>
> From the logfile run of TestBloomFilter with special SizeOf agent jar:
> Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839 for size: 100
> 2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319): New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
> Serialized bloomfilter size: 160
> In memory bf size: 1248
> As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact - 160 bytes.
> But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter we see:
>   BitSet bits;
> is the only field.
> Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of where we should be.
> Considering every HFile could potentially have a bloom filter, and bloom filters are more likely to have bit vector sizes of 10,000-100,000, we should do something about this.  Aka: write our own bit-set that uses byte[] and bit ops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HBASE-1246) BloomFilter's use of BitSet is too inefficient

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-1246.
--------------------------

    Resolution: Fixed

Resolving

> BloomFilter's use of BitSet is too inefficient
> ----------------------------------------------
>
>                 Key: HBASE-1246
>                 URL: https://issues.apache.org/jira/browse/HBASE-1246
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>         Environment: Java 1.6, OSX 64 bit
>            Reporter: ryan rawson
>            Assignee: Nicolas Spiegelberg
>             Fix For: 0.22.0
>
>         Attachments: ByteBloomFilter.java
>
>
> From the logfile run of TestBloomFilter with special SizeOf agent jar:
> Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839 for size: 100
> 2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319): New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
> Serialized bloomfilter size: 160
> In memory bf size: 1248
> As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact - 160 bytes.
> But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter we see:
>   BitSet bits;
> is the only field.
> Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of where we should be.
> Considering every HFile could potentially have a bloom filter, and bloom filters are more likely to have bit vector sizes of 10,000-100,000, we should do something about this.  Aka: write our own bit-set that uses byte[] and bit ops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1246) BloomFilter's use of BitSet is too inefficient

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-1246:
---------------------------------

    Attachment: ByteBloomFilter.java

This is a custom bloom implementation we have used internally.  I have modified it to use a byte[] to back the bit vector rather than a BitSet.

It also has an optimized/efficient Writable implementation (and static serialize/deserialize that we use).

The way I'm actually doing the hashing is probably non-optimal.  Our use of it always took a long as the unit.  We need (byte[],offset,len) as the unit.  This implementation generates an MD5 hash, puts that into a long, instantiates a Random with the md5 long as the seed, and then uses Random.nextInt() for each hash.  Should take this and pair it with Jenkins or whatever is being used up in Hadoop maybe.

main() contains some basic unit testing.  Everything does appear to work and is WAY more efficient.

{code}
Bloom filter initialized, 9585 bits, 1199 bytes, 7 hashes, 0.01 error rate, up to 1000 elements
Bloom bit vector contains 9585 bits
Serialized as 1230 bytes
{code}

In-memory size will be similarly efficient with the byte[] instead of the BitSet or boolean[] as before.

> BloomFilter's use of BitSet is too inefficient
> ----------------------------------------------
>
>                 Key: HBASE-1246
>                 URL: https://issues.apache.org/jira/browse/HBASE-1246
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>         Environment: Java 1.6, OSX 64 bit
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.20.0
>
>         Attachments: ByteBloomFilter.java
>
>
> From the logfile run of TestBloomFilter with special SizeOf agent jar:
> Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839 for size: 100
> 2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319): New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
> Serialized bloomfilter size: 160
> In memory bf size: 1248
> As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact - 160 bytes.
> But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter we see:
>   BitSet bits;
> is the only field.
> Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of where we should be.
> Considering every HFile could potentially have a bloom filter, and bloom filters are more likely to have bit vector sizes of 10,000-100,000, we should do something about this.  Aka: write our own bit-set that uses byte[] and bit ops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1246) BloomFilter's use of BitSet is too inefficient

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12679650#action_12679650 ] 

stack commented on HBASE-1246:
------------------------------

I thought this had been fixed?  See https://issues.apache.org/jira/browse/HBASE-427.  Maybe we lost Ian's fix down the line?

Also, up on IRC, Johano suggests:

{code}
09:59 < dj_ryan> hmm
09:59 < dj_ryan> we are going to have to write our own BitSet
09:59 < dj_ryan> the java one is too inefficient
09:59 < dj_ryan> it doesnt seem to actually use bits
09:59 < dj_ryan> it uses 1000 bytes to store 1000 bits
09:59 < dj_ryan> :-(
09:59 < dj_ryan> serialized the bitset is 160 bytes, but in memory its 10x bigger.
10:57 < johano> http://code.google.com/p/compressedbitset/
10:57 < johano> http://code.google.com/p/javaewah/
10:57 < johano> might be of use
{code}

> BloomFilter's use of BitSet is too inefficient
> ----------------------------------------------
>
>                 Key: HBASE-1246
>                 URL: https://issues.apache.org/jira/browse/HBASE-1246
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>         Environment: Java 1.6, OSX 64 bit
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.20.0
>
>
> From the logfile run of TestBloomFilter with special SizeOf agent jar:
> Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839 for size: 100
> 2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319): New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
> Serialized bloomfilter size: 160
> In memory bf size: 1248
> As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact - 160 bytes.
> But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter we see:
>   BitSet bits;
> is the only field.
> Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of where we should be.
> Considering every HFile could potentially have a bloom filter, and bloom filters are more likely to have bit vector sizes of 10,000-100,000, we should do something about this.  Aka: write our own bit-set that uses byte[] and bit ops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1246) BloomFilter's use of BitSet is too inefficient

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-1246:
---------------------------------

    Attachment:     (was: ByteBloomFilter.java)

> BloomFilter's use of BitSet is too inefficient
> ----------------------------------------------
>
>                 Key: HBASE-1246
>                 URL: https://issues.apache.org/jira/browse/HBASE-1246
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>         Environment: Java 1.6, OSX 64 bit
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.20.0
>
>
> From the logfile run of TestBloomFilter with special SizeOf agent jar:
> Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839 for size: 100
> 2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319): New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
> Serialized bloomfilter size: 160
> In memory bf size: 1248
> As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact - 160 bytes.
> But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter we see:
>   BitSet bits;
> is the only field.
> Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of where we should be.
> Considering every HFile could potentially have a bloom filter, and bloom filters are more likely to have bit vector sizes of 10,000-100,000, we should do something about this.  Aka: write our own bit-set that uses byte[] and bit ops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HBASE-1246) BloomFilter's use of BitSet is too inefficient

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ryan rawson reassigned HBASE-1246:
----------------------------------

    Assignee: ryan rawson

> BloomFilter's use of BitSet is too inefficient
> ----------------------------------------------
>
>                 Key: HBASE-1246
>                 URL: https://issues.apache.org/jira/browse/HBASE-1246
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>         Environment: Java 1.6, OSX 64 bit
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.20.0
>
>
> From the logfile run of TestBloomFilter with special SizeOf agent jar:
> Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839 for size: 100
> 2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319): New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
> Serialized bloomfilter size: 160
> In memory bf size: 1248
> As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact - 160 bytes.
> But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter we see:
>   BitSet bits;
> is the only field.
> Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of where we should be.
> Considering every HFile could potentially have a bloom filter, and bloom filters are more likely to have bit vector sizes of 10,000-100,000, we should do something about this.  Aka: write our own bit-set that uses byte[] and bit ops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1246) BloomFilter's use of BitSet is too inefficient

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12679660#action_12679660 ] 

Jonathan Gray commented on HBASE-1246:
--------------------------------------

HBASE-427 fix is actually moving from a boolean[] to BitSet.  BitSet has the same space issues as the boolean[] so it didn't actually solve that.

compressedbitset is no go because it's GPL.  Looked at javaewah but not clear it's really what we want.  It's not so much that we want to _compress_ the bit array, we just want to store it efficiently.  Generally compressing a bit array retains efficiency in and/or operations but can be very poor at get() / containment checks.

Streamy has a custom, stripped down implementation of blooms in java.  We're currently using BitSet but now learning about this space inefficiency I'd like to move it to byte[]'s.  Would that be worthwhile doing?  Ryan, let me know what you're working at so we can combine efforts.

> BloomFilter's use of BitSet is too inefficient
> ----------------------------------------------
>
>                 Key: HBASE-1246
>                 URL: https://issues.apache.org/jira/browse/HBASE-1246
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>         Environment: Java 1.6, OSX 64 bit
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.20.0
>
>
> From the logfile run of TestBloomFilter with special SizeOf agent jar:
> Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839 for size: 100
> 2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319): New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
> Serialized bloomfilter size: 160
> In memory bf size: 1248
> As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact - 160 bytes.
> But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter we see:
>   BitSet bits;
> is the only field.
> Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of where we should be.
> Considering every HFile could potentially have a bloom filter, and bloom filters are more likely to have bit vector sizes of 10,000-100,000, we should do something about this.  Aka: write our own bit-set that uses byte[] and bit ops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1246) BloomFilter's use of BitSet is too inefficient

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-1246:
---------------------------------

    Attachment: ByteBloomFilter.java

Contains bitvals array for computation efficiency.  If I have a static final in a class, is it the case that we don't pay the memory overhead for each object instantiation?

This also now implements HeapSize.  Currently does not include size of lookup array.

{code}
Bloom bit vector contains 9585 bits
Serialized as 1230 bytes
HeapSize is 1255
{code}

> BloomFilter's use of BitSet is too inefficient
> ----------------------------------------------
>
>                 Key: HBASE-1246
>                 URL: https://issues.apache.org/jira/browse/HBASE-1246
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>         Environment: Java 1.6, OSX 64 bit
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.20.0
>
>         Attachments: ByteBloomFilter.java
>
>
> From the logfile run of TestBloomFilter with special SizeOf agent jar:
> Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839 for size: 100
> 2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319): New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
> Serialized bloomfilter size: 160
> In memory bf size: 1248
> As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact - 160 bytes.
> But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter we see:
>   BitSet bits;
> is the only field.
> Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of where we should be.
> Considering every HFile could potentially have a bloom filter, and bloom filters are more likely to have bit vector sizes of 10,000-100,000, we should do something about this.  Aka: write our own bit-set that uses byte[] and bit ops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1246) BloomFilter's use of BitSet is too inefficient

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12679713#action_12679713 ] 

Jonathan Gray commented on HBASE-1246:
--------------------------------------

Please note, there are a number of optimizations to be made in this code w.r.t. the hashing and also some optimizations for get/set using static arrays instead of doing shifts every time.

> BloomFilter's use of BitSet is too inefficient
> ----------------------------------------------
>
>                 Key: HBASE-1246
>                 URL: https://issues.apache.org/jira/browse/HBASE-1246
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>         Environment: Java 1.6, OSX 64 bit
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.20.0
>
>         Attachments: ByteBloomFilter.java
>
>
> From the logfile run of TestBloomFilter with special SizeOf agent jar:
> Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839 for size: 100
> 2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319): New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
> Serialized bloomfilter size: 160
> In memory bf size: 1248
> As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact - 160 bytes.
> But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter we see:
>   BitSet bits;
> is the only field.
> Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of where we should be.
> Considering every HFile could potentially have a bloom filter, and bloom filters are more likely to have bit vector sizes of 10,000-100,000, we should do something about this.  Aka: write our own bit-set that uses byte[] and bit ops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1246) BloomFilter's use of BitSet is too inefficient

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-1246:
---------------------------------

    Fix Version/s:     (was: 0.20.0)
                   0.21.0

> BloomFilter's use of BitSet is too inefficient
> ----------------------------------------------
>
>                 Key: HBASE-1246
>                 URL: https://issues.apache.org/jira/browse/HBASE-1246
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>         Environment: Java 1.6, OSX 64 bit
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.21.0
>
>         Attachments: ByteBloomFilter.java
>
>
> From the logfile run of TestBloomFilter with special SizeOf agent jar:
> Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839 for size: 100
> 2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319): New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
> Serialized bloomfilter size: 160
> In memory bf size: 1248
> As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact - 160 bytes.
> But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter we see:
>   BitSet bits;
> is the only field.
> Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of where we should be.
> Considering every HFile could potentially have a bloom filter, and bloom filters are more likely to have bit vector sizes of 10,000-100,000, we should do something about this.  Aka: write our own bit-set that uses byte[] and bit ops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1246) BloomFilter's use of BitSet is too inefficient

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12679651#action_12679651 ] 

stack commented on HBASE-1246:
------------------------------

Oh, otherthing is make the bloomfilter interface take byte [] with offset and length if you can.  That'll work for current HSK and for the coming ByteBuffer.

> BloomFilter's use of BitSet is too inefficient
> ----------------------------------------------
>
>                 Key: HBASE-1246
>                 URL: https://issues.apache.org/jira/browse/HBASE-1246
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>         Environment: Java 1.6, OSX 64 bit
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.20.0
>
>
> From the logfile run of TestBloomFilter with special SizeOf agent jar:
> Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839 for size: 100
> 2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319): New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
> Serialized bloomfilter size: 160
> In memory bf size: 1248
> As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact - 160 bytes.
> But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter we see:
>   BitSet bits;
> is the only field.
> Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of where we should be.
> Considering every HFile could potentially have a bloom filter, and bloom filters are more likely to have bit vector sizes of 10,000-100,000, we should do something about this.  Aka: write our own bit-set that uses byte[] and bit ops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1246) BloomFilter's use of BitSet is too inefficient

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694160#action_12694160 ] 

Jonathan Ellis commented on HBASE-1246:
---------------------------------------

> BitSet has the same space issues as the boolean[] so it didn't actually solve that. 

That is not correct in the Sun SDK.  The version I have (1.6.10) definitely uses one bit per... bit and has the comment

 * @version 1.67, 04/07/06

so I imagine this is the case at least for all versions of 1.6.

> BloomFilter's use of BitSet is too inefficient
> ----------------------------------------------
>
>                 Key: HBASE-1246
>                 URL: https://issues.apache.org/jira/browse/HBASE-1246
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>         Environment: Java 1.6, OSX 64 bit
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.20.0
>
>         Attachments: ByteBloomFilter.java
>
>
> From the logfile run of TestBloomFilter with special SizeOf agent jar:
> Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839 for size: 100
> 2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319): New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
> Serialized bloomfilter size: 160
> In memory bf size: 1248
> As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact - 160 bytes.
> But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter we see:
>   BitSet bits;
> is the only field.
> Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of where we should be.
> Considering every HFile could potentially have a bloom filter, and bloom filters are more likely to have bit vector sizes of 10,000-100,000, we should do something about this.  Aka: write our own bit-set that uses byte[] and bit ops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.