You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2009/02/13 19:38:59 UTC

[jira] Created: (HBASE-1200) Add bloomfilters to hfile; use dynamicbloomfilter instead of base bloomfilter; depend on hadoop 0.20

Add bloomfilters to hfile; use dynamicbloomfilter instead of base bloomfilter; depend on hadoop 0.20
----------------------------------------------------------------------------------------------------

                 Key: HBASE-1200
                 URL: https://issues.apache.org/jira/browse/HBASE-1200
             Project: Hadoop HBase
          Issue Type: Task
            Reporter: stack
            Assignee: stack
             Fix For: 0.20.0


Add bloomfiltering to hfile.  Should it be optional or on always?  Currently, we bloom filter rows only, not the column + ts component, which seems good place to start but we size the bloomfilter with the number of entries we are about to flush which seems like usually we'd be making a filter too big.  How to figure how many rows in the flush?   We should use the DynamicBloomFilter as Andrezj does up in hadoop BloomFilterMapFile.  Start small and let it resize as entries are added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Assigned: (HBASE-1200) Add bloomfilters to hfile; use dynamicbloomfilter instead of base bloomfilter; depend on hadoop 0.20

Posted by Ryan Rawson <ry...@gmail.com>.
I'm going to give this a shot tomorrow.

Plan is to use row:cf:colqual as the bloom-filter 'key'.  That way we can
test if any specific row/col is in any specific file.  I might also add
'row' only as another bloom filter to test.

Note that in general this would only be useful once we know that a specific
row/column exists and want to optimize how many files we have to seek/read.

-ryan

On Thu, Mar 5, 2009 at 1:27 AM, ryan rawson (JIRA) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/HBASE-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> ryan rawson reassigned HBASE-1200:
> ----------------------------------
>
>    Assignee: ryan rawson  (was: stack)
>
> > Add bloomfilters to hfile; use dynamicbloomfilter instead of base
> bloomfilter; depend on hadoop 0.20
> >
> ----------------------------------------------------------------------------------------------------
> >
> >                 Key: HBASE-1200
> >                 URL: https://issues.apache.org/jira/browse/HBASE-1200
> >             Project: Hadoop HBase
> >          Issue Type: Task
> >            Reporter: stack
> >            Assignee: ryan rawson
> >             Fix For: 0.20.0
> >
> >
> > Add bloomfiltering to hfile.  Should it be optional or on always?
>  Currently, we bloom filter rows only, not the column + ts component, which
> seems good place to start but we size the bloomfilter with the number of
> entries we are about to flush which seems like usually we'd be making a
> filter too big.  How to figure how many rows in the flush?   We should use
> the DynamicBloomFilter as Andrezj does up in hadoop BloomFilterMapFile.
>  Start small and let it resize as entries are added.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

[jira] Commented: (HBASE-1200) Add bloomfilters to hfile; use dynamicbloomfilter instead of base bloomfilter; depend on hadoop 0.20

Posted by "Erik Holstad (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674308#action_12674308 ] 

Erik Holstad commented on HBASE-1200:
-------------------------------------

I think that the user should have an option to not use bloom filters, even though I can't really see
why you wouldn't, but still have an option to do so. I also think that we should try to go towards
row+column like BT. Using the Dynamic bloom filter seems like a reasonable way to go, the only 
thing I can see is that we are still going to have an overhead, even though it is smaller than now.
So if possible wait until we know the exact number and then create the filter. Not sure what the time
loss will be for the flush doing it this way, but that could be tested.

> Add bloomfilters to hfile; use dynamicbloomfilter instead of base bloomfilter; depend on hadoop 0.20
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1200
>                 URL: https://issues.apache.org/jira/browse/HBASE-1200
>             Project: Hadoop HBase
>          Issue Type: Task
>            Reporter: stack
>            Assignee: stack
>             Fix For: 0.20.0
>
>
> Add bloomfiltering to hfile.  Should it be optional or on always?  Currently, we bloom filter rows only, not the column + ts component, which seems good place to start but we size the bloomfilter with the number of entries we are about to flush which seems like usually we'd be making a filter too big.  How to figure how many rows in the flush?   We should use the DynamicBloomFilter as Andrezj does up in hadoop BloomFilterMapFile.  Start small and let it resize as entries are added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HBASE-1200) Add bloomfilters to hfile; use dynamicbloomfilter instead of base bloomfilter; depend on hadoop 0.20

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ryan rawson reassigned HBASE-1200:
----------------------------------

    Assignee: ryan rawson  (was: stack)

> Add bloomfilters to hfile; use dynamicbloomfilter instead of base bloomfilter; depend on hadoop 0.20
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1200
>                 URL: https://issues.apache.org/jira/browse/HBASE-1200
>             Project: Hadoop HBase
>          Issue Type: Task
>            Reporter: stack
>            Assignee: ryan rawson
>             Fix For: 0.20.0
>
>
> Add bloomfiltering to hfile.  Should it be optional or on always?  Currently, we bloom filter rows only, not the column + ts component, which seems good place to start but we size the bloomfilter with the number of entries we are about to flush which seems like usually we'd be making a filter too big.  How to figure how many rows in the flush?   We should use the DynamicBloomFilter as Andrezj does up in hadoop BloomFilterMapFile.  Start small and let it resize as entries are added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1200) Add bloomfilters to hfile; use dynamicbloomfilter instead of base bloomfilter; depend on hadoop 0.20

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-1200:
---------------------------------

    Fix Version/s:     (was: 0.20.0)
                   0.21.0

> Add bloomfilters to hfile; use dynamicbloomfilter instead of base bloomfilter; depend on hadoop 0.20
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1200
>                 URL: https://issues.apache.org/jira/browse/HBASE-1200
>             Project: Hadoop HBase
>          Issue Type: Task
>            Reporter: stack
>            Assignee: ryan rawson
>             Fix For: 0.21.0
>
>
> Add bloomfiltering to hfile.  Should it be optional or on always?  Currently, we bloom filter rows only, not the column + ts component, which seems good place to start but we size the bloomfilter with the number of entries we are about to flush which seems like usually we'd be making a filter too big.  How to figure how many rows in the flush?   We should use the DynamicBloomFilter as Andrezj does up in hadoop BloomFilterMapFile.  Start small and let it resize as entries are added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1200) Add bloomfilters to hfile; use dynamicbloomfilter instead of base bloomfilter; depend on hadoop 0.20

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12679251#action_12679251 ] 

stack commented on HBASE-1200:
------------------------------

Let me know if you want me to put the hadoop 0.20.0 jars in TRUNK Ryan.

> Add bloomfilters to hfile; use dynamicbloomfilter instead of base bloomfilter; depend on hadoop 0.20
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1200
>                 URL: https://issues.apache.org/jira/browse/HBASE-1200
>             Project: Hadoop HBase
>          Issue Type: Task
>            Reporter: stack
>            Assignee: ryan rawson
>             Fix For: 0.20.0
>
>
> Add bloomfiltering to hfile.  Should it be optional or on always?  Currently, we bloom filter rows only, not the column + ts component, which seems good place to start but we size the bloomfilter with the number of entries we are about to flush which seems like usually we'd be making a filter too big.  How to figure how many rows in the flush?   We should use the DynamicBloomFilter as Andrezj does up in hadoop BloomFilterMapFile.  Start small and let it resize as entries are added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1200) Add bloomfilters; use dynamicbloomfilter instead of base bloomfilter

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1200:
-------------------------

    Attachment: ryan_bloomfilter.patch

Latest state of RR's bloomfilter work copied from a patch posted to HBASE-1466.

> Add bloomfilters; use dynamicbloomfilter instead of base bloomfilter
> --------------------------------------------------------------------
>
>                 Key: HBASE-1200
>                 URL: https://issues.apache.org/jira/browse/HBASE-1200
>             Project: Hadoop HBase
>          Issue Type: Task
>            Reporter: stack
>            Assignee: ryan rawson
>             Fix For: 0.21.0
>
>         Attachments: ryan_bloomfilter.patch
>
>
> Add bloomfiltering to hfile.  Should it be optional or on always?  Currently, we bloom filter rows only, not the column + ts component, which seems good place to start but we size the bloomfilter with the number of entries we are about to flush which seems like usually we'd be making a filter too big.  How to figure how many rows in the flush?   We should use the DynamicBloomFilter as Andrezj does up in hadoop BloomFilterMapFile.  Start small and let it resize as entries are added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1200) Add bloomfilters; use dynamicbloomfilter instead of base bloomfilter

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1200:
-------------------------

    Summary: Add bloomfilters; use dynamicbloomfilter instead of base bloomfilter  (was: Add bloomfilters to hfile; use dynamicbloomfilter instead of base bloomfilter; depend on hadoop 0.20)

Changed subject to be more general, more about adding bloomfilters, rather than prescriptive on how to do it.

> Add bloomfilters; use dynamicbloomfilter instead of base bloomfilter
> --------------------------------------------------------------------
>
>                 Key: HBASE-1200
>                 URL: https://issues.apache.org/jira/browse/HBASE-1200
>             Project: Hadoop HBase
>          Issue Type: Task
>            Reporter: stack
>            Assignee: ryan rawson
>             Fix For: 0.21.0
>
>
> Add bloomfiltering to hfile.  Should it be optional or on always?  Currently, we bloom filter rows only, not the column + ts component, which seems good place to start but we size the bloomfilter with the number of entries we are about to flush which seems like usually we'd be making a filter too big.  How to figure how many rows in the flush?   We should use the DynamicBloomFilter as Andrezj does up in hadoop BloomFilterMapFile.  Start small and let it resize as entries are added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1200) Add bloomfilters to hfile; use dynamicbloomfilter instead of base bloomfilter; depend on hadoop 0.20

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673349#action_12673349 ] 

stack commented on HBASE-1200:
------------------------------

Thing to do would be to run with them on for a while and then before release make a call.

Here is from BloomFilterMapFile:

{code}
    private synchronized void initBloomFilter(Configuration conf) {
      numKeys = conf.getInt("io.mapfile.bloom.size", 1024 * 1024);
      // vector size should be <code>-kn / (ln(1 - c^(1/k)))</code> bits for
      // single key, where <code> is the number of hash functions,
      // <code>n</code> is the number of keys and <code>c</code> is the desired
      // max. error rate.
      // Our desired error rate is by default 0.005, i.e. 0.5%
      float errorRate = conf.getFloat("io.mapfile.bloom.error.rate", 0.005f);
      vectorSize = (int)Math.ceil((double)(-HASH_COUNT * numKeys) /
          Math.log(1.0 - Math.pow(errorRate, 1.0/HASH_COUNT)));
      bloomFilter = new DynamicBloomFilter(vectorSize, HASH_COUNT,
          Hash.getHashType(conf), numKeys);
    }
{code}

> Add bloomfilters to hfile; use dynamicbloomfilter instead of base bloomfilter; depend on hadoop 0.20
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1200
>                 URL: https://issues.apache.org/jira/browse/HBASE-1200
>             Project: Hadoop HBase
>          Issue Type: Task
>            Reporter: stack
>            Assignee: stack
>             Fix For: 0.20.0
>
>
> Add bloomfiltering to hfile.  Should it be optional or on always?  Currently, we bloom filter rows only, not the column + ts component, which seems good place to start but we size the bloomfilter with the number of entries we are about to flush which seems like usually we'd be making a filter too big.  How to figure how many rows in the flush?   We should use the DynamicBloomFilter as Andrezj does up in hadoop BloomFilterMapFile.  Start small and let it resize as entries are added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.