You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Bruce Williams <wi...@gmail.com> on 2008/11/23 21:31:31 UTC

Bloom filters

I just read that Bigtable's use of Bloom Filters for lookup resulted
in considerable performance improvements. Does Hbase employ any form
of Bloom filter?

If not, I am willing to look at it.

Bruce Williams

-- 

"Discovering...discovering...we will never cease discovering...
and the end of all our discovering will be
to return to the place where we began
and to know it for the first time."
-T.S. Eliot

RE: Bloom filters

Posted by "Jim Kellerman (POWERSET)" <Ji...@microsoft.com>.
I don't think that there are any TFile issues specifically.

I just thought that bloomfilter map file was postponed because
it was expected that TFile would replace it.

---
Jim Kellerman, Powerset (Live Search, Microsoft Corporation)


> -----Original Message-----
> From: Bruce Williams [mailto:williams.bruce@gmail.com]
> Sent: Tuesday, December 02, 2008 2:44 PM
> To: hbase-dev@hadoop.apache.org
> Subject: Re: Bloom filters
>
> Ok, Andrzej Bialecki is finishing the patch in the next few days,
> including the moving MurmurHash code to Hadoop.
>
> https://issues.apache.org/jira/browse/HADOOP-3063
>
> Are there any "TFile" issues he should be aware of? :-)
>
> Bruce
>
> On Tue, Dec 2, 2008 at 2:35 PM, Bruce Williams <wi...@gmail.com>
> wrote:
> > Jim,
> >
> > Could you comment on "derailed by TFile"?
> >
> > Thanks,
> >
> > Bruce
> >
> > On Tue, Dec 2, 2008 at 8:20 AM, Jim Kellerman (POWERSET)
> > <Ji...@microsoft.com> wrote:
> >> That patch was never committed to the Hadoop code base, (I
> >> believe it got derailed by TFile). Until the code is in
> >> Hadoop, we cannot remove it from HBase.
> >>
> >> ---
> >> Jim Kellerman, Powerset (Live Search, Microsoft Corporation)
> >>
> >>> -----Original Message-----
> >>> From: Bruce Williams [mailto:williams.bruce@gmail.com]
> >>> Sent: Tuesday, December 02, 2008 4:02 AM
> >>> To: hbase-dev@hadoop.apache.org
> >>> Subject: Re: Bloom filters
> >>>
> >>> On Mon, Dec 1, 2008 at 12:36 PM, stack <st...@duboce.net> wrote:
> >>> > Bruce Williams wrote:
> >>> >>
> >>> >> My understanding, which may be faulty, is the option works until a
> >>> >> column is modified and then it fails in a difficult-to-fix manner.
> The
> >>> >> jira hints that the issue could impact how clients, such as
> ZooKeeper,
> >>> >> function as well as HBase.
> >>> >>
> >>> >
> >>> > HBase shouldn't NPE.  If bloomfilters are enabled on a table where
> >>> before
> >>> > there were none, the table should just evolve gracefully adding the
> >>> filters
> >>> > as it runs (Same should happen when they are disabled; any filters
> >>> should be
> >>> > gradually disposed-of).
> >>> >
> >>> >> I will continue to dig, but coming up to speed on the HBase
> >>> >> implementation will take my time short term, can someone comment on
> >>> >> the "client issues"?
> >>> >>
> >>> >
> >>> > Keep asking questions if it'll maximize the time you have for hbase.
> >>> >
> >>> > Please expand on what you mean by 'client' issues in the above.
> >>> >
> >>> > Thanks,
> >>> > St.Ack
> >>> >
> >>>
> >>> Thanks, St. Ack
> >>>
> >>> Bloom Filter Code has been moved from HBase to Hadoop Core?
> >>>
> >>> https://issues.apache.org/jira/browse/HADOOP-3063
> >>>
> >>> Updated patch. This patch imports the Bloom filter classes into
> >>> org.apache.hadoop.util.bloom, and adds a notice to LICENSE.txt.
> >>> [ Show > ]
> >>> Andrzej Bialecki - 29/Mar/08 02:50 PM Updated patch. This patch
> >>> imports the Bloom filter classes into org.apache.hadoop.util.bloom,
> >>> and adds a notice to LICENSE.txt.
> >>>
> >>> Doug Cutting and Owen O'Malley think we should remove the code from
> >>> HBase and use the Hadoop code.
> >>>
> >>> We have https://issues.apache.org/jira/browse/HBASE-553 to remove the
> >>> code from HBase.
> >>>
> >>> Comment?
> >>>
> >>>
> >>> Bruce
> >>>
> >>> --
> >>>
> >>> "Discovering...discovering...we will never cease discovering...
> >>> and the end of all our discovering will be
> >>> to return to the place where we began
> >>> and to know it for the first time."
> >>> -T.S. Eliot
> >>
> >>
> >
> >
> >
> > --
> >
> > "Discovering...discovering...we will never cease discovering...
> > and the end of all our discovering will be
> > to return to the place where we began
> > and to know it for the first time."
> > -T.S. Eliot
> >
>
>
>
> --
>
> "Discovering...discovering...we will never cease discovering...
> and the end of all our discovering will be
> to return to the place where we began
> and to know it for the first time."
> -T.S. Eliot


Re: Bloom filters

Posted by Bruce Williams <wi...@gmail.com>.
Ok, Andrzej Bialecki is finishing the patch in the next few days,
including the moving MurmurHash code to Hadoop.

https://issues.apache.org/jira/browse/HADOOP-3063

Are there any "TFile" issues he should be aware of? :-)

Bruce

On Tue, Dec 2, 2008 at 2:35 PM, Bruce Williams <wi...@gmail.com> wrote:
> Jim,
>
> Could you comment on "derailed by TFile"?
>
> Thanks,
>
> Bruce
>
> On Tue, Dec 2, 2008 at 8:20 AM, Jim Kellerman (POWERSET)
> <Ji...@microsoft.com> wrote:
>> That patch was never committed to the Hadoop code base, (I
>> believe it got derailed by TFile). Until the code is in
>> Hadoop, we cannot remove it from HBase.
>>
>> ---
>> Jim Kellerman, Powerset (Live Search, Microsoft Corporation)
>>
>>> -----Original Message-----
>>> From: Bruce Williams [mailto:williams.bruce@gmail.com]
>>> Sent: Tuesday, December 02, 2008 4:02 AM
>>> To: hbase-dev@hadoop.apache.org
>>> Subject: Re: Bloom filters
>>>
>>> On Mon, Dec 1, 2008 at 12:36 PM, stack <st...@duboce.net> wrote:
>>> > Bruce Williams wrote:
>>> >>
>>> >> My understanding, which may be faulty, is the option works until a
>>> >> column is modified and then it fails in a difficult-to-fix manner. The
>>> >> jira hints that the issue could impact how clients, such as ZooKeeper,
>>> >> function as well as HBase.
>>> >>
>>> >
>>> > HBase shouldn't NPE.  If bloomfilters are enabled on a table where
>>> before
>>> > there were none, the table should just evolve gracefully adding the
>>> filters
>>> > as it runs (Same should happen when they are disabled; any filters
>>> should be
>>> > gradually disposed-of).
>>> >
>>> >> I will continue to dig, but coming up to speed on the HBase
>>> >> implementation will take my time short term, can someone comment on
>>> >> the "client issues"?
>>> >>
>>> >
>>> > Keep asking questions if it'll maximize the time you have for hbase.
>>> >
>>> > Please expand on what you mean by 'client' issues in the above.
>>> >
>>> > Thanks,
>>> > St.Ack
>>> >
>>>
>>> Thanks, St. Ack
>>>
>>> Bloom Filter Code has been moved from HBase to Hadoop Core?
>>>
>>> https://issues.apache.org/jira/browse/HADOOP-3063
>>>
>>> Updated patch. This patch imports the Bloom filter classes into
>>> org.apache.hadoop.util.bloom, and adds a notice to LICENSE.txt.
>>> [ Show > ]
>>> Andrzej Bialecki - 29/Mar/08 02:50 PM Updated patch. This patch
>>> imports the Bloom filter classes into org.apache.hadoop.util.bloom,
>>> and adds a notice to LICENSE.txt.
>>>
>>> Doug Cutting and Owen O'Malley think we should remove the code from
>>> HBase and use the Hadoop code.
>>>
>>> We have https://issues.apache.org/jira/browse/HBASE-553 to remove the
>>> code from HBase.
>>>
>>> Comment?
>>>
>>>
>>> Bruce
>>>
>>> --
>>>
>>> "Discovering...discovering...we will never cease discovering...
>>> and the end of all our discovering will be
>>> to return to the place where we began
>>> and to know it for the first time."
>>> -T.S. Eliot
>>
>>
>
>
>
> --
>
> "Discovering...discovering...we will never cease discovering...
> and the end of all our discovering will be
> to return to the place where we began
> and to know it for the first time."
> -T.S. Eliot
>



-- 

"Discovering...discovering...we will never cease discovering...
and the end of all our discovering will be
to return to the place where we began
and to know it for the first time."
-T.S. Eliot

Re: Bloom filters

Posted by Bruce Williams <wi...@gmail.com>.
Jim,

Could you comment on "derailed by TFile"?

Thanks,

Bruce

On Tue, Dec 2, 2008 at 8:20 AM, Jim Kellerman (POWERSET)
<Ji...@microsoft.com> wrote:
> That patch was never committed to the Hadoop code base, (I
> believe it got derailed by TFile). Until the code is in
> Hadoop, we cannot remove it from HBase.
>
> ---
> Jim Kellerman, Powerset (Live Search, Microsoft Corporation)
>
>> -----Original Message-----
>> From: Bruce Williams [mailto:williams.bruce@gmail.com]
>> Sent: Tuesday, December 02, 2008 4:02 AM
>> To: hbase-dev@hadoop.apache.org
>> Subject: Re: Bloom filters
>>
>> On Mon, Dec 1, 2008 at 12:36 PM, stack <st...@duboce.net> wrote:
>> > Bruce Williams wrote:
>> >>
>> >> My understanding, which may be faulty, is the option works until a
>> >> column is modified and then it fails in a difficult-to-fix manner. The
>> >> jira hints that the issue could impact how clients, such as ZooKeeper,
>> >> function as well as HBase.
>> >>
>> >
>> > HBase shouldn't NPE.  If bloomfilters are enabled on a table where
>> before
>> > there were none, the table should just evolve gracefully adding the
>> filters
>> > as it runs (Same should happen when they are disabled; any filters
>> should be
>> > gradually disposed-of).
>> >
>> >> I will continue to dig, but coming up to speed on the HBase
>> >> implementation will take my time short term, can someone comment on
>> >> the "client issues"?
>> >>
>> >
>> > Keep asking questions if it'll maximize the time you have for hbase.
>> >
>> > Please expand on what you mean by 'client' issues in the above.
>> >
>> > Thanks,
>> > St.Ack
>> >
>>
>> Thanks, St. Ack
>>
>> Bloom Filter Code has been moved from HBase to Hadoop Core?
>>
>> https://issues.apache.org/jira/browse/HADOOP-3063
>>
>> Updated patch. This patch imports the Bloom filter classes into
>> org.apache.hadoop.util.bloom, and adds a notice to LICENSE.txt.
>> [ Show > ]
>> Andrzej Bialecki - 29/Mar/08 02:50 PM Updated patch. This patch
>> imports the Bloom filter classes into org.apache.hadoop.util.bloom,
>> and adds a notice to LICENSE.txt.
>>
>> Doug Cutting and Owen O'Malley think we should remove the code from
>> HBase and use the Hadoop code.
>>
>> We have https://issues.apache.org/jira/browse/HBASE-553 to remove the
>> code from HBase.
>>
>> Comment?
>>
>>
>> Bruce
>>
>> --
>>
>> "Discovering...discovering...we will never cease discovering...
>> and the end of all our discovering will be
>> to return to the place where we began
>> and to know it for the first time."
>> -T.S. Eliot
>
>



-- 

"Discovering...discovering...we will never cease discovering...
and the end of all our discovering will be
to return to the place where we began
and to know it for the first time."
-T.S. Eliot

RE: Bloom filters

Posted by "Jim Kellerman (POWERSET)" <Ji...@microsoft.com>.
That patch was never committed to the Hadoop code base, (I
believe it got derailed by TFile). Until the code is in
Hadoop, we cannot remove it from HBase.

---
Jim Kellerman, Powerset (Live Search, Microsoft Corporation)

> -----Original Message-----
> From: Bruce Williams [mailto:williams.bruce@gmail.com]
> Sent: Tuesday, December 02, 2008 4:02 AM
> To: hbase-dev@hadoop.apache.org
> Subject: Re: Bloom filters
>
> On Mon, Dec 1, 2008 at 12:36 PM, stack <st...@duboce.net> wrote:
> > Bruce Williams wrote:
> >>
> >> My understanding, which may be faulty, is the option works until a
> >> column is modified and then it fails in a difficult-to-fix manner. The
> >> jira hints that the issue could impact how clients, such as ZooKeeper,
> >> function as well as HBase.
> >>
> >
> > HBase shouldn't NPE.  If bloomfilters are enabled on a table where
> before
> > there were none, the table should just evolve gracefully adding the
> filters
> > as it runs (Same should happen when they are disabled; any filters
> should be
> > gradually disposed-of).
> >
> >> I will continue to dig, but coming up to speed on the HBase
> >> implementation will take my time short term, can someone comment on
> >> the "client issues"?
> >>
> >
> > Keep asking questions if it'll maximize the time you have for hbase.
> >
> > Please expand on what you mean by 'client' issues in the above.
> >
> > Thanks,
> > St.Ack
> >
>
> Thanks, St. Ack
>
> Bloom Filter Code has been moved from HBase to Hadoop Core?
>
> https://issues.apache.org/jira/browse/HADOOP-3063
>
> Updated patch. This patch imports the Bloom filter classes into
> org.apache.hadoop.util.bloom, and adds a notice to LICENSE.txt.
> [ Show > ]
> Andrzej Bialecki - 29/Mar/08 02:50 PM Updated patch. This patch
> imports the Bloom filter classes into org.apache.hadoop.util.bloom,
> and adds a notice to LICENSE.txt.
>
> Doug Cutting and Owen O'Malley think we should remove the code from
> HBase and use the Hadoop code.
>
> We have https://issues.apache.org/jira/browse/HBASE-553 to remove the
> code from HBase.
>
> Comment?
>
>
> Bruce
>
> --
>
> "Discovering...discovering...we will never cease discovering...
> and the end of all our discovering will be
> to return to the place where we began
> and to know it for the first time."
> -T.S. Eliot


Re: Bloom filters

Posted by stack <st...@duboce.net>.
Thanks for the pointers to hadoop issues.

I think it would be a fine project finishing moving bloom filters up 
into hadoop.  Looks like there is a will behind getting the patches 
committed and the refactoring of the onelab stuff into a hadoop 
util.bloomfilters would cleanup some awkward code tangles in hbase.  
Would suggest writing Andrzej or probably better, commenting in the 
issue, asking about its state and if its ok if you take it on; it looks 
like it was just a matter of some javadoc fixes and some findbugs warnings.

Since the patch was made, things have gotten a little more convoluted.  
There is a new hashing mechanism that was added to hbase by Andrzej, 
MurmurHash, that should also be moved back up into hadoop.

St.Ack



Bruce Williams wrote:
> On Mon, Dec 1, 2008 at 12:36 PM, stack <st...@duboce.net> wrote:
>   
>> Bruce Williams wrote:
>>     
>>> My understanding, which may be faulty, is the option works until a
>>> column is modified and then it fails in a difficult-to-fix manner. The
>>> jira hints that the issue could impact how clients, such as ZooKeeper,
>>> function as well as HBase.
>>>
>>>       
>> HBase shouldn't NPE.  If bloomfilters are enabled on a table where before
>> there were none, the table should just evolve gracefully adding the filters
>> as it runs (Same should happen when they are disabled; any filters should be
>> gradually disposed-of).
>>
>>     
>>> I will continue to dig, but coming up to speed on the HBase
>>> implementation will take my time short term, can someone comment on
>>> the "client issues"?
>>>
>>>       
>> Keep asking questions if it'll maximize the time you have for hbase.
>>
>> Please expand on what you mean by 'client' issues in the above.
>>
>> Thanks,
>> St.Ack
>>
>>     
>
> Thanks, St. Ack
>
> Bloom Filter Code has been moved from HBase to Hadoop Core?
>
> https://issues.apache.org/jira/browse/HADOOP-3063
>
> Updated patch. This patch imports the Bloom filter classes into
> org.apache.hadoop.util.bloom, and adds a notice to LICENSE.txt.
> [ Show » ]
> Andrzej Bialecki - 29/Mar/08 02:50 PM Updated patch. This patch
> imports the Bloom filter classes into org.apache.hadoop.util.bloom,
> and adds a notice to LICENSE.txt.
>
> Doug Cutting and Owen O'Malley think we should remove the code from
> HBase and use the Hadoop code.
>
> We have https://issues.apache.org/jira/browse/HBASE-553 to remove the
> code from HBase.
>
> Comment?
>
>
> Bruce
>
>   


Re: Bloom filters

Posted by Bruce Williams <wi...@gmail.com>.
On Mon, Dec 1, 2008 at 12:36 PM, stack <st...@duboce.net> wrote:
> Bruce Williams wrote:
>>
>> My understanding, which may be faulty, is the option works until a
>> column is modified and then it fails in a difficult-to-fix manner. The
>> jira hints that the issue could impact how clients, such as ZooKeeper,
>> function as well as HBase.
>>
>
> HBase shouldn't NPE.  If bloomfilters are enabled on a table where before
> there were none, the table should just evolve gracefully adding the filters
> as it runs (Same should happen when they are disabled; any filters should be
> gradually disposed-of).
>
>> I will continue to dig, but coming up to speed on the HBase
>> implementation will take my time short term, can someone comment on
>> the "client issues"?
>>
>
> Keep asking questions if it'll maximize the time you have for hbase.
>
> Please expand on what you mean by 'client' issues in the above.
>
> Thanks,
> St.Ack
>

Thanks, St. Ack

Bloom Filter Code has been moved from HBase to Hadoop Core?

https://issues.apache.org/jira/browse/HADOOP-3063

Updated patch. This patch imports the Bloom filter classes into
org.apache.hadoop.util.bloom, and adds a notice to LICENSE.txt.
[ Show » ]
Andrzej Bialecki - 29/Mar/08 02:50 PM Updated patch. This patch
imports the Bloom filter classes into org.apache.hadoop.util.bloom,
and adds a notice to LICENSE.txt.

Doug Cutting and Owen O'Malley think we should remove the code from
HBase and use the Hadoop code.

We have https://issues.apache.org/jira/browse/HBASE-553 to remove the
code from HBase.

Comment?


Bruce

-- 

"Discovering...discovering...we will never cease discovering...
and the end of all our discovering will be
to return to the place where we began
and to know it for the first time."
-T.S. Eliot

Re: Bloom filters

Posted by stack <st...@duboce.net>.
Bruce Williams wrote:
> My understanding, which may be faulty, is the option works until a
> column is modified and then it fails in a difficult-to-fix manner. The
> jira hints that the issue could impact how clients, such as ZooKeeper,
> function as well as HBase.
>   

HBase shouldn't NPE.  If bloomfilters are enabled on a table where 
before there were none, the table should just evolve gracefully adding 
the filters as it runs (Same should happen when they are disabled; any 
filters should be gradually disposed-of).

> I will continue to dig, but coming up to speed on the HBase
> implementation will take my time short term, can someone comment on
> the "client issues"?
>   
Keep asking questions if it'll maximize the time you have for hbase.

Please expand on what you mean by 'client' issues in the above.

Thanks,
St.Ack

Re: Bloom filters

Posted by Bruce Williams <wi...@gmail.com>.
My understanding, which may be faulty, is the option works until a
column is modified and then it fails in a difficult-to-fix manner. The
jira hints that the issue could impact how clients, such as ZooKeeper,
function as well as HBase.

If this is true, then rather than a "default on" in the next rev,
encouraging a development/production distinction in practice may be
more productive? Would clear documentation on how to get the
considerable performance improvements in production systems  be the
best short-term solution?

I will continue to dig, but coming up to speed on the HBase
implementation will take my time short term, can someone comment on
the "client issues"?


Bruce Williams

On 11/23/08, Michael Stack <st...@duboce.net> wrote:
> Its currently an option in hbase but little excercised (to the best of my
> knowledge) and there is currently one issue regards their operation
> (https://issues.apache.org/jira/browse/HBASE-922).
>
> Maybe you can figure the issue?  Otherwise, chatting with others, we're
> thinking that in next rev. of our store file format, they should just be on
> by default.  See here for some notes on new file format:
> http://wiki.apache.org/hadoop/Hbase/NewFileFormat.
>
> St.Ack
>
>
> Bruce Williams wrote:
> > I just read that Bigtable's use of Bloom Filters for lookup resulted
> > in considerable performance improvements. Does Hbase employ any form
> > of Bloom filter?
> >
> > If not, I am willing to look at it.
> >
> > Bruce Williams
> >
> >
> >
>
>


-- 

"Discovering...discovering...we will never cease discovering...
and the end of all our discovering will be
to return to the place where we began
and to know it for the first time."
-T.S. Eliot

Re: Bloom filters

Posted by Michael Stack <st...@duboce.net>.
Its currently an option in hbase but little excercised (to the best of 
my knowledge) and there is currently one issue regards their operation 
(https://issues.apache.org/jira/browse/HBASE-922).

Maybe you can figure the issue?  Otherwise, chatting with others, we're 
thinking that in next rev. of our store file format, they should just be 
on by default.  See here for some notes on new file format: 
http://wiki.apache.org/hadoop/Hbase/NewFileFormat.

St.Ack

Bruce Williams wrote:
> I just read that Bigtable's use of Bloom Filters for lookup resulted
> in considerable performance improvements. Does Hbase employ any form
> of Bloom filter?
>
> If not, I am willing to look at it.
>
> Bruce Williams
>
>