You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Lin Ma <li...@gmail.com> on 2012/08/18 09:12:09 UTC

Using HBase serving to replace memcached

Hello guys,

In your experience, is it practical to use HBase directly for serving?
Saying handle directly user traffic (tens of thousands QPS scale) behind
Apache, and replace the role of memcached? I am not sure whether there are
any known panic to replace memcached by using HBase? One issue I could
think about is for a specific row range, only one active region server
could handle the request, but in memcached, we can setup several memcached
instance with duplicate content (all of them are active) to serve the same
purpose under a VIP which could achieve better performance and scalability.

Any advice or reference documents are appreciated. Thanks.

regards,
Lin

Re: Using HBase serving to replace memcached

Posted by anil gupta <an...@gmail.com>.

Nice explanation, Anoop. This deserves to be part of Hbase wiki.

On Wed, Aug 22, 2012 at 5:34 AM, Anoop Sam John <an...@huawei.com> wrote:

> > I could be wrong. I think HFile index block (which is located at the end
> >> of HFile) is a binary search tree containing all row-key values (of the
> >> HFile) in the binary search tree. Searching a specific row-key in the
> >> binary search tree could easily find whether a row-key exists (some
> node in
> >> the tree has the same row-key value) or not. Why we need load every
> block
> >> to find if the row exists?
>
> I think there is some confusion with you people regarding the blooms and
> the block index.I will try to clarify this point.
> Block index will be there with every HFile. Within an HFile the data will
> be written as multiple blocks. While reading data block by block only HBase
> read data from the HDFS layer. The block index contains the information
> regarding the blocks within that HFile. The information include the start
> and end rowkeys which resides in that particular block and the block
> information like offset of that block and its length etc. Now when a
> request comes for getting a rowkey 'x' all the HFiles within that region
> need to be checked.[KV can be present in any of the HFile] Now in order to
> know this row will be present in which block within an HFile, this block
> index will be used. Well this block index will be there in memory always.
> This lookup will tell only the possible block in which the row is present.
> HBase will load that block and will read through it to get the row which we
> are interested in now.
> Bloom is like it will have information about each and every row added into
> that HFile[Block index wont have info about each and every row]. This bloom
> information will be there in memory always. So when a read request to get
> row 'x' in an Hfile comes, 1st the bloom is checked whether this row is
> there in this file or not. If this is not there, as per the bloom, no block
> at all will be fetched. But if bloom is not enabled, we might find one
> block which is having a row range such that 'x' comes in between and Hbase
> will load that block. So usage of blooms can avoid this IO. Hope this is
> clear for you now.
>
> -Anoop-
> ________________________________________
> From: Lin Ma [linlma@gmail.com]
> Sent: Wednesday, August 22, 2012 5:41 PM
> To: J Mohamed Zahoor; user@hbase.apache.org
> Subject: Re: Using HBase serving to replace memcached
>
> Thanks Zahoor,
>
> I read through the document you referred to, I am confused about what means
> leaf-level index, intermediate-level index and root-level index. It is
> appreciate if you could give more details what they are, or point me to the
> related documents.
>
> BTW: the document you pointed me is very good, however I miss some basic
> background of 3 terms I mentioned above. :-)
>
> regards,
> Lin
>
> On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <jm...@gmail.com>
> wrote:
>
> > I could be wrong. I think HFile index block (which is located at the end
> >> of HFile) is a binary search tree containing all row-key values (of the
> >> HFile) in the binary search tree. Searching a specific row-key in the
> >> binary search tree could easily find whether a row-key exists (some
> node in
> >> the tree has the same row-key value) or not. Why we need load every
> block
> >> to find if the row exists?
> >>
> >>
> > Hmm...
> > It is a multilevel index. Only the root Index's (Data, Meta etc) are
> > loaded when a region is opened. The rest of the tree (intermediate and
> leaf
> > index's) are present in each block level.
> > I am assuming a HFile v2 here for the discussion.
> > Read this for more clarity http://hbase.apache.org/book/apes03.html
> >
> > Nice discussion. You made me read lot of things. :-)
> > Now i will dig in to the code and check this out.
> >
> > ./Zahoor
> >
>



-- 
Thanks & Regards,
Anil Gupta

Re: Using HBase serving to replace memcached

Posted by Stack <st...@duboce.net>.

On Wed, Aug 22, 2012 at 6:28 AM, Lin Ma <li...@gmail.com> wrote:
> Thanks Anoop,
>
> My question is answered. Are you writing related part of code in HBase?
> From your detailed and knowledgeable description, you seems to be the
> author. :-)
>

Anoop did not write that particular piece of code.   He has though
made many other high calibre contributions to the hbase code base.
St.Ack

Re: Using HBase serving to replace memcached

Posted by Lin Ma <li...@gmail.com>.

Thanks Anoop,

My question is answered. Are you writing related part of code in HBase?
>From your detailed and knowledgeable description, you seems to be the
author. :-)

regards,
Lin

On Wed, Aug 22, 2012 at 8:34 PM, Anoop Sam John <an...@huawei.com> wrote:

> > I could be wrong. I think HFile index block (which is located at the end
> >> of HFile) is a binary search tree containing all row-key values (of the
> >> HFile) in the binary search tree. Searching a specific row-key in the
> >> binary search tree could easily find whether a row-key exists (some
> node in
> >> the tree has the same row-key value) or not. Why we need load every
> block
> >> to find if the row exists?
>
> I think there is some confusion with you people regarding the blooms and
> the block index.I will try to clarify this point.
> Block index will be there with every HFile. Within an HFile the data will
> be written as multiple blocks. While reading data block by block only HBase
> read data from the HDFS layer. The block index contains the information
> regarding the blocks within that HFile. The information include the start
> and end rowkeys which resides in that particular block and the block
> information like offset of that block and its length etc. Now when a
> request comes for getting a rowkey 'x' all the HFiles within that region
> need to be checked.[KV can be present in any of the HFile] Now in order to
> know this row will be present in which block within an HFile, this block
> index will be used. Well this block index will be there in memory always.
> This lookup will tell only the possible block in which the row is present.
> HBase will load that block and will read through it to get the row which we
> are interested in now.
> Bloom is like it will have information about each and every row added into
> that HFile[Block index wont have info about each and every row]. This bloom
> information will be there in memory always. So when a read request to get
> row 'x' in an Hfile comes, 1st the bloom is checked whether this row is
> there in this file or not. If this is not there, as per the bloom, no block
> at all will be fetched. But if bloom is not enabled, we might find one
> block which is having a row range such that 'x' comes in between and Hbase
> will load that block. So usage of blooms can avoid this IO. Hope this is
> clear for you now.
>
> -Anoop-
> ________________________________________
> From: Lin Ma [linlma@gmail.com]
> Sent: Wednesday, August 22, 2012 5:41 PM
> To: J Mohamed Zahoor; user@hbase.apache.org
> Subject: Re: Using HBase serving to replace memcached
>
> Thanks Zahoor,
>
> I read through the document you referred to, I am confused about what means
> leaf-level index, intermediate-level index and root-level index. It is
> appreciate if you could give more details what they are, or point me to the
> related documents.
>
> BTW: the document you pointed me is very good, however I miss some basic
> background of 3 terms I mentioned above. :-)
>
> regards,
> Lin
>
> On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <jm...@gmail.com>
> wrote:
>
> > I could be wrong. I think HFile index block (which is located at the end
> >> of HFile) is a binary search tree containing all row-key values (of the
> >> HFile) in the binary search tree. Searching a specific row-key in the
> >> binary search tree could easily find whether a row-key exists (some
> node in
> >> the tree has the same row-key value) or not. Why we need load every
> block
> >> to find if the row exists?
> >>
> >>
> > Hmm...
> > It is a multilevel index. Only the root Index's (Data, Meta etc) are
> > loaded when a region is opened. The rest of the tree (intermediate and
> leaf
> > index's) are present in each block level.
> > I am assuming a HFile v2 here for the discussion.
> > Read this for more clarity http://hbase.apache.org/book/apes03.html
> >
> > Nice discussion. You made me read lot of things. :-)
> > Now i will dig in to the code and check this out.
> >
> > ./Zahoor
> >
>

Re: Using HBase serving to replace memcached

Posted by "Pamecha, Abhishek" <ap...@x.com>.

Thanks all..

i Sent from my iPad with iMstakes 

On Aug 22, 2012, at 20:53, "J Mohamed Zahoor" <jm...@gmail.com> wrote:

> If you need to search row and column qualifiers you can pick  row+ col bloom to help you skip blocks.
> 
> ./Zahoor@iPad
> 
> On 22-Aug-2012, at 10:58 PM, "Pamecha, Abhishek" <ap...@x.com> wrote:
> 
>> Great explanation. May be diverging from the thread's original question, but could you also care to explain the difference  if any, in searching for a rowkey [ that you mentioned below ] Vs searching for a specific column qualifier. Are there any optimizations for column qualifier search too or that one just needs to load all blocks that match the rowkey crieteria and then scan each one of them from start to end?
>> 
>> Thanks,
>> Abhishek
>> 
>> 
>> -----Original Message-----
>> From: Anoop Sam John [mailto:anoopsj@huawei.com] 
>> Sent: Wednesday, August 22, 2012 5:35 AM
>> To: user@hbase.apache.org; J Mohamed Zahoor
>> Subject: RE: Using HBase serving to replace memcached
>> 
>>> I could be wrong. I think HFile index block (which is located at the 
>>> end
>>>> of HFile) is a binary search tree containing all row-key values (of 
>>>> the
>>>> HFile) in the binary search tree. Searching a specific row-key in the 
>>>> binary search tree could easily find whether a row-key exists (some 
>>>> node in the tree has the same row-key value) or not. Why we need load 
>>>> every block to find if the row exists?
>> 
>> I think there is some confusion with you people regarding the blooms and the block index.I will try to clarify this point.
>> Block index will be there with every HFile. Within an HFile the data will be written as multiple blocks. While reading data block by block only HBase read data from the HDFS layer. The block index contains the information regarding the blocks within that HFile. The information include the start and end rowkeys which resides in that particular block and the block information like offset of that block and its length etc. Now when a request comes for getting a rowkey 'x' all the HFiles within that region need to be checked.[KV can be present in any of the HFile] Now in order to know this row will be present in which block within an HFile, this block index will be used. Well this block index will be there in memory always. This lookup will tell only the possible block in which the row is present. HBase will load that block and will read through it to get the row which we are interested in now.
>> Bloom is like it will have information about each and every row added into that HFile[Block index wont have info about each and every row]. This bloom information will be there in memory always. So when a read request to get row 'x' in an Hfile comes, 1st the bloom is checked whether this row is there in this file or not. If this is not there, as per the bloom, no block at all will be fetched. But if bloom is not enabled, we might find one block which is having a row range such that 'x' comes in between and Hbase will load that block. So usage of blooms can avoid this IO. Hope this is clear for you now.
>> 
>> -Anoop-
>> ________________________________________
>> From: Lin Ma [linlma@gmail.com]
>> Sent: Wednesday, August 22, 2012 5:41 PM
>> To: J Mohamed Zahoor; user@hbase.apache.org
>> Subject: Re: Using HBase serving to replace memcached
>> 
>> Thanks Zahoor,
>> 
>> I read through the document you referred to, I am confused about what means leaf-level index, intermediate-level index and root-level index. It is appreciate if you could give more details what they are, or point me to the related documents.
>> 
>> BTW: the document you pointed me is very good, however I miss some basic background of 3 terms I mentioned above. :-)
>> 
>> regards,
>> Lin
>> 
>> On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <jm...@gmail.com> wrote:
>> 
>>> I could be wrong. I think HFile index block (which is located at the 
>>> end
>>>> of HFile) is a binary search tree containing all row-key values (of 
>>>> the
>>>> HFile) in the binary search tree. Searching a specific row-key in the 
>>>> binary search tree could easily find whether a row-key exists (some 
>>>> node in the tree has the same row-key value) or not. Why we need load 
>>>> every block to find if the row exists?
>>>> 
>>>> 
>>> Hmm...
>>> It is a multilevel index. Only the root Index's (Data, Meta etc) are 
>>> loaded when a region is opened. The rest of the tree (intermediate and 
>>> leaf
>>> index's) are present in each block level.
>>> I am assuming a HFile v2 here for the discussion.
>>> Read this for more clarity http://hbase.apache.org/book/apes03.html
>>> 
>>> Nice discussion. You made me read lot of things. :-) Now i will dig in 
>>> to the code and check this out.
>>> 
>>> ./Zahoor
>>>

Re: Using HBase serving to replace memcached

Posted by J Mohamed Zahoor <jm...@gmail.com>.

If you need to search row and column qualifiers you can pick  row+ col bloom to help you skip blocks.

./Zahoor@iPad

On 22-Aug-2012, at 10:58 PM, "Pamecha, Abhishek" <ap...@x.com> wrote:

> Great explanation. May be diverging from the thread's original question, but could you also care to explain the difference  if any, in searching for a rowkey [ that you mentioned below ] Vs searching for a specific column qualifier. Are there any optimizations for column qualifier search too or that one just needs to load all blocks that match the rowkey crieteria and then scan each one of them from start to end?
> 
> Thanks,
> Abhishek
> 
> 
> -----Original Message-----
> From: Anoop Sam John [mailto:anoopsj@huawei.com] 
> Sent: Wednesday, August 22, 2012 5:35 AM
> To: user@hbase.apache.org; J Mohamed Zahoor
> Subject: RE: Using HBase serving to replace memcached
> 
>> I could be wrong. I think HFile index block (which is located at the 
>> end
>>> of HFile) is a binary search tree containing all row-key values (of 
>>> the
>>> HFile) in the binary search tree. Searching a specific row-key in the 
>>> binary search tree could easily find whether a row-key exists (some 
>>> node in the tree has the same row-key value) or not. Why we need load 
>>> every block to find if the row exists?
> 
> I think there is some confusion with you people regarding the blooms and the block index.I will try to clarify this point.
> Block index will be there with every HFile. Within an HFile the data will be written as multiple blocks. While reading data block by block only HBase read data from the HDFS layer. The block index contains the information regarding the blocks within that HFile. The information include the start and end rowkeys which resides in that particular block and the block information like offset of that block and its length etc. Now when a request comes for getting a rowkey 'x' all the HFiles within that region need to be checked.[KV can be present in any of the HFile] Now in order to know this row will be present in which block within an HFile, this block index will be used. Well this block index will be there in memory always. This lookup will tell only the possible block in which the row is present. HBase will load that block and will read through it to get the row which we are interested in now.
> Bloom is like it will have information about each and every row added into that HFile[Block index wont have info about each and every row]. This bloom information will be there in memory always. So when a read request to get row 'x' in an Hfile comes, 1st the bloom is checked whether this row is there in this file or not. If this is not there, as per the bloom, no block at all will be fetched. But if bloom is not enabled, we might find one block which is having a row range such that 'x' comes in between and Hbase will load that block. So usage of blooms can avoid this IO. Hope this is clear for you now.
> 
> -Anoop-
> ________________________________________
> From: Lin Ma [linlma@gmail.com]
> Sent: Wednesday, August 22, 2012 5:41 PM
> To: J Mohamed Zahoor; user@hbase.apache.org
> Subject: Re: Using HBase serving to replace memcached
> 
> Thanks Zahoor,
> 
> I read through the document you referred to, I am confused about what means leaf-level index, intermediate-level index and root-level index. It is appreciate if you could give more details what they are, or point me to the related documents.
> 
> BTW: the document you pointed me is very good, however I miss some basic background of 3 terms I mentioned above. :-)
> 
> regards,
> Lin
> 
> On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <jm...@gmail.com> wrote:
> 
>> I could be wrong. I think HFile index block (which is located at the 
>> end
>>> of HFile) is a binary search tree containing all row-key values (of 
>>> the
>>> HFile) in the binary search tree. Searching a specific row-key in the 
>>> binary search tree could easily find whether a row-key exists (some 
>>> node in the tree has the same row-key value) or not. Why we need load 
>>> every block to find if the row exists?
>>> 
>>> 
>> Hmm...
>> It is a multilevel index. Only the root Index's (Data, Meta etc) are 
>> loaded when a region is opened. The rest of the tree (intermediate and 
>> leaf
>> index's) are present in each block level.
>> I am assuming a HFile v2 here for the discussion.
>> Read this for more clarity http://hbase.apache.org/book/apes03.html
>> 
>> Nice discussion. You made me read lot of things. :-) Now i will dig in 
>> to the code and check this out.
>> 
>> ./Zahoor
>>

RE: Using HBase serving to replace memcached

Posted by Anoop Sam John <an...@huawei.com>.

>Are there any optimizations for column qualifier search too or that one just needs to load all blocks that match the rowkey crieteria and then scan each one of them from start to end?
With blooms there is optimization in search for a particular column qualifier also. Bloom can be ROW type or ROWCOL type. When it is rowcol type what is added in the bloom is the presence of particulat column qualifier in a row rather than just the row id.

-Anoop-
______________________________________
From: Pamecha, Abhishek [apamecha@x.com]
Sent: Wednesday, August 22, 2012 10:58 PM
To: user@hbase.apache.org; J Mohamed Zahoor
Subject: RE: Using HBase serving to replace memcached

Great explanation. May be diverging from the thread's original question, but could you also care to explain the difference  if any, in searching for a rowkey [ that you mentioned below ] Vs searching for a specific column qualifier. Are there any optimizations for column qualifier search too or that one just needs to load all blocks that match the rowkey crieteria and then scan each one of them from start to end?

Thanks,
Abhishek

-----Original Message-----
From: Anoop Sam John [mailto:anoopsj@huawei.com]
Sent: Wednesday, August 22, 2012 5:35 AM
To: user@hbase.apache.org; J Mohamed Zahoor
Subject: RE: Using HBase serving to replace memcached

> I could be wrong. I think HFile index block (which is located at the
> end
>> of HFile) is a binary search tree containing all row-key values (of
>> the
>> HFile) in the binary search tree. Searching a specific row-key in the
>> binary search tree could easily find whether a row-key exists (some
>> node in the tree has the same row-key value) or not. Why we need load
>> every block to find if the row exists?

I think there is some confusion with you people regarding the blooms and the block index.I will try to clarify this point.
Block index will be there with every HFile. Within an HFile the data will be written as multiple blocks. While reading data block by block only HBase read data from the HDFS layer. The block index contains the information regarding the blocks within that HFile. The information include the start and end rowkeys which resides in that particular block and the block information like offset of that block and its length etc. Now when a request comes for getting a rowkey 'x' all the HFiles within that region need to be checked.[KV can be present in any of the HFile] Now in order to know this row will be present in which block within an HFile, this block index will be used. Well this block index will be there in memory always. This lookup will tell only the possible block in which the row is present. HBase will load that block and will read through it to get the row which we are interested in now.
Bloom is like it will have information about each and every row added into that HFile[Block index wont have info about each and every row]. This bloom information will be there in memory always. So when a read request to get row 'x' in an Hfile comes, 1st the bloom is checked whether this row is there in this file or not. If this is not there, as per the bloom, no block at all will be fetched. But if bloom is not enabled, we might find one block which is having a row range such that 'x' comes in between and Hbase will load that block. So usage of blooms can avoid this IO. Hope this is clear for you now.

-Anoop-
________________________________________
From: Lin Ma [linlma@gmail.com]
Sent: Wednesday, August 22, 2012 5:41 PM
To: J Mohamed Zahoor; user@hbase.apache.org
Subject: Re: Using HBase serving to replace memcached

Thanks Zahoor,

I read through the document you referred to, I am confused about what means leaf-level index, intermediate-level index and root-level index. It is appreciate if you could give more details what they are, or point me to the related documents.

BTW: the document you pointed me is very good, however I miss some basic background of 3 terms I mentioned above. :-)

regards,
Lin

On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <jm...@gmail.com> wrote:

> I could be wrong. I think HFile index block (which is located at the
> end
>> of HFile) is a binary search tree containing all row-key values (of
>> the
>> HFile) in the binary search tree. Searching a specific row-key in the
>> binary search tree could easily find whether a row-key exists (some
>> node in the tree has the same row-key value) or not. Why we need load
>> every block to find if the row exists?
>>
>>
> Hmm...
> It is a multilevel index. Only the root Index's (Data, Meta etc) are
> loaded when a region is opened. The rest of the tree (intermediate and
> leaf
> index's) are present in each block level.
> I am assuming a HFile v2 here for the discussion.
> Read this for more clarity http://hbase.apache.org/book/apes03.html
>
> Nice discussion. You made me read lot of things. :-) Now i will dig in
> to the code and check this out.
>
> ./Zahoor
>

RE: Using HBase serving to replace memcached

Posted by "Pamecha, Abhishek" <ap...@x.com>.

Great explanation. May be diverging from the thread's original question, but could you also care to explain the difference  if any, in searching for a rowkey [ that you mentioned below ] Vs searching for a specific column qualifier. Are there any optimizations for column qualifier search too or that one just needs to load all blocks that match the rowkey crieteria and then scan each one of them from start to end?

Thanks,
Abhishek

-----Original Message-----
From: Anoop Sam John [mailto:anoopsj@huawei.com] 
Sent: Wednesday, August 22, 2012 5:35 AM
To: user@hbase.apache.org; J Mohamed Zahoor
Subject: RE: Using HBase serving to replace memcached

> I could be wrong. I think HFile index block (which is located at the 
> end
>> of HFile) is a binary search tree containing all row-key values (of 
>> the
>> HFile) in the binary search tree. Searching a specific row-key in the 
>> binary search tree could easily find whether a row-key exists (some 
>> node in the tree has the same row-key value) or not. Why we need load 
>> every block to find if the row exists?

I think there is some confusion with you people regarding the blooms and the block index.I will try to clarify this point.
Block index will be there with every HFile. Within an HFile the data will be written as multiple blocks. While reading data block by block only HBase read data from the HDFS layer. The block index contains the information regarding the blocks within that HFile. The information include the start and end rowkeys which resides in that particular block and the block information like offset of that block and its length etc. Now when a request comes for getting a rowkey 'x' all the HFiles within that region need to be checked.[KV can be present in any of the HFile] Now in order to know this row will be present in which block within an HFile, this block index will be used. Well this block index will be there in memory always. This lookup will tell only the possible block in which the row is present. HBase will load that block and will read through it to get the row which we are interested in now.
Bloom is like it will have information about each and every row added into that HFile[Block index wont have info about each and every row]. This bloom information will be there in memory always. So when a read request to get row 'x' in an Hfile comes, 1st the bloom is checked whether this row is there in this file or not. If this is not there, as per the bloom, no block at all will be fetched. But if bloom is not enabled, we might find one block which is having a row range such that 'x' comes in between and Hbase will load that block. So usage of blooms can avoid this IO. Hope this is clear for you now.

-Anoop-
________________________________________
From: Lin Ma [linlma@gmail.com]
Sent: Wednesday, August 22, 2012 5:41 PM
To: J Mohamed Zahoor; user@hbase.apache.org
Subject: Re: Using HBase serving to replace memcached

Thanks Zahoor,

I read through the document you referred to, I am confused about what means leaf-level index, intermediate-level index and root-level index. It is appreciate if you could give more details what they are, or point me to the related documents.

BTW: the document you pointed me is very good, however I miss some basic background of 3 terms I mentioned above. :-)

regards,
Lin

On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <jm...@gmail.com> wrote:

> I could be wrong. I think HFile index block (which is located at the 
> end
>> of HFile) is a binary search tree containing all row-key values (of 
>> the
>> HFile) in the binary search tree. Searching a specific row-key in the 
>> binary search tree could easily find whether a row-key exists (some 
>> node in the tree has the same row-key value) or not. Why we need load 
>> every block to find if the row exists?
>>
>>
> Hmm...
> It is a multilevel index. Only the root Index's (Data, Meta etc) are 
> loaded when a region is opened. The rest of the tree (intermediate and 
> leaf
> index's) are present in each block level.
> I am assuming a HFile v2 here for the discussion.
> Read this for more clarity http://hbase.apache.org/book/apes03.html
>
> Nice discussion. You made me read lot of things. :-) Now i will dig in 
> to the code and check this out.
>
> ./Zahoor
>

RE: Using HBase serving to replace memcached

Posted by Anoop Sam John <an...@huawei.com>.

> I could be wrong. I think HFile index block (which is located at the end
>> of HFile) is a binary search tree containing all row-key values (of the
>> HFile) in the binary search tree. Searching a specific row-key in the
>> binary search tree could easily find whether a row-key exists (some node in
>> the tree has the same row-key value) or not. Why we need load every block
>> to find if the row exists?

I think there is some confusion with you people regarding the blooms and the block index.I will try to clarify this point.
Block index will be there with every HFile. Within an HFile the data will be written as multiple blocks. While reading data block by block only HBase read data from the HDFS layer. The block index contains the information regarding the blocks within that HFile. The information include the start and end rowkeys which resides in that particular block and the block information like offset of that block and its length etc. Now when a request comes for getting a rowkey 'x' all the HFiles within that region need to be checked.[KV can be present in any of the HFile] Now in order to know this row will be present in which block within an HFile, this block index will be used. Well this block index will be there in memory always. This lookup will tell only the possible block in which the row is present. HBase will load that block and will read through it to get the row which we are interested in now.
Bloom is like it will have information about each and every row added into that HFile[Block index wont have info about each and every row]. This bloom information will be there in memory always. So when a read request to get row 'x' in an Hfile comes, 1st the bloom is checked whether this row is there in this file or not. If this is not there, as per the bloom, no block at all will be fetched. But if bloom is not enabled, we might find one block which is having a row range such that 'x' comes in between and Hbase will load that block. So usage of blooms can avoid this IO. Hope this is clear for you now.

-Anoop-
________________________________________
From: Lin Ma [linlma@gmail.com]
Sent: Wednesday, August 22, 2012 5:41 PM
To: J Mohamed Zahoor; user@hbase.apache.org
Subject: Re: Using HBase serving to replace memcached

Thanks Zahoor,

I read through the document you referred to, I am confused about what means
leaf-level index, intermediate-level index and root-level index. It is
appreciate if you could give more details what they are, or point me to the
related documents.

BTW: the document you pointed me is very good, however I miss some basic
background of 3 terms I mentioned above. :-)

regards,
Lin

On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <jm...@gmail.com> wrote:

> I could be wrong. I think HFile index block (which is located at the end
>> of HFile) is a binary search tree containing all row-key values (of the
>> HFile) in the binary search tree. Searching a specific row-key in the
>> binary search tree could easily find whether a row-key exists (some node in
>> the tree has the same row-key value) or not. Why we need load every block
>> to find if the row exists?
>>
>>
> Hmm...
> It is a multilevel index. Only the root Index's (Data, Meta etc) are
> loaded when a region is opened. The rest of the tree (intermediate and leaf
> index's) are present in each block level.
> I am assuming a HFile v2 here for the discussion.
> Read this for more clarity http://hbase.apache.org/book/apes03.html
>
> Nice discussion. You made me read lot of things. :-)
> Now i will dig in to the code and check this out.
>
> ./Zahoor
>

Re: Using HBase serving to replace memcached

Posted by Lin Ma <li...@gmail.com>.

Thanks Zahoor,

I read through the document you referred to, I am confused about what means
leaf-level index, intermediate-level index and root-level index. It is
appreciate if you could give more details what they are, or point me to the
related documents.

BTW: the document you pointed me is very good, however I miss some basic
background of 3 terms I mentioned above. :-)

regards,
Lin

On Wed, Aug 22, 2012 at 12:51 PM, J Mohamed Zahoor <jm...@gmail.com> wrote:

> I could be wrong. I think HFile index block (which is located at the end
>> of HFile) is a binary search tree containing all row-key values (of the
>> HFile) in the binary search tree. Searching a specific row-key in the
>> binary search tree could easily find whether a row-key exists (some node in
>> the tree has the same row-key value) or not. Why we need load every block
>> to find if the row exists?
>>
>>
> Hmm...
> It is a multilevel index. Only the root Index's (Data, Meta etc) are
> loaded when a region is opened. The rest of the tree (intermediate and leaf
> index's) are present in each block level.
> I am assuming a HFile v2 here for the discussion.
> Read this for more clarity http://hbase.apache.org/book/apes03.html
>
> Nice discussion. You made me read lot of things. :-)
> Now i will dig in to the code and check this out.
>
> ./Zahoor
>

Re: Using HBase serving to replace memcached

Posted by J Mohamed Zahoor <jm...@gmail.com>.

>
> I could be wrong. I think HFile index block (which is located at the end
> of HFile) is a binary search tree containing all row-key values (of the
> HFile) in the binary search tree. Searching a specific row-key in the
> binary search tree could easily find whether a row-key exists (some node in
> the tree has the same row-key value) or not. Why we need load every block
> to find if the row exists?
>
>
Hmm...
It is a multilevel index. Only the root Index's (Data, Meta etc) are loaded
when a region is opened. The rest of the tree (intermediate and leaf
index's) are present in each block level.
I am assuming a HFile v2 here for the discussion.
Read this for more clarity http://hbase.apache.org/book/apes03.html

Nice discussion. You made me read lot of things. :-)
Now i will dig in to the code and check this out.

./Zahoor

Re: Using HBase serving to replace memcached

Posted by Lin Ma <li...@gmail.com>.

Thanks Zahoor,

> If there is no bloom... you have to load every block and scan to find if
the row exists..

I could be wrong. I think HFile index block (which is located at the end of
HFile) is a binary search tree containing all row-key values (of the HFile)
in the binary search tree. Searching a specific row-key in the binary
search tree could easily find whether a row-key exists (some node in the
tree has the same row-key value) or not. Why we need load every block to
find if the row exists?

regards,
Lin

On Tue, Aug 21, 2012 at 11:56 PM, jmozah <jm...@gmail.com> wrote:

> >
> >
> > 1. After reading the materials you sent to me, I am confused how Bloom
> Filter could save I/O during random read. Supposing I am not using Bloom
> Filter, in order to find whether a row (or row-key) exists, we need to scan
> the index block which is at the end part of an HFile, the scan is in memory
> (I think index block is always in memory, please feel free to correct me if
> I am wrong) using binary search -- it should be pretty fast. With Bloom
> Filter, we could be a bit faster by looking up Bloom Filter bit vector in
> memory. Since both index block binary search and Bloom Filter bit vector
> search are doing in memory (no I/O is involved), what kinds of I/O is
> saved? :-)
> >
>
> If bloom says the Row *may* be present.. the block is loaded otherwise
> not...
> If there is no bloom... you have to load every block and scan to find if
> the row exists..
>
> This may incur more IO
>
>
> > 2.
> >
> > > One Hadoop job doing random reads is perfectly fine.  but , since you
> said "Handling directly user traffic"... i assumed you wanted to
> > > expose HBase independently to every client request, thereby having as
> many connections as the number of simultaneous req..
> >
> > Sorry I need to confirm again on this point. I think you mean
> establishing a new connection for each request is not good, using
> connection pool or asynchronous I/O is preferred?
> >
>
>
> Yes.

Re: Using HBase serving to replace memcached

Posted by jmozah <jm...@gmail.com>.

> 
> 
> 1. After reading the materials you sent to me, I am confused how Bloom Filter could save I/O during random read. Supposing I am not using Bloom Filter, in order to find whether a row (or row-key) exists, we need to scan the index block which is at the end part of an HFile, the scan is in memory (I think index block is always in memory, please feel free to correct me if I am wrong) using binary search -- it should be pretty fast. With Bloom Filter, we could be a bit faster by looking up Bloom Filter bit vector in memory. Since both index block binary search and Bloom Filter bit vector search are doing in memory (no I/O is involved), what kinds of I/O is saved? :-)
> 

If bloom says the Row *may* be present.. the block is loaded otherwise not...
If there is no bloom... you have to load every block and scan to find if the row exists..

This may incur more IO 


> 2. 
> 
> > One Hadoop job doing random reads is perfectly fine.  but , since you said "Handling directly user traffic"... i assumed you wanted to
> > expose HBase independently to every client request, thereby having as many connections as the number of simultaneous req..
> 
> Sorry I need to confirm again on this point. I think you mean establishing a new connection for each request is not good, using connection pool or asynchronous I/O is preferred?
> 


Yes.

Re: Using HBase serving to replace memcached

Posted by Lin Ma <li...@gmail.com>.

Thank you Zahoor,

Two more comments,

1. After reading the materials you sent to me, I am confused how Bloom
Filter could save I/O during random read. Supposing I am not using Bloom
Filter, in order to find whether a row (or row-key) exists, we need to scan
the index block which is at the end part of an HFile, the scan is in memory
(I think index block is always in memory, please feel free to correct me if
I am wrong) using binary search -- it should be pretty fast. With Bloom
Filter, we could be a bit faster by looking up Bloom Filter bit vector in
memory. Since both index block binary search and Bloom Filter bit vector
search are doing in memory (no I/O is involved), what kinds of I/O is
saved? :-)

2.

> One Hadoop job doing random reads is perfectly fine.  but , since you
said "Handling directly user traffic"... i assumed you wanted to
> expose HBase independently to every client request, thereby having as
many connections as the number of simultaneous req..

Sorry I need to confirm again on this point. I think you mean establishing
a new connection for each request is not good, using connection pool or
asynchronous I/O is preferred?

regards,
Lin

On Tue, Aug 21, 2012 at 10:45 PM, jmozah <jm...@gmail.com> wrote:

> >
> >
> >
> > 1. I know very basics of Bloom filters, which is used for detect whether
> an item is in a set. How to use Bloom filters in HBase to improve random
> read performance? Could you show me an example? Thanks.
>
> This will help omit loading the blocks (thereby saving IO and cache churn)
> which does not have the given row.
> For more on bloom, see
> 1 -
> https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf
> 2 - http://www.quora.com/How-are-bloom-filters-used-in-HBase
>
>
> > 2. "Also more client connections is one more issue that might infest
> you" -- supposing I am doing random read from a Hadoop job to access HBase,
> do you mean using multiple client connections from the Hadoop job is good
> or not good? Sorry I am a bit lost. :-)
>
> One Hadoop job doing random reads is perfectly fine.  but , since you said
> "Handling directly user traffic"... i assumed you wanted to expose HBase
> independently to every client request, thereby having as many connections
> as the number of simultaneous req..
>
>
> > 3. "asynchbase will help you" -- does HBase support asynchronous API?
> Sorry I cannot find it out. Appreciate if you could point me the APIs you
> are referring to.
>
>
> Not the default HTable API.  asynchbase is another client for Hbase. read
> more about asynchbase here (https://github.com/stumbleupon/asynchbase)
>
>

Re: Using HBase serving to replace memcached

Posted by jmozah <jm...@gmail.com>.

> 
> 
> 
> 1. I know very basics of Bloom filters, which is used for detect whether an item is in a set. How to use Bloom filters in HBase to improve random read performance? Could you show me an example? Thanks.

This will help omit loading the blocks (thereby saving IO and cache churn) which does not have the given row.
For more on bloom, see 
1 - https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf
2 - http://www.quora.com/How-are-bloom-filters-used-in-HBase


> 2. "Also more client connections is one more issue that might infest you" -- supposing I am doing random read from a Hadoop job to access HBase, do you mean using multiple client connections from the Hadoop job is good or not good? Sorry I am a bit lost. :-)

One Hadoop job doing random reads is perfectly fine.  but , since you said "Handling directly user traffic"... i assumed you wanted to expose HBase independently to every client request, thereby having as many connections as the number of simultaneous req..


> 3. "asynchbase will help you" -- does HBase support asynchronous API? Sorry I cannot find it out. Appreciate if you could point me the APIs you are referring to.


Not the default HTable API.  asynchbase is another client for Hbase. read more about asynchbase here (https://github.com/stumbleupon/asynchbase)

Re: Using HBase serving to replace memcached

Posted by Lin Ma <li...@gmail.com>.

Thanks for the reply, Zahoor.

Some more comments,

1. I know very basics of Bloom filters, which is used for detect whether an
item is in a set. How to use Bloom filters in HBase to improve random read
performance? Could you show me an example? Thanks.
2. "Also more client connections is one more issue that might infest you"
-- supposing I am doing random read from a Hadoop job to access HBase, do
you mean using multiple client connections from the Hadoop job is good or
not good? Sorry I am a bit lost. :-)
3. "asynchbase will help you" -- does HBase support asynchronous API? Sorry
I cannot find it out. Appreciate if you could point me the APIs you are
referring to.

regards,
Lin

On Tue, Aug 21, 2012 at 6:55 PM, J Mohamed Zahoor <jm...@gmail.com> wrote:

> Again. if your data is so huge that it is much larger than the available
> RAM, you might want to rethink.
> There are some configs in HBase that will help you in random read
> scenarios... like Bloom filters etc.
> Also more client connections is one more issue that might infest you...
> where connection pooling or asynchbase will help you.
>
> ./Zahoor
>
>
> On Tue, Aug 21, 2012 at 12:56 AM, Asif Ali <az...@gmail.com> wrote:
>
> > I've used memcached heavily in such scenarios and all such data is always
> > in Memory.
> >
> > Memcached definitely is a great solution for this and scales very well.
> But
> > keep in mind - it is not consistent. Which means there are some requests
> > which will be handled incorrectly.
> >
> > Memcached is great but also look at Guava cache for similar use cases.
> >
> > Asif Ali
> >
> >
> > On Mon, Aug 20, 2012 at 9:09 AM, Lin Ma <li...@gmail.com> wrote:
> >
> > > Thank you Drew. I like your reply, especially blocking cache nature
> > > provided by HBase. A quick question, for traditional memcached, all of
> > the
> > > items are in memory, no disk is used, correct?
> > >
> > > regards,
> > > Lin
> > >
> > > On Mon, Aug 20, 2012 at 9:26 PM, Drew Dahlke <dr...@bronto.com>
> > > wrote:
> > >
> > > > I'd say if the memcached model is working for you, stick with it.
> > > > HBase (currently) caches whole blocks. With cache blocks enabled you
> > > > can achieve 10s of thousands of reqs/sec with a pretty small cluster.
> > > > However there's a catch. Once you reach the point where your tables
> > > > are so large they can't all sit in memory at the same time you'll see
> > > > a behavior change. User traffic tends to be very random access which,
> > > > with block caching, can cause a lot of thrashing with frequent cache
> > > > evictions. We've seen this bring our cluster to it's knees.
> > > >
> > > > IMHO a better model is persist things in HBase and then cache things
> > > > with memcached just as you would with any other data store. If you're
> > > > looking for a spiffy memcached replacement I'd recommend checking out
> > > > Redis.
> > > >
> > > >
> > > > On Sat, Aug 18, 2012 at 3:12 AM, Lin Ma <li...@gmail.com> wrote:
> > > > > Hello guys,
> > > > >
> > > > > In your experience, is it practical to use HBase directly for
> > serving?
> > > > > Saying handle directly user traffic (tens of thousands QPS scale)
> > > behind
> > > > > Apache, and replace the role of memcached? I am not sure whether
> > there
> > > > are
> > > > > any known panic to replace memcached by using HBase? One issue I
> > could
> > > > > think about is for a specific row range, only one active region
> > server
> > > > > could handle the request, but in memcached, we can setup several
> > > > memcached
> > > > > instance with duplicate content (all of them are active) to serve
> the
> > > > same
> > > > > purpose under a VIP which could achieve better performance and
> > > > scalability.
> > > > >
> > > > > Any advice or reference documents are appreciated. Thanks.
> > > > >
> > > > > regards,
> > > > > Lin
> > > >
> > >
> >
>

Re: Using HBase serving to replace memcached

Posted by J Mohamed Zahoor <jm...@gmail.com>.

Again. if your data is so huge that it is much larger than the available
RAM, you might want to rethink.
There are some configs in HBase that will help you in random read
scenarios... like Bloom filters etc.
Also more client connections is one more issue that might infest you...
where connection pooling or asynchbase will help you.

./Zahoor


On Tue, Aug 21, 2012 at 12:56 AM, Asif Ali <az...@gmail.com> wrote:

> I've used memcached heavily in such scenarios and all such data is always
> in Memory.
>
> Memcached definitely is a great solution for this and scales very well. But
> keep in mind - it is not consistent. Which means there are some requests
> which will be handled incorrectly.
>
> Memcached is great but also look at Guava cache for similar use cases.
>
> Asif Ali
>
>
> On Mon, Aug 20, 2012 at 9:09 AM, Lin Ma <li...@gmail.com> wrote:
>
> > Thank you Drew. I like your reply, especially blocking cache nature
> > provided by HBase. A quick question, for traditional memcached, all of
> the
> > items are in memory, no disk is used, correct?
> >
> > regards,
> > Lin
> >
> > On Mon, Aug 20, 2012 at 9:26 PM, Drew Dahlke <dr...@bronto.com>
> > wrote:
> >
> > > I'd say if the memcached model is working for you, stick with it.
> > > HBase (currently) caches whole blocks. With cache blocks enabled you
> > > can achieve 10s of thousands of reqs/sec with a pretty small cluster.
> > > However there's a catch. Once you reach the point where your tables
> > > are so large they can't all sit in memory at the same time you'll see
> > > a behavior change. User traffic tends to be very random access which,
> > > with block caching, can cause a lot of thrashing with frequent cache
> > > evictions. We've seen this bring our cluster to it's knees.
> > >
> > > IMHO a better model is persist things in HBase and then cache things
> > > with memcached just as you would with any other data store. If you're
> > > looking for a spiffy memcached replacement I'd recommend checking out
> > > Redis.
> > >
> > >
> > > On Sat, Aug 18, 2012 at 3:12 AM, Lin Ma <li...@gmail.com> wrote:
> > > > Hello guys,
> > > >
> > > > In your experience, is it practical to use HBase directly for
> serving?
> > > > Saying handle directly user traffic (tens of thousands QPS scale)
> > behind
> > > > Apache, and replace the role of memcached? I am not sure whether
> there
> > > are
> > > > any known panic to replace memcached by using HBase? One issue I
> could
> > > > think about is for a specific row range, only one active region
> server
> > > > could handle the request, but in memcached, we can setup several
> > > memcached
> > > > instance with duplicate content (all of them are active) to serve the
> > > same
> > > > purpose under a VIP which could achieve better performance and
> > > scalability.
> > > >
> > > > Any advice or reference documents are appreciated. Thanks.
> > > >
> > > > regards,
> > > > Lin
> > >
> >
>

Re: Using HBase serving to replace memcached

Posted by Lin Ma <li...@gmail.com>.

Thanks Asif,

For your comments, "Which means there are some requests which will be
handled incorrectly.", could you show me an example about what do you mean
"handled incorrectly"?

regards,
Lin

On Tue, Aug 21, 2012 at 3:26 AM, Asif Ali <az...@gmail.com> wrote:

> I've used memcached heavily in such scenarios and all such data is always
> in Memory.
>
> Memcached definitely is a great solution for this and scales very well. But
> keep in mind - it is not consistent. Which means there are some requests
> which will be handled incorrectly.
>
> Memcached is great but also look at Guava cache for similar use cases.
>
> Asif Ali
>
>
> On Mon, Aug 20, 2012 at 9:09 AM, Lin Ma <li...@gmail.com> wrote:
>
> > Thank you Drew. I like your reply, especially blocking cache nature
> > provided by HBase. A quick question, for traditional memcached, all of
> the
> > items are in memory, no disk is used, correct?
> >
> > regards,
> > Lin
> >
> > On Mon, Aug 20, 2012 at 9:26 PM, Drew Dahlke <dr...@bronto.com>
> > wrote:
> >
> > > I'd say if the memcached model is working for you, stick with it.
> > > HBase (currently) caches whole blocks. With cache blocks enabled you
> > > can achieve 10s of thousands of reqs/sec with a pretty small cluster.
> > > However there's a catch. Once you reach the point where your tables
> > > are so large they can't all sit in memory at the same time you'll see
> > > a behavior change. User traffic tends to be very random access which,
> > > with block caching, can cause a lot of thrashing with frequent cache
> > > evictions. We've seen this bring our cluster to it's knees.
> > >
> > > IMHO a better model is persist things in HBase and then cache things
> > > with memcached just as you would with any other data store. If you're
> > > looking for a spiffy memcached replacement I'd recommend checking out
> > > Redis.
> > >
> > >
> > > On Sat, Aug 18, 2012 at 3:12 AM, Lin Ma <li...@gmail.com> wrote:
> > > > Hello guys,
> > > >
> > > > In your experience, is it practical to use HBase directly for
> serving?
> > > > Saying handle directly user traffic (tens of thousands QPS scale)
> > behind
> > > > Apache, and replace the role of memcached? I am not sure whether
> there
> > > are
> > > > any known panic to replace memcached by using HBase? One issue I
> could
> > > > think about is for a specific row range, only one active region
> server
> > > > could handle the request, but in memcached, we can setup several
> > > memcached
> > > > instance with duplicate content (all of them are active) to serve the
> > > same
> > > > purpose under a VIP which could achieve better performance and
> > > scalability.
> > > >
> > > > Any advice or reference documents are appreciated. Thanks.
> > > >
> > > > regards,
> > > > Lin
> > >
> >
>

Re: Using HBase serving to replace memcached

Posted by Asif Ali <az...@gmail.com>.

I've used memcached heavily in such scenarios and all such data is always
in Memory.

Memcached definitely is a great solution for this and scales very well. But
keep in mind - it is not consistent. Which means there are some requests
which will be handled incorrectly.

Memcached is great but also look at Guava cache for similar use cases.

Asif Ali


On Mon, Aug 20, 2012 at 9:09 AM, Lin Ma <li...@gmail.com> wrote:

> Thank you Drew. I like your reply, especially blocking cache nature
> provided by HBase. A quick question, for traditional memcached, all of the
> items are in memory, no disk is used, correct?
>
> regards,
> Lin
>
> On Mon, Aug 20, 2012 at 9:26 PM, Drew Dahlke <dr...@bronto.com>
> wrote:
>
> > I'd say if the memcached model is working for you, stick with it.
> > HBase (currently) caches whole blocks. With cache blocks enabled you
> > can achieve 10s of thousands of reqs/sec with a pretty small cluster.
> > However there's a catch. Once you reach the point where your tables
> > are so large they can't all sit in memory at the same time you'll see
> > a behavior change. User traffic tends to be very random access which,
> > with block caching, can cause a lot of thrashing with frequent cache
> > evictions. We've seen this bring our cluster to it's knees.
> >
> > IMHO a better model is persist things in HBase and then cache things
> > with memcached just as you would with any other data store. If you're
> > looking for a spiffy memcached replacement I'd recommend checking out
> > Redis.
> >
> >
> > On Sat, Aug 18, 2012 at 3:12 AM, Lin Ma <li...@gmail.com> wrote:
> > > Hello guys,
> > >
> > > In your experience, is it practical to use HBase directly for serving?
> > > Saying handle directly user traffic (tens of thousands QPS scale)
> behind
> > > Apache, and replace the role of memcached? I am not sure whether there
> > are
> > > any known panic to replace memcached by using HBase? One issue I could
> > > think about is for a specific row range, only one active region server
> > > could handle the request, but in memcached, we can setup several
> > memcached
> > > instance with duplicate content (all of them are active) to serve the
> > same
> > > purpose under a VIP which could achieve better performance and
> > scalability.
> > >
> > > Any advice or reference documents are appreciated. Thanks.
> > >
> > > regards,
> > > Lin
> >
>

Re: Using HBase serving to replace memcached

Posted by Lin Ma <li...@gmail.com>.

Thank you Drew. I like your reply, especially blocking cache nature
provided by HBase. A quick question, for traditional memcached, all of the
items are in memory, no disk is used, correct?

regards,
Lin

On Mon, Aug 20, 2012 at 9:26 PM, Drew Dahlke <dr...@bronto.com> wrote:

> I'd say if the memcached model is working for you, stick with it.
> HBase (currently) caches whole blocks. With cache blocks enabled you
> can achieve 10s of thousands of reqs/sec with a pretty small cluster.
> However there's a catch. Once you reach the point where your tables
> are so large they can't all sit in memory at the same time you'll see
> a behavior change. User traffic tends to be very random access which,
> with block caching, can cause a lot of thrashing with frequent cache
> evictions. We've seen this bring our cluster to it's knees.
>
> IMHO a better model is persist things in HBase and then cache things
> with memcached just as you would with any other data store. If you're
> looking for a spiffy memcached replacement I'd recommend checking out
> Redis.
>
>
> On Sat, Aug 18, 2012 at 3:12 AM, Lin Ma <li...@gmail.com> wrote:
> > Hello guys,
> >
> > In your experience, is it practical to use HBase directly for serving?
> > Saying handle directly user traffic (tens of thousands QPS scale) behind
> > Apache, and replace the role of memcached? I am not sure whether there
> are
> > any known panic to replace memcached by using HBase? One issue I could
> > think about is for a specific row range, only one active region server
> > could handle the request, but in memcached, we can setup several
> memcached
> > instance with duplicate content (all of them are active) to serve the
> same
> > purpose under a VIP which could achieve better performance and
> scalability.
> >
> > Any advice or reference documents are appreciated. Thanks.
> >
> > regards,
> > Lin
>

Re: Using HBase serving to replace memcached

Posted by Drew Dahlke <dr...@bronto.com>.

I'd say if the memcached model is working for you, stick with it.
HBase (currently) caches whole blocks. With cache blocks enabled you
can achieve 10s of thousands of reqs/sec with a pretty small cluster.
However there's a catch. Once you reach the point where your tables
are so large they can't all sit in memory at the same time you'll see
a behavior change. User traffic tends to be very random access which,
with block caching, can cause a lot of thrashing with frequent cache
evictions. We've seen this bring our cluster to it's knees.

IMHO a better model is persist things in HBase and then cache things
with memcached just as you would with any other data store. If you're
looking for a spiffy memcached replacement I'd recommend checking out
Redis.

On Sat, Aug 18, 2012 at 3:12 AM, Lin Ma <li...@gmail.com> wrote:
> Hello guys,
>
> In your experience, is it practical to use HBase directly for serving?
> Saying handle directly user traffic (tens of thousands QPS scale) behind
> Apache, and replace the role of memcached? I am not sure whether there are
> any known panic to replace memcached by using HBase? One issue I could
> think about is for a specific row range, only one active region server
> could handle the request, but in memcached, we can setup several memcached
> instance with duplicate content (all of them are active) to serve the same
> purpose under a VIP which could achieve better performance and scalability.
>
> Any advice or reference documents are appreciated. Thanks.
>
> regards,
> Lin