You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Demian Berjman <db...@despegar.com> on 2013/07/30 22:37:12 UTC

help on key design

Hi,

I would like to explain our use case of HBase, the row key design and the
problems we are having so anyone can give us a help:

The first thing we noticed is that our data set is too small compared to
other cases we read in the list and forums. We have a table containing 20
million keys splitted automatically by HBase in 4 regions and balanced in 3
region servers. We have designed our key to keep together the set of keys
requested by our app. That is, when we request a set of keys we expect them
to be grouped together to improve data locality and block cache efficiency.

The second thing we noticed, compared to other cases, is that we retrieve a
bunch keys per request (500 aprox). Thus, during our peaks (3k requests per
minute), we have a lot of requests going to a particular region servers and
asking a lot of keys. That results in poor response times (in the order of
seconds). Currently we are using multi gets.

We think an improvement would be to spread the keys (introducing a
randomized component on it) in more region servers, so each rs will have to
handle less keys and probably less requests. Doing that way the multi gets
will be spread over the region servers.

Our questions:

1. Is it correct this design of asking so many keys on each request? (if
you need high performance)
2. What about splitting in more region servers? It's a good idea? How we
could accomplish this? We thought in apply some hashing...

Thanks in advance!

Re: help on key design

Posted by Ted Yu <yu...@gmail.com>.

Was in a meeting ...

In 0.94, if you look at HConnectionManager#processBatchCallback(), you
would see:

            MultiAction<R> actions = actionsByServer.get(loc);
            if (actions == null) {
              actions = new MultiAction<R>();
              actionsByServer.put(loc, actions);
            }

where:
        Map<HRegionLocation, MultiAction<R>> actionsByServer =
          new HashMap<HRegionLocation, MultiAction<R>>();

And HRegionLocation#hashCode() is defined as:

  public int hashCode() {

    return this.serverName.hashCode();

  }

So the grouping happens at region server level.
Cheers

On Wed, Jul 31, 2013 at 11:00 AM, Pablo Medina <pa...@gmail.com>wrote:

> Isn't that a job by the multiGet at the client side?. I mean, when you
> provide a list a of gets the client groups them in regions and region
> servers and them submits a job to its executor in order to call the region
> servers in parallel. Is that what you mean, right?.
>
>
>
> 2013/7/31 Ted Yu <yu...@gmail.com>
>
> > From the information Demian provided in the first email:
> >
> > bq. a table containing 20 million keys splitted automatically by HBase
> in 4
> > regions and balanced in 3 region servers
> >
> > I think the number of regions should be increased through (manual)
> > splitting so that the data is spread more evenly across servers.
> >
> > If the Get's are scattered across whole key space, there is some
> > optimization the client can do. Namely group the Get's by region boundary
> > and issue multi get per region.
> >
> > Please also refer to http://hbase.apache.org/book.html#rowkey.design,
> > especially 6.3.2.
> >
> > Cheers
> >
> > On Wed, Jul 31, 2013 at 10:14 AM, Dhaval Shah
> > <pr...@yahoo.co.in>wrote:
> >
> > > Looking at https://issues.apache.org/jira/browse/HBASE-6136 it seems
> > like
> > > the 500 Gets are executed sequentially on the region server.
> > >
> > > Also 3k requests per minute = 50 requests per second. Assuming your
> > > requests take 1 sec (which seems really long but who knows) then you
> need
> > > atleast 50 threads/region server handlers to handle these. Defaults for
> > > that number on some older versions of hbase is 10 which means you are
> > > running out of threads. Which brings up the following questions -
> > > What version of HBase are you running?
> > > How many region server handlers do you have?
> > >
> > > Regards,
> > > Dhaval
> > >
> > >
> > > ----- Original Message -----
> > > From: Demian Berjman <db...@despegar.com>
> > > To: user@hbase.apache.org
> > > Cc:
> > > Sent: Wednesday, 31 July 2013 11:12 AM
> > > Subject: Re: help on key design
> > >
> > > Thanks for the responses!
> > >
> > > >  why don't you use a scan
> > > I'll try that and compare it.
> > >
> > > > How much memory do you have for your region servers? Have you enabled
> > > > block caching? Is your CPU spiking on your region servers?
> > > Block caching is enabled. Cpu and memory dont seem to be a problem.
> > >
> > > We think we are saturating a region because the quantity of keys
> > requested.
> > > In that case my question will be if asking 500+ keys per request is a
> > > normal scenario?
> > >
> > > Cheers,
> > >
> > >
> > > On Wed, Jul 31, 2013 at 11:24 AM, Pablo Medina <
> pablomedina85@gmail.com
> > > >wrote:
> > >
> > > > The scan can be an option if the cost of scanning undesired cells and
> > > > discarding them trough filters is better than accessing those keys
> > > > individually. I would say that as the number of 'undesired' cells
> > > decreases
> > > > the scan overall performance/efficiency gets increased. It all
> depends
> > on
> > > > how the keys are designed to be grouped together.
> > > >
> > > > 2013/7/30 Ted Yu <yu...@gmail.com>
> > > >
> > > > > Please also go over http://hbase.apache.org/book.html#perf.reading
> > > > >
> > > > > Cheers
> > > > >
> > > > > On Tue, Jul 30, 2013 at 3:40 PM, Dhaval Shah <
> > > > prince_mithibai@yahoo.co.in
> > > > > >wrote:
> > > > >
> > > > > > If all your keys are grouped together, why don't you use a scan
> > with
> > > > > > start/end key specified? A sequential scan can theoretically be
> > > faster
> > > > > than
> > > > > > MultiGet lookups (assuming your grouping is tight, you can also
> use
> > > > > filters
> > > > > > with the scan to give better performance)
> > > > > >
> > > > > > How much memory do you have for your region servers? Have you
> > enabled
> > > > > > block caching? Is your CPU spiking on your region servers?
> > > > > >
> > > > > > If you are saturating the resources on your *hot* region server
> > then
> > > > yes
> > > > > > having more region servers will help. If no, then something else
> is
> > > the
> > > > > > bottleneck and you probably need to dig further
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Dhaval
> > > > > >
> > > > > >
> > > > > > ________________________________
> > > > > > From: Demian Berjman <db...@despegar.com>
> > > > > > To: user@hbase.apache.org
> > > > > > Sent: Tuesday, 30 July 2013 4:37 PM
> > > > > > Subject: help on key design
> > > > > >
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I would like to explain our use case of HBase, the row key design
> > and
> > > > the
> > > > > > problems we are having so anyone can give us a help:
> > > > > >
> > > > > > The first thing we noticed is that our data set is too small
> > compared
> > > > to
> > > > > > other cases we read in the list and forums. We have a table
> > > containing
> > > > 20
> > > > > > million keys splitted automatically by HBase in 4 regions and
> > > balanced
> > > > > in 3
> > > > > > region servers. We have designed our key to keep together the set
> > of
> > > > keys
> > > > > > requested by our app. That is, when we request a set of keys we
> > > expect
> > > > > them
> > > > > > to be grouped together to improve data locality and block cache
> > > > > efficiency.
> > > > > >
> > > > > > The second thing we noticed, compared to other cases, is that we
> > > > > retrieve a
> > > > > > bunch keys per request (500 aprox). Thus, during our peaks (3k
> > > requests
> > > > > per
> > > > > > minute), we have a lot of requests going to a particular region
> > > servers
> > > > > and
> > > > > > asking a lot of keys. That results in poor response times (in the
> > > order
> > > > > of
> > > > > > seconds). Currently we are using multi gets.
> > > > > >
> > > > > > We think an improvement would be to spread the keys (introducing
> a
> > > > > > randomized component on it) in more region servers, so each rs
> will
> > > > have
> > > > > to
> > > > > > handle less keys and probably less requests. Doing that way the
> > multi
> > > > > gets
> > > > > > will be spread over the region servers.
> > > > > >
> > > > > > Our questions:
> > > > > >
> > > > > > 1. Is it correct this design of asking so many keys on each
> > request?
> > > > (if
> > > > > > you need high performance)
> > > > > > 2. What about splitting in more region servers? It's a good idea?
> > How
> > > > we
> > > > > > could accomplish this? We thought in apply some hashing...
> > > > > >
> > > > > > Thanks in advance!
> > > > > >
> > > > >
> > > >
> > >
> > >
> >
>

Re: help on key design

Posted by Pablo Medina <pa...@gmail.com>.

Isn't that a job by the multiGet at the client side?. I mean, when you
provide a list a of gets the client groups them in regions and region
servers and them submits a job to its executor in order to call the region
servers in parallel. Is that what you mean, right?.



2013/7/31 Ted Yu <yu...@gmail.com>

> From the information Demian provided in the first email:
>
> bq. a table containing 20 million keys splitted automatically by HBase in 4
> regions and balanced in 3 region servers
>
> I think the number of regions should be increased through (manual)
> splitting so that the data is spread more evenly across servers.
>
> If the Get's are scattered across whole key space, there is some
> optimization the client can do. Namely group the Get's by region boundary
> and issue multi get per region.
>
> Please also refer to http://hbase.apache.org/book.html#rowkey.design,
> especially 6.3.2.
>
> Cheers
>
> On Wed, Jul 31, 2013 at 10:14 AM, Dhaval Shah
> <pr...@yahoo.co.in>wrote:
>
> > Looking at https://issues.apache.org/jira/browse/HBASE-6136 it seems
> like
> > the 500 Gets are executed sequentially on the region server.
> >
> > Also 3k requests per minute = 50 requests per second. Assuming your
> > requests take 1 sec (which seems really long but who knows) then you need
> > atleast 50 threads/region server handlers to handle these. Defaults for
> > that number on some older versions of hbase is 10 which means you are
> > running out of threads. Which brings up the following questions -
> > What version of HBase are you running?
> > How many region server handlers do you have?
> >
> > Regards,
> > Dhaval
> >
> >
> > ----- Original Message -----
> > From: Demian Berjman <db...@despegar.com>
> > To: user@hbase.apache.org
> > Cc:
> > Sent: Wednesday, 31 July 2013 11:12 AM
> > Subject: Re: help on key design
> >
> > Thanks for the responses!
> >
> > >  why don't you use a scan
> > I'll try that and compare it.
> >
> > > How much memory do you have for your region servers? Have you enabled
> > > block caching? Is your CPU spiking on your region servers?
> > Block caching is enabled. Cpu and memory dont seem to be a problem.
> >
> > We think we are saturating a region because the quantity of keys
> requested.
> > In that case my question will be if asking 500+ keys per request is a
> > normal scenario?
> >
> > Cheers,
> >
> >
> > On Wed, Jul 31, 2013 at 11:24 AM, Pablo Medina <pablomedina85@gmail.com
> > >wrote:
> >
> > > The scan can be an option if the cost of scanning undesired cells and
> > > discarding them trough filters is better than accessing those keys
> > > individually. I would say that as the number of 'undesired' cells
> > decreases
> > > the scan overall performance/efficiency gets increased. It all depends
> on
> > > how the keys are designed to be grouped together.
> > >
> > > 2013/7/30 Ted Yu <yu...@gmail.com>
> > >
> > > > Please also go over http://hbase.apache.org/book.html#perf.reading
> > > >
> > > > Cheers
> > > >
> > > > On Tue, Jul 30, 2013 at 3:40 PM, Dhaval Shah <
> > > prince_mithibai@yahoo.co.in
> > > > >wrote:
> > > >
> > > > > If all your keys are grouped together, why don't you use a scan
> with
> > > > > start/end key specified? A sequential scan can theoretically be
> > faster
> > > > than
> > > > > MultiGet lookups (assuming your grouping is tight, you can also use
> > > > filters
> > > > > with the scan to give better performance)
> > > > >
> > > > > How much memory do you have for your region servers? Have you
> enabled
> > > > > block caching? Is your CPU spiking on your region servers?
> > > > >
> > > > > If you are saturating the resources on your *hot* region server
> then
> > > yes
> > > > > having more region servers will help. If no, then something else is
> > the
> > > > > bottleneck and you probably need to dig further
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Regards,
> > > > > Dhaval
> > > > >
> > > > >
> > > > > ________________________________
> > > > > From: Demian Berjman <db...@despegar.com>
> > > > > To: user@hbase.apache.org
> > > > > Sent: Tuesday, 30 July 2013 4:37 PM
> > > > > Subject: help on key design
> > > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > I would like to explain our use case of HBase, the row key design
> and
> > > the
> > > > > problems we are having so anyone can give us a help:
> > > > >
> > > > > The first thing we noticed is that our data set is too small
> compared
> > > to
> > > > > other cases we read in the list and forums. We have a table
> > containing
> > > 20
> > > > > million keys splitted automatically by HBase in 4 regions and
> > balanced
> > > > in 3
> > > > > region servers. We have designed our key to keep together the set
> of
> > > keys
> > > > > requested by our app. That is, when we request a set of keys we
> > expect
> > > > them
> > > > > to be grouped together to improve data locality and block cache
> > > > efficiency.
> > > > >
> > > > > The second thing we noticed, compared to other cases, is that we
> > > > retrieve a
> > > > > bunch keys per request (500 aprox). Thus, during our peaks (3k
> > requests
> > > > per
> > > > > minute), we have a lot of requests going to a particular region
> > servers
> > > > and
> > > > > asking a lot of keys. That results in poor response times (in the
> > order
> > > > of
> > > > > seconds). Currently we are using multi gets.
> > > > >
> > > > > We think an improvement would be to spread the keys (introducing a
> > > > > randomized component on it) in more region servers, so each rs will
> > > have
> > > > to
> > > > > handle less keys and probably less requests. Doing that way the
> multi
> > > > gets
> > > > > will be spread over the region servers.
> > > > >
> > > > > Our questions:
> > > > >
> > > > > 1. Is it correct this design of asking so many keys on each
> request?
> > > (if
> > > > > you need high performance)
> > > > > 2. What about splitting in more region servers? It's a good idea?
> How
> > > we
> > > > > could accomplish this? We thought in apply some hashing...
> > > > >
> > > > > Thanks in advance!
> > > > >
> > > >
> > >
> >
> >
>

Re: help on key design

Posted by Pablo Medina <pa...@gmail.com>.

Right. I was assuming the scenario where the region is splitted in two
balanced regions. Balanced in terms of requested keys. As you said,
introducing randonmness can give you more control over that.


2013/7/31 Michael Segel <ms...@segel.com>

> Really?
>
> You split the region that is hot. What's to stop all of the keys that the
> OP wants are still within the same region?  Not to mention... how do you
> control which region is on which region server?
>
> Just food for thought.
>
> If the OP is doing get()s, then he may want to consider taking the hash,
> truncating it to 4 bytes and prepending it to his key.  This should give
> him some randomness.
>
>
>
> On Jul 31, 2013, at 1:57 PM, Pablo Medina <pa...@gmail.com> wrote:
>
> > If you split that one hot region and then move a half to another region
> > server then you will move the half of the load of that hot region server.
> > The set of hot keys then will be spread over 2 region servers instead of
> > one.
> >
> >
> > 2013/7/31 Michael Segel <ms...@segel.com>
> >
> >> 4 regions on 3 servers?
> >> I'd say that they were already balanced.
> >>
> >> The issue is that when they do their get(s) they are hitting one region.
> >> So more splits isn't the answer.
> >>
> >>
> >> On Jul 31, 2013, at 12:49 PM, Ted Yu <yu...@gmail.com> wrote:
> >>
> >>> From the information Demian provided in the first email:
> >>>
> >>> bq. a table containing 20 million keys splitted automatically by HBase
> >> in 4
> >>> regions and balanced in 3 region servers
> >>>
> >>> I think the number of regions should be increased through (manual)
> >>> splitting so that the data is spread more evenly across servers.
> >>>
> >>> If the Get's are scattered across whole key space, there is some
> >>> optimization the client can do. Namely group the Get's by region
> boundary
> >>> and issue multi get per region.
> >>>
> >>> Please also refer to http://hbase.apache.org/book.html#rowkey.design,
> >>> especially 6.3.2.
> >>>
> >>> Cheers
> >>>
> >>> On Wed, Jul 31, 2013 at 10:14 AM, Dhaval Shah
> >>> <pr...@yahoo.co.in>wrote:
> >>>
> >>>> Looking at https://issues.apache.org/jira/browse/HBASE-6136 it seems
> >> like
> >>>> the 500 Gets are executed sequentially on the region server.
> >>>>
> >>>> Also 3k requests per minute = 50 requests per second. Assuming your
> >>>> requests take 1 sec (which seems really long but who knows) then you
> >> need
> >>>> atleast 50 threads/region server handlers to handle these. Defaults
> for
> >>>> that number on some older versions of hbase is 10 which means you are
> >>>> running out of threads. Which brings up the following questions -
> >>>> What version of HBase are you running?
> >>>> How many region server handlers do you have?
> >>>>
> >>>> Regards,
> >>>> Dhaval
> >>>>
> >>>>
> >>>> ----- Original Message -----
> >>>> From: Demian Berjman <db...@despegar.com>
> >>>> To: user@hbase.apache.org
> >>>> Cc:
> >>>> Sent: Wednesday, 31 July 2013 11:12 AM
> >>>> Subject: Re: help on key design
> >>>>
> >>>> Thanks for the responses!
> >>>>
> >>>>> why don't you use a scan
> >>>> I'll try that and compare it.
> >>>>
> >>>>> How much memory do you have for your region servers? Have you enabled
> >>>>> block caching? Is your CPU spiking on your region servers?
> >>>> Block caching is enabled. Cpu and memory dont seem to be a problem.
> >>>>
> >>>> We think we are saturating a region because the quantity of keys
> >> requested.
> >>>> In that case my question will be if asking 500+ keys per request is a
> >>>> normal scenario?
> >>>>
> >>>> Cheers,
> >>>>
> >>>>
> >>>> On Wed, Jul 31, 2013 at 11:24 AM, Pablo Medina <
> pablomedina85@gmail.com
> >>>>> wrote:
> >>>>
> >>>>> The scan can be an option if the cost of scanning undesired cells and
> >>>>> discarding them trough filters is better than accessing those keys
> >>>>> individually. I would say that as the number of 'undesired' cells
> >>>> decreases
> >>>>> the scan overall performance/efficiency gets increased. It all
> depends
> >> on
> >>>>> how the keys are designed to be grouped together.
> >>>>>
> >>>>> 2013/7/30 Ted Yu <yu...@gmail.com>
> >>>>>
> >>>>>> Please also go over http://hbase.apache.org/book.html#perf.reading
> >>>>>>
> >>>>>> Cheers
> >>>>>>
> >>>>>> On Tue, Jul 30, 2013 at 3:40 PM, Dhaval Shah <
> >>>>> prince_mithibai@yahoo.co.in
> >>>>>>> wrote:
> >>>>>>
> >>>>>>> If all your keys are grouped together, why don't you use a scan
> with
> >>>>>>> start/end key specified? A sequential scan can theoretically be
> >>>> faster
> >>>>>> than
> >>>>>>> MultiGet lookups (assuming your grouping is tight, you can also use
> >>>>>> filters
> >>>>>>> with the scan to give better performance)
> >>>>>>>
> >>>>>>> How much memory do you have for your region servers? Have you
> enabled
> >>>>>>> block caching? Is your CPU spiking on your region servers?
> >>>>>>>
> >>>>>>> If you are saturating the resources on your *hot* region server
> then
> >>>>> yes
> >>>>>>> having more region servers will help. If no, then something else is
> >>>> the
> >>>>>>> bottleneck and you probably need to dig further
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Dhaval
> >>>>>>>
> >>>>>>>
> >>>>>>> ________________________________
> >>>>>>> From: Demian Berjman <db...@despegar.com>
> >>>>>>> To: user@hbase.apache.org
> >>>>>>> Sent: Tuesday, 30 July 2013 4:37 PM
> >>>>>>> Subject: help on key design
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I would like to explain our use case of HBase, the row key design
> and
> >>>>> the
> >>>>>>> problems we are having so anyone can give us a help:
> >>>>>>>
> >>>>>>> The first thing we noticed is that our data set is too small
> compared
> >>>>> to
> >>>>>>> other cases we read in the list and forums. We have a table
> >>>> containing
> >>>>> 20
> >>>>>>> million keys splitted automatically by HBase in 4 regions and
> >>>> balanced
> >>>>>> in 3
> >>>>>>> region servers. We have designed our key to keep together the set
> of
> >>>>> keys
> >>>>>>> requested by our app. That is, when we request a set of keys we
> >>>> expect
> >>>>>> them
> >>>>>>> to be grouped together to improve data locality and block cache
> >>>>>> efficiency.
> >>>>>>>
> >>>>>>> The second thing we noticed, compared to other cases, is that we
> >>>>>> retrieve a
> >>>>>>> bunch keys per request (500 aprox). Thus, during our peaks (3k
> >>>> requests
> >>>>>> per
> >>>>>>> minute), we have a lot of requests going to a particular region
> >>>> servers
> >>>>>> and
> >>>>>>> asking a lot of keys. That results in poor response times (in the
> >>>> order
> >>>>>> of
> >>>>>>> seconds). Currently we are using multi gets.
> >>>>>>>
> >>>>>>> We think an improvement would be to spread the keys (introducing a
> >>>>>>> randomized component on it) in more region servers, so each rs will
> >>>>> have
> >>>>>> to
> >>>>>>> handle less keys and probably less requests. Doing that way the
> multi
> >>>>>> gets
> >>>>>>> will be spread over the region servers.
> >>>>>>>
> >>>>>>> Our questions:
> >>>>>>>
> >>>>>>> 1. Is it correct this design of asking so many keys on each
> request?
> >>>>> (if
> >>>>>>> you need high performance)
> >>>>>>> 2. What about splitting in more region servers? It's a good idea?
> How
> >>>>> we
> >>>>>>> could accomplish this? We thought in apply some hashing...
> >>>>>>>
> >>>>>>> Thanks in advance!
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: help on key design

Posted by Michael Segel <ms...@segel.com>.

Really? 

You split the region that is hot. What's to stop all of the keys that the OP wants are still within the same region?  Not to mention... how do you control which region is on which region server?  

Just food for thought. 

If the OP is doing get()s, then he may want to consider taking the hash, truncating it to 4 bytes and prepending it to his key.  This should give him some randomness. 



On Jul 31, 2013, at 1:57 PM, Pablo Medina <pa...@gmail.com> wrote:

> If you split that one hot region and then move a half to another region
> server then you will move the half of the load of that hot region server.
> The set of hot keys then will be spread over 2 region servers instead of
> one.
> 
> 
> 2013/7/31 Michael Segel <ms...@segel.com>
> 
>> 4 regions on 3 servers?
>> I'd say that they were already balanced.
>> 
>> The issue is that when they do their get(s) they are hitting one region.
>> So more splits isn't the answer.
>> 
>> 
>> On Jul 31, 2013, at 12:49 PM, Ted Yu <yu...@gmail.com> wrote:
>> 
>>> From the information Demian provided in the first email:
>>> 
>>> bq. a table containing 20 million keys splitted automatically by HBase
>> in 4
>>> regions and balanced in 3 region servers
>>> 
>>> I think the number of regions should be increased through (manual)
>>> splitting so that the data is spread more evenly across servers.
>>> 
>>> If the Get's are scattered across whole key space, there is some
>>> optimization the client can do. Namely group the Get's by region boundary
>>> and issue multi get per region.
>>> 
>>> Please also refer to http://hbase.apache.org/book.html#rowkey.design,
>>> especially 6.3.2.
>>> 
>>> Cheers
>>> 
>>> On Wed, Jul 31, 2013 at 10:14 AM, Dhaval Shah
>>> <pr...@yahoo.co.in>wrote:
>>> 
>>>> Looking at https://issues.apache.org/jira/browse/HBASE-6136 it seems
>> like
>>>> the 500 Gets are executed sequentially on the region server.
>>>> 
>>>> Also 3k requests per minute = 50 requests per second. Assuming your
>>>> requests take 1 sec (which seems really long but who knows) then you
>> need
>>>> atleast 50 threads/region server handlers to handle these. Defaults for
>>>> that number on some older versions of hbase is 10 which means you are
>>>> running out of threads. Which brings up the following questions -
>>>> What version of HBase are you running?
>>>> How many region server handlers do you have?
>>>> 
>>>> Regards,
>>>> Dhaval
>>>> 
>>>> 
>>>> ----- Original Message -----
>>>> From: Demian Berjman <db...@despegar.com>
>>>> To: user@hbase.apache.org
>>>> Cc:
>>>> Sent: Wednesday, 31 July 2013 11:12 AM
>>>> Subject: Re: help on key design
>>>> 
>>>> Thanks for the responses!
>>>> 
>>>>> why don't you use a scan
>>>> I'll try that and compare it.
>>>> 
>>>>> How much memory do you have for your region servers? Have you enabled
>>>>> block caching? Is your CPU spiking on your region servers?
>>>> Block caching is enabled. Cpu and memory dont seem to be a problem.
>>>> 
>>>> We think we are saturating a region because the quantity of keys
>> requested.
>>>> In that case my question will be if asking 500+ keys per request is a
>>>> normal scenario?
>>>> 
>>>> Cheers,
>>>> 
>>>> 
>>>> On Wed, Jul 31, 2013 at 11:24 AM, Pablo Medina <pablomedina85@gmail.com
>>>>> wrote:
>>>> 
>>>>> The scan can be an option if the cost of scanning undesired cells and
>>>>> discarding them trough filters is better than accessing those keys
>>>>> individually. I would say that as the number of 'undesired' cells
>>>> decreases
>>>>> the scan overall performance/efficiency gets increased. It all depends
>> on
>>>>> how the keys are designed to be grouped together.
>>>>> 
>>>>> 2013/7/30 Ted Yu <yu...@gmail.com>
>>>>> 
>>>>>> Please also go over http://hbase.apache.org/book.html#perf.reading
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> On Tue, Jul 30, 2013 at 3:40 PM, Dhaval Shah <
>>>>> prince_mithibai@yahoo.co.in
>>>>>>> wrote:
>>>>>> 
>>>>>>> If all your keys are grouped together, why don't you use a scan with
>>>>>>> start/end key specified? A sequential scan can theoretically be
>>>> faster
>>>>>> than
>>>>>>> MultiGet lookups (assuming your grouping is tight, you can also use
>>>>>> filters
>>>>>>> with the scan to give better performance)
>>>>>>> 
>>>>>>> How much memory do you have for your region servers? Have you enabled
>>>>>>> block caching? Is your CPU spiking on your region servers?
>>>>>>> 
>>>>>>> If you are saturating the resources on your *hot* region server then
>>>>> yes
>>>>>>> having more region servers will help. If no, then something else is
>>>> the
>>>>>>> bottleneck and you probably need to dig further
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Dhaval
>>>>>>> 
>>>>>>> 
>>>>>>> ________________________________
>>>>>>> From: Demian Berjman <db...@despegar.com>
>>>>>>> To: user@hbase.apache.org
>>>>>>> Sent: Tuesday, 30 July 2013 4:37 PM
>>>>>>> Subject: help on key design
>>>>>>> 
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I would like to explain our use case of HBase, the row key design and
>>>>> the
>>>>>>> problems we are having so anyone can give us a help:
>>>>>>> 
>>>>>>> The first thing we noticed is that our data set is too small compared
>>>>> to
>>>>>>> other cases we read in the list and forums. We have a table
>>>> containing
>>>>> 20
>>>>>>> million keys splitted automatically by HBase in 4 regions and
>>>> balanced
>>>>>> in 3
>>>>>>> region servers. We have designed our key to keep together the set of
>>>>> keys
>>>>>>> requested by our app. That is, when we request a set of keys we
>>>> expect
>>>>>> them
>>>>>>> to be grouped together to improve data locality and block cache
>>>>>> efficiency.
>>>>>>> 
>>>>>>> The second thing we noticed, compared to other cases, is that we
>>>>>> retrieve a
>>>>>>> bunch keys per request (500 aprox). Thus, during our peaks (3k
>>>> requests
>>>>>> per
>>>>>>> minute), we have a lot of requests going to a particular region
>>>> servers
>>>>>> and
>>>>>>> asking a lot of keys. That results in poor response times (in the
>>>> order
>>>>>> of
>>>>>>> seconds). Currently we are using multi gets.
>>>>>>> 
>>>>>>> We think an improvement would be to spread the keys (introducing a
>>>>>>> randomized component on it) in more region servers, so each rs will
>>>>> have
>>>>>> to
>>>>>>> handle less keys and probably less requests. Doing that way the multi
>>>>>> gets
>>>>>>> will be spread over the region servers.
>>>>>>> 
>>>>>>> Our questions:
>>>>>>> 
>>>>>>> 1. Is it correct this design of asking so many keys on each request?
>>>>> (if
>>>>>>> you need high performance)
>>>>>>> 2. What about splitting in more region servers? It's a good idea? How
>>>>> we
>>>>>>> could accomplish this? We thought in apply some hashing...
>>>>>>> 
>>>>>>> Thanks in advance!
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: help on key design

Posted by Pablo Medina <pa...@gmail.com>.

If you split that one hot region and then move a half to another region
server then you will move the half of the load of that hot region server.
The set of hot keys then will be spread over 2 region servers instead of
one.


2013/7/31 Michael Segel <ms...@segel.com>

> 4 regions on 3 servers?
> I'd say that they were already balanced.
>
> The issue is that when they do their get(s) they are hitting one region.
> So more splits isn't the answer.
>
>
> On Jul 31, 2013, at 12:49 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > From the information Demian provided in the first email:
> >
> > bq. a table containing 20 million keys splitted automatically by HBase
> in 4
> > regions and balanced in 3 region servers
> >
> > I think the number of regions should be increased through (manual)
> > splitting so that the data is spread more evenly across servers.
> >
> > If the Get's are scattered across whole key space, there is some
> > optimization the client can do. Namely group the Get's by region boundary
> > and issue multi get per region.
> >
> > Please also refer to http://hbase.apache.org/book.html#rowkey.design,
> > especially 6.3.2.
> >
> > Cheers
> >
> > On Wed, Jul 31, 2013 at 10:14 AM, Dhaval Shah
> > <pr...@yahoo.co.in>wrote:
> >
> >> Looking at https://issues.apache.org/jira/browse/HBASE-6136 it seems
> like
> >> the 500 Gets are executed sequentially on the region server.
> >>
> >> Also 3k requests per minute = 50 requests per second. Assuming your
> >> requests take 1 sec (which seems really long but who knows) then you
> need
> >> atleast 50 threads/region server handlers to handle these. Defaults for
> >> that number on some older versions of hbase is 10 which means you are
> >> running out of threads. Which brings up the following questions -
> >> What version of HBase are you running?
> >> How many region server handlers do you have?
> >>
> >> Regards,
> >> Dhaval
> >>
> >>
> >> ----- Original Message -----
> >> From: Demian Berjman <db...@despegar.com>
> >> To: user@hbase.apache.org
> >> Cc:
> >> Sent: Wednesday, 31 July 2013 11:12 AM
> >> Subject: Re: help on key design
> >>
> >> Thanks for the responses!
> >>
> >>> why don't you use a scan
> >> I'll try that and compare it.
> >>
> >>> How much memory do you have for your region servers? Have you enabled
> >>> block caching? Is your CPU spiking on your region servers?
> >> Block caching is enabled. Cpu and memory dont seem to be a problem.
> >>
> >> We think we are saturating a region because the quantity of keys
> requested.
> >> In that case my question will be if asking 500+ keys per request is a
> >> normal scenario?
> >>
> >> Cheers,
> >>
> >>
> >> On Wed, Jul 31, 2013 at 11:24 AM, Pablo Medina <pablomedina85@gmail.com
> >>> wrote:
> >>
> >>> The scan can be an option if the cost of scanning undesired cells and
> >>> discarding them trough filters is better than accessing those keys
> >>> individually. I would say that as the number of 'undesired' cells
> >> decreases
> >>> the scan overall performance/efficiency gets increased. It all depends
> on
> >>> how the keys are designed to be grouped together.
> >>>
> >>> 2013/7/30 Ted Yu <yu...@gmail.com>
> >>>
> >>>> Please also go over http://hbase.apache.org/book.html#perf.reading
> >>>>
> >>>> Cheers
> >>>>
> >>>> On Tue, Jul 30, 2013 at 3:40 PM, Dhaval Shah <
> >>> prince_mithibai@yahoo.co.in
> >>>>> wrote:
> >>>>
> >>>>> If all your keys are grouped together, why don't you use a scan with
> >>>>> start/end key specified? A sequential scan can theoretically be
> >> faster
> >>>> than
> >>>>> MultiGet lookups (assuming your grouping is tight, you can also use
> >>>> filters
> >>>>> with the scan to give better performance)
> >>>>>
> >>>>> How much memory do you have for your region servers? Have you enabled
> >>>>> block caching? Is your CPU spiking on your region servers?
> >>>>>
> >>>>> If you are saturating the resources on your *hot* region server then
> >>> yes
> >>>>> having more region servers will help. If no, then something else is
> >> the
> >>>>> bottleneck and you probably need to dig further
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> Regards,
> >>>>> Dhaval
> >>>>>
> >>>>>
> >>>>> ________________________________
> >>>>> From: Demian Berjman <db...@despegar.com>
> >>>>> To: user@hbase.apache.org
> >>>>> Sent: Tuesday, 30 July 2013 4:37 PM
> >>>>> Subject: help on key design
> >>>>>
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I would like to explain our use case of HBase, the row key design and
> >>> the
> >>>>> problems we are having so anyone can give us a help:
> >>>>>
> >>>>> The first thing we noticed is that our data set is too small compared
> >>> to
> >>>>> other cases we read in the list and forums. We have a table
> >> containing
> >>> 20
> >>>>> million keys splitted automatically by HBase in 4 regions and
> >> balanced
> >>>> in 3
> >>>>> region servers. We have designed our key to keep together the set of
> >>> keys
> >>>>> requested by our app. That is, when we request a set of keys we
> >> expect
> >>>> them
> >>>>> to be grouped together to improve data locality and block cache
> >>>> efficiency.
> >>>>>
> >>>>> The second thing we noticed, compared to other cases, is that we
> >>>> retrieve a
> >>>>> bunch keys per request (500 aprox). Thus, during our peaks (3k
> >> requests
> >>>> per
> >>>>> minute), we have a lot of requests going to a particular region
> >> servers
> >>>> and
> >>>>> asking a lot of keys. That results in poor response times (in the
> >> order
> >>>> of
> >>>>> seconds). Currently we are using multi gets.
> >>>>>
> >>>>> We think an improvement would be to spread the keys (introducing a
> >>>>> randomized component on it) in more region servers, so each rs will
> >>> have
> >>>> to
> >>>>> handle less keys and probably less requests. Doing that way the multi
> >>>> gets
> >>>>> will be spread over the region servers.
> >>>>>
> >>>>> Our questions:
> >>>>>
> >>>>> 1. Is it correct this design of asking so many keys on each request?
> >>> (if
> >>>>> you need high performance)
> >>>>> 2. What about splitting in more region servers? It's a good idea? How
> >>> we
> >>>>> could accomplish this? We thought in apply some hashing...
> >>>>>
> >>>>> Thanks in advance!
> >>>>>
> >>>>
> >>>
> >>
> >>
>
>

Re: help on key design

Posted by Michael Segel <ms...@segel.com>.

4 regions on 3 servers? 
I'd say that they were already balanced.

The issue is that when they do their get(s) they are hitting one region. So more splits isn't the answer. 


On Jul 31, 2013, at 12:49 PM, Ted Yu <yu...@gmail.com> wrote:

> From the information Demian provided in the first email:
> 
> bq. a table containing 20 million keys splitted automatically by HBase in 4
> regions and balanced in 3 region servers
> 
> I think the number of regions should be increased through (manual)
> splitting so that the data is spread more evenly across servers.
> 
> If the Get's are scattered across whole key space, there is some
> optimization the client can do. Namely group the Get's by region boundary
> and issue multi get per region.
> 
> Please also refer to http://hbase.apache.org/book.html#rowkey.design,
> especially 6.3.2.
> 
> Cheers
> 
> On Wed, Jul 31, 2013 at 10:14 AM, Dhaval Shah
> <pr...@yahoo.co.in>wrote:
> 
>> Looking at https://issues.apache.org/jira/browse/HBASE-6136 it seems like
>> the 500 Gets are executed sequentially on the region server.
>> 
>> Also 3k requests per minute = 50 requests per second. Assuming your
>> requests take 1 sec (which seems really long but who knows) then you need
>> atleast 50 threads/region server handlers to handle these. Defaults for
>> that number on some older versions of hbase is 10 which means you are
>> running out of threads. Which brings up the following questions -
>> What version of HBase are you running?
>> How many region server handlers do you have?
>> 
>> Regards,
>> Dhaval
>> 
>> 
>> ----- Original Message -----
>> From: Demian Berjman <db...@despegar.com>
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Wednesday, 31 July 2013 11:12 AM
>> Subject: Re: help on key design
>> 
>> Thanks for the responses!
>> 
>>> why don't you use a scan
>> I'll try that and compare it.
>> 
>>> How much memory do you have for your region servers? Have you enabled
>>> block caching? Is your CPU spiking on your region servers?
>> Block caching is enabled. Cpu and memory dont seem to be a problem.
>> 
>> We think we are saturating a region because the quantity of keys requested.
>> In that case my question will be if asking 500+ keys per request is a
>> normal scenario?
>> 
>> Cheers,
>> 
>> 
>> On Wed, Jul 31, 2013 at 11:24 AM, Pablo Medina <pablomedina85@gmail.com
>>> wrote:
>> 
>>> The scan can be an option if the cost of scanning undesired cells and
>>> discarding them trough filters is better than accessing those keys
>>> individually. I would say that as the number of 'undesired' cells
>> decreases
>>> the scan overall performance/efficiency gets increased. It all depends on
>>> how the keys are designed to be grouped together.
>>> 
>>> 2013/7/30 Ted Yu <yu...@gmail.com>
>>> 
>>>> Please also go over http://hbase.apache.org/book.html#perf.reading
>>>> 
>>>> Cheers
>>>> 
>>>> On Tue, Jul 30, 2013 at 3:40 PM, Dhaval Shah <
>>> prince_mithibai@yahoo.co.in
>>>>> wrote:
>>>> 
>>>>> If all your keys are grouped together, why don't you use a scan with
>>>>> start/end key specified? A sequential scan can theoretically be
>> faster
>>>> than
>>>>> MultiGet lookups (assuming your grouping is tight, you can also use
>>>> filters
>>>>> with the scan to give better performance)
>>>>> 
>>>>> How much memory do you have for your region servers? Have you enabled
>>>>> block caching? Is your CPU spiking on your region servers?
>>>>> 
>>>>> If you are saturating the resources on your *hot* region server then
>>> yes
>>>>> having more region servers will help. If no, then something else is
>> the
>>>>> bottleneck and you probably need to dig further
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> Dhaval
>>>>> 
>>>>> 
>>>>> ________________________________
>>>>> From: Demian Berjman <db...@despegar.com>
>>>>> To: user@hbase.apache.org
>>>>> Sent: Tuesday, 30 July 2013 4:37 PM
>>>>> Subject: help on key design
>>>>> 
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I would like to explain our use case of HBase, the row key design and
>>> the
>>>>> problems we are having so anyone can give us a help:
>>>>> 
>>>>> The first thing we noticed is that our data set is too small compared
>>> to
>>>>> other cases we read in the list and forums. We have a table
>> containing
>>> 20
>>>>> million keys splitted automatically by HBase in 4 regions and
>> balanced
>>>> in 3
>>>>> region servers. We have designed our key to keep together the set of
>>> keys
>>>>> requested by our app. That is, when we request a set of keys we
>> expect
>>>> them
>>>>> to be grouped together to improve data locality and block cache
>>>> efficiency.
>>>>> 
>>>>> The second thing we noticed, compared to other cases, is that we
>>>> retrieve a
>>>>> bunch keys per request (500 aprox). Thus, during our peaks (3k
>> requests
>>>> per
>>>>> minute), we have a lot of requests going to a particular region
>> servers
>>>> and
>>>>> asking a lot of keys. That results in poor response times (in the
>> order
>>>> of
>>>>> seconds). Currently we are using multi gets.
>>>>> 
>>>>> We think an improvement would be to spread the keys (introducing a
>>>>> randomized component on it) in more region servers, so each rs will
>>> have
>>>> to
>>>>> handle less keys and probably less requests. Doing that way the multi
>>>> gets
>>>>> will be spread over the region servers.
>>>>> 
>>>>> Our questions:
>>>>> 
>>>>> 1. Is it correct this design of asking so many keys on each request?
>>> (if
>>>>> you need high performance)
>>>>> 2. What about splitting in more region servers? It's a good idea? How
>>> we
>>>>> could accomplish this? We thought in apply some hashing...
>>>>> 
>>>>> Thanks in advance!
>>>>> 
>>>> 
>>> 
>> 
>>

Re: help on key design

Posted by Ted Yu <yu...@gmail.com>.

>From the information Demian provided in the first email:

bq. a table containing 20 million keys splitted automatically by HBase in 4
regions and balanced in 3 region servers

I think the number of regions should be increased through (manual)
splitting so that the data is spread more evenly across servers.

If the Get's are scattered across whole key space, there is some
optimization the client can do. Namely group the Get's by region boundary
and issue multi get per region.

Please also refer to http://hbase.apache.org/book.html#rowkey.design,
especially 6.3.2.

Cheers

On Wed, Jul 31, 2013 at 10:14 AM, Dhaval Shah
<pr...@yahoo.co.in>wrote:

> Looking at https://issues.apache.org/jira/browse/HBASE-6136 it seems like
> the 500 Gets are executed sequentially on the region server.
>
> Also 3k requests per minute = 50 requests per second. Assuming your
> requests take 1 sec (which seems really long but who knows) then you need
> atleast 50 threads/region server handlers to handle these. Defaults for
> that number on some older versions of hbase is 10 which means you are
> running out of threads. Which brings up the following questions -
> What version of HBase are you running?
> How many region server handlers do you have?
>
> Regards,
> Dhaval
>
>
> ----- Original Message -----
> From: Demian Berjman <db...@despegar.com>
> To: user@hbase.apache.org
> Cc:
> Sent: Wednesday, 31 July 2013 11:12 AM
> Subject: Re: help on key design
>
> Thanks for the responses!
>
> >  why don't you use a scan
> I'll try that and compare it.
>
> > How much memory do you have for your region servers? Have you enabled
> > block caching? Is your CPU spiking on your region servers?
> Block caching is enabled. Cpu and memory dont seem to be a problem.
>
> We think we are saturating a region because the quantity of keys requested.
> In that case my question will be if asking 500+ keys per request is a
> normal scenario?
>
> Cheers,
>
>
> On Wed, Jul 31, 2013 at 11:24 AM, Pablo Medina <pablomedina85@gmail.com
> >wrote:
>
> > The scan can be an option if the cost of scanning undesired cells and
> > discarding them trough filters is better than accessing those keys
> > individually. I would say that as the number of 'undesired' cells
> decreases
> > the scan overall performance/efficiency gets increased. It all depends on
> > how the keys are designed to be grouped together.
> >
> > 2013/7/30 Ted Yu <yu...@gmail.com>
> >
> > > Please also go over http://hbase.apache.org/book.html#perf.reading
> > >
> > > Cheers
> > >
> > > On Tue, Jul 30, 2013 at 3:40 PM, Dhaval Shah <
> > prince_mithibai@yahoo.co.in
> > > >wrote:
> > >
> > > > If all your keys are grouped together, why don't you use a scan with
> > > > start/end key specified? A sequential scan can theoretically be
> faster
> > > than
> > > > MultiGet lookups (assuming your grouping is tight, you can also use
> > > filters
> > > > with the scan to give better performance)
> > > >
> > > > How much memory do you have for your region servers? Have you enabled
> > > > block caching? Is your CPU spiking on your region servers?
> > > >
> > > > If you are saturating the resources on your *hot* region server then
> > yes
> > > > having more region servers will help. If no, then something else is
> the
> > > > bottleneck and you probably need to dig further
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > > Dhaval
> > > >
> > > >
> > > > ________________________________
> > > > From: Demian Berjman <db...@despegar.com>
> > > > To: user@hbase.apache.org
> > > > Sent: Tuesday, 30 July 2013 4:37 PM
> > > > Subject: help on key design
> > > >
> > > >
> > > > Hi,
> > > >
> > > > I would like to explain our use case of HBase, the row key design and
> > the
> > > > problems we are having so anyone can give us a help:
> > > >
> > > > The first thing we noticed is that our data set is too small compared
> > to
> > > > other cases we read in the list and forums. We have a table
> containing
> > 20
> > > > million keys splitted automatically by HBase in 4 regions and
> balanced
> > > in 3
> > > > region servers. We have designed our key to keep together the set of
> > keys
> > > > requested by our app. That is, when we request a set of keys we
> expect
> > > them
> > > > to be grouped together to improve data locality and block cache
> > > efficiency.
> > > >
> > > > The second thing we noticed, compared to other cases, is that we
> > > retrieve a
> > > > bunch keys per request (500 aprox). Thus, during our peaks (3k
> requests
> > > per
> > > > minute), we have a lot of requests going to a particular region
> servers
> > > and
> > > > asking a lot of keys. That results in poor response times (in the
> order
> > > of
> > > > seconds). Currently we are using multi gets.
> > > >
> > > > We think an improvement would be to spread the keys (introducing a
> > > > randomized component on it) in more region servers, so each rs will
> > have
> > > to
> > > > handle less keys and probably less requests. Doing that way the multi
> > > gets
> > > > will be spread over the region servers.
> > > >
> > > > Our questions:
> > > >
> > > > 1. Is it correct this design of asking so many keys on each request?
> > (if
> > > > you need high performance)
> > > > 2. What about splitting in more region servers? It's a good idea? How
> > we
> > > > could accomplish this? We thought in apply some hashing...
> > > >
> > > > Thanks in advance!
> > > >
> > >
> >
>
>

Re: help on key design

Posted by Dhaval Shah <pr...@yahoo.co.in>.

Yup that issue definitely seems relevant. Unfortunately you might have to wait till you can upgrade or patch your version. In the time being depending on how well your rows are grouped (and if you are using Bloomfilters) the scan might give you a short term solution
 
Regards,
Dhaval


----- Original Message -----
From: Demian Berjman <db...@despegar.com>
To: user@hbase.apache.org; Dhaval Shah <pr...@yahoo.co.in>
Cc: 
Sent: Wednesday, 31 July 2013 2:41 PM
Subject: Re: help on key design

Dhaval,

> What version of HBase are you running?
0.94.7

> How many region server handlers do you have?
100

We are following this issue:
https://issues.apache.org/jira/browse/HBASE-9087

Ted, we think too that splitting may incur in a better performance. But
like you said, it must be done manually.

Thanks!


On Wed, Jul 31, 2013 at 2:14 PM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> Looking at https://issues.apache.org/jira/browse/HBASE-6136 it seems like
> the 500 Gets are executed sequentially on the region server.
>
> Also 3k requests per minute = 50 requests per second. Assuming your
> requests take 1 sec (which seems really long but who knows) then you need
> atleast 50 threads/region server handlers to handle these. Defaults for
> that number on some older versions of hbase is 10 which means you are
> running out of threads. Which brings up the following questions -
> What version of HBase are you running?
> How many region server handlers do you have?
>
> Regards,
> Dhaval
>
>
> ----- Original Message -----
> From: Demian Berjman <db...@despegar.com>
> To: user@hbase.apache.org
> Cc:
> Sent: Wednesday, 31 July 2013 11:12 AM
> Subject: Re: help on key design
>
> Thanks for the responses!
>
> >  why don't you use a scan
> I'll try that and compare it.
>
> > How much memory do you have for your region servers? Have you enabled
> > block caching? Is your CPU spiking on your region servers?
> Block caching is enabled. Cpu and memory dont seem to be a problem.
>
> We think we are saturating a region because the quantity of keys requested.
> In that case my question will be if asking 500+ keys per request is a
> normal scenario?
>
> Cheers,
>
>
> On Wed, Jul 31, 2013 at 11:24 AM, Pablo Medina <pablomedina85@gmail.com
> >wrote:
>
> > The scan can be an option if the cost of scanning undesired cells and
> > discarding them trough filters is better than accessing those keys
> > individually. I would say that as the number of 'undesired' cells
> decreases
> > the scan overall performance/efficiency gets increased. It all depends on
> > how the keys are designed to be grouped together.
> >
> > 2013/7/30 Ted Yu <yu...@gmail.com>
> >
> > > Please also go over http://hbase.apache.org/book.html#perf.reading
> > >
> > > Cheers
> > >
> > > On Tue, Jul 30, 2013 at 3:40 PM, Dhaval Shah <
> > prince_mithibai@yahoo.co.in
> > > >wrote:
> > >
> > > > If all your keys are grouped together, why don't you use a scan with
> > > > start/end key specified? A sequential scan can theoretically be
> faster
> > > than
> > > > MultiGet lookups (assuming your grouping is tight, you can also use
> > > filters
> > > > with the scan to give better performance)
> > > >
> > > > How much memory do you have for your region servers? Have you enabled
> > > > block caching? Is your CPU spiking on your region servers?
> > > >
> > > > If you are saturating the resources on your *hot* region server then
> > yes
> > > > having more region servers will help. If no, then something else is
> the
> > > > bottleneck and you probably need to dig further
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > > Dhaval
> > > >
> > > >
> > > > ________________________________
> > > > From: Demian Berjman <db...@despegar.com>
> > > > To: user@hbase.apache.org
> > > > Sent: Tuesday, 30 July 2013 4:37 PM
> > > > Subject: help on key design
> > > >
> > > >
> > > > Hi,
> > > >
> > > > I would like to explain our use case of HBase, the row key design and
> > the
> > > > problems we are having so anyone can give us a help:
> > > >
> > > > The first thing we noticed is that our data set is too small compared
> > to
> > > > other cases we read in the list and forums. We have a table
> containing
> > 20
> > > > million keys splitted automatically by HBase in 4 regions and
> balanced
> > > in 3
> > > > region servers. We have designed our key to keep together the set of
> > keys
> > > > requested by our app. That is, when we request a set of keys we
> expect
> > > them
> > > > to be grouped together to improve data locality and block cache
> > > efficiency.
> > > >
> > > > The second thing we noticed, compared to other cases, is that we
> > > retrieve a
> > > > bunch keys per request (500 aprox). Thus, during our peaks (3k
> requests
> > > per
> > > > minute), we have a lot of requests going to a particular region
> servers
> > > and
> > > > asking a lot of keys. That results in poor response times (in the
> order
> > > of
> > > > seconds). Currently we are using multi gets.
> > > >
> > > > We think an improvement would be to spread the keys (introducing a
> > > > randomized component on it) in more region servers, so each rs will
> > have
> > > to
> > > > handle less keys and probably less requests. Doing that way the multi
> > > gets
> > > > will be spread over the region servers.
> > > >
> > > > Our questions:
> > > >
> > > > 1. Is it correct this design of asking so many keys on each request?
> > (if
> > > > you need high performance)
> > > > 2. What about splitting in more region servers? It's a good idea? How
> > we
> > > > could accomplish this? We thought in apply some hashing...
> > > >
> > > > Thanks in advance!
> > > >
> > >
> >
>
>

Re: help on key design

Posted by Demian Berjman <db...@despegar.com>.

Dhaval,

> What version of HBase are you running?
0.94.7

> How many region server handlers do you have?
100

We are following this issue:
https://issues.apache.org/jira/browse/HBASE-9087

Ted, we think too that splitting may incur in a better performance. But
like you said, it must be done manually.

Thanks!


On Wed, Jul 31, 2013 at 2:14 PM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> Looking at https://issues.apache.org/jira/browse/HBASE-6136 it seems like
> the 500 Gets are executed sequentially on the region server.
>
> Also 3k requests per minute = 50 requests per second. Assuming your
> requests take 1 sec (which seems really long but who knows) then you need
> atleast 50 threads/region server handlers to handle these. Defaults for
> that number on some older versions of hbase is 10 which means you are
> running out of threads. Which brings up the following questions -
> What version of HBase are you running?
> How many region server handlers do you have?
>
> Regards,
> Dhaval
>
>
> ----- Original Message -----
> From: Demian Berjman <db...@despegar.com>
> To: user@hbase.apache.org
> Cc:
> Sent: Wednesday, 31 July 2013 11:12 AM
> Subject: Re: help on key design
>
> Thanks for the responses!
>
> >  why don't you use a scan
> I'll try that and compare it.
>
> > How much memory do you have for your region servers? Have you enabled
> > block caching? Is your CPU spiking on your region servers?
> Block caching is enabled. Cpu and memory dont seem to be a problem.
>
> We think we are saturating a region because the quantity of keys requested.
> In that case my question will be if asking 500+ keys per request is a
> normal scenario?
>
> Cheers,
>
>
> On Wed, Jul 31, 2013 at 11:24 AM, Pablo Medina <pablomedina85@gmail.com
> >wrote:
>
> > The scan can be an option if the cost of scanning undesired cells and
> > discarding them trough filters is better than accessing those keys
> > individually. I would say that as the number of 'undesired' cells
> decreases
> > the scan overall performance/efficiency gets increased. It all depends on
> > how the keys are designed to be grouped together.
> >
> > 2013/7/30 Ted Yu <yu...@gmail.com>
> >
> > > Please also go over http://hbase.apache.org/book.html#perf.reading
> > >
> > > Cheers
> > >
> > > On Tue, Jul 30, 2013 at 3:40 PM, Dhaval Shah <
> > prince_mithibai@yahoo.co.in
> > > >wrote:
> > >
> > > > If all your keys are grouped together, why don't you use a scan with
> > > > start/end key specified? A sequential scan can theoretically be
> faster
> > > than
> > > > MultiGet lookups (assuming your grouping is tight, you can also use
> > > filters
> > > > with the scan to give better performance)
> > > >
> > > > How much memory do you have for your region servers? Have you enabled
> > > > block caching? Is your CPU spiking on your region servers?
> > > >
> > > > If you are saturating the resources on your *hot* region server then
> > yes
> > > > having more region servers will help. If no, then something else is
> the
> > > > bottleneck and you probably need to dig further
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > > Dhaval
> > > >
> > > >
> > > > ________________________________
> > > > From: Demian Berjman <db...@despegar.com>
> > > > To: user@hbase.apache.org
> > > > Sent: Tuesday, 30 July 2013 4:37 PM
> > > > Subject: help on key design
> > > >
> > > >
> > > > Hi,
> > > >
> > > > I would like to explain our use case of HBase, the row key design and
> > the
> > > > problems we are having so anyone can give us a help:
> > > >
> > > > The first thing we noticed is that our data set is too small compared
> > to
> > > > other cases we read in the list and forums. We have a table
> containing
> > 20
> > > > million keys splitted automatically by HBase in 4 regions and
> balanced
> > > in 3
> > > > region servers. We have designed our key to keep together the set of
> > keys
> > > > requested by our app. That is, when we request a set of keys we
> expect
> > > them
> > > > to be grouped together to improve data locality and block cache
> > > efficiency.
> > > >
> > > > The second thing we noticed, compared to other cases, is that we
> > > retrieve a
> > > > bunch keys per request (500 aprox). Thus, during our peaks (3k
> requests
> > > per
> > > > minute), we have a lot of requests going to a particular region
> servers
> > > and
> > > > asking a lot of keys. That results in poor response times (in the
> order
> > > of
> > > > seconds). Currently we are using multi gets.
> > > >
> > > > We think an improvement would be to spread the keys (introducing a
> > > > randomized component on it) in more region servers, so each rs will
> > have
> > > to
> > > > handle less keys and probably less requests. Doing that way the multi
> > > gets
> > > > will be spread over the region servers.
> > > >
> > > > Our questions:
> > > >
> > > > 1. Is it correct this design of asking so many keys on each request?
> > (if
> > > > you need high performance)
> > > > 2. What about splitting in more region servers? It's a good idea? How
> > we
> > > > could accomplish this? We thought in apply some hashing...
> > > >
> > > > Thanks in advance!
> > > >
> > >
> >
>
>

Re: help on key design

Posted by Dhaval Shah <pr...@yahoo.co.in>.

Looking at https://issues.apache.org/jira/browse/HBASE-6136 it seems like the 500 Gets are executed sequentially on the region server. 

Also 3k requests per minute = 50 requests per second. Assuming your requests take 1 sec (which seems really long but who knows) then you need atleast 50 threads/region server handlers to handle these. Defaults for that number on some older versions of hbase is 10 which means you are running out of threads. Which brings up the following questions - 
What version of HBase are you running?
How many region server handlers do you have?
 
Regards,
Dhaval


----- Original Message -----
From: Demian Berjman <db...@despegar.com>
To: user@hbase.apache.org
Cc: 
Sent: Wednesday, 31 July 2013 11:12 AM
Subject: Re: help on key design

Thanks for the responses!

>  why don't you use a scan
I'll try that and compare it.

> How much memory do you have for your region servers? Have you enabled
> block caching? Is your CPU spiking on your region servers?
Block caching is enabled. Cpu and memory dont seem to be a problem.

We think we are saturating a region because the quantity of keys requested.
In that case my question will be if asking 500+ keys per request is a
normal scenario?

Cheers,


On Wed, Jul 31, 2013 at 11:24 AM, Pablo Medina <pa...@gmail.com>wrote:

> The scan can be an option if the cost of scanning undesired cells and
> discarding them trough filters is better than accessing those keys
> individually. I would say that as the number of 'undesired' cells decreases
> the scan overall performance/efficiency gets increased. It all depends on
> how the keys are designed to be grouped together.
>
> 2013/7/30 Ted Yu <yu...@gmail.com>
>
> > Please also go over http://hbase.apache.org/book.html#perf.reading
> >
> > Cheers
> >
> > On Tue, Jul 30, 2013 at 3:40 PM, Dhaval Shah <
> prince_mithibai@yahoo.co.in
> > >wrote:
> >
> > > If all your keys are grouped together, why don't you use a scan with
> > > start/end key specified? A sequential scan can theoretically be faster
> > than
> > > MultiGet lookups (assuming your grouping is tight, you can also use
> > filters
> > > with the scan to give better performance)
> > >
> > > How much memory do you have for your region servers? Have you enabled
> > > block caching? Is your CPU spiking on your region servers?
> > >
> > > If you are saturating the resources on your *hot* region server then
> yes
> > > having more region servers will help. If no, then something else is the
> > > bottleneck and you probably need to dig further
> > >
> > >
> > >
> > >
> > > Regards,
> > > Dhaval
> > >
> > >
> > > ________________________________
> > > From: Demian Berjman <db...@despegar.com>
> > > To: user@hbase.apache.org
> > > Sent: Tuesday, 30 July 2013 4:37 PM
> > > Subject: help on key design
> > >
> > >
> > > Hi,
> > >
> > > I would like to explain our use case of HBase, the row key design and
> the
> > > problems we are having so anyone can give us a help:
> > >
> > > The first thing we noticed is that our data set is too small compared
> to
> > > other cases we read in the list and forums. We have a table containing
> 20
> > > million keys splitted automatically by HBase in 4 regions and balanced
> > in 3
> > > region servers. We have designed our key to keep together the set of
> keys
> > > requested by our app. That is, when we request a set of keys we expect
> > them
> > > to be grouped together to improve data locality and block cache
> > efficiency.
> > >
> > > The second thing we noticed, compared to other cases, is that we
> > retrieve a
> > > bunch keys per request (500 aprox). Thus, during our peaks (3k requests
> > per
> > > minute), we have a lot of requests going to a particular region servers
> > and
> > > asking a lot of keys. That results in poor response times (in the order
> > of
> > > seconds). Currently we are using multi gets.
> > >
> > > We think an improvement would be to spread the keys (introducing a
> > > randomized component on it) in more region servers, so each rs will
> have
> > to
> > > handle less keys and probably less requests. Doing that way the multi
> > gets
> > > will be spread over the region servers.
> > >
> > > Our questions:
> > >
> > > 1. Is it correct this design of asking so many keys on each request?
> (if
> > > you need high performance)
> > > 2. What about splitting in more region servers? It's a good idea? How
> we
> > > could accomplish this? We thought in apply some hashing...
> > >
> > > Thanks in advance!
> > >
> >
>

Re: help on key design

Posted by Demian Berjman <db...@despegar.com>.

Thanks for the responses!

>  why don't you use a scan
I'll try that and compare it.

> How much memory do you have for your region servers? Have you enabled
> block caching? Is your CPU spiking on your region servers?
Block caching is enabled. Cpu and memory dont seem to be a problem.

We think we are saturating a region because the quantity of keys requested.
In that case my question will be if asking 500+ keys per request is a
normal scenario?

Cheers,


On Wed, Jul 31, 2013 at 11:24 AM, Pablo Medina <pa...@gmail.com>wrote:

> The scan can be an option if the cost of scanning undesired cells and
> discarding them trough filters is better than accessing those keys
> individually. I would say that as the number of 'undesired' cells decreases
> the scan overall performance/efficiency gets increased. It all depends on
> how the keys are designed to be grouped together.
>
> 2013/7/30 Ted Yu <yu...@gmail.com>
>
> > Please also go over http://hbase.apache.org/book.html#perf.reading
> >
> > Cheers
> >
> > On Tue, Jul 30, 2013 at 3:40 PM, Dhaval Shah <
> prince_mithibai@yahoo.co.in
> > >wrote:
> >
> > > If all your keys are grouped together, why don't you use a scan with
> > > start/end key specified? A sequential scan can theoretically be faster
> > than
> > > MultiGet lookups (assuming your grouping is tight, you can also use
> > filters
> > > with the scan to give better performance)
> > >
> > > How much memory do you have for your region servers? Have you enabled
> > > block caching? Is your CPU spiking on your region servers?
> > >
> > > If you are saturating the resources on your *hot* region server then
> yes
> > > having more region servers will help. If no, then something else is the
> > > bottleneck and you probably need to dig further
> > >
> > >
> > >
> > >
> > > Regards,
> > > Dhaval
> > >
> > >
> > > ________________________________
> > > From: Demian Berjman <db...@despegar.com>
> > > To: user@hbase.apache.org
> > > Sent: Tuesday, 30 July 2013 4:37 PM
> > > Subject: help on key design
> > >
> > >
> > > Hi,
> > >
> > > I would like to explain our use case of HBase, the row key design and
> the
> > > problems we are having so anyone can give us a help:
> > >
> > > The first thing we noticed is that our data set is too small compared
> to
> > > other cases we read in the list and forums. We have a table containing
> 20
> > > million keys splitted automatically by HBase in 4 regions and balanced
> > in 3
> > > region servers. We have designed our key to keep together the set of
> keys
> > > requested by our app. That is, when we request a set of keys we expect
> > them
> > > to be grouped together to improve data locality and block cache
> > efficiency.
> > >
> > > The second thing we noticed, compared to other cases, is that we
> > retrieve a
> > > bunch keys per request (500 aprox). Thus, during our peaks (3k requests
> > per
> > > minute), we have a lot of requests going to a particular region servers
> > and
> > > asking a lot of keys. That results in poor response times (in the order
> > of
> > > seconds). Currently we are using multi gets.
> > >
> > > We think an improvement would be to spread the keys (introducing a
> > > randomized component on it) in more region servers, so each rs will
> have
> > to
> > > handle less keys and probably less requests. Doing that way the multi
> > gets
> > > will be spread over the region servers.
> > >
> > > Our questions:
> > >
> > > 1. Is it correct this design of asking so many keys on each request?
> (if
> > > you need high performance)
> > > 2. What about splitting in more region servers? It's a good idea? How
> we
> > > could accomplish this? We thought in apply some hashing...
> > >
> > > Thanks in advance!
> > >
> >
>

Re: help on key design

Posted by Pablo Medina <pa...@gmail.com>.

The scan can be an option if the cost of scanning undesired cells and
discarding them trough filters is better than accessing those keys
individually. I would say that as the number of 'undesired' cells decreases
the scan overall performance/efficiency gets increased. It all depends on
how the keys are designed to be grouped together.

2013/7/30 Ted Yu <yu...@gmail.com>

> Please also go over http://hbase.apache.org/book.html#perf.reading
>
> Cheers
>
> On Tue, Jul 30, 2013 at 3:40 PM, Dhaval Shah <prince_mithibai@yahoo.co.in
> >wrote:
>
> > If all your keys are grouped together, why don't you use a scan with
> > start/end key specified? A sequential scan can theoretically be faster
> than
> > MultiGet lookups (assuming your grouping is tight, you can also use
> filters
> > with the scan to give better performance)
> >
> > How much memory do you have for your region servers? Have you enabled
> > block caching? Is your CPU spiking on your region servers?
> >
> > If you are saturating the resources on your *hot* region server then yes
> > having more region servers will help. If no, then something else is the
> > bottleneck and you probably need to dig further
> >
> >
> >
> >
> > Regards,
> > Dhaval
> >
> >
> > ________________________________
> > From: Demian Berjman <db...@despegar.com>
> > To: user@hbase.apache.org
> > Sent: Tuesday, 30 July 2013 4:37 PM
> > Subject: help on key design
> >
> >
> > Hi,
> >
> > I would like to explain our use case of HBase, the row key design and the
> > problems we are having so anyone can give us a help:
> >
> > The first thing we noticed is that our data set is too small compared to
> > other cases we read in the list and forums. We have a table containing 20
> > million keys splitted automatically by HBase in 4 regions and balanced
> in 3
> > region servers. We have designed our key to keep together the set of keys
> > requested by our app. That is, when we request a set of keys we expect
> them
> > to be grouped together to improve data locality and block cache
> efficiency.
> >
> > The second thing we noticed, compared to other cases, is that we
> retrieve a
> > bunch keys per request (500 aprox). Thus, during our peaks (3k requests
> per
> > minute), we have a lot of requests going to a particular region servers
> and
> > asking a lot of keys. That results in poor response times (in the order
> of
> > seconds). Currently we are using multi gets.
> >
> > We think an improvement would be to spread the keys (introducing a
> > randomized component on it) in more region servers, so each rs will have
> to
> > handle less keys and probably less requests. Doing that way the multi
> gets
> > will be spread over the region servers.
> >
> > Our questions:
> >
> > 1. Is it correct this design of asking so many keys on each request? (if
> > you need high performance)
> > 2. What about splitting in more region servers? It's a good idea? How we
> > could accomplish this? We thought in apply some hashing...
> >
> > Thanks in advance!
> >
>

Re: help on key design

Posted by Ted Yu <yu...@gmail.com>.

Please also go over http://hbase.apache.org/book.html#perf.reading

Cheers

On Tue, Jul 30, 2013 at 3:40 PM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> If all your keys are grouped together, why don't you use a scan with
> start/end key specified? A sequential scan can theoretically be faster than
> MultiGet lookups (assuming your grouping is tight, you can also use filters
> with the scan to give better performance)
>
> How much memory do you have for your region servers? Have you enabled
> block caching? Is your CPU spiking on your region servers?
>
> If you are saturating the resources on your *hot* region server then yes
> having more region servers will help. If no, then something else is the
> bottleneck and you probably need to dig further
>
>
>
>
> Regards,
> Dhaval
>
>
> ________________________________
> From: Demian Berjman <db...@despegar.com>
> To: user@hbase.apache.org
> Sent: Tuesday, 30 July 2013 4:37 PM
> Subject: help on key design
>
>
> Hi,
>
> I would like to explain our use case of HBase, the row key design and the
> problems we are having so anyone can give us a help:
>
> The first thing we noticed is that our data set is too small compared to
> other cases we read in the list and forums. We have a table containing 20
> million keys splitted automatically by HBase in 4 regions and balanced in 3
> region servers. We have designed our key to keep together the set of keys
> requested by our app. That is, when we request a set of keys we expect them
> to be grouped together to improve data locality and block cache efficiency.
>
> The second thing we noticed, compared to other cases, is that we retrieve a
> bunch keys per request (500 aprox). Thus, during our peaks (3k requests per
> minute), we have a lot of requests going to a particular region servers and
> asking a lot of keys. That results in poor response times (in the order of
> seconds). Currently we are using multi gets.
>
> We think an improvement would be to spread the keys (introducing a
> randomized component on it) in more region servers, so each rs will have to
> handle less keys and probably less requests. Doing that way the multi gets
> will be spread over the region servers.
>
> Our questions:
>
> 1. Is it correct this design of asking so many keys on each request? (if
> you need high performance)
> 2. What about splitting in more region servers? It's a good idea? How we
> could accomplish this? We thought in apply some hashing...
>
> Thanks in advance!
>

Re: help on key design

Posted by Dhaval Shah <pr...@yahoo.co.in>.

If all your keys are grouped together, why don't you use a scan with start/end key specified? A sequential scan can theoretically be faster than MultiGet lookups (assuming your grouping is tight, you can also use filters with the scan to give better performance)

How much memory do you have for your region servers? Have you enabled block caching? Is your CPU spiking on your region servers?

If you are saturating the resources on your *hot* region server then yes having more region servers will help. If no, then something else is the bottleneck and you probably need to dig further




Regards,
Dhaval


________________________________
From: Demian Berjman <db...@despegar.com>
To: user@hbase.apache.org 
Sent: Tuesday, 30 July 2013 4:37 PM
Subject: help on key design


Hi,

I would like to explain our use case of HBase, the row key design and the
problems we are having so anyone can give us a help:

The first thing we noticed is that our data set is too small compared to
other cases we read in the list and forums. We have a table containing 20
million keys splitted automatically by HBase in 4 regions and balanced in 3
region servers. We have designed our key to keep together the set of keys
requested by our app. That is, when we request a set of keys we expect them
to be grouped together to improve data locality and block cache efficiency.

The second thing we noticed, compared to other cases, is that we retrieve a
bunch keys per request (500 aprox). Thus, during our peaks (3k requests per
minute), we have a lot of requests going to a particular region servers and
asking a lot of keys. That results in poor response times (in the order of
seconds). Currently we are using multi gets.

We think an improvement would be to spread the keys (introducing a
randomized component on it) in more region servers, so each rs will have to
handle less keys and probably less requests. Doing that way the multi gets
will be spread over the region servers.

Our questions:

1. Is it correct this design of asking so many keys on each request? (if
you need high performance)
2. What about splitting in more region servers? It's a good idea? How we
could accomplish this? We thought in apply some hashing...

Thanks in advance!