You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Mohit Anchlia <mo...@gmail.com> on 2012/08/30 00:56:51 UTC

md5 hash key and splits

If I use md5 hash + timestamp rowkey would hbase automatically detect the
difference in ranges and peforms split? How does split work in such cases
or is it still advisable to manually split the regions.

Re: md5 hash key and splits

Posted by Stack <st...@duboce.net>.
On Fri, Aug 31, 2012 at 6:09 AM, Doug Meil
<do...@explorysmedical.com> wrote:
>
> Stack, re:  "Where did you read that?", I think he might also be referring
> to this...
>
> http://hbase.apache.org/book.html#important_configurations
>

I'd say we need to revist that paragraph.  It gives a 'wrong'
impression.  It starts out w/ a blanket statement that user should do
manual splitting.  I filed
https://issues.apache.org/jira/browse/HBASE-6701.

St.Ack

Re: md5 hash key and splits

Posted by Stack <st...@duboce.net>.
On Fri, Aug 31, 2012 at 7:55 AM, Mohit Anchlia <mo...@gmail.com> wrote:
>> My data is timeseries and to get random distribution and still have the
> keys in the same region for a user I am thinking of using
> md5(userid)+reversetimestamp as a row key. But with this type of key how
> can one do pre-splits? I have 30 nodes.
>

If you don't know the key spread ahead of time, let HBase do the
splitting for you?
St.Ack

Re: md5 hash key and splits

Posted by Mohit Anchlia <mo...@gmail.com>.
On Thu, Aug 30, 2012 at 11:52 PM, Stack <st...@duboce.net> wrote:

> On Thu, Aug 30, 2012 at 5:04 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> > In general isn't it better to split the regions so that the load can be
> > spread accross the cluster to avoid HotSpots?
> >
>
> Time series data is a particular case [1] and the sematextians have
> tools to help w/ that particular loading pattern.  Is time series your
> loading pattern?  If so, yes, you need to employ some smarts (tsdb
> schema and write tricks or hbasewd tool) to avoid hotspotting.  But
> hotspotting is an issue apart from splts; you can split all you want
> and if your row keys are time series, splitting won't undo them.
>
> My data is timeseries and to get random distribution and still have the
keys in the same region for a user I am thinking of using
md5(userid)+reversetimestamp as a row key. But with this type of key how
can one do pre-splits? I have 30 nodes.


> You would split to distribute load over the cluster and HBase should
> be doing this for you w/o need of human intervention (caveat the
> reasons you might want to manually split as listed above by AK and
> Ian).
>
> St.Ack
> 1. http://hbase.apache.org/book.html#rowkey.design
>

Re: md5 hash key and splits

Posted by Stack <st...@duboce.net>.
On Thu, Aug 30, 2012 at 5:04 PM, Mohit Anchlia <mo...@gmail.com> wrote:
> In general isn't it better to split the regions so that the load can be
> spread accross the cluster to avoid HotSpots?
>

Time series data is a particular case [1] and the sematextians have
tools to help w/ that particular loading pattern.  Is time series your
loading pattern?  If so, yes, you need to employ some smarts (tsdb
schema and write tricks or hbasewd tool) to avoid hotspotting.  But
hotspotting is an issue apart from splts; you can split all you want
and if your row keys are time series, splitting won't undo them.

You would split to distribute load over the cluster and HBase should
be doing this for you w/o need of human intervention (caveat the
reasons you might want to manually split as listed above by AK and
Ian).

St.Ack
1. http://hbase.apache.org/book.html#rowkey.design

Re: md5 hash key and splits

Posted by Doug Meil <do...@explorysmedical.com>.
Stack, re:  "Where did you read that?", I think he might also be referring
to this...

http://hbase.apache.org/book.html#important_configurations






On 8/30/12 8:04 PM, "Mohit Anchlia" <mo...@gmail.com> wrote:

>In general isn't it better to split the regions so that the load can be
>spread accross the cluster to avoid HotSpots?
>
>I read about pre-splitting here:
>
>http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting
>-despite-writing-records-with-sequential-keys/
>
>On Thu, Aug 30, 2012 at 4:30 PM, Amandeep Khurana <am...@gmail.com>
>wrote:
>
>> Also, you might have read that an initial loading of data can be better
>> distributed across the cluster if the table is pre-split rather than
>> starting with a single region and splitting (possibly aggressively,
>> depending on the throughput) as the data loads in. Once you are in a
>>stable
>> state with regions distributed across the cluster, there is really no
>> benefit in terms of spreading load by managing splitting manually v/s
>> letting HBase do it for you. At that point it's about what Ian
>>mentioned -
>> predictability of latencies by avoiding splits happening at a busy time.
>>
>> On Thu, Aug 30, 2012 at 4:26 PM, Ian Varley <iv...@salesforce.com>
>> wrote:
>>
>> > The Facebook devs have mentioned in public talks that they pre-split
>> their
>> > tables and don't use automated region splitting. But as far as I
>> remember,
>> > the reason for that isn't predictability of spreading load, so much as
>> > predictability of uptime & latency (they don't want an automated
>>split to
>> > happen at a random busy time). Maybe that's what you mean, Mohit?
>> >
>> > Ian
>> >
>> > On Aug 30, 2012, at 5:45 PM, Stack wrote:
>> >
>> > On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <mohitanchlia@gmail.com
>> > <ma...@gmail.com>> wrote:
>> > From what I;ve read it's advisable to do manual splits since you are
>>able
>> > to spread the load in more predictable way. If I am missing something
>> > please let me know.
>> >
>> >
>> > Where did you read that?
>> > St.Ack
>> >
>> >
>>



Re: md5 hash key and splits

Posted by Mohit Anchlia <mo...@gmail.com>.
In general isn't it better to split the regions so that the load can be
spread accross the cluster to avoid HotSpots?

I read about pre-splitting here:

http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/

On Thu, Aug 30, 2012 at 4:30 PM, Amandeep Khurana <am...@gmail.com> wrote:

> Also, you might have read that an initial loading of data can be better
> distributed across the cluster if the table is pre-split rather than
> starting with a single region and splitting (possibly aggressively,
> depending on the throughput) as the data loads in. Once you are in a stable
> state with regions distributed across the cluster, there is really no
> benefit in terms of spreading load by managing splitting manually v/s
> letting HBase do it for you. At that point it's about what Ian mentioned -
> predictability of latencies by avoiding splits happening at a busy time.
>
> On Thu, Aug 30, 2012 at 4:26 PM, Ian Varley <iv...@salesforce.com>
> wrote:
>
> > The Facebook devs have mentioned in public talks that they pre-split
> their
> > tables and don't use automated region splitting. But as far as I
> remember,
> > the reason for that isn't predictability of spreading load, so much as
> > predictability of uptime & latency (they don't want an automated split to
> > happen at a random busy time). Maybe that's what you mean, Mohit?
> >
> > Ian
> >
> > On Aug 30, 2012, at 5:45 PM, Stack wrote:
> >
> > On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <mohitanchlia@gmail.com
> > <ma...@gmail.com>> wrote:
> > From what I;ve read it's advisable to do manual splits since you are able
> > to spread the load in more predictable way. If I am missing something
> > please let me know.
> >
> >
> > Where did you read that?
> > St.Ack
> >
> >
>

Re: md5 hash key and splits

Posted by Amandeep Khurana <am...@gmail.com>.
Also, you might have read that an initial loading of data can be better
distributed across the cluster if the table is pre-split rather than
starting with a single region and splitting (possibly aggressively,
depending on the throughput) as the data loads in. Once you are in a stable
state with regions distributed across the cluster, there is really no
benefit in terms of spreading load by managing splitting manually v/s
letting HBase do it for you. At that point it's about what Ian mentioned -
predictability of latencies by avoiding splits happening at a busy time.

On Thu, Aug 30, 2012 at 4:26 PM, Ian Varley <iv...@salesforce.com> wrote:

> The Facebook devs have mentioned in public talks that they pre-split their
> tables and don't use automated region splitting. But as far as I remember,
> the reason for that isn't predictability of spreading load, so much as
> predictability of uptime & latency (they don't want an automated split to
> happen at a random busy time). Maybe that's what you mean, Mohit?
>
> Ian
>
> On Aug 30, 2012, at 5:45 PM, Stack wrote:
>
> On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <mohitanchlia@gmail.com
> <ma...@gmail.com>> wrote:
> From what I;ve read it's advisable to do manual splits since you are able
> to spread the load in more predictable way. If I am missing something
> please let me know.
>
>
> Where did you read that?
> St.Ack
>
>

Re: md5 hash key and splits

Posted by Ian Varley <iv...@salesforce.com>.
The Facebook devs have mentioned in public talks that they pre-split their tables and don't use automated region splitting. But as far as I remember, the reason for that isn't predictability of spreading load, so much as predictability of uptime & latency (they don't want an automated split to happen at a random busy time). Maybe that's what you mean, Mohit?

Ian

On Aug 30, 2012, at 5:45 PM, Stack wrote:

On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <mo...@gmail.com>> wrote:
>From what I;ve read it's advisable to do manual splits since you are able
to spread the load in more predictable way. If I am missing something
please let me know.


Where did you read that?
St.Ack


Re: md5 hash key and splits

Posted by Stack <st...@duboce.net>.
On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <mo...@gmail.com> wrote:
>> From what I;ve read it's advisable to do manual splits since you are able
> to spread the load in more predictable way. If I am missing something
> please let me know.
>

Where did you read that?
St.Ack

Re: md5 hash key and splits

Posted by Mohit Anchlia <mo...@gmail.com>.
On Wed, Aug 29, 2012 at 10:50 PM, Stack <st...@duboce.net> wrote:

> On Wed, Aug 29, 2012 at 9:38 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> > On Wed, Aug 29, 2012 at 9:19 PM, Stack <st...@duboce.net> wrote:
> >
> >>  On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia <mohitanchlia@gmail.com
> >
> >> wrote:
> >> > If I use md5 hash + timestamp rowkey would hbase automatically detect
> the
> >> > difference in ranges and peforms split? How does split work in such
> cases
> >> > or is it still advisable to manually split the regions.
> >>
> >
> > What logic would you recommend to split the table into multiple regions
> > when using md5 hash?
> >
>
> Its hard to know how well your inserts will spread over the md5
> namespace ahead of time.  You could try sampling or just let HBase
> take care of the splits for you (Is there a problem w/ your letting
> HBase do the splits?)
>
> From what I;ve read it's advisable to do manual splits since you are able
to spread the load in more predictable way. If I am missing something
please let me know.


> St.Ack
>

Re: md5 hash key and splits

Posted by Stack <st...@duboce.net>.
On Wed, Aug 29, 2012 at 9:38 PM, Mohit Anchlia <mo...@gmail.com> wrote:
> On Wed, Aug 29, 2012 at 9:19 PM, Stack <st...@duboce.net> wrote:
>
>>  On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia <mo...@gmail.com>
>> wrote:
>> > If I use md5 hash + timestamp rowkey would hbase automatically detect the
>> > difference in ranges and peforms split? How does split work in such cases
>> > or is it still advisable to manually split the regions.
>>
>
> What logic would you recommend to split the table into multiple regions
> when using md5 hash?
>

Its hard to know how well your inserts will spread over the md5
namespace ahead of time.  You could try sampling or just let HBase
take care of the splits for you (Is there a problem w/ your letting
HBase do the splits?)

St.Ack

Re: md5 hash key and splits

Posted by Mohit Anchlia <mo...@gmail.com>.
On Wed, Aug 29, 2012 at 9:19 PM, Stack <st...@duboce.net> wrote:

>  On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> > If I use md5 hash + timestamp rowkey would hbase automatically detect the
> > difference in ranges and peforms split? How does split work in such cases
> > or is it still advisable to manually split the regions.
>

What logic would you recommend to split the table into multiple regions
when using md5 hash?


> Yes.
>
> On how split works, when a region hits the maximum configured size, it
> splits in two.
>
> Manual splitting can be useful when you know your distribution and
> you'd save on hbase doing it for you.  It can speed up bulk loads for
> instance.
>
> St.Ack
>

Re: md5 hash key and splits

Posted by Stack <st...@duboce.net>.
On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia <mo...@gmail.com> wrote:
> If I use md5 hash + timestamp rowkey would hbase automatically detect the
> difference in ranges and peforms split? How does split work in such cases
> or is it still advisable to manually split the regions.

Yes.

On how split works, when a region hits the maximum configured size, it
splits in two.

Manual splitting can be useful when you know your distribution and
you'd save on hbase doing it for you.  It can speed up bulk loads for
instance.

St.Ack