You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by "Liu, Ming (HPIT-GADSC)" <mi...@hp.com> on 2014/08/06 08:28:53 UTC

Why hbase need manual split?

Hi, all,

As I understand, HBase will automatically split a region when the region is too big.
So in what scenario, user needs to do a manual split? Could someone kindly give me some examples that user need to do the region split explicitly via HBase Shell or Java API?

Thanks very much.

Regards,
Ming

RE: Why hbase need manual split?

Posted by "Liu, Ming (HPIT-GADSC)" <mi...@hp.com>.
Thanks John,  

This is a very good answer, now I understand why you use manual split, thanks. 
And I have a typo in my previous post, 
The C is very close to A not to B-A/2. So every split in middle of key range will result a big region and a small region. So very bad.
So HBase only do auto split in middle of key range? Or there exist other algorithm here. Any help will be very appreciated!

Best Regards,
Ming
-----Original Message-----
From: john guthrie [mailto:grafpoo@gmail.com] 
Sent: Wednesday, August 06, 2014 6:35 PM
To: user@hbase.apache.org
Subject: Re: Why hbase need manual split?

to be honest, we were doing manual splits for the main reason that we wanted to make sure it was done on our schedule.

but it also occurred to me that the automatic splits, at least by default, split the region in half. normally the idea is that both new halves continue to grow, but with a sequentially increasing key that won't be true. so if you're splitting in half you want your region split size to be twice your desired region size so that when a split does occur the "older"
half of the region is the size you want it. manual splitting lets you split at the end

hope this helps, and hope i'm not wrong, john



On Wed, Aug 6, 2014 at 6:25 AM, Liu, Ming (HPIT-GADSC) <mi...@hp.com>
wrote:

> Thanks Arun, and John,
>
> Both of your scenarios make a lot of sense to me. But for the 
> "sequence-based key" case, I am still confused. It is like an 
> append-only operation, so new data are always written into the same 
> region, but that region will eventually reach the 
> hbase.hregion.max.filesize and be automatically split, why still need 
> a manual split? If we set the hbase.hregion.max.filesize to a "not too 
> big" value, then a region will never grow too big?
>
> And I think I need to first understand how HBase do the auto split 
> internally ( I am very new to HBase). Given a region with start key A, 
> and end key B. When split, how HBase do split internally? Split in the 
> middle of key range?
> Original region is in range [A,B], so split to [A, B-A/2] and 
> [B-A/2+1, B] ?
> Then if most of the row key are in a small range [A, C], while C is 
> very close to B-A/2, then I can see a problem of auto split.
>
> Is this true? Can HBase do split in other ways?
>
> Thanks,
> Ming
>
> -----Original Message-----
> From: john guthrie [mailto:grafpoo@gmail.com]
> Sent: Wednesday, August 06, 2014 6:01 PM
> To: user@hbase.apache.org
> Subject: Re: Why hbase need manual split?
>
> i had a customer with a sequence-based key (yes, he knew all the 
> downsides for that). being able to split manually meant he could split 
> a region that got too big at the end vice right down the middle. with 
> a sequentially increasing key, splitting the region in half left one 
> region half the desired size and likely to never be added to
>
>
> On Wed, Aug 6, 2014 at 2:44 AM, Arun Allamsetty 
> <arun.allamsetty@gmail.com
> >
> wrote:
>
> > Hi Ming,
> >
> > The reason why we have it is because the user can decide where each 
> > key goes. I can think multiple scenarios off the top of my head 
> > where it would be useful and others can correct me if I am wrong.
> >
> > 1. Cases where you cannot have row keys which are equally lexically 
> > distributed, leading in unequal loads on the regions. In such cases, 
> > we can set key ranges to be assigned to different regions so that we 
> > can have a more equal distribution.
> >
> > 2. The second scenario I am thinking of may be wrong and if it is, 
> > it'll clear my misconceptions. In case you cannot denormalize your 
> > data and you have to perform joins on certain range of row keys 
> > which are lexically similar. So we split them and they would be 
> > assigned to the same region server (right?) and the join would be performed locally.
> >
> > Cheers,
> > Arun
> >
> > Sent from a mobile device. Please don't mind the typos.
> > On Aug 6, 2014 12:30 AM, "Liu, Ming (HPIT-GADSC)" <mi...@hp.com>
> > wrote:
> >
> > > Hi, all,
> > >
> > > As I understand, HBase will automatically split a region when the 
> > > region is too big.
> > > So in what scenario, user needs to do a manual split? Could 
> > > someone
> > kindly
> > > give me some examples that user need to do the region split 
> > > explicitly
> > via
> > > HBase Shell or Java API?
> > >
> > > Thanks very much.
> > >
> > > Regards,
> > > Ming
> > >
> >
>

Re: Why hbase need manual split?

Posted by john guthrie <gr...@gmail.com>.
to be honest, we were doing manual splits for the main reason that we
wanted to make sure it was done on our schedule.

but it also occurred to me that the automatic splits, at least by default,
split the region in half. normally the idea is that both new halves
continue to grow, but with a sequentially increasing key that won't be
true. so if you're splitting in half you want your region split size to be
twice your desired region size so that when a split does occur the "older"
half of the region is the size you want it. manual splitting lets you split
at the end

hope this helps, and hope i'm not wrong,
john



On Wed, Aug 6, 2014 at 6:25 AM, Liu, Ming (HPIT-GADSC) <mi...@hp.com>
wrote:

> Thanks Arun, and John,
>
> Both of your scenarios make a lot of sense to me. But for the
> "sequence-based key" case, I am still confused. It is like an append-only
> operation, so new data are always written into the same region, but that
> region will eventually reach the hbase.hregion.max.filesize and be
> automatically split, why still need a manual split? If we set the
> hbase.hregion.max.filesize to a "not too big" value, then a region will
> never grow too big?
>
> And I think I need to first understand how HBase do the auto split
> internally ( I am very new to HBase). Given a region with start key A, and
> end key B. When split, how HBase do split internally? Split in the middle
> of key range?
> Original region is in range [A,B], so split to [A, B-A/2] and [B-A/2+1, B]
> ?
> Then if most of the row key are in a small range [A, C], while C is very
> close to B-A/2, then I can see a problem of auto split.
>
> Is this true? Can HBase do split in other ways?
>
> Thanks,
> Ming
>
> -----Original Message-----
> From: john guthrie [mailto:grafpoo@gmail.com]
> Sent: Wednesday, August 06, 2014 6:01 PM
> To: user@hbase.apache.org
> Subject: Re: Why hbase need manual split?
>
> i had a customer with a sequence-based key (yes, he knew all the downsides
> for that). being able to split manually meant he could split a region that
> got too big at the end vice right down the middle. with a sequentially
> increasing key, splitting the region in half left one region half the
> desired size and likely to never be added to
>
>
> On Wed, Aug 6, 2014 at 2:44 AM, Arun Allamsetty <arun.allamsetty@gmail.com
> >
> wrote:
>
> > Hi Ming,
> >
> > The reason why we have it is because the user can decide where each
> > key goes. I can think multiple scenarios off the top of my head where
> > it would be useful and others can correct me if I am wrong.
> >
> > 1. Cases where you cannot have row keys which are equally lexically
> > distributed, leading in unequal loads on the regions. In such cases,
> > we can set key ranges to be assigned to different regions so that we
> > can have a more equal distribution.
> >
> > 2. The second scenario I am thinking of may be wrong and if it is,
> > it'll clear my misconceptions. In case you cannot denormalize your
> > data and you have to perform joins on certain range of row keys which
> > are lexically similar. So we split them and they would be assigned to
> > the same region server (right?) and the join would be performed locally.
> >
> > Cheers,
> > Arun
> >
> > Sent from a mobile device. Please don't mind the typos.
> > On Aug 6, 2014 12:30 AM, "Liu, Ming (HPIT-GADSC)" <mi...@hp.com>
> > wrote:
> >
> > > Hi, all,
> > >
> > > As I understand, HBase will automatically split a region when the
> > > region is too big.
> > > So in what scenario, user needs to do a manual split? Could someone
> > kindly
> > > give me some examples that user need to do the region split
> > > explicitly
> > via
> > > HBase Shell or Java API?
> > >
> > > Thanks very much.
> > >
> > > Regards,
> > > Ming
> > >
> >
>

RE: Why hbase need manual split?

Posted by "Liu, Ming (HPIT-GADSC)" <mi...@hp.com>.
Thanks Arun, and John,

Both of your scenarios make a lot of sense to me. But for the "sequence-based key" case, I am still confused. It is like an append-only operation, so new data are always written into the same region, but that region will eventually reach the hbase.hregion.max.filesize and be automatically split, why still need a manual split? If we set the hbase.hregion.max.filesize to a "not too big" value, then a region will never grow too big?   

And I think I need to first understand how HBase do the auto split internally ( I am very new to HBase). Given a region with start key A, and end key B. When split, how HBase do split internally? Split in the middle of key range?
Original region is in range [A,B], so split to [A, B-A/2] and [B-A/2+1, B] ?
Then if most of the row key are in a small range [A, C], while C is very close to B-A/2, then I can see a problem of auto split. 

Is this true? Can HBase do split in other ways?

Thanks,
Ming

-----Original Message-----
From: john guthrie [mailto:grafpoo@gmail.com] 
Sent: Wednesday, August 06, 2014 6:01 PM
To: user@hbase.apache.org
Subject: Re: Why hbase need manual split?

i had a customer with a sequence-based key (yes, he knew all the downsides for that). being able to split manually meant he could split a region that got too big at the end vice right down the middle. with a sequentially increasing key, splitting the region in half left one region half the desired size and likely to never be added to


On Wed, Aug 6, 2014 at 2:44 AM, Arun Allamsetty <ar...@gmail.com>
wrote:

> Hi Ming,
>
> The reason why we have it is because the user can decide where each 
> key goes. I can think multiple scenarios off the top of my head where 
> it would be useful and others can correct me if I am wrong.
>
> 1. Cases where you cannot have row keys which are equally lexically 
> distributed, leading in unequal loads on the regions. In such cases, 
> we can set key ranges to be assigned to different regions so that we 
> can have a more equal distribution.
>
> 2. The second scenario I am thinking of may be wrong and if it is, 
> it'll clear my misconceptions. In case you cannot denormalize your 
> data and you have to perform joins on certain range of row keys which 
> are lexically similar. So we split them and they would be assigned to 
> the same region server (right?) and the join would be performed locally.
>
> Cheers,
> Arun
>
> Sent from a mobile device. Please don't mind the typos.
> On Aug 6, 2014 12:30 AM, "Liu, Ming (HPIT-GADSC)" <mi...@hp.com>
> wrote:
>
> > Hi, all,
> >
> > As I understand, HBase will automatically split a region when the 
> > region is too big.
> > So in what scenario, user needs to do a manual split? Could someone
> kindly
> > give me some examples that user need to do the region split 
> > explicitly
> via
> > HBase Shell or Java API?
> >
> > Thanks very much.
> >
> > Regards,
> > Ming
> >
>

Re: Why hbase need manual split?

Posted by john guthrie <gr...@gmail.com>.
i had a customer with a sequence-based key (yes, he knew all the downsides
for that). being able to split manually meant he could split a region that
got too big at the end vice right down the middle. with a sequentially
increasing key, splitting the region in half left one region half the
desired size and likely to never be added to


On Wed, Aug 6, 2014 at 2:44 AM, Arun Allamsetty <ar...@gmail.com>
wrote:

> Hi Ming,
>
> The reason why we have it is because the user can decide where each key
> goes. I can think multiple scenarios off the top of my head where it would
> be useful and others can correct me if I am wrong.
>
> 1. Cases where you cannot have row keys which are equally lexically
> distributed, leading in unequal loads on the regions. In such cases, we can
> set key ranges to be assigned to different regions so that we can have a
> more equal distribution.
>
> 2. The second scenario I am thinking of may be wrong and if it is, it'll
> clear my misconceptions. In case you cannot denormalize your data and you
> have to perform joins on certain range of row keys which are lexically
> similar. So we split them and they would be assigned to the same region
> server (right?) and the join would be performed locally.
>
> Cheers,
> Arun
>
> Sent from a mobile device. Please don't mind the typos.
> On Aug 6, 2014 12:30 AM, "Liu, Ming (HPIT-GADSC)" <mi...@hp.com>
> wrote:
>
> > Hi, all,
> >
> > As I understand, HBase will automatically split a region when the region
> > is too big.
> > So in what scenario, user needs to do a manual split? Could someone
> kindly
> > give me some examples that user need to do the region split explicitly
> via
> > HBase Shell or Java API?
> >
> > Thanks very much.
> >
> > Regards,
> > Ming
> >
>

Re: Why hbase need manual split?

Posted by Arun Allamsetty <ar...@gmail.com>.
Hi Ming,

The reason why we have it is because the user can decide where each key
goes. I can think multiple scenarios off the top of my head where it would
be useful and others can correct me if I am wrong.

1. Cases where you cannot have row keys which are equally lexically
distributed, leading in unequal loads on the regions. In such cases, we can
set key ranges to be assigned to different regions so that we can have a
more equal distribution.

2. The second scenario I am thinking of may be wrong and if it is, it'll
clear my misconceptions. In case you cannot denormalize your data and you
have to perform joins on certain range of row keys which are lexically
similar. So we split them and they would be assigned to the same region
server (right?) and the join would be performed locally.

Cheers,
Arun

Sent from a mobile device. Please don't mind the typos.
On Aug 6, 2014 12:30 AM, "Liu, Ming (HPIT-GADSC)" <mi...@hp.com> wrote:

> Hi, all,
>
> As I understand, HBase will automatically split a region when the region
> is too big.
> So in what scenario, user needs to do a manual split? Could someone kindly
> give me some examples that user need to do the region split explicitly via
> HBase Shell or Java API?
>
> Thanks very much.
>
> Regards,
> Ming
>

RE: Why hbase need manual split?

Posted by "Rendon, Carlos (KBB)" <CR...@kbb.com>.
You are just starting up a service and want the load split between multiple region servers from the start, instead of waiting for the manual splitting.

Say you had 5 region servers, one way to create your table via HBase shell is like this

create 'tablename', 'f', {NUMREGIONS => 5, SPLITALGO => 'UniformSplit'}

Of course, the exact way you do the split depends on the data you plan to store.

Cheers,
Carlos

-----Original Message-----
From: Liu, Ming (HPIT-GADSC) [mailto:ming.liu2@hp.com] 
Sent: Tuesday, August 05, 2014 11:29 PM
To: user@hbase.apache.org
Subject: Why hbase need manual split?

Hi, all,

As I understand, HBase will automatically split a region when the region is too big.
So in what scenario, user needs to do a manual split? Could someone kindly give me some examples that user need to do the region split explicitly via HBase Shell or Java API?

Thanks very much.

Regards,
Ming