You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Rob Roland <ro...@simplymeasured.com> on 2012/07/13 22:31:53 UTC

Too many regions

Hi all,

The HBase instance I'm managing has grown to the point that it has way too
many regions per server - 5 region servers with 1010 regions each on HBase
0.90.4-cdh3u2.  I want to bring this region count under control. The
cluster is currently running with the default region size of 256 mb, and
the data is spread across 17 tables.   I've turned on compression for all
the column families, which is great, as my region count is growing much
slower now. I've looked through HDFS at the individual regions, and they
seem rather small - 40-50 mb - which is not surprising due to major
compactions after enabling compression.  My total hbase folder size in HDFS
(hadoop fs -dus /hbase) is 926,939,501,499 bytes.

My question is - what's the best strategy for handling this?

What I assume from reading the docs:

1. Increase the hbase.hregion.max.filesize to something more reasonable,
like 2 GB.
2. Bring the cluster offline and merge regions.

Is there a good way to determine the actual region sizes, other than
manually, that way I can do the merges to end up with the most efficient
regions, size-wise?

At what point is it a good idea to turn off automatic region splits and
manually manage them?

Thanks,

Rob Roland
Senior Software Engineer
Simply Measured, Inc.

Re: Too many regions

Posted by Adrien Mogenet <ad...@gmail.com>.
Everyone will tell you that handling less regions is always better.
Depending on your setup, data-size and number of records, I would say that
 1 to 5 regions per table and server is acceptable. In some setup (one big
table for example) you can see up to 100/200 regions per server, which is
the kind of maximum number you should keep in mind (Reference Guide is
talking about "a few hundreds" as far I remember).

On Fri, Jul 13, 2012 at 11:14 PM, Rob Roland <ro...@simplymeasured.com> wrote:

> In almost every table, the rowkey is either a SHA hash, or a SHA hash and a
> timestamp, so we have a fairly even distribution of rowkeys now.
>
> Is there a best practice for number of regions of a table per server?
>  Meaning, with 5 region servers, 10 regions per table, so 170 regions per
> region server, would that be good?
>
> Thanks for the feedback,
>
> Rob
>
> On Fri, Jul 13, 2012 at 1:58 PM, Adrien Mogenet <adrien.mogenet@gmail.com
> >wrote:
>
> > It can be reasonable to turn off the automatic region split if you know
> > your rowkey distribution well and you're able to ensure a great
> parallelism
> > among your regionservers "easily". (ie: manually or through HBase API).
> > Sometimes it's even the best solution to ensure the minimum number of
> > regions (Many companies are doing this). There is an example about
> > pre-splitting regions on the Reference Guide.
> >
> > About your region size, consider upgrading it to 2 GB or even more will
> > help to reduce the number of regions and storeFiles.
> >
> > On Fri, Jul 13, 2012 at 10:31 PM, Rob Roland <ro...@simplymeasured.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > The HBase instance I'm managing has grown to the point that it has way
> > too
> > > many regions per server - 5 region servers with 1010 regions each on
> > HBase
> > > 0.90.4-cdh3u2.  I want to bring this region count under control. The
> > > cluster is currently running with the default region size of 256 mb,
> and
> > > the data is spread across 17 tables.   I've turned on compression for
> all
> > > the column families, which is great, as my region count is growing much
> > > slower now. I've looked through HDFS at the individual regions, and
> they
> > > seem rather small - 40-50 mb - which is not surprising due to major
> > > compactions after enabling compression.  My total hbase folder size in
> > HDFS
> > > (hadoop fs -dus /hbase) is 926,939,501,499 bytes.
> > >
> > > My question is - what's the best strategy for handling this?
> > >
> > > What I assume from reading the docs:
> > >
> > > 1. Increase the hbase.hregion.max.filesize to something more
> reasonable,
> > > like 2 GB.
> > > 2. Bring the cluster offline and merge regions.
> > >
> > > Is there a good way to determine the actual region sizes, other than
> > > manually, that way I can do the merges to end up with the most
> efficient
> > > regions, size-wise?
> > >
> > > At what point is it a good idea to turn off automatic region splits and
> > > manually manage them?
> > >
> > > Thanks,
> > >
> > > Rob Roland
> > > Senior Software Engineer
> > > Simply Measured, Inc.
> > >
> >
> >
> >
> > --
> > Adrien Mogenet
> > 06.59.16.64.22
> > http://www.mogenet.me
> >
>



-- 
Adrien Mogenet
06.59.16.64.22
http://www.mogenet.me

Re: Too many regions

Posted by Bryan Beaudreault <bb...@hubspot.com>.
Tables are like a loose organizational structure to allow you to have more granular per-table configurations or just for your own logical separation of data.  There aren't any best practices with regards to regions per table.  What is more important is regions per region server and regions per query-able data.  

The former is obvious, in that you don't want more than a few hundred (100-300) regions per region server.  What I mean by the latter is that you generally want to have "just enough" regions for the data you are trying to query.  If you have too little, you won't benefit from the distributed nature of HBase.  But if you have too many you will go over the recommended few hundred regions per region server.

Again, it depends on your use case—how are you loading your data, how much data, etc—but I would generally ere on the higher side.  It's easier to split large regions than it is to merge too-small regions.  

--  
Bryan Beaudreault


On Friday, July 13, 2012 at 5:14 PM, Rob Roland wrote:

> In almost every table, the rowkey is either a SHA hash, or a SHA hash and a
> timestamp, so we have a fairly even distribution of rowkeys now.
>  
> Is there a best practice for number of regions of a table per server?
> Meaning, with 5 region servers, 10 regions per table, so 170 regions per
> region server, would that be good?
>  
> Thanks for the feedback,
>  
> Rob
>  
> On Fri, Jul 13, 2012 at 1:58 PM, Adrien Mogenet <adrien.mogenet@gmail.com (mailto:adrien.mogenet@gmail.com)>wrote:
>  
> > It can be reasonable to turn off the automatic region split if you know
> > your rowkey distribution well and you're able to ensure a great parallelism
> > among your regionservers "easily". (ie: manually or through HBase API).
> > Sometimes it's even the best solution to ensure the minimum number of
> > regions (Many companies are doing this). There is an example about
> > pre-splitting regions on the Reference Guide.
> >  
> > About your region size, consider upgrading it to 2 GB or even more will
> > help to reduce the number of regions and storeFiles.
> >  
> > On Fri, Jul 13, 2012 at 10:31 PM, Rob Roland <rob@simplymeasured.com (mailto:rob@simplymeasured.com)>
> > wrote:
> >  
> > > Hi all,
> > >  
> > > The HBase instance I'm managing has grown to the point that it has way
> > too
> > > many regions per server - 5 region servers with 1010 regions each on
> >  
> > HBase
> > > 0.90.4-cdh3u2. I want to bring this region count under control. The
> > > cluster is currently running with the default region size of 256 mb, and
> > > the data is spread across 17 tables. I've turned on compression for all
> > > the column families, which is great, as my region count is growing much
> > > slower now. I've looked through HDFS at the individual regions, and they
> > > seem rather small - 40-50 mb - which is not surprising due to major
> > > compactions after enabling compression. My total hbase folder size in
> > >  
> >  
> > HDFS
> > > (hadoop fs -dus /hbase) is 926,939,501,499 bytes.
> > >  
> > > My question is - what's the best strategy for handling this?
> > >  
> > > What I assume from reading the docs:
> > >  
> > > 1. Increase the hbase.hregion.max.filesize to something more reasonable,
> > > like 2 GB.
> > > 2. Bring the cluster offline and merge regions.
> > >  
> > > Is there a good way to determine the actual region sizes, other than
> > > manually, that way I can do the merges to end up with the most efficient
> > > regions, size-wise?
> > >  
> > > At what point is it a good idea to turn off automatic region splits and
> > > manually manage them?
> > >  
> > > Thanks,
> > >  
> > > Rob Roland
> > > Senior Software Engineer
> > > Simply Measured, Inc.
> > >  
> >  
> >  
> >  
> >  
> > --
> > Adrien Mogenet
> > 06.59.16.64.22
> > http://www.mogenet.me
> >  
>  
>  
>  



Re: Too many regions

Posted by Rob Roland <ro...@simplymeasured.com>.
In almost every table, the rowkey is either a SHA hash, or a SHA hash and a
timestamp, so we have a fairly even distribution of rowkeys now.

Is there a best practice for number of regions of a table per server?
 Meaning, with 5 region servers, 10 regions per table, so 170 regions per
region server, would that be good?

Thanks for the feedback,

Rob

On Fri, Jul 13, 2012 at 1:58 PM, Adrien Mogenet <ad...@gmail.com>wrote:

> It can be reasonable to turn off the automatic region split if you know
> your rowkey distribution well and you're able to ensure a great parallelism
> among your regionservers "easily". (ie: manually or through HBase API).
> Sometimes it's even the best solution to ensure the minimum number of
> regions (Many companies are doing this). There is an example about
> pre-splitting regions on the Reference Guide.
>
> About your region size, consider upgrading it to 2 GB or even more will
> help to reduce the number of regions and storeFiles.
>
> On Fri, Jul 13, 2012 at 10:31 PM, Rob Roland <ro...@simplymeasured.com>
> wrote:
>
> > Hi all,
> >
> > The HBase instance I'm managing has grown to the point that it has way
> too
> > many regions per server - 5 region servers with 1010 regions each on
> HBase
> > 0.90.4-cdh3u2.  I want to bring this region count under control. The
> > cluster is currently running with the default region size of 256 mb, and
> > the data is spread across 17 tables.   I've turned on compression for all
> > the column families, which is great, as my region count is growing much
> > slower now. I've looked through HDFS at the individual regions, and they
> > seem rather small - 40-50 mb - which is not surprising due to major
> > compactions after enabling compression.  My total hbase folder size in
> HDFS
> > (hadoop fs -dus /hbase) is 926,939,501,499 bytes.
> >
> > My question is - what's the best strategy for handling this?
> >
> > What I assume from reading the docs:
> >
> > 1. Increase the hbase.hregion.max.filesize to something more reasonable,
> > like 2 GB.
> > 2. Bring the cluster offline and merge regions.
> >
> > Is there a good way to determine the actual region sizes, other than
> > manually, that way I can do the merges to end up with the most efficient
> > regions, size-wise?
> >
> > At what point is it a good idea to turn off automatic region splits and
> > manually manage them?
> >
> > Thanks,
> >
> > Rob Roland
> > Senior Software Engineer
> > Simply Measured, Inc.
> >
>
>
>
> --
> Adrien Mogenet
> 06.59.16.64.22
> http://www.mogenet.me
>

Re: Too many regions

Posted by Adrien Mogenet <ad...@gmail.com>.
It can be reasonable to turn off the automatic region split if you know
your rowkey distribution well and you're able to ensure a great parallelism
among your regionservers "easily". (ie: manually or through HBase API).
Sometimes it's even the best solution to ensure the minimum number of
regions (Many companies are doing this). There is an example about
pre-splitting regions on the Reference Guide.

About your region size, consider upgrading it to 2 GB or even more will
help to reduce the number of regions and storeFiles.

On Fri, Jul 13, 2012 at 10:31 PM, Rob Roland <ro...@simplymeasured.com> wrote:

> Hi all,
>
> The HBase instance I'm managing has grown to the point that it has way too
> many regions per server - 5 region servers with 1010 regions each on HBase
> 0.90.4-cdh3u2.  I want to bring this region count under control. The
> cluster is currently running with the default region size of 256 mb, and
> the data is spread across 17 tables.   I've turned on compression for all
> the column families, which is great, as my region count is growing much
> slower now. I've looked through HDFS at the individual regions, and they
> seem rather small - 40-50 mb - which is not surprising due to major
> compactions after enabling compression.  My total hbase folder size in HDFS
> (hadoop fs -dus /hbase) is 926,939,501,499 bytes.
>
> My question is - what's the best strategy for handling this?
>
> What I assume from reading the docs:
>
> 1. Increase the hbase.hregion.max.filesize to something more reasonable,
> like 2 GB.
> 2. Bring the cluster offline and merge regions.
>
> Is there a good way to determine the actual region sizes, other than
> manually, that way I can do the merges to end up with the most efficient
> regions, size-wise?
>
> At what point is it a good idea to turn off automatic region splits and
> manually manage them?
>
> Thanks,
>
> Rob Roland
> Senior Software Engineer
> Simply Measured, Inc.
>



-- 
Adrien Mogenet
06.59.16.64.22
http://www.mogenet.me