You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by "Kleegrewe, Christian" <ch...@siemens.com> on 2011/05/24 17:03:56 UTC

Region split behavior

Dear all,

We have a small test cluster with 5 nodes, 1 master and 4 datanodes. The nodes are installed with Ubuntu desktop 10.10, hadoop version 'Hadoop 0.20.2-CDH3B4' and hbase version 0.90.1-CDH3B4. The hbase database is well balanced and contains one table (TAB_1) containing 270.000.000 data records. The table consists of 84 regions each with 1 up to 3 storefiles and 100Mbyte -> 216 Mbyte of size for the regions. The rowkey is a monotonic raising timestamp, wich I know is bad for parallelization but we are only testing some map features so far.

When I create TAB_1 it distributes very good over the 4 region servers, so that each server contains 20 - 22 regions after creation. When I create a second table (TAB_2) with the same rowkey and the same data this table does not distribute over the servers, but is only stored on one of the regionserver (R1). The other nodes (R2, R3, R4) are not used for storage. The cluster still remains balanced but I can see drifting regions from TAB_1 away from R1 which used for storing TAB_2. After a while there are no regions of TAB_1 left on R1 and now the load balancer starts moving regions of TAB_2 to R2 .. R4. The active region that is written into remains on R1.

How can this behavious be explained. I normally would expect that TAB_2 will distribute over all 4 regionservers when creating and would not be stored on one of the servers and have the load balancer in the background shift the data.

Is this a normal hbase behaviour or is there some missconfiguration in my cluster?

Thanks in advance

Christian

---------------8<--------------------------------

Siemens AG
Corporate Technology
Corporate Research and Technologies
CT T DE IT3
Otto-Hahn-Ring 6
81739 München, Deutschland
Tel.: +49 (89) 636-42722
Fax: +49 (89) 636-41423
mailto:christian.kleegrewe@siemens.com

Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard Cromme; Vorstand: Peter Löscher, Vorsitzender; Wolfgang Dehen, Brigitte Ederer, Joe Kaeser, Barbara Kux, Hermann Requardt, Siegfried Russwurm, Peter Y. Solmssen; Sitz der Gesellschaft: Berlin und München, Deutschland; Registergericht: Berlin Charlottenburg, HRB 12300, München, HRB 6684; WEEE-Reg.-Nr. DE 23691322




Re: Region split behavior

Posted by Stack <st...@duboce.net>.
Upgrade your hbase to 0.90.3 (And your CDH to the released version).
 The issue 'HBASE-3586  Improve the selection of regions to balance'
should help.  It does a more random assignment which should help undo
some of the table clumping you are seeing.  That said, others have
been observing that the balancer needs to take tables into
consideration when balancing to make a better distribution of tables
over the cluster.  We'll work on that.  As a last resort, try moving
regions manually.  See the 'move' command in the shell.

St.Ack

On Tue, May 24, 2011 at 8:03 AM, Kleegrewe, Christian
<ch...@siemens.com> wrote:
> Dear all,
>
> We have a small test cluster with 5 nodes, 1 master and 4 datanodes. The nodes are installed with Ubuntu desktop 10.10, hadoop version 'Hadoop 0.20.2-CDH3B4' and hbase version 0.90.1-CDH3B4. The hbase database is well balanced and contains one table (TAB_1) containing 270.000.000 data records. The table consists of 84 regions each with 1 up to 3 storefiles and 100Mbyte -> 216 Mbyte of size for the regions. The rowkey is a monotonic raising timestamp, wich I know is bad for parallelization but we are only testing some map features so far.
>
> When I create TAB_1 it distributes very good over the 4 region servers, so that each server contains 20 - 22 regions after creation. When I create a second table (TAB_2) with the same rowkey and the same data this table does not distribute over the servers, but is only stored on one of the regionserver (R1). The other nodes (R2, R3, R4) are not used for storage. The cluster still remains balanced but I can see drifting regions from TAB_1 away from R1 which used for storing TAB_2. After a while there are no regions of TAB_1 left on R1 and now the load balancer starts moving regions of TAB_2 to R2 .. R4. The active region that is written into remains on R1.
>
> How can this behavious be explained. I normally would expect that TAB_2 will distribute over all 4 regionservers when creating and would not be stored on one of the servers and have the load balancer in the background shift the data.
>
> Is this a normal hbase behaviour or is there some missconfiguration in my cluster?
>
> Thanks in advance
>
> Christian
>
> ---------------8<--------------------------------
>
> Siemens AG
> Corporate Technology
> Corporate Research and Technologies
> CT T DE IT3
> Otto-Hahn-Ring 6
> 81739 München, Deutschland
> Tel.: +49 (89) 636-42722
> Fax: +49 (89) 636-41423
> mailto:christian.kleegrewe@siemens.com
>
> Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard Cromme; Vorstand: Peter Löscher, Vorsitzender; Wolfgang Dehen, Brigitte Ederer, Joe Kaeser, Barbara Kux, Hermann Requardt, Siegfried Russwurm, Peter Y. Solmssen; Sitz der Gesellschaft: Berlin und München, Deutschland; Registergericht: Berlin Charlottenburg, HRB 12300, München, HRB 6684; WEEE-Reg.-Nr. DE 23691322
>
>
>
>