You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jieshan Bean (JIRA)" <ji...@apache.org> on 2011/06/25 02:54:47 UTC

[jira] [Created] (HBASE-4031) An imbalance result calculated by LoadBalancer

An imbalance result calculated by LoadBalancer
----------------------------------------------

                 Key: HBASE-4031
                 URL: https://issues.apache.org/jira/browse/HBASE-4031
             Project: HBase
          Issue Type: Bug
          Components: master
    Affects Versions: 0.90.3
            Reporter: Jieshan Bean
             Fix For: 0.90.4


  I found the problem while the cluster couldn't balance(Around time of 2011-05-24 11:28).One node's regions count is the double of the other nodes. And it didn't move regions anymore:
   Address Start Code Load
158-1-101-202:20030 1306205409671 requests=0, regions=2593, usedHeap=114, maxHeap=8165 158-1-101-222:20030 1306205940117 requests=0, regions=5841, usedHeap=80, maxHeap=8165 158-1-101-52:20030 1306205417261 requests=0, regions=2622, usedHeap=76, maxHeap=8165 158-1-101-82:20030 1306205415714 requests=0, regions=2633, usedHeap=69, maxHeap=8165 
Total:  servers: 4   requests=0, regions=13689


  HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was found by my analysis on this problem.
  But I'm afraid it's not the main cause of the problem.

  There's one active master, one standby master, four regionservers in our cluster.

>>10:57:41, the standby hamster 222 becomes the active one.
2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding: master failover

>>4 regionservers was registered in 222 one by one. Only one regionserver seemed some time late.
2011-05-24 10:57:37,533 INFO : Registering server=158-1-101-82,20020,1306205415714, regionCount=3388, userLoad=true
2011-05-24 10:57:37,537 INFO : Registering server=158-1-101-202,20020,1306205409671, regionCount=3453, userLoad=true
2011-05-24 10:57:37,598 INFO : Registering server=158-1-101-52,20020,1306205417261, regionCount=3411, userLoad=true
2011-05-24 10:59:00,408 INFO : Registering server=158-1-101-222,20020,1306205940117, regionCount=0, userLoad=false

>>13134 regions needed to move after rebuildUserRegions(13689 regions in the cluster during the time).
2011-05-24 10:58:47,534 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to process 13134 regions in transition

>>All the 13134 regions were opened, regions opened count in each server:
158-1-101-222,20020,1306205940117    Count: 834
158-1-101-82,20020,1306205415714    Count: 4093
158-1-101-202,20020,1306205409671    Count: 4118
158-1-101-52,20020,1306205417261    Count: 4089

>>The nearest balancer calculate results:
2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 19ms. Moving 5012 regions off of 3 overloaded servers onto 1 less loaded servers

"5012" is an unimaginable number here, for it is larger than the average number "3424.5"


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4031) An imbalance result calculated by LoadBalancer

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055271#comment-13055271 ] 

Ted Yu commented on HBASE-4031:
-------------------------------

Looks like namenode failed over as well:
{code}
2011-05-24 10:59:10,584 INFO org.apache.hadoop.hdfs.DFSClient: DFSClient has connected to the active namenode: /158.1.101.222:9000
{code}
If so, I think the backup master should use a node where region server isn't running. In case of master failover, the node wouldn't be able to handle namenode, failover Master and RS at the same time.

> An imbalance result calculated by LoadBalancer
> ----------------------------------------------
>
>                 Key: HBASE-4031
>                 URL: https://issues.apache.org/jira/browse/HBASE-4031
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HMaster222.rar, HRegionServer222.rar
>
>
>   I found the problem while the cluster couldn't balance(Around time of 2011-05-24 11:28).One node's regions count is the double of the other nodes. And it didn't move regions anymore:
>    Address Start Code Load
> 158-1-101-202:20030 1306205409671 requests=0, regions=2593, usedHeap=114, maxHeap=8165 158-1-101-222:20030 1306205940117 requests=0, regions=5841, usedHeap=80, maxHeap=8165 158-1-101-52:20030 1306205417261 requests=0, regions=2622, usedHeap=76, maxHeap=8165 158-1-101-82:20030 1306205415714 requests=0, regions=2633, usedHeap=69, maxHeap=8165 
> Total:  servers: 4   requests=0, regions=13689
>   HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was found by my analysis on this problem.
>   But I'm afraid it's not the main cause of the problem.
>   There's one active master, one standby master, four regionservers in our cluster.
> >>10:57:41, the standby hamster 222 becomes the active one.
> 2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding: master failover
> >>4 regionservers was registered in 222 one by one. Only one regionserver seemed some time late.
> 2011-05-24 10:57:37,533 INFO : Registering server=158-1-101-82,20020,1306205415714, regionCount=3388, userLoad=true
> 2011-05-24 10:57:37,537 INFO : Registering server=158-1-101-202,20020,1306205409671, regionCount=3453, userLoad=true
> 2011-05-24 10:57:37,598 INFO : Registering server=158-1-101-52,20020,1306205417261, regionCount=3411, userLoad=true
> 2011-05-24 10:59:00,408 INFO : Registering server=158-1-101-222,20020,1306205940117, regionCount=0, userLoad=false
> >>13134 regions needed to move after rebuildUserRegions(13689 regions in the cluster during the time).
> 2011-05-24 10:58:47,534 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to process 13134 regions in transition
> >>All the 13134 regions were opened, regions opened count in each server:
> 158-1-101-222,20020,1306205940117    Count: 834
> 158-1-101-82,20020,1306205415714    Count: 4093
> 158-1-101-202,20020,1306205409671    Count: 4118
> 158-1-101-52,20020,1306205417261    Count: 4089
> >>The nearest balancer calculate results:
> 2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 19ms. Moving 5012 regions off of 3 overloaded servers onto 1 less loaded servers
> "5012" is an unimaginable number here, for it is larger than the average number "3424.5"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4031) An imbalance result calculated by LoadBalancer

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058555#comment-13058555 ] 

Ted Yu commented on HBASE-4031:
-------------------------------

I think HBASE-4053 reveals the cause for this issue.

> An imbalance result calculated by LoadBalancer
> ----------------------------------------------
>
>                 Key: HBASE-4031
>                 URL: https://issues.apache.org/jira/browse/HBASE-4031
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HMaster222.rar, HRegionServer222.rar
>
>
>   I found the problem while the cluster couldn't balance(Around time of 2011-05-24 11:28).One node's regions count is the double of the other nodes. And it didn't move regions anymore:
>    Address Start Code Load
> 158-1-101-202:20030 1306205409671 requests=0, regions=2593, usedHeap=114, maxHeap=8165 158-1-101-222:20030 1306205940117 requests=0, regions=5841, usedHeap=80, maxHeap=8165 158-1-101-52:20030 1306205417261 requests=0, regions=2622, usedHeap=76, maxHeap=8165 158-1-101-82:20030 1306205415714 requests=0, regions=2633, usedHeap=69, maxHeap=8165 
> Total:  servers: 4   requests=0, regions=13689
>   HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was found by my analysis on this problem.
>   But I'm afraid it's not the main cause of the problem.
>   There's one active master, one standby master, four regionservers in our cluster.
> >>10:57:41, the standby hamster 222 becomes the active one.
> 2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding: master failover
> >>4 regionservers was registered in 222 one by one. Only one regionserver seemed some time late.
> 2011-05-24 10:57:37,533 INFO : Registering server=158-1-101-82,20020,1306205415714, regionCount=3388, userLoad=true
> 2011-05-24 10:57:37,537 INFO : Registering server=158-1-101-202,20020,1306205409671, regionCount=3453, userLoad=true
> 2011-05-24 10:57:37,598 INFO : Registering server=158-1-101-52,20020,1306205417261, regionCount=3411, userLoad=true
> 2011-05-24 10:59:00,408 INFO : Registering server=158-1-101-222,20020,1306205940117, regionCount=0, userLoad=false
> >>13134 regions needed to move after rebuildUserRegions(13689 regions in the cluster during the time).
> 2011-05-24 10:58:47,534 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to process 13134 regions in transition
> >>All the 13134 regions were opened, regions opened count in each server:
> 158-1-101-222,20020,1306205940117    Count: 834
> 158-1-101-82,20020,1306205415714    Count: 4093
> 158-1-101-202,20020,1306205409671    Count: 4118
> 158-1-101-52,20020,1306205417261    Count: 4089
> >>The nearest balancer calculate results:
> 2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 19ms. Moving 5012 regions off of 3 overloaded servers onto 1 less loaded servers
> "5012" is an unimaginable number here, for it is larger than the average number "3424.5"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4031) An imbalance result calculated by LoadBalancer

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060993#comment-13060993 ] 

Jieshan Bean commented on HBASE-4031:
-------------------------------------

Yes, this is the same issue with HBASE-4053, though the scenario is not the same.
So I think we can invalid this issue.

> An imbalance result calculated by LoadBalancer
> ----------------------------------------------
>
>                 Key: HBASE-4031
>                 URL: https://issues.apache.org/jira/browse/HBASE-4031
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HMaster222.rar, HRegionServer222.rar
>
>
>   I found the problem while the cluster couldn't balance(Around time of 2011-05-24 11:28).One node's regions count is the double of the other nodes. And it didn't move regions anymore:
>    Address Start Code Load
> 158-1-101-202:20030 1306205409671 requests=0, regions=2593, usedHeap=114, maxHeap=8165 158-1-101-222:20030 1306205940117 requests=0, regions=5841, usedHeap=80, maxHeap=8165 158-1-101-52:20030 1306205417261 requests=0, regions=2622, usedHeap=76, maxHeap=8165 158-1-101-82:20030 1306205415714 requests=0, regions=2633, usedHeap=69, maxHeap=8165 
> Total:  servers: 4   requests=0, regions=13689
>   HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was found by my analysis on this problem.
>   But I'm afraid it's not the main cause of the problem.
>   There's one active master, one standby master, four regionservers in our cluster.
> >>10:57:41, the standby hamster 222 becomes the active one.
> 2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding: master failover
> >>4 regionservers was registered in 222 one by one. Only one regionserver seemed some time late.
> 2011-05-24 10:57:37,533 INFO : Registering server=158-1-101-82,20020,1306205415714, regionCount=3388, userLoad=true
> 2011-05-24 10:57:37,537 INFO : Registering server=158-1-101-202,20020,1306205409671, regionCount=3453, userLoad=true
> 2011-05-24 10:57:37,598 INFO : Registering server=158-1-101-52,20020,1306205417261, regionCount=3411, userLoad=true
> 2011-05-24 10:59:00,408 INFO : Registering server=158-1-101-222,20020,1306205940117, regionCount=0, userLoad=false
> >>13134 regions needed to move after rebuildUserRegions(13689 regions in the cluster during the time).
> 2011-05-24 10:58:47,534 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to process 13134 regions in transition
> >>All the 13134 regions were opened, regions opened count in each server:
> 158-1-101-222,20020,1306205940117    Count: 834
> 158-1-101-82,20020,1306205415714    Count: 4093
> 158-1-101-202,20020,1306205409671    Count: 4118
> 158-1-101-52,20020,1306205417261    Count: 4089
> >>The nearest balancer calculate results:
> 2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 19ms. Moving 5012 regions off of 3 overloaded servers onto 1 less loaded servers
> "5012" is an unimaginable number here, for it is larger than the average number "3424.5"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4031) An imbalance result calculated by LoadBalancer

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055119#comment-13055119 ] 

Ted Yu commented on HBASE-4031:
-------------------------------

Registration of the 3 RS (including 158-1-101-82:20020) wasn't in HMaster222.log
I noticed:
{code}
2011-05-24 10:59:01,213 DEBUG org.apache.hadoop.hbase.master.ServerManager: New connection to 158-1-101-222,20020,1306205940117
2011-05-24 10:59:11,067 WARN org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: RemoteException connecting to RS
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.ipc.ServerNotRunningException: Server is not running yet
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1038)
...
2011-05-24 10:59:11,070 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for hello,200040,1305944346902.2ce947cccbfe15b7210dd21d0cc2c515. so generated a random one; hri=hello,200040,1305944346902.2ce947cccbfe15b7210dd21d0cc2c515., src=, dest=158-1-101-82,20020,1306205415714; 4 (online=4, exclude=serverName=158-1-101-222,20020,1306205940117, load=(requests=0, regions=0, usedHeap=0, maxHeap=0)) available servers
{code}
There is only 1 line from LoadBalancer in master log. If this scenario can be reproduced, please add more DEBUG log before line 251 in balanceCluster()

I guess some regions were doubly counted on 158-1-101-222,20020

> An imbalance result calculated by LoadBalancer
> ----------------------------------------------
>
>                 Key: HBASE-4031
>                 URL: https://issues.apache.org/jira/browse/HBASE-4031
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HMaster222.rar, HRegionServer222.rar
>
>
>   I found the problem while the cluster couldn't balance(Around time of 2011-05-24 11:28).One node's regions count is the double of the other nodes. And it didn't move regions anymore:
>    Address Start Code Load
> 158-1-101-202:20030 1306205409671 requests=0, regions=2593, usedHeap=114, maxHeap=8165 158-1-101-222:20030 1306205940117 requests=0, regions=5841, usedHeap=80, maxHeap=8165 158-1-101-52:20030 1306205417261 requests=0, regions=2622, usedHeap=76, maxHeap=8165 158-1-101-82:20030 1306205415714 requests=0, regions=2633, usedHeap=69, maxHeap=8165 
> Total:  servers: 4   requests=0, regions=13689
>   HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was found by my analysis on this problem.
>   But I'm afraid it's not the main cause of the problem.
>   There's one active master, one standby master, four regionservers in our cluster.
> >>10:57:41, the standby hamster 222 becomes the active one.
> 2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding: master failover
> >>4 regionservers was registered in 222 one by one. Only one regionserver seemed some time late.
> 2011-05-24 10:57:37,533 INFO : Registering server=158-1-101-82,20020,1306205415714, regionCount=3388, userLoad=true
> 2011-05-24 10:57:37,537 INFO : Registering server=158-1-101-202,20020,1306205409671, regionCount=3453, userLoad=true
> 2011-05-24 10:57:37,598 INFO : Registering server=158-1-101-52,20020,1306205417261, regionCount=3411, userLoad=true
> 2011-05-24 10:59:00,408 INFO : Registering server=158-1-101-222,20020,1306205940117, regionCount=0, userLoad=false
> >>13134 regions needed to move after rebuildUserRegions(13689 regions in the cluster during the time).
> 2011-05-24 10:58:47,534 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to process 13134 regions in transition
> >>All the 13134 regions were opened, regions opened count in each server:
> 158-1-101-222,20020,1306205940117    Count: 834
> 158-1-101-82,20020,1306205415714    Count: 4093
> 158-1-101-202,20020,1306205409671    Count: 4118
> 158-1-101-52,20020,1306205417261    Count: 4089
> >>The nearest balancer calculate results:
> 2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 19ms. Moving 5012 regions off of 3 overloaded servers onto 1 less loaded servers
> "5012" is an unimaginable number here, for it is larger than the average number "3424.5"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4031) An imbalance result calculated by LoadBalancer

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054797#comment-13054797 ] 

Jieshan Bean commented on HBASE-4031:
-------------------------------------

For the original log fils is too big, I just attached some fragments of the full logs.

> An imbalance result calculated by LoadBalancer
> ----------------------------------------------
>
>                 Key: HBASE-4031
>                 URL: https://issues.apache.org/jira/browse/HBASE-4031
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HMaster222.rar, HRegionServer222.rar
>
>
>   I found the problem while the cluster couldn't balance(Around time of 2011-05-24 11:28).One node's regions count is the double of the other nodes. And it didn't move regions anymore:
>    Address Start Code Load
> 158-1-101-202:20030 1306205409671 requests=0, regions=2593, usedHeap=114, maxHeap=8165 158-1-101-222:20030 1306205940117 requests=0, regions=5841, usedHeap=80, maxHeap=8165 158-1-101-52:20030 1306205417261 requests=0, regions=2622, usedHeap=76, maxHeap=8165 158-1-101-82:20030 1306205415714 requests=0, regions=2633, usedHeap=69, maxHeap=8165 
> Total:  servers: 4   requests=0, regions=13689
>   HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was found by my analysis on this problem.
>   But I'm afraid it's not the main cause of the problem.
>   There's one active master, one standby master, four regionservers in our cluster.
> >>10:57:41, the standby hamster 222 becomes the active one.
> 2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding: master failover
> >>4 regionservers was registered in 222 one by one. Only one regionserver seemed some time late.
> 2011-05-24 10:57:37,533 INFO : Registering server=158-1-101-82,20020,1306205415714, regionCount=3388, userLoad=true
> 2011-05-24 10:57:37,537 INFO : Registering server=158-1-101-202,20020,1306205409671, regionCount=3453, userLoad=true
> 2011-05-24 10:57:37,598 INFO : Registering server=158-1-101-52,20020,1306205417261, regionCount=3411, userLoad=true
> 2011-05-24 10:59:00,408 INFO : Registering server=158-1-101-222,20020,1306205940117, regionCount=0, userLoad=false
> >>13134 regions needed to move after rebuildUserRegions(13689 regions in the cluster during the time).
> 2011-05-24 10:58:47,534 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to process 13134 regions in transition
> >>All the 13134 regions were opened, regions opened count in each server:
> 158-1-101-222,20020,1306205940117    Count: 834
> 158-1-101-82,20020,1306205415714    Count: 4093
> 158-1-101-202,20020,1306205409671    Count: 4118
> 158-1-101-52,20020,1306205417261    Count: 4089
> >>The nearest balancer calculate results:
> 2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 19ms. Moving 5012 regions off of 3 overloaded servers onto 1 less loaded servers
> "5012" is an unimaginable number here, for it is larger than the average number "3424.5"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4031) An imbalance result calculated by LoadBalancer

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055259#comment-13055259 ] 

Jieshan Bean commented on HBASE-4031:
-------------------------------------

This scenario cann't be reproduced easily. I have not reproduce it again.
About the Registration logs, I didn't put them into the attachment. 
<noformat>
2011-05-24 10:57:37,533 INFO : Registering server=158-1-101-82,20020,1306205415714, regionCount=3388, userLoad=true
2011-05-24 10:57:37,537 INFO : Registering server=158-1-101-202,20020,1306205409671, regionCount=3453, userLoad=true
2011-05-24 10:57:37,598 INFO : Registering server=158-1-101-52,20020,1306205417261, regionCount=3411, userLoad=true
2011-05-24 10:59:00,408 INFO : Registering server=158-1-101-222,20020,1306205940117, regionCount=0, userLoad=false
<noformat>

The registraction happened before the first line of the logs :"2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding: master failover".

Some regions were indeed doubly counted. For it has been mentioned in the related issue:HBASE-3985. But only s around 500 region.

Sorry for bring you the confusion, and thanks for dig in this issue.



> An imbalance result calculated by LoadBalancer
> ----------------------------------------------
>
>                 Key: HBASE-4031
>                 URL: https://issues.apache.org/jira/browse/HBASE-4031
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HMaster222.rar, HRegionServer222.rar
>
>
>   I found the problem while the cluster couldn't balance(Around time of 2011-05-24 11:28).One node's regions count is the double of the other nodes. And it didn't move regions anymore:
>    Address Start Code Load
> 158-1-101-202:20030 1306205409671 requests=0, regions=2593, usedHeap=114, maxHeap=8165 158-1-101-222:20030 1306205940117 requests=0, regions=5841, usedHeap=80, maxHeap=8165 158-1-101-52:20030 1306205417261 requests=0, regions=2622, usedHeap=76, maxHeap=8165 158-1-101-82:20030 1306205415714 requests=0, regions=2633, usedHeap=69, maxHeap=8165 
> Total:  servers: 4   requests=0, regions=13689
>   HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was found by my analysis on this problem.
>   But I'm afraid it's not the main cause of the problem.
>   There's one active master, one standby master, four regionservers in our cluster.
> >>10:57:41, the standby hamster 222 becomes the active one.
> 2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding: master failover
> >>4 regionservers was registered in 222 one by one. Only one regionserver seemed some time late.
> 2011-05-24 10:57:37,533 INFO : Registering server=158-1-101-82,20020,1306205415714, regionCount=3388, userLoad=true
> 2011-05-24 10:57:37,537 INFO : Registering server=158-1-101-202,20020,1306205409671, regionCount=3453, userLoad=true
> 2011-05-24 10:57:37,598 INFO : Registering server=158-1-101-52,20020,1306205417261, regionCount=3411, userLoad=true
> 2011-05-24 10:59:00,408 INFO : Registering server=158-1-101-222,20020,1306205940117, regionCount=0, userLoad=false
> >>13134 regions needed to move after rebuildUserRegions(13689 regions in the cluster during the time).
> 2011-05-24 10:58:47,534 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to process 13134 regions in transition
> >>All the 13134 regions were opened, regions opened count in each server:
> 158-1-101-222,20020,1306205940117    Count: 834
> 158-1-101-82,20020,1306205415714    Count: 4093
> 158-1-101-202,20020,1306205409671    Count: 4118
> 158-1-101-52,20020,1306205417261    Count: 4089
> >>The nearest balancer calculate results:
> 2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 19ms. Moving 5012 regions off of 3 overloaded servers onto 1 less loaded servers
> "5012" is an unimaginable number here, for it is larger than the average number "3424.5"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4031) An imbalance result calculated by LoadBalancer

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055267#comment-13055267 ] 

Ted Yu commented on HBASE-4031:
-------------------------------

RS 222 failed over to new master @ 2011-05-24 10:59:10

I traced region 'test4,011980,1306152907050.af05fe4e3f56932027c929917af664aa.' in RS log. The region was opened, closed and opened again.


> An imbalance result calculated by LoadBalancer
> ----------------------------------------------
>
>                 Key: HBASE-4031
>                 URL: https://issues.apache.org/jira/browse/HBASE-4031
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HMaster222.rar, HRegionServer222.rar
>
>
>   I found the problem while the cluster couldn't balance(Around time of 2011-05-24 11:28).One node's regions count is the double of the other nodes. And it didn't move regions anymore:
>    Address Start Code Load
> 158-1-101-202:20030 1306205409671 requests=0, regions=2593, usedHeap=114, maxHeap=8165 158-1-101-222:20030 1306205940117 requests=0, regions=5841, usedHeap=80, maxHeap=8165 158-1-101-52:20030 1306205417261 requests=0, regions=2622, usedHeap=76, maxHeap=8165 158-1-101-82:20030 1306205415714 requests=0, regions=2633, usedHeap=69, maxHeap=8165 
> Total:  servers: 4   requests=0, regions=13689
>   HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was found by my analysis on this problem.
>   But I'm afraid it's not the main cause of the problem.
>   There's one active master, one standby master, four regionservers in our cluster.
> >>10:57:41, the standby hamster 222 becomes the active one.
> 2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding: master failover
> >>4 regionservers was registered in 222 one by one. Only one regionserver seemed some time late.
> 2011-05-24 10:57:37,533 INFO : Registering server=158-1-101-82,20020,1306205415714, regionCount=3388, userLoad=true
> 2011-05-24 10:57:37,537 INFO : Registering server=158-1-101-202,20020,1306205409671, regionCount=3453, userLoad=true
> 2011-05-24 10:57:37,598 INFO : Registering server=158-1-101-52,20020,1306205417261, regionCount=3411, userLoad=true
> 2011-05-24 10:59:00,408 INFO : Registering server=158-1-101-222,20020,1306205940117, regionCount=0, userLoad=false
> >>13134 regions needed to move after rebuildUserRegions(13689 regions in the cluster during the time).
> 2011-05-24 10:58:47,534 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to process 13134 regions in transition
> >>All the 13134 regions were opened, regions opened count in each server:
> 158-1-101-222,20020,1306205940117    Count: 834
> 158-1-101-82,20020,1306205415714    Count: 4093
> 158-1-101-202,20020,1306205409671    Count: 4118
> 158-1-101-52,20020,1306205417261    Count: 4089
> >>The nearest balancer calculate results:
> 2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 19ms. Moving 5012 regions off of 3 overloaded servers onto 1 less loaded servers
> "5012" is an unimaginable number here, for it is larger than the average number "3424.5"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-4031) An imbalance result calculated by LoadBalancer

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4031:
--------------------------------

    Attachment: HMaster222.rar

> An imbalance result calculated by LoadBalancer
> ----------------------------------------------
>
>                 Key: HBASE-4031
>                 URL: https://issues.apache.org/jira/browse/HBASE-4031
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HMaster222.rar
>
>
>   I found the problem while the cluster couldn't balance(Around time of 2011-05-24 11:28).One node's regions count is the double of the other nodes. And it didn't move regions anymore:
>    Address Start Code Load
> 158-1-101-202:20030 1306205409671 requests=0, regions=2593, usedHeap=114, maxHeap=8165 158-1-101-222:20030 1306205940117 requests=0, regions=5841, usedHeap=80, maxHeap=8165 158-1-101-52:20030 1306205417261 requests=0, regions=2622, usedHeap=76, maxHeap=8165 158-1-101-82:20030 1306205415714 requests=0, regions=2633, usedHeap=69, maxHeap=8165 
> Total:  servers: 4   requests=0, regions=13689
>   HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was found by my analysis on this problem.
>   But I'm afraid it's not the main cause of the problem.
>   There's one active master, one standby master, four regionservers in our cluster.
> >>10:57:41, the standby hamster 222 becomes the active one.
> 2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding: master failover
> >>4 regionservers was registered in 222 one by one. Only one regionserver seemed some time late.
> 2011-05-24 10:57:37,533 INFO : Registering server=158-1-101-82,20020,1306205415714, regionCount=3388, userLoad=true
> 2011-05-24 10:57:37,537 INFO : Registering server=158-1-101-202,20020,1306205409671, regionCount=3453, userLoad=true
> 2011-05-24 10:57:37,598 INFO : Registering server=158-1-101-52,20020,1306205417261, regionCount=3411, userLoad=true
> 2011-05-24 10:59:00,408 INFO : Registering server=158-1-101-222,20020,1306205940117, regionCount=0, userLoad=false
> >>13134 regions needed to move after rebuildUserRegions(13689 regions in the cluster during the time).
> 2011-05-24 10:58:47,534 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to process 13134 regions in transition
> >>All the 13134 regions were opened, regions opened count in each server:
> 158-1-101-222,20020,1306205940117    Count: 834
> 158-1-101-82,20020,1306205415714    Count: 4093
> 158-1-101-202,20020,1306205409671    Count: 4118
> 158-1-101-52,20020,1306205417261    Count: 4089
> >>The nearest balancer calculate results:
> 2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 19ms. Moving 5012 regions off of 3 overloaded servers onto 1 less loaded servers
> "5012" is an unimaginable number here, for it is larger than the average number "3424.5"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (HBASE-4031) An imbalance result calculated by LoadBalancer

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean resolved HBASE-4031.
---------------------------------

    Resolution: Duplicate

> An imbalance result calculated by LoadBalancer
> ----------------------------------------------
>
>                 Key: HBASE-4031
>                 URL: https://issues.apache.org/jira/browse/HBASE-4031
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HMaster222.rar, HRegionServer222.rar
>
>
>   I found the problem while the cluster couldn't balance(Around time of 2011-05-24 11:28).One node's regions count is the double of the other nodes. And it didn't move regions anymore:
>    Address Start Code Load
> 158-1-101-202:20030 1306205409671 requests=0, regions=2593, usedHeap=114, maxHeap=8165 158-1-101-222:20030 1306205940117 requests=0, regions=5841, usedHeap=80, maxHeap=8165 158-1-101-52:20030 1306205417261 requests=0, regions=2622, usedHeap=76, maxHeap=8165 158-1-101-82:20030 1306205415714 requests=0, regions=2633, usedHeap=69, maxHeap=8165 
> Total:  servers: 4   requests=0, regions=13689
>   HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was found by my analysis on this problem.
>   But I'm afraid it's not the main cause of the problem.
>   There's one active master, one standby master, four regionservers in our cluster.
> >>10:57:41, the standby hamster 222 becomes the active one.
> 2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding: master failover
> >>4 regionservers was registered in 222 one by one. Only one regionserver seemed some time late.
> 2011-05-24 10:57:37,533 INFO : Registering server=158-1-101-82,20020,1306205415714, regionCount=3388, userLoad=true
> 2011-05-24 10:57:37,537 INFO : Registering server=158-1-101-202,20020,1306205409671, regionCount=3453, userLoad=true
> 2011-05-24 10:57:37,598 INFO : Registering server=158-1-101-52,20020,1306205417261, regionCount=3411, userLoad=true
> 2011-05-24 10:59:00,408 INFO : Registering server=158-1-101-222,20020,1306205940117, regionCount=0, userLoad=false
> >>13134 regions needed to move after rebuildUserRegions(13689 regions in the cluster during the time).
> 2011-05-24 10:58:47,534 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to process 13134 regions in transition
> >>All the 13134 regions were opened, regions opened count in each server:
> 158-1-101-222,20020,1306205940117    Count: 834
> 158-1-101-82,20020,1306205415714    Count: 4093
> 158-1-101-202,20020,1306205409671    Count: 4118
> 158-1-101-52,20020,1306205417261    Count: 4089
> >>The nearest balancer calculate results:
> 2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 19ms. Moving 5012 regions off of 3 overloaded servers onto 1 less loaded servers
> "5012" is an unimaginable number here, for it is larger than the average number "3424.5"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4031) An imbalance result calculated by LoadBalancer

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055282#comment-13055282 ] 

Jieshan Bean commented on HBASE-4031:
-------------------------------------

The master failover was due to killing the original hmaster manually. And during the time, RS222 had aborted(It restarted later)



> An imbalance result calculated by LoadBalancer
> ----------------------------------------------
>
>                 Key: HBASE-4031
>                 URL: https://issues.apache.org/jira/browse/HBASE-4031
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HMaster222.rar, HRegionServer222.rar
>
>
>   I found the problem while the cluster couldn't balance(Around time of 2011-05-24 11:28).One node's regions count is the double of the other nodes. And it didn't move regions anymore:
>    Address Start Code Load
> 158-1-101-202:20030 1306205409671 requests=0, regions=2593, usedHeap=114, maxHeap=8165 158-1-101-222:20030 1306205940117 requests=0, regions=5841, usedHeap=80, maxHeap=8165 158-1-101-52:20030 1306205417261 requests=0, regions=2622, usedHeap=76, maxHeap=8165 158-1-101-82:20030 1306205415714 requests=0, regions=2633, usedHeap=69, maxHeap=8165 
> Total:  servers: 4   requests=0, regions=13689
>   HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was found by my analysis on this problem.
>   But I'm afraid it's not the main cause of the problem.
>   There's one active master, one standby master, four regionservers in our cluster.
> >>10:57:41, the standby hamster 222 becomes the active one.
> 2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding: master failover
> >>4 regionservers was registered in 222 one by one. Only one regionserver seemed some time late.
> 2011-05-24 10:57:37,533 INFO : Registering server=158-1-101-82,20020,1306205415714, regionCount=3388, userLoad=true
> 2011-05-24 10:57:37,537 INFO : Registering server=158-1-101-202,20020,1306205409671, regionCount=3453, userLoad=true
> 2011-05-24 10:57:37,598 INFO : Registering server=158-1-101-52,20020,1306205417261, regionCount=3411, userLoad=true
> 2011-05-24 10:59:00,408 INFO : Registering server=158-1-101-222,20020,1306205940117, regionCount=0, userLoad=false
> >>13134 regions needed to move after rebuildUserRegions(13689 regions in the cluster during the time).
> 2011-05-24 10:58:47,534 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to process 13134 regions in transition
> >>All the 13134 regions were opened, regions opened count in each server:
> 158-1-101-222,20020,1306205940117    Count: 834
> 158-1-101-82,20020,1306205415714    Count: 4093
> 158-1-101-202,20020,1306205409671    Count: 4118
> 158-1-101-52,20020,1306205417261    Count: 4089
> >>The nearest balancer calculate results:
> 2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 19ms. Moving 5012 regions off of 3 overloaded servers onto 1 less loaded servers
> "5012" is an unimaginable number here, for it is larger than the average number "3424.5"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-4031) An imbalance result calculated by LoadBalancer

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4031:
--------------------------------

    Attachment: HRegionServer222.rar

> An imbalance result calculated by LoadBalancer
> ----------------------------------------------
>
>                 Key: HBASE-4031
>                 URL: https://issues.apache.org/jira/browse/HBASE-4031
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HMaster222.rar, HRegionServer222.rar
>
>
>   I found the problem while the cluster couldn't balance(Around time of 2011-05-24 11:28).One node's regions count is the double of the other nodes. And it didn't move regions anymore:
>    Address Start Code Load
> 158-1-101-202:20030 1306205409671 requests=0, regions=2593, usedHeap=114, maxHeap=8165 158-1-101-222:20030 1306205940117 requests=0, regions=5841, usedHeap=80, maxHeap=8165 158-1-101-52:20030 1306205417261 requests=0, regions=2622, usedHeap=76, maxHeap=8165 158-1-101-82:20030 1306205415714 requests=0, regions=2633, usedHeap=69, maxHeap=8165 
> Total:  servers: 4   requests=0, regions=13689
>   HBASE-3985-"Same Region could be picked out twice in LoadBalancer" was found by my analysis on this problem.
>   But I'm afraid it's not the main cause of the problem.
>   There's one active master, one standby master, four regionservers in our cluster.
> >>10:57:41, the standby hamster 222 becomes the active one.
> 2011-05-24 10:57:41,314 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding: master failover
> >>4 regionservers was registered in 222 one by one. Only one regionserver seemed some time late.
> 2011-05-24 10:57:37,533 INFO : Registering server=158-1-101-82,20020,1306205415714, regionCount=3388, userLoad=true
> 2011-05-24 10:57:37,537 INFO : Registering server=158-1-101-202,20020,1306205409671, regionCount=3453, userLoad=true
> 2011-05-24 10:57:37,598 INFO : Registering server=158-1-101-52,20020,1306205417261, regionCount=3411, userLoad=true
> 2011-05-24 10:59:00,408 INFO : Registering server=158-1-101-222,20020,1306205940117, regionCount=0, userLoad=false
> >>13134 regions needed to move after rebuildUserRegions(13689 regions in the cluster during the time).
> 2011-05-24 10:58:47,534 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to process 13134 regions in transition
> >>All the 13134 regions were opened, regions opened count in each server:
> 158-1-101-222,20020,1306205940117    Count: 834
> 158-1-101-82,20020,1306205415714    Count: 4093
> 158-1-101-202,20020,1306205409671    Count: 4118
> 158-1-101-52,20020,1306205417261    Count: 4089
> >>The nearest balancer calculate results:
> 2011-05-24 11:12:11,076 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 19ms. Moving 5012 regions off of 3 overloaded servers onto 1 less loaded servers
> "5012" is an unimaginable number here, for it is larger than the average number "3424.5"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira