You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Matthias Hofschen <ho...@gmail.com> on 2011/11/16 11:40:02 UTC

balancer stopped working

Hi,
we had a case today where the loadbalancer stopped working.
(cloudera-cdh3-u1, 52nodes). Basically we had a hot region that we moved to
another node. Shortly thereafter the regionserver of that region was
stopped. In the master logs we see that master is trying to contact this
regionserver to move the region. At the same time we had about 1000 regions
missing because of the stopped regionserver. These where not assigned to
another regionserver. I suspect that the the load balancer on the master
machine was blocked by trying to move the one region which was not
possible. We then restarted the master which solved the problem.

Is this a known problem, should I log an issue? I do have a stacktrace from
the master machine.

Cheers Matthias

Re: balancer stopped working

Posted by Matthias Hofschen <ho...@gmail.com>.
Hi
this is the stacktrace of the master process till restart:

2011-11-16 09:28:20,007 INFO org.apache.hadoop.hbase.master.LoadBalancer:
Skipping load balancing.  servers=51 regions=76793 average=1505.7451
mostloaded=1506 leastloaded=1506
2011-11-16 09:29:51,276 INFO org.apache.hadoop.hbase.master.ServerManager:
Registering server=hdp1156,60020,1321432190747, regionCount=0,
userLoad=false
2011-11-16 09:34:35,646 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321432287936
2011-11-16 09:34:35,646 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 09:34:35,647 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned java.net.ConnectException:
Connection refused for 1895561433
2011-11-16 09:37:35,648 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321432475647
2011-11-16 09:43:35,651 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 09:43:35,652 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned java.net.ConnectException:
Connection refused for 1895561433
2011-11-16 09:46:35,653 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321433015652
2011-11-16 09:49:35,656 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 09:49:35,657 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned java.net.ConnectException:
Connection refused for 1895561433
2011-11-16 09:49:36,305 INFO
org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
ephemeral node deleted, processing expiration [hdp1105,60020,1321217721376]
2011-11-16 09:49:36,305 INFO
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
logs for hdp1105,60020,1321217721376
2011-11-16 09:49:36,309 ERROR
org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while
processing event M_SERVER_SHUTDOWN
java.lang.NullPointerException
        at
org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1781)
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2011-11-16 09:52:35,657 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321433375656
2011-11-16 09:52:35,657 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 09:52:35,658 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned java.net.ConnectException:
Connection refused for 1895561433
2011-11-16 09:54:58,681 INFO org.apache.hadoop.hbase.master.HMaster:
Balance=true
2011-11-16 09:55:35,659 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321433555658
2011-11-16 09:58:35,661 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 09:58:35,667 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Received close for
message,99086_700177012,1302074416350.1895561433 but we are not serving it
for 1895561433
2011-11-16 10:01:35,663 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321433915662
2011-11-16 10:01:35,663 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 10:01:35,668 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Received close for
message,99086_700177012,1302074416350.1895561433 but we are not serving it
for 1895561433
2011-11-16 10:02:35,647 INFO org.apache.hadoop.hbase.master.HMaster:
Balance=true
2011-11-16 10:04:35,665 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321434095664
2011-11-16 10:04:35,665 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 10:04:35,671 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Received close for
message,99086_700177012,1302074416350.1895561433 but we are not serving it
for 1895561433
2011-11-16 10:04:38,420 INFO org.apache.zookeeper.ZooKeeper: Initiating
client connection, connectString=hdz1072:2281,hdz1071:2281,hdz1070:2281
sessionTimeout=60000 watcher=hconnection
2011-11-16 10:04:38,421 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hdz1071/172.17.0.71:2281
2011-11-16 10:04:38,423 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hdz1071/172.17.0.71:2281, initiating session
2011-11-16 10:04:38,427 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server hdz1071/172.17.0.71:2281, sessionid =
0x132deb2ddbb9560, negotiated timeout = 40000
2011-11-16 10:04:45,243 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Closed zookeeper sessionid=0x132deb2ddbb9560
2011-11-16 10:04:45,244 INFO org.apache.zookeeper.ZooKeeper: Session:
0x132deb2ddbb9560 closed
2011-11-16 10:04:45,244 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2011-11-16 10:05:44,032 INFO
org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
ephemeral node deleted, processing expiration [hdp1156,60020,1321432190747]
2011-11-16 10:05:44,033 INFO
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
logs for hdp1156,60020,1321432190747
2011-11-16 10:05:44,034 ERROR
org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while
processing event M_SERVER_SHUTDOWN
java.lang.NullPointerException
        at
org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1781)
        at
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2011-11-16 10:05:49,678 INFO org.apache.hadoop.hbase.master.ServerManager:
Registering server=hdp1156,60020,1321434349153, regionCount=0,
userLoad=false
2011-11-16 10:07:35,667 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321434275666
2011-11-16 10:07:35,667 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 10:07:35,674 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Received close for
message,99086_700177012,1302074416350.1895561433 but we are not serving it
for 1895561433
2011-11-16 10:09:30,671 INFO org.apache.hadoop.hbase.master.HMaster:
Balance=true
2011-11-16 10:10:35,669 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321434455668
2011-11-16 10:10:35,669 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 10:10:35,693 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Received close for
message,99086_700177012,1302074416350.1895561433 but we are not serving it
for 1895561433
2011-11-16 10:13:35,671 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
timed out:  message,99086_700177012,1302074416350.1895561433
state=PENDING_CLOSE, ts=1321434635670
2011-11-16 10:16:35,673 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=message,99086_700177012,1302074416350.1895561433
2011-11-16 10:16:35,679 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Server
serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657,
usedHeap=1380, maxHeap=11897) returned
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Received close for
message,99086_700177012,1302074416350.1895561433 but we are not serving it
for 1895561433
Wed Nov 16 10:19:33 CET 2011 Killing master
Wed Nov 16 10:19:37 CET 2011 Starting master on hdm1004
ulimit -n 65536
2011-11-16 10:19:38,823 INFO org.apache.hadoop.hbase.ipc.HBaseRpcMetrics:
Initializing RPC Metrics with hostName=HMaster, port=60000
2011-11-16 10:19:39,086 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
Responder: starting
2011-11-16 10:19:39,086 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
listener on 60000: starting
2011-11-16 10:19:39,088 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 0 on 60000: starting
...................

On Wed, Nov 16, 2011 at 12:52 PM, <yu...@gmail.com> wrote:

> Can you post the stack trace and relevant log snippets ?
>
> Thanks
>
>
>
> On Nov 16, 2011, at 2:40 AM, Matthias Hofschen <ho...@gmail.com> wrote:
>
> > Hi,
> > we had a case today where the loadbalancer stopped working.
> > (cloudera-cdh3-u1, 52nodes). Basically we had a hot region that we moved
> to
> > another node. Shortly thereafter the regionserver of that region was
> > stopped. In the master logs we see that master is trying to contact this
> > regionserver to move the region. At the same time we had about 1000
> regions
> > missing because of the stopped regionserver. These where not assigned to
> > another regionserver. I suspect that the the load balancer on the master
> > machine was blocked by trying to move the one region which was not
> > possible. We then restarted the master which solved the problem.
> >
> > Is this a known problem, should I log an issue? I do have a stacktrace
> from
> > the master machine.
> >
> > Cheers Matthias
>

Re: balancer stopped working

Posted by yu...@gmail.com.
Can you post the stack trace and relevant log snippets ?

Thanks 



On Nov 16, 2011, at 2:40 AM, Matthias Hofschen <ho...@gmail.com> wrote:

> Hi,
> we had a case today where the loadbalancer stopped working.
> (cloudera-cdh3-u1, 52nodes). Basically we had a hot region that we moved to
> another node. Shortly thereafter the regionserver of that region was
> stopped. In the master logs we see that master is trying to contact this
> regionserver to move the region. At the same time we had about 1000 regions
> missing because of the stopped regionserver. These where not assigned to
> another regionserver. I suspect that the the load balancer on the master
> machine was blocked by trying to move the one region which was not
> possible. We then restarted the master which solved the problem.
> 
> Is this a known problem, should I log an issue? I do have a stacktrace from
> the master machine.
> 
> Cheers Matthias