You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Bogdan Ghidireac <bo...@ecstend.com> on 2011/04/04 11:30:27 UTC

zookeeper connection hangs during shutdown

Hi,

I have a cluster of 90 servers (HBase 0.90.1, Hadoop 0.20-append) that
runs a write-intensive MapReduce job. Occasionally one or maybe more
region servers run out of memory and they try to shut down but the
operation does not always succeed so they get stuck.

If I dump the JVM threads in a console, it looks like the region
server wants to close all zookeeper connections and blocks until this
is done.
HRegionServer.java:672 --> HConnectionManager.deleteConnection(conf, true);

"regionserver60020-EventThread" daemon prio=10 tid=0x000000005d0dc000
nid=0x7e1d waiting on condition [0x0000000042941000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000781c9ce00> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)

"regionserver60020" prio=10 tid=0x00002aaab023e000 nid=0x7e1b in
Object.wait() [0x000000004273f000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00000007be9ad330> (a org.apache.zookeeper.ClientCnxn$Packet)
	at java.lang.Object.wait(Object.java:485)
	at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1317)
	- locked <0x00000007be9ad330> (a org.apache.zookeeper.ClientCnxn$Packet)
	at org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1295)
	at org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:531)
	- locked <0x0000000781fae170> (a org.apache.zookeeper.ZooKeeper)
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.close(ZooKeeperWatcher.java:399)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.close(HConnectionManager.java:1050)
	at org.apache.hadoop.hbase.client.HConnectionManager.deleteConnection(HConnectionManager.java:175)
	- locked <0x00000007801b6800> (a
org.apache.hadoop.hbase.client.HConnectionManager$1)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:672)
	at java.lang.Thread.run(Thread.java:662)

"main-EventThread" daemon prio=10 tid=0x00002aaab0540000 nid=0x7e10
waiting on condition [0x0000000042139000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000781faf3e0> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)

Unfortunately, some connections are never closed so the server does
not shut down.

Is is possible to add a timeout and then force a System.exit() ?

Bogdan

Re: zookeeper connection hangs during shutdown

Posted by Bogdan Ghidireac <bo...@ecstend.com>.

Please see my answers inline ...

On Mon, Apr 4, 2011 at 8:45 PM, Stack <st...@duboce.net> wrote:
> On Mon, Apr 4, 2011 at 2:30 AM, Bogdan Ghidireac <bo...@ecstend.com> wrote:
>> Is is possible to add a timeout and then force a System.exit() ?
>>
>
> Yes. Of course.  Sounds bad.  How you think this scenario came about?

My M/R job reads from a table and creates a lot of data that is
inserted into a second table. Because this new table is empty and I
did not split the keys in advance, the region server where the first
region was created is hit really hard (60-100K ops/sec).

The OOM exception happens during this time, only for one or maybe two
servers. The exception triggers a server shutdown...
Once the initial region splits and the traffic is distributed, the
problem does not happen any more.

> Is the zk ensemble up and running still?

The ZK ensemble is running fine. I have 3 zk servers running ZK 3.3.2.

> Whats the last thing in this regionserver log?

This is the RS log
http://pastebin.com/Cvx8zS54

> Anything in the .out file?

This is the System.out/err
I http://pastebin.com/gNNVUzvZ

> I've not seen this
> before but, hey, the world is a wide and wonderful place.  We could
> run the zk close inside a thread and interrupt if it goes on too long
> (Let me ask the zk boys if they've seen this before too).
>

I am subscribed to ZK list too and I have seen you email. I am using
ZK 3.3.2 ...

> St.Ack
>

Thank you,
Bogdan

Re: zookeeper connection hangs during shutdown

Posted by Stack <st...@duboce.net>.

On Mon, Apr 4, 2011 at 2:30 AM, Bogdan Ghidireac <bo...@ecstend.com> wrote:
> Is is possible to add a timeout and then force a System.exit() ?
>

Yes. Of course.  Sounds bad.  How you think this scenario came about?
Is the zk ensemble up and running still?  Whats the last thing in this
regionserver log?  Anything in the .out file?  I've not seen this
before but, hey, the world is a wide and wonderful place.  We could
run the zk close inside a thread and interrupt if it goes on too long
(Let me ask the zk boys if they've seen this before too).

St.Ack