You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Michael Stack <st...@duboce.net> on 2007/10/04 00:34:00 UTC

Soliciting suggesions on hung ipc Client connection flush

On Hudson, we've been seeing tests sporadically hang on an ipc Client 
flush of params.  I'm writing the list for suggestions or opinions on 
what folks think might be happening or ideas on what to try next.  See 
below for the latest example for a thread dump from a recent patch build.

The usual scenario is that we are trying to simulate failed servers in a 
mini-cluster.  All servers -- hbase + dfs servers -- are up and running 
inside the same JVM.  The remote ipc Server will of-a-sudden have its 
stop method run to simulate a server crash.  The Client, unawares, tries 
to go about its usual business.

    [junit] "HMaster.metaScanner" daemon prio=10 tid=0x091ecde0 nid=0x4a runnable [0xe2af9000..0xe2af9b38]
    [junit] 	at java.net.SocketOutputStream.socketWrite0(Native Method)
    [junit] 	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
    [junit] 	at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
    [junit] 	at org.apache.hadoop.ipc.Client$Connection$2.write(Client.java:190)
    [junit] 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    [junit] 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    [junit] 	- locked <0xf7bb40e0> (a java.io.BufferedOutputStream)
    [junit] 	at java.io.DataOutputStream.flush(DataOutputStream.java:106)
    [junit] 	at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:325)
    [junit] 	- locked <0xf7bb3f68> (a java.io.DataOutputStream)
    [junit] 	at org.apache.hadoop.ipc.Client.call(Client.java:462)
    [junit] 	- locked <0xf7bb3fa8> (a org.apache.hadoop.ipc.Client$Call)
    [junit] 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165)
    [junit] 	at $Proxy8.openScanner(Unknown Source)
    [junit] 	at org.apache.hadoop.hbase.HMaster$BaseScanner.scanRegion(HMaster.java:207)
    [junit] 	at org.apache.hadoop.hbase.HMaster$MetaScanner.scanOneMetaRegion(HMaster.java:643)
    [junit] 	- locked <0xf7b6b460> (a java.lang.Integer)
    [junit] 	at org.apache.hadoop.hbase.HMaster$MetaScanner.maintenanceScan(HMaster.java:694)
    [junit] 	at org.apache.hadoop.hbase.HMaster$BaseScanner.chore(HMaster.java:188)
    [junit] 	at org.apache.hadoop.hbase.Chore.run(Chore.java:59)


Other threads in the thread dump will be parked at the  DataOutputStream 
synchronize block.

Please correct me if I am wrong, but it is my understanding that writes 
do not timeout nor is this type of I/O interruptable.  The connection is 
probably already established else it would have timed out trying to 
connect to the non-existent server and besides, the ipc Client pattern 
seems to be keeps up the connection multiplexing 'commands' to the 
remote server...

I'm wondering why don't we get an exception on client side when the 
remote side of the socket goes away?

Am unable to reproduce locally.

Thanks for any input,
St.Ack