You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2011/03/05 00:44:45 UTC

[jira] Commented: (HADOOP-6762) exception while doing RPC I/O closes channel

    [ https://issues.apache.org/jira/browse/HADOOP-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002871#comment-13002871 ] 

Todd Lipcon commented on HADOOP-6762:
-------------------------------------

Hi Sam,

We saw the following deadlock which I think is related to this patch:

{noformat}
Thread 50994 (IPC Client (47) connection to XXXXXXXXX:8020 from hdfs):
State: BLOCKED
Blocked count: 7168
Waited count: 7122
Blocked on java.io.DataOutputStream@2e932fec
Blocked by 50828 (sendParams-14)
Stack:
org.apache.hadoop.ipc.Client$Connection.sendPing(Client.java:676)
org.apache.hadoop.ipc.Client$Connection.access$400(Client.java:210)
org.apache.hadoop.ipc.Client$Connection$PingInputStream.handleTimeout(Client.java:340)
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:370)
java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
java.io.BufferedInputStream.read(BufferedInputStream.java:237)
java.io.DataInputStream.readInt(DataInputStream.java:370)
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:781)
org.apache.hadoop.ipc.Client$Connection.run(Client.java:689)


Thread 50828 (sendParams-14):
State: BLOCKED
Blocked count: 54
Waited count: 8313
Blocked on org.apache.hadoop.ipc.Client$Connection@54aeb3dd
Blocked by 50994 (IPC Client (47) connection to XXXXXX:8020 from hdfs)
Stack:
org.apache.hadoop.ipc.Client$Connection.markClosed(Client.java:809)
org.apache.hadoop.ipc.Client$Connection.access$1200(Client.java:210)
org.apache.hadoop.ipc.Client$Connection$3.run(Client.java:745)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
java.util.concurrent.FutureTask.run(FutureTask.java:138)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:619)
{noformat}

The issue is that in the patch, we have the following inverted lock orders:
sendParam's senderFuture: Connection.out -> Connection (in markClosed)
sendPing: Connection -> Connection.out (explicit sync)

Have you guys seen this issue?

> exception while doing RPC I/O closes channel
> --------------------------------------------
>
>                 Key: HADOOP-6762
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6762
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: sam rash
>            Assignee: sam rash
>            Priority: Critical
>             Fix For: 0.22.0
>
>         Attachments: hadoop-6762-1.txt, hadoop-6762-10.txt, hadoop-6762-2.txt, hadoop-6762-3.txt, hadoop-6762-4.txt, hadoop-6762-6.txt, hadoop-6762-7.txt, hadoop-6762-8.txt, hadoop-6762-9.txt
>
>
> If a single process creates two unique fileSystems to the same NN using FileSystem.newInstance(), and one of them issues a close(), the leasechecker thread is interrupted.  This interrupt races with the rpc namenode.renew() and can cause a ClosedByInterruptException.  This closes the underlying channel and the other filesystem, sharing the connection will get errors.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira