You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Misha Dmitriev (JIRA)" <ji...@apache.org> on 2018/07/10 16:39:00 UTC

[jira] [Commented] (HADOOP-15538) Possible RPC deadlock in Client

    [ https://issues.apache.org/jira/browse/HADOOP-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16538911#comment-16538911 ] 

Misha Dmitriev commented on HADOOP-15538:
-----------------------------------------

So, as far as I understand, we have two threads that try to grab the same lock (which is an instance of java.lang.Object referened by SocketChannelImpl.stateLock). When the JVM's deadlock detection mechanism tries to find who currently holds this lock, it cannot find any Java thread responsible for this. Such a situation is considered a variation of a deadlock, though it's not a classical one with two threads and two locks. Rather, it's a single lock, but it can never be grabbed by the waiting threads, because the only thread that can unlock it somehow disappeared. Note that the JVM's message, as well as the comments in the JVM code, are somewhat cryptic, and it took me some head-scratching and guessing before I understood what they try to say.

I don't think that some normal Java thread threw an exception and exited, but didn't clean up one of the locks that it was holding. At least I've never seen such a situation in the past. Probably such a bug would be relatively easy to reproduce, and thus would have been fixed long ago. So I think here we have something really non-standard in play, and therefore the following exotic scenarios are more likely here:
 # The lock is still being held by some thread that the JVM doesn't know about, e.g. one started from native code.
 # The thread that was holding the lock exited in some non-standard, non-graceful way, perhaps because of a failure in native code. I am not sure what happens in such a case, and my theory is that if the thread is terminated by the OS and the JVM doesn't have a chance to interfere, all the Java locks that such a thread holds won't be unlocked. So we really have an "orphaned" lock.
 # Native code generally doing something bad. Note that in the JDK bug mentioned above, one of the JDK guys gave the following possible reason for running into this condition: "My point is that the reason for the assert condition not holding (the owner of a monitor apparently not being in the Threadslist) may not be due to any inherent bug in deadlock detection or monitor management but may due to some other problem induced by the testcase - e.g. a memory stomp due to native code failing to check for errors after invoking JNI functions."

The bottom line is that I think we should really sniff for the native code in the app that runs this HDFS Client, and then check if that native code is doing something unusual.

> Possible RPC deadlock in Client
> -------------------------------
>
>                 Key: HADOOP-15538
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15538
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>            Priority: Major
>         Attachments: t1+13min.jstack, t1.jstack
>
>
> We have a jstack collection that spans 13 minutes. One frame per ~1.5 minutes. And for each of the frame, I observed the following:
> {code:java}
> Found one Java-level deadlock:
> =============================
> "IPC Parameter Sending Thread #294":
>   waiting to lock monitor 0x00007f68f21f3188 (object 0x0000000621745390, a java.lang.Object),
>   which is held by UNKNOWN_owner_addr=0x00007f68332e2800
> Java stack information for the threads listed above:
> ===================================================
> "IPC Parameter Sending Thread #294":
>         at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:268)
>         - waiting to lock <0x0000000621745390> (a java.lang.Object)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
>         - locked <0x0000000621745380> (a java.lang.Object)
>         at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
>         at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>         at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>         - locked <0x0000000621749850> (a java.io.BufferedOutputStream)
>         at java.io.DataOutputStream.flush(DataOutputStream.java:123)
>         at org.apache.hadoop.ipc.Client$Connection$3.run(Client.java:1072)
>         - locked <0x000000062174b878> (a java.io.DataOutputStream)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Found one Java-level deadlock:
> =============================
> "IPC Client (297602875) connection to x.y.z.p:8020 from impala":
>   waiting to lock monitor 0x00007f68f21f3188 (object 0x0000000621745390, a java.lang.Object),
>   which is held by UNKNOWN_owner_addr=0x00007f68332e2800
> Java stack information for the threads listed above:
> ===================================================
> "IPC Client (297602875) connection to x.y.z.p:8020 from impala":
>         at sun.nio.ch.SocketChannelImpl.readerCleanup(SocketChannelImpl.java:279)
>         - waiting to lock <0x0000000621745390> (a java.lang.Object)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:390)
>         - locked <0x0000000621745370> (a java.lang.Object)
>         at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:553)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>         - locked <0x00000006217476f0> (a java.io.BufferedInputStream)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1006)
> Found 2 deadlocks.
> {code}
> This happens with jdk1.8.0_162 on 2.6.32-696.18.7.el6.x86_64.
> The code appears to match [https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/tree/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java].
> The first thread is blocked at:
> [https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/blob/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java?line=268]
> The second thread is blocked at:
>  [https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/blob/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java?line=279]
> There are two issues here:
>  # There seems to be a real deadlock because the stacks remain the same even if the first an last jstack frames captured is 13 minutes apart.
>  # Java deadlock report seems to be problematic, two threads that have deadlock should not be blocked on the same lock, but they appear to be in this case: the same SocketChannelImpl's stateLock.
> I found a relevant jdk jira [https://bugs.openjdk.java.net/browse/JDK-8007476], it explains where two deadlocks are reported and they are really for the same deadlock.
> I don't see a similar report about this issue in jdk jira database, and I'm thinking about filing a jdk jira for that, but would like to throw some discussion here before that.
> Issue#1 is important, because the client is hanging, which indicate a real problem; however, without a correct report before issue#2 is fixed, it's not clear how the deadlock really looks like.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org