You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Dan Smith (JIRA)" <ji...@apache.org> on 2019/08/08 17:09:00 UTC

[jira] [Commented] (GEODE-7055) Deadlock with StartupMessages if P2P error requiring a sendFailureReply

    [ https://issues.apache.org/jira/browse/GEODE-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903154#comment-16903154 ] 

Dan Smith commented on GEODE-7055:
----------------------------------

We found this in a test of the tomcat session replication module. It turns out the session module has a membership listener that sends this BootstrappingFunction to *all* members as soon as the join, which is before they send the startup message.

If a member does not have the BootstrappingFunction on the classpath, it will try to send the above failure reply and hang. Our documentation is vague on whether or not the locators should have this class on the classpath of the locator, so some users may not have put it there. See https://geode.apache.org/docs/guide/19/tools_modules/http_session_mgmt/tomcat_setting_up_the_module.html.

Before the changes in 00ed2f3c, we would only hang for 15 seconds, but now we hang forever.

> Deadlock with StartupMessages if P2P error requiring a sendFailureReply 
> ------------------------------------------------------------------------
>
>                 Key: GEODE-7055
>                 URL: https://issues.apache.org/jira/browse/GEODE-7055
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>            Reporter: Ernest Burghardt
>            Assignee: Dan Smith
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> An error/exception occurs on the P2P message thread, which requires a FailureReply be sent, but the StartupResponse message has not been recieved (on the P2P message thread) the failure reply will DEADLOCK on the call to
> org.apache.geode.distributed.internal.ClusterDistributionManager.waitUntilReadyToSendMsgs
> as the StartupOperation is already in a waitForReplies() for the StartupResponse
> {code:java}
> // below is an example of an Exception triggering the DEADLOCK
> {code}
>  
> {code:java}
> [fatal 2019/08/05 22:47:06.462 UTC <P2P message reader for 10.0.8.10(cacheserver-28663bad-c0b0-41f7-b723-5a2425fa54ff:1)<v5>:56152(version:GEODE 1.9.0) shared unordered uid=63 port=49194> tid=0x25] Error deserializing message
> java.lang.ClassNotFoundException: org.apache.geode.modules.util.BootstrappingFunction
>         at org.apache.geode.internal.ClassPathLoader.forName(ClassPathLoader.java:180)
>         at org.apache.geode.internal.InternalDataSerializer.getCachedClass(InternalDataSerializer.java:3274)
>         at org.apache.geode.DataSerializer.readClass(DataSerializer.java:264)
>         at org.apache.geode.internal.InternalDataSerializer.readDataSerializable(InternalDataSerializer.java:2398)
>         at org.apache.geode.internal.InternalDataSerializer.basicReadObject(InternalDataSerializer.java:2673)
>         at org.apache.geode.DataSerializer.readObject(DataSerializer.java:2968)
>         at org.apache.geode.internal.cache.MemberFunctionStreamingMessage.fromData(MemberFunctionStreamingMessage.java:277)
>         at org.apache.geode.internal.InternalDataSerializer.invokeFromData(InternalDataSerializer.java:2372)
>         at org.apache.geode.internal.DSFIDFactory.create(DSFIDFactory.java:997)
>         at org.apache.geode.internal.InternalDataSerializer.readDSFID(InternalDataSerializer.java:2516)
>         at org.apache.geode.internal.InternalDataSerializer.readDSFID(InternalDataSerializer.java:2528)
>         at org.apache.geode.internal.tcp.Connection.readMessage(Connection.java:3111)
>         at org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2920)
>         at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1745)
>         at org.apache.geode.internal.tcp.Connection.run(Connection.java:1577)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
>         "P2P message reader for 10.0.8.10(cacheserver-28663bad-c0b0-41f7-b723-5a2425fa54ff:1)<v5>:56152(version:GEODE 1.9.0) shared unordered uid=63 port=49194" #37 daemon prio=10 os_prio=0 tid=0x00007f4a108bb800 nid=0x2a in Object.wait() [0x00007f4a0dca7000]
>    java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         - waiting on <0x00000006d39c4538> (a java.lang.Object)
>         at java.lang.Object.wait(Object.java:502)
>         at org.apache.geode.distributed.internal.ClusterDistributionManager.waitUntilReadyToSendMsgs(ClusterDistributionManager.java:1212)
>         - locked <0x00000006d39c4538> (a java.lang.Object)
>         at org.apache.geode.distributed.internal.ClusterDistributionManager.sendMessage(ClusterDistributionManager.java:2816)
>         at org.apache.geode.distributed.internal.ClusterDistributionManager.putOutgoing(ClusterDistributionManager.java:1528)
>         at org.apache.geode.distributed.internal.ReplyMessage.send(ReplyMessage.java:113)
>         at org.apache.geode.distributed.internal.ReplyMessage.send(ReplyMessage.java:86)
>         at org.apache.geode.internal.tcp.Connection.sendFailureReply(Connection.java:1954)
>         at org.apache.geode.internal.tcp.Connection.readMessage(Connection.java:3162)
>         at org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2920)
>         at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1745)
>         at org.apache.geode.internal.tcp.Connection.run(Connection.java:1577)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)