You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by "Robert Munteanu (JIRA)" <ji...@apache.org> on 2014/10/15 17:23:33 UTC

[jira] [Commented] (SLING-4061) Deadlock involving discovery services at startup with Oak

    [ https://issues.apache.org/jira/browse/SLING-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172469#comment-14172469 ] 

Robert Munteanu commented on SLING-4061:
----------------------------------------

There is an issue is in the {{DiscoveryServiceImpl}} activate method, but I'm not sure it's the root cause. The class leaks a reference to itself before the {{activate()}} method completes:

{code:java}
        // make sure the first heartbeat is issued as soon as possible - which
        // is right after this service starts. since the two (discoveryservice
        // and heartbeatHandler need to know each other, the discoveryservice
        // is passed on to the heartbeatHandler in this initialize call).
        heartbeatHandler.initialize(this,
                clusterViewService.getIsolatedClusterViewId());

        final TopologyEventListener[] registeredServices;
        synchronized (lock) {
            registeredServices = this.eventListeners;
            doUpdateProperties();

            TopologyViewImpl newView = (TopologyViewImpl) getTopology();
            TopologyEvent event = new TopologyEvent(Type.TOPOLOGY_INIT, null,
                    newView);
            for (final TopologyEventListener da : registeredServices) {
                sendTopologyEvent(da, event);
            }
            activated = true;
            oldView = newView;
        }
{code}

The deadlock itself is a lock ordering issue

- in thread "pool-5-thread-1" the HeartbeatHandler wants to issue an update and thread and holds the DiscoveryServiceImpl.lock lock but can't lock the SegmentNodeStoreService lock
- in thread "CM Event Dispatcher..." the SegmentNodeStoreService holds its own lock and the call stack ends up trying to invoke DiscoveryServiceImpl.bindTopologyEventListener, which needs the DiscoveryServiceImpl.lock

I wonder whether we need more fine-grained locking in the DiscoveryServiceImpl - a single lock object seems to coarse-grained, especially since a lot seems to happen during calls like updateProperties(), including invocation of foreign code ( notifying event listeners ) which is a bit worrisome - invoking foreign code with locks held is prone to deadlocks.

Another alternative is to make make use of concurrent collections for e.g. event listeners, but I'm not sure we don't get bitten by the fact that they are weakly consistent.


> Deadlock involving discovery services at startup with Oak
> ---------------------------------------------------------
>
>                 Key: SLING-4061
>                 URL: https://issues.apache.org/jira/browse/SLING-4061
>             Project: Sling
>          Issue Type: Bug
>          Components: Extensions
>            Reporter: Bertrand Delacretaz
>         Attachments: discovery-deadlock.txt
>
>
> I just got a deadlock at startup when starting the launchpad integration tests instance on sling trunk revision 1632058 (so starting with Oak):
> {code}
> export DBG="-Xmx1G  -XX:MaxPermSize=256m -agentlib:jdwp=transport=dt_socket,address=30303,server=y,suspend=n"
> export MAVEN_OPTS="-Xmx1G  -XX:MaxPermSize=256m $DBG -Dsling.run.modes=oak"
> cd launchpad/testing
> mvn launchpad:run
> {code}
> I'll attach the stack trace. The discovery HeartbeatHandler, and DiscoveryServiceImpl classes are involved.
> The deadlock happens often on my box (macosx 10.9.5, java version "1.7.0_45"), with the same deadlock pattern AFAICS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)