You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by "Stefan Egli (JIRA)" <ji...@apache.org> on 2015/11/09 12:23:11 UTC

[jira] [Created] (SLING-5280) HeartbeatHandler can also be blocked while holding AnnouncementRegistry lock

Stefan Egli created SLING-5280:
----------------------------------

             Summary: HeartbeatHandler can also be blocked while holding AnnouncementRegistry lock
                 Key: SLING-5280
                 URL: https://issues.apache.org/jira/browse/SLING-5280
             Project: Sling
          Issue Type: Bug
          Components: Extensions
    Affects Versions: Discovery Impl 1.2.0
            Reporter: Stefan Egli
            Assignee: Stefan Egli
            Priority: Critical
             Fix For: Discovery Impl 1.2.2


SLING-5195 introduces a separate thread that periodically checks if the local HeartbeatHandler is properly functional and has written a heartbeat 'recently enough'. If not, it marks the local topology as TOPOLOGY_CHANGING and listeners are thus stepping back correctly (as the local instance is likely to be treated as dead by other instances in the local cluster soon).

Now this works fine as long as the new, separate thread can operate uninterruptively. But besides the necessity that it is properly scheduled regularly (which it might not, thus we should probably rather use a dedicated thread than rely on the scheduler's pool to be available) it also requires that the check can go ahead without running into a locked monitor.

https://builds.apache.org/job/sling-trunk-1.8/org.apache.sling$org.apache.sling.discovery.impl/1960/testReport/org.apache.sling.discovery.impl.cluster/RepositoryDelaysTest/testSlowSessionSaves/ however reports a situation where the blocked commit was not in {{HeartbeatHandler.issueClusterLocalHeartbeat}} - but was in {{AnnouncementRegistry.checkExpiredAnnouncements}} - and that one does a {{synchronized(this)}} - which in turn blocks precisely this 2nd HeartbeatHandler's monitoring thread - yielding it useless..

The following log excerpt (grepping thread-3/-4/main) proofs the above described scenario:
{code}
06.11.2015 23:35:22.864 *INFO * [main] RepositoryDelaysTest: <main> both instances marked as delaying 1min - but with new background checks we should go changing within 3sec
06.11.2015 23:35:22.870 *DEBUG* [pool-1-thread-4] HeartbeatHandler: issueConnectorPings: not issuing remote heartbeat yet, startup not yet finished
06.11.2015 23:35:22.870 *INFO * [pool-1-thread-4] ArtificialDelay: delay: delaying [secondInstance] 'pre.commit' for 60000ms...
06.11.2015 23:35:22.870 *INFO * [pool-1-thread-4] ArtificialDelay: delay: delaying [secondInstance] 'pre.commit' now for 60000ms...
06.11.2015 23:35:22.909 *INFO * [Test-ViewCheckerRunner [firstInstance]] ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' for 59876ms...
06.11.2015 23:35:22.909 *INFO * [Test-ViewCheckerRunner [firstInstance]] ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' now for 59876ms...
06.11.2015 23:35:23.613 *DEBUG* [pool-1-thread-3] ViewHelper: getEstablishedView: no established view found: /var/discovery/clusterC/establishedView
06.11.2015 23:35:23.613 *DEBUG* [pool-1-thread-3] ClusterViewServiceImpl: getClusterView: no view established at the moment. isolated mode
06.11.2015 23:35:23.613 *INFO * [pool-1-thread-3] BaseDiscoveryService: getTopology: undefined cluster view: NO_ESTABLISHED_VIEW] org.apache.sling.discovery.base.commons.UndefinedClusterViewException: no established view at the moment
06.11.2015 23:35:23.613 *INFO * [pool-1-thread-3] MinEventDelayHandler: asyncDelay.run: done delaying. new view (still/again) not current, delaying again
06.11.2015 23:35:23.613 *TRACE* [pool-1-thread-3] MinEventDelayHandler: runAfter: trying with scheduler.fireJob
06.11.2015 23:35:23.613 *INFO * [pool-1-thread-3] MinEventDelayHandler: triggerAsyncDelaying: asynch delaying of 1 triggered: true
06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] ClusterViewServiceImpl: getClusterView: the local instance (1b6e7530-0d41-46e7-8471-6ea7f5d3c2bc) is currently not included in the existing established view! This is normal at startup. At other times is pseudo-network-partitioning is an indicator for repository/network-delays or clocks-out-of-sync (SLING-3432). (increasing the heartbeatTimeout can help as a workaround too) The local instance will stay in TOPOLOGY_CHANGING or pre _INIT mode until a new vote was successful.
06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] BaseDiscoveryService: getTopology: undefined cluster view: ISOLATED_FROM_TOPOLOGY] org.apache.sling.discovery.base.commons.UndefinedClusterViewException: established view does not include local instance - isolated
06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] MinEventDelayHandler: asyncDelay.run: done delaying. new view (still/again) not current, delaying again
06.11.2015 23:35:23.705 *TRACE* [pool-1-thread-3] MinEventDelayHandler: runAfter: trying with scheduler.fireJob
06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] MinEventDelayHandler: triggerAsyncDelaying: asynch delaying of 1 triggered: true
06.11.2015 23:35:23.840 *DEBUG* [pool-1-thread-3] HeartbeatHandler: checkForTopologyChange/.run: going to check for topology change...
06.11.2015 23:35:30.865 *DEBUG* [main] ClusterViewServiceImpl: getClusterView: current establishedView is marked as invalid: d5f1101c-f693-44a4-9c79-f7277230591f
06.11.2015 23:35:30.865 *INFO * [main] BaseDiscoveryService: getTopology: undefined cluster view: NO_ESTABLISHED_VIEW] org.apache.sling.discovery.base.commons.UndefinedClusterViewException: current established view was marked as invalid
06.11.2015 23:36:22.786 *INFO * [Test-ViewCheckerRunner [firstInstance]] ArtificialDelay: delay: delaying [firstInstance]'pre.commit' for 59876ms done.
06.11.2015 23:36:22.809 *INFO * [Test-ViewCheckerRunner [firstInstance]] ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' for 59876ms...
06.11.2015 23:36:22.809 *INFO * [Test-ViewCheckerRunner [firstInstance]] ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' now for 59876ms...
06.11.2015 23:36:22.871 *INFO * [pool-1-thread-4] ArtificialDelay: delay: delaying [secondInstance]'pre.commit' for 60000ms done.
06.11.2015 23:36:22.871 *DEBUG* [pool-1-thread-4] 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb: analyzeVotings: start. slingId: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb
06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: listInstances: start. localClusterView: a ClusterView[2 instances: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: listAnnouncementsInSameCluster: start. localClusterView: a ClusterView[2 instances: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
06.11.2015 23:36:22.871 *DEBUG* [pool-1-thread-4] VotingHelper: listVotings: votings found: 0
06.11.2015 23:36:22.871 *DEBUG* [pool-1-thread-4] 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb: analyzeVotings: no ongoing votings at the moment. done.
06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: listAnnouncementsInSameCluster: handling clusterInstance: 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5
06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: listAnnouncementsInSameCluster: instance has announcements: 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5
06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: listAnnouncementsInSameCluster: handling clusterInstance: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb
06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: listAnnouncementsInSameCluster: matched localInstance, filling with cache: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb
06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] HeartbeatHandler: doCheckView: established view does not match. (details: old: 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5, old: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb)
06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] HeartbeatHandler: doCheckViewWith: no matching established view, marking topology as changing
06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] HeartbeatHandler: invalidateCurrentEstablishedView: invalidating slingId=8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, lastEstablishedViewId=d5f1101c-f693-44a4-9c79-f7277230591f
06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] ClusterViewServiceImpl: invalidateEstablishedViewId: marking established view as invalid: d5f1101c-f693-44a4-9c79-f7277230591f
06.11.2015 23:36:22.872 *DEBUG* [pool-1-thread-4] DiscoveryServiceImpl: handleTopologyChanging: invoking viewStateManager.handlechanging
06.11.2015 23:36:22.872 *TRACE* [pool-1-thread-4] ViewStateManagerImpl: handleChanging: start
06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: listAnnouncementsInSameCluster: result: 0
06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: listInstances: announcements added. end. instances: []
06.11.2015 23:36:22.873 *DEBUG* [pool-1-thread-3] AnnouncementRegistryImpl: listInstances: start. localClusterView: a ClusterView[2 instances: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
06.11.2015 23:36:22.873 *DEBUG* [pool-1-thread-3] AnnouncementRegistryImpl: listAnnouncementsInSameCluster: start. localClusterView: a ClusterView[2 instances: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
06.11.2015 23:36:22.874 *INFO * [pool-1-thread-4] ViewStateManagerImpl: enqueueForAll: sending topologyEvent TopologyEvent [type=TOPOLOGY_CHANGING, oldView=DefaultTopologyView[current=false, num=2, instances=8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5[local=false,leader=false],8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb[local=true,leader=true]], newView=null], to all (0) listeners
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)