You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by "Stefan Egli (JIRA)" <ji...@apache.org> on 2015/11/09 12:25:10 UTC

[jira] [Commented] (SLING-5280) HeartbeatHandler can also be blocked while holding AnnouncementRegistry lock

    [ https://issues.apache.org/jira/browse/SLING-5280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996391#comment-14996391 ] 

Stefan Egli commented on SLING-5280:
------------------------------------

So it seems the only solution to this very problem is to make this 2nd-HeartbeatHandler-self-monitor-check non-blocking at all times - which means it must not go into the AnnouncementRegistry as that might be blocked

> HeartbeatHandler can also be blocked while holding AnnouncementRegistry lock
> ----------------------------------------------------------------------------
>
>                 Key: SLING-5280
>                 URL: https://issues.apache.org/jira/browse/SLING-5280
>             Project: Sling
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: Discovery Impl 1.2.0
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>            Priority: Critical
>             Fix For: Discovery Impl 1.2.2
>
>
> SLING-5195 introduces a separate thread that periodically checks if the local HeartbeatHandler is properly functional and has written a heartbeat 'recently enough'. If not, it marks the local topology as TOPOLOGY_CHANGING and listeners are thus stepping back correctly (as the local instance is likely to be treated as dead by other instances in the local cluster soon).
> Now this works fine as long as the new, separate thread can operate uninterruptively. But besides the necessity that it is properly scheduled regularly (which it might not, thus we should probably rather use a dedicated thread than rely on the scheduler's pool to be available) it also requires that the check can go ahead without running into a locked monitor.
> https://builds.apache.org/job/sling-trunk-1.8/org.apache.sling$org.apache.sling.discovery.impl/1960/testReport/org.apache.sling.discovery.impl.cluster/RepositoryDelaysTest/testSlowSessionSaves/ however reports a situation where the blocked commit was not in {{HeartbeatHandler.issueClusterLocalHeartbeat}} - but was in {{AnnouncementRegistry.checkExpiredAnnouncements}} - and that one does a {{synchronized(this)}} - which in turn blocks precisely this 2nd HeartbeatHandler's monitoring thread - yielding it useless..
> The following log excerpt (grepping thread-3/-4/main) proofs the above described scenario:
> {code}
> 06.11.2015 23:35:22.864 *INFO * [main] RepositoryDelaysTest: <main> both instances marked as delaying 1min - but with new background checks we should go changing within 3sec
> 06.11.2015 23:35:22.870 *DEBUG* [pool-1-thread-4] HeartbeatHandler: issueConnectorPings: not issuing remote heartbeat yet, startup not yet finished
> 06.11.2015 23:35:22.870 *INFO * [pool-1-thread-4] ArtificialDelay: delay: delaying [secondInstance] 'pre.commit' for 60000ms...
> 06.11.2015 23:35:22.870 *INFO * [pool-1-thread-4] ArtificialDelay: delay: delaying [secondInstance] 'pre.commit' now for 60000ms...
> 06.11.2015 23:35:22.909 *INFO * [Test-ViewCheckerRunner [firstInstance]] ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' for 59876ms...
> 06.11.2015 23:35:22.909 *INFO * [Test-ViewCheckerRunner [firstInstance]] ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' now for 59876ms...
> 06.11.2015 23:35:23.613 *DEBUG* [pool-1-thread-3] ViewHelper: getEstablishedView: no established view found: /var/discovery/clusterC/establishedView
> 06.11.2015 23:35:23.613 *DEBUG* [pool-1-thread-3] ClusterViewServiceImpl: getClusterView: no view established at the moment. isolated mode
> 06.11.2015 23:35:23.613 *INFO * [pool-1-thread-3] BaseDiscoveryService: getTopology: undefined cluster view: NO_ESTABLISHED_VIEW] org.apache.sling.discovery.base.commons.UndefinedClusterViewException: no established view at the moment
> 06.11.2015 23:35:23.613 *INFO * [pool-1-thread-3] MinEventDelayHandler: asyncDelay.run: done delaying. new view (still/again) not current, delaying again
> 06.11.2015 23:35:23.613 *TRACE* [pool-1-thread-3] MinEventDelayHandler: runAfter: trying with scheduler.fireJob
> 06.11.2015 23:35:23.613 *INFO * [pool-1-thread-3] MinEventDelayHandler: triggerAsyncDelaying: asynch delaying of 1 triggered: true
> 06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] ClusterViewServiceImpl: getClusterView: the local instance (1b6e7530-0d41-46e7-8471-6ea7f5d3c2bc) is currently not included in the existing established view! This is normal at startup. At other times is pseudo-network-partitioning is an indicator for repository/network-delays or clocks-out-of-sync (SLING-3432). (increasing the heartbeatTimeout can help as a workaround too) The local instance will stay in TOPOLOGY_CHANGING or pre _INIT mode until a new vote was successful.
> 06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] BaseDiscoveryService: getTopology: undefined cluster view: ISOLATED_FROM_TOPOLOGY] org.apache.sling.discovery.base.commons.UndefinedClusterViewException: established view does not include local instance - isolated
> 06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] MinEventDelayHandler: asyncDelay.run: done delaying. new view (still/again) not current, delaying again
> 06.11.2015 23:35:23.705 *TRACE* [pool-1-thread-3] MinEventDelayHandler: runAfter: trying with scheduler.fireJob
> 06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] MinEventDelayHandler: triggerAsyncDelaying: asynch delaying of 1 triggered: true
> 06.11.2015 23:35:23.840 *DEBUG* [pool-1-thread-3] HeartbeatHandler: checkForTopologyChange/.run: going to check for topology change...
> 06.11.2015 23:35:30.865 *DEBUG* [main] ClusterViewServiceImpl: getClusterView: current establishedView is marked as invalid: d5f1101c-f693-44a4-9c79-f7277230591f
> 06.11.2015 23:35:30.865 *INFO * [main] BaseDiscoveryService: getTopology: undefined cluster view: NO_ESTABLISHED_VIEW] org.apache.sling.discovery.base.commons.UndefinedClusterViewException: current established view was marked as invalid
> 06.11.2015 23:36:22.786 *INFO * [Test-ViewCheckerRunner [firstInstance]] ArtificialDelay: delay: delaying [firstInstance]'pre.commit' for 59876ms done.
> 06.11.2015 23:36:22.809 *INFO * [Test-ViewCheckerRunner [firstInstance]] ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' for 59876ms...
> 06.11.2015 23:36:22.809 *INFO * [Test-ViewCheckerRunner [firstInstance]] ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' now for 59876ms...
> 06.11.2015 23:36:22.871 *INFO * [pool-1-thread-4] ArtificialDelay: delay: delaying [secondInstance]'pre.commit' for 60000ms done.
> 06.11.2015 23:36:22.871 *DEBUG* [pool-1-thread-4] 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb: analyzeVotings: start. slingId: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb
> 06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: listInstances: start. localClusterView: a ClusterView[2 instances: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
> 06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: listAnnouncementsInSameCluster: start. localClusterView: a ClusterView[2 instances: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
> 06.11.2015 23:36:22.871 *DEBUG* [pool-1-thread-4] VotingHelper: listVotings: votings found: 0
> 06.11.2015 23:36:22.871 *DEBUG* [pool-1-thread-4] 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb: analyzeVotings: no ongoing votings at the moment. done.
> 06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: listAnnouncementsInSameCluster: handling clusterInstance: 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5
> 06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: listAnnouncementsInSameCluster: instance has announcements: 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5
> 06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: listAnnouncementsInSameCluster: handling clusterInstance: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb
> 06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: listAnnouncementsInSameCluster: matched localInstance, filling with cache: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb
> 06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] HeartbeatHandler: doCheckView: established view does not match. (details: old: 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5, old: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb)
> 06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] HeartbeatHandler: doCheckViewWith: no matching established view, marking topology as changing
> 06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] HeartbeatHandler: invalidateCurrentEstablishedView: invalidating slingId=8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, lastEstablishedViewId=d5f1101c-f693-44a4-9c79-f7277230591f
> 06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] ClusterViewServiceImpl: invalidateEstablishedViewId: marking established view as invalid: d5f1101c-f693-44a4-9c79-f7277230591f
> 06.11.2015 23:36:22.872 *DEBUG* [pool-1-thread-4] DiscoveryServiceImpl: handleTopologyChanging: invoking viewStateManager.handlechanging
> 06.11.2015 23:36:22.872 *TRACE* [pool-1-thread-4] ViewStateManagerImpl: handleChanging: start
> 06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: listAnnouncementsInSameCluster: result: 0
> 06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: listInstances: announcements added. end. instances: []
> 06.11.2015 23:36:22.873 *DEBUG* [pool-1-thread-3] AnnouncementRegistryImpl: listInstances: start. localClusterView: a ClusterView[2 instances: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
> 06.11.2015 23:36:22.873 *DEBUG* [pool-1-thread-3] AnnouncementRegistryImpl: listAnnouncementsInSameCluster: start. localClusterView: a ClusterView[2 instances: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
> 06.11.2015 23:36:22.874 *INFO * [pool-1-thread-4] ViewStateManagerImpl: enqueueForAll: sending topologyEvent TopologyEvent [type=TOPOLOGY_CHANGING, oldView=DefaultTopologyView[current=false, num=2, instances=8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5[local=false,leader=false],8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb[local=true,leader=true]], newView=null], to all (0) listeners
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)