You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by "Stefan Egli (JIRA)" <ji...@apache.org> on 2015/11/23 17:29:11 UTC

[jira] [Updated] (SLING-5302) HeartbeatTest.testSlowAndFastMachine fails due to unsynchronized issuing of a heartbeat

     [ https://issues.apache.org/jira/browse/SLING-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stefan Egli updated SLING-5302:
-------------------------------
    Fix Version/s:     (was: Discovery Impl 1.2.4)
                   Discovery Impl 1.2.6

> HeartbeatTest.testSlowAndFastMachine fails due to unsynchronized issuing of a heartbeat
> ---------------------------------------------------------------------------------------
>
>                 Key: SLING-5302
>                 URL: https://issues.apache.org/jira/browse/SLING-5302
>             Project: Sling
>          Issue Type: Test
>          Components: Extensions
>    Affects Versions: Discovery Impl 1.2.2
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>            Priority: Minor
>             Fix For: Discovery Impl 1.2.6
>
>
> The following grep-extract of the log file of
> * https://builds.apache.org/job/sling-trunk-1.7/org.apache.sling$org.apache.sling.discovery.impl/2721/testReport/org.apache.sling.discovery.impl.common.heartbeat/HeartbeatTest/testSlowAndFastMachine/
> which was one of similar failing tests:
> * https://builds.apache.org/job/sling-trunk-1.8/org.apache.sling$org.apache.sling.discovery.impl/1998/testReport/org.apache.sling.discovery.impl.common.heartbeat/HeartbeatTest/testSlowAndFastMachine/
> * https://builds.apache.org/job/sling-trunk-1.8/org.apache.sling$org.apache.sling.discovery.impl/2004/testReport/org.apache.sling.discovery.impl.common.heartbeat/HeartbeatTest/testSlowAndFastMachine/
> shows that due to the unsynchronized call to {{issueHeartbeat}} - which later calls {{issueClusterLocalHeartbeat}} - can result in a race-condition with a background job and cause a false-positive from the duplicate sling Id detection. The background job is in this case started by {{triggerAsyncConnectorPing}} (which was only re-activated recently, so this failure pattern is rather recent):
> {code}
> 13.11.2015 10:44:07.000 *DEBUG* [main] HeartbeatHandler: issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for d27b4ec4-6530-445d-aa60-cb015611205e
> 13.11.2015 10:44:07.397 *DEBUG* [main] HeartbeatHandler: issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for 332ef30b-33df-488f-807a-456d4b773812
> 13.11.2015 10:44:07.455 *INFO * [Discovery-AsyncEventSender] DiscoveryServiceImpl: changePropagationListener.handleTopologyEvent: topology changed - propagate through connectors
> 13.11.2015 10:44:07.455 *INFO * [Discovery-AsyncEventSender] HeartbeatHandler: triggerAsyncConnectorPing: firing job to trigger heartbeat
> 13.11.2015 10:44:07.456 *INFO * [Discovery-AsyncEventSender] HeartbeatTest$SimpleTopologyEventListener: handleTopologyEvent[fast]: TOPOLOGY_CHANGED, newView contains 2, newView.isCurrent=true
> 13.11.2015 10:44:07.456 *DEBUG* [pool-1-thread-2] HeartbeatHandler: heartbeatAndCheckView: start. [for slingId=d27b4ec4-6530-445d-aa60-cb015611205e]
> 13.11.2015 10:44:07.467 *INFO * [Discovery-AsyncEventSender] DiscoveryServiceImpl: changePropagationListener.handleTopologyEvent: topology changed - propagate through connectors
> 13.11.2015 10:44:07.467 *INFO * [Discovery-AsyncEventSender] HeartbeatHandler: triggerAsyncConnectorPing: firing job to trigger heartbeat
> 13.11.2015 10:44:07.467 *DEBUG* [pool-1-thread-2] HeartbeatHandler: issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for d27b4ec4-6530-445d-aa60-cb015611205e
> 13.11.2015 10:44:07.467 *DEBUG* [pool-1-thread-1] HeartbeatHandler: heartbeatAndCheckView: start. [for slingId=332ef30b-33df-488f-807a-456d4b773812]
> 13.11.2015 10:44:07.467 *INFO * [Discovery-AsyncEventSender] HeartbeatTest$SimpleTopologyEventListener: handleTopologyEvent[slow]: TOPOLOGY_CHANGED, newView contains 2, newView.isCurrent=true
> 13.11.2015 10:44:07.469 *DEBUG* [pool-1-thread-1] HeartbeatHandler: issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for 332ef30b-33df-488f-807a-456d4b773812
> 13.11.2015 10:44:07.469 *INFO * [Discovery-AsyncEventSender] HeartbeatTest$SimpleTopologyEventListener: handleTopologyEvent[slow]: TOPOLOGY_CHANGING, oldView contained 2, oldView.isCurrent=false
> 13.11.2015 10:44:07.469 *ERROR* [pool-1-thread-1] HeartbeatHandler: issueClusterLocalHeartbeat: SLING-2892: Detected unexpected, concurrent update of: /var/discovery/impl/clusterInstances/332ef30b-33df-488f-807a-456d4b773812 'lastHeartbeat'. If not done manually, this likely indicates that there is more than 1 instance running in this cluster with the same sling.id. My sling.id is 332ef30b-33df-488f-807a-456d4b773812. Check for sling.id.file in your installation of all instances in this cluster to verify this! Duplicate sling.ids are not allowed within a cluster!
> 13.11.2015 10:44:08.528 *DEBUG* [main] HeartbeatHandler: issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for d27b4ec4-6530-445d-aa60-cb015611205e
> 13.11.2015 10:44:08.550 *DEBUG* [main] HeartbeatHandler: issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for 332ef30b-33df-488f-807a-456d4b773812
> 13.11.2015 10:44:08.577 *INFO * [Discovery-AsyncEventSender] DiscoveryServiceImpl: changePropagationListener.handleTopologyEvent: topology changed - propagate through connectors
> 13.11.2015 10:44:08.577 *INFO * [Discovery-AsyncEventSender] HeartbeatHandler: triggerAsyncConnectorPing: firing job to trigger heartbeat
> 13.11.2015 10:44:08.577 *INFO * [Discovery-AsyncEventSender] HeartbeatTest$SimpleTopologyEventListener: handleTopologyEvent[slow]: TOPOLOGY_CHANGED, newView contains 2, newView.isCurrent=true
> 13.11.2015 10:44:08.577 *DEBUG* [pool-1-thread-5] HeartbeatHandler: heartbeatAndCheckView: start. [for slingId=332ef30b-33df-488f-807a-456d4b773812]
> 13.11.2015 10:44:08.592 *DEBUG* [pool-1-thread-5] HeartbeatHandler: issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for 332ef30b-33df-488f-807a-456d4b773812
> 13.11.2015 10:44:09.577 *DEBUG* [main] HeartbeatHandler: issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for d27b4ec4-6530-445d-aa60-cb015611205e
> 13.11.2015 10:44:09.610 *DEBUG* [main] HeartbeatHandler: issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for 332ef30b-33df-488f-807a-456d4b773812
> 13.11.2015 10:44:10.622 *INFO * [main] HeartbeatTest: doTestSlowAndFastMachine: by now the two should have joined
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)