You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by "Stefan Egli (Jira)" <ji...@apache.org> on 2021/06/07 13:44:00 UTC

[jira] [Closed] (SLING-10353) ClusterSyncServiceChain can leak a background thread

     [ https://issues.apache.org/jira/browse/SLING-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stefan Egli closed SLING-10353.
-------------------------------

> ClusterSyncServiceChain can leak a background thread
> ----------------------------------------------------
>
>                 Key: SLING-10353
>                 URL: https://issues.apache.org/jira/browse/SLING-10353
>             Project: Sling
>          Issue Type: Bug
>          Components: Discovery
>    Affects Versions: Discovery Commons 1.0.20
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>            Priority: Major
>             Fix For: Discovery Commons 1.0.24
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Invoking a {{ClusterSyncService.sync()}} should ensure that if a background thread is running, it gets terminated. In the case of a chain of sync services implemented by {{ClusterSyncServiceChain}} there is a race-condition which can leak a background thread running unintentionally.
> The order of events is eg:
> * setup: ClusterSyncServiceChain is configured with a chain of a OakBacklogClusterSyncService followed by a SyncTokenService
> * ViewStateManagerImpl.handleNewViewNonDelayed can be called consecutively, as it is protected by a lock. Let's assume on first invocation it starts a sync process by invoking the ClusterSyncServiceChain.sync()
> * the ClusterSyncServiceChain.sync() will first invoke the OakBacklogClusterSyncService.sync()
> * let's assume there is a backlog (ie a recovery hasn't finished yet) with a particular instance in the cluster - so OakBacklogClusterSyncService.sync() will spawn a background thread that checks for that backlog every 2 seconds (handled by AbstractServiceWithBackgroundCheck.BackgroundCheckRunnable.run()).
> * the OakBacklogClusterSyncService background check determines that the backlog is gone and that it can proceed. It will do so by continuing in the ClusterSyncServiceChain - ie it will invoke the SyncTokenService.sync() - that in turn invokes syncToken() - which invokes startBackgroundThread. At this point, let's assume, the check() returns false - ie there are sync tokens it needs to wait for. Hence it will start a new background Thread for the SyncTokenService.
> * let's now assume, at some later point ViewStateManagerImpl.handleNewViewNonDelayed is again invoked. So it will again call ClusterSyncServiceChain.sync(). Now it is the responsibility of ClusterSyncServiceChain.sync() to cancel any currently running background thread - but  the first element in the chain (the OakBacklogClusterSyncService) doesn't actually have any background thread anymore - it has already "handed over" to the SyncTokenService - so the OakBacklogClusterSyncService.cancelPreviousBackgroundCheck() will not have anything to cancel. Hence it goes ahead and follows the sync() chain process - which is first checking for backlog (that has cleared now, as we assumed), and then the sync token (which is not complete, as we assumed). So it will now create a *second* background thread.
> The above resulting in 2 SyncTokenService background threads.
> This can have a few consequences:
> * a behaviour was noticed, whereby {{SyncTokenService.storeMySyncToken}} keeps being invoked with toggling values, eg first it stores {{627}}, then {{628}}, then again {{627}}, then {{628}} etc. The following is an example of the generated data in DocumentNodeStore as a result ( excerpt from the {{"_id" : "4:/var/discovery/oak/syncTokens"}} document ) :
> {noformat}
> 	"<slingId>" : {
> 		"r179378d10da-0-4" : "\"627\"",
> 		"r179378d14ce-0-4" : "\"628\"",
> 		"r179378d18bc-0-4" : "\"627\"",
> 		"r179378d1ca1-0-4" : "\"628\"",
> 		"r179378d209c-0-4" : "\"627\"",
> 		"r179378d2477-0-4" : "\"628\"",
> 		"r179378d287f-0-4" : "\"627\"",
> 		"r179378d2c4b-0-4" : "\"628\"",
> 		"r179378d3068-0-4" : "\"627\"",
> 		"r179378d3421-0-4" : "\"628\""
> 	},
> {noformat}
> This eventually should clear out - but it can lead to 
> ** unnecessary lengthening the sync procedure as other cluster nodes will potentially have to wait longer
> ** confusion when reading such log files
> * it can also result in conflict exceptions if the two threads want to update simultaneously:
> {noformat}
> 04.05.2021 13:22:27.741 *ERROR* [Discovery-AsyncEventSender] org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService storeMySyncToken: got PersistenceException while storing my syncToken:
>  org.apache.sling.api.resource.PersistenceException: Unable to commit changes to session.
> org.apache.sling.api.resource.PersistenceException: Unable to commit changes to session.
>         at org.apache.sling.jcr.resource.internal.helper.jcr.JcrResourceProvider.commit(JcrResourceProvider.java:516) [org.apache.sling.jcr.resource:3.0.22]
>         at org.apache.sling.resourceresolver.impl.providers.stateful.AuthenticatedResourceProvider.commit(AuthenticatedResourceProvider.java:215) [org.apache.sling.resourceresolver:1.7.2]
>         at org.apache.sling.resourceresolver.impl.helper.ResourceResolverControl.commit(ResourceResolverControl.java:424) [org.apache.sling.resourceresolver:1.7.2]
>         at org.apache.sling.resourceresolver.impl.ResourceResolverImpl.commit(ResourceResolverImpl.java:1000) [org.apache.sling.resourceresolver:1.7.2]
>         at org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService.storeMySyncToken(SyncTokenService.java:174) [org.apache.sling.discovery.commons:1.0.20]
>         at org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService.access$000(SyncTokenService.java:53) [org.apache.sling.discovery.commons:1.0.20]
>         at org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService$1.check(SyncTokenService.java:137) [org.apache.sling.discovery.commons:1.0.20]
>         at org.apache.sling.discovery.commons.providers.spi.base.AbstractServiceWithBackgroundCheck.startBackgroundCheck(AbstractServiceWithBackgroundCheck.java:199) [org.apache.sling.discovery.commons:
> 1.0.20]
>         at org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService.syncToken(SyncTokenService.java:131) [org.apache.sling.discovery.commons:1.0.20]
>         at org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService.sync(SyncTokenService.java:124) [org.apache.sling.discovery.commons:1.0.20]
>         at org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain.chainedSync(ClusterSyncServiceChain.java:65) [org.apache.sling.discovery.commons:1.0.20]
>         at org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain.access$000(ClusterSyncServiceChain.java:34) [org.apache.sling.discovery.commons:1.0.20]
>         at org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain$1.run(ClusterSyncServiceChain.java:69) [org.apache.sling.discovery.commons:1.0.20]
>         at org.apache.sling.discovery.commons.providers.spi.base.AbstractServiceWithBackgroundCheck.startBackgroundCheck(AbstractServiceWithBackgroundCheck.java:204) [org.apache.sling.discovery.commons:
> 1.0.20]
>         at org.apache.sling.discovery.commons.providers.spi.base.OakBacklogClusterSyncService.waitWhileBacklog(OakBacklogClusterSyncService.java:142) [org.apache.sling.discovery.commons:1.0.20]
>         at org.apache.sling.discovery.commons.providers.spi.base.OakBacklogClusterSyncService.sync(OakBacklogClusterSyncService.java:136) [org.apache.sling.discovery.commons:1.0.20]
>         at org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain.chainedSync(ClusterSyncServiceChain.java:65) [org.apache.sling.discovery.commons:1.0.20]
>         at org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain.sync(ClusterSyncServiceChain.java:54) [org.apache.sling.discovery.commons:1.0.20]
>         at org.apache.sling.discovery.commons.providers.base.ViewStateManagerImpl$1.trigger(ViewStateManagerImpl.java:548) [org.apache.sling.discovery.commons:1.0.20]
>         at org.apache.sling.discovery.commons.providers.base.AsyncEventSender.run(AsyncEventSender.java:118) [org.apache.sling.discovery.commons:1.0.20]
>         at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: javax.jcr.InvalidItemStateException: OakState0001: Unresolved conflicts in /var/discovery/oak/syncTokens
>         at org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(CommitFailedException.java:238)
> ...
> {noformat}
> * the 2 threads can also be seen in a thread-dump of course:
> {noformat}
> "SyncTokenService" #504 daemon prio=5 os_prio=0 cpu=258.15ms elapsed=47.01s tid=0x000000000f315800 nid=0x449c in Object.wait()  [0x00007f74fd56c000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>         at java.lang.Object.wait(java.base@11.0.6/Native Method)
>         - waiting on <no object reference available>
>         at org.apache.sling.discovery.commons.providers.spi.base.AbstractServiceWithBackgroundCheck$BackgroundCheckRunnable.run(AbstractServiceWithBackgroundCheck.java:98)
>         - waiting to re-lock in wait() <0x00000007a8516420> (a java.lang.Object)
>         at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)
> "SyncTokenService" #530 daemon prio=5 os_prio=0 cpu=0.78ms elapsed=0.96s tid=0x000000000a638800 nid=0x4870 in Object.wait()  [0x00007f74f0bdf000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>         at java.lang.Object.wait(java.base@11.0.6/Native Method)
>         - waiting on <no object reference available>
>         at org.apache.sling.discovery.commons.providers.spi.base.AbstractServiceWithBackgroundCheck$BackgroundCheckRunnable.run(AbstractServiceWithBackgroundCheck.java:98)
>         - waiting to re-lock in wait() <0x00000007bfe29640> (a java.lang.Object)
>         at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)