You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Stefan Egli (Jira)" <ji...@apache.org> on 2023/06/08 09:36:00 UTC

[jira] [Created] (OAK-10281) Introduce recoveryDelay to ClusterNodeInfo.isRecoveryNeeded

Stefan Egli created OAK-10281:
---------------------------------

             Summary: Introduce recoveryDelay to ClusterNodeInfo.isRecoveryNeeded
                 Key: OAK-10281
                 URL: https://issues.apache.org/jira/browse/OAK-10281
             Project: Jackrabbit Oak
          Issue Type: Task
          Components: documentmk
            Reporter: Stefan Egli


Oak instances periodically update their leases to signal to peers in the cluster that they are still alive. A lease that has timed out is hence taken as indication that the corresponding oak instance has crashed (and not released the lease). It is also assumed that the corresponding, crashing oak instance does not do any further write operations after the lease timeout - as it would otherwise have been alive and updated their lease, which it did not.

As already reported elsewhere (eg OAK-10254) there is a case where indeed writes happen later than the lease timeout (aka "late writes"): a writing thread could go passed the lease check, then a stop-the-world (eg high JVM GC) could halt the thread for more than the lease timeout (eg 2min), and upon continuation that writing thread could then send the write operation to the DocumentStore.

One way to mitigate this late-write risk is to delay the recovery. Ie wait with doing the LastRevRecovery for eg 10min after a lease failure. That includes putting the state of the clusterNode back into inactive.

This ticket is about introducing such a recoveryDelay config parameter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)