You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by "Stefan Egli (JIRA)" <ji...@apache.org> on 2015/11/09 17:14:10 UTC

[jira] [Created] (SLING-5285) more aggressive self-check for heartbeat timeout

Stefan Egli created SLING-5285:
----------------------------------

             Summary: more aggressive self-check for heartbeat timeout
                 Key: SLING-5285
                 URL: https://issues.apache.org/jira/browse/SLING-5285
             Project: Sling
          Issue Type: Improvement
          Components: Extensions
    Affects Versions: Discovery Impl 1.2.0
            Reporter: Stefan Egli
            Assignee: Stefan Egli
             Fix For: Discovery Impl 1.2.2


SLING-5195 introduced a self-check that was monitoring if the HeartbeatHandler was properly storing the heartbeats regularly. This is done because there are different reasons why that might not be the case, eg: the HeartbeatHandler could be blocked because of another long-running-commit happening locally - or it might be blocked due to thread-pool-exhaustion - or perhaps something yet different.

The check was setting off an alarm when the time-since-last-heartbeat was bigger than a *heartbeatTimeout*. This however is not sufficient. The comparison should be much more aggressive. It should compare against a *heartbeatTimeout minus 2 times heartbeatInterval* to have enough safety margin. _2 times_ because 1 time is actually the very minimum: this background check only _runs_ every heartbeatInterval, so in the worst case it could run just _heartbeatInterval_ many seconds before the timeout hits - and still be too late by a fraction. So 1 is the very minimum. The _2_ is actually adding a safety margin of 1 _heartbeatInterval_ only.

*Note:* this also means that you should configure the heartbeatTimeout at least 4-5 times the heartbeatInterval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)