You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@sling.apache.org by Stefan Egli <eg...@adobe.com> on 2014/02/07 11:38:18 UTC

[discovery] different heartbeats for repository vs connectors

Hi,

During an offline discussion, Felix brought up the suggestion to lower the topology connector's heartbeat frequency. Currently they are sent every 15 or 30 sec, which might seem a lot - especially as they were way too chatty (which is fixed now with SLING-3377).

The main reason for having a high heartbeat frequency is quicker failure detection - but it's obviously a trade-off as it increases load.

I would like to get some opinion on to the following proposal:

  *   introduce two different sets of heartbeats, one for repository and one for connectors
  *   the repository ones would remain at the current frequency (suggested default: 30sec interval, 60sec timeout). The idea is that we would want to detect crashes within a cluster rather quickly, more quickly than in the topology in general.
  *   the connectors would get a back-off behavior, where initially the values are the same (30sec/60sec) but then they send out less frequent heartbeats over time, reaching a max (eg 5min). This would have to be controlled by the receiving side, ie both sides of the connector have to agree that interval and timeout are the same.

I've opened a Jira to track this, please comment there:

https://issues.apache.org/jira/browse/SLING-3382

Thanks,
Cheers,
Stefan

Re: [discovery] different heartbeats for repository vs connectors

Posted by Carsten Ziegeler <cz...@apache.org>.

Sounds like a good strategy to me

+1

Carsten


2014-02-07 11:38 GMT+01:00 Stefan Egli <eg...@adobe.com>:

> Hi,
>
> During an offline discussion, Felix brought up the suggestion to lower the
> topology connector's heartbeat frequency. Currently they are sent every 15
> or 30 sec, which might seem a lot - especially as they were way too chatty
> (which is fixed now with SLING-3377).
>
> The main reason for having a high heartbeat frequency is quicker failure
> detection - but it's obviously a trade-off as it increases load.
>
> I would like to get some opinion on to the following proposal:
>
>   *   introduce two different sets of heartbeats, one for repository and
> one for connectors
>   *   the repository ones would remain at the current frequency (suggested
> default: 30sec interval, 60sec timeout). The idea is that we would want to
> detect crashes within a cluster rather quickly, more quickly than in the
> topology in general.
>   *   the connectors would get a back-off behavior, where initially the
> values are the same (30sec/60sec) but then they send out less frequent
> heartbeats over time, reaching a max (eg 5min). This would have to be
> controlled by the receiving side, ie both sides of the connector have to
> agree that interval and timeout are the same.
>
> I've opened a Jira to track this, please comment there:
>
> https://issues.apache.org/jira/browse/SLING-3382
>
> Thanks,
> Cheers,
> Stefan
>



-- 
Carsten Ziegeler
cziegeler@apache.org