You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Yakov Zhdanov <yz...@apache.org> on 2015/04/14 22:25:58 UTC

Speed up failure detection

Guys,

I think we can (1) make grid configuration significantly easier and (2)
speed up failure detection.

Here are disco SPI configuration properties which are responsible for
failure detection:

   - reconnectCount,
   - sockTimeout,
   - networkTImeout,
   - ackTImeout,
   - maxAckTimeout,
   - heartbeatFrequency
   - maxMissedHearbeats

Same for communication SPI

   - reconnectCount,
   - maxConnTimeout,
   - connTimeout

10 or even more properties.

We did it to address half-opened sockets problem (which is pretty common
for cloud environment) and GC pauses which may happen on cluster nodes - we
can increase ack timeouts to prevent them

By setting value for these props I set timeout for failure detection. Why
do we need such great number of parameters instead of having 1 on
IgniteConfiguration - nodeResponseThreshold (or failureDetectionThreshold -
can anyone propose better name?).

All other parameters will be calculated automatically (I think user can
still set some of them for full control over situation - need to decide if
this is needed.)

Ticket filed - https://issues.apache.org/jira/browse/IGNITE-752

Thoughts?

--Yakov

Re: Speed up failure detection

Posted by Ivan Veselovskiy <iv...@gridgain.com>.

it would be great!

- does sockTimeout affect all the server sockets involved in Ignite node?
(E.g. there are sockets in discovery, in Hadoop job tracker, in IGFS
interface, even in shmem handshake.)
- to reduce GC pauses G1 collector can potentially be helpful. Is there any
experience with it in Ignite?

--ivan

On Wed, Apr 15, 2015 at 12:25 AM, Yakov Zhdanov <yz...@apache.org> wrote:

> Guys,
>
> I think we can (1) make grid configuration significantly easier and (2)
> speed up failure detection.
>
> Here are disco SPI configuration properties which are responsible for
> failure detection:
>
>    - reconnectCount,
>    - sockTimeout,
>    - networkTImeout,
>    - ackTImeout,
>    - maxAckTimeout,
>    - heartbeatFrequency
>    - maxMissedHearbeats
>
> Same for communication SPI
>
>    - reconnectCount,
>    - maxConnTimeout,
>    - connTimeout
>
> 10 or even more properties.
>
> We did it to address half-opened sockets problem (which is pretty common
> for cloud environment) and GC pauses which may happen on cluster nodes - we
> can increase ack timeouts to prevent them
>
> By setting value for these props I set timeout for failure detection. Why
> do we need such great number of parameters instead of having 1 on
> IgniteConfiguration - nodeResponseThreshold (or failureDetectionThreshold -
> can anyone propose better name?).
>
> All other parameters will be calculated automatically (I think user can
> still set some of them for full control over situation - need to decide if
> this is needed.)
>
> Ticket filed - https://issues.apache.org/jira/browse/IGNITE-752
>
> Thoughts?
>
> --Yakov
>