You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by Yakov Zhdanov <yz...@apache.org> on 2015/04/14 22:25:58 UTC
Speed up failure detection
Guys,
I think we can (1) make grid configuration significantly easier and (2)
speed up failure detection.
Here are disco SPI configuration properties which are responsible for
failure detection:
- reconnectCount,
- sockTimeout,
- networkTImeout,
- ackTImeout,
- maxAckTimeout,
- heartbeatFrequency
- maxMissedHearbeats
Same for communication SPI
- reconnectCount,
- maxConnTimeout,
- connTimeout
10 or even more properties.
We did it to address half-opened sockets problem (which is pretty common
for cloud environment) and GC pauses which may happen on cluster nodes - we
can increase ack timeouts to prevent them
By setting value for these props I set timeout for failure detection. Why
do we need such great number of parameters instead of having 1 on
IgniteConfiguration - nodeResponseThreshold (or failureDetectionThreshold -
can anyone propose better name?).
All other parameters will be calculated automatically (I think user can
still set some of them for full control over situation - need to decide if
this is needed.)
Ticket filed - https://issues.apache.org/jira/browse/IGNITE-752
Thoughts?
--Yakov
Re: Speed up failure detection
Posted by Ivan Veselovskiy <iv...@gridgain.com>.
it would be great!
- does sockTimeout affect all the server sockets involved in Ignite node?
(E.g. there are sockets in discovery, in Hadoop job tracker, in IGFS
interface, even in shmem handshake.)
- to reduce GC pauses G1 collector can potentially be helpful. Is there any
experience with it in Ignite?
--ivan
On Wed, Apr 15, 2015 at 12:25 AM, Yakov Zhdanov <yz...@apache.org> wrote:
> Guys,
>
> I think we can (1) make grid configuration significantly easier and (2)
> speed up failure detection.
>
> Here are disco SPI configuration properties which are responsible for
> failure detection:
>
> - reconnectCount,
> - sockTimeout,
> - networkTImeout,
> - ackTImeout,
> - maxAckTimeout,
> - heartbeatFrequency
> - maxMissedHearbeats
>
> Same for communication SPI
>
> - reconnectCount,
> - maxConnTimeout,
> - connTimeout
>
> 10 or even more properties.
>
> We did it to address half-opened sockets problem (which is pretty common
> for cloud environment) and GC pauses which may happen on cluster nodes - we
> can increase ack timeouts to prevent them
>
> By setting value for these props I set timeout for failure detection. Why
> do we need such great number of parameters instead of having 1 on
> IgniteConfiguration - nodeResponseThreshold (or failureDetectionThreshold -
> can anyone propose better name?).
>
> All other parameters will be calculated automatically (I think user can
> still set some of them for full control over situation - need to decide if
> this is needed.)
>
> Ticket filed - https://issues.apache.org/jira/browse/IGNITE-752
>
> Thoughts?
>
> --Yakov
>