You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by "Yakov Zhdanov (JIRA)" <ji...@apache.org> on 2015/04/14 22:24:59 UTC
[jira] [Created] (IGNITE-752) Speed up failure detection
Yakov Zhdanov created IGNITE-752:
------------------------------------
Summary: Speed up failure detection
Key: IGNITE-752
URL: https://issues.apache.org/jira/browse/IGNITE-752
Project: Ignite
Issue Type: Bug
Reporter: Yakov Zhdanov
Priority: Critical
Fix For: sprint-4
I think we can (1) make grid configuration significantly easier and (2) speed up failure detection.
Here are disco SPI configuration properties which are responsible for failure detection:
# reconnectCount,
# sockTimeout,
# networkTImeout,
# ackTImeout,
# maxAckTimeout,
# heartbeatFrequency
# maxMissedHearbeats
Same for communication SPI
# reconnectCount,
# maxConnTimeout,
# connTimeout
So, we have 10 or even more properties.
We did it to address half-opened sockets problem (which is pretty common for cloud environment) and GC pauses which may happen on cluster nodes - we can increase ack timeouts to prevent them from being kicked off the topology.
By setting value for these props I set timeout for failure detection. Why do we need such great number of parameters instead of having 1 on IgniteConfiguration - nodeResponseThreshold (or failureDetectionThreshold - can anyone propose better name?).
All other parameters will be calculated automatically (I think user can still set some of them for full control over situation - need to decide if this is needed.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)