You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Alexey Goncharuk (JIRA)" <ji...@apache.org> on 2016/08/01 07:21:20 UTC
[jira] [Created] (IGNITE-3616) Drop failed nodes from topology
after a configured timeout
Alexey Goncharuk created IGNITE-3616:
----------------------------------------
Summary: Drop failed nodes from topology after a configured timeout
Key: IGNITE-3616
URL: https://issues.apache.org/jira/browse/IGNITE-3616
Project: Ignite
Issue Type: Improvement
Components: cache
Affects Versions: 1.5.0.final
Reporter: Alexey Goncharuk
If an OOME or assertion happens on a node, it is not uncommon that partition exchange is stuck blocking the whole cluster. We should provide a mechanism to drop non-responsive nodes automatically.
When partition exchange is times out, a coordinator should:
- print out IDs/IPs of non-responsive nodes at all times
- introduce a certain kill timeout for non-responsive nodes (-1 means
disabled)
- the timeout should be at least a minute after the 1st non-responsive node
message is printed
- when the timeout expires, we should kill the nodes and automatically
collect their thread dumps (do best effort for it)
- we should print out a message asking users to provide these thread dumps to us via Jira or dev list
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)