You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ignite.apache.org by "Alexey Goncharuk (JIRA)" <ji...@apache.org> on 2016/08/01 07:21:20 UTC

[jira] [Created] (IGNITE-3616) Drop failed nodes from topology after a configured timeout

Alexey Goncharuk created IGNITE-3616:
----------------------------------------

             Summary: Drop failed nodes from topology after a configured timeout
                 Key: IGNITE-3616
                 URL: https://issues.apache.org/jira/browse/IGNITE-3616
             Project: Ignite
          Issue Type: Improvement
          Components: cache
    Affects Versions: 1.5.0.final
            Reporter: Alexey Goncharuk


If an OOME or assertion happens on a node, it is not uncommon that partition exchange is stuck blocking the whole cluster. We should provide a mechanism to drop non-responsive nodes automatically.

When partition exchange is times out, a coordinator should:
- print out IDs/IPs of non-responsive nodes at all times
- introduce a certain kill timeout for non-responsive nodes (-1 means
disabled)
- the timeout should be at least a minute after the 1st non-responsive node
message is printed
- when the timeout expires, we should kill the nodes and automatically
collect their thread dumps (do best effort for it)
- we should print out a message asking users to provide these thread dumps to us via Jira or dev list



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)