You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2017/10/09 03:18:02 UTC

[jira] [Created] (KUDU-2183) Leaders changing to an unstable/stuck configuration

Todd Lipcon created KUDU-2183:
---------------------------------

             Summary: Leaders changing to an unstable/stuck configuration
                 Key: KUDU-2183
                 URL: https://issues.apache.org/jira/browse/KUDU-2183
             Project: Kudu
          Issue Type: Bug
          Components: consensus
    Affects Versions: 1.4.0, 1.5.0, 1.6.0
            Reporter: Todd Lipcon


We've long had this issue but finally figured out the sequence of events that result in a "stuck" raft configuration after some nodes crash and recover:

Initial state: A(LEADER), B, C

1) B goes down
2) A evicts B, commits config change: new config (A,C)
3) master sends a config change to A to add a new replica D
4) A commits config change to (A,C,D)
5) D is busy or actually not running and therefore doesn't accept the tablet copy request
6) C goes down
(at this point the tablet is not operational because C is down and D has no replica yet)
7) A creates a pending config change to evict C (new config A, D)
(it can't commit the config change because D still has no replica)
8) Tablet server "A" reboots so it is no longer the leader.

At this point even if we bring every server back up, we end up with:

A: has a pending config change from step 7 to (A, D), but can't be elected because D has no replica (thus won't vote until we implement KUDU-2169)
B: comes up and gets its replica deleted because it was evicted
C: comes up with config (A, C, D) but can't be elected because D has no replica and A has a higher OpId
D: has no replica




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)