You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Mike Percy (JIRA)" <ji...@apache.org> on 2017/10/10 19:18:00 UTC

[jira] [Comment Edited] (KUDU-2183) Leaders changing to an unstable/stuck configuration

    [ https://issues.apache.org/jira/browse/KUDU-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199214#comment-16199214 ] 

Mike Percy edited comment on KUDU-2183 at 10/10/17 7:17 PM:
------------------------------------------------------------

On second thought, I'm not sure how KUDU-2048 helps with this. The fact that we were able to initiate a 2nd config change after committing the first one (the eviction) means that we didn't get stuck trying to commit the initial eviction, which is what KUDU-2048 would remedy.

Edit: Nevermind, I thought it got stuck being unable to commit in step 4, which is also possible.


was (Author: mpercy):
On second thought, I'm not sure how KUDU-2048 helps with this. The fact that we were able to initiate a 2nd config change after committing the first one (the eviction) means that we didn't get stuck trying to commit the initial eviction, which is what KUDU-2048 would remedy.

> Leaders changing to an unstable/stuck configuration
> ---------------------------------------------------
>
>                 Key: KUDU-2183
>                 URL: https://issues.apache.org/jira/browse/KUDU-2183
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 1.4.0, 1.5.0, 1.6.0
>            Reporter: Todd Lipcon
>
> We've long had this issue but finally figured out the sequence of events that result in a "stuck" raft configuration after some nodes crash and recover:
> Initial state: A(LEADER), B, C
> 1) B goes down
> 2) A evicts B, commits config change: new config (A,C)
> 3) master sends a config change to A to add a new replica D
> 4) A commits config change to (A,C,D)
> 5) D is busy or actually not running and therefore doesn't accept the tablet copy request
> 6) C goes down
> (at this point the tablet is not operational because C is down and D has no replica yet)
> 7) A creates a pending config change to evict C (new config A, D)
> (it can't commit the config change because D still has no replica)
> 8) Tablet server "A" reboots so it is no longer the leader.
> At this point even if we bring every server back up, we end up with:
> A: has a pending config change from step 7 to (A, D), but can't be elected because D has no replica (thus won't vote until we implement KUDU-2169)
> B: comes up and gets its replica deleted because it was evicted
> C: comes up with config (A, C, D) but can't be elected because D has no replica and A has a higher OpId
> D: has no replica



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)