You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/05/26 07:28:12 UTC

[jira] [Commented] (KUDU-1469) Consensus can get "stuck" after a leader change when committed index is far behind replicated

    [ https://issues.apache.org/jira/browse/KUDU-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301678#comment-15301678 ] 

Todd Lipcon commented on KUDU-1469:
-----------------------------------

The symptoms were that the follower was spewing the following log line hundreds of times per second:
{code}
I0525 21:01:46.445459 90313 raft_consensus.cc:778] T 6df7158c963a42c5a6e3084f754259e2 P da3b59cc4d62404a9726dee44288ae8c [term 985 FOLLOWER]: Deduplicated request from leader. Original: 86.15648->[86.15649-86.15657]   Dedup: 86.15657->[]
{code}

The leader was spewing errors like:
{code}
I0525 23:59:55.158216  9831 raft_consensus.cc:651] T 6df7158c963a42c5a6e3084f754259e2 P 7a000aedd0d34c25b740600345817f2e: Attempting to remove follower a291debfd0214af6a6d59c0a323ab412 from the Raft config. Reason: The logs necessary to catch up peer a291debfd0214af6a6d59c0a323ab412 have been garbage collected. The follower will never be able to catch up (Not found: Failed to read ops 11231..15648: Segment 19 which contained index 11231 has been GCed)
W0525 23:59:55.158288  9831 raft_consensus.cc:655] T 6df7158c963a42c5a6e3084f754259e2 P 7a000aedd0d34c25b740600345817f2e: Unable to remove follower a291debfd0214af6a6d59c0a323ab412: Illegal state: Latest committed op is not from this term: 86.15648
W0525 23:59:55.158305 122562 raft_consensus.cc:655] T 6df7158c963a42c5a6e3084f754259e2 P 7a000aedd0d34c25b740600345817f2e: Unable to remove follower a291debfd0214af6a6d59c0a323ab412: Illegal state: Latest committed op is not from this term: 86.15648
W0525 23:59:55.172535 86588 lock_manager.cc:359] Waited 119184 seconds to obtain row lock on key user3467137587177005948 cur holder: 0x1d88ef320
{code}
However, the follower referenced here is a third replica, which had fallen behind the log retention threshold and been evicted over a day ago.

So, clearly, something was preventing the new leader from making progress replicating operations to the follower.

The follower is claiming that the leader's request was fully de-duplicated. Indeed, I dumped the log for this tablet on both the leader and follower, and they matched up until the last operation:

{code}
ve0136 wal (term 89 follower) wal                        ve0128 (leader) wal:
86.15649@5997037470279757832    REPLICATE WRITE_OP
86.15650@5997037470290153475    REPLICATE WRITE_OP
86.15651@5997037470319505421    REPLICATE WRITE_OP
86.15652@5997037470335680516    REPLICATE WRITE_OP
86.15653@5997037470375796741    REPLICATE WRITE_OP
86.15654@5997037470385016832    REPLICATE WRITE_OP
86.15655@5997037474152280064    REPLICATE WRITE_OP
86.15656@5997037474260807680    REPLICATE WRITE_OP
86.15657@5997037474266337280    REPLICATE WRITE_OP
---------- batch being sent is cut off here by size limit --------------
86.15658@5997037474269536256    REPLICATE WRITE_OP
86.15659@5997037474292113408    REPLICATE WRITE_OP
86.15660@5997037474410983424    REPLICATE WRITE_OP
86.15661@5997037474436861952    REPLICATE WRITE_OP
86.15662@5997037474469752832    REPLICATE WRITE_OP
86.15663@5997037474517995520    REPLICATE WRITE_OP
86.15664@5997037474562711558    REPLICATE WRITE_OP
88.15665@5997037522648662016    REPLICATE NO_OP           89.15665@5997037522636103680    REPLICATE NO_OP
{code}

I used tracing.html to grab the RPC call response from the follower for this tablet:
{code}
  - response:
     status: {last_received: {term: 88,
                              index: 15665},
              last_committed_idx: 15648,
              last_received_current_leader: {term: 0,
                                             index: 0}}}
{code}

So we can see that the follower is responding with 'last_received_current_leader' of 0.0. This is probably due to this code:

{code}
    // It's possible that the leader didn't send us any new data -- it might be a completely
    // duplicate request. In that case, we don't need to update LastReceived at all.
    if (!deduped_req.messages.empty()) {
      OpId last_appended = deduped_req.messages.back()->get()->id();
      TRACE(Substitute("Updating last received op as $0", last_appended.ShortDebugString()));
      state_->UpdateLastReceivedOpIdUnlocked(last_appended);
    } else {
      DCHECK_GE(state_->GetLastReceivedOpIdUnlocked().index(),
                deduped_req.preceding_opid->index());
    }
{code}

When the leader gets the response from the follower, it does:
When the leader gets the respones, it does:
{code}
    // If the reported last-received op for the replica is in our local log,
    // then resume sending entries from that point onward. Otherwise, resume
    // after the last op they received from us. If we've never successfully
    // sent them anything, start after the last-committed op in their log, which
    // is guaranteed by the Raft protocol to be a valid op.

    bool peer_has_prefix_of_log = IsOpInLog(status.last_received());
    if (peer_has_prefix_of_log) {
      // If the latest thing in their log is in our log, we are in sync.
      peer->last_received = status.last_received();
      peer->next_index = peer->last_received.index() + 1;

    } else if (!OpIdEquals(status.last_received_current_leader(), MinimumOpId())) {
      // Their log may have diverged from ours, however we are in the process
      // of replicating our ops to them, so continue doing so. Eventually, we
      // will cause the divergent entry in their log to be overwritten.
      peer->last_received = status.last_received_current_leader();
      peer->next_index = peer->last_received.index() + 1;

    } else {
      // The peer is divergent and they have not (successfully) received
      // anything from us yet. Start sending from their last committed index.
      // This logic differs from the Raft spec slightly because instead of
      // stepping back one-by-one from the end until we no longer have an LMP
      // error, we jump back to the last committed op indicated by the peer with
      // the hope that doing so will result in a faster catch-up process.
      DCHECK_GE(peer->last_known_committed_idx, 0);
      peer->next_index = peer->last_known_committed_idx + 1;
    }
{code}

Because the last_received_current_leader is not set, we hit the third case here, and just keep sending the operations following the commit index, and thus never make progress.

----

A bit more investigation explained how we got in this situation:
- the leader's WAL got delayed due to IO contention and had a very long batch 
{code}
W0524 13:08:10.869431 74285 log.cc:514] Time spent Append to log took a long time: real 14.729s user 0.000s sys 0.036s
{code}
- while it was blocked on the WAL, more operations were being written to it, and it was able to replicate those to the follower.
- before its own WAL write finished, it lost its leadership. So, despite the leader and follower having matching logs, it was never able to replicate the new commit index.

This resulted in the case where the commit index had fallen quite far behind the actual matching log operations.

This isn't covered well by existing integration tests, because we don't stress the case where the leader's WAL is substantially slower than the round trip to the followers, combined with the case where the actual operations and write rate are large enough that it takes multiple batches to "catch up" after the leader election.

> Consensus can get "stuck" after a leader change when committed index is far behind replicated
> ---------------------------------------------------------------------------------------------
>
>                 Key: KUDU-1469
>                 URL: https://issues.apache.org/jira/browse/KUDU-1469
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 0.8.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>
> A YCSB stress test hit a bug whereby a tablet was "stuck" - operations were left in flight for hours/days and no progress could be made, despite a majority of replicas being alive.
> Detailed analysis below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)