You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/08/12 22:10:20 UTC

[jira] [Commented] (KUDU-1365) Leader flapping when one machine has a very slow disk

    [ https://issues.apache.org/jira/browse/KUDU-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419590#comment-15419590 ] 

Todd Lipcon commented on KUDU-1365:
-----------------------------------

Looking at another instance of this problem, and it's basically as described above. The bad node is "stuck" for a bit on a slow disk -- even just a few seconds stall will cause all of the ConsensusService handler threads to get occupied blocked on the log, assuming more tablets than handler threads. This causes the other tablets to not receive their heartbeats, and trigger elections.

Even though the election isn't won, the other tablets have bumped their terms. Then a second later, they reject a request from the leader with an 'invalid term' error, and an election gets triggered.

This is exceedingly common under heavy write load in a large cluster. I think we should probably implement one of the ideas mentioned above (fast-pathing heartbeats or doing election pre-vote seem like the most promising options)

> Leader flapping when one machine has a very slow disk
> -----------------------------------------------------
>
>                 Key: KUDU-1365
>                 URL: https://issues.apache.org/jira/browse/KUDU-1365
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 0.7.0
>            Reporter: Mike Percy
>            Assignee: Mike Percy
>
> There is an issue that [~decster] ran into in production where a machine had a bad disk. Unfortunately, the disk was not failing to write, it was just very slow. This resulted in the UpdateConsensus RPC queue filling up on that machine when it was a follower, which looked like this in the log (from the leader's perspective):
> {code}
> W0229 00:07:14.332468 18148 consensus_peers.cc:316] T 41e637c3c2b34e8db36d138c4d37d032 P 6434970484be4d29855f05e1f6aed1b8 -> Peer 7f331507718d477f96d60eb1bc573baa (st128:18700): Couldn't send request to peer 7f331507718d477f96d60eb1bc573baa for tablet 41e637c3c2b34e8db36d138c4d37d032 Status: Remote error: Service unavailable: UpdateConsensus request on kudu.consensus.ConsensusService dropped due to backpressure. The service queue is full; it has 50 items.. Retrying in the next heartbeat period. Already tried 25 times.
> {code}
> The result is that the follower could not receive heartbeat messages from the leader anymore. The follower (st128) would decide that the leader was dead and start an election. Because it had the same amount of data as the rest of the cluster, it won the election. Then, for reasons we still need to investigate, it would not heartbeat to its own followers. After some timeout, a different node (typically the previous leader) would start an election, get elected, and the flapping process would continue.
> It's possible that the bad node, when leader, was only partially transitioned to leadership, and was blocking on some disk operation before starting to heartbeat. Hopefully we can get logs from the bad node so we can better understand what was happening from its perspective.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)