You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/12/19 05:32:58 UTC

[jira] [Commented] (KUDU-1170) Queue should reset all_replicated_opid when becoming LEADER

    [ https://issues.apache.org/jira/browse/KUDU-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760208#comment-15760208 ] 

Todd Lipcon commented on KUDU-1170:
-----------------------------------

I believe e21523f5fac41ae272886c89a6a45049173aa94b (passing the all-replicated index from the leader to followers and updating the queue) addressed this. Not 100% sure it was that patch, but I grepped for cases where the all_replicated index was reported to be less than the majority index, and no longer could find any:

{code}
grep 'Queue going to LEADER' kudu-tserver.* | perl -n -e 'if (/All replicated index: (\d+).*Majority replicated index: (\d+)/ && $1 > $2)
{ print $_; }
' | less -S
{code}
(note the slightly different grep since the message now reports indexes, not op ids)

> Queue should reset all_replicated_opid when becoming LEADER
> -----------------------------------------------------------
>
>                 Key: KUDU-1170
>                 URL: https://issues.apache.org/jira/browse/KUDU-1170
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: Private Beta
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>
> Looking at the logs on a busy server, I see various cases like:
> {code}
> Queue going to LEADER mode. State: All replicated op: 10.6, Majority replicated op: 10.5,
> {code}
> I'm not sure if it's actually causing downstream problems, but definitely seems counter-intuitive. I think the issue is that in SetLeaderMode, we reset majority_replicated_op based on the committed index, but we don't reset all_replicated. I think it's possible that the all_replicated watermark in a previous term gets ahead of the committed index in the case that we hit the "cannot advance committed index until we've replicated something in our own term" or somesuch, but there may be some other race here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)