You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2016/08/26 09:28:02 UTC

[kudu-CR] WIP: cleanup/refactoring in consensus

Hello Mike Percy,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/4133

to review the following change.

Change subject: WIP: cleanup/refactoring in consensus
......................................................................

WIP: cleanup/refactoring in consensus

This is a fairly invasive cleanup/refactor to consensus in preparation
for propagating replication watermarks from the leader to the followers.

The key item here is to address a long-standing TODO that the
ConsensusQueue should itself calculate the commit index, rather than
delegate that job to ReplicaState. The flow used to be something like
this on the leader upon receiving a response from a peer that it has
replicated some new operations:

 - PeerMessageQueue::ResponseFromPeer
   - updates peer last_received
   - calls PeerMessageQueue::AdvanceQueueWatermark
     - updates PeerMessageQueue::majority_replicated to the new majority
       watermark
   - calls NotifyObserversOfMajorityReplOp
   --> submits an async task to the raft threadpool
        - for each observer (best I can tell there is currently always
          only one, but that's a separate issue)
        -- observer->UpdateMajorityReplicated() to grab committed index
           (the observer is always the RaftConsensus instance)
        --- RaftConsensus::UpdateMajorityReplicated
        ---- ReplicaState::UpdateMajorityReplicatedUnlocked
        ----- this checks the condition that the majority-replicated
              watermark is within the leader's term before advancing the
              commit index.
        ----- calls AdvanceCommittedIndexUnlocked
        ------- commits stuff, etc
        ----- sets *committed_index out-param
        -- PeerMessageQueue picks up this out-param and sets the new
           value within the PeerMessageQueue

This was very difficult to follow, given that the "observer" in fact was
passing data back to the "notifier" through an out-parameter.

The new design makes the PeerMessageQueue itself fully responsible for
calculating when the commit index can advance. This required a fairly
invasive surgery because the PeerMessageQueue itself doesn't remember
the terms of pending operations, and thus the watermarks became indexes
instead of OpIds.

This is itself also a simplification -- we previously were very messy
about using the word "index" when in fact we had an OpId type. In almost
every case, we only used the index of those OpIds. In the Raft paper
itself, these watermarks are just indexes and not OpIds, so this change
brings us closer to the implementation described in the original design.

WIP because there are a few TODOs yet to iron out, although it seems
like this is at least passing most of the stress tests in
raft-consensus-itest. There does seem to be one more lurking issue,
though, because in a dist-test run, a couple of tests timedout.

Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
---
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus-test.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/raft_consensus-itest.cc
14 files changed, 386 insertions(+), 440 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4133/1
-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] WIP: cleanup/refactoring in consensus

Posted by "Kudu Jenkins (Code Review)" <ge...@cloudera.org>.
Kudu Jenkins has posted comments on this change.

Change subject: WIP: cleanup/refactoring in consensus
......................................................................


Patch Set 2:

Build Started http://104.196.14.100/job/kudu-gerrit/3088/

-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-HasComments: No

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4133

to look at the new patch set (#6).

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................

Cleanup/refactor tracking of consensus watermarks

This is a fairly invasive cleanup/refactor to consensus in preparation
for propagating replication watermarks from the leader to the followers.

The key item here is to address a long-standing TODO that the
PeerMessageQueue should itself calculate the commit index, rather than
delegate that job to ReplicaState. The flow used to be something like
this on the leader upon receiving a response from a peer that it has
replicated some new operations:

 - PeerMessageQueue::ResponseFromPeer
   - updates peer last_received
   - calls PeerMessageQueue::AdvanceQueueWatermark
     - updates PeerMessageQueue::majority_replicated to the new majority
       watermark
   - calls NotifyObserversOfMajorityReplOp
   --> submits an async task to the raft threadpool
        - for each observer (best I can tell there is currently always
          only one, but that's a separate issue)
        -- observer->UpdateMajorityReplicated() to grab committed index
           (the observer is always the RaftConsensus instance)
        --- RaftConsensus::UpdateMajorityReplicated
        ---- ReplicaState::UpdateMajorityReplicatedUnlocked
        ----- this checks the condition that the majority-replicated
              watermark is within the leader's term before advancing the
              commit index.
        ----- calls AdvanceCommittedIndexUnlocked
        ------- commits stuff, etc
        ----- sets *committed_index out-param
        -- PeerMessageQueue picks up this out-param and sets the new
           value within the PeerMessageQueue

This was very difficult to follow, given that the "observer" in fact was
passing data back to the "notifier" through an out-parameter. Moreover,
it wasn't obvious to see which class was "authoritative" for the
tracking and advancement of watermarks.

The new design makes the PeerMessageQueue itself fully responsible for
calculating when the commit index can advance. The flow is now the
following, with '!' at the beginning of the line to denote where it is
changed from before:

 - PeerMessageQueue::ResponseFromPeer
   - updates peer last_received
   - calls PeerMessageQueue::AdvanceQueueWatermark
     - updates PeerMessageQueue::majority_replicated to the new majority
       watermark.
!  - If the majority-replicated watermark is within the current leader
!    term, advances the commit index (stored in QueueState)
!    - notifies observers of the new commit index
!      - calls observer->NotifyCommitIndex() with the new commit index
!        - RaftConsensus::NotifyCommitIndex()
!          - ReplicaState::AdvanceCommittedIndexUnlocked
             - commits stuff, etc.

This required a fairly invasive surgery because the PeerMessageQueue
itself doesn't remember the terms of pending operations, and thus the
watermarks became indexes instead of OpIds.

This is itself also a simplification -- we previously were very messy
about using the word "index" when in fact we had an OpId type. In almost
every case, we only used the index of those OpIds. In the Raft paper
itself, these watermarks are just indexes and not OpIds, so this change
brings us closer to the implementation described in the original design.

I looped raft-consensus-itest.MultiThreadedInsertWithFailovers 800
times and the whole suite 500 times and got no failures.

Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
---
M docs/release_notes.adoc
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus-test.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/raft_consensus-itest.cc
15 files changed, 463 insertions(+), 592 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4133/6
-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP: cleanup/refactoring in consensus

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has uploaded a new patch set (#2).

Change subject: WIP: cleanup/refactoring in consensus
......................................................................

WIP: cleanup/refactoring in consensus

This is a fairly invasive cleanup/refactor to consensus in preparation
for propagating replication watermarks from the leader to the followers.

The key item here is to address a long-standing TODO that the
ConsensusQueue should itself calculate the commit index, rather than
delegate that job to ReplicaState. The flow used to be something like
this on the leader upon receiving a response from a peer that it has
replicated some new operations:

 - PeerMessageQueue::ResponseFromPeer
   - updates peer last_received
   - calls PeerMessageQueue::AdvanceQueueWatermark
     - updates PeerMessageQueue::majority_replicated to the new majority
       watermark
   - calls NotifyObserversOfMajorityReplOp
   --> submits an async task to the raft threadpool
        - for each observer (best I can tell there is currently always
          only one, but that's a separate issue)
        -- observer->UpdateMajorityReplicated() to grab committed index
           (the observer is always the RaftConsensus instance)
        --- RaftConsensus::UpdateMajorityReplicated
        ---- ReplicaState::UpdateMajorityReplicatedUnlocked
        ----- this checks the condition that the majority-replicated
              watermark is within the leader's term before advancing the
              commit index.
        ----- calls AdvanceCommittedIndexUnlocked
        ------- commits stuff, etc
        ----- sets *committed_index out-param
        -- PeerMessageQueue picks up this out-param and sets the new
           value within the PeerMessageQueue

This was very difficult to follow, given that the "observer" in fact was
passing data back to the "notifier" through an out-parameter.

The new design makes the PeerMessageQueue itself fully responsible for
calculating when the commit index can advance. This required a fairly
invasive surgery because the PeerMessageQueue itself doesn't remember
the terms of pending operations, and thus the watermarks became indexes
instead of OpIds.

This is itself also a simplification -- we previously were very messy
about using the word "index" when in fact we had an OpId type. In almost
every case, we only used the index of those OpIds. In the Raft paper
itself, these watermarks are just indexes and not OpIds, so this change
brings us closer to the implementation described in the original design.

WIP because there are a few TODOs yet to iron out, although it seems
like this is at least passing most of the stress tests in
raft-consensus-itest. There does seem to be one more lurking issue,
though, because in a dist-test run, a couple of tests timedout.

Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
---
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus-test.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/raft_consensus-itest.cc
14 files changed, 387 insertions(+), 461 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4133/2
-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4133

to look at the new patch set (#5).

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................

Cleanup/refactor tracking of consensus watermarks

This is a fairly invasive cleanup/refactor to consensus in preparation
for propagating replication watermarks from the leader to the followers.

The key item here is to address a long-standing TODO that the
PeerMessageQueue should itself calculate the commit index, rather than
delegate that job to ReplicaState. The flow used to be something like
this on the leader upon receiving a response from a peer that it has
replicated some new operations:

 - PeerMessageQueue::ResponseFromPeer
   - updates peer last_received
   - calls PeerMessageQueue::AdvanceQueueWatermark
     - updates PeerMessageQueue::majority_replicated to the new majority
       watermark
   - calls NotifyObserversOfMajorityReplOp
   --> submits an async task to the raft threadpool
        - for each observer (best I can tell there is currently always
          only one, but that's a separate issue)
        -- observer->UpdateMajorityReplicated() to grab committed index
           (the observer is always the RaftConsensus instance)
        --- RaftConsensus::UpdateMajorityReplicated
        ---- ReplicaState::UpdateMajorityReplicatedUnlocked
        ----- this checks the condition that the majority-replicated
              watermark is within the leader's term before advancing the
              commit index.
        ----- calls AdvanceCommittedIndexUnlocked
        ------- commits stuff, etc
        ----- sets *committed_index out-param
        -- PeerMessageQueue picks up this out-param and sets the new
           value within the PeerMessageQueue

This was very difficult to follow, given that the "observer" in fact was
passing data back to the "notifier" through an out-parameter. Moreover,
it wasn't obvious to see which class was "authoritative" for the
tracking and advancement of watermarks.

The new design makes the PeerMessageQueue itself fully responsible for
calculating when the commit index can advance. This required a fairly
invasive surgery because the PeerMessageQueue itself doesn't remember
the terms of pending operations, and thus the watermarks became indexes
instead of OpIds.

This is itself also a simplification -- we previously were very messy
about using the word "index" when in fact we had an OpId type. In almost
every case, we only used the index of those OpIds. In the Raft paper
itself, these watermarks are just indexes and not OpIds, so this change
brings us closer to the implementation described in the original design.

Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
---
M docs/release_notes.adoc
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus-test.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/raft_consensus-itest.cc
15 files changed, 422 insertions(+), 591 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4133/5
-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Kudu Jenkins (Code Review)" <ge...@cloudera.org>.
Kudu Jenkins has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 8: -Verified

Build Started http://104.196.14.100/job/kudu-gerrit/3267/

-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Mike Percy, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4133

to look at the new patch set (#8).

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................

Cleanup/refactor tracking of consensus watermarks

This is a fairly invasive cleanup/refactor to consensus in preparation
for propagating replication watermarks from the leader to the followers.

The key item here is to address a long-standing TODO that the
PeerMessageQueue should itself calculate the commit index, rather than
delegate that job to ReplicaState. The flow used to be something like
this on the leader upon receiving a response from a peer that it has
replicated some new operations:

 - PeerMessageQueue::ResponseFromPeer
   - updates peer last_received
   - calls PeerMessageQueue::AdvanceQueueWatermark
     - updates PeerMessageQueue::majority_replicated to the new majority
       watermark
   - calls NotifyObserversOfMajorityReplOp
   --> submits an async task to the raft threadpool
        - for each observer (best I can tell there is currently always
          only one, but that's a separate issue)
        -- observer->UpdateMajorityReplicated() to grab committed index
           (the observer is always the RaftConsensus instance)
        --- RaftConsensus::UpdateMajorityReplicated
        ---- ReplicaState::UpdateMajorityReplicatedUnlocked
        ----- this checks the condition that the majority-replicated
              watermark is within the leader's term before advancing the
              commit index.
        ----- calls AdvanceCommittedIndexUnlocked
        ------- commits stuff, etc
        ----- sets *committed_index out-param
        -- PeerMessageQueue picks up this out-param and sets the new
           value within the PeerMessageQueue

This was very difficult to follow, given that the "observer" in fact was
passing data back to the "notifier" through an out-parameter. Moreover,
it wasn't obvious to see which class was "authoritative" for the
tracking and advancement of watermarks.

The new design makes the PeerMessageQueue itself fully responsible for
calculating when the commit index can advance. The flow is now the
following, with '!' at the beginning of the line to denote where it is
changed from before:

 - PeerMessageQueue::ResponseFromPeer
   - updates peer last_received
   - calls PeerMessageQueue::AdvanceQueueWatermark
     - updates PeerMessageQueue::majority_replicated to the new majority
       watermark.
!  - If the majority-replicated watermark is within the current leader
!    term, advances the commit index (stored in QueueState)
!    - notifies observers of the new commit index
!      - calls observer->NotifyCommitIndex() with the new commit index
!        - RaftConsensus::NotifyCommitIndex()
!          - ReplicaState::AdvanceCommittedIndexUnlocked
             - commits stuff, etc.

This required a fairly invasive surgery because the PeerMessageQueue
itself doesn't remember the terms of pending operations, and thus the
watermarks became indexes instead of OpIds.

This is itself also a simplification -- we previously were very messy
about using the word "index" when in fact we had an OpId type. In almost
every case, we only used the index of those OpIds. In the Raft paper
itself, these watermarks are just indexes and not OpIds, so this change
brings us closer to the implementation described in the original design.

I looped raft-consensus-itest.MultiThreadedInsertWithFailovers 800
times and the whole suite 500 times and got no failures[1].

[1] http://dist-test.cloudera.org//job?job_id=todd.1473280738.28972

Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
---
M docs/release_notes.adoc
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus-test.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/raft_consensus-itest.cc
15 files changed, 464 insertions(+), 592 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4133/8
-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 7: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Kudu Jenkins (Code Review)" <ge...@cloudera.org>.
Kudu Jenkins has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 7:

Build Started http://104.196.14.100/job/kudu-gerrit/3257/

-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 7: Code-Review+1

(1 comment)

lgtm, just a syntax nit

http://gerrit.cloudera.org:8080/#/c/4133/7/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 703:           && queue_state_.majority_replicated_index > queue_state_.committed_index) {
nit: ampersands not consistent (put them at the end of the line)


-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 4:

(23 comments)

http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

Line 333:   // The id of the last committed operation in the configuration. This is the
> s/id/index/
Done


Line 334:   // id of the last operation the leader deemed committed from a consensus
> s/id/index/
Done


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_peers-test.cc
File src/kudu/consensus/consensus_peers-test.cc:

Line 142:   message_queue_->SetLeaderMode(0, 0, BuildRaftConfigPBForTests(3));
> Maybe better to use kMinimumTerm and kMinimumOpIdIndex from opid_util.h her
Done


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

Line 202:   queue_->SetLeaderMode(0, MinimumOpId().term(), BuildRaftConfigPBForTests(2));
> use MinimumOpId.index() of the equivalent constant
switched to using kMinimumOpIdIndex and kMinimumTerm here and elsewhere per Mike's suggestion


Line 242:   queue_->SetLeaderMode(0, MinimumOpId().term(), BuildRaftConfigPBForTests(2));
> same
Done


Line 305:   queue_->SetLeaderMode(0, MinimumOpId().term(), BuildRaftConfigPBForTests(3));
> same
Done


Line 311:   ASSERT_EQ(queue_->GetMajorityReplicatedIndexForTests(), 0);
> same here and below
I disagree in this case, since it's obvious what the numeric constant is representing (given the other side of the equality assertion). In the "SetLeaderMode" case it's nice because you don't necessarily remember the order of the two integer arguments.

Also I think it's important to realize that it's 0 as an integer, rather than maybe thinking it's 1 (which is actually the first term and index that any peer claims, right?)


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 112:   queue_state_.current_term = 0;
> use constants here and elsewhere?
actually per my reply elsewhere I think 0's clearer


Line 137:   // TODO: reset first_index_in_term? make it optional, reset here and set on the
hrm, having some trouble with this TODO actually. Trying to clean up the way the queue tracks current_term as leader, and I think some of our tests may be abusing the queue. In particular:

static inline void AppendReplicateMessagesToQueue(
    PeerMessageQueue* queue,
    const scoped_refptr<server::Clock>& clock,
    int first,
    int count,
    int payload_size = 0) {

  for (int i = first; i < first + count; i++) {
    int term = i / 7;
    int index = i;
    CHECK_OK(queue->AppendOperation(make_scoped_refptr_replicate(
        CreateDummyReplicate(term, index, clock->Now(), payload_size).release())));
  }
}

this code is appending ops from different terms to a leader-mode queue, which I don't think would ever happen in real life, right? Any time the term advances, the leader would step down and then have to call SetLeaderMode again for the next term?

Mike/David: If this seems right to you guys, I'll make the change so that the queue doesn't track current term by "snooping" on ops anymore, but instead takes a 'first_opid_in_term' parameter to SetLeaderMode, which will be the opid of the NO_OP.


Line 447:       // TODO: why is this only if successful? shouldn't we still use
added this TODO in this patch but I think it's a pre-existing issue. Any thoughts on this issue? otherwise I"m inclined to leave this as a TODO here


Line 499:   boost::optional<int64_t> updated_commit_index;
> do we really need to introduce a boost dependency here? since this is just 
given we already use boost::optional pretty often elsewhere, I dont think the "introduce a dependency" argument carries that much weight. I like optional because it's impossible to accidentally miss the "special" value (eg accidentally compare vs the -1 value and decide something is above this commit index, or whatever)


Line 636:     // If we're the leader, we are in charge of computing the new
> I think it would make sense to document somewhere that calculating the comm
hrm, why do you say we might not be the leader? We're still inside the lock here, so we know we're in leader mode.


Line 658:       if (queue_state_.majority_replicated_index >= queue_state_.first_index_in_term) {
> maybe rename this var to "first_index_in_current_term"
Done


Line 669:         VLOG_WITH_PREFIX_UNLOCKED(2) << "Commit index advanced to " << *updated_commit_index;
> add the "before" ci
Done


Line 676:         (peer->last_known_committed_idx < queue_state_.committed_index);
> nit: rename the suffix of this var to _index for consistency
Done


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

Line 139:   // of the commit index from the follower code.
> having some trouble parsing this sentence: "updating the queue of the commi
yea this was some note to myself and I don't think it's relevant anymore.


Line 415:   // TODO: doc me
> todo
Done


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/raft_consensus-test.cc
File src/kudu/consensus/raft_consensus-test.cc:

Line 388:   LOG(WARNING) << "============================================================";
> ??
ah, was trying to separate out the gmock-related junk above from the gmock-related junk below while debugging something :) will remove


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 1267:   // TODO: interrogate queue rather than state?
> what would be the difference/advantage?
trying to reduce the amount of duplication of responsibility (ideally could actually remove all knowledge of the commit index from 'state', so that 'state' is really just a subscriber to 'hey, start this operation', 'abort these operations', 'commit these operations' and not partially-responsible for this other stuff. In other words 'state' should represent the state machine, not the raft-internal state, and I think it'd be a cleaner design


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/raft_consensus.h
File src/kudu/consensus/raft_consensus.h:

Line 171: 
> nit: extra line
Done


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/raft_consensus_state.cc
File src/kudu/consensus/raft_consensus_state.cc:

Line 575:         LOG(FATAL) << "TODO: remove me once convinced that this doesnt happen in tests";
> add an actual error message that people can parse also add prefix
yea I was just using this during test looping, will just remove it now.


Line 583:     LOG_WITH_PREFIX_UNLOCKED(WARNING) << "given start C_O=" << committed_op
> more info in the log statement or remove
Done


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

Line 2543:   req.set_committed_index(0);
> use constant
same response as elsewhere


-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP: cleanup/refactoring in consensus

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4133

to look at the new patch set (#3).

Change subject: WIP: cleanup/refactoring in consensus
......................................................................

WIP: cleanup/refactoring in consensus

This is a fairly invasive cleanup/refactor to consensus in preparation
for propagating replication watermarks from the leader to the followers.

The key item here is to address a long-standing TODO that the
ConsensusQueue should itself calculate the commit index, rather than
delegate that job to ReplicaState. The flow used to be something like
this on the leader upon receiving a response from a peer that it has
replicated some new operations:

 - PeerMessageQueue::ResponseFromPeer
   - updates peer last_received
   - calls PeerMessageQueue::AdvanceQueueWatermark
     - updates PeerMessageQueue::majority_replicated to the new majority
       watermark
   - calls NotifyObserversOfMajorityReplOp
   --> submits an async task to the raft threadpool
        - for each observer (best I can tell there is currently always
          only one, but that's a separate issue)
        -- observer->UpdateMajorityReplicated() to grab committed index
           (the observer is always the RaftConsensus instance)
        --- RaftConsensus::UpdateMajorityReplicated
        ---- ReplicaState::UpdateMajorityReplicatedUnlocked
        ----- this checks the condition that the majority-replicated
              watermark is within the leader's term before advancing the
              commit index.
        ----- calls AdvanceCommittedIndexUnlocked
        ------- commits stuff, etc
        ----- sets *committed_index out-param
        -- PeerMessageQueue picks up this out-param and sets the new
           value within the PeerMessageQueue

This was very difficult to follow, given that the "observer" in fact was
passing data back to the "notifier" through an out-parameter.

The new design makes the PeerMessageQueue itself fully responsible for
calculating when the commit index can advance. This required a fairly
invasive surgery because the PeerMessageQueue itself doesn't remember
the terms of pending operations, and thus the watermarks became indexes
instead of OpIds.

This is itself also a simplification -- we previously were very messy
about using the word "index" when in fact we had an OpId type. In almost
every case, we only used the index of those OpIds. In the Raft paper
itself, these watermarks are just indexes and not OpIds, so this change
brings us closer to the implementation described in the original design.

WIP because there are a few TODOs yet to iron out

Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
---
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus-test.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/raft_consensus-itest.cc
14 files changed, 386 insertions(+), 440 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4133/3
-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 5:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/4133/5//COMMIT_MSG
Commit Message:

Line 45: The new design makes the PeerMessageQueue itself fully responsible for
> Can you please provide the equivalent diagram for the new design, or add it
Done


Line 47: invasive surgery because the PeerMessageQueue itself doesn't remember
> hmm, it is a bug that the peer message queue doesn't "remember" the terms o
nah, they aren't important for the logic of raft (note that the "volatile state on leaders" described in the raft paper is all index-based)


Line 51: This is itself also a simplification -- we previously were very messy
> TBH I would rather have this change split out from the rest of the patch, s
yea it's quite difficult to separate the two, sorry


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 447:   // following:
> would have into the history to remind myself why this is here, but I seem t
OK, I'll leave this TODO here and maybe we can try removing the if() later. It seems to me removing it might be beneficial in a case like:

- local peer has replicated opid 100
- remote peer A has replicated opid 100
- remote peer B has replication opid 10 and is catching up
- then remote peer 'A' goes down

Here we'd start getting 'is_last_exchange_successful == false' for peer 'A'. In that case, the 'all_replicated_watermark', which requires 3 peers, would not be updateable, even once we've replicated peer 'B' up to opid 100. It would get "stuck" at 10.

I added this scenario to the TODO comment.


Line 636:     }
> queue is decoupled from state so isn't it possible that we lost an election
added a comment, hopefully sorts it out


http://gerrit.cloudera.org:8080/#/c/4133/5/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 261:       // TODO: this code seems wrong. When we are leader, we've already bumped
> seems wrong to me too...
planning on reworking this a bit, so that the leader term change is set more explicitly


-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Kudu Jenkins (Code Review)" <ge...@cloudera.org>.
Kudu Jenkins has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 5:

Build Started http://104.196.14.100/job/kudu-gerrit/3183/

-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] WIP: cleanup/refactoring in consensus

Posted by "Kudu Jenkins (Code Review)" <ge...@cloudera.org>.
Kudu Jenkins has posted comments on this change.

Change subject: WIP: cleanup/refactoring in consensus
......................................................................


Patch Set 1:

Build Started http://104.196.14.100/job/kudu-gerrit/3087/

-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-HasComments: No

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Kudu Jenkins (Code Review)" <ge...@cloudera.org>.
Kudu Jenkins has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 6:

Build Started http://104.196.14.100/job/kudu-gerrit/3252/

-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4133/7/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 267:   // changes, so the refactor isn't trivial.
> seems like premature optimization -- will make the code longer, and this ha
k, fair enough


-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 5:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/4133/5//COMMIT_MSG
Commit Message:

Line 45: The new design makes the PeerMessageQueue itself fully responsible for
Can you please provide the equivalent diagram for the new design, or add it to the docs / header file comments somewhere?


Line 47: invasive surgery because the PeerMessageQueue itself doesn't remember
hmm, it is a bug that the peer message queue doesn't "remember" the terms of pending ops?


Line 51: This is itself also a simplification -- we previously were very messy
TBH I would rather have this change split out from the rest of the patch, since it's distracting, but maybe that is hard to do.


http://gerrit.cloudera.org:8080/#/c/4133/5/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 261:       // TODO: this code seems wrong. When we are leader, we've already bumped
seems wrong to me too...


Line 455:       // TODO: why is this only if successful? shouldn't we still use
i'm not sure either


-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4133

to look at the new patch set (#7).

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................

Cleanup/refactor tracking of consensus watermarks

This is a fairly invasive cleanup/refactor to consensus in preparation
for propagating replication watermarks from the leader to the followers.

The key item here is to address a long-standing TODO that the
PeerMessageQueue should itself calculate the commit index, rather than
delegate that job to ReplicaState. The flow used to be something like
this on the leader upon receiving a response from a peer that it has
replicated some new operations:

 - PeerMessageQueue::ResponseFromPeer
   - updates peer last_received
   - calls PeerMessageQueue::AdvanceQueueWatermark
     - updates PeerMessageQueue::majority_replicated to the new majority
       watermark
   - calls NotifyObserversOfMajorityReplOp
   --> submits an async task to the raft threadpool
        - for each observer (best I can tell there is currently always
          only one, but that's a separate issue)
        -- observer->UpdateMajorityReplicated() to grab committed index
           (the observer is always the RaftConsensus instance)
        --- RaftConsensus::UpdateMajorityReplicated
        ---- ReplicaState::UpdateMajorityReplicatedUnlocked
        ----- this checks the condition that the majority-replicated
              watermark is within the leader's term before advancing the
              commit index.
        ----- calls AdvanceCommittedIndexUnlocked
        ------- commits stuff, etc
        ----- sets *committed_index out-param
        -- PeerMessageQueue picks up this out-param and sets the new
           value within the PeerMessageQueue

This was very difficult to follow, given that the "observer" in fact was
passing data back to the "notifier" through an out-parameter. Moreover,
it wasn't obvious to see which class was "authoritative" for the
tracking and advancement of watermarks.

The new design makes the PeerMessageQueue itself fully responsible for
calculating when the commit index can advance. The flow is now the
following, with '!' at the beginning of the line to denote where it is
changed from before:

 - PeerMessageQueue::ResponseFromPeer
   - updates peer last_received
   - calls PeerMessageQueue::AdvanceQueueWatermark
     - updates PeerMessageQueue::majority_replicated to the new majority
       watermark.
!  - If the majority-replicated watermark is within the current leader
!    term, advances the commit index (stored in QueueState)
!    - notifies observers of the new commit index
!      - calls observer->NotifyCommitIndex() with the new commit index
!        - RaftConsensus::NotifyCommitIndex()
!          - ReplicaState::AdvanceCommittedIndexUnlocked
             - commits stuff, etc.

This required a fairly invasive surgery because the PeerMessageQueue
itself doesn't remember the terms of pending operations, and thus the
watermarks became indexes instead of OpIds.

This is itself also a simplification -- we previously were very messy
about using the word "index" when in fact we had an OpId type. In almost
every case, we only used the index of those OpIds. In the Raft paper
itself, these watermarks are just indexes and not OpIds, so this change
brings us closer to the implementation described in the original design.

I looped raft-consensus-itest.MultiThreadedInsertWithFailovers 800
times and the whole suite 500 times and got no failures[1].

[1] http://dist-test.cloudera.org//job?job_id=todd.1473280738.28972

Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
---
M docs/release_notes.adoc
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus-test.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/raft_consensus-itest.cc
15 files changed, 464 insertions(+), 592 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4133/7
-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4133/7/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 267:   // changes, so the refactor isn't trivial.
> final q: do you think it might be worth it to just look at the last message
seems like premature optimization -- will make the code longer, and this has never shown up as a hot spot


-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Kudu Jenkins (Code Review)" <ge...@cloudera.org>.
Kudu Jenkins has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 4:

Build Started http://104.196.14.100/job/kudu-gerrit/3145/

-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 4:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

Line 311:   ASSERT_EQ(queue_->GetMajorityReplicatedIndexForTests(), 0);
> I disagree in this case, since it's obvious what the numeric constant is re
ok


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 137:   // TODO: reset first_index_in_term? make it optional, reset here and set on the
> hrm, having some trouble with this TODO actually. Trying to clean up the wa
sounds reasonable


Line 447:       // TODO: why is this only if successful? shouldn't we still use
> added this TODO in this patch but I think it's a pre-existing issue. Any th
would have into the history to remind myself why this is here, but I seem to recall this being an important thing


Line 499:   boost::optional<int64_t> updated_commit_index;
> given we already use boost::optional pretty often elsewhere, I dont think t
ok


Line 636:     // If we're the leader, we are in charge of computing the new
> hrm, why do you say we might not be the leader? We're still inside the lock
queue is decoupled from state so isn't it possible that we lost an election but the queue doesn't know about it yet (i.e. that the call to change the queue's mode is queued behind this thread competing for the lock?


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 1267:   // TODO: interrogate queue rather than state?
> trying to reduce the amount of duplication of responsibility (ideally could
that makes sense in general, though I'm not totally sure what it looks like


-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Kudu Jenkins (Code Review)" <ge...@cloudera.org>.
Kudu Jenkins has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 8:

Build Started http://104.196.14.100/job/kudu-gerrit/3266/

-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 4:

(4 comments)

I am still working my way through this patch. There are several TODOs, not sure if you know the answers to those yet. But I will investigate when I pick up again on reviewing the logic changes in this patch.

http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

Line 333:   // The id of the last committed operation in the configuration. This is the
s/id/index/


Line 334:   // id of the last operation the leader deemed committed from a consensus
s/id/index/


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_peers-test.cc
File src/kudu/consensus/consensus_peers-test.cc:

Line 142:   message_queue_->SetLeaderMode(0, 0, BuildRaftConfigPBForTests(3));
Maybe better to use kMinimumTerm and kMinimumOpIdIndex from opid_util.h here and elsewhere


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

Line 415:   // TODO: doc me
todo


-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 4:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

Line 202:   queue_->SetLeaderMode(0, MinimumOpId().term(), BuildRaftConfigPBForTests(2));
use MinimumOpId.index() of the equivalent constant


Line 242:   queue_->SetLeaderMode(0, MinimumOpId().term(), BuildRaftConfigPBForTests(2));
same


Line 305:   queue_->SetLeaderMode(0, MinimumOpId().term(), BuildRaftConfigPBForTests(3));
same


Line 311:   ASSERT_EQ(queue_->GetMajorityReplicatedIndexForTests(), 0);
same here and below


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 112:   queue_state_.current_term = 0;
use constants here and elsewhere?


Line 499:   boost::optional<int64_t> updated_commit_index;
do we really need to introduce a boost dependency here? since this is just an int why not just have a NOT_UPDATED = -1 constant?


Line 636:     // If we're the leader, we are in charge of computing the new
I think it would make sense to document somewhere that calculating the commit index here is safe, independently of whether we're leader or not (we might not be by the time we reach here) since the commit index is a property of the shared state machine and not something that is "decided" by the leader.


Line 658:       if (queue_state_.majority_replicated_index >= queue_state_.first_index_in_term) {
maybe rename this var to "first_index_in_current_term"


Line 669:         VLOG_WITH_PREFIX_UNLOCKED(2) << "Commit index advanced to " << *updated_commit_index;
add the "before" ci


Line 676:         (peer->last_known_committed_idx < queue_state_.committed_index);
nit: rename the suffix of this var to _index for consistency


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

Line 139:   // of the commit index from the follower code.
having some trouble parsing this sentence: "updating the queue of the commit index"


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/raft_consensus-test.cc
File src/kudu/consensus/raft_consensus-test.cc:

Line 388:   LOG(WARNING) << "============================================================";
??


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 1267:   // TODO: interrogate queue rather than state?
what would be the difference/advantage?


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/raft_consensus.h
File src/kudu/consensus/raft_consensus.h:

Line 171: 
nit: extra line


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/raft_consensus_state.cc
File src/kudu/consensus/raft_consensus_state.cc:

Line 575:         LOG(FATAL) << "TODO: remove me once convinced that this doesnt happen in tests";
add an actual error message that people can parse also add prefix


Line 583:     LOG_WITH_PREFIX_UNLOCKED(WARNING) << "given start C_O=" << committed_op
more info in the log statement or remove


http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

Line 2543:   req.set_committed_index(0);
use constant


-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Cleanup/refactor tracking of consensus watermarks

This is a fairly invasive cleanup/refactor to consensus in preparation
for propagating replication watermarks from the leader to the followers.

The key item here is to address a long-standing TODO that the
PeerMessageQueue should itself calculate the commit index, rather than
delegate that job to ReplicaState. The flow used to be something like
this on the leader upon receiving a response from a peer that it has
replicated some new operations:

 - PeerMessageQueue::ResponseFromPeer
   - updates peer last_received
   - calls PeerMessageQueue::AdvanceQueueWatermark
     - updates PeerMessageQueue::majority_replicated to the new majority
       watermark
   - calls NotifyObserversOfMajorityReplOp
   --> submits an async task to the raft threadpool
        - for each observer (best I can tell there is currently always
          only one, but that's a separate issue)
        -- observer->UpdateMajorityReplicated() to grab committed index
           (the observer is always the RaftConsensus instance)
        --- RaftConsensus::UpdateMajorityReplicated
        ---- ReplicaState::UpdateMajorityReplicatedUnlocked
        ----- this checks the condition that the majority-replicated
              watermark is within the leader's term before advancing the
              commit index.
        ----- calls AdvanceCommittedIndexUnlocked
        ------- commits stuff, etc
        ----- sets *committed_index out-param
        -- PeerMessageQueue picks up this out-param and sets the new
           value within the PeerMessageQueue

This was very difficult to follow, given that the "observer" in fact was
passing data back to the "notifier" through an out-parameter. Moreover,
it wasn't obvious to see which class was "authoritative" for the
tracking and advancement of watermarks.

The new design makes the PeerMessageQueue itself fully responsible for
calculating when the commit index can advance. The flow is now the
following, with '!' at the beginning of the line to denote where it is
changed from before:

 - PeerMessageQueue::ResponseFromPeer
   - updates peer last_received
   - calls PeerMessageQueue::AdvanceQueueWatermark
     - updates PeerMessageQueue::majority_replicated to the new majority
       watermark.
!  - If the majority-replicated watermark is within the current leader
!    term, advances the commit index (stored in QueueState)
!    - notifies observers of the new commit index
!      - calls observer->NotifyCommitIndex() with the new commit index
!        - RaftConsensus::NotifyCommitIndex()
!          - ReplicaState::AdvanceCommittedIndexUnlocked
             - commits stuff, etc.

This required a fairly invasive surgery because the PeerMessageQueue
itself doesn't remember the terms of pending operations, and thus the
watermarks became indexes instead of OpIds.

This is itself also a simplification -- we previously were very messy
about using the word "index" when in fact we had an OpId type. In almost
every case, we only used the index of those OpIds. In the Raft paper
itself, these watermarks are just indexes and not OpIds, so this change
brings us closer to the implementation described in the original design.

I looped raft-consensus-itest.MultiThreadedInsertWithFailovers 800
times and the whole suite 500 times and got no failures[1].

[1] http://dist-test.cloudera.org//job?job_id=todd.1473280738.28972

Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Reviewed-on: http://gerrit.cloudera.org:8080/4133
Reviewed-by: Mike Percy <mp...@apache.org>
Tested-by: Kudu Jenkins
---
M docs/release_notes.adoc
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus-test.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/raft_consensus-itest.cc
15 files changed, 464 insertions(+), 592 deletions(-)

Approvals:
  Mike Percy: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/4133/6/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 137:   if (current_term != queue_state_.current_term) {
> are we verifying that the term doesn't go back? should we?
sure, I'll give that a try. I think we are, elsewhere, but maybe not here.


Line 263:       // TODO: this code seems wrong. When we are leader, we've already bumped
> yeah, this seems like it's making things a bit messier (not that they were 
oops, I actually meant to remove this TODO in the current revision.

I agree it would be nice to do the SetLeaderAndAppendNoOp thing, but we're currently abusing SetLeaderMode to handle refreshing the tracked peers on a configuration change, and I didn't want to open the can of worms of trying to add more refactoring on top. So, I just kept this as is. I'll update the TODO, though, to suggest this refactor.


Line 459:     // TODO: The fact that we only consider peers whose last exchange was
> this is technically true, but not sure how important/impactful it is. If A 
yea, agreed that it's not a super impactful optimization, but more about code cleanliness and the fact that our "watermarks" actually move backwards, not just forwards (and thus aren't technically watermarks!)

But, would like to address it as a follow-on.


-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/4133/5/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 261:       // TODO: this code seems wrong. When we are leader, we've already bumped
> planning on reworking this a bit, so that the leader term change is set mor
ended up making this be a boost::optional. When we bump the term in SetLeaderMode, it gets set to boost::none, and in this block, we set it on the first op we see if it's not already set.

Making it a more explicit argument to SetLeaderMode() would be nice, but that function is also currently overloaded to be used for config changes where the term doesn't change.


Line 455:       // TODO: why is this only if successful? shouldn't we still use
> i'm not sure either
I tried to remove this and then figured out why it's important after some tests started failing 1-2% of the time. Left a lengthy TODO.


-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP: cleanup/refactoring in consensus

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: WIP: cleanup/refactoring in consensus
......................................................................


Patch Set 3:

the timeout/failures I was seeing before in RaftConsensusITest seemed to be pre-existing. rebased this on top of a couple patches to fix them.

-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] WIP: cleanup/refactoring in consensus

Posted by "Kudu Jenkins (Code Review)" <ge...@cloudera.org>.
Kudu Jenkins has posted comments on this change.

Change subject: WIP: cleanup/refactoring in consensus
......................................................................


Patch Set 3:

Build Started http://104.196.14.100/job/kudu-gerrit/3102/

-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-HasComments: No

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 8: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4133

to look at the new patch set (#4).

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................

Cleanup/refactor tracking of consensus watermarks

This is a fairly invasive cleanup/refactor to consensus in preparation
for propagating replication watermarks from the leader to the followers.

The key item here is to address a long-standing TODO that the
PeerMessageQueue should itself calculate the commit index, rather than
delegate that job to ReplicaState. The flow used to be something like
this on the leader upon receiving a response from a peer that it has
replicated some new operations:

 - PeerMessageQueue::ResponseFromPeer
   - updates peer last_received
   - calls PeerMessageQueue::AdvanceQueueWatermark
     - updates PeerMessageQueue::majority_replicated to the new majority
       watermark
   - calls NotifyObserversOfMajorityReplOp
   --> submits an async task to the raft threadpool
        - for each observer (best I can tell there is currently always
          only one, but that's a separate issue)
        -- observer->UpdateMajorityReplicated() to grab committed index
           (the observer is always the RaftConsensus instance)
        --- RaftConsensus::UpdateMajorityReplicated
        ---- ReplicaState::UpdateMajorityReplicatedUnlocked
        ----- this checks the condition that the majority-replicated
              watermark is within the leader's term before advancing the
              commit index.
        ----- calls AdvanceCommittedIndexUnlocked
        ------- commits stuff, etc
        ----- sets *committed_index out-param
        -- PeerMessageQueue picks up this out-param and sets the new
           value within the PeerMessageQueue

This was very difficult to follow, given that the "observer" in fact was
passing data back to the "notifier" through an out-parameter. Moreover,
it wasn't obvious to see which class was "authoritative" for the
tracking and advancement of watermarks.

The new design makes the PeerMessageQueue itself fully responsible for
calculating when the commit index can advance. This required a fairly
invasive surgery because the PeerMessageQueue itself doesn't remember
the terms of pending operations, and thus the watermarks became indexes
instead of OpIds.

This is itself also a simplification -- we previously were very messy
about using the word "index" when in fact we had an OpId type. In almost
every case, we only used the index of those OpIds. In the Raft paper
itself, these watermarks are just indexes and not OpIds, so this change
brings us closer to the implementation described in the original design.

Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
---
M docs/release_notes.adoc
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus-test.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/raft_consensus-itest.cc
15 files changed, 400 insertions(+), 569 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4133/4
-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/4133/6/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 137:   if (current_term != queue_state_.current_term) {
are we verifying that the term doesn't go back? should we?


Line 263:       // TODO: this code seems wrong. When we are leader, we've already bumped
yeah, this seems like it's making things a bit messier (not that they were much more nice and neat before). Would it be possible to create a new SetLeaderMode like method that also took the first round to be appended and that would set both the first index and do the SetLeaderMode stuff? Something like SetLeaderAndAppendNoOp then we would always snoop for both index and term (even f if you don't like snooping, at least we'd be consistent). This may not work great with the replicating peers though (right now we set leader mode, update the peers config and only then append the no op)


Line 459:     // TODO: The fact that we only consider peers whose last exchange was
this is technically true, but not sure how important/impactful it is. If A goes down we wouldn't be able to advance the index anyway, and we'll be able to again only once we've removed A and added a new node. So yeah, once we've stopped talking to a peer we stop moving the all_replicated and maybe we could advance it a little more, but IMO we should address it only if it becomes pathological. (not saying anything about the last_received, which I agree we should make match raft proper. iirc we actually tried to do that and hit some roadblock)


-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/4133/6/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 459:   for (const PeersMap::value_type& peer : peers_map_) {
> yea, agreed that it's not a super impactful optimization, but more about co
well the all replicated is always supposed to go back when we add a new node, so maybe we can change the name, but its nature seems to be necessary :) agreed on the follow on.


http://gerrit.cloudera.org:8080/#/c/4133/7/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 267:   // changes, so the refactor isn't trivial.
final q: do you think it might be worth it to just look at the last message, like we did before and then reverse iterate only if the term actually changed?


-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Cleanup/refactor tracking of consensus watermarks

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: Cleanup/refactor tracking of consensus watermarks
......................................................................


Patch Set 5: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/4133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No