You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2017/03/21 19:44:49 UTC

[kudu-CR] KUDU-1506 Add consensus lag metrics

Andrew Wong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6451

Change subject: KUDU-1506 Add consensus lag metrics
......................................................................

KUDU-1506 Add consensus lag metrics

It may be useful to see how far behind certain peers are in appending
their operations. As such, this patch adds a metric to do so.

The new metric counts the number of operations behind the maximum
operation seen by the leader each peer is. It is updated every time
the peer queue updates its metrics, i.e. after every append on a
follower and after parsing responses from peers on a leader.

Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
---
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
4 files changed, 49 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/51/6451/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


Patch Set 9:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6451/9/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS9, Line 88: "Number of operations this server is behind the leader in appending "
            :                           "to the WAL."
nit: in a comment in a previous ref I had mentioned to fix the "WAL" references there and elsewhere. here's one of the "elsewhere" :)


PS9, Line 869: // Include a metric to indicate how far behind the highest peer's commit index
             :   // this peer is. This is also added to the leader, as the leader may fall behind.
wait, where does the commit index come into play? I would just remove this comment, since UpdateLagMetrics is already commented.


http://gerrit.cloudera.org:8080/#/c/6451/9/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

PS9, Line 246: // Called by the concensus implementation to update the lag metrics
             :   // before preparing and applying the contents of a request. Also updates the
             :   // queue state's 'last_index_appended_by_leader'.
I don't think that you need to refer to where this is called from. How about:
Updates the last op appended to the leader and the corresponding lag metric, which measures how many ops behind the leader the local replica is (0 if leader).

Then you can remove comments on the call sites unless they have more information than the above


PS9, Line 338: is
nit: comma or period. i.e. "is, as" or "is. This is a soft..."


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6451

to look at the new patch set (#12).

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................

[consensus] Add consensus op-level lag metrics

It may be useful to see how far behind the leaders each follower is
with regard to replicating operations. As such, this patch adds a
metric facilitate this.

The new metric counts the number of operations behind the leader each
peer is. It is updated when a follower receives a request from a
leader, and after the peer message queue appends operations.

Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
---
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
7 files changed, 78 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/51/6451/12
-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1506 Add consensus lag metrics

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1506 Add consensus lag metrics
......................................................................


Patch Set 3:

Is there any way we can get this metric to be time-based instead of op-based? op-based has the downside that it's dependent on the rate of incoming operations?

-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] WIP: KUDU-1506 Add consensus lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6451

to look at the new patch set (#6).

Change subject: WIP: KUDU-1506 Add consensus lag metrics
......................................................................

WIP: KUDU-1506 Add consensus lag metrics

It may be useful to see how far behind the leaders followers are with
regard to replicating operations. As such, this patch adds a metric
facilitate this.

The new metric counts the number of operations behind the maximum
applied operation, as seen by the leader, each peer is. It is updated
when a replica receives a request from a leader, and every time the
peer queue updates its metrics, i.e. after every append on a follower
and after parsing responses from peers on a leader.

This currently still needs to address using time as a lag metric, as
per the JIRA.

Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
---
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
5 files changed, 54 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/51/6451/6
-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


Patch Set 10:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6451/10/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

Line 247:   // Note: This estimation must be precise in order to determine the exact number of messages that
This was a little hard to me to grok. Can you clarify in this message that all optional fields in the protobuf must appear in order for the estimate in this test to be accurate?


http://gerrit.cloudera.org:8080/#/c/6451/10/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 1152:     queue_->UpdateLagMetrics(request->last_idx_appended_to_leader());
I think this will cause the field to be 'required' without a default being specified in the .proto file. That will break rolling upgrades.


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


Patch Set 13: Code-Review+2

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6451/12/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS12, Line 602: queue_state_.last_idx_appended_to_leader - queue_state_.last_appended.index()
> Going off of David's comment, it seems truncation will occur when it checks
Great. Thanks for checking.


http://gerrit.cloudera.org:8080/#/c/6451/12/src/kudu/consensus/raft_consensus_quorum-test.cc
File src/kudu/consensus/raft_consensus_quorum-test.cc:

Line 981:   ASSERT_EQ(0, follower->queue_->metrics_.num_ops_behind_leader->value());
> Done and changed in the other test (there are a bunch of places where these
Sounds good. Yes it's a common mistake.


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6451

to look at the new patch set (#9).

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................

[consensus] Add consensus op-level lag metrics

It may be useful to see how far behind the leaders each follower is
with regard to replicating operations. As such, this patch adds a
metric facilitate this.

The new metric counts the number of operations behind the leader each
peer is. It is updated when a follower receives a request from a
leader, and after the peer message queue appends operations.

Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
---
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
7 files changed, 84 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/51/6451/9
-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6451/11/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 156: 
> spurious change
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


Patch Set 8:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6451/9/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS9, Line 88: "Number of operations this server is behind the leader in appending "
            :                           "to the WAL."
> nit: in a comment in a previous ref I had mentioned to fix the "WAL" refere
Done


PS9, Line 869:   queue_state_.last_appended.index() - queue_state_.committed_index);
             : 
> wait, where does the commit index come into play? I would just remove this 
Good catch. Done


http://gerrit.cloudera.org:8080/#/c/6451/9/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

PS9, Line 246: // Called by the concensus implementation to update the lag metrics
             :   // before preparing and applying the contents of a request. Also updates the
             :   // queue state's 'last_index_appended_by_leader'.
> I don't think that you need to refer to where this is called from. How abou
Fair point. It's not very ubiquitous so it's less necessary to describe the call sites.

I think call sites in test are valuable since the contextualize calling a bit, particularly when the tests don't have a leader/follower.


PS9, Line 338: is
> nit: comma or period. i.e. "is, as" or "is. This is a soft..."
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


Patch Set 12:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6451/12/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS12, Line 602: queue_state_.last_idx_appended_to_leader - queue_state_.last_appended.index()
> I don't think this is true, given where this is called in Consensus::Update
Right, seems like truncation happens before updating the lag metric.


PS12, Line 602: queue_state_.last_idx_appended_to_leader - queue_state_.last_appended.index()
> Just thought of one last thing. I think it is possible that the leader tell
Going off of David's comment, it seems truncation will occur when it checks the leader request (raft_consensus.cc:1137), which comes before we update the "last..to_leader". And this truncation will change the last_appended.


http://gerrit.cloudera.org:8080/#/c/6451/12/src/kudu/consensus/raft_consensus_quorum-test.cc
File src/kudu/consensus/raft_consensus_quorum-test.cc:

Line 981:   ASSERT_EQ(follower->queue_->metrics_.num_ops_behind_leader->value(), 0);
> nit: the order should be ASSERT_EQ(expected, actual) in order to get a non-
Done and changed in the other test (there are a bunch of places where these are swapped, but won't touch those for this patch).


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6451

to look at the new patch set (#13).

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................

[consensus] Add consensus op-level lag metrics

It may be useful to see how far behind the leaders each follower is
with regard to replicating operations. As such, this patch adds a
metric facilitate this.

The new metric counts the number of operations behind the leader each
peer is. It is updated when a follower receives a request from a
leader, and after the peer message queue appends operations.

Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
---
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
7 files changed, 84 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/51/6451/13
-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP: KUDU-1506 Add consensus lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: WIP: KUDU-1506 Add consensus lag metrics
......................................................................


Patch Set 6:

> Is there any way we can get this metric to be time-based instead of
 > op-based? op-based has the downside that it's dependent on the rate
 > of incoming operations?

I know the above mentioned time-based doesn't measure the same thing, but it's lower-hanging and still potentially useful.

To convert this follower lag metric to time, I think we can keep track of when we append each op on the leader side and add to the follower request the time that the leader appended the last op in the request. There're some implementation details (e.g. upper bound on time) missing, but I think that might work.

-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-1506 Add consensus lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6451

to look at the new patch set (#5).

Change subject: KUDU-1506 Add consensus lag metrics
......................................................................

KUDU-1506 Add consensus lag metrics

It may be useful to see how far behind the leaders followers are with
regard to replicating operations. As such, this patch adds a metric
facilitate this.

The new metric counts the number of operations behind the maximum
applied operation, as seen by the leader, each peer is. It is updated
when a replica receives a request from a leader, and every time the
peer queue updates its metrics, i.e. after every append on a follower
and after parsing responses from peers on a leader.

Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
---
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
5 files changed, 52 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/51/6451/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


Patch Set 12:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6451/10/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

Line 253:   page_size_estimator.set_committed_index(0);
> The key thing is that in this and only this test, the size of the protobuf 
Thanks for the explanation. Makes sense.


http://gerrit.cloudera.org:8080/#/c/6451/12/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS12, Line 602: queue_state_.last_idx_appended_to_leader - queue_state_.last_appended.index()
Just thought of one last thing. I think it is possible that the leader tells us it has truncated the log but we are under memory pressure and rejecting requests. This will cause the follower to be ahead of the leader which will result in a negative lag, which possibly could throw off averages tracked by monitoring systems etc. So we may want to set a floor of 0 for this metric and make this line:

    std::min<uint64_t>(0, queue_state_.last_idx_appended_to_leader - queue_state_.last_appended.index())


http://gerrit.cloudera.org:8080/#/c/6451/10/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 1152:     queue_->UpdateLastIndexAppendedToLeader(request->last_idx_appended_to_leader());
> As discussed there is a _default_ default that should be fine.
Yeah, I forgot there is a default default. :) Sounds good.


http://gerrit.cloudera.org:8080/#/c/6451/12/src/kudu/consensus/raft_consensus_quorum-test.cc
File src/kudu/consensus/raft_consensus_quorum-test.cc:

Line 981:   ASSERT_EQ(follower->queue_->metrics_.num_ops_behind_leader->value(), 0);
nit: the order should be ASSERT_EQ(expected, actual) in order to get a non-confusing error message from gtest if it fails.


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


Patch Set 7:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/6451/7/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

PS7, Line 359: it is
they are


PS7, Line 360: by
s/by/to


http://gerrit.cloudera.org:8080/#/c/6451/7/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS7, Line 302: the highest index to be added to the leader's WAL.
this is only appended below, right? also might not be highest (rewriting). I don't think it matters in terms of "correctness" I'd just rephrase to (here and elsewhere);
If LEADER, also update the index of the last appended operation.


PS7, Line 607: last_idx_appended_by_leader
this needs to be the max you've taken above, no?


http://gerrit.cloudera.org:8080/#/c/6451/7/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

PS7, Line 1150: Update the lag metrics while the contents of the request are still fresh from the leader.
How about:
Update lag metrics.
We do it here instead of when appending to the queue below so to be able to update lag metrics even when operations are rejected.


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP: KUDU-1506 Add consensus lag metrics

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: WIP: KUDU-1506 Add consensus lag metrics
......................................................................


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6451/6/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

PS6, Line 358: The highest index that has been replicated by any peer in the config. Leaders will set this
             :   // to be the highest committed index it has seen as it receives responses from followers.
             :   // On the follower side, this can be used to determine how far behind a given peer is.
Wait, this doesn't seem to be quite true, according to the implementation. When you increase this in PeerMessageQueue::AppendOperations the message might not have been replicated anywhere, right?

My comments on the metric name were to make it consistent with this index, but it seems like it would make more sense to call this index something like: 'max_appended_index'.

Btw 'max' makes sense here since even if we end up appending messages that are overwritten in the log, we still keep the highest index, according to the implementation (and which I think makes sense).


http://gerrit.cloudera.org:8080/#/c/6451/6/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

PS6, Line 251: page_size_estimator.set_max_replicated_index(0);
why is this needed in this particular test? btw could you add checking for the new index to some existing test (likely not worth it having its own test)


http://gerrit.cloudera.org:8080/#/c/6451/6/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS6, Line 86: ops_behind_max_replicated
make it consistent with the proto field, when you change it


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP: KUDU-1506 Add consensus lag metrics

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: WIP: KUDU-1506 Add consensus lag metrics
......................................................................


Patch Set 6:

We could probably do the time thing by tracking assigned timestamps in addition to indexes.

The problem is that it's largely arbitrary. A replica might be lagging by a little time and have thousands of ops in the queue, or be lagging by a large chunk of time but have a relatively small amount of ops in the queue.

I think the problem we set out to solve here is to give users some insight into whether a replica is lagging, and this cannot be conveyed by a single number, its something that needs to be tracked over time to have meaning. That is a user won't care about or even understand that a replica is lagging by 1000 ops, or that it's lagging by 5 minutes (where 5 mins is the timestamp diff between the last appended op on the leader and the last received op by the replica). It cares whether this number goes down over time (replica is catching up) or whether it goes up over time (replica won't ever catch up).

To this point how accurate we are in defining this number is largely irrelevant as long as we do it consistently.

-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6451/8/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS8, Line 309: pdate the index of the last appended operation. This must also be in the unlocked
             :   // section to avoid any races with LocalPeerAppendFinished(), which updates metrics.
             :   if (queue_state_.mode == LEADER) {
             :     queue_state_.last_idx_appended_to_leader = last_id.index();
             :   }
Actually this doesn't make sense, will move it back.

There is a race, but it's in a test, not here.


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


Patch Set 10:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/6451/10/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

Line 343:   // The highest index that is known to be replicated by all members of
Ha, nice catch


http://gerrit.cloudera.org:8080/#/c/6451/10/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

Line 253:   page_size_estimator.set_last_idx_appended_to_leader(0);
I'm not sure of the purpose of the changes to this test. Does this mean this field is now required by the server? If so, this patch will break rolling upgrades (which are a nice-to-have when possible). Maybe we should provide a default value of 0 for this field so it doesn't have to be specified.


http://gerrit.cloudera.org:8080/#/c/6451/10/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS10, Line 88: is
believes it is


Line 156:   queue_state_.last_idx_appended_to_leader = queue_state_.last_appended.index();
Maybe we should not track this in leader mode and instead do this re-initialization in NonLeaderMode(). It should be safe to assume (for our purposes with this metric) that the leader must be ahead of us in order to be the leader to we can move this line to line 183 or so to imply that our lag between the time we are demoted to follower and the time we get our next message from the leader would be 0.


PS10, Line 301:     // If LEADER, update the index of the last appended operation.
              :     queue_state_.last_idx_appended_to_leader = last_id.index();
Seems strange to me to track this at all as the leader; see below for a suggestion.


PS10, Line 380: last_idx_appended_to_leader
last_appended.index()


Line 603:   queue_state_.last_idx_appended_to_leader = std::max({queue_state_.last_idx_appended_to_leader,
Based on the changes we have made, I don't think we want the max() anymore because we can assume that the leader is always right, and the latest message received from the leader should contain the correct value. The reason max() is a problem is because if there is a leader change and the new leader truncates the log then our metric will be inaccurate until the leader catches up to the index of the previous leader.

I also find it a little bizarre that in this file we call UpdateLagMetrics(queue_state_.last_idx_appended_to_leader) only to again compare that value with queue_state_.last_idx_appended_to_leader

I think it would make more sense to have 2 methods here:

    void PeerMessageQueue::UpdateLagMetrics() {
      // Don't update queue_state_, only set the metric.
    }

    void PeerMessageQueue::UpdateLastIndexAppendedToLeader(int64_t index) {
      queue_state_.last_idx_appended_to_leader = index;
      UpdateLagMetrics();
    }

That way, from within the PeerMessageQueue class we would only call UpdateLagMetrics() and from RaftConsensus (and tests) we would call UpdateLastIndexAppendedToLeader(...).


PS10, Line 605:   metrics_.num_ops_behind_leader->set_value(
              :       queue_state_.last_idx_appended_to_leader - queue_state_.last_appended.index());
How about:

    int64_t num_ops_behind_leader = queue_state_.mode == LEADER ? 0 : queue_state_.last_idx_appended_to_leader - queue_state_.last_appended.index());
    metrics_.num_ops_behind_leader->set_value(num_ops_behind_leader);


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


Patch Set 11: Code-Review+2

(1 comment)

+2 from my side. just a little nit: you can repost and re- +2 from my side.  Mike might still want to take a look though.

http://gerrit.cloudera.org:8080/#/c/6451/11/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 156: 
spurious change


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


Patch Set 7:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/6451/7/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

PS7, Line 359: it is
> they are
Done


PS7, Line 360: by
> s/by/to
Done


http://gerrit.cloudera.org:8080/#/c/6451/7/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS7, Line 302: the highest index to be added to the leader's WAL.
> this is only appended below, right? also might not be highest (rewriting). 
Done


PS7, Line 607: last_idx_appended_by_leader
> this needs to be the max you've taken above, no?
Good catch. Done


http://gerrit.cloudera.org:8080/#/c/6451/7/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

Line 249:   void UpdateLagMetrics(int64_t last_index_appended_by_leader);
> warning: function 'kudu::consensus::PeerMessageQueue::UpdateLagMetrics' has
Done


http://gerrit.cloudera.org:8080/#/c/6451/7/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

PS7, Line 1150: Update the lag metrics while the contents of the request are still fresh from the leader.
> How about:
Good elaboration. Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6451

to look at the new patch set (#8).

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................

[consensus] Add consensus op-level lag metrics

It may be useful to see how far behind the leaders each follower is
with regard to replicating operations. As such, this patch adds a
metric facilitate this.

The new metric counts the number of operations behind the leader each
peer is. It is updated when a follower receives a request from a
leader, and after the peer message queue appends operations.

Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
---
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
7 files changed, 86 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/51/6451/8
-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


Patch Set 6:

(3 comments)

As you guys in the above comments, since this doesn't address time, I've changed the tag to [consensus] rather than the jira number, and will link this as a related patch to KUDU-1506.

http://gerrit.cloudera.org:8080/#/c/6451/6/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

PS6, Line 358: The highest index that has been replicated by any peer in the config. Leaders will set this
             :   // to be the highest committed index it has seen as it receives responses from followers.
             :   // On the follower side, this can be used to determine how far behind a given peer is.
> Wait, this doesn't seem to be quite true, according to the implementation. 
Right, the leader only knows that it appended this op to its own WAL.

Seeing as there's also a "last_received_current_leader" which sounds similar but is not the same, I'm renaming this to "last_idx_appended_by_leader" (one source of ambiguity may be leaving out "by"s, "on"s, "from"s, etc.).


http://gerrit.cloudera.org:8080/#/c/6451/6/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

PS6, Line 251: page_size_estimator.set_max_replicated_index(0);
> why is this needed in this particular test? btw could you add checking for 
The page_size_estimator is used to set the max_batch_size_bytes flag. Without adding this index, we have an underestimate of the request size, which is results in an underestimate in the max batch size, calculated in RequestForPeer (consensus_queue.cc:384). This test expects a certain number of ops in the request and doesn't get it because of this.

Done


http://gerrit.cloudera.org:8080/#/c/6451/6/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS6, Line 86: ops_behind_max_replicated
> make it consistent with the proto field, when you change it
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1506 Add consensus lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6451

to look at the new patch set (#3).

Change subject: KUDU-1506 Add consensus lag metrics
......................................................................

KUDU-1506 Add consensus lag metrics

It may be useful to see how far behind certain peers are in appending
their operations. As such, this patch adds a metric facilitate this.

The new metric counts the number of operations behind the maximum
applied operation, as seen by the leader, each peer is. It is updated
when a replica receives a request from a leader, and every time the
peer queue updates its metrics, i.e. after every append on a follower
and after parsing responses from peers on a leader.

Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
---
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
4 files changed, 55 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/51/6451/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has submitted this change and it was merged.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


[consensus] Add consensus op-level lag metrics

It may be useful to see how far behind the leaders each follower is
with regard to replicating operations. As such, this patch adds a
metric facilitate this.

The new metric counts the number of operations behind the leader each
peer is. It is updated when a follower receives a request from a
leader, and after the peer message queue appends operations.

Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Reviewed-on: http://gerrit.cloudera.org:8080/6451
Reviewed-by: Mike Percy <mp...@apache.org>
Tested-by: Kudu Jenkins
---
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
7 files changed, 84 insertions(+), 15 deletions(-)

Approvals:
  Mike Percy: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 14
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6451

to look at the new patch set (#7).

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................

[consensus] Add consensus op-level lag metrics

It may be useful to see how far behind the leaders each follower is
with regard to replicating operations. As such, this patch adds a
metric facilitate this.

The new metric counts the number of operations behind the leader each
peer is. It is updated when a follower receives a request from a
leader, and after the peer message queue appends operations.

Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
---
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
7 files changed, 82 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/51/6451/7
-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1506 Add consensus lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: KUDU-1506 Add consensus lag metrics
......................................................................


Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/6451/3//COMMIT_MSG
Commit Message:

PS3, Line 9: behind
> "... behind the leader followers are with regard to replicated operations."
Done


http://gerrit.cloudera.org:8080/#/c/6451/3/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

PS3, Line 359: peer
> peer is
Done


http://gerrit.cloudera.org:8080/#/c/6451/3/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS3, Line 86: Operations behind max
> ... max replicated
Done


PS3, Line 314: if (queue_state_.max_replicated_index < last_id.index()) {
             :     queue_state_.max_replicated_index = last_id.index();
             :   }
> use std::max(queue_state_.max_replicated_index, last_id.index())
Done


http://gerrit.cloudera.org:8080/#/c/6451/3/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

PS3, Line 248:   // Called by the concensus implementation to update the lag metrics
             :   // before applying the contents of a request. Also used throughout consensus
             :   // via UpdateMetrics() as operations are applied.
             :   void UpdateLagMetrics(int64_t max_replicated_index);
> why not add an argument to the method above?
See other comment.


http://gerrit.cloudera.org:8080/#/c/6451/3/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

PS3, Line 127: METRIC_DEFINE_gauge_int64(tablet, num_ops_behind_max,
             :                           "Operations Behind Max Index",
             :                           kudu::MetricUnit::kOperations,
             :                           "Number of operations behind the max replicated index received by and "
             :                           "reported by leader.");
> wait why have the same metric twice?
Done


PS3, Line 1047: queue_->UpdateLagMetrics(request->max_replicated_index());
> I think it's fine to just reuse UpdateFollowerWatermarks with the extra arg
I'd wanted to keep it separate since only the lag metric will be updated right after the request is received, whereas all metrics are updated after the transactions are committed i.e. UpdateFollowerWatermarks() gets called later on in this function and this UpdateLagMetrics(). In that sense, UpdateLagMetrics() and UpdateFollowerMetrics() would have slightly different use-cases.

Good point about the leader checking though. An extra locking call for something with such a short critical path should be fine.


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] WIP: KUDU-1506 Add consensus lag metrics

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: WIP: KUDU-1506 Add consensus lag metrics
......................................................................


Patch Set 6:

I agree w/ Andrew and David on the time thing, I think it's useful and much easier to put in an op-based lag metric now and defer to a later patch a time-based metric. David had a great point that when you plot the graph of this number you can see the trend which provides the bulk of the information people are currently lacking.

-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-1506 Add consensus lag metrics

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-1506 Add consensus lag metrics
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6451/3/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

PS3, Line 1047: queue_->UpdateLagMetrics(request->max_replicated_index());
> I'd wanted to keep it separate since only the lag metric will be updated ri
The leader's validity is only checked in LOC 1145, so this would have to be below at least LOC 1151.

That being said I do see the point of updating this even if we can't prepare (if block in 1183). Actually we probably need this particularly bad _when_ we can't prepare, so I'm ok with keeping two separate methods.


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1506 Add consensus lag metrics

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: KUDU-1506 Add consensus lag metrics
......................................................................


Patch Set 3:

(4 comments)

I haven't been following this but I was just curious and thought I'd take a look. I have some high level questions though:

1. I am having trouble deciding whether these metrics are meant to have meaning on nodes other than the leader. I think yes.
2. I am not sure what this lag metric means exactly. It seems like it means means highest index that the leader knows about or highest index the current node has ever seen.

http://gerrit.cloudera.org:8080/#/c/6451/3/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

PS3, Line 358: any peer in the config
Just looking at the .proto file, it's hard for me to understand what this means. Let's make this more precise by saying something like "if this node is the leader, this is the highest index that this leader has replicated to at least one other machine." ... I am guessing that is what it means?


http://gerrit.cloudera.org:8080/#/c/6451/3/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS3, Line 106: max
How about calling this 'latest' instead of 'max'? Here and elsewhere. When I think maximum OpId I think `MakeOpId(INT64_MAX, INT64_MAX)`.


PS3, Line 157: queue_state_.max_replicated_index = committed_index
Hmm, can't we get a more accurate / tighter bound than this?


http://gerrit.cloudera.org:8080/#/c/6451/3/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

PS3, Line 338: max_replicated_index
Shouldn't this be bounded by queue_state_.last_appended? Is not, is that a bug?


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6451

to look at the new patch set (#10).

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................

[consensus] Add consensus op-level lag metrics

It may be useful to see how far behind the leaders each follower is
with regard to replicating operations. As such, this patch adds a
metric facilitate this.

The new metric counts the number of operations behind the leader each
peer is. It is updated when a follower receives a request from a
leader, and after the peer message queue appends operations.

Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
---
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
7 files changed, 78 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/51/6451/10
-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


Patch Set 10:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/6451/10/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

Line 343:   // The highest index that is known to be replicated by all members of
> Ha, nice catch
:)


http://gerrit.cloudera.org:8080/#/c/6451/10/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

Line 247:   // Note: This estimation must be precise in order to determine the exact number of messages that
> This was a little hard to me to grok. Can you clarify in this message that 
Done


Line 253:   page_size_estimator.set_last_idx_appended_to_leader(0);
> I'm not sure of the purpose of the changes to this test. Does this mean thi
The key thing is that in this and only this test, the size of the protobuf must be the expected size of a request, since it uses it to set a flag that would otherwise be set by the user.


http://gerrit.cloudera.org:8080/#/c/6451/10/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS10, Line 88: is
> believes it is
Done


Line 156:   queue_state_.last_idx_appended_to_leader = queue_state_.last_appended.index();
> Maybe we should not track this in leader mode and instead do this re-initia
Hmm good observation that one of the reasons we track on LEADER is that if we don't, when we step down as leader, we could end up with a negative metric (and this can be avoided by moving this down to when we start as nonleader).


PS10, Line 301:     // If LEADER, update the index of the last appended operation.
              :     queue_state_.last_idx_appended_to_leader = last_id.index();
> Seems strange to me to track this at all as the leader; see below for a sug
Done


PS10, Line 380: last_idx_appended_to_leader
> last_appended.index()
Done


Line 603:   queue_state_.last_idx_appended_to_leader = std::max({queue_state_.last_idx_appended_to_leader,
> Based on the changes we have made, I don't think we want the max() anymore 
Hmm. So when processing the request, we can UpdateLastIndexAppendedToLeader and just UpdateLagMetrics when there's no new information from the leader. Makes sense.


PS10, Line 605:   metrics_.num_ops_behind_leader->set_value(
              :       queue_state_.last_idx_appended_to_leader - queue_state_.last_appended.index());
> How about:
Given the above change, makes sense.


http://gerrit.cloudera.org:8080/#/c/6451/10/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 1152:     queue_->UpdateLagMetrics(request->last_idx_appended_to_leader());
> I think this will cause the field to be 'required' without a default being 
As discussed there is a _default_ default that should be fine.


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1506 Add consensus lag metrics

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-1506 Add consensus lag metrics
......................................................................


Patch Set 3:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/6451/3//COMMIT_MSG
Commit Message:

PS3, Line 9: behind
"... behind the leader followers are with regard to replicated operations."


http://gerrit.cloudera.org:8080/#/c/6451/3/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

PS3, Line 359: peer
peer is


http://gerrit.cloudera.org:8080/#/c/6451/3/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS3, Line 86: Operations behind max
... max replicated


PS3, Line 86: num_ops_behind_max
num_ops_behind_max_replicated


PS3, Line 314: if (queue_state_.max_replicated_index < last_id.index()) {
             :     queue_state_.max_replicated_index = last_id.index();
             :   }
use std::max(queue_state_.max_replicated_index, last_id.index())


http://gerrit.cloudera.org:8080/#/c/6451/3/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

PS3, Line 248:   // Called by the concensus implementation to update the lag metrics
             :   // before applying the contents of a request. Also used throughout consensus
             :   // via UpdateMetrics() as operations are applied.
             :   void UpdateLagMetrics(int64_t max_replicated_index);
why not add an argument to the method above?


http://gerrit.cloudera.org:8080/#/c/6451/3/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

PS3, Line 127: METRIC_DEFINE_gauge_int64(tablet, num_ops_behind_max,
             :                           "Operations Behind Max Index",
             :                           kudu::MetricUnit::kOperations,
             :                           "Number of operations behind the max replicated index received by and "
             :                           "reported by leader.");
wait why have the same metric twice?


PS3, Line 1047: queue_->UpdateLagMetrics(request->max_replicated_index());
I think it's fine to just reuse UpdateFollowerWatermarks with the extra argument. This has the extra advantage of only updating this for valid leaders (whereas here an invalid leader could move this index)


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1506 Add consensus lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: KUDU-1506 Add consensus lag metrics
......................................................................


Patch Set 3:

(5 comments)

> 1. I am having trouble deciding whether these metrics are meant to
 > have meaning on nodes other than the leader. I think yes.
Yeah, so the leader will determine the max replicated index that it knows about from all the peers it's tracking and send that index out to its followers via the ConsensusRequestPB. To a follower, this can indicate how far behind in terms of number of ops it is. The follower should see this when it UpdateReplica()s

 > 2. I am not sure what this lag metric means exactly. It seems like
 > it means means highest index that the leader knows about or highest
 > index the current node has ever seen.
It's the highest index of an op committed by a peer that's being tracked by the leader.

 > Is there any way we can get this metric to be time-based instead of
 > op-based? op-based has the downside that it's dependent on the rate
 > of incoming operations?
Have been thinking about this as well, since using op-id was more a means to get familiar with consensus. Moving forward, I think so. If the follower can track the time it receives a replicate request, it just needs to update the metric when it finishes applying.

http://gerrit.cloudera.org:8080/#/c/6451/3/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

PS3, Line 358: any peer in the config
> Just looking at the .proto file, it's hard for me to understand what this m
Yep! I think that's an apt description. Also important to note that the request gets sent to the followers, allowing them to track their own op lag.


http://gerrit.cloudera.org:8080/#/c/6451/3/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS3, Line 106: max
> How about calling this 'latest' instead of 'max'? Here and elsewhere. When 
This has been changed to "num_ops_behind_max_replicated


PS3, Line 157: queue_state_.max_replicated_index = committed_index
> Hmm, can't we get a more accurate / tighter bound than this?
Hmm, good point, think we can make it max(max_replicated_index, committed_index)


http://gerrit.cloudera.org:8080/#/c/6451/3/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

PS3, Line 338: max_replicated_index
> Shouldn't this be bounded by queue_state_.last_appended? Is not, is that a 
That's the case if this is the leader, but consensus_queue still functions when not in LEADER mode. If another follower has a higher index, max_replicated_index could be higher than this follower's last_appended.


http://gerrit.cloudera.org:8080/#/c/6451/3/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

PS3, Line 1047: queue_->UpdateLagMetrics(request->max_replicated_index());
> The leader's validity is only checked in LOC 1145, so this would have to be
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6451

to look at the new patch set (#11).

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................

[consensus] Add consensus op-level lag metrics

It may be useful to see how far behind the leaders each follower is
with regard to replicating operations. As such, this patch adds a
metric facilitate this.

The new metric counts the number of operations behind the leader each
peer is. It is updated when a follower receives a request from a
leader, and after the peer message queue appends operations.

Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
---
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
7 files changed, 79 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/51/6451/11
-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1506 Add consensus lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has uploaded a new patch set (#2).

Change subject: KUDU-1506 Add consensus lag metrics
......................................................................

KUDU-1506 Add consensus lag metrics

It may be useful to see how far behind certain peers are in appending
their operations. As such, this patch adds a metric to do so.

The new metric counts the number of operations behind the maximum
operation seen by the leader each peer is. It is updated every time
the peer queue updates its metrics, i.e. after every append on a
follower and after parsing responses from peers on a leader.

Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
---
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
4 files changed, 41 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/51/6451/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


Patch Set 10: Code-Review+2

Leaving open in case mike wants to take a look.

-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [consensus] Add consensus op-level lag metrics

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] Add consensus op-level lag metrics
......................................................................


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6451/12/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS12, Line 602: queue_state_.last_idx_appended_to_leader - queue_state_.last_appended.index()
> Just thought of one last thing. I think it is possible that the leader tell
I don't think this is true, given where this is called in Consensus::UpdateReplica (truncation happened before we update this index in the queue)


-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1506 Add consensus lag metrics

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6451

to look at the new patch set (#4).

Change subject: KUDU-1506 Add consensus lag metrics
......................................................................

KUDU-1506 Add consensus lag metrics

It may be useful to see how far behind the leaders followers are with
regard to replicating operations. As such, this patch adds a metric
facilitate this.

The new metric counts the number of operations behind the maximum
applied operation, as seen by the leader, each peer is. It is updated
when a replica receives a request from a leader, and every time the
peer queue updates its metrics, i.e. after every append on a follower
and after parsing responses from peers on a leader.

Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
---
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
5 files changed, 48 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/51/6451/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6451
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ida8e992cc2397ca8d5873e62961a65f618d52c36
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>