You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Will Berkeley (Code Review)" <ge...@cloudera.org> on 2018/06/21 20:54:41 UTC

[kudu-CR] Add a simple metric for cluster skew

Will Berkeley has uploaded this change for review. ( http://gerrit.cloudera.org:8080/10787


Change subject: Add a simple metric for cluster skew
......................................................................

Add a simple metric for cluster skew

This adds a very simple 'cluster_skew' metric to the master that reports
on the difference in number of replicas between the most and least
loaded tablet servers. This information was already computable from the
tablets_num_* metrics available on all the tablet servers, but this
centralizes it in one place and handles counting the correct tablet
states, so it's much easier to consume. This simple metric should be
useful for operators trying to set up simple alerting schemes based on
cluster balance.

Why not introduce a more comprehensive set of metrics around balance?
Because eventually rebalancing should be tightly integrated with the
master. This metric is just meant as a useful "canary" for when the
rebalancer ought to be run, until a more sophisticated and automated
procedure can be put in place. At that time there will likely be better
metrics exposed to gauge the balance of the cluster and the behavior of
the rebalancer.

I also wrote a quick script to simulate placing replicas on tablet
servers and measure the resulting distribution of skew. The results of
the simulations show skew is almost certainly 6 or less when replica
distribution is determined solely by the current power of two choices
algorithm with a fixed number of tablet servers. This can provide some
guide to operators looking to set a theshold for concerning skew- a
value of e.g. 10 should be vanishingly unlikely to result except by some
external force like unbalanced re-replication or the addition of a tablet
server, so it should suffice as a threshold.

Change-Id: I107256de604998cbf9206a8fccb3a43de86f81a8
---
M src/kudu/master/master.cc
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
A src/kudu/scripts/max_skew_estimate.py
4 files changed, 129 insertions(+), 3 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/87/10787/1
-- 
To view, visit http://gerrit.cloudera.org:8080/10787
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I107256de604998cbf9206a8fccb3a43de86f81a8
Gerrit-Change-Number: 10787
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>

[kudu-CR] Add a simple metric for cluster skew

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has removed a vote on this change.

Change subject: Add a simple metric for cluster skew
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/10787
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I107256de604998cbf9206a8fccb3a43de86f81a8
Gerrit-Change-Number: 10787
Gerrit-PatchSet: 3
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] Add a simple metric for cluster skew

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/10787

to look at the new patch set (#2).

Change subject: Add a simple metric for cluster skew
......................................................................

Add a simple metric for cluster skew

This adds a very simple 'cluster_skew' metric to the master that reports
on the difference in number of replicas between the most and least
loaded tablet servers. This information was already computable from the
tablets_num_* metrics available on all the tablet servers, but this
centralizes it in one place and handles counting the correct tablet
states, so it's much easier to consume. This simple metric should be
useful for operators trying to set up simple alerting schemes based on
cluster balance.

Why not introduce a more comprehensive set of metrics around balance?
Because eventually rebalancing should be tightly integrated with the
master. This metric is just meant as a useful "canary" for when the
rebalancer ought to be run, until a more sophisticated and automated
procedure can be put in place. At that time there will likely be better
metrics exposed to gauge the balance of the cluster and the behavior of
the rebalancer.

I also wrote a quick script to simulate placing replicas on tablet
servers and measure the resulting distribution of skew. The results of
the simulations show skew is almost certainly 6 or less when replica
distribution is determined solely by the current power of two choices
algorithm with a fixed number of tablet servers. This can provide some
guide to operators looking to set a theshold for concerning skew- a
value of e.g. 10 should be vanishingly unlikely to result except by some
external force like unbalanced re-replication or the addition of a tablet
server, so it should suffice as a threshold.

Change-Id: I107256de604998cbf9206a8fccb3a43de86f81a8
---
M src/kudu/master/master.cc
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
A src/kudu/scripts/max_skew_estimate.py
4 files changed, 132 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/87/10787/2
-- 
To view, visit http://gerrit.cloudera.org:8080/10787
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I107256de604998cbf9206a8fccb3a43de86f81a8
Gerrit-Change-Number: 10787
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] Add a simple metric for cluster skew

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/10787 )

Change subject: Add a simple metric for cluster skew
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/10787/1/src/kudu/master/ts_manager.h
File src/kudu/master/ts_manager.h:

http://gerrit.cloudera.org:8080/#/c/10787/1/src/kudu/master/ts_manager.h@26
PS1, Line 26: #include "kudu/util/metrics.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/10787
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I107256de604998cbf9206a8fccb3a43de86f81a8
Gerrit-Change-Number: 10787
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Thu, 21 Jun 2018 22:13:46 +0000
Gerrit-HasComments: Yes

[kudu-CR] Add a simple metric for cluster skew

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/10787 )

Change subject: Add a simple metric for cluster skew
......................................................................

Add a simple metric for cluster skew

This adds a very simple 'cluster_skew' metric to the master that reports
on the difference in number of replicas between the most and least
loaded tablet servers. This information was already computable from the
tablets_num_* metrics available on all the tablet servers, but this
centralizes it in one place and handles counting the correct tablet
states, so it's much easier to consume. This simple metric should be
useful for operators trying to set up simple alerting schemes based on
cluster balance.

Why not introduce a more comprehensive set of metrics around balance?
Because eventually rebalancing should be tightly integrated with the
master. This metric is just meant as a useful "canary" for when the
rebalancer ought to be run, until a more sophisticated and automated
procedure can be put in place. At that time there will likely be better
metrics exposed to gauge the balance of the cluster and the behavior of
the rebalancer.

I also wrote a quick script to simulate placing replicas on tablet
servers and measure the resulting distribution of skew. The results of
the simulations show skew is almost certainly 6 or less when replica
distribution is determined solely by the current power of two choices
algorithm with a fixed number of tablet servers. This can provide some
guide to operators looking to set a threshold for concerning skew: a
value of e.g. 10 should be vanishingly unlikely to result except by some
external force like unbalanced re-replication or the addition of a
tablet server, so it should suffice as a threshold.

Change-Id: I107256de604998cbf9206a8fccb3a43de86f81a8
Reviewed-on: http://gerrit.cloudera.org:8080/10787
Tested-by: Will Berkeley <wd...@gmail.com>
Reviewed-by: Alexey Serbin <as...@cloudera.com>
---
M src/kudu/master/master.cc
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
A src/kudu/scripts/max_skew_estimate.py
4 files changed, 129 insertions(+), 3 deletions(-)

Approvals:
  Will Berkeley: Verified
  Alexey Serbin: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/10787
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I107256de604998cbf9206a8fccb3a43de86f81a8
Gerrit-Change-Number: 10787
Gerrit-PatchSet: 4
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] Add a simple metric for cluster skew

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/10787

to look at the new patch set (#3).

Change subject: Add a simple metric for cluster skew
......................................................................

Add a simple metric for cluster skew

This adds a very simple 'cluster_skew' metric to the master that reports
on the difference in number of replicas between the most and least
loaded tablet servers. This information was already computable from the
tablets_num_* metrics available on all the tablet servers, but this
centralizes it in one place and handles counting the correct tablet
states, so it's much easier to consume. This simple metric should be
useful for operators trying to set up simple alerting schemes based on
cluster balance.

Why not introduce a more comprehensive set of metrics around balance?
Because eventually rebalancing should be tightly integrated with the
master. This metric is just meant as a useful "canary" for when the
rebalancer ought to be run, until a more sophisticated and automated
procedure can be put in place. At that time there will likely be better
metrics exposed to gauge the balance of the cluster and the behavior of
the rebalancer.

I also wrote a quick script to simulate placing replicas on tablet
servers and measure the resulting distribution of skew. The results of
the simulations show skew is almost certainly 6 or less when replica
distribution is determined solely by the current power of two choices
algorithm with a fixed number of tablet servers. This can provide some
guide to operators looking to set a threshold for concerning skew: a
value of e.g. 10 should be vanishingly unlikely to result except by some
external force like unbalanced re-replication or the addition of a
tablet server, so it should suffice as a threshold.

Change-Id: I107256de604998cbf9206a8fccb3a43de86f81a8
---
M src/kudu/master/master.cc
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
A src/kudu/scripts/max_skew_estimate.py
4 files changed, 129 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/87/10787/3
-- 
To view, visit http://gerrit.cloudera.org:8080/10787
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I107256de604998cbf9206a8fccb3a43de86f81a8
Gerrit-Change-Number: 10787
Gerrit-PatchSet: 3
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] Add a simple metric for cluster skew

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/10787 )

Change subject: Add a simple metric for cluster skew
......................................................................


Patch Set 2:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/10787/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/10787/2//COMMIT_MSG@31
PS2, Line 31: theshold
> threshold
Done


http://gerrit.cloudera.org:8080/#/c/10787/2//COMMIT_MSG@31
PS2, Line 31: -
> nit: add space or replace with a colon?
Done


http://gerrit.cloudera.org:8080/#/c/10787/2//COMMIT_MSG@33
PS2, Line 33: external force like unbalanced re-replication or the addition of a tablet
> nit: it would be nice to have a line with 72 chars or less of length for th
Done


http://gerrit.cloudera.org:8080/#/c/10787/2/src/kudu/master/ts_manager.cc
File src/kudu/master/ts_manager.cc:

http://gerrit.cloudera.org:8080/#/c/10787/2/src/kudu/master/ts_manager.cc@36
PS2, Line 36: cluster_skew
> Nit: would it make sense to include 'replica' or 'tablet' in the name of th
Done


http://gerrit.cloudera.org:8080/#/c/10787/2/src/kudu/master/ts_manager.cc@145
PS2, Line 145: int&
> nit: why a reference, not just a copy of an integer value?
Done


http://gerrit.cloudera.org:8080/#/c/10787/2/src/kudu/scripts/max_skew_estimate.py
File src/kudu/scripts/max_skew_estimate.py:

http://gerrit.cloudera.org:8080/#/c/10787/2/src/kudu/scripts/max_skew_estimate.py@20
PS2, Line 20: This
> The
Done


http://gerrit.cloudera.org:8080/#/c/10787/2/src/kudu/scripts/max_skew_estimate.py@20
PS2, Line 20: aximum
> maximum
Done


http://gerrit.cloudera.org:8080/#/c/10787/2/src/kudu/scripts/max_skew_estimate.py@21
PS2, Line 21:  
> nit: extra space
Done


http://gerrit.cloudera.org:8080/#/c/10787/2/src/kudu/scripts/max_skew_estimate.py@22
PS2, Line 22: is 
> drop
Added "which" in front.



-- 
To view, visit http://gerrit.cloudera.org:8080/10787
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I107256de604998cbf9206a8fccb3a43de86f81a8
Gerrit-Change-Number: 10787
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Tue, 26 Jun 2018 18:02:28 +0000
Gerrit-HasComments: Yes

[kudu-CR] Add a simple metric for cluster skew

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/10787 )

Change subject: Add a simple metric for cluster skew
......................................................................


Patch Set 3: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/10787/3/src/kudu/scripts/max_skew_estimate.py
File src/kudu/scripts/max_skew_estimate.py:

http://gerrit.cloudera.org:8080/#/c/10787/3/src/kudu/scripts/max_skew_estimate.py@31
PS3, Line 31: xrange
Nit for here and below, but I think it's just a general comment that doesn't need to be addressed in this version of the patch:  it seems in python3 xrange deprecated (and range does what xrange did in python2).  If there are no memory issues with that code, maybe it's worth to use range() in future since one day it could help switching to python3.



-- 
To view, visit http://gerrit.cloudera.org:8080/10787
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I107256de604998cbf9206a8fccb3a43de86f81a8
Gerrit-Change-Number: 10787
Gerrit-PatchSet: 3
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Wed, 27 Jun 2018 17:54:34 +0000
Gerrit-HasComments: Yes

[kudu-CR] Add a simple metric for cluster skew

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/10787 )

Change subject: Add a simple metric for cluster skew
......................................................................


Patch Set 3: Verified+1

Unrelated Spark test failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/10787
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I107256de604998cbf9206a8fccb3a43de86f81a8
Gerrit-Change-Number: 10787
Gerrit-PatchSet: 3
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Tue, 26 Jun 2018 18:54:35 +0000
Gerrit-HasComments: No

[kudu-CR] Add a simple metric for cluster skew

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/10787 )

Change subject: Add a simple metric for cluster skew
......................................................................


Patch Set 2:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/10787/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/10787/2//COMMIT_MSG@31
PS2, Line 31: theshold
threshold


http://gerrit.cloudera.org:8080/#/c/10787/2//COMMIT_MSG@31
PS2, Line 31: -
nit: add space or replace with a colon?


http://gerrit.cloudera.org:8080/#/c/10787/2//COMMIT_MSG@33
PS2, Line 33: external force like unbalanced re-replication or the addition of a tablet
nit: it would be nice to have a line with 72 chars or less of length for the commit message.


http://gerrit.cloudera.org:8080/#/c/10787/2/src/kudu/master/ts_manager.cc
File src/kudu/master/ts_manager.cc:

http://gerrit.cloudera.org:8080/#/c/10787/2/src/kudu/master/ts_manager.cc@36
PS2, Line 36: cluster_skew
Nit: would it make sense to include 'replica' or 'tablet' in the name of this metric (e.g. 'tablet_replicas_cluster_skew')?  Or it already has some implicit namespacing?


http://gerrit.cloudera.org:8080/#/c/10787/2/src/kudu/master/ts_manager.cc@145
PS2, Line 145: int&
nit: why a reference, not just a copy of an integer value?


http://gerrit.cloudera.org:8080/#/c/10787/2/src/kudu/scripts/max_skew_estimate.py
File src/kudu/scripts/max_skew_estimate.py:

http://gerrit.cloudera.org:8080/#/c/10787/2/src/kudu/scripts/max_skew_estimate.py@20
PS2, Line 20: This
The


http://gerrit.cloudera.org:8080/#/c/10787/2/src/kudu/scripts/max_skew_estimate.py@20
PS2, Line 20: aximum
maximum


http://gerrit.cloudera.org:8080/#/c/10787/2/src/kudu/scripts/max_skew_estimate.py@21
PS2, Line 21:  
nit: extra space


http://gerrit.cloudera.org:8080/#/c/10787/2/src/kudu/scripts/max_skew_estimate.py@22
PS2, Line 22: is 
drop



-- 
To view, visit http://gerrit.cloudera.org:8080/10787
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I107256de604998cbf9206a8fccb3a43de86f81a8
Gerrit-Change-Number: 10787
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Fri, 22 Jun 2018 20:19:31 +0000
Gerrit-HasComments: Yes