You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2017/09/14 03:47:37 UTC

[kudu-CR] KUDU-2144. Add metrics for Reactor load

Hello Mike Percy, Michael Ho,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/8064

to review the following change.

Change subject: KUDU-2144. Add metrics for Reactor load
......................................................................

KUDU-2144. Add metrics for Reactor load

This adds two new metrics:

1) reactor_load_percent

This measures the percentage of time that a reactor spends doing active
work (i.e not blocked in epoll_wait()). As this approaches 100%, it
indicates that the server may be reactor-bound (eg due to skew, a
performance bug, or insufficient number of reactor threads)

At first glance, it might seem like this should just be a simple counter
of cycles spent in epoll_wait(), so that metrics systems would calculate
a derived cycles/second rate. However, a simple rate metric would not
well represent the fact that there are multiple reactor threads, and
it's possible (and often the case) that only one of them is highly
loaded due to skew. Additionally, there are some bursty workloads where
the load average over minute granularity is relatively low, but when
viewed on a short time scale the reactor thread approaches 100% load.

So, to capture that, this new metric is a histogram of percentages.
Values are contributed to the histogram in the periodic TimerHandler
which runs every 100ms. So, if there is any 100ms period in which the
reactor is fully loaded, it will contribute a "100%" sample to the
histogram. We can then inspect the percentiles and the raw counts to see
if there are even any short bursts where the reactors are the
bottleneck.

To test this out, I ran rpc-bench and modified the number of server
reactors. As I added more server reactors, I could see that the load
percentage (particularly the 95th percentile) went down as each reactor
was able to spend more time idle.

2) reactor_active_latency_us

This metric takes another approach to measure potential latency issues
caused by reactors by measuring the histogram of time spent invoking
watcher callbacks. If, for example, we see that callback execution
frequently takes 100ms, we can assume that a similar amount of latency
would be contributed to any inbound or outbound RPCs associated with the
same reactor.

To test this out, I simulated KUDU-1944 (OpenSSL lock contention slowing
down socket IO) by adding a usleep(10ms) call in Socket::Recv and
running rpc-bench. I could see that the new metric shot up accordingly.

Change-Id: Ic530af2836b1c31b3d754a9e0068fc5d31aa6fbb
---
M src/kudu/rpc/reactor.cc
M src/kudu/rpc/reactor.h
M src/kudu/rpc/rpc-bench.cc
M src/kudu/util/metrics.h
4 files changed, 125 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/64/8064/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8064
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic530af2836b1c31b3d754a9e0068fc5d31aa6fbb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2144. Add metrics for Reactor load

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged.

Change subject: KUDU-2144. Add metrics for Reactor load
......................................................................


KUDU-2144. Add metrics for Reactor load

This adds two new metrics:

1) reactor_load_percent

This measures the percentage of time that a reactor spends doing active
work (i.e not blocked in epoll_wait()). As this approaches 100%, it
indicates that the server may be reactor-bound (eg due to skew, a
performance bug, or insufficient number of reactor threads)

At first glance, it might seem like this should just be a simple counter
of cycles spent in epoll_wait(), so that metrics systems would calculate
a derived cycles/second rate. However, a simple rate metric would not
well represent the fact that there are multiple reactor threads, and
it's possible (and often the case) that only one of them is highly
loaded due to skew. Additionally, there are some bursty workloads where
the load average over minute granularity is relatively low, but when
viewed on a short time scale the reactor thread approaches 100% load.

So, to capture that, this new metric is a histogram of percentages.
Values are contributed to the histogram in the periodic TimerHandler
which runs every 100ms. So, if there is any 100ms period in which the
reactor is fully loaded, it will contribute a "100%" sample to the
histogram. We can then inspect the percentiles and the raw counts to see
if there are even any short bursts where the reactors are the
bottleneck.

To test this out, I ran rpc-bench and modified the number of server
reactors. As I added more server reactors, I could see that the load
percentage (particularly the 95th percentile) went down as each reactor
was able to spend more time idle.

2) reactor_active_latency_us

This metric takes another approach to measure potential latency issues
caused by reactors by measuring the histogram of time spent invoking
watcher callbacks. If, for example, we see that callback execution
frequently takes 100ms, we can assume that a similar amount of latency
would be contributed to any inbound or outbound RPCs associated with the
same reactor.

To test this out, I simulated KUDU-1944 (OpenSSL lock contention slowing
down socket IO) by adding a usleep(10ms) call in Socket::Recv and
running rpc-bench. I could see that the new metric shot up accordingly.

Change-Id: Ic530af2836b1c31b3d754a9e0068fc5d31aa6fbb
Reviewed-on: http://gerrit.cloudera.org:8080/8064
Tested-by: Kudu Jenkins
Reviewed-by: David Ribeiro Alves <da...@gmail.com>
---
M src/kudu/rpc/reactor.cc
M src/kudu/rpc/reactor.h
M src/kudu/rpc/rpc-bench.cc
M src/kudu/util/metrics.h
4 files changed, 133 insertions(+), 2 deletions(-)

Approvals:
  David Ribeiro Alves: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/8064
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ic530af2836b1c31b3d754a9e0068fc5d31aa6fbb
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2144. Add metrics for Reactor load

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/8064

to look at the new patch set (#3).

Change subject: KUDU-2144. Add metrics for Reactor load
......................................................................

KUDU-2144. Add metrics for Reactor load

This adds two new metrics:

1) reactor_load_percent

This measures the percentage of time that a reactor spends doing active
work (i.e not blocked in epoll_wait()). As this approaches 100%, it
indicates that the server may be reactor-bound (eg due to skew, a
performance bug, or insufficient number of reactor threads)

At first glance, it might seem like this should just be a simple counter
of cycles spent in epoll_wait(), so that metrics systems would calculate
a derived cycles/second rate. However, a simple rate metric would not
well represent the fact that there are multiple reactor threads, and
it's possible (and often the case) that only one of them is highly
loaded due to skew. Additionally, there are some bursty workloads where
the load average over minute granularity is relatively low, but when
viewed on a short time scale the reactor thread approaches 100% load.

So, to capture that, this new metric is a histogram of percentages.
Values are contributed to the histogram in the periodic TimerHandler
which runs every 100ms. So, if there is any 100ms period in which the
reactor is fully loaded, it will contribute a "100%" sample to the
histogram. We can then inspect the percentiles and the raw counts to see
if there are even any short bursts where the reactors are the
bottleneck.

To test this out, I ran rpc-bench and modified the number of server
reactors. As I added more server reactors, I could see that the load
percentage (particularly the 95th percentile) went down as each reactor
was able to spend more time idle.

2) reactor_active_latency_us

This metric takes another approach to measure potential latency issues
caused by reactors by measuring the histogram of time spent invoking
watcher callbacks. If, for example, we see that callback execution
frequently takes 100ms, we can assume that a similar amount of latency
would be contributed to any inbound or outbound RPCs associated with the
same reactor.

To test this out, I simulated KUDU-1944 (OpenSSL lock contention slowing
down socket IO) by adding a usleep(10ms) call in Socket::Recv and
running rpc-bench. I could see that the new metric shot up accordingly.

Change-Id: Ic530af2836b1c31b3d754a9e0068fc5d31aa6fbb
---
M src/kudu/rpc/reactor.cc
M src/kudu/rpc/reactor.h
M src/kudu/rpc/rpc-bench.cc
M src/kudu/util/metrics.h
4 files changed, 133 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/64/8064/3
-- 
To view, visit http://gerrit.cloudera.org:8080/8064
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic530af2836b1c31b3d754a9e0068fc5d31aa6fbb
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2144. Add metrics for Reactor load

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-2144. Add metrics for Reactor load
......................................................................


Patch Set 3: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/8064
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic530af2836b1c31b3d754a9e0068fc5d31aa6fbb
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-2144. Add metrics for Reactor load

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/8064

to look at the new patch set (#2).

Change subject: KUDU-2144. Add metrics for Reactor load
......................................................................

KUDU-2144. Add metrics for Reactor load

This adds two new metrics:

1) reactor_load_percent

This measures the percentage of time that a reactor spends doing active
work (i.e not blocked in epoll_wait()). As this approaches 100%, it
indicates that the server may be reactor-bound (eg due to skew, a
performance bug, or insufficient number of reactor threads)

At first glance, it might seem like this should just be a simple counter
of cycles spent in epoll_wait(), so that metrics systems would calculate
a derived cycles/second rate. However, a simple rate metric would not
well represent the fact that there are multiple reactor threads, and
it's possible (and often the case) that only one of them is highly
loaded due to skew. Additionally, there are some bursty workloads where
the load average over minute granularity is relatively low, but when
viewed on a short time scale the reactor thread approaches 100% load.

So, to capture that, this new metric is a histogram of percentages.
Values are contributed to the histogram in the periodic TimerHandler
which runs every 100ms. So, if there is any 100ms period in which the
reactor is fully loaded, it will contribute a "100%" sample to the
histogram. We can then inspect the percentiles and the raw counts to see
if there are even any short bursts where the reactors are the
bottleneck.

To test this out, I ran rpc-bench and modified the number of server
reactors. As I added more server reactors, I could see that the load
percentage (particularly the 95th percentile) went down as each reactor
was able to spend more time idle.

2) reactor_active_latency_us

This metric takes another approach to measure potential latency issues
caused by reactors by measuring the histogram of time spent invoking
watcher callbacks. If, for example, we see that callback execution
frequently takes 100ms, we can assume that a similar amount of latency
would be contributed to any inbound or outbound RPCs associated with the
same reactor.

To test this out, I simulated KUDU-1944 (OpenSSL lock contention slowing
down socket IO) by adding a usleep(10ms) call in Socket::Recv and
running rpc-bench. I could see that the new metric shot up accordingly.

Change-Id: Ic530af2836b1c31b3d754a9e0068fc5d31aa6fbb
---
M src/kudu/rpc/reactor.cc
M src/kudu/rpc/reactor.h
M src/kudu/rpc/rpc-bench.cc
M src/kudu/util/metrics.h
4 files changed, 132 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/64/8064/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8064
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic530af2836b1c31b3d754a9e0068fc5d31aa6fbb
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Mike Percy <mp...@apache.org>