You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by jtuple <gi...@git.apache.org> on 2018/07/20 20:09:12 UTC

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

GitHub user jtuple opened a pull request:

    https://github.com/apache/zookeeper/pull/580

    ZOOKEEPER-3098: Add additional server metrics

    This patch adds several new server-side metrics as well as makes it easier to add new metrics in the future. This patch also includes a handful of other minor metrics-related changes.
    
    Here's a high-level summary of the changes.
    
    1. This patch extends the request latency tracked in `ServerStats` to
       track `read` and `update` latency separately. Updates are any request
       that must be voted on and can change data, reads are all requests that
       can be handled locally and don't change data.
    
    2. This patch adds the `ServerMetrics` logic and the related `AvgMinMaxCounter`
       and `SimpleCounter` classes. This code is designed to make it incredibly easy to
       add new metrics. To add a new metric you just add one line to `ServerMetrics` and
       then directly reference that new metric anywhere in the code base. The `ServerMetrics`
       logic handles creating the metric, properly adding the metric to the JSON output of
       the `/monitor` admin command, and properly resetting the metric when necessary.
    
       The motivation behind `ServerMetrics` is to make things easy enough that it encourages
       new metrics to be added liberally. Lack of in-depth metrics/visibility is a long-standing
       ZooKeeper weakness. At Facebook, most of our internal changes build on `ServerMetrics` and
       we have nearly 100 internal metrics at this time -- all of which we'll be upstreaming
       in the coming months as we publish more internal patches.
    
    3. This patch adds 20 new metrics, 14 which are handled by `ServerMetrics`.
    
    4. This patch replaces some uses of `synchronized` in `ServerStats` with atomic operations.
    
    Here's a list of new metrics added in this patch:
    
    - `uptime`: time that a peer has been in a stable leading/following/observing state
    - `leader_uptime`: uptime for peer in leading state
    - `global_sessions`: count of global sessions
    - `local_sessions`: count of local sessions
    - `quorum_size`: configured ensemble size
    - `synced_observers`: similar to existing `synced_followers` but for observers
    - `fsynctime`: time to fsync transaction log (avg/min/max)
    - `snapshottime`: time to write a snapshot (avg/min/max)
    - `dbinittime`: time to reload database -- read snapshot + apply transactions (avg/min/max)
    - `readlatency`: read request latency (avg/min/max)
    - `updatelatency`: update request latency (avg/min/max)
    - `propagation_latency`: end-to-end latency for updates, from proposal on leader to committed-to-datatree on a given host (avg/min/max)
    - `follower_sync_time`: time for follower to sync with leader (avg/min/max)
    - `election_time`: time between entering and leaving election (avg/min/max)
    - `looking_count`: number of transitions into looking state
    - `diff_count`: number of diff syncs performed
    - `snap_count`: number of snap syncs performed
    - `commit_count`: number of commits performed on leader
    - `connection_request_count`: number of incoming client connection requests
    - `bytes_received_count`: similar to existing `packets_received` but tracks bytes

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jtuple/zookeeper ZOOKEEPER-3098

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zookeeper/pull/580.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #580
    
----
commit e6935f8d99eace05d29c2d6659e68e8b90b9a633
Author: Joseph Blomstedt <jd...@...>
Date:   2018-07-19T19:47:15Z

    ZOOKEEPER-3098: Add additional server metrics
    
    This patch adds several new server-side metrics as well as makes it easier
    to add new metrics in the future. This patch also includes a handful of
    other minor metrics-related changes.
    
    Here's a high-level summary of the changes.
    
    1. This patch extends the request latency tracked in `ServerStats` to
       track `read` and `update` latency separately. Updates are any request
       that must be voted on and can change data, reads are all requests that
       can be handled locally and don't change data.
    
    2. This patch adds the `ServerMetrics` logic and the related `AvgMinMaxCounter`
       and `SimpleCounter` classes. This code is designed to make it incredibly easy to
       add new metrics. To add a new metric you just add one line to `ServerMetrics` and
       then directly reference that new metric anywhere in the code base. The `ServerMetrics`
       logic handles creating the metric, properly adding the metric to the JSON output of
       the `/monitor` admin command, and properly resetting the metric when necessary.
    
       The motivation behind `ServerMetrics` is to make things easy enough that it encourages
       new metrics to be added liberally. Lack of in-depth metrics/visibility is a long-standing
       ZooKeeper weakness. At Facebook, most of our internal changes build on `ServerMetrics` and
       we have nearly 100 internal metrics at this time -- all of which we'll be upstreaming
       in the coming months as we publish more internal patches.
    
    3. This patch adds 20 new metrics, 14 which are handled by `ServerMetrics`.
    
    4. This patch replaces some uses of `synchronized` in `ServerStats` with atomic operations.
    
    Here's a list of new metrics added in this patch:
    
    - `uptime`: time that a peer has been in a stable leading/following/observing state
    - `leader_uptime`: uptime for peer in leading state
    - `global_sessions`: count of global sessions
    - `local_sessions`: count of local sessions
    - `quorum_size`: configured ensemble size
    - `synced_observers`: similar to existing `synced_followers` but for observers
    - `fsynctime`: time to fsync transaction log (avg/min/max)
    - `snapshottime`: time to write a snapshot (avg/min/max)
    - `dbinittime`: time to reload database -- read snapshot + apply transactions (avg/min/max)
    - `readlatency`: read request latency (avg/min/max)
    - `updatelatency`: update request latency (avg/min/max)
    - `propagation_latency`: end-to-end latency for updates, from proposal on leader to committed-to-datatree on a given host (avg/min/max)
    - `follower_sync_time`: time for follower to sync with leader (avg/min/max)
    - `election_time`: time between entering and leaving election (avg/min/max)
    - `looking_count`: number of transitions into looking state
    - `diff_count`: number of diff syncs performed
    - `snap_count`: number of snap syncs performed
    - `commit_count`: number of commits performed on leader
    - `connection_request_count`: number of incoming client connection requests
    - `bytes_received_count`: similar to existing `packets_received` but tracks bytes

----


---

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

Posted by anmolnar <gi...@git.apache.org>.
Github user anmolnar commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/580#discussion_r209038790
  
    --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
    @@ -127,36 +129,45 @@ public String toString(){
             sb.append("Mode: " + getServerState() + "\n");
             return sb.toString();
         }
    -    // mutators
    -    synchronized void updateLatency(long requestCreateTime) {
    -        long latency = Time.currentElapsedTime() - requestCreateTime;
    -        totalLatency += latency;
    -        count++;
    -        if (latency < minLatency) {
    -            minLatency = latency;
    +
    +    /**
    +     * Update request statistic. This should only be called from a request
    +     * that originated from that machine.
    +     */
    +    public void updateLatency(Request request, long currentTime) {
    --- End diff --
    
    Why do you need to pass `currentTime` here?
    For testing?


---

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

Posted by lvfangmin <gi...@git.apache.org>.
Github user lvfangmin commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/580#discussion_r212104426
  
    --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
    @@ -33,17 +34,17 @@
     public class ServerStats {
         private static final Logger LOG = LoggerFactory.getLogger(ServerStats.class);
     
    -    private long packetsSent;
    -    private long packetsReceived;
    -    private long maxLatency;
    -    private long minLatency = Long.MAX_VALUE;
    -    private long totalLatency = 0;
    -    private long count = 0;
    +    private final AtomicLong packetsSent = new AtomicLong();
    --- End diff --
    
    It's being counted in requestLatency.


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by jtuple <gi...@git.apache.org>.
Github user jtuple commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    Rebased to resolve merge conflict.


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by anmolnar <gi...@git.apache.org>.
Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    @eolivelli @jtuple 
    I think the best would be to join the efforts and work together to establish the new framework and implement these new metrics with a suitable library like dropwizard. At the end everybody will be happy with the results for sure.


---

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

Posted by jtuple <gi...@git.apache.org>.
Github user jtuple commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/580#discussion_r216049921
  
    --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
    @@ -33,17 +34,17 @@
     public class ServerStats {
         private static final Logger LOG = LoggerFactory.getLogger(ServerStats.class);
     
    -    private long packetsSent;
    -    private long packetsReceived;
    -    private long maxLatency;
    -    private long minLatency = Long.MAX_VALUE;
    -    private long totalLatency = 0;
    -    private long count = 0;
    +    private final AtomicLong packetsSent = new AtomicLong();
    --- End diff --
    
    When making this change, we kept all existing metrics "as-is" and only added new metrics to `ServerMetrics`. The packet sent/receive and request_latency are both examples of metrics existing prior to this pull-request which are directly exposed in both in the `/metrics` admin command (where `ServerMetrics` metrics also report) as well as the `mntr` four letter command (where `ServerMetrics` metrics do not report).
    
    The only changes to these existing metrics in this pull-request was converting some of them to `AtomicLong` for minor performance wins.
    
    I'm happy to move both of these metrics to `ServerMetrics` if we want. I'm not sure if the names will 100% match the current values though. We'd also need to decide if we're happy losing these metrics in `mntr` or if I should port the existing `mntr` reporting logic to still query these from `ServerMetrics`.
    
    This is also an open question for all other pre-existing metrics in `ServerStats`.


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by eolivelli <gi...@git.apache.org>.
Github user eolivelli commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    @anmolnar  this change is a backport from Facebook fork.
    
    in PR #582 I have started a brand new work, to design and implement a long term ZK internal API to instrument server and client java code base.
    Users are requesting support for Prometheus for instance.
    The idea of PR #582 is to design how a "MetricsProvider" can interact with ZK code.
    
    This great work from @jtuple and his collegues will be (quite) easily refactored to the new system once we have merged Facebook contribs and merged the new system.
    
    On the new Metrics API the most interesting key points are:
    - we are going to support several providers and/or let the users contribute/use other own providers
    - we are not using 'static' variables, so you can have several ZooKeeper clients (for instance a HBase client, a Curator based client...) inside the same JVM
    
    I would like to move forward and send the next patches but we need to have agreement on the high level decisions
    
    cc @hanm @phunt 



---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    One question about gathering metrics using this system: let's say I have a use case where I want to gather the number of requests processed by FinalRequestProcessor for each request type (e.g. Ping, CreateSession, and other OpCode types). How to do this using this metrics system? I think I have to add, for each OpCode, a new enum type in ServerMetrics, is this correct? 


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by lvfangmin <gi...@git.apache.org>.
Github user lvfangmin commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    @jtuple do you have time to resolve the comments and rebase this onto latest branch? It would be great if we can get this in this week.


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by jtuple <gi...@git.apache.org>.
Github user jtuple commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    Sorry for the delay, August was a super busy month.
    
    I've rebased onto latest master + added basic unit tests for `AvgMinMaxCounter` and `SimpleCounter`. I also resolved the findbugs issue and the `testMonitor` unit tests. Jenkin still seems unhappy, but I'm not sure what's up -- the linked test results show no failures.
    
    I've also addressed the majority of inline comments. Please let me know if there's anything else I overlooked.
    
    ---
    
    As mentioned in a prior comment, this pull request is just the first of many. In this PR we add the `AvgMinMaxCounter` and `SimpleCounter` metrics.
    
    We have additional metrics internally that we'll upstream in the future, including `AvgMinMaxPercentileCounter` as well as set variants: `AvgMinMaxCounterSet` and `AvgMinMaxPercentileCounterSet`. I talked about both of these in [a previous comment](https://github.com/apache/zookeeper/pull/580#issuecomment-410340039)
    
    Those are all the types we currently have internally. Other types like a Gauge could easily be done in a future pull-request, although our experience has been that gauge's are less useful in a polling based approach since you only see the most recent value. Something like an `AvgMinMaxCounter` metric gives you more information when there are multiple events, and trivially reduces to a gauge-equivalent when there's only a single value during the sampling interval.
    
    For things like commit processor queue sizes and the like, we simply query the queue size every time we enqueue an item and report that instantaneous size as a value in an `AvgMinMaxCounter`. When we periodically query metrics, we then know the min/max/avg queue size during the interval, and as well as the population count which is equivalent to "number of enqueues" during the interval.
    
    ---
    
    @hanm: For your example, you have a few options:
    
    1. Do as you mentioned and add a new metric to the `ServerMetric` enum for each request type.
    
    2. Once we land the set-variants, you could add a single metric to the `ServerMetric` enum and rely upon the set logic. Eg. add a `REQUEST_LATENCY(new AvgMinMaxCounterSet("request_latency"))` metric and then do `ServerMetrics.REQUEST_LATENCY.add("create_session", latency)`, `ServerMetrics.REQUEST_LATENCY.add("set_data", latency)`, etc. For your example, you wanted to track counts so you'd probably want a `SimpleCounterSet` which we don't have, but would be easy to create.
    
    3. Use something custom. For example, the majority of our internal metrics build on `ServerMetrics`, but we do have a handful that don't. One example that we'll be upstreaming soon is our so-called "request state" metrics that track the count for requests at different stages in the request pipeline. We maintain a count of requests queued, issued, and completed for different category of requests (all, read, write, create_session, close_session, sync, auth, set_watches, other). This matrix is also added to the output of `/metrics` but doesn't use any of the `ServerMetrics` logic.


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    Thanks for detailed reply, @jtuple.
    
    Jenkins was not happy because one C client test failed (the 'Test Result' only shows Java tests). I kicked a 
    [new build](https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2141/), and it passes.
    
    >> Other types like a Gauge could easily be done in a future pull-request, although our experience has been that gauge's are less useful in a polling based approach since you only see the most recent value. 
    
    OK.
    
    >> Once we land the set-variants, you could add a single metric to the ServerMetric enum and rely upon the set logic.
    
    I think this is what I was looking for - a way of organizing metrics hierarchically like files/folder or namespaces. Good to know it's already supported and will be upstreamed.


---

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

Posted by anmolnar <gi...@git.apache.org>.
Github user anmolnar commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/580#discussion_r207891221
  
    --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
    @@ -33,17 +34,17 @@
     public class ServerStats {
         private static final Logger LOG = LoggerFactory.getLogger(ServerStats.class);
     
    -    private long packetsSent;
    -    private long packetsReceived;
    -    private long maxLatency;
    -    private long minLatency = Long.MAX_VALUE;
    -    private long totalLatency = 0;
    -    private long count = 0;
    +    private final AtomicLong packetsSent = new AtomicLong();
    --- End diff --
    
    Why are these 3 not part of `ServerMetrics`?
    Is that a future step of the migration?


---

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/580#discussion_r217864986
  
    --- Diff: src/java/test/org/apache/zookeeper/server/ServerMetricsTest.java ---
    @@ -0,0 +1,91 @@
    +package org.apache.zookeeper.server;
    --- End diff --
    
    Please add Apache License Header to this file. Missing of the header causes release audit warnings like [this](https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2181/artifact/patchprocess/patchReleaseAuditProblems.txt)


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by lvfangmin <gi...@git.apache.org>.
Github user lvfangmin commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    @anmolnar If I understand correctly, @eolivelli's work is more about making it easier to export ZooKeeper's internal metric to external systems, while FB is focusing on adding a lot more useful new metrics, so they don't conflict with each other that much.
    
    And even we need to change something to adapt the new metric framework it should be trivial.


---

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

Posted by jtuple <gi...@git.apache.org>.
Github user jtuple commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/580#discussion_r207633573
  
    --- Diff: src/java/main/org/apache/zookeeper/server/NettyServerCnxn.java ---
    @@ -271,7 +271,7 @@ private boolean checkFourLetterWord(final Channel channel,
             String cmd = FourLetterCommands.getCommandString(len);
     
             channel.setInterestOps(0).awaitUninterruptibly();
    -        packetReceived();
    +        packetReceived(4);
    --- End diff --
    
    `packetReceived()` now takes in the packet-size as an argument to update the packets received metric. The total packet size consists of packet header (4 bytes) + packet size. For "four letter words", there is no packet size -- just the 4-byte packet header.


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by eolivelli <gi...@git.apache.org>.
Github user eolivelli commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    I think it will be easier for FB guys to merge their work to community repo and then we can move to the new system.
    Otherwise the two forks will never converge.
    
    So I will lean towards merging all the FB work about metrics and then convert to the new system.
    
    In parallel we can take decisions about the new system, we can merge my changes as long as they do not have significant impact at runtime..
    
    For me it would be a nice  approch to let @lvfangmin convert eventually the instrumentation points to the new system.
    
    I can focus on implementation of providers, like Prometheus and Dropwizard


---

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

Posted by lvfangmin <gi...@git.apache.org>.
Github user lvfangmin commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/580#discussion_r212103989
  
    --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
    @@ -127,36 +129,45 @@ public String toString(){
             sb.append("Mode: " + getServerState() + "\n");
             return sb.toString();
         }
    -    // mutators
    -    synchronized void updateLatency(long requestCreateTime) {
    -        long latency = Time.currentElapsedTime() - requestCreateTime;
    -        totalLatency += latency;
    -        count++;
    -        if (latency < minLatency) {
    -            minLatency = latency;
    +
    +    /**
    +     * Update request statistic. This should only be called from a request
    +     * that originated from that machine.
    +     */
    +    public void updateLatency(Request request, long currentTime) {
    --- End diff --
    
    Agree, we don't have to pass in the currentTime here, no test is calling this function.


---

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

Posted by anmolnar <gi...@git.apache.org>.
Github user anmolnar commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/580#discussion_r214857251
  
    --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
    @@ -33,17 +34,17 @@
     public class ServerStats {
         private static final Logger LOG = LoggerFactory.getLogger(ServerStats.class);
     
    -    private long packetsSent;
    -    private long packetsReceived;
    -    private long maxLatency;
    -    private long minLatency = Long.MAX_VALUE;
    -    private long totalLatency = 0;
    -    private long count = 0;
    +    private final AtomicLong packetsSent = new AtomicLong();
    --- End diff --
    
    I'm not sure if I see what you mean by "being counted"?
    I've thought that these three metrics should be part of ServerMetrics as:
    - packetsSent and packetsReceived as SimpleCounter,
    - requestLatency as AvgMinMaxCounter
    
    and should be referred as ServerMetrics.PACKETS_SENT, etc. like the rest.
    Please let me know if didn't get it right.


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by anmolnar <gi...@git.apache.org>.
Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    @lvfangmin @eolivelli Thanks for the clarification. Let's do it the way you're suggesting.
    @jtuple I'm happy to commit your patch once you answered my questions and resolved the conflicts.


---

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

Posted by anmolnar <gi...@git.apache.org>.
Github user anmolnar commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/580#discussion_r214857350
  
    --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
    @@ -127,36 +129,45 @@ public String toString(){
             sb.append("Mode: " + getServerState() + "\n");
             return sb.toString();
         }
    -    // mutators
    -    synchronized void updateLatency(long requestCreateTime) {
    -        long latency = Time.currentElapsedTime() - requestCreateTime;
    -        totalLatency += latency;
    -        count++;
    -        if (latency < minLatency) {
    -            minLatency = latency;
    +
    +    /**
    +     * Update request statistic. This should only be called from a request
    +     * that originated from that machine.
    +     */
    +    public void updateLatency(Request request, long currentTime) {
    --- End diff --
    
    @jtuple Please remove `currentTime` from here.


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    From the conversation so far, it looks like we have reached a consensus regarding the directions of this patch and future metrics related patch. For this specific patch, it looks good to me, and I only have one comment regarding FOLLOWER_SYNC_TIME instrumentation but that's not blocking.
    
    Any comments from other reviewers? @anmolnar @eolivelli @lvfangmin 
    Let's try to get this patch merge in soon to avoid constant turnovers on merge conflicts (unfortunately, it got another merge conflicts that requires another rebase.).


---

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/580#discussion_r216487862
  
    --- Diff: src/java/main/org/apache/zookeeper/server/quorum/Follower.java ---
    @@ -86,7 +89,13 @@ void followLeader() throws InterruptedException {
                                 + " is less than our accepted epoch " + ZxidUtils.zxidToString(self.getAcceptedEpoch()));
                         throw new IOException("Error: Epoch of leader is lower");
                     }
    -                syncWithLeader(newEpochZxid);                
    +                long startTime = Time.currentElapsedTime();
    +                try {
    +                    syncWithLeader(newEpochZxid);
    +                } finally {
    +                    long syncTime = Time.currentElapsedTime() - startTime;
    +                    ServerMetrics.FOLLOWER_SYNC_TIME.add(syncTime);
    --- End diff --
    
    This will execute regardless of `syncWithLeader` succeeded or not. Should we only collect `syncTime` for the syncs that were successful? 


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by jtuple <gi...@git.apache.org>.
Github user jtuple commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    I'm happy to add tests for `AvgMinMaxCounter` and `SimpleCounter`, will go ahead and update this PR today to include that for completeness. Since it's easy change to make while we discuss any other reworkings.
    
    We actually landed a version of #440 internally at Facebook and have been running with Dropwiz-based percentile counters in production for a few months now. We build on the same `ServerMetrics` design in this PR and have an `AvgMinMaxPercentileCounter` that wraps a Dropwiz `Histogram` which uses a custom uniform-sampling reservoir implementation that enables resetting the metric/reservoir. This makes things consistent with all the other metric types that are resettable.
    
    In practice, the memory overhead for the Dropwiz histogram is much higher than the flat `AvgMinMaxCounter` metric, so we only converted about 20 of our 100 metrics to `AvgMinMaxPercentileCounter` metrics -- pretty much just the latency metrics and metrics that track time spent in various queues and stages of the request pipeline.
    
    We also have `AvgMinMaxCounterSet` and `AvgMinMaxPercentileCounterSet` metric-types that aggregate metrics across a dynamic set of entities. These are types that we were planning to submit in future PRs that build on this one. Example uses of this includes tracking learner queue size/time and ack latency on the leader for each quorum peer.
    
    ---
    
    In any case, happy to rework this PR with guidance and have the bandwidth to do so if we can come to consensus on a design. My biggest concern is that pretty much all of our remaining internal features at Facebook that we want to upstream are blocked on this feature, since we aggressively add metrics when adding new features (so that we can monitor/track in production, etc). The sooner we can land something, the sooner we can rebase on top of whatever the agreed approach is and unblock our other contributions.
    
    Open to any design that the community is happy with, however I think there's a few things which are important to maintain:
    
    1. Any solution should make it super easy to add new metrics and use them throughout the codebase. We've learned the hard way that lowering the barrier to entry was the only way to truly motivate adding new metrics while adding new features.
    
    2. It's useful to have a `Metric` type interface that allows us to enforce certain requirements. As an example, ZooKeeper metrics have historically been resettable, which isn't necessarily true for other metric libraries. Having an adaptor class allowed us to wrap the Dropwiz histograms and add a fast reset capability that then behaved like all our other metrics.
    
    3. Even if we export to external systems, having the ability to have Zookeeper maintain internal metrics/aggregation is still important. There should be a low-barrier to entry for getting visibility into the system from day one with minimal config/setup.


---

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/580#discussion_r204179634
  
    --- Diff: src/java/main/org/apache/zookeeper/server/NettyServerCnxn.java ---
    @@ -271,7 +271,7 @@ private boolean checkFourLetterWord(final Channel channel,
             String cmd = FourLetterCommands.getCommandString(len);
     
             channel.setInterestOps(0).awaitUninterruptibly();
    -        packetReceived();
    +        packetReceived(4);
    --- End diff --
    
    I am not sure what's the intention of this change..


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    Got a [green build](https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2193/). Merging.


---

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/580#discussion_r216489582
  
    --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
    @@ -33,17 +34,17 @@
     public class ServerStats {
         private static final Logger LOG = LoggerFactory.getLogger(ServerStats.class);
     
    -    private long packetsSent;
    -    private long packetsReceived;
    -    private long maxLatency;
    -    private long minLatency = Long.MAX_VALUE;
    -    private long totalLatency = 0;
    -    private long count = 0;
    +    private final AtomicLong packetsSent = new AtomicLong();
    --- End diff --
    
    Since `ServerMetrics` is a superset of `ServerStats` in terms of scope, we probably want to keep `ServerStats` as is and ultimately deprecate it in favor of `ServerMetrics`. I don't think there is a need to duplicate metrics in two places, which would be both a burden to maintain and a potential source of confusion.
    
    Regarding reporting to `mntr`, we decided deprecate 4lw last year due to the limitation of its design, in particular around security, in favor of admin server endpoints  (`/metrics` in this case), so I don't there is a need to report newly added metrics to `mntr`. This also encourages users to migrate away from 4lw to admin end points.
    
    Overall the state in current patch w.r.t this looks good to me.



---

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/zookeeper/pull/580


---

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/580#discussion_r204180108
  
    --- Diff: src/java/main/org/apache/zookeeper/server/ServerMetrics.java ---
    @@ -0,0 +1,103 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.zookeeper.server;
    +
    +import org.apache.zookeeper.server.metric.AvgMinMaxCounter;
    +import org.apache.zookeeper.server.metric.Metric;
    +import org.apache.zookeeper.server.metric.SimpleCounter;
    +
    +import java.util.LinkedHashMap;
    +import java.util.Map;
    +
    +public enum ServerMetrics {
    +    /**
    +     * Txnlog fsync time
    +     */
    +    FSYNC_TIME(new AvgMinMaxCounter("fsynctime")),
    +
    +    /**
    +     * Snapshot writing time
    +     */
    +    SNAPSHOT_TIME(new AvgMinMaxCounter("snapshottime")),
    +
    +    /**
    +     * Db init time (snapshot loading + txnlog replay)
    +     */
    +    DB_INIT_TIME(new AvgMinMaxCounter("dbinittime")),
    +
    +    /**
    +     * Stats for read request. The timing start from when the server see the
    +     * request until it leave final request processor.
    +     */
    +    READ_LATENCY(new AvgMinMaxCounter("readlatency")),
    +
    +    /**
    +     * Stats for request that need quorum voting. Timing is the same as read
    +     * request. We only keep track of stats for request that originated from
    +     * this machine only.
    +     */
    +    UPDATE_LATENCY(new AvgMinMaxCounter("updatelatency")),
    +
    +    /**
    +     * Stats for all quorum request. The timing start from when the leader
    +     * see the request until it reach the learner.
    +     */
    +    PROPAGATION_LATENCY(new AvgMinMaxCounter("propagation_latency")),
    +
    +    FOLLOWER_SYNC_TIME(new AvgMinMaxCounter("follower_sync_time")),
    +    ELECTION_TIME(new AvgMinMaxCounter("election_time")),
    --- End diff --
    
    Should election_time be recorded as a gauge instead of a counter? Similar for other times (fsync, init, etc)


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    We at Twitter manage large ZK clusters and we also have our own internal metrics system (you can't really operate things like ZK without metrics at large scale.). Our design is similar like this in terms of metrics collection and code instrumentation, but we coupled the metrics collection with publication. We are publishing metrics through Finagle (Twitter's RPC library) and the tight coupling caused several issues, such as circular dependencies between Finagle and ZooKeeper, where ZK depends on Finagle for metrics, and Finagle depends on ZK for ServerSets (naming resolution). 
    
    I think the current design proposed in the community would solve this problem. Basically we will have two systems:
    * The ZooKeeper metrics system for metrics collection.
    * The pluggable system @eolivelli is working on is the backends for the metrics.
    
    So IMO both work are sort of orthogonal and can be done in parallel. And I agree with the design decisions @jtuple proposed and @eolivelli commented previously.
    
    Regarding use an external metrics type system: I think it would be more flexible for us to define and own the metrics definitions so it's easy to add / change the system without depends on third parties. 


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by eolivelli <gi...@git.apache.org>.
Github user eolivelli commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    @jtuple in my opinion it is better  to allow all  of changes coming from Facebook to be merged in as soon as possible.
    Having only one common codebase is necessary in order to work together.
    
    So from my point of view it will be better to mark cleearly all of the changes coming from FB fork, this way reviewers and committer will take it in account and treat the patch under that light.
    
    Last year for instance in Bookkeeper we merged three 'diverged' forks from Yahoo, Twitter and Salesforce into the Apache codebase. It has been a long work but now it is super easy to collaborate.
    
    Once we have a common codebase refactoring will be easier for everyone.
    
    Just my 2cents


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    Thanks for rebase, @jtuple. Latest build failed, but those are flaky tests. Kicking Jenkins again to get a green build.
    
    @anmolnar Could you please also sign this off? I believe you raised a request of removing `currentTime` which @jtuple addressed by pointing out its usage in test code. Do you have other concerns regarding this patch?


---

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

Posted by jtuple <gi...@git.apache.org>.
Github user jtuple commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/580#discussion_r216112301
  
    --- Diff: src/java/main/org/apache/zookeeper/server/ServerMetrics.java ---
    @@ -0,0 +1,103 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.zookeeper.server;
    +
    +import org.apache.zookeeper.server.metric.AvgMinMaxCounter;
    +import org.apache.zookeeper.server.metric.Metric;
    +import org.apache.zookeeper.server.metric.SimpleCounter;
    +
    +import java.util.LinkedHashMap;
    +import java.util.Map;
    +
    +public enum ServerMetrics {
    +    /**
    +     * Txnlog fsync time
    +     */
    +    FSYNC_TIME(new AvgMinMaxCounter("fsynctime")),
    +
    +    /**
    +     * Snapshot writing time
    +     */
    +    SNAPSHOT_TIME(new AvgMinMaxCounter("snapshottime")),
    +
    +    /**
    +     * Db init time (snapshot loading + txnlog replay)
    +     */
    +    DB_INIT_TIME(new AvgMinMaxCounter("dbinittime")),
    +
    +    /**
    +     * Stats for read request. The timing start from when the server see the
    +     * request until it leave final request processor.
    +     */
    +    READ_LATENCY(new AvgMinMaxCounter("readlatency")),
    +
    +    /**
    +     * Stats for request that need quorum voting. Timing is the same as read
    +     * request. We only keep track of stats for request that originated from
    +     * this machine only.
    +     */
    +    UPDATE_LATENCY(new AvgMinMaxCounter("updatelatency")),
    +
    +    /**
    +     * Stats for all quorum request. The timing start from when the leader
    +     * see the request until it reach the learner.
    +     */
    +    PROPAGATION_LATENCY(new AvgMinMaxCounter("propagation_latency")),
    +
    +    FOLLOWER_SYNC_TIME(new AvgMinMaxCounter("follower_sync_time")),
    +    ELECTION_TIME(new AvgMinMaxCounter("election_time")),
    --- End diff --
    
    If by gauge you're referring to a metric that only reports a single value, I don't think that's a good fit for these metrics. If you're querying metrics from Zookeeper every minute or so, you'd miss all but the last election that occurred in that interval. With an avg/min/max counter, we can query periodically and have statistics that summarize all elections that occurred in that interval. Same logic applies to fsync, init, etc.
    
    Tracking individual event times is probably better left for the "push metrics to external system" approach that's orthogonal to this pull request.
    
    However, if we wanted to provide statistics + bounded history internally, we could consider adding an extended version of say `AvgMinMaxCounter` that also keeps a fixed number of values around that could be queried, dropping the oldest or doing some sort of reservoir sampling.
    
    Definitely something we can do in the future. If we do, it's unclear if we'd want to wire that up to `/metrics` admin command though vs exposing through a different means.


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    Finally this happens, thanks for contributing the change back @jtuple . My quick comments inline.
    
    In addition, there are two things:
    * There is find bug warning. I believe it's the txn variable [here](https://github.com/jtuple/zookeeper/blob/e6935f8d99eace05d29c2d6659e68e8b90b9a633/src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java#L110) that should be removed. I know that's not part of this patch, and i am not sure why it gets triggered here, but please remove it so we can get a clean find bug check.
    
    * The test `org.apache.zookeeper.server.admin.CommandsTest` is failing. Please investigate.


---

[GitHub] zookeeper pull request #580: ZOOKEEPER-3098: Add additional server metrics

Posted by jtuple <gi...@git.apache.org>.
Github user jtuple commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/580#discussion_r216047184
  
    --- Diff: src/java/main/org/apache/zookeeper/server/ServerStats.java ---
    @@ -127,36 +129,45 @@ public String toString(){
             sb.append("Mode: " + getServerState() + "\n");
             return sb.toString();
         }
    -    // mutators
    -    synchronized void updateLatency(long requestCreateTime) {
    -        long latency = Time.currentElapsedTime() - requestCreateTime;
    -        totalLatency += latency;
    -        count++;
    -        if (latency < minLatency) {
    -            minLatency = latency;
    +
    +    /**
    +     * Update request statistic. This should only be called from a request
    +     * that originated from that machine.
    +     */
    +    public void updateLatency(Request request, long currentTime) {
    --- End diff --
    
    This is used for testing in both `ServerStatsTest.testLatencyMetrics()` and `ServerStatsTest.testReset()`.


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    And yes like others commented, please rebase and resolve conflicts, and reply to previous unattended comments. My previous comments need address are:
    
    * Existing unit test failures and one findbug issue.
    * Are there more types of metrics going to be added, such as Gauge? 
    * Would be good to have some tests around the system.


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by lvfangmin <gi...@git.apache.org>.
Github user lvfangmin commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    Thanks @anmolnar, I've answered some of the questions here.
    
    @jtuple can you resolve some of these comments and the conflict, we might ready to go now, this will unblock other patches we're going to upstream, so I'd like to see it's being merged sooner.


---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Posted by anmolnar <gi...@git.apache.org>.
Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    @eolivelli 
    I got that this is Facebook work on adding new metrics and at the same time you're working on a generic long term solution for ZooKeeper metrics system.
    
    My point is that we should get these 2 initiatiatives as close as possible and minimize the need for refactoring on both sides.
    
    > will be (quite) easily refactored to the new system once we have merged Facebook contribs and merged the new system.
    
    I'm not sure that we can get to that point in the short term, hence I encourage to do these 2 things side by side.


---