You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@accumulo.apache.org by noedetore <gi...@git.apache.org> on 2017/02/23 01:52:32 UTC

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

GitHub user noedetore opened a pull request:

    https://github.com/apache/accumulo/pull/220

    ACCUMULO-4591 replication latency metrics added

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/noedetore/accumulo ACCUMULO-4591

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/accumulo/pull/220.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #220
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by joshelser <gi...@git.apache.org>.

Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102778787
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -197,6 +200,7 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
         mutations.add(m);
         for (Entry<String,Long> entry : tableToTimeCreated.entrySet()) {
           log.info("Removing order mutation for table {} at {} for {}", entry.getKey(), entry.getValue(), row.toString());
    +      collectLatency(entry.getValue());
    --- End diff --
    
    Building on the point above: you would want to collect the latencies of each outstanding file to be replicated, likely not within the execution path of the Replication internals (metrics should not affect the latency of the system).
    
    It would be better to pull this logic out of `RCRR`, and add it into the Metrics2ReplicationMetrics, enumerating the files and computing the histogram of metrics during the metrics collection cycle.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by joshelser <gi...@git.apache.org>.

Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102778301
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/metrics/Metrics2ReplicationMetrics.java ---
    @@ -58,6 +59,7 @@ protected void snapshot() {
         registry.add(PENDING_FILES, getNumFilesPendingReplication());
         registry.add(NUM_PEERS, getNumConfiguredPeers());
         registry.add(MAX_REPLICATION_THREADS, getMaxReplicationThreads());
    +    registry.add(LATENCY, getLatencyInSeconds());
    --- End diff --
    
    You're collecting the "maximum" latency in the system which isn't ideal as it paints a very skewed picture of the actual system.
    
    For example, if you have a single file which cannot be replicated (say, it's corrupted) it would have a very long latency. The rest of the files may be replicating in a (hypothetically) 5seconds. The maximum latency would continue to increase without bound, indicating to an operator that the system is not healthy, when the system is actually 99% healthy.
    
    Hadoop metrics2 exposes a histogram class for metrics which automatically aggregates min, median, avg, max as well as percentiles (e.g. p75, p85, p95, p98, p99).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by ctubbsii <gi...@git.apache.org>.

Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102801086
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -216,4 +220,13 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
     
         return recordsRemoved;
       }
    +
    +  /*
    +   * Metrics calls snapshot. Largest time stamp will be found for each snapshot then reset.
    +   */
    +  private void collectLatency(Long createdTime) {
    +    long latency = System.currentTimeMillis() - createdTime;
    --- End diff --
    
    I guess... but there's really no reason to fudge when there's a built-in API for this in the JVM. I realize we're still using currentTimeMillis elsewhere, but I'm trying to encourage us to move away from that, for correctness.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by keith-turner <gi...@git.apache.org>.

Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102770533
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -216,4 +220,13 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
     
         return recordsRemoved;
       }
    +
    +  /*
    +   * Metrics calls snapshot. Largest time stamp will be found for each snapshot then reset.
    +   */
    +  private void collectLatency(Long createdTime) {
    +    long latency = System.currentTimeMillis() - createdTime;
    --- End diff --
    
    I am not sure, but I think this createdTime may come from a persisted source?  @joshelser  is that right?  If its coming from a persisted source, then it could have come from another process.  In that case using `System.currentTimeMillis()` is reasonable.   `System.nanoTime()` can reliably compute time deltas within a process.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by joshelser <gi...@git.apache.org>.

Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102820537
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/metrics/Metrics2ReplicationMetrics.java ---
    @@ -58,6 +59,7 @@ protected void snapshot() {
         registry.add(PENDING_FILES, getNumFilesPendingReplication());
         registry.add(NUM_PEERS, getNumConfiguredPeers());
         registry.add(MAX_REPLICATION_THREADS, getMaxReplicationThreads());
    +    registry.add(LATENCY, getLatencyInSeconds());
    --- End diff --
    
    Ah, crap. It looks like I was confusing what exists in other Metrics libraries as primitives.
    
    Hadoop Metrics2 supports gauges, counters, stats and quantiles. You would want to create `MutableQuantiles` and some `MutableGauges` from the `MetricsRegistry`. You can maintain the min, median, avg, and max as gauges, and then present the percentiles using the quantiles. Essentially, you'd have to track both separately. Like you have here, you could just instrument this class to compute the current metrics instead of doing it "in-band" in `RCRR`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by keith-turner <gi...@git.apache.org>.

Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102774271
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -216,4 +220,13 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
     
         return recordsRemoved;
       }
    +
    +  /*
    +   * Metrics calls snapshot. Largest time stamp will be found for each snapshot then reset.
    +   */
    +  private void collectLatency(Long createdTime) {
    +    long latency = System.currentTimeMillis() - createdTime;
    --- End diff --
    
    I looked into this a bit more to see if using nanoTime would be reasonable across processes.  The following is from the nanoTime javaodc
    
    ```
    Returns the current value of the running Java Virtual Machine's high-resolution time source, in nanoseconds. This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time.
    ```
    
    Given this documentation, if the times are persisted and used across multiple processes then I think using `nanoTime()` would be inappropriate. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo issue #220: ACCUMULO-4591 replication latency metrics added

Posted by keith-turner <gi...@git.apache.org>.

Github user keith-turner commented on the issue:

    https://github.com/apache/accumulo/pull/220
  
    @noedetore is this metric showing the current maximum time for all active replication across all tables with replication enabled?
    
    Does anyone know of a place where individual metrics are documented? I looked in the user manual and did not see anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by joshelser <gi...@git.apache.org>.

Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102804050
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/metrics/Metrics2ReplicationMetrics.java ---
    @@ -58,6 +59,7 @@ protected void snapshot() {
         registry.add(PENDING_FILES, getNumFilesPendingReplication());
         registry.add(NUM_PEERS, getNumConfiguredPeers());
         registry.add(MAX_REPLICATION_THREADS, getMaxReplicationThreads());
    +    registry.add(LATENCY, getLatencyInSeconds());
    --- End diff --
    
    Yes, not saying that having the maximum isn't helpful, but it paints a very skewed picture.
    
    We have the primitives to collect and display metrics which paint a complete picture. I really think that we should use them instead of only displaying the maximum.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by noedetore <gi...@git.apache.org>.

Github user noedetore commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102800222
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/metrics/Metrics2ReplicationMetrics.java ---
    @@ -58,6 +59,7 @@ protected void snapshot() {
         registry.add(PENDING_FILES, getNumFilesPendingReplication());
         registry.add(NUM_PEERS, getNumConfiguredPeers());
         registry.add(MAX_REPLICATION_THREADS, getMaxReplicationThreads());
    +    registry.add(LATENCY, getLatencyInSeconds());
    --- End diff --
    
    Screen Shot 2017-02-23 at 9.31.06 AM shows the spike but still helpful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by joshelser <gi...@git.apache.org>.

Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102805336
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/metrics/Metrics2ReplicationMetrics.java ---
    @@ -58,6 +59,7 @@ protected void snapshot() {
         registry.add(PENDING_FILES, getNumFilesPendingReplication());
         registry.add(NUM_PEERS, getNumConfiguredPeers());
         registry.add(MAX_REPLICATION_THREADS, getMaxReplicationThreads());
    +    registry.add(LATENCY, getLatencyInSeconds());
    --- End diff --
    
    Also, now that I think of this, "latency" is not the right word to use. Replication latency should be the time in which some record (maybe file, for simplicity) was created to when it was replicated. The "latency" which you have here is not that -- it's the current maximum time from WAL creation to being the replication record being cleaned up. This maximum value includes time for Accumulo to clean up the record as well as the time when a WAL was created before it received any data for a table being replicated -- it's an over-estimation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by ctubbsii <gi...@git.apache.org>.

Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102622870
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -57,9 +58,11 @@
       private static final Logger log = LoggerFactory.getLogger(RemoveCompleteReplicationRecords.class);
     
       private Connector conn;
    +  private Master master;
     
    -  public RemoveCompleteReplicationRecords(Connector conn) {
    +  public RemoveCompleteReplicationRecords(Connector conn, Master master) {
    --- End diff --
    
    Is this connector for the remote peer? If not, the master actually contains a client context and the connector is redundant if the master is being passed in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by noedetore <gi...@git.apache.org>.

Github user noedetore commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102771300
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -216,4 +220,13 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
     
         return recordsRemoved;
       }
    +
    +  /*
    +   * Metrics calls snapshot. Largest time stamp will be found for each snapshot then reset.
    +   */
    +  private void collectLatency(Long createdTime) {
    +    long latency = System.currentTimeMillis() - createdTime;
    --- End diff --
    
    Using nanoTime() is new to me. In making sure there is no conflict, I did find in org.apache.accumulo.tserver.log.TabletServerLogger setting createdTime
    Status status = StatusUtil.fileCreated(System.currentTimeMillis());


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by ctubbsii <gi...@git.apache.org>.

Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102801369
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -216,4 +220,13 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
     
         return recordsRemoved;
       }
    +
    +  /*
    +   * Metrics calls snapshot. Largest time stamp will be found for each snapshot then reset.
    +   */
    +  private void collectLatency(Long createdTime) {
    +    long latency = System.currentTimeMillis() - createdTime;
    --- End diff --
    
    Definitely don't use nanoTime across processes. It's only guaranteed that `(newtime - oldtime) > 0` within the same process.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by keith-turner <gi...@git.apache.org>.

Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102783229
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -216,4 +220,13 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
     
         return recordsRemoved;
       }
    +
    +  /*
    +   * Metrics calls snapshot. Largest time stamp will be found for each snapshot then reset.
    +   */
    +  private void collectLatency(Long createdTime) {
    +    long latency = System.currentTimeMillis() - createdTime;
    --- End diff --
    
    I like the idea of taking the max of 0 and computed latency.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo issue #220: ACCUMULO-4591 replication latency metrics added

Posted by joshelser <gi...@git.apache.org>.

Github user joshelser commented on the issue:

    https://github.com/apache/accumulo/pull/220
  
    > Does anyone know of a place where individual metrics are documented? I looked in the user manual and did not see anything.
    
    I don't think we have this broken out into the user manual. It's very ad-hoc. We could experiment with lifting all of the metrics keys into an interface, creating a description for each, and then automate the addition of these into the user manual like the configuration properties work presently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by ctubbsii <gi...@git.apache.org>.

Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102623116
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -216,4 +220,13 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
     
         return recordsRemoved;
       }
    +
    +  /*
    +   * Metrics calls snapshot. Largest time stamp will be found for each snapshot then reset.
    +   */
    +  private void collectLatency(Long createdTime) {
    +    long latency = System.currentTimeMillis() - createdTime;
    --- End diff --
    
    `System.currentTimeMillis()` is unreliable for tracking time differences, and this subtraction could even be negative. Use `System.nanoTime()` for time differences/durations. The difference is guaranteed to be positive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by ctubbsii <gi...@git.apache.org>.

Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102622494
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/Master.java ---
    @@ -192,6 +193,7 @@
       private ReplicationDriver replicationWorkDriver;
       private WorkDriver replicationWorkAssigner;
       RecoveryManager recoveryManager = null;
    +  private AtomicLong replicationLatency = new AtomicLong(0l);
    --- End diff --
    
    Personal preference: use upper-case `L` for Long literals. It's much easier to read (this looks like `01`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by noedetore <gi...@git.apache.org>.

Github user noedetore commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102776327
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/Master.java ---
    @@ -192,6 +193,7 @@
       private ReplicationDriver replicationWorkDriver;
       private WorkDriver replicationWorkAssigner;
       RecoveryManager recoveryManager = null;
    +  private AtomicLong replicationLatency = new AtomicLong(0l);
    --- End diff --
    
    will fix


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by joshelser <gi...@git.apache.org>.

Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102823576
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/metrics/Metrics2ReplicationMetrics.java ---
    @@ -58,6 +59,7 @@ protected void snapshot() {
         registry.add(PENDING_FILES, getNumFilesPendingReplication());
         registry.add(NUM_PEERS, getNumConfiguredPeers());
         registry.add(MAX_REPLICATION_THREADS, getMaxReplicationThreads());
    +    registry.add(LATENCY, getLatencyInSeconds());
    --- End diff --
    
    Oh, this is why I was confused. HBase has its own MutableHistogram metrics classes.
    
    https://github.com/apache/hbase/blob/bb3d9ccd489fb64e3cb2020583935a393382a678/hbase-hadoop2-compat/src/main/java/org/apache/hadoop/metrics2/lib/MutableTimeHistogram.java
    https://github.com/apache/hbase/blob/bb3d9ccd489fb64e3cb2020583935a393382a678/hbase-hadoop2-compat/src/main/java/org/apache/hadoop/metrics2/lib/MutableSizeHistogram.java
    https://github.com/apache/hbase/blob/c64a1d199402d6bf2d6ff4168c00c756dcaa59e4/hbase-hadoop2-compat/src/main/java/org/apache/hadoop/metrics2/lib/MutableHistogram.java
    
    We could consider lifting these to make your life easier.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by joshelser <gi...@git.apache.org>.

Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102777287
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -57,9 +58,11 @@
       private static final Logger log = LoggerFactory.getLogger(RemoveCompleteReplicationRecords.class);
     
       private Connector conn;
    +  private Master master;
     
    -  public RemoveCompleteReplicationRecords(Connector conn) {
    +  public RemoveCompleteReplicationRecords(Connector conn, Master master) {
    --- End diff --
    
    bq. Is this connector for the remote peer?
    
    No, it's for the "local" instance. This is just cleaning up stuff in the accumulo.replication table on the "source" instance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo issue #220: ACCUMULO-4591 replication latency metrics added

Posted by noedetore <gi...@git.apache.org>.

Github user noedetore commented on the issue:

    https://github.com/apache/accumulo/pull/220
  
    @keith-turner yes showing the current maximum for given time(10sec, configurable) when snapshot is called to collect metrics


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by joshelser <gi...@git.apache.org>.

Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102779143
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -216,4 +220,13 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
     
         return recordsRemoved;
       }
    +
    +  /*
    +   * Metrics calls snapshot. Largest time stamp will be found for each snapshot then reset.
    +   */
    +  private void collectLatency(Long createdTime) {
    +    long latency = System.currentTimeMillis() - createdTime;
    --- End diff --
    
    > I am not sure, but I think this createdTime may come from a persisted source
    
    I believe createdTime comes from the TabletServer, likely calling currentTimeMillis(). As long as we're reporting non-negative latencies, I think it will be fine (e.g. take the max of 0 and the computed latency).
    
    The createdTime is coming from a remote JVM (the tserver), so I dont' think there's a need to use nanoTime here for the monotonically increasing time. We can just fudge it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

Posted by noedetore <gi...@git.apache.org>.

Github user noedetore commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102815549
  
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/metrics/Metrics2ReplicationMetrics.java ---
    @@ -58,6 +59,7 @@ protected void snapshot() {
         registry.add(PENDING_FILES, getNumFilesPendingReplication());
         registry.add(NUM_PEERS, getNumConfiguredPeers());
         registry.add(MAX_REPLICATION_THREADS, getMaxReplicationThreads());
    +    registry.add(LATENCY, getLatencyInSeconds());
    --- End diff --
    
    I agree there are better ways of getting a better metric. This approach is a 90%(maybe to generous) solution using the information that already exists to give an end user some bearing of confidence as to current replication latency. From a users perspective, they want a historical way to know what is the greatest latency they can expect their data to be replicated at a given point in time. Spikes help explain the possible -1% unhealthy replicated files. As in the picture above there are couple spikes, but they don't seem to paint a bad picture. I will look into the histogram class. Are there any points of reference in the code you can point me to?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---