You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Wellington Chevreuil (JIRA)" <ji...@apache.org> on 2018/11/04 01:21:00 UTC

[jira] [Commented] (HBASE-21406) "status 'replication'" should not show SINK if the cluster does not act as sink

    [ https://issues.apache.org/jira/browse/HBASE-21406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674255#comment-16674255 ] 

Wellington Chevreuil commented on HBASE-21406:
----------------------------------------------

Added initial patch proposal for *branch-1.* Idea here is to not show stats for SINK, until it has not received any edits. Added additional metrics showing the sink startup time, something as below:
{noformat}
SINK  : TimeStampStarted=1541292912227, Waiting for OPs...{noformat}
 
BTW, while testing, noticed additional issues with metrics for source on current branch-1 version:
1) Once started and while no OP eligible for replication occurs, TimeStampsOfLastShippedOp shows "Thu Jan 01 01:00:00 GMT 1970", and huge Replication Lag is accounted. This seems to be due HBASE-15995, which removed code on ReplicationSource class that initializes AgeOfLastShippedOp to the startup time:

{noformat}
-          // Reset the sleep multiplier if nothing has actually gone wrong
-          if (!gotIOE) {
-            sleepMultiplier = 1;
-            // if there was nothing to ship and it's not an error
-            // set "ageOfLastShippedOp" to <now> to indicate that we're current
-            metrics.setAgeOfLastShippedOp(EnvironmentEdgeManager.currentTime(), walGroupId);
+          WALEntryBatch entryBatch = entryReader.take();
+          for (Map.Entry<String, Long> entry : entryBatch.getLastSeqIds().entrySet()) {
+            waitingUntilCanPush(entry);
{noformat}

2) After source gets OPs to replicate and successfully ships it to target, source metrics then keep showing lags, even if there was no new edits to replicate. This is also wrong, and was apparently introduced by changes from HBASE-15093, which has modified the way log que size is accounted, and replication lag calculation logic seems to rely on the log queue size in ReplicationLoad:
{noformat}
      long ageOfLastShippedOp = sm.getAgeOfLastShippedOp();
      int sizeOfLogQueue = sm.getSizeOfLogQueue();
      long timeStampOfLastShippedOp = sm.getTimeStampOfLastShippedOp();
      long replicationLag;
      long timePassedAfterLastShippedOp =
          EnvironmentEdgeManager.currentTime() - timeStampOfLastShippedOp;
      if (sizeOfLogQueue != 0) {
        // err on the large side
        replicationLag = Math.max(ageOfLastShippedOp, timePassedAfterLastShippedOp);
      } else if (timePassedAfterLastShippedOp < 2 * ageOfLastShippedOp) {
        replicationLag = ageOfLastShippedOp; // last shipped happen recently
      } else {
        // last shipped may happen last night,
        // so NO real lag although ageOfLastShippedOp is non-zero
        replicationLag = 0;
      }
{noformat}

I'll be opening another jira to fix the source metrics issues mentioned above.

> "status 'replication'" should not show SINK if the cluster does not act as sink
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-21406
>                 URL: https://issues.apache.org/jira/browse/HBASE-21406
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Daisuke Kobayashi
>            Assignee: Wellington Chevreuil
>            Priority: Minor
>         Attachments: HBASE-21406-branch-1.001.patch, Screen Shot 2018-10-31 at 18.12.54.png
>
>
> When replicating in 1 way, from source to target, {{status 'replication'}} on source always dumps SINK with meaningless metrics. It only makes sense when running the command on target cluster.
> {{status 'replication'}} on source, for example. {{AgeOfLastAppliedOp}} is always zero and {{TimeStampsOfLastAppliedOp}} does not get updated from the time the RS started since it's not acting as sink.
> {noformat}
>     source-1.com
>        SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=0, TimeStampsOfLastShippedOp=Mon Oct 29 23:44:14 PDT 2018, Replication Lag=0
>        SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Thu Oct 25 23:56:53 PDT 2018
> {noformat}
> {{status 'replication'}} on target works as expected. SOURCE is empty as it's not acting as source:
> {noformat}
>     target-1.com
>        SOURCE:
>        SINK  : AgeOfLastAppliedOp=70, TimeStampsOfLastAppliedOp=Mon Oct 29 23:44:08 PDT 2018
> {noformat}
> This is because {{getReplicationLoadSink}}, called in {{admin.rb}}, always returns a value (not null).
> 1.X
> https://github.com/apache/hbase/blob/rel/1.4.0/hbase-client/src/main/java/org/apache/hadoop/hbase/ServerLoad.java#L194-L204
> 2.X
> https://github.com/apache/hbase/blob/rel/2.0.0/hbase-client/src/main/java/org/apache/hadoop/hbase/ServerLoad.java#L392-L399



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)